Texts are hierarchic constructs which consist of several autonomous levels [1–3]: letters, words, phrases, clauses, sentences, paragraphs. If a text is looked at statistically, i.e. without understanding its meaning (e.g. because it is written in an unknown system), how can it be efficiently distinguished from a meaningless collection of words [4–6]? Several such distinctions are well-known, e.g. (i) meaningful texts have a large number of words that appear in a text only few times, in particular once (rare words or hapax legomena) [4]. (ii) Ranked frequencies of words obey the Zipf’s law [7–9]. (iii) Letters and words of a text demonstrate long-range (power-law) correlations [10–15].
However, these characteristics can be reproduced by a sufficiently simple stochastic models putting in doubt their direct relation to the meaningfulness of a text. (i) Simple stochastic models can recover quite precisely the detailed structure of the hapax legomena [16]. (ii) Zipf’s law can be deduced from statistical approaches [16–30]. The first model of this type was a random text, where words are generated through random combinations of letters, i.e. the most primitive stochastic process [22, 28]. Its drawbacks [31–33] (e.g. many words having the same frequency) are avoided by more refined models [16, 23–26]. More generally, it was recently understood that Zipf’s law is a statistical regularity that emerges in samples which are informative about the underlying generative process [34]. (iii) Physics and mathematics of stochastic processes offer a plethora of models and approaches for generating long-range correlations [35] .
Here we contribute to resolving the above question by recalling that meaningful texts evolve sequentially (linearly) from beginning to end. This was taken as one of basic features of language [36], which—together with other design features—allows to distinguish human language from other communication systems [37]. Thus we divide texts into two halves, each one containing the same amount of words. Thereby we neutralize confound variables that are involved in a complex text-producing process (style, genre, subject, the author’s motives and vocabulary etc), because they are the same in both halves. Hence by comparing the two halves with each other we hope to see statistical regularities that are normally shielded by above variables. Statistical regularities hold for the majority of texts, such that it is highly unlikely to get this majority for random reasons (as checked by the 3rule). The significance of results will be checked by well-accepted statistical tests; for our purpose this is the W-test (Wilcoxon’s test) [38].
In two sets of several hundred texts we noted the following statistical regularities. (1) The first half has a larger number of different words, i.e. a larger vocabulary. (2) It also has a larger number of rare words, i.e. words that appear once or twice. (3) The first half is less compressible than the second half. The compressibility was studied via
TABLE I: Qualitative comparison of various features of texts between first and second halves: + (–) means that the feature is larger (smaller) in the corresponding half. means that the sought difference does not show up. Features are divided into two groups by double-lines. The first four features correlate with each other. The number of letters is separated, since there is a weak evidence toward its validity (the values of test quantities are close to their threshold values).
several different standard approaches, e.g. the Lempel-Ziv complexity and the zip algorithm. Lesser compressibility relates to more information in the sense of Shannon [39]. (4) Common words of both halves tend to have a larger overall frequency in the second half. These four features significantly correlate with each other as quantified by Pearson’s correlation coefficient. (5) The words in the first half are distribued less homogeneously, since they have a larger difference between the frequency and (inverse) spatial period.
One possible explanation of these result is that the first part of the text normally contains the exposition (which sometimes can be up to 20 % of the text), where the background information about events, settings, and characters is introduced to readers. The first part also plots the main conflict (open issue), whose denouement (solution) comes in the second half . Hence (1)-(4) can be hypothetically explained by the fact that the exposition—hence the first half—needs more different words (1), more rare words (2), is more informative (in the sense of Shannon) (3), and introduces words that are employed in the second half (4, 5) (i.e. the second half processes information introduced in the first half). We emphasize that this explanation is hypothetical, its direct validity is yet to be checked via more refined methods to be developed in future.
(6) Many other features—in particular those related to higher-order hierarchic structures of the text [1–3]—do not show any significant difference between the first and second half: number of sentences, paragraphs, repetitiveness of words (as quantified by Yule’s constant), number of punctuation signs etc. Among such features we especially mention the overall number of letters, since there is a weak statistical evidence (the W-test is not always passed) that this quantity is still larger in the first half.
Checking these features does not require any understanding of the text, i.e. it is not required that the meaning of words is understood, or even their writing system is known. We show that (expectedly) neither of them survives if the words of the text are randomly permuted, and only after that the resulting “text” is divided into two parts. Hence these features are specific for meaningful texts and they can be employed for distinguishing meaningful texts from a random collection of words.
This paper is organized as follows. The next section reviews our data collection method and recalls the W-test. Sections III–V present results from the above points (1)-(5). Section VI reviews negative results from point (6). The last section contains the outlook of this research and relates it with existing literatures. Our main results are briefly summarized in Table I. All other tables are given in Appendix A.
A. Studied texts
We selected English texts with a single narrative that are written on relatively few tightly connected subjects, and are sufficiently short for not containing “texts inside of texts” . We divided studied texts into two halves (each half contains equal number of words)
along the flow of the narrative, i.e. from the beginning to end. Several aspects of texts are left unchanged: each half is sufficiently large for statistics to apply, they have the same overall number of words, the same author, genre etc. The halves are semantically different, since the first half can be understood without the second half, but the second half generally cannot be understood alone. Also, the structure of narrative is different: the first half normally contains the exposition, where actors, situations and conflicts are set and defined, while the second half normally contains the denouement; cf. Footnote 2.
We have chosen to work with two datasets [57]. The first dataset was taken from the Gutenberg project at [58]. It consists of 156 fiction novels; for each novel the overall number of words is in the range [10000, 50000], which is sufficiently large for statistics to apply, but sufficiently short to ensure that they do not amount to “books inside of a book”. This range of the overall number of words [10000, 50000] is motivated from our own experience as readers.
The texts within the first dataset are thematically close, since they are all fiction novels. The second dataset consists of 350 thematically diverse texts taken from various online sources and collected at [59]. When collecting those texts we tried to ensure that they do not contain texts that are meaningless to divide into halves, i.e. that they do not contain effectively independent narratives. Hence we did not include in this dataset biographies, poems, collections of short stories or essays (in particular, folk stories), lectures, proceedings, letters.
B. Testing the difference between the halves
After processing, the typical form of our data are pairs of numbers for each text: , where
and
are certain features of (resp.) the first and second half of a text i, with M being the overall number of texts in the dataset. E.g.
and
are the number of different words for (resp.) first and second halves of a text; see below.
To inquire on whether this data indicate on a difference between two halves, we formulate two natural hypotheses: ) means that the difference
does (does not) follow a symmetric distrbution around the zero. Now some understanding on excluding
can be gained by looking at the percentage of cases, where
. This amounts to calculating
But looking only at (1) is incomplete, since it does not take into account the magnitudes . A relatively complete answer to the above comparison between
and
is provided by the W-test (Wilcoxon’s test) [38], which looks at [cf. (1)]
where is ordered in increasing way, and asigned ranks
that enter into W [38]. Note that W does not depend on absolute magnitudes, i.e. W is invariant upon multiplying
and
by the same positive number. Now if
holds, then for
1 (effectively M > 30 suffices, which always holds in our cases) the law of large numbers works and W is a Gaussian random variable, since it is a weigted sum of a large number of uncorrelated random variables. Its average is zero, since sgn[
] assume values
1 with equal probability (once
is assumed to hold). Its dispersion is calculated directly from (2) [38]:
Hence can be excluded via the 3
rule, if
We accept the 3rule (4) as the minimal threshold for claiming the statistical significance of our results. However, we emphasize that the absolute majority of our results hold the much stronger 5
rule; see tables in Appendix A.
A. Different words (vocabulary)
The basic hierarchic level of text is that of words. Neglecting phenomena of synonymy and homonymy (which are rare in English, but not at all rare e.g. in Chinese [40]), we can say that every word has several closely related meanings (polysemy). Neglecting also the difference between polysemic meanings, the number of independent meanings in a text can be estimated via the number of different words n. Tables II and III show that the first half of a meaningful text has statistically more different words than the second half:
As expected, this result disappears after random shuffling (random permutation of words) of texts that destroys its linear structure; see Tables IV and V.
B. Rare words (hapax legomena)
In any meaningful text, a sizable number of words appear only very few times (hapax legomena). These rare words amount to a finite fraction of n (i.e. the number of different words). The existence and the (large) number of rare events is not peculiar for texts, since there are statistical distributions that can generate samples with a large number of rare events [4, 16]. One reason why many rare words should appear in a meaningful text is that a typical sentence contains functional words (which come from a small pool), but it also has to contain some rare words, which then necessarily have to come from a large pool [27] .
Let (
) is the number of words that appear m times in the first (second) half. For defining rare words we focused on
i.e. on words that appear up to times. We choose to work with different
’s to ensure that our results are robust with respect to varying the definition of “rare”. For both datasets we observed that in the majority of cases the number of rare words in the first half is larger than the number of rare words in the second half [see Tables VI and VII]:
We confirmed via the 5of the W-test that the probability to get (7) due to random reasons is negligible.
Eq. (7) suggests that the first half uses more rare words, but such a conclusion is incomplete, since the two halves have different numbers of distinct words. Denote them as and
, for the first and second half respectively; cf. (5). Note that
where ) is the frequency of the most frequent word in the first (second) half. Hence in addition to (7) it is necessary to consider normalized quantities, i.e.
Relation (9) does hold statistically; see Table VI for the first dataset and Table VII for the second dataset. Hence the first half has more rare words both in absolute and relative terms; see (5, 9). These differences between the halves disappear after random shuffling of texts; see Tables IV and V. Note that yet another possibility to define rare words comes from relations
Eq. (10) invites to compare the normalized quantities with
for
= 1, ..., 5. We carried out this comparison and the results (for percentages and W-values) are very similar to (7), i.e. we obtain
in the same statistical sense as (7).
C. Common words
Both halves of a text have certain common words, e.g. non-common words of the second half are those that are not met in the first half. Let the number of common words in a given text is denoted by C. Our first result is that for each half the common words are less numerous than non-common ones:
where ) is the number of different words in the first (second) half. Relations (12) hold statistically; see Table VIII. Though common words are in minority they are more frequent, i.e. the overall frequency
) of common words in each half holds
Relations (13) hold without exclusions for all text we studied. Inequalities (12, 13) are well expected, since common words include functional words, which are frequent, but not numerous [16].
Our next finding does indicate on a difference between two halves, and is therefore less expected: the overall frequency of common words is larger in the second half [see Tables II and III]:
Eq. (14) indicates in which specific sense the second half employs more words from the first half, than vice versa.
A. Lempel-Ziv (LZ) complexity and compressibility
Here we discuss to which extent the two halves differ by their compressibility, and whether the word-inverted text has has a compressibility compared to the original text. Compressibility—i.e. the ability of size decreasing under lossless compression—is an interesting indicator of textual features, e.g. it is known that a random permutation of words in a text decreases its compressibility [50]. The idea that texts should have a larger compressibility than a random data also appeared in the form of Hilberg’s conjecture [51, 52]. Compressibility was recently employed for detecting ordered structures in data [53].
Now the compressibility can be defined via one of standard lossless compression methods e.g. the zip. But algorithms for such methods are not freely available. Hence we employ the Lempel-Ziv (LZ) complexity which estimates compressibility and which is basic for other compression methods (zip, gzip) [39]. The LZ-complexity is widely applied in data science; see e.g. [49] for a recent review.
Let us illustrate how the LZ-complexity is calculated [39, 49] for a bit-string 01001011. This string is separated into fragments by commas starting from the left. Each fragment is defined to be the shortest substring that did not appear before: 01001011 00, 10, 11. Now each fragment whose length is larger than one bit consists of the last bit and the prefix, i.e. the part that already appeared somewhere later. (The last fragment need not have the last symbol, and can simply consist of a prefix.) In the second step each fragment is coded via the number of the fragment, where its prefix first appeared (first symbol) and its last bit (second symbol). Thus
The LZ-complexity of the string is defined as the overall number of fragments. Now
determines the bit-length of the coded string, since we need
bits for last symbols for each fragment plus (at best)
log
bits for representing prefixes [39]. Obviously,
can be significantly smaller than the initial string length only for sufficiently long strings. The LZ-complexity defines a universal (since no prior information about the string is to be available) and asymptotically optimal lossless compression, since its compression ratio for a stationary random process tends to the entropy rate of the process [39].
reads for two simplest N-bit strings:
where (16) [(17)] is derived from noting that the lengths of successive fragments are 1, 2, 3, .... [1, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7, 6, 7....]. Eqs. (16, 17) are to be contrasted with the length of optimal compressions for the corresponding strings: (0; N) and (01; N/2). These lengths reduce to representing N via bits and amount to 1 + logfor both cases. Though
is far from the optimal case for ordered strings (16, 17), it is interesting to see that the string in (17) is still more LZ-complex, as expected intuitively. Thus, more regularity leads to a better compression, a general idea of all lossless compression methods.
We transform each text into a bit-string , and define the relative compressibility as
where is the original size of a given text in bytes, and
is the LZ-complexity. The normalization of
by the original size of the string is standard [39].
B. Permutation and inversion
Note that if the string in (17) is randomly permuted sufficiently many times, it will turn into a random string (with equal number of 0’s and 1’s). The of such a random string will be O(N) [39], i.e. much larger than
) in (17). Results of [50] are easy to understand in this context: random permutations will increase the LZ-complexity of any given text. To confirm this relation in our databases we generated 10 random permutations of words in each text, after each permutation we calculated the difference of compressibilities (18) between the original text and permuted one, and then averaged the difference over 10 permutations. The difference was positive for all our texts without exclusion.
Here is however a result that is much less straightforward. Once the present work focuses on the linear structure of a text, it is natural to ask the following question: is there any compressibility difference between a given text T and its inverted version T? Here the last word of T becomes the first word of T
, the penultimate word of T becomes the second word of T
, ..., and the first word of T becomes the last word of T
. Next, T and and T
are (separately) turned into bit-strings and their compressibility is calculated via (18)
.
It appears that the original text T is (statistically) more compressible in terms of (18) than the inverted text T[cf. Table X]:
The percentage of (19) is remarkably high: it holds for > 97% cases in both of our datasets; see Table X. Eq. (19) is confirmed via the zip compression method; see Table X.
Recall that previous applications of the LZ-complexity in texts [50–52] assumed that the LZ-complexity captures correlations between different text symbols (letters, words etc). Relation (19) can be explained by noting that focuses on short range correlations of letters, which may be lost after inversion of words. To illustrate this point, let
and
are two consecutive 3-letter words. Their order in T [in T
] is
]. Now if there are correlations between
and
in T, and such correlations are accounted for in
, then in T
such correlations will have a longer range, and will not be seen in
.
Note that instead of inverting texts at the level of words we also inverted them at the level of letters: put the last letter as the first one etc. We saw that out of this letter inversion the compressibility does change, i.e. there is more into the LZ-complexity than just short-range correlations of letters. However, no clear indications emerged on the analogue of (19) or on its inverse. The results differ from one dataset to another and from one compression method to another.
C. Compressibility of two halves
Table IX shows that the relative compressibility of the first half is statistically smaller, i.e. it is compressed less than the second half:
Inequality (20) holds in 60%-70% of cases with at least 4significant W-factor. The results are fully corroborated when using in (20) the zip compression method instead of the LZ-complexity.
Recall that according to information theory, more compressible sequences convey less information (in the sense of Shannon) [39]. Then one can interpret (20) in the context of (5, 7, 9): the first half has more words—hence it conveys more information—and more rare words, which altogether combine into a more informative and hence less compressible structure. Indeed Tables XIII and XIV show that the validity of (20) does correlate (in the sense of the Pearson correlation coefficient recalled in Appendix C) with the validity of (5, 7, 9).
A. Definitions
Let us now turn to features that reflect the distribution of words along the text. Studying this spatial distribution of words is traditional for quantitative linguistics [9, 41]. More recently, Refs. [42–45] investigated the spatial distribution of key-words versus functional words. The conclusion reached is that key-words are distributed less homogeneously, i.e. they cluster into certain parts of the text [42–45].
For a given text we extract the frequencies of different words (n is the number of different words):
Let denote all occurrences of a word w along the text. Let
denotes the number of words (different from w) between
and
. I.e.
+ 1
1 is the number of space symbols between
and
. Define the average period t(w) of this word w via
The averaging is conceptually meaningful only for sufficiently frequent words, though formally (22) is always well-defined. Note that (1)t(w) equals to 1 plus the number of words that differ from w and occur between
and
. Hence t(w) will stay intact under redistributing
for fixed
and
.
Now define the inverse spatial period:
If a word w is distributed homogeneously, then g(w) is expressed via the ordinary frequency f(w). If in addition, this is a sufficiently frequent word, then ) =
, where we assume that
1 and
1. Hence the difference
(for sufficiently frequent words) can tell how the distribution of w deviates from the homogeneous one.
One can also directly define the average characteristic frequency for the word w:
One feature of (24)—which is absent in (23)—is that (24) is not susceptible to outliers, e.g. if is much larger than other
’s, then
does not contribute much into (24). Here we shall focus on t(w), since (24) did not so far demonstrate any interesting behavior
.
B. Results
We now aim to compare the frequency f(w) with the inverse spatial period g(w) in each half of a given text. We focus on sufficiently frequent words, because g(w) is not well-defined for words that appear only once. Hence for a given half of a text we define the normalized distance between f(w) and g(w):
where is the number of elements in the set Ω, and where Ω includes words that appear k times. Tables XI and XII exemplify the behavior of (25) for three selected values of k: 15, 20 and 30. (We did not choose smaller values of k, since the definition of 1/g(w) as the average period becomes unclear.) It is seen that the normalized distance is typically larger in the first half,
and that this effect gets stronger—both in terms of the percentage of cases and the value of W in (2), when the the set Ω in (25) is restricted to common words of both halves; see Tables XI and XII.
So far we mostly concentrated on one level of textual hierarchy, i.e. words. Letters are on the hierarchy level below that of words. For the total number of letters L in each half the statistical evidence we got is weaker, since the percentage of cases, where (i.e. the first half has a larger overall number of letters than the second half) and the W-statistics for
are close to their critical values; see Tables II and III. Moreover, in one of our datasets the W-test is passed, while in the other it is not. However, there is a weak, but a definite evidence for the validity of
. First, after a random shuffling of texts, the percentage of cases where
holds drops down from its value
58 (for original texts) to
5 for shuffled texts; see Tables IV and V. Second,
shows significant correlations with (5) and (7); see Tables XIII and XIV. Third, the relation
holds in average for both datasets:
For hierarchy levels higher than that of words, our results are negative, i.e. they do not indicate on a statistically significant difference between the halves. The number of sentences does not show significant differences; see Tables II and III. Here we (conventionally) defined the sentence as the shortest sequence of words located in between of any of the following symbols: comma, dot, semicolon, question mark, exclamation mark. We also studied the number of sentences, when the comma is excluded from the above list (not shown in tables). This did not change our conclusion.
We also calculated the full distribution of sentences over the length (measured in words): the fraction of sentences with word-length
= 1). Two specific characteristics of this distribution were looked at: the average
, dispersion
None of these quantities shown a statistically significant difference between the halves; see Tables II and III. Another level of the textual hierarchy is the one containing paragraphs. Denoting the number of paragraphs as , we saw that there is no statistical evidence in favor of
or
; see Tables II and III. Our results on Yule’s constant that describes the repetitiveness of words (see Appendix B details of the definition) also do not indicate on a significant difference between the halves.
We proposed a set of relations between statistical features of the two halves of a meaningful text; see Table I for a summary of our results. The validity of these relations is statistical, i.e. the majority of them holds with 5significance of the Wilcoxon test; see Appendix A. No understanding of the text (or even knowing its writing system) is needed for checking these relations. We explicitly confirmed that all these relations disappear after a random permutation of words in the text.
We conjecture that these relations between the halves are connected to a specific, information-carrying structure of the text, where the information is introduced (defined) in the first half, and then is processed in the second half. Such a structure is anticipated in text linguistics, where the flow of the text narrative is conventionally separated between the exposition and the denouement, which are typically located in the first and second halves, respectively [1–3]. This is however a qualitative concept, and hence the connection between our results and the exposition-denouement is stated as a hypothesis. Work is currently in progress for designing specific tests for checking the hypothesis.
Practically, knowing whether a string of symbols is a meaningful text or not can be useful in cryptography, fraud detection and historical analysis. The latter can refer to inferring whether a given text in unknown writing system is meaningful or asemic [60]. One interesting application relates to the CETI problem (communication with extraterrestial intelligence) [56]. Here the code of a potential signal is completely unknown, but it can be plausibly conjectured that a meaningful signal has similar differences between the halves. (Fractioning into “words” is is possible, once the “space symbol” is identified as the most frequent symbol in the text [56].)
Some of our results were sporadically observed in literatures. Ref. [47] emphasized the translation invariance of books, but still noted on a concrete text that its last part has less rare words. Ref. [14] noted the following (nontopical) differences between the first and the second halves of Moby Dick (by H. Melville): (1.) The word is is more frequent than was in the first half, but less frequent in the second half. (2.) The ratio of articles the to a is larger in the second half, which may mean that the second half makes more concrete statements.
Our results concerning the compressibility features—in particular, the result in section IV B on the compressibility decrease under inverting the word order—are especially worth studying in more detail. While we focused on the Lempel-Ziv complexity and the related zip method for defining the compressibility, it is known that the Lempel-Ziv complexity for long but finite sequences has a drawback of not capturing the real randomness, i.e. not agreeing with the Kolmogorov complexity [54]. Hence more refined compression methods are to be studied in future, e.g. the Huffman coding that is algorithmically slower, but does capture the notion of randomness. It is also important to clarify whether the compressibility difference between the original and inverted text can serve for quantifying syntagmatic correlations between the words; cf. Footnote 8.
Another important open problem relates to modeling the above effects. A superficial modeling would be possible via altering the existing sequential text-generating models (see [23–25] for example) such that e.g. they generate less rare words towards the end of the text. But we should warn the reader against such quick attempts. First, the main drawback of sequential models is that they do not describe sufficiently well the distribution of rare words [47], which was so far possible only via non-sequential statistical models [16]. So a good model should predicts both the distribution of rare words and their difference between the halves. Second, the example of the Zipf law—with its numerous models and explanations [16–30]—shows that excessive modeling even with a reasonable quantitative agreement is not at all a guarantee for understanding the actual meaning of a complex textual phenomenon. Hence at the present stage of our understanding we want to concentrate on conceptual issues (and novel tests) relating statistics to meaningfulness, more than to develop a purely statistical model for explaining above results.
W. Deng was partially supported by the Fundamental Research Funds for the Central Universities, the Program of Introducing Talents of Discipline to Universities under grant no. B08033, and National Natural Science Foundation of China (Grant Nos. 11505071 and 11905163). A.E. Allahverdyan was supported by SCS of Armenia, grants No. 18RF-015 and No. 18T-1C090.
TABLE II: Results for the first dataset of 156 texts. Notations: n is the number of different words; refer to the first and second half, respectively.
) is the overall frequency of common words in the first (second) half, L is the number of letters,
is the number of paragraphs, K is Yule’s constant. Eqs. (28) define
. The number of sentences is denoted by
This table shows the W-value (2) for the relation (and other indicated relations), and also the percentage with which this relation holds in the dataset. The W-test is passed according to the 3
02; see (3). The 3
threshold for the percentage is 0
62. Table shows whether the
rule is passed with 3
5. False means that even 3
rule is not passed.
TABLE III: Results for the second dataset of 350 texts. The same notations as in Table II. The W-test is passed according to the 3threshold for percentages is 0
5802. Table shows whether the
rule is passed with 3
5. False means that even 3
rule is not passed.
TABLE IV: Random shuffle for the first dataset of 156 texts 10 shuffles; cf. Table II. For , the thresholds for appearance numbers of rare words are set to 3. False means that even 3
rule is not passed.
TABLE V: Random shuffle for the second dataset of 350 texts averaged over 10 shuffles; cf. Table III. For thresholds for appearance numbers of rare words are set to
= 3; see (6–9). False means that even 3
rule is not passed.
TABLE VI: Results for for the first dataset of 156 texts; see (6–9). The rareness is defined by
We demonstrate the percentage of cases, where the first half has a larger number of rare words. The 3
threshold for this percentage is 0
62. We show the W value (2) for the W -test. The test is passed according to the 5
rule for all but one case.
TABLE VII: Results for for the second dataset of 350 texts. The rareness is defined by
(6–9). We demonstrate the percentage of cases, where the first half has a larger number of rare words. The 3
threshold for percentages is 0
5802. We show the W value for the W -test. The test is passed according to the 5
rule for all but one case.
TABLE VIII: Shows the percentage and W-value for inequalities (12).
TABLE IX: The compressibility difference between the two halves; cf. (20) and discussion around. It shows the percentage and W-value for the relation . The relative compressibilities in (20) were calculates via the LZ-complexity (LZ) and the zip method. The results are similar and confirm (20).
TABLE X: Shows the percentage and W-value for the relation (19), i.e. for the compressibility difference between a text and its inverted version. Relation (19) is checked via the Lempel-Ziv (LZ) complexity and via the zip method.
TABLE XI: Results for for the first dataset of 156 texts. Here n.a. is the minimal number of appearances for words included into the definition (25) of
15 means that only words with frequency
are included, where N is the total number of words in the text.
TABLE XII: Results for for the second dataset of 350 texts; cf. Table XI.
Let us recall how Yule’s constant is defined [4]. Define to be the number of words that (in a fixed text) appear m times. We get two obvious features:
where n and N are, respectively, the number of different words and the number of all words, and is the number of times the most frequent word appears in the text. Note that for a sufficiently small
is either zero or one. For instance, consider the text The Age of Reason by T. Paine, 1794 (the major source of British deism), whose first half has N = 11612 total words and n = 2012 different words. In the first half of this text:
= 1,
TABLE XIII: Pearson correlation coefficients between the number of appearances for various relations between the two halves in the second dataset of 156 texts; see also Appendix C in this context. Now
]) is the number of times (among 156 texts) that the relation
holds; cf. (7). For h we understand the number of words that appear once, while for
number of appearances of words included in (25) is set to
denote (25) applied to common words of (resp.) first and second halves. Significant correlations are underlined (i.e. the coefficient is larger than 0.4). For the relation
[cf. (20)] we employed the zip compression method.
TABLE XIV: Pearson correlation coefficients between the number of appearances for various relations between the two halves in the second dataset of 350 texts. For notations see Table XIII.
= 0,
= 1,
= 0,
= 1 etc.
Take a word w that appears m times in a text with length N. Now is the probability that a randomly taken word in the text will be w. Likewise,
is the probability that the second randomly taken word in the text will be again w. Both probabilities refer to a word w that appears m times. The probability to take such a word among n distinct words of the text is
; cf. (B1)
. Thus the average
is a measure of repetitiveness of
words. The Yule’s constant K employs this quantity without the factor , since it wants to have something weakly dependent on N [4]. For us this feature is not important, since we compare the halves of a text. Following the tradition, we also omit the factor
, but we stress that including it does not change our conclusions. Using (B1) and
1, the Yule’s constant reads [4]
where 10is a conventional factor we applied to keep K = O(1).
We have M texts (k = 1, ..., M for our case M = 350 or M = 156) and 5 features (a = 1, ..., 5): 1. 2.
3.
4.
5. space frequency.
Let us define such that
= 1 (
= 0) if the feature a holds (does not hold) for text k. Let us now define the Pearson correlation coefficient between the features:
where
Eq. (C1) can be simplified as
[1] W. J. Hutchins, On the problemof aboutness in document analysis, Journal of Informatics, 1, 17 (1977).
[2] N.S. Valgina, Theory of the text. Textbook (Logos, Moscow, 2003) (Russian Edition).
[3] M. A. K. Halliday and R. Hasan, Language, Context and Text; Aspects of Language in a Social-semiotic Perspective (Oxford, Oxford University Press, 2nd edition, 1989).
[4] H. Baayen, Word frequency distribution (Kluwer Academic Publishers, 2001).
[5] J.K. Orlov, On statistical structure of message that are optimal for human perception, Naucno-techniceskaja informacija (Serija 2), 8, 11 (1970) (In Russian).
[6] M.V. Arapov and Yu. A. Shrejder, , 74 (1978).
[7] J.-B. Estoup, (Paris, Institut St´enographique, 1912).
[8] E.U. Condon, Statistics of vocabulary, Science, 67, 300 (1928).
[9] G.K. Zipf, The psycho-biology of language (MIT Press, Cambridge, MA, 1935) .
[10] A. Schenkel, J. Zhang, Y.-C. Zhang, Long range correlations in human writings, Fractals, 1, 47 (1993).
[11] M. Amit, Y. Shmerler,E. Eisenberg, M. Abraham, and N. Shnerb, Language and codification dependence of long-range correlations in text, Fractals, 2, 7 (1994).
[12] W. Ebeling and A. Neiman, Long-range correlations between letters and sentences in texts, Physica A 215, 233 (1995).
[13] E. Alvarez-Lacalle, B. Dorow, J.-P. Eckmann, and E. Moses, Hierarchical structures induce long-range dynamical correlations in written texts, PNAS, 103 (2006).
[14] D.Y. Manin, On the nature of long-range letter correlations in texts, arXiv:0809.0103.
[15] E. G. Altmann, G. Cristadoro, and M. Degli Esposti, On the origin of long-range correlations in texts, PNAS, 109, 11582 (2012).
[16] A. E. Allahverdyan, W. Deng, and Q. A. Wang, , Phys. Rev. E 88, 062804 (2013).
[17] Yu.A. Shreider, , Prob. Inform. Trans. 3, 45 (1967).
[18] Y. Dover, A short account of a connection of power laws to the information entropy, Physica A 334, 591 (2004).
[19] E.V. Vakarin and J. P. Badiali, Maximum entropy approach to power-law distributions in coupled dynamic-stochastic systems, Phys. Rev. E 74, 036120 (2006).
[20] C.-S. Liu, , 99 (2008).
[21] S.K. Baek, S. Bernhardsson and P. Minnhagen, , New Journal of Physics 13, 043004 (2011).
[22] G.A. Miller, Some effects of intermitted silence, Am. J. Psyc. 70, 311 (1957). G.A. Miller and E.B. Newman, Tests of statistical explanation for rank-frequency relation in written English, Am. J. Psyc. 71, 209 (1958).
[23] H.A. Simon, On a class of skew distribution functions, Biometrika 42, 425 (1955).
[24] D.H. Zanette and M.A. Montemurro, , J. Quant. Ling. 12, 29 (2005).
[25] I. Kanter and D.A. Kessler, , Phys. Rev. Lett. 74, 4559 (1995).
[26] B.M. Hill, J. Am. Stat. Ass., , 1017 (1974). H.S. Sichel, J. Am. Stat. Ass., On a distribution law for word frequencies, 70, 542 (1975). G. Troll and P. Beim Graben,
, Phys. Rev. E 57, 1347 (1998). A. Czirok, H.E. Stanley and T. Vicsek, Possible origin of power-law behavior in n-tuple Zipf analysis, Phys. Rev. E, 53, 6371 (1996).
[27] L. Aitchison, N. Corradi, P. E. Latham, PLoS Comput. Biol. 12, e1005110 (2016).