Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words

A fundamental problem in linguistics is how literary texts can be quantified mathematically. It is well known that the frequency of a (rare) word in a text is roughly inverse proportional to its rank (Zipf’s law). Here we address the complementary question, if also the rhythm of the text, characterized by the arrangement of the rare words in the text, can be quantified mathematically in a similar basic way. To this end, we consider representative classic single-authored texts from England/Ireland, France, Germany, China, and Japan. In each text, we classify each word by its rank. We focus on the rare words with ranks above some threshold Q and study the lengths of the (return) intervals between them. We find that for all texts considered, the probability SQ(r) that the length of an interval exceeds r, follows a perfect Weibull-function, SQ(r) = exp(−b(β)rβ), with β around 0.7. The return intervals themselves are arranged in a long-range correlated self-similar fashion, where the autocorrelation function CQ(s) of the intervals follows a power law, CQ(s) ∼ s−γ, with an exponent γ between 0.14 and 0.48. We show that these features lead to a pronounced clustering of the rare words in the text.

In this Supporting Information, we show first how the parameter b in the stretched exponential describing the exceedance probability can be determined. Then we present, in Figure A, the rank-frequency distribution of the 10 texts considered here and show in Figure B how the return period R Q varies with rank Q in the texts. We demonstrate in Figure C that for shuffled texts the exceedance probability S Q (r) is a simple exponential and the autocorrelation function C Q (s) fluctuates around 0. The  table in Table A summarizes all the statistics, obtained in the present study. Figure D shows S Q (r) and C Q (s) of Newspaper corpora in three languages.
The parameter b in the exceedance probability for large R Q .
In the article we observed that the exceedance probability S Q (r) can be expressed by a stretched exponential S Q (r) = exp(−b(r/R Q ) β ) ≡ exp(−bx β ) for all texts considered. Here we show that for large R Q where x = r/R Q can be (approximately) considered as continuous, the relation b = [ ∫ ∞ 0 dx exp(−x β )] β holds. To see this, we note that in the continuous limit, the probability density distribution P Q satisfies P Q = −R Q dS Q (x)/dx, yielding Zipf 's law for the 10 texts considered Figure A. Zipf 's law for the 10 texts considered. Log-log plots of the rankfrequency distribution of the words in the 10 texts considered. The black line has slope -1 as suggested by Zipf's law. The figure demonstrates that Zipf's law is not rigorous but a reasonable approximation for the rank-frequency distribution in the considered texts.
Dependence of the return period R Q on Q Figure B. Dependence of the return period R Q on Q. (a) Log-log plots of R Q as a function of Q (left shows plots for the first five texts and right shows those for the second five). From Zipf's law we would expect Accordingly, when plotting (1 − 1/R Q ) ln Q max versus ln Q, deviations from the straight (red) line are due to the deviations from Zipf's law.
S Q (r) and C Q (s) in randomized texts  Table A. Summary of parameters is presented in the following two pages. For fixed return period R Q , the table shows (i) the values of the corresponding Q value and the number of words above Q, N Q , (ii) the exponent β and the parameter b in S Q (r) = exp(−b(r/R Q ) β , and (iii) the exponent γ and the parameter C Q (1) for the fitted autocorrelation function C Q (s) = C Q (1)s −γ . From C Q (1) and γ, the fraction a of white noise can be estimated [24] a = 1/[1 +

S Q (r) and C Q (s) of Newspaper Corpora
We also considered newspaper archives consisting of the Wall Street Journal (1987, 108 MB, N =22679512), 10 years of the People's Daily in Chinese (1995, 67 MB, N =19420852), and the Mainichi newspaper in Japanese (2000( -2009. These corpora are thus multiple-author texts. The articles in each newspaper are aligned in chronological order and some rare words could therefore occur locally in the corpora. For the three languages, the exceedance probabilities and autocorrelation functions correspond to the left-hand, middle, and right-hand graphs, respectively, in Figure D.
For the exceedance probability, the Weibull function gives an excellent fit, especially for larger R Q . In the case of the Chinese journal, the fit is perfect even for lower R Q . Therefore, the Weibull characteristic underlying the distribution of intervals occurs independently of whether texts have single or multiple authors, although the values of β were smaller.
The autocorrelation function behaves differently from that for single-author texts. The plots have a tendency to present a convex alignment, indicating a shorter memory, leading to larger error bars than those in Fig 3. This can be understood, since newspaper articles are usually quite short, and hence the memory tends to decay faster than for single-author texts. A more detailed mathematical and experimental explanation of the difference between single-and multi-author texts remains for our future work.