^{1}

^{*}

^{2}

Conceived and designed the experiments: RFiC BE. Performed the experiments: RFiC. Analyzed the data: RFiC. Contributed reagents/materials/analysis tools: RFiC. Wrote the paper: RFiC BE.

The authors have declared that no competing interests exist.

Zipf's law states that the relationship between the frequency of a word in a text and its rank (the most frequent word has rank

In this article, we examine the flaws of such putative good fits of random texts. We demonstrate - by means of three different statistical tests - that ranks derived from random texts and ranks derived from real texts are statistically inconsistent with the parameters employed to argue for such a good fit, even when the parameters are inferred from the target real text. Our findings are valid for both the simplest random texts composed of equally likely characters as well as more elaborate and realistic versions where character probabilities are borrowed from a real text.

The good fit of random texts to real Zipf's law-like rank distributions has not yet been established. Therefore, we suggest that Zipf's law might in fact be a fundamental law in natural languages.

Imagine that one takes a text, counts the frequency of every word and assigns a rank to each word in a decreasing order of frequency. This would result in the most frequent word having a rank of

has been generated using English letters ranging from ‘a’ to ‘z’ (the separation between words in our example is arbitrary and due to automatic formatting).

The idea that random sequences of characters reproduce Zipf's law stems from the seminal work of Mandelbrot

There have been many arguments against the meaningfulness or relevance of Zipf's law

The studies that question the relevance to natural language of Zipf's law argue for the matching between Eq. 1 and random texts. However, Eq. 1 is only an approximation for the real rank histogram. The best candidate for the actual rank distribution remains an open question

As far as we know, in none of the popular articles that argue against the meaningfulness of Zipf's law

Eqs. 2, 3 and 4 are derived in the context of a very long text. It is not known

As far as we know, in none of the popular articles that question the meaningfulness to natural language of Zipf's law

A comparison of the real rank histogram (thin black line) and two control curves with the

The same as

To address Problem 1, we evaluate the goodness of fits of random texts to real texts

We exclude from our analysis a variant of the random text that generates empty words. Empty words are obtained when producing two blanks in a row, which is allowed in

Many authors have discussed the explanatory adequacy of random texts for real Zipf's law-like word rank distributions indirectly from inconsistencies between random texts and real texts beyond the distribution of ranks

To our knowledge, only a few studies have addressed this question

In this article we employ a set of ten English texts (eight novels and two essays) to evaluate the goodness of fit of random texts in

Title | Abbreviation | Author |

Alice's adventures in wonderland | AAW | Lewis Carroll (1832–1898) |

The adventures of Tom Sawyer | ATS | Mark Twain (1835–1910) |

A Christmas carol | CC | Charles Dickens (1812–1870) |

David Crockett | DC | John S. C. Abbott (1805–1877) |

An enquiry concerning human understanding | ECHU | David Hume (1711–1776) |

Hamlet | H | William Shakespeare (1564–1616) |

The hound of the Baskervilles | HB | Sir Arthur Conan Doyle (1859–1930) |

Moby-Dick: or, the whale | MB | Herman Melville (1819–1891) |

The origin of species by means of natural selection | OS | Charles Darwin (1809–1882) |

Ulysses | U | James Joyce (1882–1941) |

Abbreviation | ||||||

AAW | 27342 | 28 | 0.254 | 2574 | 254.05 | 466.60 |

CC | 29253 | 30 | 0.240 | 4263 | 463.31 | 887.22 |

H | 32839 | 28 | 0.253 | 4582 | 474.39 | 932.44 |

ECHU | 57958 | 36 | 0.212 | 4912 | 433.91 | 861.35 |

HB | 59967 | 39 | 0.244 | 5568 | 472.87 | 990.44 |

ATS | 73523 | 31 | 0.248 | 7169 | 612.45 | 1298.53 |

DC | 78819 | 36 | 0.228 | 7385 | 668.60 | 1346.19 |

OS | 209176 | 36 | 0.207 | 8955 | 589.94 | 1274.53 |

MB | 218522 | 36 | 0.229 | 17190 | 1291.67 | 2909.44 |

U | 269589 | 36 | 0.228 | 29213 | 2425.63 | 5444.95 |

We consider three different versions of the random text (RT) model without empty words that have been considered in the literature. All the versions generate a random sequence of independent characters. These three version are (the subindex indicates the number of parameters of the version of the random text):

All characters, including the blank are equally likely. This model is specified with a single parameter:

All characters except the blank are equally likely. This model is specified with two parameters,

All characters can take any probability. This model is specified with

Real character probabilities extracted from the target writing as in

An example of

Here we aim to compare rank histograms from real texts and expected rank histograms from random texts. If random texts really reproduce the rank histogram of real texts, then the histogram of real texts and those of the random texts should completely overlap. We will see that this is not the case.

Here our emphasis is on providing a fair visual comparison. We use the term fair in two senses. First, we consider real and artificial texts of the same length in words. Notice that the equations that have been derived so far for the rank distribution of

In the interest of being concise, for visual fits, we chose four works representing different genres and covering the whole range of text lengths in the sample.

It is well known that if characters other than the blank have unequal probabilities then the rank histogram smoothes

The same as

In the next section, we employ rigorous statistical fitting, not because we think that it is strictly necessary when large visual differences between random and real texts are found (e.g.,

We detailed in the introduction that we did not seek to evaluate the goodness of fits of random texts for actual rank histograms through Zipf's law because this implies the risk that the target equation, i.e. Eq. 1, is not accurate enough

To our knowledge, the expectation of these statistics for a text of a certain finite length has not previously been reported. If the rank distribution of the real texts and that of the random texts are the same, statistically significant differences between the value of the above statistics in real texts and those of random texts should not be found or be exceptional. Here we consider the whole set of ten English texts including the four works we examined in detail in the previous section (

For each real text, we estimate the expectation and standard deviation of these statistics by generating

Abbrv. | - | |||||||||

AAW | 42.6 | −97.5 | −133.2 | −163.4 | −573.4 | −160.6 | 54.0 | −74.3 | −147.1 | |

130.5 | −59.7 | −78.6 | −94.2 | −312.9 | −93.1 | 173.0 | −46.3 | −85.7 | ||

56.3 | −83.4 | −119.5 | −156.8 | −2033.6 | −153.1 | 74.1 | −63.0 | −135.5 | ||

CC | 99.1 | −80.7 | −120.0 | −151.1 | −555.4 | −158.5 | 116.8 | −53.7 | −139.3 | |

267.3 | −54.5 | −76.8 | −93.6 | −317.8 | −98.0 | 347.2 | −36.8 | −87.1 | ||

136.5 | −72.8 | −111.9 | −149.5 | −1969.6 | −159.2 | 169.9 | −48.7 | −134.7 | ||

H | 103.5 | −86.6 | −127.6 | −158.4 | −581.5 | −157.7 | 121.5 | −58.6 | −142.3 | |

277.8 | −58.4 | −81.4 | −97.6 | −331.3 | −97.5 | 361.9 | −40.5 | −89.1 | ||

142.1 | −77.8 | −118.3 | −155.8 | −2017.9 | −154.6 | 176.8 | −53.1 | −135.4 | ||

ECHU | 75.6 | −133.8 | −184.6 | −226.0 | −795.6 | −275.9 | 93.7 | −98.9 | −240.5 | |

247.4 | −81.9 | −108.5 | −129.7 | −431.3 | −155.9 | 328.9 | −61.6 | −137.4 | ||

106.3 | −112.7 | −161.7 | −210.2 | −2494.6 | −278.7 | 138.1 | −82.9 | −227.2 | ||

HB | 92.0 | −131.4 | −182.2 | −225.5 | −791.3 | −246.2 | 112.8 | −93.8 | −207.3 | |

272.8 | −82.6 | −109.2 | −131.3 | −432.7 | −142.6 | 366.7 | −60.5 | −121.9 | ||

127.8 | −112.0 | −161.0 | −211.0 | −2482.7 | −238.3 | 165.5 | −79.9 | −189.0 | ||

ATS | 120.7 | −137.9 | −195.9 | −242.1 | −854.7 | −253.9 | 143.7 | −97.7 | −219.7 | |

369.8 | −87.6 | −118.6 | −142.1 | −469.4 | −148.1 | 488.6 | −63.6 | −130.6 | ||

173.0 | −118.1 | −173.1 | −226.1 | −2620.6 | −241.3 | 218.5 | −83.9 | −199.6 | ||

DC | 119.2 | −143.6 | −201.8 | −250.1 | −882.0 | −294.2 | 143.9 | −102.1 | −246.5 | |

404.6 | −89.5 | −120.6 | −145.5 | −482.0 | −168.9 | 540.4 | −64.0 | −143.9 | ||

175.7 | −121.8 | −177.0 | −232.0 | −2678.1 | −288.0 | 224.7 | −86.6 | −226.7 | ||

OS | 72.9 | −258.2 | −341.1 | −419.4 | −1446.2 | −539.9 | 100.0 | −205.0 | −443.3 | |

349.3 | −148.4 | −189.5 | −228.6 | −754.5 | −289.1 | 486.6 | −119.9 | −240.5 | ||

117.7 | −203.8 | −279.5 | −362.9 | −3939.7 | −514.2 | 164.6 | −160.9 | −390.7 | ||

MB | 222.1 | −221.6 | −311.1 | −392.2 | −1418.5 | −470.9 | 266.4 | −155.8 | −382.7 | |

849.5 | −137.8 | −184.8 | −226.4 | −765.8 | −266.8 | 1152.2 | −98.0 | −221.3 | ||

352.8 | −184.8 | −265.5 | −350.9 | −3908.0 | −444.5 | 452.7 | −130.7 | −339.9 | ||

U | 404.3 | −200.7 | −303.2 | −398.6 | −1491.0 | −481.1 | 466.6 | −120.8 | −388.7 | |

1672.5 | −133.7 | −190.8 | −241.7 | −828.3 | −285.2 | 2206.4 | −78.8 | −235.9 | ||

693.6 | −175.0 | −266.7 | −364.3 | −4068.4 | −462.1 | 862.0 | −107.3 | −354.5 |

How can we determine the significance of these distances? The Chebyshev inequality provides us with an upper bound of the, p-value, the probability that the value of the distance is due to mere chance for any kind of distribution. This upper bound is

the fair die rolling experiment considered in

the variant of the random text model considered in

the random text with unequal letter probabilities (

Next we focus on the sign of the distances in order to shed light on the nature of the disagreement between real and random texts. The sign of the distance indicates whether the actual value is too small (

We have seen that three different rank statistics are able to show, independently, that ten English texts and random texts with different versions and parameters settings are statistically inconsistent in all cases. We have seen that for the majority of the parameter settings considered, the nature of the disagreement is that the real rank statistic is smaller than that expected for a random text.

Although we have shown the poor fits of random texts by means of rigorous statistical tests, our limited exploration of the parameter space cannot exclude the possibility that random texts provide good fits for actual rank histograms with parameter values not considered here. Notice that random texts fail both with arbitrarily chosen parameters, e.g., the fair die rolling experiment

We believe that the quest for parameters that provide a good fit of random texts on real texts is a tough challenge for detractors of the meaningfulness of Zipf's law, because real writers do not produce words by concatenating independent events under a certain termination probability. Real writers extract words from a mental lexicon that provides almost ‘ready to use’ words

There are still many models of Zipf's law for which the goodness of fit to real texts has not been studied rigorously (e.g.,

To simplify the analysis, we normalize the English texts in

After text normalization, there is a small fraction of word characters that are not letters in the English alphabet. Most of these characters are digits or accents. To make sure that our results are not a consequence of these infrequent characters we repeated the fitting tests excluding words not made exclusively of English lowercase letters from ‘a’ to ‘z’ after text normalization. We found that the results were qualitatively identical: each of the three rank statistics is able to reject the hypothesis of a random text in all cases.

Here we aim to provide some guidelines to perform the computer calculations presented in this article for easy replication of our results. In what follows we consider the computational efficiency of three issues: (i) the generation of random words; (ii) counting the frequency of random words; (iii) and sorting.

Here we explain how to generate a random word efficiently. We start with the simplest (or naïve) algorithm of random word generation (we assume that the space delimiting words does not belong to the word):

Start with an empty string of characters

Generate a random character

Generate a uniform random deviate

While

Generate a random character

Generate a uniform random deviate

Generating a uniformly distributed random letter (steps

Imagine that a random word has length

Generate a random geometric deviate

Generate a random word

where is Step 2 is performed through the following algorithm

Start with an empty string of characters

Repeat

Generate a random character

Of key importance is the generation of the random geometric deviate in

It is still possible to generate a random word of length

Start with an empty string of characters

Generate a random character

Generate a random character

While

Add

Generate a random character

We define

With simultaneous random word generation and counting, the time efficiency can be improved by employing more memory for the case of

Generate a random geometric deviate

If

Generate a random uniform number

Increase

else

Generate a random word

Increase the frequency of

The extra memory needed for the table of words of length not exceeding

Sorting natural numbers efficiently is needed to calculate ranks. Obtaining the ranks of a certain text (real or random) requires sorting the word frequencies from the random text in decreasing order. All the above techniques may not contribute to increase significantly the speed of the computer calculations if the sorting takes more than

(0.24 MB PDF)

We are grateful to R. Gavaldà, B. McCowan, J. Mac˘utek and E. Wheeler for helpful comments. We also thank M. A. Serrano, A. Arenas, S. Caldeira for helpful discussions and J. Cortadella for computing facilities.