Reader Comments

Post a new comment on this article

More on the poor fit of random typing

Posted by rferrericancho on 15 Mar 2010 at 16:24 GMT

You can have more visual evidence of the poor fit of random typing ("random texts") to real texts from the perspective of the frequency histogram. In a frequency histogram, you have frequency on the x-axis and number of words with that frequency on the y-axis. Have a look to Figure 1 of the article:

Ferrer-i-Cancho, R. & Gavaldà, R. (2009). The frequency spectrum of finite samples from the intermittent silence process. Journal of the American Society for Information Science and Technology 60 (4), 837-843.

There you can see the expected frequency histogram of a random typing with parameters that Miller & Chomsky (1963) argued to give a good fit to actual word frequencies. Although the frequency histogram of actual texts is known to conform (approximately) to a straight line in double logarithmic scale (Zipf 1949), the expected frequency histogram of random typing clearly does not. Pay special attention to the humps and the gaps between them. There is no "power-law" in Figure 1.

Notice that, in a rank histogram, frequency cannot increase as the rank increases but in a frequency histogram, the number of words with a certain frequency can a priori decrease or increase freely. The frequency histogram allows for humps and gaps and thus behaves like an amplifier of the profound differences between random typing and real texts at the level of word frequencies.

No competing interests declared.

RE: More on the poor fit of random typing

allegrip replied to rferrericancho on 08 Sep 2010 at 12:42 GMT

Dear Ramon,
I think that this comment is even more compelling than the paper.
In fact random text (as they are defined by Wentian Li, as monkeys
in font of a typewriter) do in fact possess a probability as the limit of
their relative frequency of occurrence.

What is clear from quantitative studies of texts is that this limit
does not in fact exist, and what Zipf's law is telling us is that the
only probability density that can be defined is the probability of
relative frequencies [Montemurro].

In this space the r(f) inverse power laws of natural and random
languages have a completely different origin, the latter
due to a weak convergence to a Renyi (people call it Tsallis')
density, due to a inhomogeneous sum of normal functions, with,
in this case, a hierarchical structure (hence the holes).

I have always argued that the inverse-power law in the case
of natural text is due to the generalized central limit theorem.
Really a different story.

Paolo Allegrini

No competing interests declared.