In what way is random typing model relevant to language?

Posted by dmanin on 16 Mar 2010 at 04:59 GMT

I have read the paper with a lot of interest. I am not sure, though, that I understand the main
question posed in it: "is Zipf's law relevant to natural language", because
I don't understand what exactly "being relevant" means in this context.
Let me try to outline my own position on this issue.

There is a wide variety of opinions about power distribution laws,
ranging from "they certainly indicate complex-system behavior" to "they
have no meaningfulness at all".

I certainly disagree with both statements. Power rank distributions can
result from many different mechanisms, including the extremely simple
one, "random typing", which clearly defeats the idea that *every* system
exhibiting power-law distribution is complex. (This reminds me of
another grand claim: that every character sequence with long-range
correlations is "complex"). This
work confirms that random typing leads to (something like) power-law rank distribution.
Whether it is in quantitative agreement with distributions observed for
natural language (which it is shown here not to be) is immaterial for
this particular question.

On the other hand, I also disagree with the "Zipf nihilism" position
that claims that Zipf's law for language is simply not interesting just
because random typing models also exhibit Zipf-like rank distribution.
Whatever is the mechanism generating such distribution in language (in
my article cited here as ref [24], I argue that it may have to do with word semantics), it is
certainly *not* random typing, so random typing is simply irrelevant to
the question. Not being a proof though, this argument is a source of
doubt, because it demonstrates that simple causes *may* underlie power-
law rank distributions. And nothing would change if one demonstrated a
random-typing model producing exactly the same distribution as natural
language, because *the same distribution does not ensure the same

The bottom line, to me, is this: the only was to make a claim that
Zipf's law for language is interesting, is to propose a specific
mechanism that generates power-law rank distribution in natural

As for the differences between Zipf-like distributions resulting from
random typing vs. natural language, which are studied in this work, I
find them very interesting even outside of this debate. I would be
very much interested to see how the distributions change with the number
of words, what are the mechanisms producing large-scale deviations from
ideal power law, what determines the exponent, maybe what happens when
one progresses from Poisson processes to Markov chains, etc.

RE: In what way is random typing model relevant to language?

rferrericancho replied to dmanin on 22 Mar 2010 at 18:41 GMT

(see also my reply to galtmann)

You are bringing the discussion back to power laws but our focus are the real rank distributions (the power law might be a too rough approach to the actual rank histograms, see refs in our article). I wonder if there would be such a multiplicity of models for power laws if people had really cared about making rigorous statistical tests...

But let us come back to your comment. I totally agree on the need of realistic models of Zipf’s law for word frequencies. These models, as you say, should be able to reproduce Zipf’s law but also other statistical properties of language. This is an elegant way of excluding unrealistic explanations such as random typing. This is the normal way of scientific progress in physics, biology, cognitive science, …
The point however is mainly sociological: many researchers feel strongly attached to the idea that random typing reproduces Zipf’s law, which for them, makes research on Zipf’s law uninteresting or less valuable. This is a very reductionist point of view: one focuses on the fit of Zipf’s law and then forgets about any other statistical property of the real system. Although these rules of the scientific game make no sense to many people including (me and you), have been defended by many authorities (B. Mandelbrot, G. Miller, N. Chomsky, S. Wolfram, … simply follow the refs in our article), which makes changing this culture specially difficult.
The questions are:
1) Can we show the bad fit monkey typing with the rules of the game imposed by these researchers?
2) Equivalently, can we reject the hypothesis of random typing by simply looking at word frequencies?
The answer to these questions, according to our article, is YES (for the parameters of the model that we have considered and with which many researchers claimed a good fit).

RE: In what way is random typing model relevant to language?

dmanin replied to dmanin on 25 Mar 2010 at 05:04 GMT

Yes, I generally agree with your points. It looks like there is a major difference between random-typing statistics and real-text statistics in the number of hapaxes (words occurring exactly once in a corpus). I know people specifically studied the number of hapaxes for real-language corpora. I'm not very familiar with this research, but it raises many interesting questions, and the number of hapaxes appears to be an important characteristic.

RE: RE: In what way is random typing model relevant to language?

allegrip replied to dmanin on 09 Sep 2010 at 13:40 GMT

I find that the "hapax" point is a very important point, and
in fact there is a big difference in the two kinds of texts (the
right-end limit of the f(r) curve).

