Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Maximum Entropy, Word-Frequency, Chinese Characters, and Multiple Meanings

Maximum Entropy, Word-Frequency, Chinese Characters, and Multiple Meanings

  • Xiaoyong Yan, 
  • Petter Minnhagen
PLOS
x

Abstract

The word-frequency distribution of a text written by an author is well accounted for by a maximum entropy distribution, the RGF (random group formation)-prediction. The RGF-distribution is completely determined by the a priori values of the total number of words in the text (M), the number of distinct words (N) and the number of repetitions of the most common word (kmax). It is here shown that this maximum entropy prediction also describes a text written in Chinese characters. In particular it is shown that although the same Chinese text written in words and Chinese characters have quite differently shaped distributions, they are nevertheless both well predicted by their respective three a priori characteristic values. It is pointed out that this is analogous to the change in the shape of the distribution when translating a given text to another language. Another consequence of the RGF-prediction is that taking a part of a long text will change the input parameters (M, N, kmax) and consequently also the shape of the frequency distribution. This is explicitly confirmed for texts written in Chinese characters. Since the RGF-prediction has no system-specific information beyond the three a priori values (M, N, kmax), any specific language characteristic has to be sought in systematic deviations from the RGF-prediction and the measured frequencies. One such systematic deviation is identified and, through a statistical information theoretical argument and an extended RGF-model, it is proposed that this deviation is caused by multiple meanings of Chinese characters. The effect is stronger for Chinese characters than for Chinese words. The relation between Zipf’s law, the Simon-model for texts and the present results are discussed.

Introduction

The scientific interest in the information-content hidden in the frequency statistics of words and letters in a text goes at least back to Islamic scholars in the ninth century. The first practical application of these early endeavors seems to have been the use of frequency statistics of letters to decipher cryptic messages [1]. The more specific question of what linguistic information is hidden in the shape of the word-frequency distribution stems from the first part of the twentieth century when it was discovered that the words in a text typically have a broad “fat-tailed” shape, which often can be well approximated with a power law over a large range [25]. This led to the empirical concept of Zipf’s law which states that the probability that a word occurs k-times in a text, P(k), is proportional to 1/k2[35]. The question is then what principle or property of a language causes this power law distribution of word-frequencies and this is still an ongoing research [610]. In the middle of the twentieth century Simon in [11] instead suggested that since quite a few completely different systems also seemed to follow Zipf’s law in their corresponding frequency distributions, the explanation of the law must be more general and stochastic in nature and hence independent of any specific information of the language itself. Instead he proposed a random stochastic growth model for a book written one word at a time from beginning to end. This became a very influential model and has served as a starting point for much later works [1218]. However, it was recently pointed out that the Simon-model has a fundamental flaw: the rare words in the text are more often to be found in the later part of the text, whereas a real text is to very good approximation translational invariant: the first half of a real text has, provided it is written by the same author, the same word-frequency distribution as the second [23, 24]. So, although the Simon-model is very general and contains a stochastic element, it is still history dependent and, in this sense, it leads to a less random frequency distribution than a real text. An extreme random model was proposed in the middle of the twentieth century by Miller in [25]: the resulting text can be described as being produced by a monkey randomly typing away on a typewriter. The monkey book is definitely translational invariant, but its properties are quite unrealistic and different from a real text [26].

The RGF (random group formation)-model, which is the basis for the present analysis, can be seen as a next step along Simon’s suggestion of system-independence [27]. Instead of introducing randomness from a stochastic growth model, RGF introduces randomness directly from the maximum entropy principle [27]. An important point of the RGF-theory is that it is predictive: if the only knowledge of the text is M (total number of words), N (number of distinct words), and kmax (number of repetitions of the most common word), then RGF provides a complete prediction of the probability distribution P(k). This prediction includes the functional form, which embraces Gaussian-like, exponential-like and power-law-like shapes; the form is determined by the sole knowledge of (M, N, kmax). A crucial point is that, if the maximum entropy principle, through RGF, gives a very good description of the data, then this implies that the values (M, N, kmax) incorporate all information contained in the distribution P(k), which makes the prediction neutral and void of more specific characteristic features. More specific text information is, from this view-point, associated with systematic deviations from the RGF-prediction.

Texts sometimes deviate significantly from the empirical Zipf’s law and a substantial part of work has been devoted to explain such deviations. These explanations usually involve text- and language specific features. However, from the RGF point of view, such explanations appear rather redundant and arbitrary, whenever the RGF-prediction agrees with the data. This point of view has been further elucidated in [28] for the case of species divided into taxa in biology.

In a recent paper by L. Lü et al. [18] it was pointed out that the character frequency-distribution for a text written in Chinese characters differs significantly from Zipf’s law, as had also been noticed earlier [1922]. Chinese characters carry specific meanings. For example, ‘huí’ and ‘jia’ are two Chinese characters carrying the elementary meanings of “return” and “home”, respectively. In general a Chinese character can also carry multiple meanings, where the relevant meaning has to be deduced from the context. A Chinese word corresponds to one, two or more characters, e.g. the two characters ‘huí’ and ‘jia’ can be combined into the Chinese word ‘huí,jia’ denoting the concept of “returning home”. Thus both Chinese characters and Chinese words carry meanings which can be single or multiple. Roughly a word in Chinese corresponds to about 1.5 characters on the average and typically more than 90% of the words in a novel are written with one or two characters, where about 50% of the words are written by one character and 40% with two. The remaining ones are made up of more than two Chinese characters.

The Chinese character frequency distribution is illustrated in Fig 1. The straight line in the figure is the Zipf’s law expectation. From a Zipf’s law perspective one might then be tempted to conclude that the deviations between the data and Zipf’s law have something to do specifically with the Chinese language or the representation in terms of Chinese characters, or perhaps a bit of both. However, the dashed curve in the figure is the RGF-prediction. This prediction is very close to the data, which suggests that beyond the three characteristic numbers (M, N, kmax) [total number of Chinese characters, distinct characters, and the number of repetitions of the most common character] there is no specifically Chinese feature, which can be extracted from the data.

thumbnail
Fig 1. Frequency of Chinese characters for the novel A Q Zheng Zhuan by Xun Lu and comparison with the RGF-prediction and the Zipf’s law expectation.

(a) Compares the probability, P(k), for a character to appear k-times in the text: crosses are raw data, filled dots are the log2-binned data, the straight line is the Zipf’s law expectation, and the dashed curve is the RGF-prediction. The RGF predicts the dashed curve directly from the three values (M, N, kmax) (see Table 1 for the input values and the corresponding predicted output values from RGF). (b) The same features in terms of the cumulative distribution C(k) = ∑k′ ≥ k P(k): filled triangles are the data, the straight line the Zipf’s law expectation and the dashed curve the RGF-prediction. RGF gives a very good ab initio description of the data which differs substantially from the Zipf’s law expectation (Note that the RGF-prediction is based solely on the raw data and predicts both the binned data in (a) and the cumulant data in (b)).

https://doi.org/10.1371/journal.pone.0125592.g001

A crucial point for reaching our conclusions in the present paper is the distinction between a predictive model like RGF and conventional curve-fitting. This can be illustrated by Fig 1(b): if your aim is to fit the lowest k-data points in Fig 1(b) (e.g. k = 1 to 10) with an ad hoc two parameter curve you can obviously do slightly better than the dashed curve in the Fig 1(b). However, the dashed curve is a prediction solely based on the knowledge of the right-most point in Fig 1(b) (kmax = 747) and the average number of times a character is used (M/N = 11.5). RGF predicts where the data points in the interval k = 1−10 in Fig 1(b) should fall without any explicit a priori knowledge of their whereabouts and with very little knowledge of anything else. This is the crucial difference between a prediction from a model and a fitting procedure and this difference carries over into the different conclusions which can be drawn from the two procedures. Another illustration is the fact that although the data in Fig 1(b) cannot be described by a Zipf’s-line with slope -1, such a line can be fitted to the data over a narrow range somewhere in the middle. Such an ad hoc fitting has no predictive value.

Specific information about the system may be reflected in deviations from the RGF-prediction [28]. One such possible deviation is discussed. It is also suggested that the cause of this deviation is multiple meanings of Chinese characters. A statistical information based argument for this conclusion is presented together with an extended RGF-model.

The method section starts with a brief recapitulation of the RGF-theory, as well as the Random Book Transformation, which allows for the analysis of sub-parts of the novels. Both these methods are used as starting points when analyzing the frequency distribution for two Chinese novels. The Chinese character-frequency distributions are compared to the corresponding word-frequency distributions for both novels, as well as for parts of the novels. The results from these comparisons lead to an information theory which makes it possible to approximately include the multiple meanings of Chinese characters. It is pointed out that the existence of words with multiple meanings isn’t a characteristic specific to Chinese, but a general feature of languages. The frequency distribution of the elementary entities of a written language (words or characters) is therefore influenced by the distribution of meanings over these entities, in a characteristic way. Conclusions are discussed in a last section.

Methods

Random Group Formation

The random group formation describes the general situation in which M objects are randomly grouped together into N groups [27]. The simplest case is when the objects are denumerable. Then if you know M and N the most likely distribution of group sizes, N(k) (number of sizes with k objects), can be obtained by minimizing the information average I[N(k)] = N−1N(k)ln(kN(k)) with respect to the functional form of N(k), subject to the two constraints that N−1N(k)k = < k > = M/N and ∑N(k) = N. Note that the information to localize an object in one of the groups of size k is log2(kN(k)) in bits and ln(kN(k)) in nats. Minimizing the average information I[N(k)] is equivalent to maximizing the entropy [27]. Thus RGF is a way to apply the maximum entropy principle to this particular class of problems. The result of the simplest case is the prediction N(k) = Aexp(−bk)/k[27]. However, in more general cases there might be many additional constraints and in addition all the objects might not lend themselves to a simple denumerization. The point is that in many applications you do know that there must be additional constraints relative to the simplest case but you have no idea what they might be. The RGF-idea is then based on the observation that any deviation from the simplest case will be reflected in a change of the entropy S[N(k)] = −∑k N(k)/Nln(kN(k)/N). This can then be taken into account by incorporating the actual value of the entropy S as an additional constraint in the minimizing of I[N(k)]. The resulting more general prediction then becomes N(k) = Aexp(−bk)/kγ[27]. Thus RGF transforms the three values (M, N, S) into a complete prediction of the group-size distribution. This also means that the form of the distribution is determined by the values (M, N, S) and includes a Gaussian limit (when γ = (M/N)b and (M/N)2/γ is small), exponential (when γ = 0), power-law (when b = 0) and anything in between.

In comparison with earlier work, one may note that the functional form P(k) = Aexp(−bk)/kγ has been used before when parameterizating distributions as described e.g. by Clauset et al[29] and that such a functional form can obtained from a maximum entropy as described e.g. by Visser [30]. The difference with our approach is the connection to minimal information which opens up the predictive part of the RGF. As emphasized in the Introduction, it is this predictive aspect which is crucial in the present approach and which lends itself to the generalization of including multiple meanings of characters.

The RGF-distribution was in [27, 28, 31, 32] shown to apply to a variety of systems like words in texts, population in counties, family names, distribution of richness, distribution of species into taxa, node sizes in metabolic networks, etc. In case of words, N is the number of different words, M is the total number of words, and N(k) is the number of different words which appears k times in the text. In English the largest group consists of the word “the” and its occurrence in a text written by an author is a statistically very well defined: it is typically about 4% of the total number of words [24, 27]. As a consequence one may replace the three values (M, N, S) by the three values (M, N, kmax). Both choices completely determine the parameters (A, b, γ) in the RGF-prediction. However, the latter choice has the practical advantage that kmax, the number of repetitions of the most common word, is more directly accessible and statistically very well-defined. For example, if kmax is close to the average < k > = M/N, such that (kmax− < k >)/ < k > < < 1 then the RGF-prediction approaches a Gaussian, which comes as no surprise because a Gaussian is just the outcome of the maximum entropy principle for such a narrow distribution [27].

Random Book Transformation

In general, the distribution for a system, which falls into the RGF-class, has a distribution with a shape which depends on M. Since M for a text is the total number of words, this means that the frequency distribution is text-length dependent. The reason is that if you start from a text characterized by (M, N, kmax), then the corresponding value for a half of the text is characterized by (M1/2, N1/2, kmax1/2). Here M1/2 = M/2 by definition, kmax1/2 = [kmax]/2 because the most common word is to good approximation equally distributed within the text, but N1/2 is non trivial. In the present investigation we need a method to separate between changes in the frequency distribution due to multiple meanings and due to the size of the text. For this purpose we use the Random Book Transformation (RBT) discussed in [27], where it was shown that the text-length dependence of the average N, when taking a part of a given text, is to good approximation a neutral feature: it is to good approximation the same as when you randomly delete the corresponding amount of words from the text. The process of changing the length of a text by randomly deleting words is a simple statistical process which transforms the probability distribution PM(k) = N(k)/N for the full text into PM/n(k) for the nth part of the text by (1) where PM/n and PM are column matrices corresponding to PM/n and PM. The transformation matrix Akk is given by (2) where Ckk is binomial coefficient. B is given by the normalization condition (3)

As shown in the next section, this simple random book transformation also to good approximation applies to text written in Chinese characters.

Results

RGF and size transformation for Chinese texts

Fig 2 shows that the data for the novel A Q Zheng Zhuan is well described by the neutral-model prediction provided by RGF. This implies that the frequency distribution of both words and characters is to large extent directly determined by the “state”-variable triple (M, N, kmax). At first sight this might appear surprising because the development of a spoken language and its written counterpart is a long and intricate process. However, in statistical physics this type of emergent simple properties from a complex system is well established. A well-known example is the ideal gas law P = NT/V which predicts the pressure, P, that an ideal gas inside a closed container exerts on the walls from the three “state”-variables (N, V, T), where N the number of gas particles, V is the volume of the container and T is the absolute temperature of the gas. Yet each gas particle follows its own deterministic trajectory including collisions with other particles and the walls. Since the number of particles is enormous it is in practice impossible to predict the outcome by deterministically following what happens in time to all the particles. The emergence of the simple ideal gas law stems from the fact that, with an enormous number of possibilities, the actual one is very likely to be close to the most likely outcome, assuming that all possibilities are equally likely. The basis for the maximum entropy principle in the present context is precisely the assumption that all distinct possibilities are equally likely.

thumbnail
Fig 2. Comparison between Chinese texts in characters and words.

(a) Comparison between characters and words for the novel A Q Zheng Zhuan by Xun Lu together with the respective RGF-predictions. (b) The same comparison for the novel Ping Fan De Shi Jie by Yao Lu. Filled dots correspond to the binned data for Chinese characters and filled triangles the data for words. Full and dashed curves correspond to the respective RGF-predictions and dotted straight lines are the Zipf’s law expectations for the word-frequency distribution. The respective “state”-variables (M, N, kmax) and the corresponding RGF-predictions are given in Table 1. Note that the translation between words and characters is a deterministic process. Yet the “state”-variables (M, N, kmax) suffice to predict the change in frequency distribution caused by the translation between words and characters.

https://doi.org/10.1371/journal.pone.0125592.g002

A crucial point is that, provided RGF does give a good description of the data, this means that it is the deviations between the data and the RGF-prediction which may carry interesting system-specific information. From this perspective Zipf’s law is just an approximation of the RGF i.e. the straight line in Fig 1 should be regarded as an approximation of the dashed curve. It follows that the deviation between Zipf’s law and the data does not reflect any characteristic property of the underlying system [27].

Following this line of argument, it is essential to establish just how well the RGF does describe the data. Fig 2(a) gives such a quality test: if all that matters is the “state”-variables (M, N, kmax), then one could equally well translate the same novel from Chinese characters to words. As seen in Fig 2(a), the word-frequency distribution for the novel A Q Zheng Zhuan is completely different from the character-frequency and also the “state”-variables are totally different (see Table 1 for “state”-variables and RGF prediction values). Yet according to RGF the change in shape only depends on the value of the “state”-variables and not if they relate to characters or words. As seen from Fig 2(a), RGF does indeed give a very good description in both cases.

thumbnail
Table 1. Data and RGF-predictions.

Two Chinese novels are used as empirical data i.e. A Q Zheng Zhuan (AQ for short) written by Xun Lu and Ping Fan De Shi Jie (PF for short) by Yao Lu. For each book we first remove punctuation marks and numbers from the texts, then count the Chinese characters one by one and finally get the characters frequency results. In Chinese language the words are not separated by spaces, so we use a word segmenter, Jieba (https://github.com/fxsjy/jieba), to extract words from Chinese texts. The RGF-prediction is given in the form P(k) = A′ exp(−bk)/kγ. This means that the RGF-theory transforms the data-triple (M, N, kmax) into the prediction triple (γ, b, A′).

https://doi.org/10.1371/journal.pone.0125592.t001

The translation of A Q Zheng Zhuan from characters to words is in itself an example of a deterministic process. Yet, as illustrated in Fig 2(a), it is a complicated process in the sense that the resulting word-frequency distribution, through RGF, can be obtained to very good approximation without having any knowledge about the actual deterministic translation-process! This can again be viewed as a case when complexity results in simplicity.

Fig 2(b) gives a second example for a longer novel, Ping Fan De Shi Jie by Lu Yao (about 40 times as many characters as A Q Zheng Zhuan, see Table 1). In this case the word-frequency is very well accounted for by RGF. Note that in this particular case the Zipf’s law prediction agrees very well with both the RGF-prediction and the data (Zipf’s law is a straight line with slope -2 in Fig 2(a) and 2(b)). RGF also provides a reasonable approximation of the character-frequency, whereas Zipf’s law fails completely for this case. This is consistent with the interpretation that Zipf’s law is just an approximation of RGF; an approximation which sometimes works and sometimes does not. However, as will be argued below, the discernible deviation between RGF and the data may reflect some specific linguistic feature.

As shown above, the shape of the frequency curve for a given text changes when translating between characters and words and this change is well accounted for by the RGF and the corresponding change in “state”-variables. This is quite similar to the change of shape when more generally translating a novel to different languages. This analogy is demonstrated on the basis of the Russian short story The Man in a Case by A. Chekhov and its translations into English words and Chinese characters. As shown in Fig 3(a), the respective RGF-predictions match the corresponding frequency distributions very well. The same is true for the English novel The Old Man and the Sea by E. Hemmingway (compare Fig 3(b)). These findings confirm that the information contained in the triple (M, N, kmax) is sufficient to describe the frequency distribution of the fundamental entities of a written language, independent if those are words or characters in Chinese and irrespective of the underlying language.

thumbnail
Fig 3. Similarity of translation between words and characters versus words of different languages.

(a) The Russian short story The Man in a Case by A. Chekhov, and its translations into English words and Chinese characters (triangles, squares, and filled dots, respectively). The RGF-predictions are given by the curves (dashed dotted, dashed, and full, respectively, The RGF-prediction completely characterizes a frequency distribution in terms of the total number of words/characters (M), the number of specific words/characters (N), and how many of the total number of words/characters are given by the most common word/character (kmax/M). Each such triple (M, N, kmax) gives a unique prediction-curve [(M, N, kmax) = (4061, 1721, 231), (5375, 1317, 256), and (8212, 1150, 312), respectively]. The agreement shows that words and characters are entirely analogous with respect to frequency distributions. (b) illustrates the same thing starting from the English novel The Old Man and the Sea and translating into Russian words and Chinese characters. The triples are this time (M, N, kmax) = (22414, 5378, 988), (23894, 2388, 2091), and (34220, 1685, 1289), in the order Russian, English and Chinese characters (Data points and RGF-curves, as in (a)).

https://doi.org/10.1371/journal.pone.0125592.g003

In order to gain further insight into what causes the difference in word-frequency and character-frequency of a text written in Chinese one can compare text-parts of different lengths from a given novel. As described in [24], text-parts of different length of a novel have different frequency distributions. For example if you start from A Q Zheng Zhuan and take an 10th-part, then the shape changes, as shown in Fig 4(a). According to RGF this new shape should now to good approximation be directly predicted from the new “state” (M/10,N,kmax) (see Table 1 for the precise values) As seen in Fig 4(b) this is to good approximation the case. As explained in Methods and can be verified from Table 1, kmaxkmax/10. One may then ask if the transformation from N to N′ involves some system specific feature. In order to check this one can compare the process of taking an nth-part of a text with the process of randomly deleting characters until only a nth-part of them remains. This latter process is a trivial statistical transformation described in Methods under the name RBT (Random-Book-Transformation). Fig 4(b) also shows the predicted frequency distribution obtained from the “state”-variable triple (M,N,kmax) derived from RBT and used as input in RGF. (The actual RBT-derived value for N′ is given in Table 1). The close agreements signal that the change of shape due to a reduction in text length, to large extent, is a general totally system-independent feature. Fig 4(c) shows the change of the frequency-distribution, when taking parts of the longer novel Ping Fan De Shi Jie written in characters and Fig 4(d) compares the parts with the RGF-prediction, as well as with the combined RGF+RBT-prediction. The conclusion is that the change of shape carries very little system specific information.

thumbnail
Fig 4. Size dependence of novels written in Chinese characters.

The same two novels as in Fig 2 are divided into parts. The frequency distribution of a full novel is compared to the one of a part. (a) P(k) for A Q Zheng Zhuan (filled dots) is compared to the distribution for a typical 10th-part (filled triangles). Here the word typical means an average distribution obtained by taking many different 10th with different starting points. These two functions have quite different shapes. However, the shapes of both are equally well predicted by RGF (curves with dashed and full lines). (b) The distribution of the 10th-part, which can to very good approximation be trivially obtained from the full book by just randomly removing 90% of the words from the full book. This corresponds to the dashed curve which is almost identical to the RGF-prediction and both correspond very well to the data. (c-d) The same features for the novel Ping Fan De Shi Jie. Note that the 10th-part agrees better with RGF than the full novel.

https://doi.org/10.1371/journal.pone.0125592.g004

By comparing Fig 2(a) and 2(b), one notices that whereas RGF gives a very good account of the shorter novel A Q Zheng Zhuan, there appears to be some deviation for the longer novel Ping Fan De Shi Jie. In Fig 5(a) we compare a 40th part of Ping Fan De Shi Jie with the full length of A Q Zheng Zhuan. As seen from Fig 5(a) the two texts have very closely the same character-frequency distribution. From the point of view of RGF, it would mean that the “state”-variables (M, N, kmax) are closely the same. This is indeed the case, as seen in Table 1 and from the direct comparison with RGF in Fig 5(b). Ping Fan De Shi Jie and its partitioning suggest a possible specific additional feature for written texts: a deviation from RGF for longer texts, which becomes negligible for shorter. In the following section we suggest what type of feature this might be.

thumbnail
Fig 5. Comparison between two different texts of approximately equal length written by different authors in Chinese characters.

(a) A Q Zheng Zhuan (filled dots) is compared to the 40th part of Ping Fan De Shi Jie (filled triangles). Note that the two data sets almost completely overlap. This means the difference in the frequency distribution between A Q Zheng Zhuan and Ping Fan De Shi Jie is just caused by the difference in length of the two novels. Furthermore (b) illustrates that this length difference is rather trivial because it just the frequency distribution you get when randomly removing 97.5% of the words from Ping Fan De Shi Jie (dashed curve).

https://doi.org/10.1371/journal.pone.0125592.g005

Systematic deviations, information loss and multiple meanings of words

As suggested in the previous section, the clearly discernible deviation in Fig 2(b) between the character-frequency distribution for the data and the RGF-prediction in case of Ping Fan De Shi Jie could be a systematic difference. The cause of this deviation should then be such that it becomes almost undetectable for a 40th-part of the same text, as seen in Fig 5(b).

We here propose that this deviation is caused by the specific linguistic feature that a written word can have more than one meaning. Let us start from an English alphabetic text. A word is then defined as a collection of letters partitioned by blanks (or other partitioning signs). Such a written word could then within the text have more than one meaning. Multiple meanings here means that a word in a dictionary is listed to have several meanings i.e. a written word may consists of a group of words with different meanings. We will call the members of these under-groups primary words. So in order to pick a distinct primary word, you first have to pick a written word and then one of its meanings within the text. It follows that the longer the text is, the larger the chance that several meanings of a written word appear in the text. Our explanation is based on an earlier proposed specific linguistic feature that a more frequently written word occurring in the text, has a tendency of having more meanings [3336]. This means that a written word which occurs k times in the text on the average consists of a larger number of primary words than a written word which occurs fewer times. Thus if the text consists of N(k) written words which occur k times in the text, then the average number of primary words is NP(k) = N(k)f(k) where f(k) describes how the number of multiple meanings depend on the frequency of the written word. In the case of texts written with Chinese characters, it is, as explained the introduction, the characters are the elementary entities carrying individual meanings and hence play the role of words.

It is possible to incorporate the concept of multiple meanings into a RGF-type formulation. The point to note is that the distributed entities are really the primary words/Chinese-characters and the information needed to localize a primary word/Chinese-character belonging to a written word/Chinese-character which occurs k times in the text is log2(kNP(k)) = log2(kN(k)f(k)). We want to determine the distribution N(k) taking into account that the information lost, −log2(f(k)), caused by the number of multiple multiple meanings (on the average) of a word which occurs k times in the text. It follows the information which then needs to be minimized in order to obtain the maximum entropy solution is the average of log2(kN(k))−log2(f(k)) or equivalently (4) and following the same steps as in Methods and [27] this predicts the functional form (5)

Basically the specific linguistic character is that f(k) is an increasing function and that f(k = 1) = 1, because a word which only occurs a single time in the text can only have one meaning within the text. The simplest approximation is then just a linear increase. Fig 6 gives some support for this supposition: the average frequency, k(fD), of Chinese characters in Ping Fan De Shi Jie, which have fD dictionary meanings,is plotted against fD. The plot shows that the k(fD) to fair approximation has a linear increase of the form k=fD/c1/c+1 or equivalently fD=ck+1c. Fig 6(a) corresponds to the full text and Fig 6(b) to a 40th part. Note that the slope c′ changes with text size. This is easily understood: shortening the text is, as explained in the previous section, basically the same as randomly removing characters. This means that a character with a smaller k has a larger chance to be completely removed from the text than one with higher. But since the characters with higher frequency on average have a larger number of multiple meanings, this means that the resulting characters with low k will on average have more multiple meanings. Also note that the dictionary meanings and the meanings within a text is not the same; the former is larger than the latter, but the longer the text the more equal they become. However, it is reasonable to assume that also the number of meanings within a text follows a similar linear relationship. Next we make the further simplification by replacing the average k with just k i.e. we are ignoring the spread in frequency of characters having a specific number of meanings within the text. However, this approximation still catches the increase in meanings with frequency. We will take this linear increase as our ansatz and include a cut-off kc for large k, since the most frequent Chinese characters has few multiple meanings. This is a general linguistic feature, the most frequent English words, “the”, has only one meaning. Thus we use the approximate ansatz f(k) ∝ k/(1+k/kc). This approximation reduces the RGF functional form to (6) where d = 1/kc. In addition to the “state”-variable triple (N, M, kmax) we should specify an a priori knowledge of f(k). The knowledge of this linguistic constraint is limited and enters through its approximate form f(k) ∝ k/(1+kd). This enables us to determine the value d = 1/kc from the RFG-method by including the value of the entropy S as an additional constraint. Thus we use RGF in the form of Eq (6) together with the “state”-variable quadruple (N, M, kmax, S). This follows since the four constants (A′, b, γ, d) in Eq (6), through RGF-formulation completely determine the quadruple (N, M, kmax, S) and vice versa. In Fig 7 this form of extended RGF is tested on data from three novels written in Chinese characters. The corresponding “state”-quadruples (N, M, kmax, S) are given in Table 2 together with the corresponding predicted output-quadruple (γ, b, kmax, d). The agreement with the data is in all cases excellent (dashed curves in the Fig 7). The dotted curves are the usual RGF-prediction based on the “state”-triples (M, N, kmax). Note that for a 100th-part of Ping Fan De Shi Jie, the usual RGF and the extended RGF agrees equally well with the data. This means that any effect of multiple meanings is in this case already taken care of by the usual RGF. However as the text size is increased to 40th-, 10th part and full novel, the extended RGF agrees equally well, whereas the usual RGF-start to deviate. It is this systematic difference, which suggest that there is specific effect beyond the neutral-model prediction given by the usual RGF.

thumbnail
Fig 6. The average frequency k for the occurrence of a Chinese character in a given text is plotted against its number of multiple dictionary meanings fD.

The Chinese character dictionary Xinhua Dictionary, 5th Edition is used for the number of dictionary meanings of Chinese characters. Figure (a) shows the occurrence in the novel Ping Fan De Shi Jie and figure (b) the occurrence for the average 40th-part of the same novel. In both cases the trend of the functional dependence can be represented by a straight line. The linear increase fDck is for the full novel c′ ≈ 0.0083 and for the 40th-part c′ ≈ 0.34. The reason that c′ increases with decreasing size is explained in the text.

https://doi.org/10.1371/journal.pone.0125592.g006

thumbnail
Fig 7. Test of RGF including multiple meaning constraints.

The RGF is in each case predicted from the quadruple of state variables (M, N, kmax, S). The data is from three novels in Chinese (see Table 2). The RGF predictions with multiple meaning constraint are given by the dashed curves. The RGF without the multiple meaning constraint is predicted from the state variable triple (M, N, kmax) and corresponds to the dotted curves. Only when the multiple meaning constraint significantly improves the RGF-prediction can some specific interpretation be associated with it. As seen from the figure the significance increases with increasing length of the novel.

https://doi.org/10.1371/journal.pone.0125592.g007

thumbnail
Table 2. Data and RGF-predictions including multiple meanings.

Three Chinese novels are used as the empirical data i.e. A Q Zheng Zhuan written by Xun Lu, Ping Fan De Shi Jie by Yao Lu and Harry Potter (HP for short) volume 1 to 7 (written by J. K. Rowling and translated to Chinese by Ainong Ma et al.). The statistics for the characters are obtained as described in Table 1. In this case the input quadruple (M, N, kmax, S) is transformed by the RGF-theory into the output prediction (γ, b, d, A′) corresponding to the RGF-form Aexp(bk)kγ(1+1dk)γ.

https://doi.org/10.1371/journal.pone.0125592.t002

Is the multiple meaning explanation sensible? To investigate this we estimate the average number of multiple meanings < f(k) > using the ansatz form for f including the condition that a single character can only have a single meaning in the text f(k = 1) = 1 i.e. f(k) = (1+d)k/(1+kd) together with the obtained values of d (see Table 2) (7) These estimated values for < f > are given in Table 2. Fig 8(a) shows that < f > increases with the text length. This is consistent with the fact that the number of uses of a character increases and hence the chance that more of its multiples meanings appears in the text. For the same reason < f > increases with the average number of uses of a character < k > as shown in Fig 8(b). In addition the chance for a larger number of dictionary meanings is larger for a more frequent character (see Fig 6). Thus it appears that the connection between < f > and multible meanings makes sense.

thumbnail
Fig 8. Consistency test of the multiple meaning model.

According to the multiple meaning model the parameter d (see Table 2) should give a sensible approximative estimate of the average number of multiple meanings per character within a text < f >. The figure shows that < f > increases with the size of the text M. This is consistent with the fact the number of uses of a character increases and hence the chance that more of its multiples meanings appears in the text. For the same reason < f > increases with the average number of uses of a character < k >. In addition the chance for a larger number of dictionary meanings is larger for a more frequent character (see Fig 6). The inset shows how < k > increases with M.

https://doi.org/10.1371/journal.pone.0125592.g008

Multiple meaning is of course not a unique feature of Chinese, it is a common feature of many languages. Therefore, it is unsurprising that we can also observe systematic deviations from the RGF-prediction in other languages, such as English [24] and Russian [36]. However, the average meaning of English words are much less than that of Chinese character: in modern Chinese there are only about 3,500 commonly used characters [37] and even for a novel including more than one million of characters, the number of distinct characters involved is less than 4,000 (see Table 2); but for the same novel written in English, the number of distinct words is more than 20,000 (see Fig 9(a)). Therefore, the systematic deviation caused by multiple meaning can be neglected for short English text, as shown in Fig 9(b). Even for a rather long text, the deviation is still very slight and, as shown in Fig 9(a), the usual RGF gives a good prediction (RGF with multiple meaning constraint incorporates more a priori information and may consequently be expected to give a better prediction but the difference is very small). Taken together, Chinese uses a small amount of characters to describe the primary word, resulting in a high degree of multiple meanings, further leading to that the head of the character-frequency distribution (or tail of the frequency-rank distribution) deviates somewhat from the RGF-prediction. But such deviations are not special to Chinese, as we have demonstrated in Fig 9, it is just more pronounced in Chinese than in some other languages.

thumbnail
Fig 9. Test of RGF including multiple meaning constraints for English books.

The RGF is in each case predicted from the quadruple of state variables (M, N, kmax, S). The data is from two English novels: Tess of the d’Urbervilles written by T. Hardy and Harry Potter volume 1 to 7 by J. K. Rowling. The RGF predictions with multiple meaning constraint are given by the dashed curves. The RGF without the multiple meaning constraint is predicted from the state variable triple (M, N, kmax) and corresponds to the dotted curves.

https://doi.org/10.1371/journal.pone.0125592.g009

Discussion

The view taken in the present paper is somewhat different and heretical compared to a large body of earlier work [318]. First of all we argue that Zipf’s law is not a good starting point, when trying to extract information from word/character frequency distributions. Our starting point is instead a neutral-model containing a minimal a priori information about the system. From this minimal information, the frequency distribution is predicted through a maximum entropy principle. The minimal information consists of the “state”-variable triple (M, N, kmax) corresponding to the (total number of-, number of different-, maximum occurrence of most frequent-) word/character, respectively. The shape of the distribution is entirely determined by the triple (M, N, kmax). Within this RGF-approach, Zipf’s-law (or any other power law with an exponent different from the Zipf’s law exponent) distribution only results for seemingly accidental triples of (M, N, kmax). The first question is then if these Zipf’s law triples are really accidental or if they carry some additional information about the system. According to our findings there is nothing special about these power-law cases. First of all in the examples discussed here, Zipf’s law is in most cases not a good approximation of the data, whereas the RGF-prediction in general gives a very good account of all the data including the rare cases when the distribution is close to a Zipf’s law. Second, translating a novel between languages, or between words and Chinese characters, or taking parts of the novel, all changes the triple (M, N, kmax). This means that the shape of the distribution changes, such that if it happened to be close to a Zipf’s law before the change, it deviates after. Furthermore, in the case of taking parts of a novel, the change in the triple (M, N, kmax) is to large extent trivial, which means that there is no subtle constraint for preferring special values of (M, N, kmax). All what this leads up to is that the distributions you find in word/character frequencies are very general and apply to any system which can be similarly described in terms of the triple (M, N, kmax) as discussed in [27, 28]. From this point of view the word/character frequency carries little specific information about languages.

In a wider context, this generality and lack of system-dependence was also expressed in [28] as: …we can safely exclude the possibility that the processes that led to the distribution of avian species over families also wrote the United States’ declaration of independence, yet both are described by RGF, and earlier and more drastically by Herbert Simon in [11]: No one supposes that there is any connection between horse-kicks suffered by soldiers in the German army and blood cells on a microscopic slide other than that the same urn scheme provides a satisfactory abstract model for both phenomena. The urn scheme used in the present paper is the maximum entropy principle in the form of RGF.

Herbert Simon’s own urn model is called the Simon model [11]. The problem with the Simon model in the context of written text is that it does presume a specific relation between the parameters of the “state”-triple (M, N, kmax): for a text with a given M and N, the Simon model predicts a kmax. This value of kmax is quite different from the ones describing the real data analyzed here. For example in case of the “state” triple for A Q Zheng Zhuan in Chinese characters the values of M and N are 17,915 and 1,552, respectively (see Table 1) and the Simon model predicts kmax = 9,256 and P(k) in the form of a power law given by ∝ 1/k2.1. Thus the most common character accounts for about 50% of the total text, which does not correspond to any realistic language. Fig 10(a) compares this Simon model result with the real data, as well as with the corresponding RGF-predictions. You could perhaps imagine that you in each case could modify the Simon model so as to produce the correct “state”-triple. However, even so a modified Simon models will anyway have a serious problem, as discussed in [24]: if you take a novel written by the Simon stochastic model and divide it into two equally sized parts, then the first part has a quite different triple (M/2, N1/2, kmax/2) than the second. Yet both parts of a real book are described by the same “state”-variable triple. This means that the change in shape of the distribution by partitioning cannot be correctly described within any stochastic Simon-type model.

thumbnail
Fig 10. Test of the Simon model.

(a) The data (solid triangles) together with the RGF-prediction (dashes curve) for A Q Zheng Zhuan in Chinese characters. The Simon model with the same M and N are given by the solid dots and the Simon prediction for infinite M by the dotted line. Note that the most common character appears 9,256 times for the Simon model which is about 50% of the total number of characters. This is completely unrealistic for a sensible language (the most common character in Chinese is about 4% and the most common word in English “the” is also about 4%). Figure (b) shows that the frequency distribution for Simon model is not translation invariant: For a real novel the word frequency distribution of the first half of the novel is to good approximation the same as the second. The data for the novel A Q Zheng Zhuan in Chinese characters illustrates this (full drawn and short dashed curves in the figure). However for the Simon model the frequency distribution depends on which part you take (long dashed- and dotted curves in the figure).

https://doi.org/10.1371/journal.pone.0125592.g010

From the point of view of the present approach, the fact that the data is very well described by the RGF-model gives a tentative handle to get one step further: since RGF is a neutral-model prediction, the implication is that any systematic deviations between the data and the RGF-prediction might carry additional specific information about the system. Such a deviation was shown to become more discernable the longer the text written in Chinese characters is. The multiple meaning of Chinese characters was suggested as an explanatory factor of this phenomenon. This is based on the notion that characters/words used with larger frequency have a tendency to have more multiple meanings within a text. Some supports for this was gained be comparing to the dictionary meanings of a Chinese character. It was also argued that this tendency of more multiple meanings could be entered as an additional constraint within the RGF-formulation. Comparison with data suggested that this is indeed a sensible contender for an explanation.

Our view is that the neutral-model provided by RGF provides a useful starting point for extracting information from word/character distributions in texts. It has the advantage, compared to most other approaches, in that it actually predicts the real data from a very limited amount of a priori information. It also has the advantage of being a general approach which can be applied to a great variety of different systems.

Supporting Information

Acknowledgments

Economic support from IceLab is gratefully acknowledged.

Author Contributions

Conceived and designed the experiments: PM XY. Performed the experiments: XY. Analyzed the data: XY. Wrote the paper: PM.

References

  1. 1. Singh S. The code book. New York: Random House; 2002.
  2. 2. Estroup JB. Les Gammes Sténographiques. 4th ed. Paris: Institut Stenographique de France; 1916.
  3. 3. Zipf GK. Selective studies of the principle of relative frequency in language. Cambridge: Harvard University Press; 1932.
  4. 4. Zipf GK. The psycho-biology of language: an introduction to dynamic philology. Boston: Mifflin; 1935.
  5. 5. Zipf GK. Human bevavior and the principle of least effort. Reading, MA: Addison-Wesley; 1949.
  6. 6. Mandelbrot B. An informational theory of the statistical structure of languages. Woburn: Butterworth; 1953.
  7. 7. Li W. Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE T Inform Theory. 1992;38: 1842–1845.
  8. 8. Baayen RH. Word frequency distributions. Dordrecht: Kluwer Academic; 2001.
  9. 9. i Cancho RF, Solé RV. Least effort and the origins of scaling in human language. Proc Natl Acad Sci USA. 2003;100: 788–791.
  10. 10. Montemurro MA. Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A. 2001;300: 567–578.
  11. 11. Simon H. On a class of skew distribution functions. Biometrika. 1955;42: 425–440.
  12. 12. Kanter I, Kessler DA. Markov processes: linguistics and Zipf’s Law. Phys Rev Lett. 1995;74: 4559–4562. pmid:10058537
  13. 13. Dorogovtsev SN, Mendes JFF. Languague as an evolving word web. Proc R Soc Lond B. 2001;268: 2603–2606.
  14. 14. Zanette DH, Montemurro MA. Dynamics of text generation with realistic Zipf’s distribution. J Quant Linguistics. 2005; 12: 29–40.
  15. 15. Wang DH, Li MH, Di ZR. Ture reason for Zipf’s law in language. Physica A. 2005;358: 545–550.
  16. 16. Masucci A, Rodgers G. Networks properties of written human language. Phys Rev E. 2006;74: 26102.
  17. 17. Cattuto C, Loreto V, Servedio VDP. A Yule-Simon process with memory. Europhys Lett. 2006;76: 208–214.
  18. 18. Lü L, Zhang Z-K, Zhou T. Deviation of Zipf’s and Heaps’ laws in human languages with limited dictionary sizes. Sci Rep. 2013;3: 1082. pmid:23378896
  19. 19. Zhao KH. Physics nomenclature in China. Am J Phys. 1990;58: 449–452.
  20. 20. Rousseau R, Zhang Q. Zipf’s data on the frequency of Chinese words revisited. Scientometrics. 1992;24: 201–220.
  21. 21. Shtrikman S. Some comments on Zipf’s law for the Chinese language. J Inf Sci. 1994;20: 142–143.
  22. 22. Ha LQ, Sicilia-Garcia EI, Ming J, Smith FJ. Extension of Zipf’s law to words and phrases. Proc 19th Intl Conf Comput Linguistics. 2002: 315-320.
  23. 23. Bernhardsson S, da Rocha LEC, Minnhagen P. Size dependent word frequencies and the translational invariance of books. Physica A. 2010;389: 330–341.
  24. 24. Bernhardsson S, da Rocha LEC, Minnhagen P. The meta book and size-dependent properties of written language. New J Phys. 2009;11: 123015.
  25. 25. Miller GA. Some effects of intermittance silence. Am J Psychol. 1957;70: 311–314. pmid:13424784
  26. 26. Bernhardsson S, Baek SK, Minnhagen P. A paradoxical property of the monkey book. J Stat Mech. 2011;7: PO7013.
  27. 27. Baek SK, Bernhardsson S, Minnhagen P. Zipf’s law unzipped. New J Phys. 2011;13: 043004.
  28. 28. Bokma F, Baek SK, Minnhagen P. 50 years of inordinate fondness. Syst biol. 2013: syt067.
  29. 29. Clauset A, Shalizi CR and Newman MEJ. Power-law distributions in empirical data. SIAM Rev. 2009;51: 661–703.
  30. 30. Visser M. Zipf’s law, power laws and maximum entropy. New J Phys. 2013;15: 043021.
  31. 31. Lee SH, Bernhardsson S, Holme P, Kim BJ, Minnhagen P. Neutral theory of chemical reaction networks. New J Phys. 2012; 14: 033032.
  32. 32. Baek SK, Minnhagen P, Kim BJ. The ten thousand Kims. New J Phys. 2011;13: 073036.
  33. 33. Zipf GK. The meaning-frequency relationship of words. J Gen Psychol. 1945;33: 251–256. pmid:21006715
  34. 34. Reder L, Anderson JR, Bjork RA. A semantic interpretation of encoding specificity. J Exp Psychol. 1974;102: 648–656.
  35. 35. i Cancho RF. The variation of Zipf’s law in human language. Eur Phys J B. 2005;44: 249–257.
  36. 36. Manin DY. Zipf’s law and avoidance of excessive synonymy. Cognitive Sci. 2008;32: 1075–1098.
  37. 37. Yan X, Fan Y, Di Z, Havlin S, Wu J. Efficient learning strategy of Chinese characters based on network approach. PLoS ONE. 2013;8: e69745. pmid:23990887