A Common Construction Pattern of English Words and Chinese Characters

Rankings are ubiquitous around the world. Here I investigate spatial ranking patterns of English Words and Chinese Characters, and reveal a common construction pattern related to phase separation. In detail, I analyze a list of different words in the English language, and find that the frequency of the number of letters per word linearly or nonlinearly decays over its rank in the frequency table. I interpret the linearly decaying area as a linear phase that covers 96.4% words, which is in sharp contrast to a nonlinear phase (representing the nonlinearly decaying area) that covers the remaining 3.6% words. Amazingly, the phase separation phenomenon with the same two percentages of 96.4% and 3.6% holds also for the relation between strokes and characters in the Chinese language although English and Chinese are two distinctly different language systems. The common construction pattern originates from the log-normal distributions of frequencies of words or characters, which can be understood by the joint effect of both the Weber-Fechner law in psychophysics and the principle of maximum entropy in information theory.


Introduction
The world is full of rankings : everything, ranging from the reputation of movie stars and academic journals, to purchasing choices, and to a global rich list is affected by differences between them. This turns the quantitative understanding of rankings into a central project of scientific research . For example, in the English language, the frequency, f , of encountering the rth most common word is inversely proportional to rank order r (namely, f (r)!1=r), as indicated by Zipf's law [2,3]. Besides linguistics [2][3][4][5][6][7][8][9][10], Zipfian type power laws (that include Zipf's law [2,3] and its many extensions  given by f (r)~c=r s with sw0, where c is a non-zero positive constant and r is either rank orders or item's quantities that can be ranked, say, firm sizes [23]) have been observed and studied in many disciplines like physics [18][19][20][21][22], acoustics [11], biology [12,13], economics or finance [15,16,24], sociology [17,23,25], and architectonics [14]. However, although many rankings can be described by Zipfian type power laws , many others can not, e.g., in communications [26] and linguistics [27,28]. That is, the ranking patterns might be distinctly different in various areas. Accordingly, in the seminal paper [1], Blumm et al. studied temporal ranking patterns of some complex systems. As a result, they revealed a novel noise-driven phase transition that separates stable rankings observed in some complex systems and volatile rankings observed in others. While they made a big success in revealing and explaining the common phase transition phenomenon of different temporal ranking patterns, little academic attention has been devoted to the common phase separation phenomenon of different spatial ranking patterns [that are counterparts of temporal ranking patterns. Here the ''temporal ranking pattern'' describes the ranking of a specific item (say, the reputation of a particular scientist) that changes with time; the ''spatial ranking pattern'' depicts the rankings of different items (say, the reputation of a particular scientist versus that of another particular athlete) at a given time]. This is partly because rankings come to appear in such diverse regions that it seems to be an impossible task to obtain a common phase separation behavior. In fact, the failure of existing models of Zipfian type power laws to capture the information-theoretic model for communication [26] also implies the possibility that various spatial ranking patterns share a common phase separation phenomenon. Moreover, understanding the common phase separation phenomenon of different spatial ranking patterns raises some valuable questions: What is the underlying mechanism? Is the mechanism also common for different systems?
To proceed, I attempt to study two different language systems: English and Chinese. Belonging to two different language families, English and Chinese have a lot of significant differences, say, alphabet, phonology, grammar, vocabulary, etc. But do their construction patterns of rankings also differ from each other? To this end, I show that, although the language families are different, their construction patterns of rankings are not: English and Chinese construction patterns of rankings exhibit the same phase separation phenomenon that contains a linear phase that is described by a linearly-decaying law and a nonlinear phase that is described by a nonlinearly-decaying law. Accordingly, the phase separation appears to be ''common'' for the two different systems, at least to some extent.

Methods
Let me start by briefly introducing the two languages. The English language contains 54,700 different words [29], each constructed by letters; an individual letter in a word does not represent special meanings (except for 1-letter words like ''I''). The Chinese language has 20,893 different characters [30], each constructed by strokes; a stroke in a character has no special meanings either (except for 1-stroke characters like '' ''(one)). In fact, a stroke is only an individual pen movement that is needed to draw a character. For instance, the character '' '' (two) has two strokes, and the character '' '' (hand) has four strokes.
Consider a list of numbers of letters (or strokes) per English word (or per Chinese character), each corresponding to a certain number, N, of English words (or Chinese characters) that determines their ranking. I consider that the number of letters or strokes with the largest N (that is 8326 for English words with 8 letters long and 1957 for Chinese characters with 12 strokes, respectively) is ranked first, namely, its rank order R~1. Similarly,  Table 1.
On the other hand, I use frequency, F , to measure the occupation ratio defined by the quotient of N and the total number of English words or Chinese characters.

Results
As shown in Figs. 1 and 2, most of the frequencies are well approximated by a linearly-decaying law, F (R)~azbR. The fact that the linearly-decaying behavior spans R~1*11 for English words and R~1*19 for Chinese characters indicates that the linearly-decaying law is valid for most English words or Chinese characters. Remarkably, the same percentage 96.4% is covered for both English words and Chinese characters. Accordingly, as shown in Figs. 1 and 2, the 96.4% words or characters appear in a linear phase where a linear function works for fitting. In contrast, for the remaining 3.6% words or characters, they are located in a nonlinear phase where a nonlinear function fits instead. Regarding the nonlinear function, what I adopt for Figs. 1 and 2 is power-law distribution functions [31] (based on the private communication with Mr. G. Yang) that belong to the family of Zipfian type power laws. Nevertheless, I should remark that other types of nonlinear functions like exponential distributions might also be suitable due to data sparsity within the current nonlinear phase. Because, compared with the linear phase, the number of words/characters in the nonlinear phase is small enough to be neglected, I would like to focus on the linear phase by raising a question: what is the origin for the observed linearly-decaying behavior? To answer this question, I have to plot the frequency versus the number of letters (or strokes) per English word (or per Chinese character). Figs. 3 and 4 show that the frequencies are approximated with a lognormal distribution for either English words or Chinese characters. This echoes with the findings by Herdan [9] and Zhang [10]. In Ref. [9], Herdan reported that a log-normal distribution appeared for 738 different English words in phone conversations, where the mean value and standard deviation are 5.05 and 1.47, respectively. In Ref. [10], Zhang revealed a log-normal distribution for 16,262 different Chinese characters in the Chinese dictionary ''Cihai ( )'' that was edited as early as 1979, where the mean value and standard deviation are 2.4739 and 0.3827, respectively. When I rank the theoretical values predicted by the two log-normal distributions depicted in Figs. 3 and 4, I find that they agree with those empirically obtained from the 54,700 English words and 20,893 Chinese characters, respectively; see Figs. 1 and 2. Remarkably, they can even be fitted within the same ranges of R by using the same linear function, F (R)~azbR, with almost the same parameter sets of a and b. So, I would say the existence of log-normal distributions is a possible origin for the linearly-decaying law.
So far, one may ask why ''log-normal distributions'' come to appear herein. This can be understood according to the following theoretical analysis (based on the private communication with Dr. J. R. Wei), which is somehow different from the models mentioned in Ref. [32].  Fig. 3. For comparison, the same linear function is used to fit the data of frequencies obtained from this log-normal distribution for the same range of R: a~1:76|10 {1 and b~{1:49|10 {2 (blue dashes). On the other hand, for Rw11 covering the remaining 3.6% words, I attempt to use a power-law distribution function, F (R)~10 10:39 =R 11:14 with regression coefficient r c 2~9 8:58% (note the perfect fit corresponds to r c 2~1 00% [31]); see the inset that shows a log-log plot. The linear fits are obtained by the least square method. In analogy with critical phenomena, I indicate a critical threshold, R c~1 1. For RƒR c , the linearly-decaying behavior described by the linear function comes to appear; I interpret this as a linear phase. For RwR c , the nonlinearly-decaying behavior occurs, which can be described by nonlinear functions (say, a power-law distribution function as used in the figure); I interpret this as a nonlinear phase. doi:10.1371/journal.pone.0074515.g001 Figure 2. Frequency, F, versus rank order, R, for the list of 20,893 Chinese characters (red circles). For Rƒ19, the linear function, F (R)~azbR, is adopted for the linear fit (red line) covering 96.4% characters: a~1:01|10 {1 and b~{4:99|10 {3 . Also shown are the frequencies (blue stars) according to the log-normal distribution depicted in Fig. 4. The same linear function is used to fit the data of frequencies obtained from this log-normal distribution for the same range of R: a~1:01|10 {1 and b~{5:05|10 {3 (blue dashes). In addition, for Rw19 covering the remaining 3.6% characters, I try using a power-law distribution function, F (R)~10 12:02 =R 10:61 with regression coefficient r c 2~9 7:04%; see the inset that depicts a log-log plot. Also, I obtain the linear fits according to the least square method. Following Fig. 1, I also indicate a critical threshold, R c~1 9, to distinguish the linear phase from the nonlinear phase. doi:10.1371/journal.pone.0074515.g002 Figure 3. Frequency, F, as a function of the number of letters per English word (red circles). Note that the function is approximated with a log-normal distribution (blue stars) that has the same mean value (2.0794) and standard deviation (0.3351) as those determined by the whole list of 54,700 English words. doi:10.1371/journal.pone.0074515.g003 Let me set the probability density function of the number (n) of English words or Chinese characters to be n~f (x). Here x is the number of letters per English word, or the number of strokes per Chinese character. Now I am in a position to introduce both the Weber-Fechner law in psychophysics [33] and the principle of maximum entropy in information theory [34,35]; see the following two steps.
Step I: According to the Weber-Fechner law, regarding x, people's psychological perception is ln (x). So, for a particular group of people, the distribution of x satisfies that the mean and the standard deviation of ln (x) should be constant, respectively.
Step II: The long-time evolution of English words or Chinese characters optimizes f (x). According to the principle of maximum entropy, in order to achieve the optimal f (x), one should maximize information entropy, Max½{ P f (x) ln f (x). As a result of the two steps above, one can obtain f (x) in the log-normal distribution as expected. In other words, this theoretical analysis suggests that the joint effect of both the Weber-Fechner law and the principle of maximum entropy serves as the underlying mechanism for the linear phases shown in Figs. 1 and 2.

Discussion
The present results indicate that the linearly-decaying law observed in the English and Chinese language systems and the nonlinearly-decaying law observed in the same two systems represent different phases separated by a critical threshold. Despite the difference of the language family of English and Chinese, their spatial ranking patterns can be captured by the same phase separation that displays two distinct phases: Linear phase and nonlinear phase. In analogy with critical phenomena, people might see R (rank order) as the control parameter and F (frequency) as the order parameter. Besides the construction pattern, it is also instructive to compare English and Chinese when they evolve or expand [36,37].
Uncovering the common phase separation in spatial ranking patterns is expected to have scientific and commercial potentials in some areas with rankings. Below I would like to list some initial thoughts, which might be agendas for future research. First, one may extend the present analysis to other languages or specific books. For books, say, by Shakespeare, one might use this analysis to identify the authenticity of some controversial books. Second, models of rankings [38,39] are indispensable for models of social organization, ranging from urban or national models to financial market models. The common phase separation might change the conclusions these models offer. In this case, the Bayesian model averaging established by Raftery et al. can also help to account for uncertainty about model form [40,41]. Third, because forecasting ranking patterns is a difficult goal in contrast with the accurate predictive tools common in natural sciences, models describing spatial ranking patterns with a common phase separation become potentially useful for better resource allocation [42,43] and pricing plans for companies to improve inventory and service allocation. Finally, the common phase separation reported in this work might suggest a class of self-organized critical phenomena, raising the intriguing possibility that besides the usual physical systems like granular piles [44] and proteins [45], the theory of self-organized criticality [46] might also be used to understand ranking systems that people face daily. This also shows that self-organization can serve as an underlying mechanism for not only traditional physical systems, but also non-traditional physical systems [47].

Author Contributions
Conceived and designed the experiments: JPH. Performed the experiments: JPH. Analyzed the data: JPH. Contributed reagents/materials/ analysis tools: JPH. Wrote the paper: JPH.