Model of the Dynamic Construction Process of Texts and Scaling Laws of Words Organization in Language Systems

Scaling laws characterize diverse complex systems in a broad range of fields, including physics, biology, finance, and social science. The human language is another example of a complex system of words organization. Studies on written texts have shown that scaling laws characterize the occurrence frequency of words, words rank, and the growth of distinct words with increasing text length. However, these studies have mainly concentrated on the western linguistic systems, and the laws that govern the lexical organization, structure and dynamics of the Chinese language remain not well understood. Here we study a database of Chinese and English language books. We report that three distinct scaling laws characterize words organization in the Chinese language. We find that these scaling laws have different exponents and crossover behaviors compared to English texts, indicating different words organization and dynamics of words in the process of text growth. We propose a stochastic feedback model of words organization and text growth, which successfully accounts for the empirically observed scaling laws with their corresponding scaling exponents and characteristic crossover regimes. Further, by varying key model parameters, we reproduce differences in the organization and scaling laws of words between the Chinese and English language. We also identify functional relationships between model parameters and the empirically observed scaling exponents, thus providing new insights into the words organization and growth dynamics in the Chinese and English language.


Introduction
Scaling laws have been discovered and investigated in many fields such as physics, biology, finance, geology, and sociology. Examples include urban growth [1], population distribution of a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 cities [2,3], clusters in DNA structure [4,5], financial markets [6][7][8], spatial distribution of words in texts [9][10][11][12][13], structures of rocks and geological formations [14], citation networks [15], social interactions [16], and stochastic physical systems [17,18]. Zipf's and Heaps' law, two classical representatives of scaling laws, have been studied in various systems [19][20][21][22]. In the context of natural language structure and organization, Zipf's law indicates the inversely proportional relationship in log-log scale between the descending words frequency Z(r) and the words frequency rank r [23]. Heaps' law reveals a different aspect of words organization in natural languages, which indicates that the vocabulary size N(t) of a given text grows roughly as a sublinear function in log-log plot with increasing text length t [24]. The initial Zipf's study focused on the English language [23], and later researchers extended his work to other natural languages, including Hebrew, Greek, Spanish and Irish [25,26], reporting that different languages are characterized by distinct scaling exponents, reflecting differences in words organization.
Although Chinese is a widely used language, relatively few studies have focused on the Chinese lexical organization [21,27], and the laws which govern words frequency, words rank and growth of distinct words with increasing text length in the Chinese language remain not well understood. The existence of Chinese polysemic words and the complex word segmentation rules in the Chinese language pose challenges to systematic and consistent studies [28]. Currently, it is an open question whether Zipf's law, Heaps' law or other scaling laws adequately describe the structure and organization of the Chinese language system.
In this study, we analyze scaling properties of words organization in the Chinese language and compare with English, using a database of classic Chinese and English books. Our analyses show that: (i) the probability distribution of words frequency in the Chinese and English language obeys a power-law, and the word organization of both languages conforms to Zipf's law and Heaps' law; (ii) the scaling exponents and crossover behavior of these two languages are significantly different, reflecting the different role and organization of Chinese characters compared to English words. Furthermore, we propose a model of words organization and text growth mechanism which accounts for the empirically observed scaling laws. The introduced model parameters provide new insight into the scaling dynamics and construction mechanisms of words organization in the Chinese and English written texts. By varying key model parameters, we successfully reproduce the differences in the organization and scaling properties of words in Chinese and English texts, and we establish functional relationship between the empirically observed scaling exponents and our model parameters.

Methods and Results Database
We compare two categories of data sets obtained from Project Gutenberg [29]: 1. Ten classic Chinese books written by different authors during the period from the 14th to late 18th century, including novels, chinese mythologies and history texts, where each book has approximately 0.4 × 10 4 distinct Chinese characters and on average 5 × 10 5 total number of Chinese characters per book (a Chinese character can appear many times in a book). In our analyses, we treat each Chinese character as a separate word because in contrast to western languages where each word is composed of letters, characters in the Chinese language do not correspond to letters but often indicate separate words, and the same Chinese character can play role as a verb, noun, or adverb depending on the context in the sentence.
2. Ten classic English books written by different authors covering different themes and genres. Each book contains approximately 10 4 distinct words and 10 5 text length, a corpus of words comparable in size with the database of Chinese classic books analyzed in this paper. Detailed information on the books included in our analyses, their length and the vocabulary size of each book is shown in Table 1.

Fundamental scaling laws of words organization in Chinese language
We first obtain the distribution of word frequency for each Chinese and English language book in our database. We find that the probability distribution P(k) of word frequency k for both Chinese and English books follows a power-law with scaling exponent β (Fig 1(a)), Our analyses show that Chinese language texts exhibit higher percentage of high frequency words compared to English texts. This is reflected by the significantly lower value of the scaling exponent β = 1.55 ± 0.06 (group mean ± standard deviation) for Chinese books, compared to β = 1.83 ± 0.04 for English books with Student's t-test p < 0.01, showing statistically significant difference (Fig 2, fitting range k 2 [1, 10 3 ]). We next perform Zipf's rank analysis, and we find that words organization in both Chinese and English language texts obeys the Zipf's law, i.e. the normalized word frequency Z(r) exhibits a power-law behavior as a function of the word frequency rank r characterized by a scaling exponent α, Our analyses show that while English texts exhibit a power-law with a single exponent α = 1.05 ± 0.03 for the entire fitting range of words frequency rank r 2 [3, 2 × 10 3 ], the Chinese language texts are characterized by a clear crossover in the Zipf's scaling from regime with α 1 = 0.60 ± 0.07 for high and intermediate frequency ranks r 2 [3, 100] to a second regime with scaling exponent α 2 = 1.48 ± 0.14 for lower frequency ranks r 2 [2 × 10 2 , 2 × 10 3 ] shown in Fig  1( , showing significant differences when comparing α 1 and α 2 of Chinese language texts, as well as when comparing the scaling exponents between the Chinese and English language, Fig 2). In Table 2, we list the top 20 most frequently used English words and Chinese characters and their frequencies.
We also find that words organization of both Chinese and English language texts obeys the Heaps' law. Specifically, we find the number of distinct words N(t) grows as a power-law with increasing text length t for all Chinese and English books in our database (Fig 1(c)):  Our results show that at short scales t 2 [1, 10 2 ] both Chinese and English texts exhibit practically the same scaling behavior with exponent λ 1 % 1 (Fig 1(c)) with: λ 1 = 0.96 ± 0.02 for Chinese language books and λ 1 = 0.94 ± 0.05 for English (Student's t-test indicates no statistical difference with p = 0.34). To estimate the range of this linear growth regime for each Chinese and English texts, we determine the scale t 1 at which the scaling exponent λ reaches 0.95, and we find significant difference in the range of linear growth between the Chinese and English texts (t 1 = 130.2 ± 43.2 for Chinese and t 1 = 78.6 ± 16.2 for English, with p < 0.01).
Further, we find that the Heap's scaling law exhibits a crossover at t % 100 from a linear to a sub-linear scaling regime characterized by a scaling exponent λ 2 % 0.7 for the intermediate  Table 1. The central red mark in each box indicates the median value for the corresponding scaling exponents β, α and λ shown in three panels. Bottom and upper edges of each box denote the 25th and 75th percentiles correspondingly, while the whiskers extend to the most extreme data points not considered outliers; outliers are plotted individually by the symbol "+" in red color. The exponent β of the probability distribution P(k) is obtained in the range of word frequency k 2 [1, 10 3 ] for both Chinese and English texts. The Zipf's law scaling exponent α for English books is obtained in the range of word frequency rank r 2 [3, 2 × 10 3 ], while for Chinese books α 1 is obtained for small and intermediate scales r 2 [3, 100] and α 2 for large scales r 2 [2 × 10 2 , 2 × 10 3 ]. Heaps' scaling law for both languages exhibits several scaling regimes with the exponent λ 1 obtained for short scales of text length of t 2 [1, 100] for both languages, λ 2 in the scales of t 2 [10 2 , 2 × 10 4 ] for English texts and t 2 [10 2 , 3 × 10 3 ] for Chinese texts, and the exponent λ 3 in long scales of t > 10 4 for Chinese texts only. Our analyses indicate significant differences in all three scaling exponents β, α and λ between written texts in the Chinese and English language.  (Fig 1(c)). Student's t-test p = 0.28 indicates no significant difference between Chinese and English language in this intermediate scaling regime.
Notably, our analyses show that all Chinese texts exhibit a second crossover at large scale t % 4 × 10 3 , corresponding to N(t) % 2 × 10 3 with a transition to a saturation regime characterized by a distinct scaling exponent λ 3 = 0.27 ± 0.03. This saturation regime is not observed in the English texts, but is typical for Chinese texts (Fig 1(c)) and results from the limited number of distinct Chinese characters (considered as words in our analyses) that are predominantly used in Chinese language texts.

Stochastic feedback model of words organization and text growth mechanism in Chinese language
To understand the mechanism underlying words organization leads to the empirically observed scaling laws and study the difference between Chinese and English languages, we introduce a stochastic feedback model that accounts for the probability of word occurrence and growth of new word (that has not appeared in the text yet) with increasing text length.
To guide our model construction, we perform an additional statistical analysis on all books in our database to determine how the frequency of words in a new part of the text depends on the frequency of words in the previous part of the text. First, we divide a text into two equal parts (Part I and Part II), and calculate the frequency k for each distinct word in Part I, then we count n(k) the number of distinct words that appears k times in Part I. We next count the total number of times N(k) when all these n(k) words in Part I would also appear in Part II of the text. Finally, we calculate ϕ(k) the average number of occurrence in Part II of words that appeared k times each in Part I: For both Chinese and English language books in Table 1, we find that ϕ(k) scales with k with an exponent γ % 1 (Fig 3): Hence, this analysis reveals a "rich-get-richer" mechanism in words growth, i.e., high frequency word in the "old" text tends to appears with similar high frequency in the "new" text. This behavior is consistently observed in all Chinese and English books in our database, and thus we introduce in our modeling approach this empirically derived functional relation for the organization of distinct words in written text.
To understand the origin of the observed scaling laws describing words structure in Chinese language texts, we simulate the process of language construction using the following stochastic feedback model of text growth. Given a text with a number of words, we grow the text following two different procedures: Further, to model text growth dynamics based on the process of generating new distinct words that are not present in the prior text (Procedure I), we introduce a probability p that  Table 1, revealing a scaling relation of the average number of occurrence ϕ(k) of a given word in the second half of a text, provided the frequency of occurrence of this word in the first half of the text is k. Both languages are characterized by a scaling exponent γ % 1, indicating that words which appear with high frequency k in the first part of the text have also high-average occurrence in the rest of the text. gradually changes with text length t: where t is the growing text length. In our model simulation, for each book, we generate data length t = T that equals to the length of the corresponding empirical texts in Table 1.
At short text length t, the functional form of p in Eq (6) provides for a close to linear growth of the number of distinct new words. The values of the parameter k 0 > 1 determine the range of the linear growth region where the text growth process includes only new words that are not present in the prior text. With text length t increasing, the probability p of adding distinctly new words to the text decreases, which is controlled by the scaling parameter k t > 1 for Chinese texts in order to account for the saturation regime observed in the Heaps' scaling law for very large scales of t shown in Fig 1(c) (for English texts the simulation parameter k t 2 (0, 1)).
In parallel to the text growth process that involves generating new distinct words with probability p (Eq 6), natural language text growth in our model involves also a second process of adding words that are already present in the prior text (Procedure II) with probability 1 − p. Guided by the empirical observations shown in Fig 3, in this procedure, our model incorporates a probabilitypðiÞ to add a word i that is already in the text, wherepðiÞ has a positive dependence on the number of times n(i) that the word i already appears in the text: The values of the parameter k p determine the degree of dependence between the probability of a word being selected and its number of occurrence in the already generated text. Fig 4 shows the scaling analyses results from our model simulation outputs, where model parameters were chosen to reproduce the empirical scaling properties we found for the first book in Table 1 in Chinese and English language. All fundamental scaling laws for P(k), Z(r) and N(t) with their specific scaling regimes are very well captured by our model. Moreover, our separate simulations for each book in the database reveal that: 1. There is a significant difference in the model parameter k 0 values between the Chinese (k 0 = 5.51 ± 1.56, group ave. ± std. dev.) and English language texts (k 0 = 2.93 ± 1.01) ( Table 3). This corresponds to our empirical findings that Chinese and English language have different linear growth regimes in N(t).
2. The model parameter k t for Chinese texts has significantly higher values compared to English texts: k t = 1.16 ± 0.17 for Chinese and k t = 0.33 ± 0.05 for English, indicating a much lower growth rate of new distinct words with increasing text length in the Chinese language. Scatter plot diagram of the model parameter k t vs. the empirical scaling exponent λ shown in Fig 5(b) for each Chinese and English book in our database reveals an exponential relationship between k t and λ: l ¼ e À 1:2k t .
3. Chinese language texts have significantly lower values of the model parameter k p compared to English texts (k p = 1.02 ± 0.02 for Chinese books and k p = 1.10 ± 0.02 for English books, Table 3), indicating that Chinese texts exhibit weaker dependence between the probabilitỹ pðiÞ of a word i being selected and added to the text and its number of occurrence n(i) in prior text, which accounts for the scaling exponent β of the power-law probability distribution P(k). Scatter plot diagram of the model parameter k p vs. the empirically derived scaling exponent β for each Chinese and English book in our database reveals a linear functional dependence between k p and β: β = 3.5k p − 2 (Fig 5(a)).  (Table 1) are shown in panels (d), (e) and (f). Modeling parameters for all Chinese and English language books are given in Table 3.
doi:10.1371/journal.pone.0168971.g004 Table 3. Model parameters of Chinese and English books. The statistics show significant differences in the model parameters k 0 , k t and k p between Chinese and English texts, indicating differences in the dynamic process underlying the language structure, words organization and the occurrence of new words with text growth. Note that, in Fig 5 we also include the results for English spoken language we analyzed in a previous study [30]. The data points obtained from English spoken language fall nicely on the fitting curves, which further validate the functional relation we establish in this study between the empirically observed scaling exponents and the model parameters.

Conclusion
Our analyses of Chinese and English books indicate that both language forms obey power-law in the probability distribution of word frequency, as well as Zipf's law and Heaps' law, however with different scaling characteristics. Specifically, words organization in the Chinese language (i) is characterized by a significantly lower exponent for the probability distribution of words frequency, (ii) exhibits a pronounced crossover in the Zipf's scaling law of words frequency rank, and (iii) obeys the Heaps' scaling law exhibiting a similar growth rate in the number of new distinct words with increasing text length for short and intermediate scales compared to English texts, however with a unique for the Chinese language saturation regime when the number of distinct words (Chinese characters) reaches about 4000. The total number of Chinese vocabulary is approximately 90 thousand, and the number of commonly used Chinese characters is about 4500, which accounts for the Heaps' law saturation regime at long text length. In contrast, the vocabulary size of the English language [31] is much larger than that of the Chinese language, thus the average text length of English books may not be sufficient to manifest a saturation regime as observed for Chinese language books. Our empirical findings and model simulations indicate a clear difference in the dynamics of language construction process between Chinese and English language forms, associated with higher concentration of high frequency words and lower occurrence of distinct new words in Chinese texts. The presented model successfully accounts for the different lexical generation mechanisms in Chinese and English. Our simulation results on words generation and text growth are in agreement with the empirical findings of three distinct scaling laws, and confirm that these laws  Table 1, and English spoken language from Ref. [30]. The dotted lines indicate 95% confidence intervals of the data points obtained from empirical and model parameters for each separate book. accurately represent the complex dynamics of words organization in the Chinese and English language. The proposed here stochastic feedback model for text generation can be generally applied to other dynamic growth processes in complex systems characterized by scaling laws.