Model of the Dynamic Construction Process of Texts and Scaling Laws of Words Organization in Language Systems

doi:10.1371/journal.pone.0168971

Table 1.

Detailed information on the database of Chinese and English books and corresponding words statistics for each book.

T is the total number of words, and N_T is the vocabulary size of each book.

More »

Expand

Fig 1.

Scaling analyses of words organization in Chinese and English books.

Words organization in Chinese and English language exhibits scaling laws with different characteristics. Log-log plots of (a) probability distribution P(k) of the word frequency k, (b) Zipf’s law Z(r) of the word frequency rank r, and (c) Heaps’ scaling law of the number of distinct words N(t) vs. the number of words t in the text, obtained for the first book in Chinese and in English language listed in Table 1. Straight lines indicate the fitting range where the scaling exponents β, α and λ are obtained. Our analyses show that (a) Chinese language books exhibit lower exponent β compared to English books; (b) while English books exhibit a single scaling regime over the entire range of frequency ranks r, Chinese texts are characterized by a clear crossover in the Zipf’s scaling of the normalized word frequency Z(r) vs. word frequency rank r; (c) the number of distinct words N(t) vs. text length t exhibits a crossover with two different scaling exponents at small and intermediate scales for both Chinese and English books. However, Chinese texts are characterized by a third saturation regime for large scales t that is not observed for English books.

More »

Expand

Fig 2.

Statistical comparison of the scaling exponents derived from different scaling laws.

Data show the group averaged values obtained for all books analyzed in Table 1. The central red mark in each box indicates the median value for the corresponding scaling exponents β, α and λ shown in three panels. Bottom and upper edges of each box denote the 25th and 75th percentiles correspondingly, while the whiskers extend to the most extreme data points not considered outliers; outliers are plotted individually by the symbol “+” in red color. The exponent β of the probability distribution P(k) is obtained in the range of word frequency k ∈ [1, 10³] for both Chinese and English texts. The Zipf’s law scaling exponent α for English books is obtained in the range of word frequency rank r ∈ [3, 2 × 10³], while for Chinese books α₁ is obtained for small and intermediate scales r ∈ [3, 100] and α₂ for large scales r ∈ [2 × 10², 2 × 10³]. Heaps’ scaling law for both languages exhibits several scaling regimes with the exponent λ₁ obtained for short scales of text length of t ∈ [1, 100] for both languages, λ₂ in the scales of t ∈ [10², 2 × 10⁴] for English texts and t ∈ [10², 3 × 10³] for Chinese texts, and the exponent λ₃ in long scales of t > 10⁴ for Chinese texts only. Our analyses indicate significant differences in all three scaling exponents β, α and λ between written texts in the Chinese and English language.

More »

Expand

Table 2.

Top 20 most frequently used English words and Chinese characters and their frequencies.

A Chinese character can have different functions in the structure of a sentence and carry different meanings depending on the context, as shown in brackets following each Chinese character in the table. The frequencies are calculated using pooled data of all books in our database.

More »

Expand

Fig 3.

Empirical analyses of the word growth mechanism.

Data show results for the first book of (a) English and (b) Chinese language listed in Table 1, revealing a scaling relation of the average number of occurrence ϕ(k) of a given word in the second half of a text, provided the frequency of occurrence of this word in the first half of the text is k. Both languages are characterized by a scaling exponent γ ≈ 1, indicating that words which appear with high frequency k in the first part of the text have also high-average occurrence in the rest of the text.

More »

Expand

Fig 4.

Empirical results and modeling simulation for Chinese and English language books.

Scaling laws and model simulations for English book No. 1 (Table 1) are shown in panels (a), (b) and (c), and for Chinese book No. 1 (Table 1) are shown in panels (d), (e) and (f). Modeling parameters for all Chinese and English language books are given in Table 3.

More »

Expand

Table 3.

Model parameters of Chinese and English books.

The statistics show significant differences in the model parameters k₀, k_t and k_p between Chinese and English texts, indicating differences in the dynamic process underlying the language structure, words organization and the occurrence of new words with text growth.

More »

Expand

Fig 5.

Functional relations between the empirically observed scaling exponents and model parameters.

(a) Exponent β of the empirical probability distribution P(k) vs. model parameter k_p indicating linear functional dependence. (b) Heaps’ law scaling exponents λ₂ for English books and spoken transcriptions, and λ₃ for Chinese books vs. model parameter k_t indicating exponential functional dependence. Data points are obtained from the scaling analyses and simulation of all ten Chinese and English language books listed in Table 1, and English spoken language from Ref. [30]. The dotted lines indicate 95% confidence intervals of the data points obtained from empirical and model parameters for each separate book.

More »

Expand