Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts

doi:10.1371/journal.pone.0129031

Table 1.

Characteristics of the books analyzed.

The length of each book L is measured in millions of tokens.

More »

Expand

Fig 1.

(a) Probability mass functions f(n) of the absolute frequencies n of words and lemmas in La Regenta, together with their fits. (b) The same, from top to bottom, for Clarissa, Moby-Dick, Ulysses (all three in English), and Don Quijote (in Spanish). The distributions are multiplied by factors 1, 10⁻², 10⁻⁴ and 10⁻⁶ for a clearer visualization.

More »

Expand

Table 2.

Power-law fitting results for words and lemmas, denoted respectively by subindices w and l.

V is the number of types (vocabulary size), n_m is the maximum frequency of the distribution, N_a is the number of types in the power-law tail, i.e., with n ≥ a, a is the minimum value for which the power-law fit holds, and γ and σ are the power-law exponent and its standard deviation, respectively. 2σ_d, the double of the standard deviation σ_d is also given. σ_d is the standard deviation of γ_l−γ_w assuming independence, which is . The last column provides ℓ₁, the number of lemmas associated to only one word form. Notice that the lemma exponent is very close to the one found in Ref. [29] for the tail of a double power-law fitting, except for Moby-Dick and Ulysses.

More »

Expand

Fig 2.

γ_l (the exponent of the frequency distribution of lemmas) versus γ_w (the exponent of the frequency distribution of word forms).

As a guide to the eye, the line γ_l = γ_w is also shown (solid line). Error bars indicate one standard deviation.

More »

Expand

Table 3.

Analysis of the association between random variables using Pearson and Spearman correlations as statistics.

ρ is the value of the correlation statistic and p is the p-value of a two-sided test with null hypothesis ρ = 0, calculated through permutations of one of the variables (the results can be different if p is calculated from a t–test). The sample size is 𝓝 = 10 in all cases. Only the Spearman correlation between a_w and a_l/a_w is significantly different from zero.

More »

Expand

Table 4.

The fit of a linear model for the relationship between exponents (γ_w and γ_l) and the relationship between cut-offs (a_w and a_l).

c₁ and c₃ stand for slopes and c₂ and c₄ stand for intercepts. The error bars correspond to one standard deviation. A Student’s t-test is applied to investigate if the slopes are significantly different from one and if the intercepts are significantly different from zero. The resulting p-values indicate that in all cases the slopes are compatible with being equal to one. The intercepts are compatible with zero for the exponents, but seem to be incompatible for the cut-offs.

More »

Expand

Fig 3.

(a) Probability mass functions f(n) of the absolute frequencies n of words and lemmas in La Regenta, together with their fits, under rescaling of both axis. The collapse of the tails indicates the compatibility of both power-law exponents. (b) The same for, from top to bottom, Artamène, Bragelonne (both in French), Seitsemän v., Kevät ja t., and Vanhempieni r. (all three in Finnish). The rescaled distributions are multiplied in addition by factors 1, 10⁻², etc., for a clearer visualization.

More »

Expand

Fig 4.

The lower cut-off for the frequency distribution of lemmas (a_l) versus the lower cut-off for the frequency distribution of word forms (a_w).

The line a_l = a_w is also shown (solid line).

More »

Expand

Fig 5.

(a) Number of words per lemma as a function of lemma absolute frequency n_l in Vanhempieni romaani (in Finnish) and in La Regenta. The figures for the former have been slightly shifted up for clarity sake. (b) Frequency of words n_w as a function of the frequency of their lemmas n_l in La Regenta.

More »

Expand

Fig 6.

Probability density D(n_l/n_w) of the frequency ratio for lemmas and words, n_l/n_w, in La Regenta.

Values of n_l smaller than n_w are disregarded, as they arise from words associated to more than one lemma. Bending for the largest n_l/n_w is expected as the maximum of the ratio is given by n_l, which is not constant for each distribution but has a variation of half an order of magnitude (see plot legend).

More »

Expand

Table 5.

Coverage of the vocabulary by the dictionary in each language, both at the word-type and at the token level.

The average for all texts is also included. Remember that we distinguish between a word type (corresponding to its orthographic form) and its tokens (actual occurrences in text).

More »

Expand

Table 6.

Size of vocabulary V (i.e., number of types) when texts are decomposed in different sorts of types, being these: word-lemma-tag (w-l-t), plain words, lemma-POS (l-pos), lemma-POS of words in the dictionary (l-pos dic), lemmas, and lemmas of words in the dictionary (lemma dic).

The latter provide the most radical transformation, as it yields the largest reduction in resulting vocabulary.

More »

Expand