Rank Diversity of Languages: Generic Behavior in Computational Linguistics

Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution rank diversity. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: “heads” consist of words which almost do not change their rank in time, “bodies” are words of general use, while “tails” are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity of words can be understood as the result of random variations in rank, where the size of the variation depends on the rank itself. We find that the core size is similar for all languages studied.


S1 Models for rank-frequency distributions
The rank-frequency distributions of words for different languages are very similar to each other, as shown in Fig. S1a. The distributions are also similar across centuries, as shown in Fig. S1b.
We present five different distributions with distinct origins, though all of them containing the common factor 1 k a . The distributions are: where N i are normalization factors, depending on the parameters a, b, and α of the different models, andN is the total number of words. In Fig. S2 we compare the fit of these distributions with the observed curves. It can be seen that none of the distributions reproduces closely the dataset. We calculated for all fits the χ 2 test with similar results. The best value corresponds to the fit proposed in [1], namely the double Zipf model (equation (S5)). In all cases we studied the p-value of the data, needed for an appropriate interpretation of the goodness of the fit. In all cases, that is for all years, all languages and all models, this number was smaller than machine precision. This shows that none of these models captures satisfactorily the data behavior.  : Word frequency f R as a function of the rank k for English and several years, normalized so that the most frequent element has relative frequency one. In the inset, the unnormalized frequency f is shown. The origin of some of these models is similar. The following discussion shows how they can be encompassed in a common formulation.
Given a set of words forming a text, one can evaluate the number of times N (k, t) that a certain word appears with the rank k at time t. If B(k)and D(k) denote, respectively, the probability per unit time that a word enters or leaves the rank k, we have: Here the two terms on the r.h.s. within the first curly brackets describe, respectively, the local growth rate and the overall decrease rate acting on N (k, t). The total number of words at a given time t is Σ(t) and F is a function that determines global constraint features that refer to the total number of words. The terms within the second curly brackets, indicate the balance arising from the birth B(k) and death D(k) contributions of first neighbor words with k ± 1 ranks at time t. If we consider the total number of words at a given time t to be a fixed quantity, we can define the probability density of finding a word with rank k, or relative frequency distribution, by

Substitution of equation (S7) and equation (S8) into equation (S6) leads to
where the bracket indicates a sum over all k weighted by n(k, t). We assume, for simplicity, that ξ(k) is a linear function of the number of edges k, so that ξ(k) = ξ 0 + ξ 1 k , where ξ 0 and ξ 1 are constants. Then equation (S6) reduces to the following master equation for a one step process: In what follows we shall only consider the case ξ(k) = ξ 0 , so equation (S10) reduces to the general form of the master equation for a one step process, If the changes in k are small and we are only interested in solutions n (k, t) that vary slowly with k, then k may be treated as a continuous variable and we obtain the Fokker-Planck equation: For the stationary solutions m(k), we have the equation (S15) If we assume the simplest expression for D(k) and B (k) transition probabilities These and additional results could be obtained using the complex network language [2,3,4].
With respect to the distribution of equation (S5), the derivation given in [1] is based on the following assumptions. The existence of two word regimes: A language core containing words with low rank and do not affect the birth of new words, and the remaining high ranked words which reduce the probability of new words to be used. Table S1 shows the most frequent words for the year 2000 with their translation and relative frequency. Notice that these are very similar across languages. Table S2 shows the most frequent nouns for the years 1700, 1800, 1900, and 2000. There are similarities across languages and across centuries, but also important differences. Figs. S3-S9 show rank trajectories of words for the languages studied, including our simulated language. It can be seen that the behavior is similar for all languages: words with low rank (heads) almost do not vary in time. Afterwards the variation in rank depends on the rank itself, approximating a scale-invariant random walk. Notice that there is a higher variation at all scales before 1850. Further work is required to measure how much this variation depends on having less data before 1850 and how much on language properties of the time. Fig. S10 shows the distribution of relative flights for all languages. See main text for details.

S3 Correlation of relative frequency changes
We studied the correlations of the relative frequency changes (flights), defined in the main text as We shall use a normalized version of it: where · denotes average over time. This normalization ensures that both d t = 0 and d 2 t = 1. The time correlation is given by In principle, this quantity also depends on t, but usually this dependence is very weak, as in this case, and one can ignore it.
In Fig. S11 we show the average of C τ , of 50 different ranks chosen randomly, for different languages, as well as for the simulated language. We note that the correlation is very small, except for τ = 0, where it is 1, due to the normalization chosen, and for τ = 1 where a negative value, typical of bounded sequences, is observed for the six languages studied here. The random Gaussian model reproduces well these correlations except at τ = 1. Note that some words are used not only as nouns, which can give them a higher rank. For example,été in French is summer, but also the past participle ofêtre (to be).