The Evolution of the Exponent of Zipf's Law in Language Ontogeny

It is well-known that word frequencies arrange themselves according to Zipf's law. However, little is known about the dependency of the parameters of the law and the complexity of a communication system. Many models of the evolution of language assume that the exponent of the law remains constant as the complexity of a communication systems increases. Using longitudinal studies of child language, we analysed the word rank distribution for the speech of children and adults participating in conversations. The adults typically included family members (e.g., parents) or the investigators conducting the research. Our analysis of the evolution of Zipf's law yields two main unexpected results. First, in children the exponent of the law tends to decrease over time while this tendency is weaker in adults, thus suggesting this is not a mere mirror effect of adult speech. Second, although the exponent of the law is more stable in adults, their exponents fall below 1 which is the typical value of the exponent assumed in both children and adults. Our analysis also shows a tendency of the mean length of utterances (MLU), a simple estimate of syntactic complexity, to increase as the exponent decreases. The parallel evolution of the exponent and a simple indicator of syntactic complexity (MLU) supports the hypothesis that the exponent of Zipf's law and linguistic complexity are inter-related. The assumption that Zipf's law for word ranks is a power-law with a constant exponent of one in both adults and children needs to be revised.


Age ranges of target children
A summary of the age ranges of the target children included in our analyses is shown in Tables 1 for English, Table 2 for German and Table 3 for Dutch and Swedish. Ages are given in months. Only the names of the target children employed in our study are shown (target children with less than two time points are excluded). Age points refers to number of different ages before applying the filter that excludes transcripts from 5 years onwards (see Methods in the main article). Initial and final age refer to the age at which the study started and ended, respectively. Format as in Table 1. Format as in Table 1.

The cut-offs for normalization
The cut-offs for normalization, T * (by length) and n * (by observed vocabulary size), were chosen based upon the summary of the raw statistics of T and n in Tables 4  and 5. We focused on the major classes of roles: 'target child', 'father', 'mother' and 'investigator'. T * = 500 and n * = 100 were chosen for being round lower bounds to the smallest mean T and the smallest mean n, respectively, among the major classes of roles at the level of all languages mixed (i.e. the mean T and the mean n of investigators). T * and n * were then halved to increase the number of participants and the number of ages considered for each participant, yielding T * = 250 and n * = 50. N is the number of individuals analyzed for a given role class and language category that have at least m * = 5 time points (see Methods for a justification of this lower bound). For each individual, four statistics concerning T are computed: the minimum (min), the mean (mean), the maximum (max) and the standard deviation (dev) over all his/her transcripts. The mean plus/minus 1 standard deviation of these four statistics is shown for each role class and language category (when N = 1, a standard deviation of 0 is assumed). The same format as in Table 4 is adopted. In our analyses, n is equivalent r M , one of the parameters of the right-truncated zeta distribution.   Tables 6 and 7 show the results of the analysis of the dependency between α and age for cut-offs at T * = 250 and n * = 50, respectively. Analysis of the correlation between α and age from two perspectives: the sign of the correlation and the significance of the correlations. Four language categories, i.e. All (all languages mixed), Dutch, English, German and Swedish, are considered. N is the number of individuals analyzed for a given role class and language category that had at least m * = 5 different points of time (the minimum number of points needed to show a significant correlation between a parameter and age through a two-sided correlation test at a significance level of 0.05, see the Methods section). This filter was applied for consistency between the analysis of the sign of the dependency and its significance. For each individual, the Spearman rank correlation [1] between age and a certain parameter of the right-truncated distribution was computed. In the analysis of the sign of the correlation, two counts are provided, namely N + and N − , for each role class and language category. N + and N − are, respectively, the number individuals with a positive and negative correlation (regardless of the sign of the correlation). In the analysis of the significance of the correlation, three counts are provided, namely N S + , N S − and N ? , for each role class and language category. N S + and N S − are the number individuals with a statistically significant positive and negative correlation, respectively. N ? is the number of individuals with a correlation that is not significant. Significance was decided by a two-sided Spearman rank correlation test [1] at a significance level a = 0.05. ↑ and ↓ indicate counts that are, respectively, significantly high or significantly low according to a binomial test (see Methods).
Methods (other than the normalization) and format are the same as in Table 6.   Table 8 shows the results of the analysis of the dependency between r M and age for this normalization.  Father  3  3  0  3  0  0  3  German  Investigator  3  3  0  3 Methods (other than the target parameter) and format are the same as in Table 6. Figures 4 and 5 show the actual dependency between α and MLU for cut-offs at T * = 250 and n * = 50, respectively.   Tables 9 and 10 show the results of the analysis of the dependency between MLU and age for cut-offs at T * = 250 and n * = 50, respectively. Methods (other than the target variables) and format are the same as in Table 6. Methods (other than the normalization and the target variables) and format are the same as in Table 6.

Dependencies between α and MLU
3.2. Normalization by random sampling.
3.2.1. Dependency of parameters with age The analysis of the correlation between α and time supports the idea that the behavior of infants and adults differs notably. The analysis of the sign of the correlation between α and age confirms the tendency of α to decrease over time: N + is never significantly high while N − is significantly large in the majority of target children with the only exception of Swedish, where the number of target children is very small, and also significantly large in investigators and parents depending on the language (Tables 11 and 12 for length normalization;  Tables 13 and 14 for observed vocabulary size normalization). If the significance of the correlation between α and age is taken into account, then it turns out that N S + is very small (zero in the majority of cases), and never significantly large (Tables 11 and 12  for length normalization; Tables 13 and 14   Methods (other than the normalization) and format are the same as in Table 6. Methods (other than the normalization) and format are the same as in Table 6. Methods (other than the normalization) and format are the same as in Table 6. Methods (other than the normalization) and format are the same as in Table 6.
The analysis of the sign of the correlation between r M and age confirms the tendency of r M to increase over time: N − is never significantly high while N + is significantly large in the majority of target children with the only exception of Swedish, where the number of target children is very small, and also significantly large in investigators, parents and other adults depending on the language (Tables 15 and 16). If the significance of the correlation between r M and age is taken into account, then it turns out that N S − is very small (zero in the majority of cases), and never significantly large (Tables 15 and  16). Interestingly, N S + is significantly large for all target children. With regard to α versus time, the ratio N S + /N (where N = N S + + N S − + N ? ) is more balanced between target children and the adults where N S + is significantly large. These results confirm the previous finding based upon prefix normalization: that the increase of r M with time does not distinguish children from adults as clearly as α and also confirm that prefix normalization is not omitting vital information. Methods (other than the normalization and the target parameter) and format are the same as in Table 6. Methods (other than the normalization and the target parameter) and format are the same as in Table 6.
4. The right-trucanted zeta distribution: α = 1 versus free α For each corpus and major class of role (target child, father, investigator and mother), a comparison of the quality of the fit of the two theoretical distributions, i.e. the right-truncated zeta distribution (with two parameters α and r M ) and a right-truncated zeta distribution with only one parameter, i.e. r M (α = 1), is made. The control right-truncated distribution with α = 1 was also fitted by maximum likelihood. The maximum likelihood estimator of r M coincides with n, the maximum rank of the sample.
To see it, consider Eq. 6 of the main text with α = 1, n > 1 and notice that H(r M , 1) is a monotonically increasing function of r M . The quality of the fit was evaluated using Akaike's Information Criterion (AIC), a metric that combines a quantitative measure of the goodness of the fit to the real data with a penalty for the number of parameters used [2]. In our analysis, we adopted a variant that incorporates a correction for small samples which is defined as [3] AIC k = −2 log(L) + 2k where k is the number of free parameters of the right-truncated zeta distribution (n = 1 or n = 2 in our case), T is the length of the text sample in words and L is the loglikelihood as it is defined in the main article. The lower the value of AIC k of a model with regard to that of alternative models, the better the model. If no size/length normalization is used, the right-truncated distribution with two parameters gives a better fit in the majority of cases (Table 17).

Normalization by constant length in words
If fragments of the same T (i.e the same length in words) are considered, the righttruncated distribution of two parameters is better than that of one parameter taking a prefix of length T * for each time point (see Table 18 for T * = 250 and Table 19 for T * = 500) or taking a random sample of size T * (see Table 20 for T * = 250 and Table  21 for T * = 500).

Normalization by constant number of different words
If fragments of the same n (i.e. the same number of different words) are considered, the right-truncated distribution of two parameters is better than that of one parameter taking a prefix of n * different words for each time point (see Table 22 for n * = 50 and Table 23 for n * = 100) or taking a random sample of n * different words (see Table 24 for n * = 50 and Table 25 for n * = 100).

Brief discussion
For all the normalizations considered above and given a language and a class of role, the percentage of cases where the one parameter truncated zeta distribution yields a better fit than the two parameter version is less than 7%. Interestingly, the success of the two parameters drops when no normalization is used, e.g., various combinations of language and role class reach at least 7 in the percentage of times where the one parameter function is better than the two parameter version (recall Table 17). This suggests that normalization improves the adequacy of the truncated zeta distribution with two parameters but this could be simply due to the loss of individuals producing small samples. A sample that that is too small may not contain enough information to discriminate accurately between the one and the two parameter version and may AIC 1 and AIC 2 are, respectively, the corrected Akaike information criterion for the right-truncated zeta distribution with two parameters (α and r M ) and that of one parameter (α = 1 and free r M ). For each language category and role class, the percentage of times (over all the available individual -age pairs where the fit can be performed) that AIC 1 < AIC 2 , AIC 1 > AIC 2 and AIC 1 = AIC 2 are shown.
not reach the cut-off imposed for normalization. In fact, various classes of roles do not survive normalization (they are present in Table 17 but disappeared in normalization Comparison of AICs for prefixes of the same length T in words (T * = 250). The same format as in Table 17 is adopted. tables). Comparison of AICs for prefixes of the same length T in words (T * = 500). The same format as in Table 17 is adopted. Comparison of AICs for random samples of the same length T in words (T * = 250). The same format as in Table 17 is adopted. Comparison of AICs for random samples of the same length T in words (T * = 500). The same format as in Table 17 is adopted. Comparison of AICs for prefixes of the same number n of different words (n * = 50). The same format as in Table 17 is adopted. Comparison of AICs for prefixes of the same number n of different words (n * = 100). The same format as in Table 17 is adopted. Comparison of AICs for random samples of the same number n of different words (n * = 50). The same format as in Table 17 is adopted. Comparison of AICs for random samples the same number n of different words (n * = 100). The same format as in Table 17 is adopted.
5. The range of variation of α: further support for the evolution of α Tables 26 and 27 show the range of variation of α for normalization by prefix at lower cut-offs than those considered in the main article, T * = 250 and n * = 50, respectively.