Fig 1.
Word frequency distributions for English and German UDHR.
Bars indicate frequencies of occurrence in German (blue) and English (red) for the highest ranking words in the UDHR text.
Fig 2.
Word frequency distributions with ZM parameter approximations for selected languages.
Dots represent frequencies and ranks for the 50 highest frequency words in English (red), Fijian (green), German (blue) and Hungarian (purple). Lines reflect Zipf-Mandelbrot approximations. Lower frequencies towards the first ranks are associated with more word types in the tails of distributions. More diverse languages have more hapax legomena (i.e. words with frequency = 1), i.e. Hungarian has more hapax legomena than German, English, and Fijian, in this order.
Table 1.
Data sets for phylogenetic signal analyses.
Table 2.
Data set for PGLS regression.
Fig 3.
Lexical diversity distribution.
Scaled LDT measures for 647 languages (histogram with grey bars), with smoothing function overlaid (red). The corresponding normal distribution is plotted in blue (dashed line).
Fig 4.
Locations of 647 languages along ZM’s α, Hw and TTR (centered and scaled). Highly diverse languages cluster towards the upper-right corner in the back (highest values), whereas lexically redundant languages cluster towards the lower-left corner in the front (lowest values). To illustrate between-family variation, Altaic (yellow squares), Indo-European (green squares) and Creole languages (red squares) are pointed out among languages of other families (grey dots).
Fig 5.
Lexical diversity space for Indo-European languages.
Locations of Indo-European languages along ZM’s α, Hw and TTR (UDHR only). High LDT languages are to be found in the upper-right corner (e.g. Lithuanian, Marathi), low LDT languages are to be found in the lower-right corner (e.g. Low Saxon, English, Afrikaans).
Fig 6.
Linear model for the relationship between the ratio of L2 speakers versus L1 speakers (logarithmically transformed) and scaled lexical diversities. Model parameters (β-coefficients, R2-values and t-values are displayed in Table 3). The blue line indicates a linear model with the respective intercept and slope (coefficient) and 95% confidence intervals.
Table 3.
Results for linear regression model.
Table 4.
Results for linear mixed-effects regression.
Fig 7.
Scatterplots of log-transformed ratios of L2 speakers versus LDTs facetted by language families. Colored lines represent linear models by families with 95% confidence intervals. Languages of families with less than 10 data points are subsumed under “Other”. Note that this is just done for plotting, for statistical modeling language families are not collapsed.
Fig 8.
Scatterplots of log-transformed ratios of L2 speakers versus LDTs facetted by language regions. Colored lines represent linear models by families with 95% confidence intervals. Languages of regions with less than 10 texts are subsumed under “Other”. Note that this is just done for plotting, for statistical modeling language regions are not collapsed.
Fig 9.
Regression plots by LDT measures.
Scatterplots of log-transformed ratios of L2 speakers versus LDTs facetted by LDT measures. Lines represent linear models by families with 95% confidence intervals.
Fig 10.
Regression plots by text types.
Scatterplots of log-transformed ratios of L2 speakers versus LDTs facetted by text type. Lines represent linear models by families with 95% confidence intervals.
Table 5.
Results for the phylogenetic signal analysis (mean λ).
Table 6.
Results for PGLS.