Adaptive Communication: Languages with More Non-Native Speakers Tend to Have Fewer Word Forms

doi:10.1371/journal.pone.0128254

Fig 1.

Word frequency distributions for English and German UDHR.

Bars indicate frequencies of occurrence in German (blue) and English (red) for the highest ranking words in the UDHR text.

More »

Expand

Fig 2.

Word frequency distributions with ZM parameter approximations for selected languages.

Dots represent frequencies and ranks for the 50 highest frequency words in English (red), Fijian (green), German (blue) and Hungarian (purple). Lines reflect Zipf-Mandelbrot approximations. Lower frequencies towards the first ranks are associated with more word types in the tails of distributions. More diverse languages have more hapax legomena (i.e. words with frequency = 1), i.e. Hungarian has more hapax legomena than German, English, and Fijian, in this order.

More »

Expand

Table 1.

Data sets for phylogenetic signal analyses.

More »

Expand

Table 2.

Data set for PGLS regression.

More »

Expand

Fig 3.

Lexical diversity distribution.

Scaled LDT measures for 647 languages (histogram with grey bars), with smoothing function overlaid (red). The corresponding normal distribution is plotted in blue (dashed line).

More »

Expand

Fig 4.

Lexical diversity space.

Locations of 647 languages along ZM’s α, H_w and TTR (centered and scaled). Highly diverse languages cluster towards the upper-right corner in the back (highest values), whereas lexically redundant languages cluster towards the lower-left corner in the front (lowest values). To illustrate between-family variation, Altaic (yellow squares), Indo-European (green squares) and Creole languages (red squares) are pointed out among languages of other families (grey dots).

More »

Expand

Fig 5.

Lexical diversity space for Indo-European languages.

Locations of Indo-European languages along ZM’s α, H_w and TTR (UDHR only). High LDT languages are to be found in the upper-right corner (e.g. Lithuanian, Marathi), low LDT languages are to be found in the lower-right corner (e.g. Low Saxon, English, Afrikaans).

More »

Expand

Fig 6.

Linear regression.

Linear model for the relationship between the ratio of L2 speakers versus L1 speakers (logarithmically transformed) and scaled lexical diversities. Model parameters (β-coefficients, R²-values and t-values are displayed in Table 3). The blue line indicates a linear model with the respective intercept and slope (coefficient) and 95% confidence intervals.

More »

Expand

Table 3.

Results for linear regression model.

More »

Expand

Table 4.

Results for linear mixed-effects regression.

More »

Expand

Fig 7.

Regression plots by families.

Scatterplots of log-transformed ratios of L2 speakers versus LDTs facetted by language families. Colored lines represent linear models by families with 95% confidence intervals. Languages of families with less than 10 data points are subsumed under “Other”. Note that this is just done for plotting, for statistical modeling language families are not collapsed.

More »

Expand

Fig 8.

Regression plots by regions.

Scatterplots of log-transformed ratios of L2 speakers versus LDTs facetted by language regions. Colored lines represent linear models by families with 95% confidence intervals. Languages of regions with less than 10 texts are subsumed under “Other”. Note that this is just done for plotting, for statistical modeling language regions are not collapsed.

More »

Expand

Fig 9.

Regression plots by LDT measures.

Scatterplots of log-transformed ratios of L2 speakers versus LDTs facetted by LDT measures. Lines represent linear models by families with 95% confidence intervals.

More »

Expand

Fig 10.

Regression plots by text types.

Scatterplots of log-transformed ratios of L2 speakers versus LDTs facetted by text type. Lines represent linear models by families with 95% confidence intervals.

More »

Expand

Table 5.

Results for the phylogenetic signal analysis (mean λ).

More »

Expand

Table 6.

Results for PGLS.

More »

Expand