Conceived and designed the experiments: GL. Analyzed the data: GL RD. Contributed reagents/materials/analysis tools: GL RD. Wrote the paper: GL RD.
The authors have declared that no competing interests exist.
Languages differ greatly both in their syntactic and morphological systems and in the social environments in which they exist. We challenge the view that language grammars are unrelated to social environments in which they are learned and used.
We conducted a statistical analysis of >2,000 languages using a combination of demographic sources and the World Atlas of Language Structures— a database of structural language properties. We found strong relationships between linguistic factors related to morphological complexity, and demographic/socio-historical factors such as the number of language users, geographic spread, and degree of language contact. The analyses suggest that languages spoken by large groups have simpler inflectional morphology than languages spoken by smaller groups as measured on a variety of factors such as case systems and complexity of conjugations. Additionally, languages spoken by large groups are much more likely to use lexical strategies in place of inflectional morphology to encode evidentiality, negation, aspect, and possession. Our findings indicate that just as biological organisms are shaped by ecological niches, language structures appear to adapt to the environment (niche) in which they are being learned and used. As adults learn a language, features that are difficult for them to acquire, are less likely to be passed on to subsequent learners. Languages used for communication in large groups that include adult learners appear to have been subjected to such selection. Conversely, the morphological complexity common to languages used in small groups increases redundancy which may facilitate language learning by infants.
We hypothesize that language structures are subjected to different evolutionary pressures in different social environments. Just as biological organisms are shaped by ecological niches, language structures appear to adapt to the environment (niche) in which they are being learned and used. The proposed
Although the largest languages are spoken by millions of people spread over vast geographic areas, most languages are spoken by relatively few individuals over comparatively small areas. The median number of speakers for the 6,912 languages catalogued by the Ethnologue is only 7,000, compared to the mean of over 828,000
Just as there are socio-historical and demographic differences among the world's languages, there are also vast differences among languages in morphology and syntax
Languages with richer morphological systems are said to be more overspecified
The degree and specificity of morphological encoding can reach astounding levels. For example, Karok—a language of N.W. California—has morphological suffixes for forms of containment
Attempts to establish relationships between social and linguistic structure date back at least a century
Languages with histories of adult learning have been argued to be morphologically simpler, less redundant, and more regular/transparent
The primary goal of the present work is to examine whether non-spurious relationships exist between social and linguistic structure by using large-scale demographic and linguistic databases. A secondary goal is to provide a tentative framework within which to understand the reported results—the
In assessing the relationship between social and linguistic structure, it is useful to distinguish two main contexts (niches) in which languages are learned and used: the
To assess relationships between social and linguistic structure we constructed a dataset that combined social/demographic and typological information for 2,236 languages. Grammatical information was obtained from the World Atlas of Language Structures (WALS)
Model | |||||
Feature | Observed Pattern | Population (Log Speakers) | Area (Log km2) | Ling Contact (Log ling. neighbors) | Model Fits |
1. Fusion of inflectional formatives (20) § | Isolating > Concatenating | **$ | x | . | 358/138/140 |
2. Inflectional Morphology(26) § | Little or None > Present | ** | . | . | 688/678/680 |
3. |
Fewer Cases > More Cases | **$ | x | x | 795/920/912 |
4. Case Syncretism (28) § | Core/Non-Core Cases > Core Only = No Syncretism | ◊$ | ◊ | ◊ | 103/89/93 |
5. Alignment of Case markings of Full NPs (98) § | Nom/Acc > Erg/Abs | **$ | ** | ** | 437/348/349 |
6. Inflectional Synthesis of the Verb (categories per word)(22) § (See |
Few Forms > Many Forms | **$ | ** | ** | 450/451/454 |
7. Alignment of Verbal Person Marking (100) § | Neutral ≥ Ergative = Accusative > Context Dependent | ** | ** | x | 1083/818/821 |
8. Person Marking on Verbs (102) | None = Agent > Agent & Patient = Patient Only > Agent or Patient | ** | ** | ** | 1373/911/923 |
9. Person Marking on Adpositions (48) §(see |
None > Pronoun > Pronoun + Noun | **$ | ◊1 | ** | 640/498/495 |
10. Syncretism in Verbal Person/Number Marking (29) | Syncretic > None | **$ | ◊ | ** | 207/184/188 |
11. Situational Possibility (74) § | Verbal > Morphological | **$ | ** | ** | 250/246/249 |
12. Epistemic Possibility (75) § | Verbal > Morphological | **$ | ** | ** | 177/112/112 |
13. Overlap b/w Epistemic and Situational Possibility (76) § | Situational/Epistemic Collapsed > Separate Markers | **$ | ◊ | ** | 501/350/350 |
14. Coding of Evidentiality (77) | No Gram. Evidentials > Gram. Evidentials | **$ | . | . | 497/536/537 |
15. Coding of Negation (112) § | Word > Affix ≥ Double Neg ≥ Particle ≥ Aux. Verb ≥ Word/Affix Variation | **$ | ** | ** | 2961/2454/2468 |
16. Coding/Occurrence of Plurality (34) § | Obligatory > Optional [word > affix/clitic] > None | **$ | ◊ | ◊ | 1055/807/816 |
17. Associative Plural (36) § | No assoc. Plural > Assoc. Plural | ◊$ | . | . | 200/201/205 |
18. Polar Question coding (92) § | Question particle > No Question particle | **$ | ** | ** | 1022/979/979 |
19. Future Tense (67) § (see |
No Morph > Morph. | **$ | ◊ | ◊ | 320/295/294 |
20. Past Tense (66) § | Simple Past > No Morph Past > 2–3 Remoteness Dist. > 3+ Remoteness Dist. | **$ | ◊1 | ◊ | 617/466/458 |
21. Perfective/Imperfective (65) | Morph. Distinction > No Morph Distinction | ◊$ | ◊ | . | 330/303/304 |
22. Morphological Imperative (70) | Sing only > Not Morph. Marked ≥Sing & Plural ≥ Sing. Syncretic with Plural | **$ | x | x | 1395/1228/1223 |
23. Coding of Possessives (57) § (see |
No possessive affix > Possessive Affix | **$ | ** | ** | 757/826/828 |
24. Possessive Classification (59)§ | No classification > 2 Classes > 3–5 Classes | **$ | ** | ** | 514/477/480 |
25. Optative (73) § | Not Marked > Morphologically Marked | . | **1 | x | 264/264/250 |
26. Definite/Indefinite Articles (38–39) § | None ≥ Both (Lexical) = Only Def. or Only Indef. ≥ Both (Affixes) | . | **1 | . | 1359/1178/1169 |
27. Distance distinctions in demonstratives (41) | No distance contrasts > 2 Contrasts ≥ 2+ Contrasts | **$ | . | ** | 501/471/474 |
28. Expression of Pronominal Subjects (101) § | Oblig. Lexical = Opt. Lexical > Affixes/Clitics | ** | ◊ | ** | 1102/1011/1012 |
Smaller values indicate better fits.
§ = Demographics and geographic location predict typology better than geographic location alone (χ2 model comparison (p<.05).
$ = Predictive power of population is reduced (significantly larger residual deviations) by randomly shuffling languages within their families. Indicates that reported effects generalize to within language families.
** = Reported pattern is significant (p between 0.05 and 10–11) after controlling for language family.
◊ = Pattern no longer significant (p≥.05) after controlling for language family.
1 = Area and Number of Neighbors are significant predictors controlling for population.
. = Consistent with the pattern reported, but not significant.
x = Pattern after controlling for geographic covariates is non-significantly inconsistent with the pattern observed without controlling for geographic location.
Compared to languages spoken in the
Are more likely to be classified by typologists as
Contain fewer case markings (3), and have case systems with higher degree of case syncretism (4) (further reducing the number of morphological distinctions). Nominative/accusative alignment is more prevalent than ergative/absolutive alignment (5).
Have fewer grammatical categories marked on the verb (6) and are less likely to have idiosyncratic verbal morphology such as verbal person markings that alternate between marking agent or patient depending on semantic context (7).
Are more likely to not possess noun/verb agreement or have agreement limited to agents (8) and are more likely to possess no person markings on adpositions (9). As with case markings, syncretism in noun/verb/adposition agreement is more common in languages spoken in the exoteric niche (10).
Are more likely to make possibility and evidentiality distinctions using lexical (e.g., verbal) constructions rather than using inflections such as affixes (11,12,14) and are more likely to conflate the two (semantically distinct) types of possibility (13).
(a) Are more likely to encode negation using analytical strategies (negative word) than using inflections (affixes) and are less likely to have idiosyncratic variations between word and affixation strategies (15). (b) Are more likely to have obligatory plural markers (16). For languages with optional markers, analytic (lexical) strategies are more common than inflectional strategies (affixes or clitics). (c) Are less likely to have a separate associative plural (e.g., “He and his friends”) (17) (c) Are more likely to have a dedicated question particle (18).
(a) Are less likely to encode the future tense morphologically (19) or possess remoteness distinctions in the past tense (20). Languages spoken in the exoteric niche are somewhat more likely to mark the perfective/imperfective distinction in their morphology (21), although this relationship disappears when language geography is partialed out. (b) Are more likely to mark singular imperatives on verbs using inflections than have no morphological markings for imperatives at all, but are less likely to contain more elaborate markings that differentiate between singular and plural imperatives (22). (c) Are less likely to have inflections that mark possession (23). If possession is marked, it is less likely to distinguish between types of possession (e.g., alienable versus inalienable) (24). (d) Are less likely to morphologically mark the optative mood (25).
Are less likely to have definite and indefinite articles (26). If both are present, they are more likely to be expressed by separate words than affixes.
Are less likely to communicate distance distinctions in demonstratives (27).
Are more likely to express pronominal subjects lexically than morphologically (28).
Y-axis of right-side panels displays residual population after the GLM model partialed out geographic information (reducing the correlation between population and geography to 0). Values above bars represent the number of languages coded for that feature value. (A) Adpositions (prepositions or postpositions) may be coded for person agreement in some languages. In English, there is no such agreement/person marking. One may say “from him” without, for example, encoding onto “from” the gender or number identity of “him,” as opposed to “me” in “from me.” Languages that do encode more information on adpositions show smaller populations. (B) Languages that use inflections (i.e., morphology) for the future tense have smaller populations. (C) Morphological encoding of possession is associated with smaller populations of speakers.
We constructed a morphological complexity measure by summing the number of features for which each language relies on lexical versus morphological coding and subtracting the total from 0. There was a strong relationship between complexity and speaker population,
X-axis scores represent a measure of lexical devices compared to the use of inflectional morphology. Filled symbols represent population means for languages with a given complexity score; bars show 95% confidence intervals of the median. Bar width is proportional to sample size for each score.
In cross-cultural or -linguistic research, it is important to consider the issue of non-independence of cases, often subject to autocorrelation (also known as Galton's problem). We controlled for non-independence in several ways:
We factored in both language family and geographic location to ensure they did not completely account for the observed language feature distribution (e.g.,
We also performed a Monte Carlo simulation, randomizing language-demographic information
(A) Inflectional synthesis of the verb (feature 6 in
Interestingly, a number of the languages that lie far below the regression line are lingua francas, e.g., Hausa, Bambara, and Oromo are all used as lingua francas (vehicular languages). The Padang dialect of Minangkabau (the second simplest Austronesian language by our measure) is also a lingua franca around West Sumatra, Indonesia.
These controls ensure that the present results cannot be explained as consequences of historical events such as the colonization of the New World (and the population reduction that ensued)
Languages that are on the exoteric side of esoteric-exoteric continuum—as indicated by larger speaker populations, greater geographical coverage, and greater degree of contact with other languages—had overall simpler morphological systems, more frequently express semantic distinctions using lexical means, and were overall less grammatically specified. This was true both for quantitative grammatical measures such as the number of different grammatical categories encoded by verbal inflections (feature 6) and case markings, as well as for qualitative grammatical types. For example, languages spoken in the exoteric niche were associated with a lack of conventional strategies for encoding semantic distinctions like situational/epistemic possibility, evidentiality, the optative, indefiniteness, the future tense, and both distance contrasts in demonstratives (consider the rarity of the English “over yonder”) and remoteness distinctions in the past tense.
With few exceptions, the same patterns were observed whether population, area, or linguistic contact was used in the model. Overall, the population model provided the greatest predictive power.
As noted above, semantic distinctions coded lexically are more likely to be optionally expressed than those coded inflectionally (e.g., lexical versus inflectional encoding of tense). Thus, languages that are less grammatically specified tend to rely more on extra-linguistic information such as pragmatics and context
Our results provide strong evidence for a relationship between social structure and linguistic structure. Here, we speculate about the social and cognitive mechanisms that may give rise to this relationship. The linguistic niche hypothesis (LNH) provides one framework in which to consider two central questions raised by the present analyses: (1) Why are languages spoken in the exoteric niche morphologically simpler than languages spoken in the esoteric niche? (2) Why are languages spoken in the esoteric niche so morphologically complex, given that such a high level of specification seems unnecessary for communication?
We tentatively propose that the level of morphological specification is a product of languages adapting to the learning constraints and the unique communicative needs of the speaker population. Complex morphological paradigms appear to present particular learning challenges for adult learners even when their native languages make use of similar paradigms
With increased geographic spread and an increasing speaker population, a language is more likely to be subjected to learnability biases and limitations of adult learners (
The
In
The paradoxical prediction that morphological overspecification, while clearly difficult for adults, facilitates infant language acquisition is novel and is empirically testable. In
The linguistic niche hypothesis stresses redundancy as the force that results in greater inflection in languages with few speakers. An alternative is that languages with fewer speakers may come to rely more on inflectional rather than lexical devices because these afford greater economy of expression. On average a language with a greater reliance on inflectional devices will produce shorter sentences than one that relies on lexical devices
We have presented statistical evidence showing that aspects of morphological structure are predicted from nonlinguistic demographic variables, especially population. These results provide support for a non-arbitrary relationships between linguistic and social structure. One way to understand how these relationships come about is through what we have referred to as the Linguistic Niche Hypothesis (LNH) according to which different languages are placed under different learning constraints by socio-demographic factors. Languages spoken by millions of people over a diverse region are under a greater pressure to be learnable by adult outsiders. This pressure gradually results in morphological simplification with an increase in productivity of existing grammatical patterns, and greater analytical and compositional structure
We used three socio-demographic variables as proxies for esotericity: speaker population, geographic spread, and degree of inter-language contact. Speaker population data for each language was retrieved from the Ethnologue
Because direct measures for the esotericity are not available on a large scale, we used three proxy variables: speaker population, geographic spread, and degree of inter-language contact. Speaker population data for each language was retrieved from the Ethnologue
Our analyses focused on typological factors most relevant to morphological encoding with particular emphasis on continuous variables such as the number of inflectional case markings or the inflectional synthesis of verbs—the number of different types of information that can be inflectionally encoded by verbal affixes—measured in categories per word
Although our corpus included 2,236 languages, no feature was defined for all the languages in the WALS database. The results presented in
Typological variables with no natural ordering were predicted using multinomial regression (proportional odds logistic regression). Binary variables were predicted using simple logistic regression (logit GLM), continuous variables (features 3, 6, 24, 27) were predicted using a Gaussian GLM. The included analyses partial out language location by including as covariates the latitude/longitude coordinates of the language as reported in WALS. We also ran analyses that partialed out location by including the continent as a random effect. These analyses resulted in larger uncertainties in the typological value estimates, but in no case led to discrepant conclusions.
Because many languages only had information for a few of the features listed in
The relationship between population and number of nominal cases (a), and number of categories per verb (b). The regression lines are flanked by 95% CIs. The ranges on the x-axis correspond to the coding of these features in the World Atlas of Langauge Structures.
(0.10 MB DOC)
Word order and affixation frequencies and associated speaker populations. a. Distribution of word order types versus the mean speaker populations (numbers above bars indicate number of languages with the given feature value). b. Speaker population adjusted by geography. c–d. A break-down of languages classified as having dominantly prefixing versus dominantly suffixing inflectional morphology.
(0.10 MB DOC)
Examples of native (L1) to non-native (L2) populations for several languages.
(0.03 MB DOC)
A comparison of linguistic features (typologies) that are most common to languages in the exoteric niche compared to overall typological frequency.
(0.05 MB DOC)
A note regarding Japanese as an example.
(0.02 MB DOC)
A note about the correlations between our main demographic variables.
(0.02 MB DOC)
A note regarding our nomothetic approach.
(0.02 MB DOC)
A representative analysis of our language sample.
(0.35 MB DOC)
A note about esoteric and exoteric uses of a language.
(0.02 MB DOC)
A detailed description of the linguistic features used.
(0.06 MB DOC)
A note regarding multilingualism.
(0.02 MB DOC)
A note regarding generational transmission of a nonnative language.
(0.03 MB DOC)
A clarification of the term “redundancy.”
(0.02 MB DOC)
Modeling language fitness as a function of age of acquisition.
(0.56 MB DOC)
Text compressibility as a measure of linguistic redundancy.
(0.10 MB DOC)
Supporting analyses of constituent order and application to adult versus child language acquisition.
(0.04 MB DOC)
We thank the following people for their input (listed in alphabetical order): Dan Dediu, Guy Deutscher, Connie DeVos, Adele Goldberg, Jim Hurford, Asifa Majid, Daniel Nettle, Michael Tomasello, Peter Trudgill, Anne Warlaumont.