Some Structural Aspects of Language Are More Stable than Others: A Comparison of Seven Methods

Understanding the patterns and causes of differential structural stability is an area of major interest for the study of language change and evolution. It is still debated whether structural features have intrinsic stabilities across language families and geographic areas, or if the processes governing their rate of change are completely dependent upon the specific context of a given language or language family. We conducted an extensive literature review and selected seven different approaches to conceptualising and estimating the stability of structural linguistic features, aiming at comparing them using the same dataset, the World Atlas of Language Structures. We found that, despite profound conceptual and empirical differences between these methods, they tend to agree in classifying some structural linguistic features as being more stable than others. This suggests that there are intrinsic properties of such structural features influencing their stability across methods, language families and geographic areas. This finding is a major step towards understanding the nature of structural linguistic features and their interaction with idiosyncratic, lineage- and area-specific factors during language change and evolution.


Introduction
We present here in detail the derivation of Elena Maslova's estimation of transition probabilities (denoted M in our paper), as well as some possible extensions. A more general introduction can be found in [1]. This presentation is based on [2][3][4][5] and on personal communication between Cysouw and Maslova in 2006. The technical details of Maslova's proposals are not easily extractable from her publications, and because there is still no detailed description from her own hand we decided to publish our rendition of her ideas here. As is discussed in the main paper, Maslova's approach can be seen as a simplified formalisation of the method used in [6] (denoted D in our paper) and empirically strongly correlated to it. Considering that Maslova's approach is much easier to handle, as both the empirical prerequisites (i.e. only pairs of related languages are necessary) and the mathematical calculations (i.e. only some quadratic equations have to be solved) are simpler, we propose that Maslova's approach can be used to quickly obtain approximate estimates of transition probabilities.

Assumptions
Consider a typology of languages for a particular feature. To estimate the transition probabilities, we will restrict ourselves here to features with two possible types only, A and B. Languages have to be either of type A or of type B, nothing else is allowed (but see Section for extensions of this restriction). This restriction to binary features immediately implies that the fraction of languages of type A is the complement of the fraction of languages of type B. Or, stated in terms of probabilities: P pAq`P pBq " 1 (1) where P pAq and P pBq are the fraction of languages of type A and B, respectively. Within a period of time t, ranging from t 0 to t 1 , we attempt to estimate the probability that languages will change from A to B, or vice versa from B to A. We do not make any assumptions concerning the causes of such changes: they can be due to internal developments within the language or to external influences (such as contact with another languages); from the current method's point of view these distinctions are irrelevant.
Let p AB be the probability that a language of type A changes to type B within the timeframe t. Likewise, let p BA be the probability of a change from B to A. These two probabilities are independent of each other. The complement of p AB , viz. 1´p AB , can be interpreted as the probability that a language of type A does not change to type B within the timeframe t. Likewise, the complement of p BA , viz. 1´p BA , is the probability that a language of type B does not change to A.
Of course, within the timeframe t a language might change from A to B and back to A (A Ñ B Ñ A), in which case it would be included in the 1´p AB because, at the end of t, it would still be of type A. This applies to any odd number of changes (A Ñ B Ñ . . . Ñ B Ñ A) and shows that (i) t must be short enough such that such reversals to the original value are not too frequent, and (ii) that this method is unable to account for reversals, as opposed to more advanced likelihood and Bayesian phylogenetic methods [7,8]. The necessary assumption that the period t is short is a limiting factor on Maslova's approach. However, in practice there is not much accepted knowledge about deep phylogenies in linguistics anyway, so empirically linguists will mostly work within groups of closely related languages, or even with variation between dialectal variants.
Assuming that these transition probabilities remain the same over longer periods of time, then it is possible to predict the stable distributions of the types A and B, i.e. the situation in which the fraction of languages of type A and B does not change: P t1 pAq " P t0 pAq and P t1 pBq " P t0 pBq. Such a stable situation does not mean that there are no changes anymore; it means that the number of changes cancel out against each other. The crucial assumption that the transition probabilities themselves remain stable is of course far from proven. It might very well be the case that even these probabilities have changed over time. Still, the assumption of universal transition probabilities represents a step forward from the common practice of linguistic typology to assume universal empirical frequencies [1].

Basic implications
Concretely, in a stable distribution P S there are equally many languages changing from A to B as there are languages changing from B to A within a particular period of time, so: Using (1) and (2), the fractions of languages of type A and type B in the stable distribution can be predicted from the transition probabilities: Further, the complement of the average of the transition probabilities, i.e. 1´p AB`pBA 2 can be interpreted as stability, i.e. the probability that there will be no change in a particular period. A high value indicates that few languages will change, which means that the characteristic is very stable. Conversely, a low average probability is indicative that a characteristic is highly unstable. However, as discussed above, this interpretation crucially hangs on the absence of hidden reversals within the timeframe t and thus on the careful choice of t.
Finally, the expected fraction of languages of type A and type B at the end of a period t can be predicted from the fractions at the start of the period. Namely, the languages of type A at the end of the period P t1 pAq will consist of those languages that were of type A at the start of the period P t0 pAq which did not change to B, i.e. with probability of 1´p AB , together with those languages that were of type B at the start of the period P t0 pBq which did change to A, i.e. with probability p BA . The same holds reversely for P t1 pBq.
# P t1 pAq " P t0 pAq¨p1´p AB q`P t0 pBq¨p BA P t1 pBq " P t0 pAq¨p AB`Pt0 pBq¨p1´p BA q (4) Using (1), these equations can be inverted to expressions of the fractions at the beginning of the period in terms of the fractions at the end of the period (we will need this in the next section for the estimation SOI "Some Structural Aspects of Language are More Stable than Others" 3 of the transition probabilities), e.g. for type A: Doing this likewise for type B results in:

Estimating transition probabilities
Linguists are often highly confident that certain languages are related, without necessarily being able to reach agreement on the details of the internal subgrouping of such a genealogical unit. Taking advantage of this empirical situation, we will only assume that (see also the discussion above): 1. there is a distinction between pairs of related vs. non-related languages (i.e. there are no detailed genealogical trees assumed) 2. the time-depth of split-up of related languages is relatively small, so that it is likely that there has maximally been one single change of type per language in that period (no reversals) and that there is a low probability that all related languages have changed.
3. all pairs of related languages have approximately the same time depth. In practice we used the genus-level as described in WALS as the maximum divergence time depth.
Given a sample of pairs of such related languages we can estimate the transition probabilities and the stability of the concerned features (see Section ) -the method we applied in this paper, and which amounts to the original proposal from Maslova. However, two interesting extensions are also possible: (a) we could look at groups of three of such related languages (see Section ) and (b) we could add to the sample of related pairs a third, non-related language but which is geographically close, allowing us to estimate the transition probabilities including borrowing events into the model (see Section ). Interestingly, the resulting formulas to estimate the transition probabilities in these cases only differ by a constant factor.

Using genealogically closely related pairs
Given a pair of closely related languages, the method assumes that they both shared the same type at the start t 0 of the period t. Either both languages are of type A, with probability P t0 pAq, or both are of type B, with probability P t0 pBq. Some changes might happen (or not) during the period t, resulting in a particular probability that both languages are still identical at the end t 1 of the period t. This probability is called P t1 pidenticalq. This probability is the sum of four possible histories: either both languages started off as type A and both did not change (AA t0 Ñ AA t1 ); or both started off as type A and both changed to B (AA t0 Ñ BB t1 ); or both started off as type B and both did not change (BB t0 Ñ BB t1 ); or, finally, both started of as type B and both changed to A (BB t0 Ñ AA t1 ). Note that the assumption of a short time-span t leads to the further assumption that the number of pairs that did not change (AA t0 Ñ AA t1 and BB t0 Ñ BB t1 ) will be larger than the number of pairs that changed completely (AA t0 Ñ BB t1 and BB t0 Ñ AA t1 ). This assumption will become important in the solving of the equations (see Section ). Thus, the probability of pairs of languages being identical in a synchronic empirical collection of pairs can be expressed as: Using (1), this can be reformulated as an equation relating the fraction of identical pairs at the end of the period P t1 pIq and the fraction of languages of type A at the beginning of the period P t0 pAq: BA Using (6), the fraction of type A at the beginning of the period can be expressed in terms of the fraction of type A at the end of the period. Thus, both P pIq and P pAq in the equation are expressed at the same point in time, reducing the necessity for a subscript for time: or, by defining P pDq as the complement of P pIq, i.e. P pDq " 1´P pIq, this becomes P pDq is the frequency with which the languages within the pair are different. So, there should be a linear dependency between the frequency of pairs of languages being different, P pDq, and the frequency of languages of type A, P pAq, of the form P pDq " 2α¨P pAq`2β, with α " p AB´pBA and β " p BA p1´p AB q. By empirically measuring P pDq and P pAq and by estimating the coefficients α and β of the linear dependency, it is possible to estimate the transition probabilities (see Section for the practical details).
Note that it might seem to be even more interesting to consider a less constrained model, starting with two languages of any pair of types, AA, AB, BA, or BB. This would result in the following equation for P pIq: However, after performing the same algebra as above, all transition probabilities factor out, leaving just the evidently true, but useless, equation Using genealogically closely related triples There are various different setting that can be used to estimate the transition probabilities. However, most of them quickly become rather complex. The algebra of the following two settings also nicely reduces to a manageable model, being only slightly different from the previous one. These settings were not considered by Maslova herself, but added by the present authors.
Instead of looking at pairs of languages, one might also look at groups of three closely related languages. In that case the probability that all three languages are identical consists of four different possible histories (AAA t0 Ñ AAA t1 , AAA t0 Ñ BBB t1 , BBB t0 Ñ BBB t1 and BBB t0 Ñ AAA t1 ). This results in an equation very similar to (7): Performing the same algebra as in the previous section, this leads to: i.e. exactly the same formula as in (8), though with a constant 3 instead of 2. Note that this does not generalize to larger groups, i.e. for groups with five languages it does not work to replace the constant with a 5. All groups higher than three languages lead to much more complex algebra and thus are not usable as a quick approximation (which is the goal of the present method).

Using geographically close, but genealogically unrelated pairs
In this extension of Maslova's approach we again consider two closely related languages, but now we add a third non-related language that is geographically close to one (and only one) of the two related languages. We are interested in situations where the two related languages are of a different type, but the two non-related yet geographically close languages are of the same type. Such a situation is typically interpreted as the result of contact-induced change in one of the geographically close languages. However, there are various histories possible that lead to this setting. We assume, as before, that at the start t 0 of the period t the two related languages are of the same type, so either both are of type A, with probability P t0 pAq, or both are of type B, with probability P t0 pBq. The non-related third language can also be of either type (with the same probabilities), so there are four possible start settings: AA-A, AA-B, BB-A, and BB-B (the non-related, but geographically close language is shown separated by a dash). We are interested in end situations in which the two related languages are of different types, but the two geographically close language are of the same type: AB-B and BA-A. The probability for any of these situations to occur will be denoted P pCq, where the 'C' mnemonically stands for 'convergence'.
From each starting situation it is possible to arrive at both end situations, given the right changes. For example, to get from AA-A to AB-B requires two languages to change from A to B and one language to not change from type A. Writing out all eight such possibilities gives the following unwieldy formula: Combining terms, and using the complementarity of P pAq and P pBq, this reduces to P t1 pCq " P t0 pAq rp1´p AB qp AB´p 1´p BA qp BA s`p1´p BA qp BA Using (6), the probability P t0 pAq can be expressed by using only the probability P t1 pAq, so the timesubscripts are identical and can thus be left out: which reduces nicely to P pCq " P pAq¨pp AB´pBA q`p BA p1´p BA q (10) So, again there is the same linear dependency between the probability that the two unrelated languages are identical while the two related languages are different, P pCq, and the empirical probability of languages being of type A, P pAq. The only difference between (10) and the earlier results in (8) and (9) is the constant. By estimating the coefficients, it is possible to estimate the transition probabilities, and from that the stable distribution and the stability of the feature.

Empirically estimating the transition probabilities using WALS
In the main part of this paper, we applied the method described in Section to the data of the World Atlas of Language Structures (WALS) [9] in order to obtain estimates of the stability of the structural features of language covered in this database. The actual R code (released under a GPL v3 license) is given below.
For any given feature in WALS, F with n ě 2 values V 1 , . . . V n , we estimated separately the transition probabilities for each of its values, V i such that with the previous notations A is V i and B represents all other values except for V i . Thus, for each value V i we estimated the transition probabilities from V i to any other possible value, p ABi .
The basic idea is the following. For a specific value A we select pairs of languages of the same genus from WALS. All those pairs are separated into two different samples, for convenience called sample 1 and sample 2 here. For each of these samples, we count how many pairs are not identical (so, one of the two languages has value A, the other has not value A). The proportions of different pairs for the two samples are P pD 1 q and P pD 2 q. Further, for both samples we count the number of languages that have value A. The proportions of value for the two samples are P pA 1 q and P pA 2 q. From equation (8) we then have: From the first two equations in (11) we can derive: pA 1 qP pD 2 q´P pA 2 qP pD 1 q P pA 1 q´P pA 2 q And from the second two equations in (11) we can derive: By filling in the four empirical estimates for P pD 1 q, P pD 2 q, P pA 1 q and P pA 2 q we can thus directly derive estimates for p AB and p BA . However, note that there are actually two solutions for p AB and p BA , one with lower transition probabilies (the 'minus' variant) and one with higher transition probabilities (the 'plus' variant), with pÁ B " 1´pB A and vice versa. The interpretation of these two solutions can be understood from looking back at equation (7). The 'minus' solution represents the situation that the number of pairs that did not change (AA t0 Ñ AA t1 and BB t0 Ñ BB t1 ) is larger than the number of pairs that changed completely (AA t0 Ñ BB t1 and BB t0 Ñ AA t1 ). As was discussed in Section , this is the fitting interpretation to the assumption that the time period t is small. We will therfore use the 'minus' solution here.
Instead of using just two samples of pairs, as illustrated above, it is also possible to select many different samples. In fact, in order to estimate α and β we used multiple sets of P pAq and P pDq for the same value. Our way of obtaining these multiple sets is by randomly subsampling the set of all available genera. For example, using WALS for feature 10 (Vowel Nasalization) and its value 1 ("Contrast present"), we have a total of 26 genera with enough data for this feature 1 We created 50 random subsets of 13 genera and for each such subset we computed the P pDq and P pAq, as exemplified in Table 1. Table 1. Example P pAq and P pDq for feature 10 (Vowel Nasalization) in WALS. Each subset shown is composed of 13 random genera (here given by their alphabetical index, thus 1 represents Adamawa-Ubangian). As an example, we only show the first 10 subsets generated in one particular run. If we then regress P pDq to P pAq we obtain estimates of α and β as the coefficients of this regression. In this example α " 0.08 (std. error 0.17) and β " 0.17 (std. error 0.05). The error values show that these estimates are not completely random, though it should be noted that the errors are substantial. The estimates are thus to be interpreted with care. Still, proceeding with these estimated parameters we can estimate p AB " 0.33 and p BA " 0.25. Using the formula in (3) this implies that the stable distribution of P S pAq " 0.43. Note that the actual frequency of vowel nasalization in WALS is only 26%, indicating that the current distribution is not in its stable state, and that there probably is influence from historical coincidences on the current world-wide distribution of vowel nasalisation. The stability of this characteristic is rather high, so languages do not seem to change too often, making it even more probable that the current distribution shows signs of historical events.
Further, we computed the stability of the feature F by taking the weighted average of the stability of each of its values V i (defined as 1´p ABi ), where the weights are represented by the relative frequencies of the feature values: where P i is the frequency of value V i relative to all the possible values of feature F . Thus, the stability of more frequent values have a bigger influence on estimating the frequency of the whole feature's stability SpF q.

The R code
This section contains the R code (released under GPL v3 and also reproduced in Script S1) implementing the estimation of stability (used in this paper) based on Elena Maslova's method (as described above). Please note that depending on the version of the WALS database used results might differ slightly from the ones reported here, but for maximum reproductibility we also included the version of the WALS dataset we used as Dataset S1

## P l e a s e n o t e t h a t c u r r e n t l y t h e WALS d a t a i s r e l e a s e d i n a s l i g h t l y d i f f e r e n t f o r m a t OE ë ( s e e h t t p : // w a l s . i n f o / e x p o r t ) and i n o r d e r t o i m p o r t i t you need t o u s e i n s t e a d OE ë s o m e t h i n g o f t h e form : 27 ## b u t we d i d n o t t e s t t h e s c r i p t w i t h t h i s new f o r m a t and i t might r e q u i r e some OE
ë t w e a k i n g . #l a n g s <´re a d . t a b l e (" l a n g u a g e s . t a b " , s e p ="\ t " , h e a d e r=T) 29 #mat <´r e a d . t a b l e (" d a t a p o i n t s . t a b " , s e p ="\ t " , h e a d e r=T, row . names=1) 31 33 # S e l e c t g e n e r a t h a t have more than one l a n g u a g e coded f o r a f e a t u r e # For example , " l e n g t h ( g e t g e n e r a ( 8 3 ) ) " g i v e s "197" { l a n g s <´l a n g s [ l g s $ macroarea [ rownames( l g s ) %i n% l a n g s ] == macroarea ] ; 43 } i f ( ! i s . na ( l g f a m i l y ) ) 45 { l a n g s <´l a n g s [ l g s $family [ rownames( l g s ) %i n% l a n g s ] == l g f a m i l y ] ;