How individuals change language

Languages emerge and change over time at the population level though interactions between individual speakers. It is, however, hard to directly observe how a single speaker’s linguistic innovation precipitates a population-wide change in the language, and many theoretical proposals exist. We introduce a very general mathematical model that encompasses a wide variety of individual-level linguistic behaviours and provides statistical predictions for the population-level changes that result from them. This model allows us to compare the likelihood of empirically-attested changes in definite and indefinite articles in multiple languages under different assumptions on the way in which individuals learn and use language. We find that accounts of language change that appeal primarily to errors in childhood language acquisition are very weakly supported by the historical data, whereas those that allow speakers to change incrementally across the lifespan are more plausible, particularly when combined with social network effects.


S1.1 Empirical data set
Our empirical data set of 52 languages is derived from 84 sources that document instances of article usage in different languages at different times. In Table S1.1 we summarise the stages of the grammaticalisation cycles for the definite and indefinite articles that have been observed, along with a historical time period that covers these observations. The quantities that enter the likelihood analysis are the total length of the historical period, and the number of changes in the article that occurred within it. In most cases, the beginning and end of each period corresponds to the earliest and most recent record (in cases where the language is still spoken, the latter is the present day). The exception to this is Hebrew, which was not spoken for a 1700-year period. In this case we take processes of change to be halted during this period. In cases where a language split into several daughter languages, we take the observation period for the parent language to end at the time of split, and the daughters' observation periods to begin. We also quote a measure of relative population size (the weight of a language), which can be converted into an estimate of the true population size using the procedure described in Section S1.2 below. These weights are obtained from geographical population sizes as described in Section S1.3 below. Finally in this table we record the relevant sources of historical language use so our characterisation of the historical data can be verified as required.

S1.2 Historical populations of geographical regions
We use a survey of population sizes across many regions of the world [85] to fit the following model for the size N i (t) of region i at time t, measured in years since 1BCE: (S1.1) In this model, N 0 sets an overall scale, w i is a region-dependent weight that specifies its relative size, and g(t) is a universal time-dependent growth function. This model amounts to an assumption that populations in different regions of the world maintain constant ratios over the relevant historical time period. Below we show that this model provides an estimate of a region's population size that is accurate to within a factor of 2.8 at a confidence level of 95%. The source data for this analysis is provided in the S2 File as an MS Excel spreadsheet; the resulting weights are specified in Table S1.2. Although not a particularly precise estimate, this variation of a factor of 2.8 is to be compared with a factor of 3400 variation in the weights themselves, and an overall increase by a factor of 300 in population sizes from 5000BCE to the present day. Consequently this simple model captures the range of population sizes and their changes over time rather well with a single parameter per geographical region, and we do not feel there would be much to be gained from using a more refined model.  Figure S1.1: Population sizes normalised by by the weight factor w i for each geographical region i. The grey dots show observations recorded in [85]. The black squares are the mean population size after normalisation has been applied to minimise the variance between different regions. The smooth line is a degree 4 polynomial fit to the logarithm of the mean normalised population size. See Table S1.3 for the coefficients in this polynomial.
The weights and the unknown function g(t) are found by performing a linear least-squares fit to ln(N i (t j )) = ln(w i ) + ln(N 0 ) + ln(g(t j )) + i (t j ) , (S1.2) where N i (t j ) is the population size in region i at time point t j as recorded in [85], and the parameters a i = ln(w i ) and b j = ln(g(t j )) are varied to minimise the sum of square residuals i (t j ). This minimisation problem is underdetermined, and a unique solution is obtained after fixing g(0) = 1 and w i = 1 for the smallest region in the sample (Iceland). This procedure yields an overall scale N 0 = 14600 (to 3 s.f.), and the weights w i are presented in Table S1.2. In Figure S1.1, we plot the normalised population sizes N i (t j )/w i , along with a fit to its mean, g(t), whose logarithm is found to be well described by a quartic polynomial. This figure demonstrates that there is some scatter around this average, which we quantify further with the distribution of residuals that is shown in Figure S1.2. We find that although this distribution is not normal, the central 95% of the residuals span the interval from −1.02 to 1.3. Since these residuals are natural logarithms, this range corresponds to the overall factor of 2.8 error (in either direction) on the estimated population size, as claimed above. The R 2 statistic for the linear least-squares fit is 0.923 (to 3 s.f.).

S1.3 Geographical composition of languages
We assume that language sizes are linear combinations of the sizes of the geographical regions where they are spoken, and hence that these fractions remain constant over time. These combinations are provided in Table S1.4, with the resulting language sizes in Table S1.1. For most cases, we assume a one-to-one relationship between a geographical region (e.g., Sweden) and a language (Swedish). In other cases, further explanation is necessary: • English and Welsh -Currently, around 5% of the population of England and Wales resides in Wales. Assuming this fraction to be constant over history, and that roughly one third of Welsh residents are speakers of Welsh, we arrive at a 98% to 2% split across England and Wales into English and Welsh speakers respectively.
• Scandinavian -Two of the languages in the sample, Icelandic and Swedish, descended from Common Scandinavian with the split estimated to have occurred around 1300CE. We therefore take the period from 1100CE to 1300CE as common to both languages, and to have a correspondingly larger population (see below for details of how this was incorporated into the analysis).
• Irish -Until around 1600CE, we assume the entire population of Ireland is Irish-speaking before the Irish-speaking population declines. We therefore truncate the time-window for Irish in the analysis at 1600CE, rather than the present day (see Table S1.1).
• Latin -Analogously to the Scandinavian languages, we take Latin to be an ancestor of French and Romanian, estimating the split to take place around 500CE, before which both daughter languages are considered to have a common history.
• Turkish -The region designated Turkey-in-Asia by McEvedy and Jones [85] includes a significant population of Kurdish, Armenian and Greek speakers. We take the number of Turkish speakers to be 75% of this larger population.
• Hebrew -Hebrew was not spoken between around 200CE and 1900CE, being a liturgical and written language in the intervening period. Here, we assume that the language was frozen (unable to change) in the period that it was not spoken.
• Aramaic -McEvedy and Jones [85] combine Syria (where Aramaic was spoken) and Lebanon into a single region. We estimate that Aramaic speakers make up 75% of this region.
• Ethiopian Semitic -Ge'ez is assumed to be the direct ancestor of both Tigré and Tigrinya, with a split occouring at 1000CE. The fraction of Ethiopia in which each daughter language is spoken is assumed to be constant and equal to 1986 values taken from [86].
• Indo-Aryan -Masica [32, p8] quotes a figure of 640m total speakers of Indo-Aryan languages. The majority of these reside in the region designated as Pakistan, India and Bangladesh by McEvedy and Jones [85], for which the most recent estimate of population size (1975CE) was given as 745m. We therefore assume the population of Indo-Aryan speakers to track the size of Pakistan, India and Bangladesh, but scaled by a factor of 85% to match recent estimates in [86]. We take the Indo-Aryan ancestor language to split at around 1000CE; the daughter languages Bengali, Assamese, Hindi and Gujarati that were included in the sample are thereby taken to share a common period of evolution from around 700CE to 1000CE (as with the Scandinavian and Latin languages). These sizes of each of these daughter languages were also taken to track the population of Pakistan, India and Bangladesh, again using the ratio that applies to the present-day populations.
• Tamil -Tamil is spoken in India and Sri Lanka. Proportions of the corresponding regions documented by McEvedy and Jones [85] were estimated using numbers for 1986CE in [86].
• Tibetan -We take 25% of the population of the Chinese Turkestan and Tibet region [85] to be Tibetan.
• Mongolian -The Mongolian-speaking population extends beyond the region designated as Mongolia by McEvedy and Jones which excludes Inner Mongolia. We assume that the total population of Mongolian speakers is 35% larger than that of Mongolia.
• Mesoamerican languages -These languages are spoken in regions designated as Mexico and Guatemala by McEvedy and Jones [85]. We take current estimates of their speaker numbers as fractions of the relevant geographical regions to obtain the weight of the languages, and a historical estimate to determine the fraction of Mexico population that spoke Classical Nahuatl before splitting into daughters at around 1600CE. However, given the recent decline (in particular) of Nahuatl as it was replaced by Spanish in Mexico, this means that the numbers post-split are likely to be underestimates, and furthermore it may not be reasonable to assume that the speaker numbers are a constant fraction of the geographical population over time. However, we do not believe that this uncertainty greatly affects our results.
• Quechua -We consider three varieties of Quechua, all of which are taken to be descendants of a common Colonial Quechua language spoken widely across Bolivia, Ecuador and Peru before 1700CE. After this split, we assume that the speakers of the daughter varieties are constant fractions of Peru and Ecuador set at values that pertain to the 1970s [86].  Table S1.4: Geographical composition of the speech community for each language in the sample. The resulting relative historical average population sizes are given in Table S1.1.

Language
In the main text, the model calls for a single size to characterise each population over the relevant historical period. We use the mean of N i (t) given by Eq. (S1.1) over the historical time period (or periods, for Hebrew) given in Table S1.1, summed over regions i with the weights given in Table S1.4.

S1.4 Origin-fixation model
As described in the main text, we have constructed an origin-fixation model to describe the dynamics of language change at the population level, and more specifically provide likelihood functions for the empirical data sets under various assumptions on the underlying linguistic behaviour of individuals. In this model, an innovation (mutation) of type i + 1 that successfully propagates through (invades) a population of type i individuals is introduced as a Poisson process with rate ω i . Recall from the main text that we refer to the introduction of a successful innovation to the population as an origination event. We take ω i =ω/(V f i ), where f i is the typological frequency of variant i = 1, 2, . . . , V across the world's languages, so that mean time between the onset of origination events is proportional to f i . In most origin-fixation models [87], the fixation time T F is idealised as zero. As explained in the main text, in language change the timescale of fixation (decades to hundreds of years) is not greatly separated from the origination timescale (hundreds to thousands of years). Consequently, we must generalise to nonzero fixation times. Specifically, we model the fixation process as one whose time to fixation is drawn from a Gamma distribution (this because the fixation time is necessarily positive, and we wish to treat its mean and variance as independent quantities). Since in this framework it is possible that the change to the next stage of the cycle may be triggered before the previous one has gone to fixation, we must also account for interference between successive originations.

S1.4.1 Likelihood function
In the main text, we compared different models using the Akaike Information Criterion, defined through equation (4). This involves the likelihood function where L m i (t i ) gives the probability that exactly m i language changes have occurred in a time window of length t i these corresponding to language i in the sample. In this section, we explain how L m i (t i ), and therewith L, is calculated.
We assume that at the beginning of an observation window, t = 0, only one variant is present in the population. Let ω 1 be the origination rate at the first stage in the cycle and T F and σ 2 T F be the mean and variance of the time for the innovation to reach fixation in the population, conditioned on this event occurring. Even in those cases where exact results are available [88], the functional form of the distribution p 1 (t) that the invading mutant fixes at time t is very complicated. We have found it is well approximated by the convolution of a Poisson process with rate ω and a Gamma distribution with mean T F and variance σ 2 T F (see S1.5.2.1 below). That is where Γ(α) is the Gamma function and * denotes the convolution operation. The parameters α and β are related to the mean and variance of the fixation time via (S1.5) Once the innovation has gone to fixation, the next innovation can then be introduced to the population at rate ω 2 (which need not equal ω 1 ) and its fixation time is assumed to have the same mean and variance as the first mutant. The probability P m (t) that at least m changes have occurred by time t is obtained by convolving p 1 (t) with itself m times, and integrating from 0 to t. The probability L m (t) that exactly m changes have occurred by time t is then given by t 0 [P m (t ) − P m+1 (t )] dt . This is most conveniently written in the form of the Laplace transform (S1.6) Note that in the case m = 0, the product over i is set equal to unity.
It is possible to invert the Laplace transform and obtain explicit expressions for L m (t) involving alternating sums of incomplete Gamma functions. Unfortunately, these expressions are difficult to compute to the desired numerical precision, due to cancellations between terms at the leading order. A much better approach is to numerically invert (S1.6) using the Euler algorithm as set out in [89]. This involves computing the sum where the number of terms n, the weights c i and the nodes ζ i depend on the desired precision [89].
We have found five digits of precision sufficient for our needs, which corresponds to n = 18, a remarkably small number of function evaluations given the complexity of the problem. The different cultural evolutionary scenarios that we test in the main text give rise to a wide range of different parameter combinations. To maintain numerical precision in calculating the logarithm of the likelihood L m (t) across the full range of parameter values, a few minor modifications to the standard Euler algorithm were required: • For the case m = 0, L 0 (t) → 1 as t → 0. Here, the log likelihood is close to zero, and therefore for this to be obtained to the desired precision, we invert the transform of 1 − L 0 (t) (which is close to zero), and use a library function to evaluate ln(1 − x) for small x.
• In all other cases, the likelihood has a leading exponential decay e −s * t where s * = min{β, ω 1 , . . . , ω m } is the location of the singularity in (S1.6) closest to the origin in the complex-s plane. Since the combination s * t can become large (leading to a very small likelihood), we maintain precision via the identity ln L(t) = −s * t + ln R s * (t) where R s * (t) is the inverse of the shifted Laplace transformL(s − s * ). Note that the apparent pole at s = 0 is cancelled by a zero in the numerator, which permits this shift of the integration contour.
• Finally, some of the terms in the sum (S1.7) can take values that are sufficiently small or large to cause overflow when working to machine precision. We handle this by computing the logarithm of each term in the sum and subtracting out the largest real part from all terms. Then the remainders can be safely exponentiated and summed without causing overflow. The contribution that was subtracted is then reinstated into the result for the log likelihood at the end of the calculation.
The likelihood of the macroscopic Poisson process with state-dependent origination rates ω 1 , ω 2 , . . . can be obtained from the inversion of (S1.6) after taking the limit β → ∞. A complete implementation of the likelihood analysis code is provided for reference at [90].

S1.4.2 Interference correction
The calculation of the likelihood function above is conditioned on each origination going to fixation before the next origination event is triggered. When the origination rate ω i is comparable to 1/T F , successive origination events can interfere, which ultimately leads to coexistence of multiple variants rather than a sequence of changes going through the population. This feature is inconsistent with the empirical data, in which the typical situation is that one convention dominates, or that we are in transition from one stage of the cycle to the next. To prevent a maximum of the likelihood function being found in this region, we need to multiply it by the probability that an originated innovation does go to fixation: this then gives us the joint probability that fixation will occur, and that it has occurred by a given time.
We introduce therefore the correction C i , which is the propbability that the innovation introduced at stage i (i.e., the one that precipitates a change to stage i+1) goes to fixation without interference. Then, (S1.8) In the case of the Poisson process (or the Wright-Fisher model with infinite selection) T F ≡ 0, and so no correction is needed (C i ≡ 1). In the general Wright-Fisher model, we find from numerical computations that C i is well approximated by C i = e −ω i T F (see S1.5.2.1 below). However, the precise form of the correction is not too important in terms of likelihood maximisation, as long as C i ≈ 1 when innovations can propagate freely without interference, and decreases towards 0 when they cannot.

S1.5 Wright-Fisher model
At the individual speaker level, we use a Wright-Fisher model. We provide the full definition of this model here, explain how we extract from it the parameters α, β and ω i in the origin-fixation model, and demonstrate numerically that the latter serves as a good approximation to the Wright-Fisher model after averaging over individual innovation frequencies.

S1.5.1 Definition
The dynamics of the Wright-Fisher model are illustrated in Fig. 1 of the main text. We assume that at a given point in time, the language is in transition from stage i to i + 1 of the cycle. Each of the N speakers is then characterised by the frequency x n that they use the innovation (i.e., produce an utterance consistent with stage i + 1 of the cycle). Each speaker updates their grammar (their x n value) at intervals of ∆t = 1/R, where R is the interaction rate. They retain a fraction (1 − ) of their existing grammar, and replace the remaining fraction with a memory of either the innovation or the existing convention, this depending on the relevant linguistic interactions they have over the interval ∆t. If we define a variable τ n such that τ n = 0 when the speaker stores a memory of the convention, and τ n = 1 for the case of the innovation, we have x n = (1 − )x n + τ n . (S1.9) The probability p n that τ n = 1 depends on the frequency of the innovation in the speaker's neighbourhood,x n , the individual innovation rate η and the selection strength s. The specific prescription is p n = 1 −x n 1 +x n s η i + (1 + s)x n 1 +x n s . (S1.10) In words, this equation says that a speaker first samples an instance of linguistic behaviour from their local neighbourhood, wherein the convention is given weight 1, and the innovation weight 1 + s. That is, if s > 0 the innovation is selected for; and if s < 0 it is selected against. If this instance of linguistic behaviour corresponds to the convention, there is a probability η i that it is recorded by the speaker as the innovation (i.e., a mutation from stage i to i + 1 of the cycle has occurred at the individual level).
We are deliberately abstract in our specification of the model, as different mechanisms can give rise to selection and innovation. A number of concrete examples, and their relation to linguistic theories, are provided in the main text.

S1.5.2 Correspondence with the origin-fixation model
To connect the Wright-Fisher model to the origin-fixation model, we need to work out the values of the parameters ω i , T F and σ 2 T F ≡ T 2 F − T F 2 of the latter that are implied by the former. We consider first of all the origination rates. Starting from a state with x n = 0 for all speakers, we see that each agent has a probability η i of generating an innovation at a rate R. Then the total rate at which successful innovations propagate is N Rη i Q( /N ), where Q(x 0 ) is the probability that an innovation with frequency x 0 in the population goes to fixation. x 0 = /N because we assume the innovation rate is sufficiently small that exactly one speaker starts off with the innovation at level . Within this low innovation-rate regime, the innovation then propagates under the influence of selection and drift (fluctuations arising from the finite exposure to linguistic behaviour) until it reaches fixation.
To calculate Q(x 0 ), we first define δx n = x n − x n and determine the expectation values δx n and (δx n ) 2 over the distribution of τ n given above. This allows one to write down the forward or backward Kolmogorov equation for the set of individual speaker frequencies x n via the Kramers-Moyal expansion. Various studies, e.g., [91][92][93], have shown that for an appropriate weighted average x of these individual speaker frequencies, one has the backward Kolmogorov equation for the probability distribution q(t|x) that an innovation with initial frequency x reaches fixation at time t. In this equation, s is as defined in the Wright-Fisher model, T M = 1/(R ) is the memory lifetime identified in the main text, and N e is an effective population size (S1.12) The quantity z n is defined as the number of others speakers that speaker n can observe the linguistic behaviour of; we assume that each of these speakers is given the same amount of attention, although the number of neighbours z n can vary across a social network. (Derivations of this result can be found in [91][92][93]).
This Kolmogorov equation is extremely well studied in the population genetics literature (e.g. [88,94]). In particular, the procedure for obtaining moments of the time to reach fixation from a single mutant (x → 0), conditioned on fixation occurring, is well established [94]. To compute T F and σ 2 T F , we require the first two moments. Defining we have where Q(x) is the probability that a mutant with initial frequency x fixes. By multiplying (S1.11) by t k and integrating [94], we find the recursion where a prime denotes differentiation. The boundary conditions of this differential equation are F k (0) = 0 for all k, F 0 (1) = 1 and F k (1) = 0 for k > 0. For the case k = 0, the solution has the closed form [94] Q (S1. 16) Substituting x = N we arrive at Eq. (3) in the main text. Although no general closed-form expression exists for the case k > 0, (S1.15) can still be integrated to obtain [94] which can be evaluated numerically and substituted into (S1.14) to obtain the desired moments.
The complexity of this numerical problem is simplified slightly by noting that 1 − e −2Nes(1−y) dy (S1.18) and the fact that T k F is symmetric in s → −s. However, we find that a numerical integration routine, implemented naïvely, becomes unreliable for small and large N e |s|. In these regimes it both more efficient and less susceptible to numerical instability to use Taylor series and asymptotic expansions, respectively. Specifically, for small |s| we use the approximations when 2N e |s| < 10 −3 (S1.19) when 2N e |s| < 10 −2 (S1. 20) and for large |s| when 2N e |s| > 500 (S1.21) when 2N e |s| > 500 (S1. 22) in which γ is the Euler-Mascheroni constant γ = 0.577 . . ..

S1.5.2.1 Numerical test of the correspondence between the models
In arriving at the Kolmogorov equation (S1.11), we made a number of approximations, in particular, that innovation (mutation) can be ignored when estimating the time to fixation. We therefore compare numerical solutions for the probability P (t) that an innovation has reached fixation by time t within the full Wright-Fisher dynamics against the formulae we have derived for the corresponding origin-fixation model. These results are shown in Fig. S1.3, and we find the essential features are well captured. Of particular importance are deviations from Poisson behavior that are evident at early times. These deviations arise from the fact that an innovation takes a finite amount of time to propagate through the whole population, and are absent in classical origin-fixation models where T F is assumed to be zero. We see that even in a population of 100 speakers, the probability that a single change has occurred is suppressed for a historically relevant time (∼100 generations, which would equate to 2, 500 years in a child-based model). It is this property that is ultimately responsible for the very low likelihoods that are encountered in the main text.
These numerics also allow us to estimate the form of the interference correction C i introduced in S1.4.2. To achieve this, we consider a generalisation to the Wright-Fisher model where two innovations can occur. If the innovation rate is fast enough, the second innovation can occur before the first has gone to fixation. A plot of the probability that the first goes to fixation for a variety of innovation rates gives us the correction factor C i . We find that the numerical data are reasonably well fit by the function C i e −ω i T F . As noted previously, it is not important to capture the exact form of this function, as the regime where successive innovations interfere is not empirically relevant.
What is important is establishing where the correction becomes significant.

S1.5.3 Robustness of the Wright-Fisher model
It is well established in the population genetics literature (e.g. [95]) that the Wright-Fisher model approximates well the behaviour of a large number of evolutionary processes. In the main text, we presented one specific member from this class, since this allowed us to ascribe concrete meanings to the parameters N e , s and T M in terms of individual linguistic behaviour. Here, we demonstrate that many aspects of that specific model-for example, that all individuals have the same memory lifetime and that this remains constant over time, that they are equally biased in favour (or against) an innovation or do not themselves undergo processes of birth and death-are incidental. The key property is that linguistic behaviour is socially learnt, i.e., acquired from other members of the speech community, perhaps in the presence of biases.
This demonstration model comprises the following components: • Agent lifetime -After being introduced into the population, an agent has a lifetime d drawn from a Gamma distribution with mean µ d and variance σ 2 d . It is then removed from the population after time d has elapsed. If this removal causes the entire population to go extinct, a new population comprising a single individual is immediately established. (The probability of this event is very small once a steady state is reached, and is included simply to guarantee that every run of the simulation reaches the steady state).
• Agent reproduction -When an agent is introduced into the population, it is also assigned a number of offspring k ≥ 0 drawn from a Geometric distribution with mean µ k . Associated with each offspring n = 1, 2, . . . , k is the age b n of the parent when the offspring is born. This age is drawn from a Gamma distribution with mean µ b and variance (Bottom right) Probability that the first of two changes goes to fixation as a function of the interference I = ωT F . This is reasonably well fit by the exponential P (∞) = e −I . exceeds d, the parent's age of death, then it is discarded and the distribution resampled). When the time of birth arrives, a new agent is entered into the population with probability 1 if the current population size N is smaller than a carrying capacity K, and with probability 1 − K N µ k otherwise. This rule prevents an unbounded exponential growth of the population, and in practice causes its size N to fluctuate around K in the steady state.
• Initial interaction network -When an agent is born, its parent is marked as an interlocutor (that is, someone who may influence its linguistic behaviour). The offspring also inherits each of its parent's z interlocutors with probability µ i z . The offspring inherits all interlocutors if µ i > z; otherwise it inherits µ i of its parent's interlocutors on average.
• Expansion of the interaction network -At the time of birth, an agent is also assigned an age e drawn from a Gamma distributions with mean µ e and variance σ 2 e . At this age, each member of the population, n, is assigned a weight w n = w 0 if they are one of the agent's existing interlocutors, or w n = exp(−h|δ|) where δ is the age difference between the two agents. The existing interlocutors are then discarded, and a new set built up, with agent n being marked as an interlocutor with probability µ i w n /Z (capped at 1) where Z = n w n . Here µ i is the mean number of interlocutors arising from the expansion of the interaction network. If the homophily parameter h = 0, then every agent in the population has the same chance of becoming an interlocutor at the time of expansion; for h > 0, agents who are closer in age are more likely to interact after expansion than those further away in age. Thus, in this model, young agents tend to be influenced by their parents (and their parent's peer group) whereas older agents tend to influenced by their own peer group.
• Initial linguistic behaviour -When an agent is created, the frequency x with which it uses the innovation is inherited unchanged from its parent.
• Rate of linguistic interactions -An agent participates in a linguistic interaction that modifies its behaviour at age a according to a time-inhomogeneous Poisson process of intensity R(a)da = R ∞ + (R 0 − R ∞ )e −a/θ . That is, when the agent is born, the rate of (behaviourmodifying) interactions is R 0 , and this decays exponentially to R ∞ < R 0 with characteristic timescale θ. Thus, in this model, an agent may become less liable to change their behaviour as they age. Each of the parameters R 0 , R ∞ and θ are assigned from distributions when the agent is born. R 0 is drawn from a Gamma distribution with mean µ R and variance σ 2 R . R ∞ = R 0 /(1 + r) where r is drawn from a Gamma distribution with mean µ r and variance σ 2 r . The decay time θ is drawn from a Gamma distribution with mean µ θ and variance σ 2 θ . • Behaviour modification in a linguistic interaction -When a linguistic interaction takes place, each of the agent's interlocutors is included in the interaction with probability q, subject to a constraint that at least one interlocutor must be present. The mean frequency of the innovation among this subset of interlocutors, y, is calculated. The agent then updates their innovation frequency x as described in the main text: a fraction of their existing frequency is replaced with τ = 1 if they perceive the innovation in the interaction, and with τ = 0 if they perceive the convention. The probability that τ = 1 is (1+χ)y 1+χy where χ is a bias towards (or against) the innovation. Both the parameters and χ are randomly assigned to an agent at birth. is drawn from a Beta distribution on [0, 1] with mean µ and variance σ 2 . The bias χ is drawn from a normal distribution with mean µ χ and variance σ 2 χ .
It is evident that this model is much more complex than that described in the main text: agents are born and die, and the population size fluctuates over time; there is vertical and horizontal transmission of linguistic behaviour, the proportion of which changes during an agent's lifespan; agents can be more or less liable to changing their behaviour, and the rate at which they do so decreases over time; and they can be more less disposed to the innovation. It also has a correspondingly increased number of parameters: 23 in total; by contrast the Wright-Fisher model described in the main text has only 5 (if one excludes, as here, the possibility of an innovation being generated during the process of fixation).
To establish that the demonstration falls into the general class of Wright-Fisher models-and can be well represented with a smaller number of parameters-we use the fact that over short time increments ∆t, we should find that the first two moments in the change in the frequency of the innovation over the population, ∆x, are where x is the innovation frequency, and the parameters T M , s and N e are as described in the main text. Here, the overlines denote averages over multiple time intervals. The crucial point is that correspondence with the Wright-Fisher model is manifested as both moments varying with frequency as x(1 − x).
In Figure S1.4 we plot these moments obtained from simulations as a function of x under three choices of the parameters controlling the bias χ: (i) µ χ = 0, σ χ = 0.005; (ii) µ χ = 0.005, σ χ = 0; and (iii) µ χ = 0.005, σ χ = 0.005. That is, in case (i) agents are equally likely to be biased in favour of, or against the innovation; in case (ii) all agents are biased by the same amount in favour of the innovation; and in case (iii) agents are more likely to be biased in favour of, rather than against, the innovation. The values of the remaining 21 parameters are given in Table S1.5 (and were chosen to be in the range that could plausibly describe a small human-like population). Fits of the parabola Ax(1 − x), with the amplitude A as a free parameter, are shown as dashed lines if Figure S1.4. We find these empirical fits describe well the jump moments obtained from simulation (albeit subject to some noise in the estimation of the first jump moment, which is likely a consequence of the wide variation in individual behaviour this demonstration model permits). (S1.24) The resulting parabolas are plotted as solid lines of Figure S1.4. We find that the amplitude of the second jump moment (which characterises the stochastic contribution to the dynamics) is welldescribed by this estimate, whilst that of first jump moment is of the right order of magnitude but is over-estimated. This demonstrates that the additional complexity of this model does not fundamentally change its behaviour, but instead leads to values of the parameters in the Wright-Fisher (and therewith, the origin-fixation) model that deviate slightly from estimates obtained by simple averaging.
This observation has two important consequences for our analysis. First, by surveying all combinations of the parameters T M , s and N e , we ultimately account for any model which-like the demonstration model here-falls into the large Wright-Fisher class. Second, in addition to the Wright-Fisher model providing a robust description of many different evolutionary processes, the interpretation of quantities like memory lifetime and individual biases given in the main text also generalises beyond the specific individual-based model presented there.