Evoked and transmitted culture models: Using bayesian methods to infer the evolution of cultural traits in history

A central question in behavioral and social sciences is understanding to what extent cultural traits are inherited from previous generations, transmitted from adjacent populations or produced in response to changes in socioeconomic and ecological conditions. As quantitative diachronic databases recording the evolution of cultural artifacts over many generations are becoming more common, there is a need for appropriate data-driven methods to approach this question. Here we present a new Bayesian method to infer the dynamics of cultural traits in a diachronic dataset. Our method called Evoked-Transmitted Cultural model (ETC) relies on fitting a latent-state model where a cultural trait is a latent variable which guides the production of the cultural artifacts observed in the database. The dynamics of this cultural trait may depend on the value of the cultural traits present in previous generations and in adjacent populations (transmitted culture) and/or on ecological factors (evoked culture). We show how ETC models can be fitted to quantitative diachronic or synchronic datasets, using the Expectation-Maximization algorithm, enabling estimating the relative contribution of vertical transmission, horizontal transmission and evoked component in shaping cultural traits. The method also allows to reconstruct the dynamics of cultural traits in different regions. We tested the performance of the method on synthetic data for two variants of the method (for binary or continuous traits). We found that both variants allow reliable estimates of parameters guiding cultural evolution, and that they outperform purely phylogenetic tools that ignore horizontal transmission and ecological factors. Overall, our method opens new possibilities to reconstruct how culture is shaped from quantitative data, with possible application in cultural history, cultural anthropology, archaeology, historical linguistics and behavioral ecology.

Given the number of parameters, such an exploration would be a major endeavor, and so it would be reasonable for the authors to hold off, and save such an exploration for a future paper. But either way, I would encourage them to do so at some point if they want others to take notice of their suggested methodology.
The reviewer raises an important point. As noted by the reviewer, the parameter space of the model is simply too large to allow for a thorough exploration of the estimation properties throughout this parameter space. As a first step towards this endeavor, we have replicated the estimation for the continuous ETC model, varying the values of parameter λ (the influence of the ecological variable) and the number of regions (50 regions instead of 5). The good estimation properties of the algorithm were maintained for all values of λ, and also for a larger number of regions. In other simulations, we varied the number of time steps in the simulation, which is equivalent to changing the sample size. As expected, the more time steps are included in the simulations, the more accurate are parameter estimates, and the more often is the correct model identified using model selection tools. We have also added a new tool (the pseudo Variance Inflation Factor) to estimate the identifiability of parameters of cultural trait evolution for continuous traits (see new Section B.6). We introduce the tool in Results section and illustrate it with the case of a large number of connected regions that result in the poor identifiability of the cultural diffusion parameter. The new results are presented as Supplementary Figures 1-5, with the following addition in main text: "These results were obtained for a certain choice of parameter values and number of observations (i.e. the total number of artifacts). Exploring the estimation properties when all parameter values and number of observations are changed is out of scope of the present study. However, we performed further analyses varying the value of parameter λ. We found that parameter values were correctly estimated when λ was varied (Supplementary Figure  1), suggesting that our previous analysis hold at least for a certain range of parameters. Trait trajectories and model parameters were also correctly estimated when we increased the number of regions to K=50 instead of 5 (Supplementary Figure 2). Moreover, in another set of simulations, we varied the number of time steps in each simulation, which effectively modulates the sample size. As expected, we found that when the number of time steps was larger, parameters were more accurately estimated (Supplementary Figure 3), and that the full model was more often (correctly) identified as the underlying model (Supplementary Figure 4). This stresses the importance of sample size for making correct inference from the data, as in any statistical analysis. Overall, as for any statistical modelling technique, the best approach is to validate the estimation procedure before applying it to experimental data on synthetic data that matches as much as possible the experimental data (same number of regions, artifacts, credible values of parameters). This can be done in a straightforward manner by any researcher interested in using our method for their dataset by some simple adaptation of the code. We have now added in the Discussion: "As expected for any statistical procedure, the accuracy of parameter estimation and model selection depends largely on the structure of the data and in particular on the number of observations (here the number of artifacts Similarly, the authors' position would be strengthened if they applied their method to real (as opposed to purely simulated) data to show an example of the kind of insights this method offers in practice.
We have actually applied the methods to a dataset of literary works across ages and regions of the world, to understand how the concept of romanticism has evolved and its relation to the affluence of the population. This analysis is part of a broader set of analyses to test different hypotheses regarding the cultural evolution of romanticism. We feel this practical example cannot be presented here in a compact fashion, and will point to the other manuscript (soon to be published) for a case example of the method.
While I do think these are important concerns. I do not think they necessary preclude publication of the paper. This is because what is already there is (as far as I can tell) technically sound, and the paper does make a sensible contribution in its current form. There is simply more that could be done if the authors wish to maximize the impact of their suggested method.
Minor points: 1. lines 13-16. The use of the demographic transition is weird here. Yes, it occurs across societies, but this does not simply imply an ecological response without transmission. For instance, Heidi Colleran has shown the role of social transmission in women's contraceptive choices in a rural population. While Boyd & Richerson have argued that prestige biased transmission is critical to the demographic transition (see Not By Genes Alone). More generally, there is no unproblematic explanation for the demographic transition as an adaptative ecological response. It remains a big question, that defies a behavioral ecological approach, and cultural transmission seems very important.
We understand the point of view of the reviewer, and we agree that "there is no unproblematic explanation for the demographic transition as an adaptative ecological response". However, there are many adaptive approaches to the demographic transition (some arguing that it is adaptive, some arguing that it is a mismatch). In fact, Colleran publishes her article in a special issue of Philosophical Transaction 'Understanding variation in human fertility: what can we learn from evolutionary demography?' in which most articles discuss adaptive responses to ecological changes. Similarly, in economics, a popular explanation is that the demographic transition is driven by the increase in access to education, which is an environmental change (i.e. the relative cost of education decreases), and does not necessarily requires cultural transmission. In this perspective, such an example does not seem weird at all.
We have nuanced the sentence: 2. line 32. Putting "institutions" in the environment and not a part of cultural transmission is a big claim! Many cultural evolutionists would push back against this. If the authors really want to have culturally constructed and transmitted institutions to be part of the environment and not part of transmitted culture then they need to argue why this is the right thing to do in this case.
We agree that this is not the right place to have a debate as to whether institutions are information inheritance or environment inheritance. We have deleted the term.
3. lines 77-89. Overall the model introduction is well written, but it's a challenge writing it such that even non-statistically oriented scientists can follow. Minimally, I'd unpack the term "probabilistic generative models" just so readers can follow more easily.
We agree with the referee that it is a real challenge. Following the reviewer's advice, we have now unpacked: Thank you for the suggestion. We now detail: "θT represents a set of parameters of cultural evolution (which control notably the level of horizontal transmission, vertical transmission and evoked component, see details in following sections)" 5. Figure 1, I would suggest using different shaped symbols for E, T and A. As it currently is I assumed time flowed from top to bottom, not left to right, so it took me a while to wrap my head around the figure.
We followed the referee's advice and now use different shapes for E, T and A.
6. line 128 -please explain why this is necessary. I think it is because otherwise you might get multiple transitions within a single time step.
The reviewer's intuition is correct. We now make it explicit: Unfortunately, it does not seem that the other reviewer can either. We hope the good performance of the estimation procedure argues in favor of their validity.
9. Figure 3 is really hard to read, the thick lines obscure the shaded areas and thin lines. A different visual approach is required.
We agree with the reviewer. We have changed the colour and width of the difference curves, which in our opinion makes it easier to compare the true and estimated trajectories of the cultural trait.
Also, why do the blue lines stop early?
This is because the corresponding region is suppressed at the corresponding time. We illustrated here the possibility that some states are created, divided or suppressed at certain times. This was described in the corresponding Appendix C2, but we now make it explicit in the legend of the figure: "Note that in this example region 2 was created from region 3 at time step 31, region 1 was created de novo at time step 51, while region 3 was suppressed at time step 300." In fact, the x-axis limits are different in all 3 panels -what's going on?
This was an unfortunate error. It has been corrected.

Reviewer #2:
The manuscript Evoked and Transmitted Culture models: Using bayesian methods to infer the evolution of cultural traits in history propose a novel model to infer the evolution of cultural traits. The article is interesting and I support publication should the following points be addressed.

Use Case
The authors decided to introduce the model in rather abstract terms and to test it on synthetic data. This is fine, but it makes it difficult to verify how the model assumptions relate to concrete real-world phenomena, say the evolution of linguistic traits. I was wondering if the authors could present the model with a clear use case instead. This would make it easier to verify to what degree the model assumptions are appropriate for inferring a specific evolutionary process.
We thank the reviewer for raising the important issue of clarity. Following the reviewer advice, we now present a what the different variables mean for a possible example. The presentation of the introduction now reads: "We will use the following example as an illustration: in this example, the cultural trait is the general support for gender equity in the population (a cognitive disposition), while the cultural artifacts could be a corpus of fictions, and for each fiction in the corpus we measure (quantitatively) the importance of female characters. We expect that the cognitive disposition at a given time and region will influence the content of the artifacts: the more support there is for gender equity in the population, the more likely we expect to find important female characters in the fictions produced at that time and region.

A general question in cultural evolution is to understand the relative importance of ecological factors, vertical transmission and horizontal transmission in the evolution of the cultural trait. Because we have no direct access to the mental dispositions of ancient populations, we want to infer the dynamics of the cultural traits using the indirect observations of cultural artifacts. In our example, we would want to understand what drives changes in the support for gender equity in the population from a dataset of fiction books at different times and regions, each classified according to the importance of its female characters. If our hypothesis is that economic development leads to higher support of gender equity, we could use the GDP per habitant (a proxy to economic development) as the ecological factor. (…) Bayesian methods allow notably to estimate the parameters of the generative model that best fit the observations in the dataset (e.g. through Maximum
Likelihood Estimation) and compare which of a set of models is best supported by the observations. In our example, we could compare a model where economic development is taken as the ecological factor driving changes in support for gender equity with a model where the average level of science education is the ecological factor.

(…)
In the following sections we discuss two distinct classes of ETC models: when the cultural trait takes the form of a binary value and when it takes the form of a continuous value. In our example, these two classes correspond to classifying a population as supporting or non-supporting gender equity for the binary trait; or to quantifying the support to gender equity in the population with a continuous value." The authors explain and motivate the model parameters with many examples (which is good!), but these examples appear to be somewhat unrelated. The example to explain the parameter for artifact productions is the number of universities in the region if artifacts are the number of scientific publications in a certain field, and the example for the data collection process is the number of registered archaeological sites when assessing the fabrication of a particular tool in a region. Again, I would encourage the authors to start with a clear use case in mind, explain all model parameters with respect to the use case and only then show how the model generalizes to other evolutionary stories. It will be much easier for the reader to follow the argument and to verify that the generative model fits the evolutionary process.
We have also changed the examples related to artifact productions to be in line with the single example provided above: "The production of artifacts is guided by the value of the cultural traits, and possibly as well by some other (known)  We have tried to clarify better the choice of variables and equations throughout the presentation of the model (see tracked changes). We hope this will make it more digestible now.
I collected some other points below: In equation 1, T_t,r on the left hand side depends on T_t-1, but on the right hand side this conditional event referring to one time stamp earlier has been omitted. Could the authors comment on this? Maybe I do not fully understand the equation.
We thank the reviewer for noting this error on the equation. It has been corrected.
Theta_T (line 91) It is rather difficult to follow the argument here, because no further details are given for Theta_T at this point other than the brief explanation that it is a set of parameters of cultural evolution. Theta_T is only explained properly on line 137. Maybe change the order? The same holds true for Theta_A which does not seem to be defined at all.
We thank the reviewer for the suggestion we now detail: "Theta_T represents a set of parameters of cultural evolution (which control notably the level of horizontal transmission, vertical transmission and evoked component, see details in following sections)" "For example, if each cultural artifact is a binary variable, then we could define p(Atr) = σ(β0 + β1 Ttr + β2 Ftr) where σ is the logistic function and θA = { β0, β1, β2} are parameters that control the bias and the influence of the trait and factor on the cultural artifact, respectively."

Equation 4
Could equation 4 be based on relevant theory? What are the implications of this being a linear relationship?
We main reasons for choosing a linear relationship are simplicity of the model, interpretability (each parameter characterizes the influence of one variable onto the transition rates), and simplifying the fitting algorithm (most notably the M-step of the EM equation). More complex frameworks have been developed for controlling transitions in a discrete Markov chain, e.g. Bengio & Frasconi 1995, but at the cost of interpretability. Actually the linear assumption is equivalent to the model of Bengio & Frasconi with a single layer network defining the transition rates from the variables. We have now added: "For the sake of simplicity and interpretability of the model, we propose here linear effects, i.e. the transitions rates depend linearly on the value of the ecological factor and cultural traits in neighbouring regions" Line 129 Why is this relationship required? Please explain.
We now explain: "If the time step is too large, then in equations [3] the probability of transition in one time step is not infinitesimal, so the transition rates Rup(r,t) and Rdown(r,t) cannot be approximated by a constant during each time step due to their dependence in cultural traits (see below)."

Equation 5
Why does equation 4 use a transition rate, but equation 5 the transition rate times delta t? We thank the reviewer for noting the discrepancy. The dt term was incorrect in equation 5, is has been removed.

Equation 6 and 7
Is there any theoretical justification for equation 6 and 7?
Equation 6 can be viewed as an Ornstein-Uhlenbeck process where the drift term depends linearly on the value of the ecological variable and cultural traits in neighbouring regions and the diffusion term is constant. The Orstein-Uhlenbeck is the standard stochastic process used to describe a variable that evolves through time with a certain drift, noise and default value (the strength of the force towards the default value, parametrized by parameter ρ, determines the speed at which a cultural trait can be lost, i.e. vertical transmission). Equation 7 corresponds to the discretization of Equation 6 (required as all datasets rely on discrete times). We have added the following sentence in the manuscript: "Equation 6 be viewed as an Ornstein-Uhlenbeck process where the drift term depends linearly on the value of the ecological variable and cultural traits in neighbouring regions, while the diffusion term is constant." Could you explain what cultural lability is and why it is relevant here? Why is there a variable to model a potential bias towards positive or negative trait values? Please justify your design decisions.
We agree with the reviewer that these terms deserved a better introduction. We have now expanded the explanation: "ρ determines the time scale at which the memory of cultural traits is lost. For example, in an isolated region (no neighbour), in the absence of ecological variable and noise, then Ttr converges exponentially to γ /ρ at time scale 1/ρ, i.e. the cultural trait is "forgotten" with a time scale of 1/ρ. γ is a bias towards positive or negative trait values capturing a "default" tendency of the cultural trait in the absence of evoked component and horizontal transmission." Line 130: this seems to belong to the simulation, but not the model. Same for line 156.
The initial value (or initial distribution) of the latent factor (i.e. cultural trait) are also parts of the definition of a latent probabilistic model, usually with their own set of parameters. This is the case here: x0 (and σ0 in the case of cultural traits) are parameters from the cultural evolution model that are estimated from the dataset. We now detail: "Initial values are parametrized by p(T1r=1)=x0 (and also for regions created de novo), with x0 the probability that the cultural trait is present in any new region." Both the binary and the cont. model use the same set of parameters theta_T. I would use different parameters here.
We disagree with the reviewer here. Following the standards in the statistics / machine learning literature, we define a vector θT that represents the parameter set for cultural evolution for any model of cultural trait evolution, and then details what this parameter set is composed of for different subclasses of models. We use the same name for parameters in θT when their interpretation is similar (i.e. λ is the sensitivity to the ecological factor, ξ controls the level of horizontal transmission, x0 sets the initial value), while other parameters are specific to binary or continuous cultural traits.

Simulation study and scalability
The authors chose to test the model on simulated data, which is fine. I was wondering, though, if the authors could design the simulation with a specific real-world scenario in mind and define the simulation goals accordingly. This could allow others to verify if the scenario is plausible and if all relevant goals are achieved.
We have added the following in the presentation of the model with a continuous latent variable: "Subsequently, we tested the estimation method for ETC models with a continuous trait, and artifacts with count values (the number of fiction books produced in a given region and year). We simulated K=5 regions over 400 times steps (e.g. over 2000 years with time step of dt=5 years). The number of books in the dataset was on average 10 per century and per region, and the values in A correspond in our example to the number of important female characters in each book. We want to reconstruct the dynamics of the support for gender equity (the cultural trait) from this synthetic corpus of fictions." Moreover, the simulation relies on 6 regions, which seems rather low for real world studies, especially given that the model should be applicable when the nodes of the network are interconnected individuals. How scalable is the approach?
We thank the reviewer for raising this important point. The scalability differs between the case of binary vs continuous traits. The different methods in the continuous trait model scale cubically with the number of regions K, so they can be applied as long as K remains somehow lower than 80-100. For large number of regions, the moment method should be favored as it scales linearly with the number of time steps so the overall computational cost is not prohibitive. We now illustrate the method on such large number of regions: we simulated and fitted the model using the moment method with now K=50. The results are presented in the new Supplementary Figure  2: the trajectories of cultural traits and the model parameters are well estimated. This figure is presented in main text: "Trait trajectories and model parameters were also correctly estimated when we increase the number of regions to K=50 instead of 5 (Supplementary Figure 2)." We also added the following in the discussion: "The computational complexity of the algorithm is largely driven by inversions of covariance matrices in the E-step. Theses squares matrices are of size K for the moment method, and matrix inversion is repeated at each time step, so the computational cost scales as O(TK 3 ). By contrast, the GP framework works with the full covariance matrix of size KT, so the cost scales as K 3 T 3 (the inversion can be performed only for data points with attached observations, so the actual cost can be drastically reduced if the artifacts are sparse)." The method however does not scale well for binary traits as we explicitly represent all combinations of cultural trait values (i.e. 2 K combinations). We moved one sentence from the Annexes to the main text to make this point more explicit: "For large number of regions K (more than 10-20), the number of combinations becomes prohibitive, so we must resort to approximate solutions using variational inference or sampling methods (Gharamani et al, 1997)." Language I am not an English native either, but I feel that the language can be improved here and there (especially the use of articles).
We have tried to improve the English.
Defining ETC models The paragraph introducing the generative model is rather low on references (line 67 and following). The two definitions for cultural artifacts and cultural traits, for example, appear to be make sense intuitively, but I was wondering if they reflect the scientific consensus. The same