^{1}

^{4}

^{1}

^{5}

^{2}

^{3}

^{1}

^{5}

^{*}

Conceived and designed the experiments: MGT. Performed the experiments: YI AP MGT. Analyzed the data: YI AP MAB MGT. Contributed reagents/materials/analysis tools: YI AP MAB JB MGT. Wrote the paper: YI AP JB MGT.

The authors have declared that no competing interests exist.

Lactase persistence (LP) is common among people of European ancestry, but with the exception of some African, Middle Eastern and southern Asian groups, is rare or absent elsewhere in the world. Lactase gene haplotype conservation around a polymorphism strongly associated with LP in Europeans (

Most adults worldwide do not produce the enzyme lactase and so are unable to digest the milk sugar lactose. However, most people in Europe and many from other populations continue to produce lactase throughout their life (lactase persistence). In Europe, a single genetic variant, −13,910*T, is strongly associated with lactase persistence and appears to have been favoured by natural selection in the last 10,000 years. Since adult consumption of fresh milk was only possible after the domestication of animals, it is likely that lactase persistence coevolved with the cultural practice of dairying, although it is not known when lactase persistence first arose in Europe or what factors drove its rapid spread. To address these questions, we have developed a simulation model of the spread of lactase persistence, dairying, and farmers in Europe, and have integrated genetic and archaeological data using newly developed statistical approaches. We infer that lactase persistence/dairying coevolution began around 7,500 years ago between the central Balkans and central Europe, probably among people of the Linearbandkeramik culture. We also find that lactase persistence was not more favoured in northern latitudes through an increased requirement for dietary vitamin D. Our results illustrate the possibility of integrating genetic and archaeological data to address important questions on human evolution.

Lactase persistence (LP) is an autosomal dominant trait enabling the continued production of the enzyme lactase throughout adult life. Lactase non-persistence is the ancestral condition for humans, and indeed for all mammals

Using long-range haplotype conservation

It is unlikely that lactase persistence would provide a selective advantage without a supply of fresh milk and this has lead to a gene-culture co-evolutionary model where lactase persistence is only favoured in cultures practicing dairying

Estimates of the age of the

Important questions remain regarding the location of the earliest

Assuming that the

Unlike the simulation models used in related studies

We applied the regression adjustment and weighting step of ABC to simulations accepted at the 0.5% tolerance level _{c}_{Fd}, is significantly higher than that for non-dairying farmers; 99.998% of 100,000 random draws from the former are greater that those from the latter. We note that for some parameters the estimated 95% credible intervals lie outside the upper prior bound. This is a consequence of using regression adjustment in a model with rectangular priors

ABC was performed using regression adjustment and weighting, following acceptance at the 0.5% tolerance level

To investigate relationships among demographic and evolutionary parameters we calculated Spearman's R^{2} and p-values for all possible pairwise joint posterior parameter distribution (see Supplementary ^{2}>0.024. The following parameter pairs, in order of decreasing R^{2}, showed non-independence by this criteria: (A) proportion available for sporadic migration and the sporadic mobility of dairying farmers, (B) proportion available for sporadic migration and the sporadic mobility of non-dairying farmers, (C) selective advantage and sporadic mobility of non-dairying farmers, and (D) sporadic mobility of dairying farmers and sporadic mobility of hunter-gatherers. That the first two joint distributions show negative correlation is unsurprising since changes in the proportion available for sporadic migration, or in the sporadic migration mobility of dairying and non-dairying farmers, will have similar effects on the timing of arrival of farming at different locations.

Points represent regression adjusted parameter values from simulations accepted at the 0.5% tolerance level. Shading was added using 2D kernel density estimation. Parameter combinations shown are the proportion of individuals in a deme available for sporadic long-distance migration versus the average mobility – in number of demes moved – of (A) dairying farmers, and (B) non-dairying farmers, (C) the selective advantage of a LP allele among dairying farmers versus the average mobility of non-dairying farmers, and (D) the average mobility of dairying farmers versus the average mobility of hunter-gatherers.

Following acceptance at the 0.5% level and regression adjustment we found that the most probable location where an LP allele first underwent selection among dairying farmers lies in a region between the central Balkans and central Europe (see

Points represent regression-adjusted latitude and longitude coordinates from simulations accepted at the 0.5% tolerance level. Shading was added using 2D kernel density estimation.

Although not parameters of the model

Although not strictly a parameter of the model presented we have applied the ABC approach to estimate the genetic contribution of people living in the deme where LP-dairying gene-culture coevolution began, and its 8 surrounding demes, to the modern European gene-pool (95% CI 2.83 to 27.4%; mode = 7.47%; see

Value distributions were taken from 5,000 simulations assuming selection (black line), and 5,000 simulations assuming no selection (red line). Simulation parameter values were sampled at random from the marginal posterior density estimates presented in

To explore the power of our model to explain the two data sets we have considered (13,910*T allele frequency at 12 European locations and farming arrival date at 11 European locations) we plotted the following for each data type and at each location considered: (1) the observed value, (2) the distribution of values from simulations accepted at the 0.5% tolerance level, and (3) the distribution of values from all simulations in which the 13,910*T allele arose and did not go extinct (see Supplementary

The simulation model we have employed here is relatively complex compared to related human demographic/evolutionary models reported _{d}

Estimates of the arrival dates for farming the 11 locations we consider here were calculated as local weighted averages of calibrated carbon-14 dates

We are well aware that the spread of the Neolithic over Europe was not as constant as our model assumes. After the arrival of the Neolithic in the Balkans, there is a pause of approximately 800 years before it starts to spread to Central Europe, and there is another pause of 1,000 years before it spreads further into the northern German lowlands and other parts of the northern Europe. Clearly, the carbon-14 dates we have used to estimate the farming arrival times will not fully reflect the complex history of neolithisation in all parts of the continent.

The list of parameters for which the marginal posterior distributions are notably narrower than their corresponding prior ranges (selective advantage, intrademic gene flow, the sporadic migration distance of _{d}_{nd}

The estimated selective advantage conferred by a LP allele (mode = 0.0953; 95% CI = 0.0518–0.159) is in good agreement with previous estimates for Europeans (0.014–0.15

Perhaps the most interesting result presented here is our estimation of the geographic and temporal origins of LP-dairying co-evolution. We find the highest posterior probabilities for a region between the central Balkans and central Europe (see

Overall, by considering the results from our simulations and archaeological, archaeozoological, and archaeometric findings, it seems very plausible to connect the geographic origin of the spread of LP to the increasing emergence of a cattle-based dairying economy during the 6^{th} millennium BC. The geographic region of origin of the LBK – in modern day Northwest Hungary and Southwest Slovakia _{d}_{nd}

Contrary to our expectations, we did not find that the presence of a positively selected LP allele in early dairying groups increases the unlinked genetic contribution of people living in the region where LP-dairying coevolution started to the modern European gene pool, when using demographic parameter values estimated here. The main reason for this is likely to be the relatively high inferred rates of intra- and interdemic gene flow between dairying and non-dairying farmers and between neighbouring demes, respectively, leading to a rapid erosion of any demographic ‘hitchhiking’ of unlinked genomic regions. Additionally, we only track the genetic contribution of people living in and around the deme of LP/dairying coevolution from the inception of this process. Since it takes some time for the LP allele to rise to appreciable frequencies, any demographic ‘hitchhiking’ effect may become important only after the allele centroid has moved some distance away from its origin deme.

Another notable result was obtained when we compared the range of expected 13,910*T allele frequencies at different European locations – from simulations accepted at the 0.5% tolerance level – to those observed. While all observed values were within the 95% equal tail probability interval of the simulated values, many were somewhat offset from the modes. We interpret this as indicating that our model does not fully explain the distribution of the 13,910*T allele in Europe. One possible explanation for this is that migration activity – as modeled here by interdemic gene flow and sporadic unidirectional migration – has increased subsequent to the expansion of farming into the northwestern reaches of Europe. In this scenario the farming expansion phase, occurring 9,000 to 5,500 years BP, would be mainly responsible for generating the 13,910*T allele frequency cline in Europe but higher migration activity following this period would then have a homogenizing effect in LP allele frequencies. Intriguingly, a general pattern can be seen (Supplementary

As inferred here, the spread of a LP allele in Europe was shaped not only by selection but also by underlying demographic processes; in this case the spread of farmers from the Balkans into the rest of Europe. We propose that this combination of factors could also explain the apparent homogeneity of LP-associated mutations in Europe. In Africa there are at least four known LP-associated alleles, including three that are likely to be of African origin

We accept that the model we have used does not accommodate all data (both genetic and archaeological) that is potentially informative on the coevolution of LP and dairying in Europe. Future improvements can be made by adding more ‘realism’ to the model and by increasing the number of data types that are used in the ABC analysis, leading to more integrative inference. The former should include both adding more fixed parameter information (such as the effects of past vegetation, climate variation and other geographic features on migration parameters and carrying capacities

We infer that the coevolution of European LP and dairying originated in a region between central Europe and the northern Balkans around 6,256 to 8,683 years BP. We propose the following scenario: after the arrival of the Neolithic in south-eastern Europe and the increasing importance of cattle herding and dairying, natural selection started to act on a few LP individuals of the early Neolithic cultures of the northern Balkans. After the initial slow increase of LP frequency in those populations and the onset of the Central European LBK culture around 7,500 BP, LP frequencies rose more rapidly in a gene-culture co-evolutionary process and on the wave front of a demographic expansion (see Supplementary

Our simulation approach is motivated by a previous demic computer simulation study _{deme}

So _{max}^{2} (i.e. in a sea level Mediterranean climate deme _{deme}^{2}.

Each deme contains three distinct cultural groups: non-dairying farmers (F_{nd}), dairying farmers (F_{d}), and hunter-gatherers (HG). The ratios of ceiling population size for F_{nd}, F_{d}, and HG (as a proportion of the total maximum population size for the deme, _{deme}_{d} group. The latter, here termed the GB (genetic background) ‘allele’, is used to track the general genetic ancestry component from the region where the LP allele is first found among dairying farmers. It will be used to infer the

The LP and GB ‘allele’ frequency dynamics are determined in each generation by five processes: (1) intrademic bidirectional geneflow between cultural groups; (2) bidirectional geneflow between demes (interdemic) within the same cultural groups; (3) sporadic unidirectional migration within the same cultural groups; (4) cultural diffusion (CD); and (5) selection operating on LP allele-carrying individuals within the F_{d} group. Hardy-Weinberg equilibrium within each cultural group within each deme is assumed. Population size increase for each cultural group in each deme is modelled by logistic growth, limited by the carrying capacity of each group within each deme. We fixed the growth rate to _{d} group is allowed to increase in size as a function of the selective advantage of the LP allele,

We define _{c}_{i↔j}_{i} and N_{j}_{c}

We define _{d}_{d}_{c}_{i} and N_{j}

We define _{s}_{mig}_{curr}_{dest}_{curr}_{dest}_{s}_{i}_{curr}_{i}_{curr}_{curr}

We define _{dif}_{i→j}_{dif}

The geographic location where LP/dairying gene-culture coevolution starts is chosen at random from all land demes. This LP mutation is initialized at a frequency of 0.1 in F_{d} when their population size reaches a critical size in the chosen start deme, set to a minimum of 20 individuals per deme in simulations. While we would expect any _{d} in the deme of origin. However, such a starting frequency means that little more than four LP alleles are initialized in simulations. Selection acting on the LP allele, _{d} only, as follows _{d} as follows:_{d} in a particular deme.

All simulations were run for 360 generations which, assuming a generation time of 25 years

The genetic contribution of the population living in the region of origin of LP/dairying gene-culture coevolution to the overall European population is tracked over generations by calculating the GB ‘allele’ frequency over all demes in all 3 cultural groups. In the generation when the LP allele is initialized, all cultural groups in the origin deme and 8 neighbouring demes are assigned the unlinked GB ‘allele’ at a frequency of 1. The GB ‘allele’ is subjected to the same intra- and inter-deme geneflow and migration processes as described above, but is not subject to drift, as modelled by binomial sampling, or to selection. At the end of each simulation this GB allele is taken to represent the general genetic contribution of the population living in the region of origin of LP to the modern European population. The ancestry component of Europeans, at any generation, that originates from people living in the region of origin of the LP allele (_{GB}_{i}_{GBij}_{ij}

To estimate parameters of interest we use an ABC approach, following _{d}_{nd}

Location | −13,910*T allele frequency | N individuals used to assess −13,910*T allele frequency | Reference for 13,910*T allele frequency | Great circle distance from central Anatolia (km) | Inferred farming arrival date in years BP ^{1} (generations after start of simulation) |
Inferred farming arrival date in years BP ^{2} (generations after start of simulation) |
Latitude | Longitude |

Turkey | 0.031 | 49 | 0 | 9000 (0) | 9000 (0) | 38.00 | 30.00 | |

Greece | 0.134 | 41 | 550 | 7932 (43) | 8389 (24) | 37.98 | 23.73 | |

Tuscany | 0.063 | 16 | 1699 | 7274 (69) | 7112 (76) | 43.77 | 11.25 | |

Sardinia | 0.071 | 56 | 1829 | 7371 (65) | 6968 (81) | 39.00 | 9.00 | |

North Italy | 0.357 | 28 | 1880 | 6992 (80) | 6911 (84) | 45.68 | 9.72 | |

Scandinavia | 0.815 | 360 | 2523 | 5833 (127) | 6197 (112) | 59.33 | 18.05 | |

Germany | 0.556 | 60 | 2309 | 6396 (104) | 6434 (103) | 53.55 | 10.00 | |

France | 0.431 | 58 | 2523 | 6552 (98) | 6197 (112) | 48.87 | 2.33 | |

French Basque | 0.667 | 48 | 2666 | 7078 (77) | 6037 (119) | 43.00 | −1.00 | |

Southern UK | 0.734 | 64 | 2785 | 5954 (122) | 5905 (124) | 51.50 | −0.12 | |

Orkney | 0.688 | 32 | 3325 | 5778 (129) | 5306 (148) | 58.95 | −3.30 | |

Ireland | 0.954 | 65 | 3349 | 5807 (128) | 5260 (150) | 54.37 | −7.63 |

Inferred arrival of farming dates were based on: ^{1} a weighted average of all calibrated carbon-14 earliest farming arrival dates from Pinhasi et al. ^{2} by assuming a constant rate of spread of farming (estimated at 0.9 km/year

Parameters of interest, _{d}_{d}_{d}_{c}_{dif}_{s}_{i}_{GB}

Our full ABC algorithm is as follows: (1) choose the summary statistics _{δ}, to accept and from this calculate an implicit tolerance level δ), (3) sample a parameter set φ_{i} from the pre-determined prior distribution of φ, (4) simulate forward under our model using parameter set φ_{i}, (5) in the final generation of our simulation we calculate the summary statistics, _{i}_{i}_{i}, (7) steps 3 to 6 are repeated until we have a sufficient number of retained parameter sets, (8) A local-linear standard multiple regression is then performed to adjust the φ_{i}, with each φ_{i} weighted according to the size of ∥_{i}_{δ}(t) (see _{i}* form a random sample from the approximate joint posterior distribution P(φ|_{i}*, as suggested by Beaumont

The simulation and ABC analysis procedures were written in the Python Programming Language (URL:

Performance of model in explaining observed data on −13,910*T allele frequency at 12 locations throughout Europe. The observed point values are indicated by vertical red lines. The distributions of expected values from all simulations in which the 13,910*T allele arose and did not go extinct are indicated by black lines. The distributions of expected values from all simulations accepted at the 0.5% tolerance level in ABC analysis are indicated by green lines.

(0.62 MB EPS)

Performance of model in explaining observed data on the estimated time of arrival of farming at 11 locations throughout Europe. The observed point values are indicated by vertical red lines. The distributions of expected values from all simulations in which the 13,910*T allele arose and did not go extinct are indicated by black lines. The distributions of expected values from all simulations accepted at the 0.5% tolerance level in ABC analysis are indicated by green lines.

(0.60 MB EPS)

Approximate marginal posterior density estimates of demographic and evolutionary parameters. ABC was performed using regression adjustment and weighting, following acceptance at the 0.5% tolerance level. The upper and lower 2.5% of each distribution are shaded. These simulation results are equivalent to those presented in

(7.42 MB EPS)

Pairwise joint approximate posterior density estimates of demographic and evolutionary parameters showing high degrees of correlation (Spearman's R^{2}>0.024). Points represent regression adjusted parameter values from simulations accepted at the 0.5% tolerance level. Shading was added using 2D kernel density estimation. These simulation results are equivalent to those presented in

(0.46 MB TIF)

Approximate posterior density of region of origin for LP/dairying co-evolution. Points represent regression-adjusted latitude and longitude coordinates from simulations accepted at the 0.5% tolerance level. Shading was added using 2D kernel density estimation. This result is equivalent to that presented in

(1.03 MB EPS)

Approximate marginal posterior density estimates of (a) the date of origin for LP/dairying co-evolution, and (b) the contribution of people living in the deme of origin for LP/dairying co-evolution, and its 8 surrounding demes, to the modern European gene pool. The upper and lower 2.5% of each distribution are shaded. These simulation results are equivalent to those presented in

(2.96 MB EPS)

Main regions of early (dark green) and late phase (light green) spread of the Linearbandkeramk culture from its origins in modern day northwest Hungary and southwest Slovakia.

(5.34 MB TIF)

Average deme elevation (scale bar in meters above sea level).

(0.47 MB EPS)

Deme climate zones: Mediterranean, Temperate, and Cold/Desert.

(0.39 MB EPS)

Carrying capacity (maximum number of people per deme; indicated by scale bar), calculated as a function of average deme elevation (Supplementary

(0.48 MB EPS)

Demographic processes: (a) Intrademic bidirectional geneflow - a single example deme is illustrated; bidirectional geneflow occurs between all cultural groups within the deme. The number of individuals exchanged between cultural groups _{i⇔j}_{i}_{curr}_{i⇒j}

(0.06 MB DOC)

Correlations among demographic and evolutionary parameters. Spearman's R^{2} (above diagonal) and p-values (below diagonal) are given for all pairwise joint posterior parameter distribution. Posterior distributions were estimated by ABC employing regression adjustment and weighting of simulations accepted at the 0.5% tolerance level. Parameter joint distributions are shown in ^{2} value>0.024.

(0.06 MB DOC)

Posterior estimates of demographic and evolutionary parameters (mean, mode and 95% credibility interval). Posterior distributions were by estimated by ABC employing regression adjustment and weighting of simulations accepted at the 0.5% tolerance level.

(0.06 MB DOC)

Parameters of simulation model. ‘Flat’ indicates that a uniform prior was used.

(0.06 MB DOC)

Supplementary Video S1 - Animation graphically representing the geographic frequency distribution of the −13,910*T allele at 10-generation time slices over the last 9000 years (assuming a generation time of 25 years), taken from simulations that best fitted data on modern −13,910*T allele frequency and timing of the arrival of farming in Europe.

(1.50 MB MPG)

Supplementary Video S2 - Animation graphically representing the geographic frequency distribution of the −13,910*T allele at 10-generation time slices over the last 9000 years (assuming a generation time of 25 years), taken from simulations that best fitted data on modern −13,910*T allele frequency and timing of the arrival of farming in Europe.

(1.50 MB MPG)

Supplementary Video S3 - Animation graphically representing the geographic frequency distribution of the −13,910*T allele at 10-generation time slices over the last 9000 years (assuming a generation time of 25 years), taken from simulations that best fitted data on modern −13,910*T allele frequency and timing of the arrival of farming in Europe.

(1.42 MB MPG)

We thank Dallas Swallow, Lounès Chikhi, Charlotte Mulcare, Pascale Gerbault, Jens Lüning, Kevin Bryson, Norbert Benecke, Mehmet Özdogan, László Bartosievicz, Marjan Mashkour and Yael Pinchevsky for access to date and helpful comments.