^{1}

^{*}

^{2}

^{3}

^{4}

^{3}

^{4}

Analyzed the data: OF MB MJ NR. Contributed reagents/materials/analysis tools: OF MB MJ NR. Wrote the paper: OF NR. Designed the study: OF.

The authors have declared that no competing interests exist.

The model plant species

The demographic forces that have shaped the pattern of genetic variability in the plant species

Many of the same traits that contribute to the utility of

Most European species are believed to have been restricted to southern refugia at the height of glaciation ∼18,000 BP—many in the peninsulas of Iberia, Italy, and the Balkans, and some near the Caucasus region and the Caspian Sea

In this article, we consider an alternative model for the spread of

To investigate spatial population structure in European accessions of _{max} clusters, estimating the most likely value for the number of clusters as a value _{max} (see

The TESS runs with the smallest values of the Deviance Information Criterion, a penalized measure of how well the model underlying TESS fits the data, were obtained for _{max} greater than four (see _{max} = 5 clusters. The cluster membership coefficients estimated for the central European and western European accessions suggest that clinal variation occurs along an east-west gradient separating two clusters. The western cluster grouped accessions mainly from the British Isles, France and Iberia. The eastern cluster grouped all accessions from central Europe, southern Sweden, Poland, Russia, Ukraine, and Estonia. German and Swiss accessions shared almost the same amount of membership in the western and eastern clusters. The eight northern Swedish accessions and two Finnish accessions were grouped into a separate cluster.

(A) Membership coefficients in _{max} = 5 putative populations, computed using the average values over the 10 TESS runs with the smallest values of the deviance information criterion from a total of 100 runs. Similar results were obtained with other values of _{max} from 4 to 10. (B) Interpolated membership coefficients in the three apparent subpopulations: western cluster, eastern cluster, and northern cluster.

In previous analysis of the same data set

To better evaluate the direction of variation in the continental cluster, we regressed heterozygosity on geographic distance. This analysis used the approach of Ramachandran et al.

All accessions from the northern Sweden sample, as well as a few accessions that were poorly geographically connected to other accessions, were removed from the regression analysis. The remaining accessions were grouped into seven samples (

For each of 300×180 points on a two-dimensional lattice covering Europe, we computed distances from each lattice point considered as a potential source for the geographic expansion of

Correlation (

Because this analysis is based on a relatively limited geographic sample, it is possible that it is affected by the peculiarities of this sample. Therefore, to assess the possibility of bias due to non-uniform and sparse geographic sampling, we performed spatially explicit range expansion simulations that reproduced the geographic sampling scheme of the actual data (

Inference of demographic parameters and the choice of a best-fitting demographic model for the data were performed using an approximate Bayesian computation (ABC) analysis _{obs}. Parameter values that have generated summary statistics close enough to those of the observed data are retained to form an approximate sample from the posterior distribution, enabling parameter estimation and model choice (see

The ABC analysis was limited to a subset of 64 individuals representing the central European and western European populations. We restricted the analysis to the non-coding part of the genomic data, using the intron and the intergenic sequences only (648 loci). Simulated data also included 648 corresponding loci, each paired to have the same length as a locus in the observed data. The loci were assumed to be in linkage equilibrium, in agreement with the median ∼100 kb distance between fragments in the genome-wide data

Coalescent simulations were performed under four demographic scenarios (Models A–D). Model A has a constant population size, _{0}. Model B has an exponentially growing population size (present size, _{0}, ancestral size, _{1}, time since the onset of expansion, _{0}). In model C, the population size was constant in the distant past as well as in the recent past, and the growth was exponential between the two periods of constant population size (present size, _{0}, ancestral size, _{1}, time since the onset of expansion, _{0}, time since the end of expansion, _{1}). Model D is similar to model B, but it includes an ancient bottleneck before expansion. The prior distributions used in the four models are described in _{δ}_{i}_{,1} and _{i}_{,2} are the ^{th} vectors of summary statistics simulated under models 1 and 2 (see

Among all the scenarios, variants of the four models with variable mutation rates across loci were given higher statistical support, measured by the Bayes factor, than were models with fixed mutation rates - reflecting the high heterogeneity of diversity estimates among loci _{A,B} = 0 indicates that the model with constant population size (model A) was totally unsupported. The exponential growth model (model B) was the second best-supported model, and the evidence supporting model C against model B was moderate (_{C,B} = 1.9, see _{D,B} = 0.7).

The 4 demographic scenarios (Models A–D) and their associated Bayes factors. Model A is the model with constant population size, _{0}. Model B is a model with an exponentially growing population size (present size, _{0}, ancestral size, _{1}, time since the onset of expansion, _{0}). In Model C, the growth is exponential between two periods with constant size (present size, _{0}, ancestral size, _{1}, time since the onset of expansion, _{0}, time since the end of expansion, _{1}). Model D is similar to Model B, but it includes an ancient bottleneck before expansion. Variants of these 4 models, including variable mutation rates across loci, are considered here. The Bayes factors (top boxes) correspond to the ratio of the weight of evidence of each model to the weight of evidence of Model B. Two window sizes, _{0.01} and _{0.05}, were used when computing the Bayes factors. These window sizes correspond to the 1% and 5% quantiles of the distance between the values of the summary statistics obtained under Model B and the observed values of the summary statistics. The Bayes factors were identical for the 2 window sizes and for values rounded for one decimal place, except for Model C, for which a minor difference was observed (1.8 for _{0.05} instead of 1.9).

_{0} = 10,000 BP (model B) and _{0} = 12,000 BP (model C) using the Maximum A Posteriori (MAP) estimate (_{1}/_{0} = 0.3, but the large credibility interval (0,0.6) makes it impossible to eliminate the hypothesis of a wider expansion. The MAP estimate of the mutation rate was ^{−8} with credibility intervals ranging from 0.9×10^{−8} to 12.6×10^{−8}. The MAP estimate for the date of the end of the expansion was _{1} = 5,000 BP (see _{0}, and the length of the expansion, _{0}−_{1}, the joint posterior distribution of these two quantities was computed.

Plot of the joint posterior distribution for the time of onset of the expansion, _{0}, and the length of the expansion, _{0}−_{1}. Computations were performed under demographic Model C, in which the population was initially constant, then grew exponentially until _{1}, and then remained constant until the present. Percentages represent the cumulative probabilities under the density curve. The straight line indicates that the duration of expansion cannot be longer than the time elapsed since the onset of expansion.

Model Parameters | Model B | Model C |

μ (×10^{−8}) | ||

_{0} | ||

_{0} | ||

_{1} | ||

_{0}/N_{1} | ||

_{1} | - |

Because we observed considerable difference in the TESS analysis between the northernmost accessions and the main European populations (

The mean number of distinct haplotypes and the mean number of private haplotypes of the central European population and the northern European population as functions of sample size. Vertical bars show standard error.

To study the split between the northern and central European populations, we used a coalescent model for the divergence between two populations at some time

The mean number of distinct haplotypes and the mean number of private haplotypes of two simulated populations, as functions of sample size. The dark orange lines show the simulation results for a population of size 135,000, and the dark green lines show the simulation results for a population of size 135,000×1/4. The top panel shows the case when the split time is 0. Below follow the results for increasing split times. No migration is assumed. The split time

The mean number of distinct haplotypes and the mean number of private haplotypes of two simulated populations as functions of sample size, shown for 100 replicates. The dark orange lines show the simulation results for a population of size N_{CE} = 135,000, and the dark green lines show the results for a population of size 135,000×1/4, when _{CE}). The results from the observed populations are also plotted for comparison (lighter orange and green lines).

As

In the ABC analysis the scenarios that consisted solely of population size change produced patterns of DNA sequence diversity similar to those resulting from a rapid spatial range expansion ^{2}. To account for the fact that in Europe,

It has been previously recognized that the frequency spectrum may be influenced by signals of past demographic events ^{2} distance (see

A coarse preliminary search found that values of migration rates and growth rates corresponding to the saturation of a deme in 100–300 years and lengths of the colonization phase around 3,000–6,000 years followed by an equilibrium migration phase yielded non-significant ^{2}

In a second stage of the analysis, we investigated the time at which the range expansion began, varying this time from _{0} = 5,000 BP to _{0} = 20,000 BP assuming a growth rate of

(A) ^{2} distances between the simulated and the empirical folded frequency spectra as a function of the time of onset of the expansion. The other parameters were fixed at _{1} = 10,000. The origin was placed north of the Black Sea (48°N, 35°E). The horizontal line corresponds to the 95% rejection interval of the ^{2} test (df = 3, see ^{2} distances between simulated and empirical folded spectra for 24 potential origins (black dots). The time of onset was fixed at 9,000 years BP, and the other parameters were fixed as in (A).

To better locate the origin of ^{2} distances between simulated spectra and the empirical spectrum on an interpolated map (^{2} values ranged from 0.03 (East) to 0.3 (Spain - North Africa). Although the map does not provide an accurate localization of the onset of range expansion, it is similar to

_{1} = 5,000 and _{0} = 10,000, ^{2} = 0.03,

Minor allele frequency spectra of empirical data and data simulated under the best-fitting model of spatial range expansion. Population growth followed the logistic model within each deme (see text for the other parameter settings). The solid line (grey) corresponds to the neutral folded frequency spectrum. (A) The empirical folded spectrum was computed from the 648 inter-genic and non-coding sequences. (B) The simulated spectrum was computed using the same number of neutral nucleotides as in the data. In simulations, expansion started 9,000 years ago from a potential origin north of the Black Sea (48°N, 35°E). Other locations from a large region around this potential origin yielded very similar simulated spectra.

We have performed an investigation of the population structure and demographic history of European

From a biogeographic point of view, Europe is a large peninsula with an east-west orientation, delimited in the south by a strong Mediterranean barrier. During glaciation epochs, many species likely went through alternating contractions and expansions of range, involving extinctions of northern populations when the temperature decreased, and spread of the southern populations from different refugial areas after glaciation. Such colonization processes were likely characterized by recurrent bottlenecks that would have led to a loss of diversity in the northern populations.

The idea that the refugia were localized in three areas (Iberia, Italy, Balkans) is now well-established

We observed that genetically diverse populations of

We observed that intraspecific diversity declines away from the southeast, as predicted by a model of successive founder events during colonization. We also inferred that the putative origin of most accessions in the sample is localized somewhere in a vast eastern region, encompassing refugia such as the Caucasus region and the Balkans. The direction of diffusion from the east towards the British Isles coincides with the post-glacial re-colonization of Europe for many species such as beech, alder and ash trees, or flightless grasshopers

The boreal regions, in which environmental conditions are often very severe, contain the northern distribution limit of many European plants. These regions are often characterized by larger fluctuations in population size, which increase the effect of drift and can lead to increased genetic differentiation

An alternative hypothesis to the idea of a natural spatial expansion of

The possible prehistoric anthropogenic spread of

While our results might be explained by the simultaneous expansion of

A set of 76 individuals containing both hierarchical population samples and stock center accessions was extracted from the sample of 96 individuals studied by Nordborg et al.

Since

We performed an admixture analysis using TESS version 1.1, whose individual-based spatially explicit Bayesian clustering algorithm uses a hidden Markov random field model to compute the proportion of individual genomes originating in

Two values of the TESS interaction parameter were used,

TESS and STRUCTURE proceed with the determination of the number of clusters _{max}, and by running the program until the final inferred number of clusters, _{max}. We used the admixture version of TESS, and we set the admixture parameter to α = 1. The algorithm was run with a burn-in period of length 20,000 cycles, and estimation was performed using 30,000 additional cycles. We increased the maximal number of clusters from _{max} = 3 to _{max} = 8 (20 replicates for each value). Runs with _{max} = 5 led to either

For each run we computed the Deviance Information Criterion (DIC) _{max} = 5. One accession, Mr-0 (Italy), shared nearly equal membership in each of the _{max} clusters, regardless of the value of _{max} (see the clustering tree in _{max} = 5, we performed 100 additional runs (interaction parameter

Spatial interpolation of admixture coefficients was performed according to the kriging method as implemented in the R packages ‘spatial’ and ‘fields’

The regression analysis of heterozygosities on geographic distances was based on 57 central European, eastern and western European accessions. The 57 individuals were grouped into seven samples as described in

We used an ABC approach for inferring demographic parameters under four models of population growth. In the ABC approach, we assume that there is a multidimensional parameter of interest _{obs} of a set of summary statistics, _{i}, s_{i}_{i}_{i}_{i}_{i}_{obs}−_{i}^{.} | is the Euclidean norm. We used tolerance errors such that fractions of either 5% or 1% of the total number of simulations were retained.

The four demographic scenarios were described in the text as Models A–D. The six-dimensional parameter ^{−8}), the population size at the onset of expansion, _{1}, the time since the onset of expansion, _{0}, the growth rate, _{0}, and the time elapsed since the equilibrium phase, _{1}. The variable mutation rate models included locus-specic rates, _{j}_{,} obtained as independent realizations of an exponential prior distribution for which the hyperparameter was exponentially distributed with mean

Twelve summary statistics were used to capture genomic information at the 648 loci, defined as the 25%, 50% and 75% quantiles (quartiles) of each of the distributions of the number of segregating sites, the mean number of pairwise differences between sequences, the Tajima ^{2}<0.25). The second improvement of the original method – namely, smooth weighting – was retained in our analysis. Smoothing was implemented using the Epanechnikov kernel _{δ}_{δ}_{i}_{obs}|)

We computed the Bayes factor when evaluating the evidence of model 1 against model 2 (where 1 and 2 are chosen among A, B, C and D) as described in Results. The new formula can be seen as an improvement of the method that used the ratio of acceptances under the two models to approximate the Bayes factor, originally formulated as_{δ}_{δ} (t)_{0.01} and _{0.05}, corresponding to the 1% and 5% quantiles of the distance between the summary statistics obtained under the variant of model B with variable mutation rates and the observed summary statistics, were used when computing the Bayes factors.

We selected 64 individuals from central Europe and western Europe and ten individuals from northern Europe (northern Sweden and Finland). From the 876 fragments, we removed indels, sites with more than 20% missing data, and monomorphic sites. A total of 795 fragments and 11,134 SNPs remained. For each site, the remaining missing data was replaced by sampling alleles from the allele frequency distribution so that the final data set did not contain any missing data.

We simulated data from model C using MS _{CE} = 135,000, the estimated size of the central European population, and that migration occurred at rate _{CE}. The size of the northern population, _{NE} , was assumed to be 1/4 of the estimated size of the central European population, _{CE}. The growth scenario was assumed to be the same in the two populations, with only the population sizes differing.

To approximate the likelihood of the parameters, we used two haplotype diversity statistics, the mean number of distinct haplotypes and the mean number of private haplotypes. To correct the number of distinct haplotypes and the number of private haplotypes statistics for sample size differences, we used the rarefaction method

Simulations of a two-dimensional stepping stone model were performed using the program SPLATCHE 1.1 ^{2}. To account for the fact that ^{−8}, and the generation time corresponded to one year. Because the memory requirements of SPLATCHE are particularly high, we modified the mutation rate and the effective size in order to accelerate the generation time from one year to ten years (this means that the model was simulated ten generations at a time).

Values of the original population size were taken equal to _{1} around 10,000 (5,000–15,000). DNA sequences were simulated using the modified mutation rate ^{−5}. Rescaling the generation time to a value _{R}_{e} _{1}×_{R} ∼10^{−2}). Note that _{1} cannot be compared to the value used in the non-spatial ABC simulations unless we restore the original mutation rates and generation times. After the correction, the values used in the spatial and non-spatial scenarios were actually similar. In simulations, we assumed that the population remained constant (equal to _{1}) during 100 Ky before range expansion.

To compare with the data in western and central Europe, we simulated the genealogies of 66 individuals located at the same spatial coordinates as the set of 66 accessions that excluded those from northern Sweden and Finland. The fit of simulated data to the real data was assessed by evaluating the distance between the empirical folded frequency spectrum computed from the non-coding sequences, and frequency spectra obtained from individuals simulated at the same locations.

The distance used to compare folded spectra was the ^{2} distance defined from four classes as follows: Class 1) minor allele frequency 1 (total 28%); Class 2) minor allele frequency 2–4 (total 26%); Class 3) minor allele frequency 5–12 (total 25%); and Class 4) minor allele frequency 13–33 (total 21%).

Five model parameters were varied: the time of the onset of spatial expansion t0, the migration rate _{1} (resized), and the location of the origin. Ideally one would use an ABC analysis to choose a subset of parameters that maximizes the posterior probability of the corresponding evolutionary scenario given prior distributions over these parameters. However performing an ABC analysis with geographically explicit simulations is prohibitively time-consuming, due to the large cost of a single simulation. In practice, we first performed a coarse search using fixed values of the starting date t0 (equal to 8,000–12,000 BP) and a random sampling design for the other parameters, exploring migration rates (^{2} _{0} = 5,000 BP to _{0} = 20,000 BP using _{0} = 10,000), 0.9 (_{0} = 7,000) and 1.2 (_{0} = 5,000), so that the colonization phase ended before the present day. Finally, we studied the explanatory power of twenty-four potential spatial origins throughout central and western Europe (

(.07 MB PDF)

^{2} statistic calculated in the regression of diversity on distance to the putative origin. The sampling scheme was identical to the one used to collect the actual data. The sample barycenter locations were 1: Southern Sweden, 2: British Isles, 3: France-Belgium, 4: Germany, 5: Iberia, 6: Central Europe, 7: Northeastern Europe (

(.04 MB PDF)

_{0} since the beginning of the expansion._{0} to time _{1}, and was constant again until the present. The dashed blue line corresponds to model B, for which the population size was initially constant, and then grew exponentially until the present.

(.07 MB PDF)

(.02 MB PDF)

(.02 MB PDF)

(.01 MB PDF)

_{0} is the present population size, _{1} is the population size at the onset of expansion, _{0} ^{−rt }_{0} is the time since the start of the expansion, and _{1} is the time since population size reached an equilibrium value. Time is measured backwards and in coalescent units of _{0} generations. LN denotes the log-normal distribution, and

(.21 MB PDF)

(.11 MB PDF)

_{0.01} and _{0.05}, were used when computing the Bayes factors. These window sizes correspond to the 1% and 5% quantiles of the distance between observed summary statistics and the summary statistics obtained under the variant of Model B with variable mutation rates.

(.02 MB PDF)

(.08 MB PDF)

We are grateful to Magnus Nordborg for inspiring discussions and many useful comments on a previous draft of the manuscript. We also wish to thank Pierre Taberlet, Uma Ramakrishnan, Vincent Plagnol, and Karl Ljung.