plosPLoS GenetplosgenPLoS Genetics1553-73901553-7404Public Library of ScienceSan Francisco, USA09-PLGE-RA-0228R310.1371/journal.pgen.1000695Research ArticleComputational Biology/GenomicsGenetics and Genomics/Population GeneticsInferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency DataDemography from Multidimensional SNP DataGutenkunstRyan N.^{1}^{*}HernandezRyan D.^{2}WilliamsonScott H.^{3}BustamanteCarlos D.^{3}Theoretical Biology and Biophysics and Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of AmericaHuman Genetics, University of Chicago, Chicago, Illinois, United States of AmericaBiological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of AmericaMcVeanGilEditorUniversity of Oxford, United Kingdom* E-mail: ryang@lanl.gov
Conceived and designed the experiments: RNG SHW CDB. Performed the experiments: RNG. Analyzed the data: RNG. Contributed reagents/materials/analysis tools: RNG RDH. Wrote the paper: RNG CDB.
The authors have declared that no competing interests exist.
10200923102009510e1000695122200923920092009This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.
Demographic models built from genetic data play important roles in illuminating prehistorical events and serving as null models in genome scans for selection. We introduce an inference method based on the joint frequency spectrum of genetic variants within and between populations. For candidate models we numerically compute the expected spectrum using a diffusion approximation to the one-locus, two-allele Wright-Fisher process, involving up to three simultaneous populations. Our approach is a composite likelihood scheme, since linkage between neutral loci alters the variance but not the expectation of the frequency spectrum. We thus use bootstraps incorporating linkage to estimate uncertainties for parameters and significance values for hypothesis tests. Our method can also incorporate selection on single sites, predicting the joint distribution of selected alleles among populations experiencing a bevy of evolutionary forces, including expansions, contractions, migrations, and admixture. We model human expansion out of Africa and the settlement of the New World, using 5 Mb of noncoding DNA resequenced in 68 individuals from 4 populations (YRI, CHB, CEU, and MXL) by the Environmental Genome Project. We infer divergence between West African and Eurasian populations 140 thousand years ago (95% confidence interval: 40–270 kya). This is earlier than other genetic studies, in part because we incorporate migration. We estimate the European (CEU) and East Asian (CHB) divergence time to be 23 kya (95% c.i.: 17–43 kya), long after archeological evidence places modern humans in Europe. Finally, we estimate divergence between East Asians (CHB) and Mexican-Americans (MXL) of 22 kya (95% c.i.: 16.3–26.9 kya), and our analysis yields no evidence for subsequent migration. Furthermore, combining our demographic model with a previously estimated distribution of selective effects among newly arising amino acid mutations accurately predicts the frequency spectrum of nonsynonymous variants across three continental populations (YRI, CHB, CEU).
Author Summary
The demographic history of our species is reflected in patterns of genetic variation within and among populations. We developed an efficient method for calculating the expected distribution of genetic variation, given a demographic model including such events as population size changes, population splits and joins, and migration. We applied our approach to publicly available human sequencing data, searching for models that best reproduce the observed patterns. Our joint analysis of data from African, European, and Asian populations yielded new dates for when these populations diverged. In particular, we found that African and Eurasian populations diverged around 100,000 years ago. This is earlier than other genetic studies suggest, because our model includes the effects of migration, which we found to be important for reproducing observed patterns of variation in the data. We also analyzed data from European, Asian, and Mexican populations to model the peopling of the Americas. Here, we find no evidence for recurrent migration after East Asian and Native American populations diverged. Our methods are not limited to studying humans, and we hope that future sequencing projects will offer more insights into the history of both our own species and others.
This research was supported by National Science Foundation grant PHY05-51164, National Institutes of Health grants 1R01GM83606 and 2R01HG003229, and DOE contract DE-AC52-06NA25396. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Introduction
Demographic models inferred from genetic data play several important roles in population genetics. First, they complement archeological evidence in understanding prehistorical events (such as the number and timing of major continental migrations) which have left no written record [1],[2]. Second, they facilitate the search for genetic regions that have been targets of non-neutral forces, such as recent natural selection, by guiding our expectations as to how much sequence and haplotype variation one expects to see in a given genomic region (and, more importantly, the variance around these expectations) [3]. Finally, existing demographic models can guide sampling design for subsequent population or medical genetic studies.
Given their many uses, it is not surprising that many studies have inferred demographic models for populations of humans and other species [4]–[15].
The process of inferring a demographic model consistent with a particular data set typically involves exploring a large parameter space by simulating the model many times, often using coalescent-theory based Monte Carlo approaches. For computational reasons, many of the demographic inference procedures developed thus far have focused on single population models or models with multiple populations but no subsequent migration after subpopulations split (i.e., [4]–[6],[16],[17], but also see [10],[18]). Methods that do consider multiple populations with migration often assume independent non-recombining regions [7],[19] and do not often scale to genomic size data sets. Approaches for jointly considering recombination and migration often use a restricted set of summary statistics [9] of the data, which limits their statistical power. Finally, complex demographic inferences that make use of many summary statistics are often very computationally intensive [8],[10],[18], which precludes thorough investigation of their statistical properties.
Here, we develop and apply a computationally efficient diffusion-based approach to the problem of demographic inference, based on the multi-population allele frequency spectrum (AFS) (i.e., the joint distribution of allele frequencies across diallelic variants) [10],[17],[18],[20],[21]. Given a genetic region sequenced in multiple individuals from each of populations, the resulting AFS is a P-dimensional matrix. Each entry of this matrix records the number of diallelic genetic polymorphisms in which the derived allele was found in the corresponding number of samples from each population. For example, if diploid individuals from two populations were sequenced, with 10 individuals from population 1 and 5 from population 2, the AFS would be a 21-by-11 matrix (indexed from 0). The [2,0] entry would record the number of polymorphisms for which the derived allele was seen twice in population 1 but never seen in population 2, while the [20],[5] entry would record polymorphisms for which the derived allele was homozygous in all individuals from population 1 and seen 5 times in population 2. If all polymorphic sites possess only two alleles and can be considered independent, the AFS is a complete summary of the data. Many of the statistics commonly used for population genetic inference, such as and Tajima's , are summaries of the AFS (see [18],[22]).
Efficient techniques exist for simulating the AFS of a single population [4],[5],[23]. The joint AFS between two populations has been used by several recent studies [10],[11],[18],[24], but these have all relied upon very computationally intensive coalescent simulations. Here we approximate the joint multi-population AFS by numerical solution of a diffusion equation, and our implementation supports up to three simultaneous populations. Because the diffusion approach neglects linkage, our comparison with the data is through a composite likelihood function. Such likelihoods are consistent estimators under a wide range of population genetic scenarios for selectively-neutral data, but do not correctly capture variances [25]. (Lower recombination induces higher linkage and higher variance in the entries of the AFS.) As we demonstrate below, the efficiency of our diffusion approach enables both conventional and parametric bootstrap resampling of the data, allowing us to accurately estimate confidence intervals for parameter values and critical values for hypothesis tests [26], accounting for any degree of linkage found in the data. This bootstrap procedure overcomes the traditional concerns with composite likelihood as a philosophy for inference in population genetics.
To demonstrate the utility of our approach, we apply our method to two epochs in human history, using single nucleotide polymorphism (SNP) data from the Environmental Genome Project (EGP) [27], the largest public database of human resequencing data. We first study the expansion of humans out of Africa, jointly modeling the history of African, European, and East Asian populations. We then study the settlement of the New World, jointly modeling European, East Asian, and admixed Mexican populations. In both cases, we quantify the uncertainty of our parameter inferences and test hypotheses about migration (bootstrapping to account for linkage). In particular, we infer an earlier divergence between African and Eurasian populations than previous studies, because our inferences account for the substantial migration between these populations. Our methods also find no evidence for multiple migrations between East Asia and the New World. While similarly complex models for human continental populations have been studied [8], to our knowledge, our analysis is the first in which the full joint AFS is used for inference and in which uncertainty and goodness-of-fit have been quantified.
An important advantage of the diffusion approach is the ease with which selection can be incorporated. As an illustrative application, we also predict the distribution of protein-coding variation between populations. In agreement with the data, we find that less nonsynonymous variation is shared between populations than might be expected based only on patterns of shared noncoding variation.
While no model can capture the full complexity of any species' genetic history, the models presented refine our understanding of the expansion of humanity across the globe. None of the methodology is specific to humans, and we expect our method will find wide application to demographic inference of other species.
MethodsDiffusion approximation
To efficiently simulate the AFS, we adopt a diffusion approach. Such approaches have a long and distinguished history in population genetics, dating back to R. A. Fischer [28]–[30]. The diffusion approach is a continuous approximation to the population genetics of a discrete number of individuals evolving in discrete generations. An important underlying assumption is that per-generation changes in allele frequency are small. Consequently, the diffusion approximation applies when the effective population size is large and migration rates and selection coefficients are of order .
If we have samples from populations, the numbers of sampled sequences from each population are . (For diploids, is typically twice the number of individuals sampled from population 1.) Entry of the AFS records the number of diallelic polymorphic sites at which the derived allele was found in samples from population 1, from population 2, and so forth. (If ancestral alleles cannot be determined, then the “folded” AFS can be considered, in which entries correspond to the frequency of the minor allele.)
We model the evolution of , the density of derived mutations at relative frequencies in populations at time . (All run from 0 to 1.) Given an infinitely-many-sites mutational model [31] and Wright-Fisher reproduction in each generation, the dynamics of for an arbitrary finite number of populations are governed by a linear diffusion equation:The first term models genetic drift, and the second term models selection and migration. Figure 1A illustrates the effects of different evolutionary forces on components of . Time is in units of , where is the time in generations and is a reference effective population size. The relative effective size of population is . The scaled migration rate is , where is the proportion of chromosomes per generation in population that are new migrants from population . (Thus migration is assumed to be conservative [32]). Finally, the scaled selection coefficient is , where is the relative selective advantage or disadvantage of variants in population . Boundary conditions are no-flux except at two corners of the domain, where all population frequencies are 0 or 1; these are absorbing points corresponding to allele loss or fixation. Because the diffusion equation is linear, we can solve simultaneously for the evolution of all polymorphism by continually injecting density at low frequency in each population (at a rate proportional to the total mutation flux ), corresponding to novel mutations.
(A) Qualitative effects of modeled neutral genetic forces on , the density of alleles at relative frequencies and in populations 1 and 2. (B) For the spectra shown, an equilibrium population of effective size diverges into two populations generations ago. Populations 1 and 2 have effective sizes and , respectively. Migration is symmetric at per generation, and . (C) The AFS at . Each entry is colored by the logarithm of the number of sites in it, according to the scale shown. (D) The AFS at various times for various demographic parameters, on the same scale as (B). (E) Comparison between coalescent- and diffusion-based estimates of the likelihood of data generated under the model (A). Coalescent-based estimates of the likelihood, each of which took approximately 7.0 seconds, are represented in the histogram. The result from our diffusion approach, which took 2.0 seconds, is represented by the red line. For accuracy comparison, the yellow line indicates the likelihood inferred from coalescent simulations.
Changes in population size and migration alter the parameters in Equation 1, while population splits and mergers alter the dimensionality of . For example, if new population 3 is admixed with a proportion from population 1 and from population 2 thenwhere denotes the Dirac delta function. To remove population 2, is integrated over : .
Given , the expected value of each entry of the AFS, , is found via a P-dimensional integral over all possible population allele frequencies of the probability of sampling derived alleles times the density of sites with those population allele frequencies. For SNP data obtained by resequencing, these probabilities are binomial, soIn some cases of ascertained data [33], the resulting bias can be corrected by modifying the above equation [11],[34].
Likelihood-based inference
Let correspond to the parameters of a demographic model we wish to estimate from the observed multi-population allele frequency spectrum, which we denote . Assuming no linkage between polymorphisms, each entry in the AFS is an independent Poisson variable [20], with mean (which depends on ). We can, therefore, construct a likelihood function using standard statistical theory:So is the product of Poisson likelihoods, one for each entry in the AFS.
In words, our approach consists of calculating the expected allele frequency spectrum using a particular demographic model (and set of parameter values for that demographic model) using our diffusion approach. We then maximize the similarity between and the observed AFS over the parameter values that can take on. Competing demographic models can be chosen from using standard statistical theory such as the nested likelihood ratio test or information criteria such as the Akaike or Bayesian Information Criteria.
For linked polymorphisms, is a composite likelihood. Such likelihoods are consistent estimators under a wide range of neutral population genetic scenarios [25], but simulations incorporating linkage are necessary to estimate variances and define critical values for hypothesis testing and model selection. In our applications, we estimate variances using simulations from the coalescent simulator ms[35].
Numerics
Solving the multi-population diffusion equation is substantially more demanding than the single-population case [23]. This is primarily because the boundary conditions are more complex, and the numerical grid of population frequencies must be much coarser to be computationally tractable, because it is of dimensions. For example, a previous single-population study [23] used a uniform grid of order values between 0 and 1. Extending this grid to a three-population simulation would require an infeasible array of size . Instead, we use a nonuniform grid and extrapolation to enable accurate computation using of order 100 values along each dimension, for a final array size of order .
We solve the diffusion equation on a regular nonuniform grid, using a finite difference scheme [36] inspired by the method of Chang and Cooper [37] (Text S1). Mutations in population arise at frequency . The diffusion approximation applies when , but the minimum frequency in our numerical simulation is that of the first grid point, denoted . To overcome this, we extrapolate our results to an infinitely fine grid. We use a quadratic extrapolation on the logarithm of the AFS entry, modeling the bias introduced by the finite initial grid point asHere is an AFS element calculated at grid size and is the extrapolated value. Given three evaluations at different grid sizes , we solve for and use this value when calculating likelihoods. This vastly increases both the speed and accuracy of our calculation (Supplementary Figure 3 in Text S1). While higher-order extrapolations may improve accuracy in some cases, they may also be more sensitive to numerical noise. Our empirical experience is that a quadratic approximation provides a good compromise between accuracy, efficiency, and robustness.
The computational cost for a single likelihood evaluation scales as where is the number of grid points used. In our experience, for stability and accuracy should somewhat larger than the largest population sample size. Although our theoretical framework extends to an arbitrary number of populations, the exponential scaling of computation with limits our current applications to three simultaneous populations. Importantly, our likelihood calculation is deterministic and numerically smooth, so numerical derivatives can be used in optimization. We use the the quasi-Newton BFGS method [36], which converges in order steps, where is the number of free parameters.
Our implementation of these methods, , is written in cross-platform Python and C, making use of the NumPy [38], Scipy [39], and Matplotlib libraries [40]. It is distributed under the open-source BSD license. All calculations herein were performed with version 1.1.0.
We estimated parameter uncertainties by both conventional bootstrap (fitting data sets resampled over loci) and parametric bootstrap (fitting simulated data sets). To generate simulated data we used the coalescent program ms[35], a region-specific recombination rate, and the detailed EGP sequencing strategy (Text S1).
The confidence intervals reported in Table 1 and Table 2 derive from a normal approximation to the bootstrap results. For the conventional bootstrap, confidence intervals were calculated as . For the parametric bootstrap, biased-corrected intervals were calculated as . The maximum-likelihood value is denoted , while and denote the mean and standard deviation of the bootstrap results. Aside from the growth rates , all our model parameters are positive by definition, so in those cases we used their logarithms when calculating confidence intervals.
10.1371/journal.pgen.1000695.t001Out of Africa inferred parameters.
aSee Figure 3B for model schematic. Growth rates and migration rates are per generation. is the average European admixture proportion of the Mexican-Americans sampled.
Pearson's goodness-of-fit test was performed using all 21^{3}−2 = 9259 bins in the AFS. Results are similar if we restrict our analysis to entries in which the expected value is greater than 1 or greater than 5.
Data
We used the National Institute of Environmental Health Science's Environmental Genome Project SNPs database [41], which results from direct Sanger resequencing of environmental response genes in several populations. We considered all diallelic SNPs in 5.01 Mb of sequence from noncoding regions of 219 autosomal genes (Supplementary Table 8 in Text S1). These data have been the subject of many publications, including [17],[23],[27],[42]. As an assessment of quality, additional high-coverage short-read sequencing has recently been performed across 8 samples in this data set. Over 26,000 sites, the SNP concordance between this next-generation sequencing and the original Sanger sequencing averages 99.5% (D. Nickerson, personal communication). Given the high quality of this data set, we do not incorporate sequencing error into our modeling. We believe such correction will be essential in future applications to less accurate short-read sequencing data, as inference based on the frequency spectrum is sensitive to rare alleles.
To estimate the ancestral allele, we aligned to the panTro2 build of the chimp genome [43]. Like other methods based on the unfolded AFS, our analysis is sensitive to errors in identifying the ancestral allele. We statistically corrected the AFS for ancestral misidentification [17], using a context-dependent substitution model [44]. This procedure has been shown to perform better than aligning to multiple species [17]. To account for missing data and ease qualitative comparisons between populations, we projected all spectra down to 20 samples per population [5] (Text S1).
The human-chimp divergence in the data is 1.13%. We assumed a divergence time of 6 My [45] and a generation time of 25 years. This yielded an estimated neutral mutation rate of per site per generation, which is comparable to direct estimates [46]. There is some controversy as to the appropriate generation time to assume in human population genetic studies [47],[48]. In particular, the human generation time may differ between cultures and may have changed during our biological and cultural evolution. The bootstrap uncertainties reported in Table 1 and Table 2 do not include systematic uncertainties in the human-chimp divergence or generation times. The generation time, however, formally cancels when converting between genetic and chronological times.
Nonsynonymous polymorphism
In our prediction of the distribution of nonsynonymous polymorphism, the distribution of selective effects assumed was a negative-gamma distribution with shape parameter and scale [49]. The AFS was calculated by trapezoid-rule integration over this distribution, using 201 evaluations logarithmically spaced over . All demographic parameters, including the scaled mutation rate , were set to the maximum-likelihood values from our Out of Africa analysis.
Results
First, we explored how various demographic forces affect the AFS, building intuition for our subsequent applications to real data. We then compared the performance of diffusion versus coalescent methods for evaluating the AFS, finding that the diffusion approach is substantially faster. We then applied our diffusion approach to infer parameters for plausible demographic models for the history of continental human populations. We first considered the expansion of humans out of Africa and then the settlement of the New World. In these applications, we inferred the maximum composite-likelihood parameters of our models using diffusion fits to the real data. To account for linkage in estimating variances and critical values for hypothesis tests, we then repeatedly fit both conventional and parametric bootstrap data sets. Finally, in an application incorporating selection, we predicted the distribution of nonsynonymous variation between populations in our Out of Africa model, finding good agreement with the available data.
Demographic effects on the AFS
In Figure 1, we provide examples of the AFS under different demographic scenarios. Figure 1B illustrates the isolation-with-migration model for which the spectra are calculated. The expected spectrum at zero divergence time is shown in Figure 1C. Figure 1D shows the expected spectrum at various divergence times under various demographic scenarios. Qualitatively, correlation between population allele frequencies declines with increasing divergence time, depopulating the diagonal of the AFS. On the other hand, migration prolongs and sustains correlation. Less obviously, AFS entries corresponding to shared low-frequency alleles distinguish between increased migration and reduced divergence time (Supplementary Figure 1 in Text S1). Additionally, differences in genetic drift between populations with different effective sizes result in asymmetries in the AFS. These qualitative features of the AFS are also evident in human data. Detailed modeling allows us to quantify our inference regarding the type, timing, and strength of demographic events that are consistent with the data.
Computational performance
The computer program implementing our method is named (Diffusion Approximations for Demographic Inference). It is open-source and freely available at http://dadi.googlecode.com.
Figure 1E compares with a coalescent approach to evaluating the likelihood of frequency spectrum data. The coalescent simulator ms[35] was used to generate a simulated data set from the model in Figure 1B, with parameters , , , , , scaled total recombination rate , and 20 samples per population. Coalescent-based estimates of the expected AFS were generated by averaging ms simulations, each run with and . These estimates were scaled to for comparison with the simulated data set. (This procedure is substantially faster than simulating with larger and .) Each estimate took approximately 7.2 seconds of computation. The histogram in Figure 1E shows the resulting distribution of estimated likelihoods of the data. Shown by the red line in Figure 1E is the result from our diffusion approach (with grid sizes ), which took approximately 2.0 seconds of computation. The yellow line is the likelihood from coalescent simulations, illustrating the high accuracy of our diffusion approach. (Note that the coalescent approach we consider here is not necessarily optimal. We are, however, unaware of any such approach that is competitive in computational speed with the diffusion method.)
The computational advantage of the diffusion method is even larger when placed in the context of parameter optimization. Unlike the coalescent approach, there is no simulation variance, so efficient derivative-based optimization methods can be used. As examples, consider our applications to human data, which involve 20 samples per population. On a modern workstation, fitting a single-population three-parameter model took roughly a minute, while fitting a two-population six-parameter model took roughly 10 minutes. The fits of three-population models with roughly a dozen parameters typically took a few hours to converge from a reasonable initial parameter set. This speed allows us to use extensive bootstrapping to estimate variances, overcoming the limitations of composite likelihood.
Expansion out of Africa
Our analysis of human expansion out of Africa used data from three HapMap populations: 12 Yoruba individuals from Ibadan, Nigeria (YRI); 22 CEPH Utah residents with ancestry from northern and western Europe (CEU); and 12 Han Chinese individuals sampled in Beijing, China (CHB). Because approaches based on the frequency spectrum are sensitive to miscalling of the ancestral state, we statistically corrected for ancestral misidentification using an approach that accounts for a myriad of mutation and context-dependent biases (such as CpG effects) [17]. To ease qualitative comparison among populations and account for missing data, we projected the data down to 20 sampled chromosomes per population [5]. Because this data set is of very high quality (>99% concordance of sequenced SNPs with next-generation sequencing of the same individuals to high coverage; see Methods), we do not explicitly correct for sequencing errors here. We were left with 17,446 segregating diallelic SNPs from effectively 4.04 Mb of sequence. Figure 2A shows the resulting AFS. For ease of visualization, the top row of Figure 2C shows the two-population marginal spectra.
10.1371/journal.pgen.1000695.g002Out of Africa analysis.
(A) AFS for the YRI, CEU, and CHB populations. The color scale is as in (C). (B) Illustration of the model we fit, with the 14 free parameters labeled. (C) Marginal spectra for each pair of populations. The top row is the data, and the second is the maximum-likelihood model. The third row shows the Anscombe residuals [61] between model and data. Red or blue residuals indicate that the model predicts too many or too few alleles in a given cell, respectively. (D) The observed decay of linkage disequilibrium (black lines) is qualitatively well-matched by our simulated data sets (colored lines). (E) Goodness-of-fit tests based on the likelihood and Pearson's statistic both indicate that our model is a reasonable, though incomplete description of the data. In both plots, the red line results from fitting the real data and the histogram from fits to simulated data. Poorer fits lie to the right (lower and higher ). (F) The improvement in likelihood from including contemporary migration in the real data fit (red line) is much greater than expected from fits to simulated data generated without contemporary migration (histogram). This indicates that the data contain a strong signal of contemporary migration.
There are many possible three-population demographic models one could consider for these populations. To develop a parsimonious yet realistic model, we first considered the marginal AFS for each population and each pair of populations. Previous analyses found that the YRI spectrum is well-fit by a two-epoch model with ancient population growth [5],[17], and we found this as well (Supplementary Figure 6 in Text S1). Previous analyses of the CEU and CHB populations found that both populations went through bottlenecks [5],[11] concurrent with divergence [11]. Such models qualitatively fit our marginal CEU-CHB spectrum (Supplementary Figure 7 in Text S1).
Combining these demographic features yields the model illustrated in Figure 2B. The maximum likelihood values for the 14 free parameters are reported in Table 1. Qualitatively, the resulting model reproduces the observed spectra well, as seen in the second and third rows of Figure 2C. (The correlation between adjacent residuals is due in part to our projection of the data down from a larger sample size (Supplementary Figure 8 in Text S1).) Allowing for asymmetric gene flow yielded very little improvement in fit, as did allowing for growth in the Eurasian ancestral population or allowing the CEU and CHB bottleneck and divergence times to differ (data not shown).
Our composite likelihood function assumes that polymorphic sites are independent. Because it thus overestimates the number of effective independent data points, confidence intervals calculated directly from the composite likelihood function will be too small. To control for linkage, we performed both conventional and parametric bootstraps. Because our sequenced genes are typically well separated, they can be treated as independent, and our conventional bootstrap resampled from the 219 sequenced loci. For the parametric bootstrap, simulated data sets that incorporate linkage and the EGP's sequencing strategy were generated with ms[35].
Table 1 reports parameter 95% confidence intervals from both the conventional and bias-corrected parametric bootstraps. The parametric bootstraps yield slightly smaller confidence intervals than the conventional bootstrap, suggesting that some variability in the data has not been accounted for by our simulations. This variability may involve small varied selective forces on the sequenced regions or slight relatedness between sampled individuals. The parametric bootstrap results additionally show that our method possesses very little bias in parameter inference (Supplementary Figure 9 in Text S1).
As seen in Table 1, the times for growth in the African ancestral population and divergence of the Eurasian ancestral population ( and ) have particularly wide confidence intervals, likely a consequence of the high inferred migration rate between the African and Eurasian ancestral populations. shows high correlation with the ancestral population size , while shows no strong linear correlation with any other single parameter (Supplementary Figure 11 in Text S1). We found that 92 out of our 100 conventional bootstrap fits yield , supporting the contention that the CHB population suffered a more severe bottleneck than the CEU population [11] (Supplementary Figure 11 in Text S1).
We used several metrics to assess our model's goodness-of-fit, in additional to visual inspection of the residuals seen in Figure 2C. Figure 2D compares the decay of linkage disequilibrium (LD) in the data and in the parametric bootstrap simulations. The agreement seen is notable because our demographic inference used no LD information in building and fitting the model. This LD comparison thus serves as independent validation of both our model and bootstrap simulations. We also asked whether the likelihood found in the real data fit is atypical of fits to simulated data. Out of fits to 100 simulated data sets, 2 produced a smaller likelihood (worse fit) than the real data fit (Figure 2E), yielding a p-value of ≈0.02. One can craft examples in which a likelihood-based goodness-of-fit test fails to exclude very poor models [50]. Thus we also applied Pearson's goodness-of-fit test, a more robust and standard method for data that is in Poisson-distributed bins, such as the AFS [36]. In our case, we must use our parametric bootstraps to assess the significance of the sum-of-squared-residuals test statistic , because many entries in the AFS are small and because they are not strictly independent. Figure 2E shows the bootstrap-derived empirical distribution of . Two of the bootstraps yielded a larger (worse fit) than the real data fit, giving a p-value of ≈0.02, identical to that from the likelihood-based test. (The two simulations that yield a higher than the real fit are not the same two that yield a lower , suggesting that these tests are somewhat independent.) In some cases specific frequency classes of SNPs, such as rare alleles, may be of particular interest. In Supplementary Table 5 in Text S1, we provide comparisons of the joint distribution of rare alleles seen in the data with that from our simulations. These comparisons indicate that our model also reproduces well this interesting region of the frequency spectrum. Finally, in Figure 4 we compare the model and data using larger bins of SNPs specific to particular populations or segregating at high or low frequency. In all cases the model agrees within the uncertainty of the bootstrapped data. Taken together, these tests suggest that our model provides a reasonable, though not complete, explanation of the data, lending credence to our demographic estimates.
The inferred contemporary migration parameters (, and ) are small, raising the question as to whether they are statistically distinguishable from zero. Figure 2F shows that the improvement in fit to the real data upon adding contemporary migration to the model is much larger than would be expected if there were no such migration, implying that the contemporary migration we infer is highly statistically significant. Omitting ancient migration () reduced fit quality even more, indicating that the data also demand substantial ancient migration (data not shown).
Settling the New World
To study the settlement of the Americas, we used the previously considered 22 CEU and 12 CHB individuals, plus an additional 22 individuals of Mexican descent sampled in Los Angeles (MXL). Data were processed as in our Out of Africa analysis, yielding 13,290 segregating SNPs from effectively 4.22 Mb of sequence. Figure 3A shows the resulting AFS, while Figure 3C shows the marginal spectra.
10.1371/journal.pgen.1000695.g003Settlement of the New World analysis.
As in Figure 2, (A) is the data, (B) is a schematic of the model we fit, (C) compares the data and model AFS, and (D) compares LD. (E) The fit of our model to the real data is not atypical of fits to simulated data. (F) The improvement in real data fit upon including CHB-MXL migration (red line) is very typical of the improvement in fits to simulated data without CHB-MXL migration. Thus we have no evidence for CHB-MXL migration after divergence.
A model in which the CEU and CHB diverge from an equilibrium population did not reproduce the AFS well (Supplementary Figure 13 in Text S1). Interestingly, a model allowing a prior size change in the ancestral population better fit the AFS but very poorly fit the observed LD decay (Supplementary Figure 13 in Text S1). Thus, reproducing the AFS does not guarantee reproduction of LD, at least given a historically unrealistic model. To develop a more realistic model, we endeavored to include the effects of Eurasian divergence from and migration with the African population. Computational limits precluded us from considering all 4 populations simultaneously, so we dropped the African population from the simulation upon MXL divergence (Figure 3B).
Table 2 records the maximum-likelihood parameter values inferred for this model. Because this fit did not include African data, we could not reliably infer demographic parameters involving the African population. Thus, for this point estimate we fixed the Africa-related parameters , , , , , , and to their maximum-likelihood values from Table 1. Figure 3C compares the model and data spectra. The residuals show little correlation, with the possible exception that the model may underestimate the number of high-frequency segregating alleles.
Parameter confidence intervals are reported in Table 2. To account for our uncertainty in those parameters derived from the Out of Africa fit, for each conventional bootstrap fit we used a set of Africa-related parameters randomly chosen from the sets yielded by our Out of Africa conventional bootstrap. For the parametric bootstrap, we used the maximum-likelihood point estimates. Again, we see that the conventional bootstrap confidence intervals are comparable to, although slightly wider than, the parametric bootstrap intervals. Several parameters in this analysis have direct correspondence with our Out of Africa analysis. Of particular note, the confidence intervals for the CEU-CHB divergence time overlap.
In assessing goodness of fit, Figure 3D shows that this model does indeed reproduce the observed pattern of LD decay. Unlike in our Out of Africa analysis, however, here the LD decay was used to choose the form of the model (although not its parameter values), so this is not a completely independent assessment of fit. Of our 100 parametric bootstrap fits, 13 yielded a worse likelihood than the real fit (Figure 3E), for a p-value of ≈0.13. Applying Pearson's test, we find that 23 of 100 bootstrap fits yield a higher (worse) than the fit to the real data, for a p-value of ≈0.23, similar to that of the likelihood analysis. Comparing distributions of rare alleles, our model typically reproduces the observed distribution well, although it may be somewhat overestimating the proportion of alleles that are rare or absent in the CHB population (Supplementary Table 7 in Text S1). In sum, our model appears to be a reasonable explanation of this data, somewhat better than in our Out of Africa analysis.
An essential feature of the Mexican-American individuals considered here is that they are typically admixed from Native American and European ancestors. The ≈50% average European admixture proportion we inferred for the MXL population is consistent with previous estimates for Los Angeles Latinos [51]. We have no direct data from the Native American populations ancestral to MXL, but our model does account for their divergence from East Asia. A model neglecting this divergence (by setting to zero) fit the data substantially worse and yields an unrealistically high average European admixture proportion into MXL of 68%.
Not only are Mexican-American individuals admixed, their admixture proportions also vary, and this subtlety is not directly accounted for in our analysis. To assess its effect on our results, we first roughly estimated the ancestry proportion of each individual, using essentially a maximum-likelihood version [18] of the algorithm used in structure[52] (Text S1). (Methods based on “admixture LD”, which identify breakpoints between regions of Native American and European ancestry, may be more powerful [53]. However, the strategy used by the EGP of sequencing widely spaced genes will resolve few of these breakpoints, limiting the applicability of these methods.) We then performed additional parametric bootstrap analyses, using simulations with a distribution of individual ancestry chosen to mimic that seen in the data and, to further test the method, with an extremely wide distribution. These simulations showed that variation in individual ancestry does not bias our parameter inferences (Supplementary Figure 19 in Text S1). Remarkably, it does not even change our statistical power. This is evidenced by the fact that these bootstrap simulations yielded confidence intervals identical to our original simulations without variation in ancestry proportion (Supplementary Figure 19 in Text S1). Nevertheless, future studies may profit by incorporating individual ancestry information [18], perhaps inferred from admixture LD.
Finally, our model allowed us to assess the role recurrent migration from Asia played in the settlement of the New World [2]. When we added CHB-MXL migration to our model, we found that the maximum likelihood migration rate was per generation. As shown in Figure 3F, the resulting improvement in likelihood is typical (p-value≈0.45) of fits including CHB-MXL migration to data simulated without it. Our data and analysis thus yielded no evidence of recurrent migration in the settlement of the New World. Note, however, that this simple test does not necessarily rule out more complex scenarios, in which migration may vary over time.
Nonsynomymous polymorphism
Polymorphisms that change protein amino acid sequence are of medical interest because they are particularly likely to affect gene function [54]. Correspondingly, they are often subject to natural selection. Diffusion approaches are particularly useful for studying such nonsynonymous polymorphism, because they easily incorporate selection. Although the diffusion approximation assumes that sites are unlinked, nonsynonymous segregating sites are rare enough that this is often a reasonable approximation [49].
As an illustration, we used our Out of Africa demographic model to predict the distribution of such variation between continental populations. To do so, we must specify a distribution for the selective effects of nonsynonymous mutations that enter the population. For this we adopted a negative gamma distribution whose parameters were recently inferred [49]. The resulting distribution of segregating variation is shown in Figure 4A. (To ease comparison, we have assumed the same scaled mutation rate as in the neutral case of Figure 2C.) As expected, selection sharply reduces the amount of segregating polymorphism. Figure 4B shows the proportion of variants within various classes. Also as expected, selection shifts nonsynonymous variation toward lower frequencies, raising the proportion of singletons and lowering the proportion at frequency greater than 10%. Less obviously, it also reduces the proportion of variation that is shared between populations. In the neutral case, 43% of polymorphism is predicted to be present in more than one population, while in the selected case only 35% is. Thus genetic inferences from coding polymorphism may be less transferable between populations than might be expected from neutral patterns of allele sharing.
10.1371/journal.pgen.1000695.g004Distribution of nonsynonymous polymorphism.
We simulated our maximum-likelihood Out of Africa demographic model with a distribution of selective effects previously inferred for nonsynonymous polymorphism [49]. (A) To enable direct comparison with the neutral AFS (Figure 2C), the scaled mutation rate was set identically, as is the color scale. As expected, selection dramatically reduces the amount of segregating polymorphism. (B) Shown are the proportions of variation found in various frequency classes. As expected, nonsynonymous variants typically have lower frequency. They also less likely to be shared between populations. Data error bars indicate 95% bootstrap confidence intervals.
In the data considered here, there are about 400 nonsynomymous polymorphisms segregating in the three populations considered. This is too few for a detailed goodness-of-fit test of our predicted distribution. (Although see Supplementary Figure 20 in Text S1 for a direct AFS comparison.) Nevertheless, we observe that our predictions shown in Figure 4B all lie within the bootstrap 95% confidence intervals from the data.
Discussion
Our diffusion approximation to the joint allele frequency spectrum is a powerful tool for population genetic inference. Although the diffusion approximation neglects linkage between sites, our method's computational efficiency allows us to use extensive bootstrap simulations to account for the effects of linkage. (Let us reiterate that linkage does not affect the expected allele frequency spectrum of neutral sites, so our diffusion-based approach is estimating the same AFS that coalescent simulations are estimating, but in a small fraction of the time). We applied our method to human expansion out of Africa and settlement of the New World, using public resequencing data from the Environment Genome Project. The flexibility of the diffusion approach also allowed us to consider the distribution of non-neutral variation, which is difficult to address with other approaches. Although no model can capture in detail the complete history of any population, the models presented here help refine our understanding of human expansion across the globe.
Our demographic results are in most respects broadly consistent with previous analyses of human populations. In particular, single-population analyses have also inferred African population growth and European and Asian bottlenecks [4]–[6]. Also, the migration rates we infer are similar to those inferred by Schaffner et al. [8] but somewhat smaller than those of Cox et al. [15]. On the other hand, Keinan et al. [11] inferred no significant migration between CEU and CHB. Finally, our estimate of a New World founding effective population size in the hundreds is compatible other inferences [14].
Perhaps our most interesting demographic results are the inferred divergence times. Other studies [11],[12] have estimated divergence times between Europeans and East Asians similar to the ≈23 kya we infer. Interestingly, archeological evidence places humans in Europe much earlier (≈40 kya) [1]. Our inferred divergence time of ≈22 kya between East Asians and Mexican-Americans is somewhat older than the oldest well-accepted New World archeological evidence [2]. The divergence we infer may reflect the settlement of Beringia, rather than the expansion into the New World proper [14]. Finally, the divergence time of ≈140 kya we infer between African and Eurasian populations is consistent with archeological evidence for modern humans in the Middle East ≈100 kya [1], but it is much older than other inferences of ≈50 kya divergence from mitochondrial DNA [1]. This discrepancy may be explained by our inclusion of migration in the model. Migration preserves correlation between population allele frequencies, so an observed correlation across the genome can be explained by either recent divergence without migration or ancient divergence with migration. In fact, the African-Eurasian migration rate we infer of ≈25×10^{−5} per generation is comparable to the ≈100×10^{−5} inferred from census records between modern continental Europe and Britain [55].
One difficulty in interpreting our divergence times is that the sampled populations may not best represent those in which historically important divergences occurred. For example, the Yoruba are a West African population, so the divergence time we infer between Yoruba and Eurasian ancestral populations may correspond to divergence within Africa itself. Future studies of more populations [56]–[58] will help alleviate this difficulty.
Another difficulty is that the genic loci we study here may not be ideal for demographic inference. Although we consider only noncoding sequence in fitting our historical model, selection on regulatory or linked coding sites may skew the AFS [59]. In fact, the EGP data have been shown to differ in some ways (e.g. Tajima's ) from intergenic regions [58]. Nevertheless, we use the EGP data because it is currently the largest public resource of noncoding human genetic variation, and we fit a neutral model because disentangling the small expected effects of selection on these sites from demographic effects will require additional data. The rapidly declining cost of sequencing will give future studies access to many more loci that are likely to be less influenced by selection. Importantly, the computational burden of our method is independent of the amount of sequence used to construct the AFS. Additional loci will also increase power to discriminate between models and incorporate more detail.
The AFS encodes substantial demographic information. It is has been shown, however, that an isolated population's AFS does not uniquely and unambiguously identify its demographic history [60]; we expect a similar result to hold for multiple interacting populations. Moreover, the AFS does not capture all the information in the data. As illustrated by the alternative New World models we considered, patterns of linkage disequilibrium encode additional information. Future studies may profit from coupling our efficient AFS simulation with methods that address other aspects of the data.
We have developed a powerful diffusion-based method for demographic inference from the joint allele frequency spectrum. We applied our method to human expansion out of African and the settlement of the New World, developing models of human history that refine our knowledge and raise intriguing questions. We also applied our method to predict the distribution of nonsynonymous variation across populations, and this prediction is consistent with the available data. Our methods and the models inferred from it offer a foundation for studying the history and evolution of both our own species and others.
Supporting Information
Complete supplementary data.
(1.80 MB PDF)
We thank Amit Indap for bioinformatics assistance and Jim Booth for statistical assistance. We also thank the NIEHS Program (ES-15478) for making their dataset so easily accessible. We had fruitful discussions with Adam Auton, Debbie Nickerson, Michael Hammer, Rasmus Nielsen, Nick Patterson, Molly Przeworski, Jeff Wall, and Carsten Wiuf.
ReferencesMellarsP2006Going east: new genetic and archaeological perspectives on the modern human colonization of Eurasia.GoebelTWatersMRO'RourkeDH2008The late Pleistocene dispersal of modern humans in the Americas.NielsenRHellmannIHubiszMBustamanteCClarkAG2007Recent and ongoing selection in the human genome.AdamsAMHudsonRR2004Maximum-likelihood estimation of demographic parameters using the frequency spectrum of unlinked single-nucleotide polymorphisms.MarthGTCzabarkaEMurvaiJSherryST2004The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations.VoightBFAdamsAMFrisseLAQianYHudsonRR2005Interrogating multiple aspects of variation in a full resequencing data set to infer human population size changes.HeyJ2005On the number of New World founders: a population genetic portrait of the peopling of the Americas.SchaffnerSFFooCGabrielSReichDDalyMJ2005Calibrating a coalescent simulation of human genome sequence variation.BecquetCPrzeworskiM2007A new approach to estimate parameters of speciation models with application to apes.CaicedoALWilliamsonSHHernandezRDBoykoAFledel-AlonA2007Genome-wide patterns of nucleotide polymorphism in domesticated rice.KeinanAMullikinJCPattersonNReichD2007Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East Asians than in Europeans.GarriganDKinganSBPilkingtonMMWilderJACoxMP2007Inferring human population sizes, divergence times and rates of gene flow from mitochondrial, X and Y chromosome resequencing data.MulliganCJKitchenAMiyamotoMM2008Updated three-stage model for the peopling of the Americas.KitchenAMiyamotoMMMulliganCJ2008A three-stage colonization model for the peopling of the Americas.CoxMWoernerAWallJHammerM2008Intergenic DNA sequences from the human X chromosome reveal high rates of global gene flow.DrummondAJRambautAShapiroBPybusOG2005Bayesian coalescent inference of past population dynamics from molecular sequences.HernandezRDWilliamsonSHBustamanteCD2007Context dependence, ancestral misidentification, and spurious signatures of natural selection.NielsenRHubiszMHellmannITorgersonDAndrésA2009Darwinian and demographic forces affecting human protein coding genes.HeyJNielsenR2004Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis.SawyerSAHartlDL1992Population genetics of polymorphism and divergence.BustamanteCDWakeleyJSawyerSHartlDL2001Directional selection and the site-frequency spectrum.WakeleyJ2008WilliamsonSHHernandezRFledel-AlonAZhuLNielsenR2005Simultaneous inference of selection and population growth from patterns of variation in the human genome.HernandezRDHubiszMJWheelerDASmithDGFergusonB2007Demographic histories and patterns of linkage disequilibrium in Chinese and Indian rhesus macaques.WiufC2006Consistency of estimators of population scaled parameters using composite likelihood.ZhuLBustamanteCD2005A composite-likelihood approach for detecting directional selection from DNA sequence data.LivingstonRJvon NiederhausernAJeggaAGCrawfordDCCarlsonCS2004Pattern of sequence variation across 213 environmental response genes.FischerRA1922On the dominance ratio.KimuraM1964Diffusion models in population genetics.EwensWJ2000WattersonGA1975On the number of segregating sites in genetical models without recombination.NagylakiT1980The strong-migration limit in geographically structured populations.ClarkAGHubiszMJBustamanteCDWilliamsonSHNielsenR2005Ascertainment bias in studies of human genome-wide polymorphism.NielsenRHubiszMJClarkAG2004Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data.HudsonRR2002Generating samples under a Wright-Fisher neutral model of genetic variation.PressWHTeukolskySAVetterlingWTFlanneryBP2007ChangJSCooperG1970A practical difference scheme for Fokker-Planck equations.OliphantTE2006OliphantTE2007Python for scientific computing.HunterJD2007Matplotlib: a 2D graphics environment.NIEHS Environmental Genome Project.URL http://egp.gs.washington.edu. Accessed November 6, 2007AkeyJMEberleMARiederMJCarlsonCSShriverMD2004Population history and natural selection shape patterns of genetic variation in 132 genes.Chimpanzee Sequencing and Analysis Consortium2005Initial sequence of the chimpanzee genome and comparison with the human genome.HwangDGGreenP2004Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution.KumarSFilipskiASwarnaVWalkerAHedgesSB2005Placing confidence limits on the molecular age of the human-chimpanzee divergence.KondrashovAS2002Direct estimates of human per nucleotide mutation rates at 20 loci causing Mendelian diseases.FennerJN2005Cross-cultural estimation of the human generation interval for use in genetics-based population divergence studies.TremblayMVézinaH2000New estimates of intergenerational time intervals for the calculation of age and origin of mutations.BoykoARWilliamsonSHIndapARDegenhardtJDHernandezRD2008Assessing the evolutionary impact of amino acid mutations in the human genome.HeinrichJG2001Can the likelihood-function value be used to measure goodness of fit?Technical Report 5639, Collider Detector at FermilabPriceALPattersonNYuFCoxDRWaliszewskaA2007A genomewide admixture map for Latino populations.PritchardJKStephensMDonnellyP2000Inference of population structure using multilocus genotype data.PattersonNHattangadiNLaneBLohmuellerKEHaflerDA2004Methods for high-density admixture mapping of disease genes.KryukovGVShpuntAStamatoyannopoulosJASunyaevSR2009Power of deep, all-exon resequencing for discovery of human trait genes.WealeMEWeissDAJagerRFBradmanNThomasMG2002Y chromosome evidence for Anglo-Saxon mass migration.LiJZAbsherDMTangHSouthwickAMCastoAM2008Worldwide human relationships inferred from genome-wide patterns of variation.JakobssonMScholzSWScheetPGibbsJRVanLiereJM2008Genotype, haplotype and copy-number variation in worldwide human populations.WallJDCoxMPMendezFLWoernerASeversonT2008A novel DNA sequence database for analyzing human demographic history.BravermanJMHudsonRRKaplanNLLangleyCHStephanW1995The hitchhiking effect on the site frequency spectrum of DNA polymorphisms.MyersSFeffermanCPattersonN2008Can one learn history from the allelic spectrum?PierceDASchaferDW1986Residuals in generalized linear models.