Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data

Demographic models built from genetic data play important roles in illuminating prehistorical events and serving as null models in genome scans for selection. We introduce an inference method based on the joint frequency spectrum of genetic variants within and between populations. For candidate models we numerically compute the expected spectrum using a diffusion approximation to the one-locus, two-allele Wright-Fisher process, involving up to three simultaneous populations. Our approach is a composite likelihood scheme, since linkage between neutral loci alters the variance but not the expectation of the frequency spectrum. We thus use bootstraps incorporating linkage to estimate uncertainties for parameters and significance values for hypothesis tests. Our method can also incorporate selection on single sites, predicting the joint distribution of selected alleles among populations experiencing a bevy of evolutionary forces, including expansions, contractions, migrations, and admixture. We model human expansion out of Africa and the settlement of the New World, using 5 Mb of noncoding DNA resequenced in 68 individuals from 4 populations (YRI, CHB, CEU, and MXL) by the Environmental Genome Project. We infer divergence between West African and Eurasian populations 140 thousand years ago (95% confidence interval: 40–270 kya). This is earlier than other genetic studies, in part because we incorporate migration. We estimate the European (CEU) and East Asian (CHB) divergence time to be 23 kya (95% c.i.: 17–43 kya), long after archeological evidence places modern humans in Europe. Finally, we estimate divergence between East Asians (CHB) and Mexican-Americans (MXL) of 22 kya (95% c.i.: 16.3–26.9 kya), and our analysis yields no evidence for subsequent migration. Furthermore, combining our demographic model with a previously estimated distribution of selective effects among newly arising amino acid mutations accurately predicts the frequency spectrum of nonsynonymous variants across three continental populations (YRI, CHB, CEU).

As an example of inference with migration, consider the demographic scenario of Figure 1 for parameter values of moderate migration and divergence: θ = 1000, ν 1 = ν 2 = 0.5, M = 2, and τ = 0.3. The noise-free AFS is shown in the upper left panel of Supplementary  Figure 1. For a model with M ≡ 0 fit to this AFS, the maximum likelihood parameters are: θ = 1034.4, ν 1 = ν 2 = 0.44, and τ = 0.084. As expected, when we neglect migration the inferred divergence time is substantially smaller. Less obviously, the ancestral population size (proportional to θ) is overestimated, while current populations sizes are underestimated. The resulting AFS is shown in the upper right panel of Supplementary Figure 1 and is qualitatively very similar to the AFS with migration. As illustrated in the lower left panel, however, fitting this incorrect model yields correlated residuals; the model predicts too few shared polymorphisms at low frequency in one or both populations. This is the AFS signal that distinguishes between reduced divergence time and increased migration.
2 Numerical methods 2.1 Finite-difference scheme Note: here we index populations using lower-case Greek indices, to distinguish from the numerical index we use over grid points.
To numerically approximate the solution to our diffusion equation (Eqn. 1) for P populations, we discretize the equation over a non-uniform regular P -dimensional grid. (The grid is described in section 2.3). To solve this multidimensional problem, we adopt an alternating difference scheme [1], in which we evolve terms involving derivatives in each variable x α separately. Here we outline the finite difference scheme we use, inspired by work of Chang and Cooper [2] on diffusion equations arising in plasma physics.
Importantly, aside from the boundary conditions our diffusive evolution conserves probability. For stability and accuracy our finite difference scheme should also explicitly conserve probability. This motivates us to consider the fluxes F α . Our diffusion equation can be decomposed into fluxes as where V (α) = x α (1 − x α ), and M (α) = β M α←β (x β − x α ). If we take a centered finite difference about point j on our grid x i , where i = 1, 2, . . . G, we obtain To conserve total probability (as defined by the trapezoidal numerical integration scheme) we take F −1/2 ≡ F G+1/2 ≡ 0 and To evaluate F t+1 j+1/2 , we take a centered finite-difference, so that where φ t+1 j+1/2 ≡ 1 2 φ t+1 j + φ t+1 j+1 . (Note that Chang and Cooper take a non-centered difference on φ to ensure non-negativity of the equilibrium solution. This is not possible for us, because the equilibrium solution of our equations can be singular.) Combining equations S4 and S6 yields a tridiagonal system of equations which can be solved by standard methods, e.g. a j φ t+1 j−1 + b j φ t+1 j + c j φ t+1 j+1 = φ t j . The flux out is handled by particular absorbing terms [3] of the form: These are applied only at the grid points where all frequencies equal 0 or 1 respectively. New mutations are injected into the system at each timestep via: The final term (involving 2 P ) normalizes the probability density given trapezoid rule integration over the other dimensions of the grid. Note that we inject probability at a rate θ/(2x 1 ), independent of the relative population size ν α . This reflects the fact that mutations are being input at x 1 , not 1/(2N α ). The effect of relative population size is accounted for by the drift term V .
In total, for a P -dimensional problem, each timestep requires solving G P −1 tridiagonal systems of size G. We set our timestep to be ∆t ≡ min ∆x/10, which provides good stability and accuracy for the scenarios considered here. (Situations with very high migration rates or selection coefficients or very small population sizes may require finer timesteps.) Thus the total run time for a single likelihood evaluation is order G P +1 .

Population splits
Numerically handling δ-function density for φ upon population splits (Eqn. 2) involves some subtlety. In particular, it is important to properly conserve φ density, and the value z to which a given pair of x and y map to may not be a point in the grid. Consider x and y which map to a value z which lies between grid points z i and z i+1 . Define ∆z i ≡ z i+1 − z i . We distribute the density φ(x, y) which a proportion a going to z i and b going to z i+1 , such that To ensure proper normalization under the trapezoid rule in the interior of our domain, we must have a 2 Upon splitting, a contribution aC is thus added to element φ(x, y, z i ) and a contribution bC to φ(x, y, z i+1 ). The edge cases, in which z, for example, maps between z 1 and z 2 are handled similarly. In those cases, Eqn. S11 must simply be altered to reflect the correct trapezoidal integration rule for those edge cases. We solve our equations on a non-uniform x grid. For a grid of G points, the first G/10 points (rounded down) are uniformly spaced within x ∈ [0, 0.05]. This uniform grid spacing makes the solution more accurate in the low-frequency regime, where we have many segregating mutations and the data are therefore relatively less noisy.

Non-uniform grid
The remaining points are placed so that their spacings increase quadratically, being finer at small and large values of x. This scheme minimizes the difference in spacings between adjacent grid points, which we found helped accuracy and stability. Specifically, we take the remaining points onto q ∈ [0, 1] which is mapped onto x by x = a q 3 + b q 2 + c q + d, were d = 0.05, c is equal to the spacing between the first G/10 points, b = −3(∆q + c + ∆q d)/∆q, a = −(2/3)b. This pattern of grid points is illustrated in Supplementary Figure 2.
Because our algorithm run time scales as O(G 4 ) when working with 3 populations, this non-uniform grid gives a dramatic increase in computational speed for a given solution accuracy.

Extrapolation
Our numerical approximation to the solution of the diffusion equation improves as the grid becomes finer, as G → ∞. Our computational costs, however, increase rapidly with G. To overcome this, we use Richardson extrapolation [1], extrapolating on the log of each entry

Small-M comparison with ms
The human applications we consider here involve relatively high migration rates, so there are very few fixed differences between populations. In other cases, much smaller migration rates may be of interest, so we have additionally compared performance between diffusion and coalescent simulations for M = 0.1, using the same procedure as reported in the main text for M = 2. Supplementary Figure 4 shows that our method is again very accurate in likelihood evaluation. The speed advantage of the diffusion method is similar in this case to the M = 2 case.

Data processing
The 219 genes we consider yield 5,017,161 bases of non-coding sequenced DNA, with 31,352 observed SNPs in the EGP samples. Table 8 gives a gene-by-gene break down of the data.
To apply the statistical correction for ancestral misidentification [5], we align to the panTro2 build of the chimp genome using BLAT [6]. Genes longer than the 25 kb recommended for BLAT were split into 25 kb chunks with 2 kb overlap. Each 25 kb chunk was required to have at least 90% identity with chimp across at least 70% of the sequence, or it was discarded from analysis.
To apply the ancestral misidentification correction to a SNP it must satisfy 4 criteria. 1) It must have been successfully aligned to chimp. 2) The chimp allele must be one of the segregating human alleles.
3) The two bases flanking the SNP must be the same in chimp and human. 4) The SNP cannot be adjacent to any other SNP. All these criteria are satisfied for 27,824 SNPs in our sample.

Projection
To account for missing data and ease visual comparison between populations, we project each AFS down to 20 samples per population using a hypergeometric distribution. In essence, to project from n successful calls in a population to m, we average over all possible results of choosing a size m subsample from those n calls. Because the sampling in each population is independent of sampling in the others, this is a simple extension of the one-dimensional case [7]. In particular, if we are projecting an AFS S from a sample size in population α of n to a sample size of m, the projected AFS P is (S12) Note that this projection reduces the number of segregating SNPs, as SNPs segregating in the original sample may project to make contributions to the 'absent in the sample' and 'fixed in the sample' classes. For our Out of Africa model, which uses YRI, CEU & CHB data, 25,258 SNPs have ≥ 20 calls in every population. We correct our estimation of the ancestral population size for the SNPs lost to the ancestral misidentification criteria and the ≥ 20 calls criteria by using an effective sequenced length. In this case, it is 5.02 Mb · 25, 259/31, 352 = 4.02 Mb. After projection, our AFS sums to 17,446 segregating SNPs.
For the New World model we have 26,387 SNPs with ≥ 20 calls in every population, yielding an effective sequencing length of 4.20 Mb. Our projected AFS, however, contains only 13,290 segregating SNPs, reflecting lower diversity in these populations. show linkage disequilibrium as a function of physical distance in the data. These values were calculated using phased haplotypes generated using PHASE 2.0 [8] and provided by the EGP. For comparison with our simulated data, only SNPs in non-coding regions and with minor allele frequency ≥ 10% in all populations were considered.

Estimating µ
We estimate the neutral mutation rate µ using the divergence between human and chimp. Comparing aligned sequences in our data, we estimate the divergence to be 1.13%. Assuming a divergence time of 6 million years [9] and a mean generation time for human and chimp over this interval of 25 years, we have µ = 0.0113 · 25/(2 · 6 × 10 6 ) = 2.35 × 10 −8 per generation. (S13) Note that the assumed generation time formally cancels in the conversion between genetic and chronological time units. If t C is the chronological time, τ is the genetic time (in units of 2N A generations), and t g is the generation time, then As shown in the previous paragraph, µ is proportional to t g , resulting in the cancellation of t g .

Coalescent simulations
Detailed simulations of the EGP data are required to estimate accurate confidence intervals and to perform realistic goodness-of-fit and hypothesis tests. For these simulations, we use ms [10].
To account for potential linkage between genes, in our simulations we divide the 219 sequenced genes into 194 potentially linked regions, each separated by at least 500 kb (Table 8). We simulate the entirety of each region using a value for θ scaled by the relative length of that region to the total sequenced length. The recombination rate for each region is set to the average recombination rate from the HapMap Release 22 genetic map [11]. This rate r was converted into a population-scaled rate ρ = 2N ref r by using N ref inferred from θ and our estimate of µ.
Simulations use the total number of samples from each EGP population. The resulting AFS is then projected down to 20 samples per population. When extracting an AFS from a simulation, only SNPs lying within regions sequenced by the EGP are included. To first order, the effect of linkage on our analysis is to reduce the effective number of independent samples in the frequency spectrum. To estimate the magnitude of this effect, in Supplementary Figure 5 we compare the distribution of likelihoods generated by the full simulation procedure described above and by modified Poisson sampling from the model frequency spectrum M . In the modified Poisson sampling, we sample from M/f and then multiply the resulting sampled frequency spectrum by f . In effect, this generates a frequency spectrum where each sample is replicated f times.

Reduction of effective independent SNPs by linkage
As illustrated in Supplementary Figure 5, the likelihood distributions are similar for f = 4.5. This suggests that linkage reduces the effective number of independent samples in our data by a factor of approximately 4.5. Note that the likelihood distribution for the full simulations remains somewhat wider than that for f = 4.5, indicating that the effect of the linkage in the data is more than a simple consistent reduction in the number of effective data points.    Table 2 present results from a divergence and growth model fit to the CEU/CHB spectrum. (In this model both populations diverge from an equilibrium population and grow exponentially following a concurrent bottleneck.) Here we do see a pattern of correlated residuals, in that the model underestimates the amount of high frequency shared polymorphism. As seen in Figure 2, the more complex model which incorporates migration from YRI alleviates this somewhat. Allowing the bottleneck times in the two populations to occur anytime after divergence yielded only a very slight increase in fit quality (data not shown).

Maximum likelihood parameters
In genetic units (scaled by N A ), the maximum likelihood parameters are shown in Supplementary Table 3. In ms syntax, the demographic model is: To convert to physical units, we use θ = 4N A µL, where L is the length of sequence considered. Importantly, L must use the effective sequenced length, which accounts for losses in alignment and missed calls. So (S15) The remaining conversions are straightforward.

Projection and residuals
Some the correlation between residuals seen in Figure 2 and 3 is due to the fact that we've projected the data down from a larger sample size. This effect is illustrated in Supplementary  Figure 8. In subfigure (a) we show the residuals between our model AFS simulated for the total number of individuals in the EGP and a Poisson sampling from that AFS. As expected, the residuals shown no correlation. Subfigure (b) compares the model AFS with the same sampled AFS as in (a), but projected down to 20 samples per population. This projection results in correlation between adjacent entries in the AFS that is very similar to that seen in the model comparison with the real data. Supplementary

Parametric bootstrap
Shown in Supplementary Figure 9 are the results of the parametric bootstrap analysis of parameter uncertainties for our Out of Africa model. All distributions of bootstrap parameter estimates have their mode approximately at the maximum likelihood value. This provides empirical evidence that, as expected, our estimation is not significantly biased.

Parameter correlations
The correlations between parameter values inferred during our conventional bootstrap shed light on the dependencies between model parameters. Supplementary Figure 10 shows the squared correlation coefficient between each of the inferred parameter values. Correlations are typically low, with a few exceptions. Supplementary Figure 11 illustrates three of the most strongly correlated parameter pairs. The strongest correlation is between the time of the African population size change T AF and the ancestral population size N A . This likely reflects the need to generate the appropriate level of polymorphism before divergence from Africa. Interestingly, the divergence time between CEU and CHB, T EU −AS , is more strongly correlated with the growth rate of the Asian population r AS than with the migration rate between the two populations m EU −AS .
The lower-right panel of Supplementary Figure 11 compares inferred values of the initial population sizes for European and Asian populations, N EU 0 and N AS0 , respectively. There is    Recently it has been suggested that the sensitivities of many-parameter models to changes in those parameter values follow a universal 'sloppy' distribution [12]. This distribution is characterized by eigenvalues of the Hessian matrix (∂ 2 L/∂ log p i ∂ log p j ) which decrease linearly in their logarithm. As illustrated in Supplementary Figure 12, the demographic model we fit exhibits a sloppy spectrum of sensitivities. As expected, a principle components analysis of our ensemble of 100 parameter sets derived from fitting simulated data sets shows a similar sloppy pattern.
The sloppiness we see is indicative of strong correlations between the effects of different parameters on the AFS. In some problems, it is possible to define an orthogonal parameterization which eliminates these correlations [12]. Such a parameterization has been demonstrated for the single-population AFS [13], but none is know yet for the multiple-population case.
Additionally, the fact that this model is sloppy is interesting because the model was designed to be parsimonious, and in all previous examples of sloppy systems the parameterization was determined by biological or physical considerations. This example expands the domain of 'sloppy' problems.

Contemporary migration test
Fitting the data with a model lacking contemporary migration yields the parameters in Supplementary Table 4. In our test, we generate simulated data sets with these parameters, then fit them with a model allowing contemporary migration.

Rare alleles
In Figure 2.E. of the main text, we evaluate the goodness-of-fit of our model to the observed data using our composite likelihood function. This function involves the entirety of the frequency spectrum, and in some cases specific frequency classes may be of particular interest.
In particular, rare alleles may be of medical interest. Supplementary Table 5 compares our the rare-allele entries of our bootstrap simulations with the real data. Each entry in the tables records the fraction of bootstrap simulations which yielded a larger proportion of SNP in that entry that in the real data. For example, 3% of simulations yield a larger proportion of SNPs that are at frequency (2,1,1) in (YRI,CEU,CHB) samples, respectively, than is observed in the real data. In general, the proportions seen in the real data are typical in our simulations. For some entries the real data may be atypical of the simulations, but it is unclear whether these deviations are signficant.
Supplementary Table 5: P-values of rare frequency spectrum entries. The tables record the fraction of parametric bootstrap simulations yielding larger proportion of mutations in a given frequency class than observed in the data. Each panel shows one of the three models we considered for modeling settlement of the New World, which differ only in history prior to the MXL divergence. Each panel also shows the residuals from the fit to the data, along with a comparison of LD decay from the real data and from the simulations.
Supplementary Figure 13 illustrates several alternative demographic models considered for the Settlement of the New World. In Model (a) the CEU and CHB populations diverge from an equilibrium neutral population. Unsurprisingly, this model fails to reproduce the observed frequency spectrum well (log-likelihood = -5405.9, 11 free parameters). As an alternative, we considered Model (b), which allows for a population size change before divergence. (We allowed for the population to increase or decrease. The optimization resulting in a decrease.) This model reproduces the frequency spectrum substantially better (log-likelihood = -5261.1, 13 free parameters), but the predicted pattern of linkage disequilibrium is very different from the observed pattern. The model we settled upon is Model (c), in which the Supplementary population history prior to CEU/CHB divergence is set to the maximum likelihood history from our previous Out of Africa analysis. Note that θ must be scaled to reflect the differing effective sequence length. This model reproduces the frequency spectrum as well as Model (b) (log-likelihood = -5262.0, 10 free parameters), and it yields the correct pattern of linkage disequilibrium.

Maximum likelihood parameters
Supplementary Table 6 gives the maximum likelihood parameter values for our Settlement of the New World. The corresponding ms demography is: Note that this command is complicated by the need to model admixture in MXL by generating a 5th population that exists for 0 time.

Parametric bootstrap
The parametric bootstrap results for our New World model are shown in Supplementary  Figure 14. Note that there does appear to be some slight bias in our inference of the growth rate, r M X , of the MXL population. This is because our calculation of the frequency spectrum is slightly biased in evaluation of the number of singletons private to MXL. Increasing the number of grid points G used in the evaluation would help eliminate this bias, at the cost of an increase in computation time.

Parameter correlations
Supplementary Figure 15 shows the correlation between conventional bootstrap values between pairs of parameters.
Supplementary Figure 16 plots the three most correlated parameter combinations. The correlation between MXL initial population size N M X0 and growth rate r M X is particular strong. These likely represent combinations of parameters with generate the appropriate YRI and CHB divergence.

Comparison with Out of Africa model parameters
For those inferred parameter values that are shared between our Out of Africa and Settlement of the New World Model analyses, Figure 17 compares the distributions of conventional bootstrap estimates. Importantly, all confidence intervals overlap substantially.

Admixture variability
Individuals in admixed populations mary vary considerably in their ancestry. In particular, previous analyses of Mexican-Americans have found a wide range of ancestry [14]. Our model adopts a single population-level admixture proportion, and it is important that we understand the degree to varying individual ancestry affects our results.
To roughly estimate individual ancestry proportions, we adopt a maximum-likelihood approach [15] similar to the Bayesian approach used in structure [16]. We consider only the European and East Asian ancestry of Mexican individuals, in order to compare with our three-population simulations. This is a rather crude approximation to the possibly very complex ancestry of these individuals, but it nevertheless provides useful guidance for our simulations.
Define q i to be European ancestry proportion of Mexican individual i. In this rough analysis, 1 − q i is then that individual's East Asian ancestry proportion. For that individual, the probability of the observing the pair of alleles x l = (α, β) at locus l is: Here E lα is the frequency of allele α at locus l in the European population. Similarly, A lα is the frequency of allele α at locus l in the East Asian population. We simply estimate E lα and A lα by the frequencies in our CEU and CHB samples, respectively. To estimate the ancestry proportions q i , we maximize the composite likelihood formed by multiplying the probability in Equation S16 over all loci. The histogram in Supplementary Figure 18 shows the European admixture proportions inferred for the 22 individuals in our Mexican-American sample. The green curve shows the distribution of admixture proportions inferred from parametric bootstrap simulations of our demographic model. These simulations were performed with f M X = 0.47, indicated by the dashed green line. As expected, we see that our admixture inference procedure overestimates the CEU contribution, because we have no Native American samples. Also, we see that the distribution of ancestry proportions arising from our simulations is narrower than seen in the real data. In particular, there are a few individuals in the data with unusually little European ancestry.
To roughly approximate the distribution of ancestry seen in the data, we perform simulations in which 3 of our 22 individuals have 5% European ancestry, 16 have 47%, and 3 have 60%. 1 The resulting distribution of inferred ancestries is shown by the red curve in Supplementary Figure 18. This distribution captures the individuals with low-European ancestry in the real data, and is overall perhaps somewhat wider.
To test our method in the presence of an extremely wide distribution of individual ancestry, we consider a scenario in which we have 1 individual each of European ancestry proportion {0, 5%, 10% . . . }. To have 22 individuals as in our data, we simulate with two individuals of European ancestry proportion 50%. 2 The resulting distribution of inferred ancestries is shown by the cyan curve in Supplementary Figure 18. As expected, the distribution of inferred ancestry proportions is very wide.
We fit our CEU/CHB/MXL model to 100 data sets each simulated with our two distribution of MXL ancestry. Supplementary Figure 19 compares the resulting parametric bootstrap parameter distributions with those from our single-ancestry-proportion simulations. The differences in f M X are expected and correctly reflect the average ancestry of Remarkably, variable ancestry does not even effect our power, as evidenced by the fact that the widths of all the parameter distributions are identical.

Rare alleles
As in section 5.6, it may be of interest specifically how well our model including MXL reproduces the distribution of shared rare alleles. Supplementary Table 7. In this case, it appears that our model may be somewhat overestimating the proportion of alleles that are observed to be absent or at very low frequency in CHB.   Figure 18: Distribution of inferred European ancestry proportion for real data, and simulations with varying degrees of individual ancestry variability. Histogram is the real data. The green, red, and cyan curves are, respectively, simulations with only a single admixture proportion, with a simple distribution of admixture proportions chosen to mimic the real data, and with an extremely wide distribution. Note that these are only crude estimates of European ancestry proportion, because our data lack Native American samples. This is emphasized by the dashed black line, which shows the European admixture proportion used in the single proportion simulations.  Table 8: Properties of genes analyzed. Recorded are the location of each gene, the number of noncoding bases sequenced, and the number of SNPs found in that sequence. Note that these are SNPs segregating in any of the EGP populations, of which we've considered subsets. The horizontal lines divide genes that were considered contiguous possibly linked blocks in our simulations with linkage. For each of these 194 blocks, the recombination rate used is reported.