Inferring Demographic History from a Spectrum of Shared Haplotype Lengths

There has been much recent excitement about the use of genetics to elucidate ancestral history and demography. Whole genome data from humans and other species are revealing complex stories of divergence and admixture that were left undiscovered by previous smaller data sets. A central challenge is to estimate the timing of past admixture and divergence events, for example the time at which Neanderthals exchanged genetic material with humans and the time at which modern humans left Africa. Here, we present a method for using sequence data to jointly estimate the timing and magnitude of past admixture events, along with population divergence times and changes in effective population size. We infer demography from a collection of pairwise sequence alignments by summarizing their length distribution of tracts of identity by state (IBS) and maximizing an analytic composite likelihood derived from a Markovian coalescent approximation. Recent gene flow between populations leaves behind long tracts of identity by descent (IBD), and these tracts give our method power by influencing the distribution of shared IBS tracts. In simulated data, we accurately infer the timing and strength of admixture events, population size changes, and divergence times over a variety of ancient and recent time scales. Using the same technique, we analyze deeply sequenced trio parents from the 1000 Genomes project. The data show evidence of extensive gene flow between Africa and Europe after the time of divergence as well as substructure and gene flow among ancestral hominids. In particular, we infer that recent African-European gene flow and ancient ghost admixture into Europe are both necessary to explain the spectrum of IBS sharing in the trios, rejecting simpler models that contain less population structure.


Mixed admixture status
Consider a history described by four parameters: a population size N , a split time τ s , a later admixture time τ a and an admixture fraction f that denotes the percentage of individuals in the recipient population that recently migrated over from the donor population. Given this history, we will describe how to calculate the frequency H τa,τs,f (L) of IBS tracts with one internal recombination.
As with previous calculations, we must marginalize over two coalescence times t 0 and t, as well as a time of recombination t (r) < min(t 0 , t). In addition, we must consider the "admixture status" of each tract half: whether one of the sequences was involved in the historical migration or whether they were constrained to coalesce before the split time. We will say that a locus is admixture-positive if one of the sequences was involved in the migration and admixture-negative otherwise (see Figure S13). This allows us to introduce the conditional coalescence density function ζ(t|(t (r) , a)), which denotes the coalescence time density function given that at time t (r) , the base pair was uncoalesced with admixture status a. When t (r) < τ a , the admixture status at the time of recombination is undetermined, which will be denoted '0'. Conversely, the admixture status cannot be undetermined at more ancient times of recombination t (r) > τ a . For the one-pulse, constant size history considered here, ζ(t|(t (r) , a)) is the following: If we let a denote the admixture status at the time of recombination, then Therefore, our goal is to evaluate the expression in closed form. Marginalizing over the admixture status at the time of recombination yields that The easiest integrals to compute are the ones where the whole segment has the same admixture status. These scenarios split into two classes: one where the recombination is more recent than τ a (class R) and one where the one where the recombination is more ancient than τ a (class A). When admixture status is negative it also matters whether recombination happened before or after τ s , which divides class A into two subclasses A i (intermediate recombination times, τ a ≤ t (r) < τ s ) and A a (ancient recombination times, t (r) ≥ τ s ). The scenarios with mixed admixture status involve exponential integrals, which cannot be computed in closed form and must be approximated further. In total, there are seven integrals we must do: ( ( The first five of these integrals can be evaluated in closed form with no further approxima-tion. After some algebra, we find that The last terms H τa,τs,f (L, +, R, −) and H τa,τs,f (L, −, R, +), as mentioned before, involve exponential integrals that cannot be reduced to elementary functions. We will approximate them using a standard tight bracketing of the exponential integral. Since (see [?]), we will let and use this to derive the approximation 1 2 e (τs−τa)(1+θ) log 1 + 2 1 + θ , which simplifies to 1 2 e −(τs−τa)(L(ρ+θ)−ρ) log 1 + 2 (τ s − τ a )(1 + θ) .

Simulated admixture histories
To generate each colored curve in Figure 2 of the main text, we used MS to simulate 4.8×10 10 bases of pairwise sequence alignment assuming a mutation rate of 2.5 × 10 −8 per site per generation and a recombination rate of 1.0 × 10 −8 per site per generation. Letting ta denote the admixture time in units of 2N generations, the data were generated with the following command line: ./ms 2 4800 -t 10000 -r 4000 100000 -I 2 1 1 -es ta/2 1 0.95 -ej ta/2+0.000001 3 2 -ej 0.05 2 1 For each of two admixture times τ a = 200 generations and τ a = 400 generations, we simulated 100 replicate datasets and inferred the set of demographic parameters τ a , τ s , f , and N . The mean and variance of the estimates for each parameter are reported in Table 1 of the main text. Figures S1 and S2 plot the full histogram of values estimated for each of the four parameters, suggesting that all parameters are estimated consistently.

Analysis of human data 3.1 Generating empirical tract spectra
We generated empirical spectra of IBS tract lengths from the 1000 Genomes pilot sequences reported in [?] and available at: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/ We used VCF-tools to extract the haplotype sequences from the VCF files encoding the low-coverage haplotypes and trio haplotypes [?]. We then used the CEU, CHBJPT, and YRI mask files available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_03/pilot1/ supporting/README_callability_masks to excise all haplotype regions that were at least 10,000 bases long and annotated as inaccessible for SNP calling in any of the three low coverage data sets. In addition, we excised all regions annotated as gaps on the UCSC genome browser, a list available at: http://cistrome.dfci.harvard.edu/browser/cgi-bin/hgTables After these annotated gaps were removed, the remaining genome contained some conspicuously long regions with few or no SNP calls in the low coverage data, meaning a large fraction of the 358 total haplotypes were IBS. By visual inspection on the 1000 genomes browser, many of these SNP deserts had disappeared with the addition of new individuals sequenced after completion of the pilot phase, indicating that their sparsity of SNP calls was probably a sequencing artifact. We therefore excised each 10 6 -base region of the genome that did not have at least 66 SNP calls in each of the CEU, YRI, and JPTCHB low coverage data sets.
For the remaining portion of the genome, we generated within-population IBS length spectra as follows: For each low coverage population, we numbered the 120 haplotypes with consecutive integers and aligned haplotype n with haplotype n + 1. This generated a total of 119 whole-genome alignments, totaling 3.05 × 10 11 base pairs, which was cut up into IBS fragments at each of the sites where the two haplotypes differed. For each pair of populations, we also generated a between-population IBS length spectrum by aligning haplotype n from population A with haplotype n from population B, yielding 120 whole-genome alignments that totaled 3.08 × 10 11 base pairs. Each alignment was parsed into an IBS length spectrum by cutting it up at the sites where the two haplotypes differed and sorting the resulting fragments by length.
The four parental haplotypes from each trio were numbered 1,2,3,4, and all six pairwise alignments (1 paired with 2, 1 paired with 3, etc.) were used to create the within-population tract spectra. All twelve possible pairwise alignments were used to create the spectrum of CEU-YRI trio sharing.

IBS tracts in low coverage data
For the CEU and YRI populations, we looked at IBS tract sharing within two subsets of the 1000 Genomes pilot data: four high quality whole genome haplotypes from the trio parents and 120 whole-genome haplotypes sequenced at low coverage. We were able to account for the excess of long IBS tracts in the high coverage trios by modeling the distribution of excess errors in the low coverage data (see Figure 5 of the main text).
One difference between the trios and the low coverage sequences was that the low coverage alignments had higher mean heterozygosity. We found that the YRI low coverage data had a mean heterozygosity of 8.47 × 10 −4 , while the YRI trio parents had a mean heterozygosity of 6.98 × 10 −4 . To determine whether the difference was significant, we bootstrapped the low coverage data as follows: for 0 ≤ n ≤ 30, we subsampled haplotypes 4n, 4n + 1, 4n + 2, and 4n + 3 from the low coverage YRI data and determined their shared IBS tract spectrum in the same way that was done with the four trio haplotypes. These bootstrapped low coverage data sets had mean heterozygosity 8.23 × 10 −4 with standard deviation 1.81 × 10 −5 , making the trio heterozygosity significantly lower. When we bootstrapped the CEU sequences in the same way, the subsample heterozygosities had mean 6.84 × 10 −4 and standard deviation 2.3 × 10 −5 . In contract, the CEU trio parents had mean heterozygosity 5.50 × 10 −4 . In both populations, the low coverage data had an excess heterozygosity between 1.25 × 10 −4 and 1.35 × 10 −4 , probably due to sequencing errors. The mean heterozygosity between CEU and YRI was also higher in the low coverage data, at 9.33 × 10 −4 compared to 8.05 × 10 −4 in the trio data.
An error rate of 10 −4 per base pair would destroy most 10 5 -and 10 6 -base IBS tracts if the errors were evenly Poisson-distributed throughout the low coverage sequences. However, the situation is somewhat better because of the imputation that was used to generate the low coverage 1000 Genomes sequences, preferentially calling haplotypes that are IBS with one another in regions where both appear IBD with one of the HapMap references. For this reason, as well as the empirical abundance of long IBS tracts in the low coverage data, we expect true IBS tracts in the low coverage alignments to be broken up by errors at some rate IBS 10 −4 . Let f YRI trio (L) (resp. f YRI lc ) be the frequency of differences between two high coverage YRI samples (resp. two low coverage YRI samples) that are followed by exactly L bases of IBS. If IBS tracts are accurately observed in the trio data but broken up by errors at rate YRI IBS in the low coverage data, then we should expect that f YRI lc (L) ≈ f YRI trio (L)e −L YRI IBS . In this way, the data are consistent with an error rate of YRI IBS = 5 × 10 −6 , with the function f YRI trio (L)e −L YRI IBS falling within the realm of variation of f YRI lc (L) in the bootstrapped low coverage datasets. The frequencies of long IBS tracts in the trio data are consistent with an identical error rate of CEU IBS = 5 × 10 −6 (see Figure S4).
To estimate parameters efficiently, we used a two-step procedure. First, we estimated N 2 , N 3 , and t ancient by fitting a simple bottleneck history to the IBS sharing with the YRI. Next, we fixed N 2 , N 3 , and t ancient and estimated the rest of the parameters by jointly maximizing the likelihood of all three informative spectra: within the YRI, within the CEU and between YRI and CEU.

Assessing uncertainty via simulation
We simulated 30 replicate datasets under the human evolutionary model described in section ?? with the maximum likelihood parameters inferred from the trio data. We then estimated parameters from each replicate dataset, using parametric bootstrapping to gauge our accuracy at inferring a complex history. Figure S10 illustrates the differences between the parameters inferred from the trio data and the mean estimates obtained from replicate simulations. Figures S7, S8  . We therefore evaluated ∂a∂i's ability to infer the parameters of the simple admixture history considered in the main text.
In Figure 3, we compare composite likelihood surfaces generated by ∂a∂i from the joint allele frequency spectrum and by our method from the joint IBS tract spectrum. Each point in the ∂a∂i likelihood surface is generated by fixing τ a and τ s and then optimizing f (∂a∂i deterministically optimizes the size N ). Similarly, each point in the IBS tract likelihood surface is generated by fixing τ a and τ s and jointly optimizing f and N . These likelihood surfaces show that both methods allow for accurate grid search estimation of demographic parameters. However, the ∂a∂i numerical optimization routines usually fail to arrive at a good estimate starting from a random point in demographic parameter space. For the τ a = 0.01 history, the optimal parameters located by grid search have a Poisson log likelihood greater than −5, 000, the best parameters obtained from 20 random Nelder-Mead optimizations have Poisson log likelihood −8, 329. Nelder-Mead optimization was chosen for this comparison because it is the routine recommended in the ∂a∂i manual for optimization starting far from the true optimum. In contrast, if we sample Θ uniformly at random from a bounded range and maximize the likelihood of observed IBS tracts, the optimization consistently terminates very close to the global maximum (see Supplementary Table 1). Both the SFS likelihood surface and the IBS tract likelihood surface allow for parameter estimation by grid search, but the two likelihood surfaces have different shapes that suggest complementary demographic sensitivities. The SFS likelihood is more sensitive to variation in divergence time than to changes in admixture time, while the IBS tract likelihood is more sensitive to variation in the time of last gene flow.

The NIEHS site frequency spectrum
Although the Gutenkunst, et al. and Gravel, et al. histories fit human site frequency spectra well, Figure 5 illustrates that they do not predict the right spectrum of shared IBS tracts. Similarly, there is no guarantee that our inferred demographic history should predict the right CEU-YRI site frequency spectrum. To test this, we used MS to simulate a 20-by-20 site frequency spectrum under our demographic model and compared it to the National Institute of Environmental Health Science (NIEHS) frequency spectrum that was analyzed in [?] and generously made available by Ryan Gutenkunst. The simulated SFS had an FST of 0.205, significantly larger than the FST value of 0.158 that was computed from the NIEHS frequency spectrum data (see Figure S11). We discuss possible reasons for the discrepancy in the main text.