Figures
Abstract
We learn about population history and underlying evolutionary biology through patterns of genetic polymorphism. Many approaches to reconstruct evolutionary histories focus on a limited number of informative statistics describing distributions of allele frequencies or patterns of linkage disequilibrium. We show that many commonly used statistics are part of a broad family of two-locus moments whose expectation can be computed jointly and rapidly under a wide range of scenarios, including complex multi-population demographies with continuous migration and admixture events. A full inspection of these statistics reveals that widely used models of human history fail to predict simple patterns of linkage disequilibrium. To jointly capture the information contained in classical and novel statistics, we implemented a tractable likelihood-based inference framework for demographic history. Using this approach, we show that human evolutionary models that include archaic admixture in Africa, Asia, and Europe provide a much better description of patterns of genetic diversity across the human genome. We estimate that an unidentified, deeply diverged population admixed with modern humans within Africa both before and after the split of African and Eurasian populations, contributing 4 − 8% genetic ancestry to individuals in world-wide populations.
Author summary
Throughout human history, populations have expanded and contracted, split and merged, and exchanged migrants. Because these events affected genetic diversity, we can learn about human history by comparing predictions from evolutionary models to genetic data. Here, we show how to rapidly compute such predictions for a wide range of diversity measures within and across populations under complex demographic scenarios. While widely used models of human history accurately predict common measures of diversity, we show that they strongly underestimate the co-occurence of low frequency mutations within human populations in Asia, Europe, and Africa. Models allowing for archaic admixture, the relatively recent mixing of human populations with deeply diverged human lineages, resolve this discrepancy. We use such models to infer demographic models that include both recent and ancient features of human history. We recover the well-characterized admixture of Neanderthals in Eurasian populations, as well as admixture from an as-yet unknown diverged human population within Africa, further suggesting that admixture with deeply diverged lineages occurred multiple times in human history. By simultaneously testing model predictions for a broad range of diversity statistics, we can assess the robustness of common evolutionary models, identify missing historical events, and build more informed models of human demography.
Citation: Ragsdale AP, Gravel S (2019) Models of archaic admixture and recent history from two-locus statistics. PLoS Genet 15(6): e1008204. https://doi.org/10.1371/journal.pgen.1008204
Editor: Joshua M. Akey, Princeton University, UNITED STATES
Received: January 8, 2019; Accepted: May 17, 2019; Published: June 10, 2019
Copyright: © 2019 Ragsdale, Gravel. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The only data used in this study are publicly available through the Thousand Genomes Project (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/).
Funding: SG recieved funding from Canada Research Chairs program, the Natural Sciences and Engineering Research Council of Canada (NSERC) discovery grant, and Canadian Institutes of Health Research (CIHR) MOP-136855 (http://www.chairs-chaires.gc.ca/home-accueil-eng.aspx, http://www.nserc-crsng.gc.ca/ResearchPortal-PortailDeRecherche/Instructions-Instructions/DG-SD_eng.asp, http://www.cihr-irsc.gc.ca/e/193.html). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The study of genetic diversity in human populations has shed light on the origins of our species and our spread across the globe. With the growing abundance of sequencing data from contemporary and ancient humans, coupled with archaeological evidence and detailed models of human demography, we continue to refine our understanding of our intricate history. Accurate demographic models also serve as a statistical foundation for the identification of loci under natural selection and the design of biomedical and association studies.
Whole-genome sequencing data are high dimensional and noisy. In order to make inferences of history and biology, we rely on summary statistics of variation across the entire genome and in many sequenced individuals. One such statistic that is commonly used for demographic inference is the distribution of SNP allele frequencies in one or more populations, called the sample or allele frequency spectrum (AFS) [1–4]. AFS-based inference has proven to be a powerful inference approach, yet it assumes independence between SNPs and therefore ignores information contained in correlations between neighboring linked loci, which is also referred to as linkage disequilibrium (LD).
Measures of LD are also informative about demographic history, mutation, recombination, and selection. A separate class of inference methods leverage observed LD across the genome to infer local recombination rates [5–8] and demographic history [9–12].
While two-locus statistics have been extensively studied [13–22], most of this work focused on a single population at equilibrium demography, precluding their application to realistic demographic scenarios. Recently, approaches for computing the full two-locus sampling distribution for a single population with non-equilibrium demography were developed via the coalescent [8] or a numerical solution to the diffusion approximation [23], allowing for more robust inference of fine-scale recombination rates and single population demographic history. However, there remain significant limitations. Computing the full two-locus haplotype frequency spectrum is computationally expensive, hindering its application to inference problems that require a large number of function evaluations. Alternatively, computationally efficient low-order equations for specific LD statistics have been proposed [12, 14], but these have seen limited application and only to single populations.
In this article, we show that the moment system of Hill and Robertson [14] can be expanded to compute a large family of one- and two-locus statistics with flexible recombination, population size history, and mutation models. Additionally, we show that the system can be extended to multiple populations with continuous migration and discrete admixture events, and that low order statistics can be accurately and efficiently computed for tens of populations with complex demography.
We use this moment system together with likelihood-based optimization to infer multi-population demographic histories. We reexamine how well widely used models of human demographic history recover observed patterns of polymorphism, and find that these models underestimate LD among low frequency variants in each population, sometimes by a large amount. The inclusion of admixture from deeply diverged lineages in both Eurasian and African populations resolves these differences, and we infer an archaic lineage contributed ∼ 6 − 8% genetic ancestry in two populations in Africa. By jointly modeling a wide range of summary statistics across human populations, we can reveal important aspects of our history that are hidden from traditional analyses using individual statistics.
Models and methods
To compute a large set of summary statistics for genetic data, we use mathematical properties of the Wright-Fisher model that are related to the look-down model of Donnelly and Kurtz [24, 25]. To illustrate this process, we first build intuition through familiar equations from population genetics and then explain how these fit within a larger hierarchy of tractable models.
In this section, we therefore begin with evolution equations for heterozygosity and the frequency spectrum, then turn to recursions for low-order LD statistics and show that the classical Hill-Robertson [14] system for D2 can be extended to arbitrary moments of D, multiple populations, and even the full sampling distribution of two-locus haplotypes. Mathematical details and expanded discussion for each result are given in S1 Appendix. Throughout this article, we assume that human populations can be described, approximately, by a finite number of randomly mating populations. We also assume an infinite-sites model in the main text and describe a reversible mutation model in Appendix S1.1.3.
Motivation: Single site statistics and the allele frequency spectrum
The most basic measure of diversity is expected heterozygosity , or the expected number of differences between two haploid copies of the genome. Given at time t, population size N(t) and mutation rate u, Wright [26] showed that enumerating all distinct ways to choose parents among two lineages leads to a recursion for , (1) To leading order in 1/N and u, two copies of the genome are different if their parents were distinct (which has probability ) and carried different alleles (which has probability ), or if there was a mutation along one of two lineages (which has probability 2u).
Heterozygosity is a low-order statistic: we require only two copies of the genome to estimate genome-wide. More samples provide additional information that can be encoded in the sample AFS Φn, the distribution of allele counts within a sample of size n. Specifically, Φn(i) is the number (or proportion) of loci where the derived allele is observed in i copies out of n samples.
A standard forward approach to compute Φn involves numerically solving the partial differential equation for the distribution of allele frequencies in the full population and then sampling from this distribution for the given sample size n (e.g. Gutenkunst et al. [2]). By enumerating mutation events and parental copying probabilities in a sample of size n, Jouganous et al. [3] showed that Eq 1 can be generalized to a recursion for {Φn(i)}i=0,…,n (Fig 1). can be seen as a special case equal to Φ2(1), the i = 1 bin in the size n = 2 frequency spectrum. These recursions can also be derived as moment equations for the diffusion approximation [3, 27, 28].
Expected statistics under neutral Wright-Fisher evolution depend on equal or lower-order statistics in the previous generation, allowing for a hierarchy of closed recursion equations. Left: single-site statistics are represented as the entries in the size-n AFS, Φn, and depend only on same-order statistics. Right: the corresponding two-locus statistics, including the Hill-Robertson system for , rely on statistics of the same or lower order. Closed recursions can be found for any given , leading to a sparse, linear system of ODEs. We denote π2 = p(1 − p)q(1 − q), z = (1 − 2p)(1 − 2q), and σi = pi(1 − p)i + qi(1 − q)i. Arrows indicate dependence of moments and highlighted moments indicate classical recursions. Odd-order moments are shown in Fig A1. Here we are particularly interested in such closed recursions in multi-population settings.
Two-locus statistics
We will use this same intuition for the two-locus theory. First consider the model for two loci that each permit two alleles: alleles A/a at the left locus, and B/b at the right. There are four possible two-locus haplotypes, AB, Ab, aB, and ab, whose frequencies sum to 1 in the population. LD between two loci is measured as the covariance of their allele frequencies: Therefore D can also be interpreted as the probability of drawing two lineages from the population and observing one lineage of type AB and the other of type ab, minus the probability of observing the two cross types Ab and aB. As such, is a two-haplotype statistic, meaning we require just two haploid copies of the genome (or a single phased diploid genome) to estimate genome-wide , in the same way that the expected heterozygosity is a two-sample statistic of single-site variation.
Moment equations for D and D2.
Enumerating possible copying, recombination, and mutation events for two lineages also leads to a well-known recursion for [13]. The possibility of sharing a common parent from the previous generation leads to the same decay familiar from Eq 1. also decays due to recombination with rate proportional to the probability r of a recombination event between two loci in a given generation. Throughout, we assume r ≪ 1, so that higher order terms may be ignored. For loosely linked or unlinked linked loci (r = 1/2), higher order terms must be considered [14].
To leading order in r, u, and , we have (2) Mutation doesn’t contribute to because any mutation event is equally likely to contribute positively or negatively to the statistic. As a result, D is expected to be zero across the genome.
However, the second moment is positive. Hill and Robertson [14] found a recursion for a triplet of statistics including , which we write as where p is the allele frequency of A, and q is the allele frequency of B. The recursion is (3) where and are matrix operators for drift and recombination, respectively. To leading order in and r, these take the form and
The three statistics in the Hill-Robertson system have a natural interpretation. is the variance of D and has received plenty of attention over the years. The second statistic includes a term z = (1 − 2p)(1 − 2q) whose magnitude is largest when there are rare alleles at both loci, and which is positive when p and q both correspond to the minor allele (or both to the major allele). Thus measures positive covariance among low frequency variants. Fig 2A–2C shows that the decay of and are sensitive to demographic history. Fig A2 in S1 Appendix shows how the bulk of the Dz statistic is contributed by pairs of variants where the rarest allele has frequency between 2 and 20%, while common variants comprises the bulk of D2.
Demographic histories shown in (A) affect statistics in the Hill-Robertson system and their dependance on recombination distance (B-C). Both the amplitude and shape of the LD curves differ between demographic models for (B) and (C) . (D) To illustrate the effect of admixture on LD curves, we consider two populations in isolation for 2N generations, followed by an admixture event where the focal population receives 1% of lineages from the diverged population. (E) curves are largely unaffected by this low level of admixture. (F) However, is immediately and strongly elevated following admixture, and remains significantly elevated for prolonged time T (in units of 2N generations) since the admixture event.
is the joint heterozygosity across pairs of SNPs. If we sample four haplotypes from the population, this is proportional to the probability that the first pair differ at the left locus, and the second pair differ at the right locus.
The applications in this article focus on generalizing the Hill-Robertson equations to multi-population settings. However, we first outline generalizations to high-order moments and non-neutral evolution, leaving theoretical developments and simulations to the Appendix.
Generalizing to higher moments of D.
The existence of tractable higher-order moment equations for one-locus statistics [3] suggests the existence of a similar high-order system for two-locus statistics. Higher moments of D provide additional information about the distribution of two-locus haplotypes. Appendix S1.1 shows that the Hill-Robertson system can be extended to compute any moment of D, and presents recursions for those systems of arbitrary order Dm that closes under drift, recombination, and mutation.
This family of recursion equations takes a form similar to the D2 system: the evolution of requires and , with each of those terms depending on additional terms of the same order and smaller orders (Fig 1). For any order m, Appendix S1.4.1 shows the system closes and forms a hierarchy of moment equations, in that the Dm recursion contains the Dm−2 system, which itself contains the Dm−4 system, and so on (Fig A1 in S1 Appendix). Just as the Wright equation for heterozygosity generalizes naturally to equations for the more informative distribution of allele frequency [3], the Hill and Robertson equations for and generalize to informative higher-order LD statistics.
Generalizing to arbitrary two-locus haplotype distribution.
Given the analogy between the frequency spectrum and the Hill-Robertson equations, it is natural to study the connection between the moment equations for and the evolution of the two-locus haplotype frequency distribution Ψn(fAB, fAb, faB, fab).
While classical approaches for computing Ψn [18, 20] were limited to neutrality and steady-state demography, recent coalescent and diffusion developments allow for Ψn to be computed under non-equilibrium demography and selection [8, 23]. These approaches are computationally expensive and limited to one population, as Ψn has size , and the P-population distribution grows asymptotically as n3P.
Generalizing the approach of Jouganous et al. [3], we can write a recursion equation on the entries of Ψn under drift, mutation, recombination, and selection at one or both loci (Appendix S1.3). As expected, this recursion does not close under selection: to find Ψn at time t + 1, we require Ψn+1 and Ψn+2 at time t. It also does not close under recombination, requiring a closure approximation. Using the same closure strategy for selection and recombination, however, we can approximate the entries of Ψn+1 and Ψn+2 as linear combinations of entries in Ψn and obtain a closed equation. This approach provides accurate approximation for moderate n under recombination and selection (Appendix S1.3.5) that represent a 10 to 100-fold speedup over the numerical PDE implementation in [23] (Table A1 in S1 Appendix). However, closure is inaccurate for small n.
By contrast to the full two-locus model, equations for moments of D close under recombination because the symmetric combination of haplotype frequencies that define D ensures the cancellation of higher-order terms (Appendix S1.1.2). This makes the moments of D particularly suitable for rapid computation of low-order statistics over a large number of populations.
The Hill-Robertson system does not close, however, if one or both loci are under selection. Appendix S1.1.4 considers a model where one of the two loci is under additive selection. We derive recursion equations for terms in the system and describe the moment hierarchy and a closure approximation, though we leave its development to future work. In the following we focus on neutral evolution.
Multiple populations
While a large body of work exists for computing expected LD in a single population, little progress has been made toward extending these models to multiple populations. Forward equations for the full two-locus sampling distribution become computationally intractable beyond just a single population, even with the moment-based approach described above. Here, we extend the Hill-Robertson system to any number of populations, allowing for population splits, admixture, and continuous migration.
Motivation: Heterozygosity across populations.
To motivate our derivation of the multi-population Hill-Robertson system and provide intuition, we begin with a model for heterozygosity across populations with migration. With two populations we consider the cross-population heterozygosity, , where pi and qi are allele frequencies at the left and right loci, respectively, in population i. This is the probability that two lineages, one drawn from each population, differ by state. At the time of split between populations 1 and 2, . Because coalescence between lineages in different populations is unlikely, is not directly affected by drift. In the absence of migration and under the infinite-sites assumption used here, this statistic increases linearly with the mutation rate over time (Fig A3 in S1 Appendix).
With migration, the evolution of also depends on and . We define the migration rate m12 to be the probability that a lineage in population 2 has its parent in population 1. Assuming mij ≪ 1, the probability that both lineages in come from population 1 is m12 (to leading order), in which case is equal to , and the probability that both come from population 2 is m21. Then to leading order in mij, we have Similar intuition leads to recursions for and under migration, and this system easily extends to more than two populations.
The Hill-Robertson system with migration.
We take the same approach to determine transition probabilities in the multi-population Hill-Robertson system. Suppose that at some time, a population splits into two populations. At the time of the split, expected two-locus statistics (D2, Dz, π2) in each population are each equal to those in the parental population at the time of split (Appendix S1.2.1). Additionally, the covariance of D between the two populations, , is initially equal to in the parental population. In the absence of migration, Hill-Robertson statistics in each population evolve according to Eq 3, and (4)
With migration, additional moments are needed to obtain a closed system. These additional terms take the same general form as the original terms in the Hill-Robertson system, but include cross-population statistics, analogous to H12 in the heterozygosity model with migration. Again using y to denote bases of Hill-Robertson moments, this basis is (5) where P is the number of populations, and we slightly abuse notation so that Di Dj stands in for all index permutations (, , and D1 D2 in the two-populations case). We derive transition probabilities under continuous migration in Appendix S1.2.2 leading to the closed recursion, (6) where , , , and are sparse matrices for drift, migration, recombination and mutation that depend on the number of populations, population sizes N(t), and migration rates m.
Admixture.
Patterns of LD are sensitive to migration and admixture events, and low order LD statistics are commonly used to infer the parameters of admixture events [10, 29]. A well-known result (e.g., example 2.7 in [30]) is that D in an admixed population can be nonzero even when D is zero in both parental populations if allele frequencies differ between the two parental populations. This is seen by enumerating all possible combinations of haplotype sampling when a fraction f of lineages were contributed by population 1, and 1 − f by population 2 (Appendix S1.2.3). More generally, immediately following the admixture event, the expectation in the admixed population is (7) where δ = (p1 − p2)(q1 − q2) [31].
To integrate the multi-population D2 system after an admixture event, we require and other second order terms in the basis (5) involving the admixed population. Using the same enumeration approach as for Eq 7, the expectation immediately following the admixture event is (8) Each other required term can be found in a similar manner (Appendix S1.2.3). In this way, the set of moments may be expanded to include the admixed population and integrated forward in time using Eq 6.
Numerical implementation
We rescale time by 2Nref generations (Nref is an arbitrary reference population size, often the ancestral population size), so that the recursion can be approximated as a differential equation (9) where ν are the relative population sizes at time t (νi(t) = Ni(t)/Nref), are the population size-scaled migration rates 2Nrefmij, ρ = 4Nrefr, and θ = 4Nrefu. Each matrix is sparse, and this equation can be solved efficiently using a standard Crank-Nicolson integration scheme. Our implementation allows users to define general models with standard demographic events (migrations, splits and mergers, size changes, etc.) similar to the interface familiar to ∂a∂i and moments [2, 3]. A single evaluation of the four-population model shown in Fig A4 in S1 Appendix can be computed in roughly 0.1 second. We packaged our method with moments [3] as moments.LD, a python module that computes expected statistics and performs likelihood-based inference from observed data (described below), available at bitbucket.org/simongravel/moments.
Validation.
We validated our numerical implementation and estimation of statistics from simulated genomes using msprime [32]. Expectations for low-order statistics match closely with coalescent simulations. For example, Fig A4 in S1 Appendix shows the agreement for a four population model with non-constant demography, continuous migration, and an admixture event, for which we computed expectations using moments.LD that matched estimates from msprime. While approximating expectations from msprime required the time-consuming running and parsing of many simulations, expectations from moments.LD were computed in seconds on a personal computer.
Data and inference
Genotype data.
Computing D using the standard definition requires phased haplotype data (Appendix S1.5). However, most currently available whole genome sequence data is unphased, so that we must rely on two-locus statistics based on observed genotype counts instead of haplotype counts. One could estimate haplotype statistics using the Weir [33] estimator (10) where nA is the count of A at the left locus, nB the count of B at the right locus, nd the number of diploid individuals in the sample, and {nAABB, nAABb, …} the counts of each observed genotype. However, the Weir estimator for D is biased. Fortunately, we can simply treat the Weir estimator as a statistic and obtain an unbiased prediction for its expectation (Appendix S1.7.3). Even though can be estimated from 2n phased haplotypes, more samples are required to accurately estimate LD for a given pair of SNPs. However, as we are interested in genome-wide averages of and other LD statistics, even when individual estimates are noisy, by averaging over a very large number of pairs of SNPs we can accurately estimate LD from relatively few diploid genomes.
1000 Genome Project data.
We computed statistics from intergenic data in the Phase 3 1000 Genomes Project data [34]. The non-coding regions of the 1000 Genomes data is low coverage, which can lead to significant underestimation of low frequency variant counts, which distorts the frequency spectrum and can lead to biases in AFS-based demographic inference [35]. However, low-order statistics in the Hill-Robertson system are robust to low coverage data in a large enough sample size (Fig A6 in S1 Appendix), so that low coverage data are well suited for inference from LD statistics (see also [12]).
To avoid possible confounding due to variable mutation rate across the genome, we calculated and compared statistics normalized by π2, the joint heterozygosity: , as in [12]. All figures showing -type statistics are normalized using π2(YRI), the joint heterozygosity in the Yoruba from Ibidan, Nigeria (YRI). This normalization removes all dependence of the statistics on the overall mutation rate, so that estimates of split times and population sizes are calibrated by the recombination rate per generation instead of the mutation rate [23]. This is convenient given that genome-wide estimates of the recombination rate tend to be more consistent across experimental approaches than estimates of the mutation rate.
We considered all pairs of intergenic SNPs with 10−5 ≤ r ≤ 2 × 10−3 using the African-American recombination map estimated by Hinch et al. [36] using ancestry switch-points. The lower bound was chosen to further reduce the potential effect of short-range correlations of mutation rates, clustered mutations, experimental error, and low resolution of the recombination map at very short distances.
Likelihood-based inference on LD-curves.
To compare observed LD statistics in the data to model predictions, and thus to evaluate the fit of the model to data, we used a likelihood approach. We binned pairs of SNPs based on the recombination distance separating them (Appendix S1.7.2). Bins were defined by bin edges {r0, r1, …, rn}, roughly logarithmically spaced. The model is defined by the set of demographic parameters Θ. We included the ancestral Nref as a parameter to be fit, which we also use to scale recombination bins as ρi = 4Nrefri.
For a given recombination bin (ρi, ρi+1], we computed statistics and normalized by π2 in one population (we used π2(YRI)), and denote this set of normalized statistics . We computed expectations for normalized statistics from the model, Mi, and then estimated the likelihood as taking the probability of observing data to be normally distributed with mean M and covariance matrix Σ (the normal distribution assumption is validated in Fig A5 in S1 Appendix).
We estimated Σ directly from the data by constructing bootstrap replicates from sampled subregions of the genome with replacement. This has the advantage of accounting for the covariance of statistics in our basis, as well as non-independence between distinct neighboring or overlapping pairs of SNPs. To compute the composite likelihood across ρ bins, we simply took the product of likelihoods over values of recombination bins indexed by i, so that To compute confidence intervals on parameters, we used the approach proposed by Coffman et al. [37], which adjusts uncertainty estimates to account for non-independence between recombination bins and neighboring pairs of SNPs.
Results
Human expansion models underestimate LD between low frequency variants
The demographic model for human out-of-Africa (OOA) expansion proposed and inferred by Gutenkunst et al. [2] has been widely used for subsequent simulation studies, and parameter estimates have been refined as more data became available [3, 35, 38]. These models have typically been fit to the single-locus joint AFS, with Yoruba of Ibidan, Nigeria (YRI), Utah residents of Western European ancestry (CEU), and Han Chinese from Beijing (CHB) as representative panels. Gutenkunst et al. verified that the observed decay of r2 was consistent with simulations under their inferred model.
We first asked if the OOA model (Fig 3A) is able to capture observed patterns of LD within and between these three populations. When fitting to all statistics in the multi-population basis, parameters diverged to infinite values, suggesting that the model is mis-specified. In particular, this model was unable to describe observed Dz statistics, with Dz-curves from the model drastically underestimating observations. We refit the OOA model without including Dz statistics, and we inferred best-fit parameters that generally align with estimates using the joint AFS (Table 1, left, and Fig 3). This model underestimated observed Dz in each population, especially in the YRI population (Fig 3D). Using AFS-inferred parameters from previous studies led to qualitatively similar results.
(A) We fit the 13-parameter Gutenkunst et al. model to statistics in the two-locus, multi-population Hill-Robertson system. The remaining 35 statistics from the Hill-Robertson basis used in the fit are shown in Fig A7 in S1 Appendix, and residuals are shown in Fig A8 in S1 Appendix. Best fit values for labeled parameters are given in Table 1. Most statistics were accurately predicted by this model, including (B) the decays of in each population, (C) the decay of the covariance of D between populations, and (E) the joint heterozygosity . (D) However, was fit poorly by this model, and we were unable to find a three-population model that recovered these observed statistics, including with additional periods of growth, recent admixture between modern human populations, or substructure within modern populations. Error bars represent bootstrapped 95% confidence intervals on the statistic estimate.
Two models for the out-of-Africa expansion. We fit the commonly used 13-parameter model to the multi-population Hill-Robertson statistics (left). The best fit parameters shown here were fit to the set of statistics without the terms, because the inclusion of those terms led to runaway parameter behavior in the optimization. This is often a sign of model mis-specification. On the right, the same 13-parameter model is augmented by the inclusion of two deeply diverged branches, putatively Neanderthal and an unknown lineage within Africa. We inferred that these branches split from the branch leading to modern humans roughly 460 − 650 kya, and contributed migrants until quite recently (∼19 kya). Times reported here assume a generation time of 29 years and are calibrated by the recombination (rather than mutation) rate. Confidence intervals were computed using the Godambe information matrix on bootstrap replicates of the data [37].
The Gutenkunst model is a vast oversimplification of human evolutionary history, so its failure to account for Dz is not all that surprising. However, given the good agreement of the model to both allele frequencies and r2 decay [2], we did not expect such a large discrepancy. Having ruled out low coverage and spatial correlations in the mutation rate as explaining factors, our next hypothesis was a more complex demographic history. We generalized the Gutenkunst model with a number of additional parameters accounting for recent events, including size changes in the YRI population, recent mixture between populations, and substructure within each continental population. None of these modifications provided satisfactory fit to the data and some did not converge to biologically realistic parameters.
Inference of archaic admixture
is a measure of positive covariance between low-frequency alleles (Fig A2 in S1 Appendix). We therefore expect this statistic to be sensitive to the presence of rare, deep-coalescing lineages within the population, as those lineages will contribute haplotypes with a large number of tightly linked low frequency variants (see Discussion below).
Given prior genetic evidence for archaic admixture in Eurasia and Africa (reviewed in [39]), we proposed a model that includes two deeply diverged human branches, with one branch mixing with Eurasian ancestors beginning at the OOA event, and the second one mixing with the ancestors of the Yoruba population over a time period that could include the OOA event. In this scenario, this second branch could also contribute to Eurasians through admixture prior to the OOA event (Fig 4A). Many human lineages coexisted on the African continent, possibly until quite recently [40–42], and genetic evidence points to a history of archaic admixture or deep structure across many modern African populations [43–48]. It is likely that modern humans have met and mixed with diverged lineages many times through history, rather than receiving just a single pulse of migrants [49, 50]. We chose to model the mixing of archaic and modern human branches as continuous and symmetric [51], parameterizing the migration rate between these branches and the times that migration began and ended.
(A) We fit a model for out-of-Africa expansion related to the standard model in Fig 3A. Demographic events for the three modern human populations are parameterized as above, but we also include two branches with deep split from the ancestral population to modern humans. A putatively Neanderthal branch that remains isolated until the Eurasian split from YRI, and a deep branch within Africa that is allowed to be isolated for some time before continuously exchanging migrants with the common ancestral branch and the YRI branch. (B-E) This model fits the data much better than the model without archaic admixture, and especially for the Dz statistics (D). Fits to 35 more curves and statistics are shown in Fig A7 in S1 Appendix, and residuals are shown in Fig A8 in S1 Appendix. The migration rates inferred between the diverged African branch and YRI provides an estimate of ∼ 7.5% contribution.
We considered two topologies for the archaic branches: 1) both branches split independently from that leading to modern humans (Fig 4A and Table 1), and 2) one branch split from the modern human branch, which some time later split into the two populations (Fig A9 and Table A2 in S1 Appendix). Both models fit the data well with little statistical evidence to discriminate between these two models (Fig 4B–4E and Fig A8 in S1 Appendix). The difference in log-likelihood between the two models was ΔLL < 1, as opposed to ΔLL = 1,730 between models with and without archaic admixture. ΔLL between the best fit model with archaic admixture and the fully saturated model (using observations as expectations) was 767. Consistent among the inferred models was the age of the split between diverged and modern human branches within Africa at ∼ 500 kya, though uncertainty remains with regard to the relationship between archaic human lineages in Africa and Eurasia. The sequencing of archaic genomes within Africa would clearly be helpful in resolving these topologies.
We inferred an archaic population to have contributed measurably to Eurasian populations. This branch (putatively Eurasian Neanderthal) split from the branch leading to modern humans 470 − 650 thousand years ago (kya), which contributed 1.2 ± 0.6% ancestry in modern CEU and CHB populations after the out-of-Africa split. This range of divergence dates from our maximum-likelihood model overlaps with previous estimates of the time of divergence between Neanderthals and human populations, estimated at 550 − 765 kya [52]. The diverged African branch split from the ancestors of modern humans 460 − 540 kya and contributed to both the pre-OOA human branch and the lineage leading to YRI. This admixture began between 90 − 160 kya, well before the estimated split between Eurasian and the YRI lineages, so that this archaic branch also contributed to the ancestors of Eurasian populations. We estimated 4.7 − 9.2% ancestry contribution from this unknown population to YRI, and 1.9 − 6.6% contribution to CEU and CHB.
We chose a separate population trio to validate our inference and compare levels of archaic admixture with different representative populations. This second trio consisted of the Luhya in Webuye, Kenya (LWK), Kinh in Ho Chi Minh City, Vietnam (KHV), and British in England and Scotland (GBR). We inferred the KHV and GBR populations to have experienced comparable levels of migration from the putatively Neanderthal branch. However, the LWK population exhibited lower levels of admixture (∼ 6%) in comparison to YRI, possibly suggesting population differences in archaic admixture events within the African continent (Table A3).
Discussion
Multi-population two-locus diversity statistics
The application presented here relied on the four-haplotype statistics (D2, Dz, π2). Studying these low-order multi-population statistics in a likelihood framework allowed us to infer a demographic model with archaic admixture, even without reference genomes from those diverged populations. We have also shown that higher order statistics may be computed through this same framework. Extending higher order two-locus moment systems to multiple populations would potentially provide further information about demography, particularly for past encounters with archaic lineages.
Relation to other statistics.
There are many approaches for computing expected statistics for diversity under a wide range of scenarios. Single-site statistics, which include expected heterozygosity and the AFS, may be computed efficiently using forward- or reverse-time approaches. Beyond the classical recursions for and [12, 14], two-locus statistics are difficult to compute for non-equilibrium, multi-population demographic models. Sved [53] proposed an IBD based recursion to compute across subdivided populations, but its accuracy and interpretation remain debated [12].
The moments-based approach presented here generalizes the recursion for the single-site AFS presented in [3]. The moments system includes all heterozygosity statistics, so we recover expected F-statistics under arbitrary demography, which are commonly used to test for admixture [54–56]. Long-range patterns of elevated LD in putatively admixed populations are used to infer the timing of admixture events and relative contributions of parental populations [10, 29]. These approaches rely on the recursion for after admixture events that is used here (Eqs 2 and 7). Thus the generalized Hill-Robertson system is sensitive to ancient admixture, but also captures statistics used to identify recent admixture history, with fewer assumptions about early history.
Plagnol and Wall [57, 58] introduced a statistic, S*, specifically designed to scan for introgressed haplotypes without having sequence data from the diverged population. S* uses an ad-hoc score to identify SNPs that likely arose on haplotypes contributed from a deeply diverged population, and is estimated through simulation. These SNPs will tend to be rare and in high LD, and therefore also contribute to Dz (Fig 2D–2F). Thus even a small amount of archaic admixture will significantly elevate compared to that in an unadmixed population, and Dz itself could be used as an ad-hoc statistic similar to S*. Given its conceptual relationship to S*, it may not be so surprising that this previously overlooked statistic is particularly well suited for model-based inference of archaic admixture.
Caveats.
Like many inference approaches in population genetics, we approximate human history using discrete, randomly mating populations with size and migration histories described by relatively few parameters. History is much more complex than this. Thus statistical uncertainties estimated using bootstrap analysis masks much larger, systematic errors due to model misspecification. In particular, some choices we made in modeling archaic admixture are certainly oversimplified, such as the assumption of symmetric and constant migration rates during the period of contact between archaic and modern humans.
Variability in fine-scale recombination rates between populations and over time contributes another source of systematic error. While large-scale recombination rates are generally better understood than the mutation rate in humans [for which current estimates vary over a factor of two [59]], recombination rates can vary at short distances. Spence and Song [60] showed that recombination maps are highly concordant across populations represented in the Thousand Genomes Project [34], although this correlation surely decreases at shorter distances. We filtered out pairs of mutations at very close distances (less than roughly 1kb) to reduce potential biases due to very fine scale variation. We therefore do not expect variation in recombination rate among human populations to explain the large differences in Dz compared to the Gutenkunst et al. model. However, the effect of population-specific recombination maps may play a role when considering finer-scale patterns and data from deeply diverged populations such as the Neanderthal.
Finally, our model and inferences assumed that mutations are evolving neutrally. We chose to analyze SNPs in intergenic regions and excluded genic and intronic regions in an effort to reduce biases due to selection acting on mutations included in the analysis or nearby selected regions, although some intergenic regions are expected to be affected by selection or biased gene conversion. While outside the scope of this study, a more detailed characterization of the effects of linked selection on Hill-Robertson statistics is warranted.
Conclusion
We described an infinite hierarchy of multi-locus summaries of genomic diversity that are easy to compute under arbitrary, multi-population demographies. Some of these statistics are familiar, including expected heterozygosity, F-statistics, and LD decay, while others have been largely unexplored in multi-population models, such as the degree of LD between low frequency alleles (Dz) and the joint heterozygosity across sites and populations (π2). The one-population Dz statistic, in particular, has an interesting history, as it has come up in early work as a mathematical stepping-stone on the way to computing D2 [14], but was, to our knowledge, never used in data analysis. As it happens, this ‘ghost’ statistic provides a unique window into human history.
Using this set of summary statistics, we explored a commonly used model of human demographic history derived from single-site AFS and validated using LD decay curves. While many statistics under this model fit the data well, the model dramatically underestimates levels of LD among rare alleles. Modeling archaic admixture worldwide resolved this discrepancy. We recovered the signal of Neanderthal admixture in Eurasian populations, and found evidence for substantial and long-lasting admixture from a deeply diverged lineage in two African populations that is consistent with evidence from previous studies [46–48, 57].
This model deserve a more thorough investigation, including data from ancient humans and additional contemporary African populations. We leave this to future work for three reasons. First, proposing a detailed multi-population model of evolution in Africa will require carefully incorporating anthropological and archaeological evidence, which is a substantial endeavor. Second, the inclusion of two-locus statistics from ancient genomes will require vetting possible biases associated with ancient DNA sequencing, although we see no problem with using two-locus statistics in modern populations jointly with one-locus statistics in ancient DNA.
Third, and more importantly, archaic admixture can hide in the blind spot of classical statistics, and widely used demographic models for simulating genomes underestimate LD between low frequency variants in populations around the globe, especially in Africa. This large bias affects neither the distribution of allele frequencies nor the amount of correlation measured by D2, but it may impact analyses aiming to identify disease variants based on overrepresentation of rare variants in specific genes or pathways. Thus both statistical and population geneticists would benefit from including archaic admixture into baseline models of human genomic diversity.
Supporting information
S1 Appendix. Supporting material.
In the Appendix, we provide detailed mathematical derivations, expanded discussions, supporting analyses, and supplemental figures and tables.
https://doi.org/10.1371/journal.pgen.1008204.s001
(PDF)
Acknowledgments
We thank Brenna Henn, Mathias Steinrücken, Ryan Gutenkunst, and Chris Gignoux for useful discussions, and Ryan Gutenkunst for also making his source code open and accessible. We also thank Nick Patterson and an anonymous reviewer for useful comments that improved this manuscript.
References
- 1. Marth GT, Czabarka E, Murvai J, Sherry ST. The Allele Frequency Spectrum in Genome-Wide Human Variation Three Large World Populations. Genetics. 2004;372(January):351–372.
- 2. Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics. 2009;5(10):e1000695. pmid:19851460
- 3. Jouganous J, Long W, Ragsdale AP, Gravel S. Inferring the Joint Demographic History of Multiple Populations: Beyond the Diffusion Approximation. Genetics. 2017;206(3):1549–1567. pmid:28495960
- 4. Kamm JA, Terhorst J, Song YS. Efficient computation of the joint sample frequency spectra for multiple populations. Journal of Computational and Graphical Statistics. 2017;26(1):182–194. pmid:28239248
- 5. McVean GAT, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnelly P. The Fine-Scale Structure of Recombination Rate Variation in the Human Genome. Science. 2004;304(5670):581–584. pmid:15105499
- 6. Auton A, McVean G. Recombination rate estimation in the presence of hotspots. Genome Research. 2007;17:1219–1227. pmid:17623807
- 7. Chan AH, Jenkins PA, Song YS. Genome-Wide Fine-Scale Recombination Rate Variation in Drosophila melanogaster. PLoS Genetics. 2012;8(12).
- 8. Kamm JA, Spence JP, Chan J, Song YS. Two-locus likelihoods under variable population size and fine-scale recombination rate estimation. Genetics. 2016;203(3):1381–1399. pmid:27182948
- 9. Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475(7357):493–496. pmid:21753753
- 10. Loh PR, Lipson M, Patterson N, Moorjani P, Pickrell JK, Reich D, et al. Inferring admixture histories of human populations using linkage disequilibrium. Genetics. 2013;193(4):1233–1254. pmid:23410830
- 11. Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nature Genetics. 2014;46(8):919–925. pmid:24952747
- 12. Rogers AR. How population growth affects linkage disequilibrium. Genetics. 2014;197(4):1329–1341. pmid:24907258
- 13. Hill WG, Robertson A. The effect of linkage on limits to artificial selection. Genetical Research. 1966;8(03):269. pmid:5980116
- 14. Hill WG, Robertson A. Linkage disequilibrium in finite populations. Theoretical and Applied Genetics. 1968;38(6):226–231. pmid:24442307
- 15. Karlin S, McGregor J. Rates and probabilities of fixation for two locus random mating finite populations without selection. Genetics. 1968;58:141–159. pmid:5656343
- 16. Ohta T, Kimura M. Linkage disequilibrium at steady state determined by random genetic drift and recurrent mutation. Genetics. 1969;63(1):229–238. pmid:5365295
- 17. Ohta T, Kimura M. Linkage disequilibrium due to random genetic drift. Genetical Research. 1969;13(01):47.
- 18. Golding GB. The sampling distribution of linkage disequilibrium. Genetics. 1984;108(1):257–274. pmid:6479585
- 19. Ethier SN, Griffiths RC. On the two-locus sampling distribution. Journal of Mathematical Biology. 1990;29(2):131–159.
- 20. Hudson RR. Two-locus sampling distributions and their application. Genetics. 2001;159(4):1805–1817. pmid:11779816
- 21. McVean GAT. A genealogical interpretation of linkage disequilibrium. Genetics. 2002;162(2):987–991. pmid:12399406
- 22. Song YS, Song JS. Analytic computation of the expectation of the linkage disequilibrium coefficient r2. Theoretical Population Biology. 2007;71(1):49–60. pmid:17069867
- 23. Ragsdale AP, Gutenkunst RN. Inferring Demographic History Using Two-Locus Statistics. Genetics. 2017;206(2):1037–1048. pmid:28413158
- 24. Donnelly P, Kurtz TG. Genealogical processes for Fleming-Viot models with selection and recombination. Annals of Applied Probability. 1999;9(4):1091–1148.
- 25. Donnelly P, Kurtz TG. Particle Representations for Measure-Valued Population Models. The Annals of Probability. 1999;27(1):166–205.
- 26. Wright S. Evolution in mendelian populations. Genetics. 1931;16:97–159. pmid:17246615
- 27. Evans SN, Shvets Y, Slatkin M. Non-equilibrium theory of the allele frequency spectrum. Theoretical Population Biology. 2007;71(1):109–119. pmid:16887160
- 28. Živković D, Steinrücken M, Song YS, Stephan W. Transition densities and sample frequency spectra of diffusion processes with selection and variable population size. Genetics. 2015;200(2):601–617. pmid:25873633
- 29. Moorjani P, Patterson N, Hirschhorn JN, Keinan A, Hao L, Atzmon G, et al. The history of African gene flow into Southern Europeans, Levantines, and Jews. PLoS Genetics. 2011;7(4):e1001373. pmid:21533020
- 30.
Cavalli-Sforza LL, Bodmer . The genetics of human populations. W. H. Freeman and Company; 1971.
- 31. Nei M, Li WH. Linkage disequilibrium in subdivided populations. Genetics. 1973;75(1):213–9. pmid:4762877
- 32. Kelleher J, Etheridge AM, McVean G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Computational Biology. 2016;12(5):1–22.
- 33. Weir BS. Inferences about linkage disequilibrium. Biometrics. 1979;35(1):235–254. pmid:497335
- 34. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. pmid:26432245
- 35. Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT, Clark AG, et al. Demographic history and rare allele sharing among human populations. Proceedings of the National Academy of Sciences. 2011;108(29):11983–11988.
- 36. Hinch AG, Tandon A, Patterson N, Song Y, Rohland N, Palmer CD, et al. The landscape of recombination in African Americans. Nature. 2011;476(7359):170–175. pmid:21775986
- 37. Coffman AJ, Hsieh PH, Gravel S, Gutenkunst RN. Computationally Efficient Composite Likelihood Statistics for Demographic Inference. Molecular Biology and Evolution. 2016;33(2):591–593. pmid:26545922
- 38. Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, et al. Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes. Science. 2012;337(6090):64–69. pmid:22604720
- 39. Wall JD, Brandt DYC. Archaic admixture in human history. Current Opinion in Genetics & Development. 2016;41:93–97.
- 40. Rightmire GP. Middle and later Pleistocene hominins in Africa and Southwest Asia. Proceedings of the National Academy of Sciences. 2009;106(38):16046–16050.
- 41. Harvati K, Stringer C, Grün R, Aubert M, Allsworth-Jones P, Folorunso CA. The later stone age calvaria from Iwo Eleru, Nigeria: Morphology and chronology. PLoS ONE. 2011;6(9). pmid:21949689
- 42. Berger LR, Hawks J, Dirks PHGM, Elliott M, Roberts EM. Homo naledi and Pleistocene hominin evolution in subequatorial Africa. eLife. 2017;6:1–19.
- 43. Hammer MF, Woerner AE, Mendez FL, Watkins JC, Wall JD. Genetic evidence for archaic admixture in Africa. Proceedings of the National Academy of Sciences. 2011;108(37):15123–15128.
- 44. Lachance J, Vernot B, Elbers CC, Ferwerda B, Froment A, Bodo JM, et al. Evolutionary history and adaptation from high-coverage whole-genome sequences of diverse African hunter-gatherers. Cell. 2012;150(3):457–469. pmid:22840920
- 45. Hsieh PH, Woerner AE, Wall JD, Lachance J, Tishkoff SA, Gutenkunst RN, et al. Model-based analyses of whole-genome data reveal a complex evolutionary history involving archaic introgression in Central African Pygmies. Genome Research. 2016;26(3):291–300. pmid:26888264
- 46. Skoglund P, Thompson JC, Prendergast ME, Mittnik A, Sirak K, Hajdinjak M, et al. Reconstructing Prehistoric African Population Structure. Cell. 2017;171(1):59–71.e21. pmid:28938123
- 47.
Durvasula A, Sankararaman S. Recovering signals of ghost archaic admixture in the genomes of present-day Africans. bioRxiv. 2018.
- 48. Hey J, Chung Y, Sethuraman A, Lachance J, Tishkoff S, Sousa VC, et al. Phylogeny Estimation by Integration over Isolation with Migration Models. Molecular Biology and Evolution. 2018;35(11):2805–2818. pmid:30137463
- 49. Browning SR, Browning BL, Zhou Y, Tucci S, Akey JM. Analysis of Human Sequence Data Reveals Two Pulses of Archaic Denisovan Admixture. Cell. 2018;173(1):53–61.e9. pmid:29551270
- 50. Villanea FA, Schraiber JG. Multiple episodes of interbreeding between Neanderthal and modern humans. Nature ecology & evolution. 2019;3(1):39.
- 51. Kuhlwilm M, Gronau I, Hubisz MJ, de Filippo C, Prado-Martinez J, Kircher M, et al. Ancient gene flow from early modern humans into Eastern Neanderthals. Nature. 2016;530(7591):429–433. pmid:26886800
- 52. Prüfer K, Racimo F, Patterson N, Jay F, Sankararaman S, Sawyer S, et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature. 2014;505(7481):43–49. pmid:24352235
- 53. Sved JA. Correlation measures for linkage disequilibrium within and between populations. Genetics Research. 2009;91(3):183–192. pmid:19589188
- 54. Reich D, Thangaraj K, Patterson N, Price AL, Singh L. Reconstructing Indian population history. Nature. 2009;461(7263):489–494. pmid:19779445
- 55. Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, et al. Ancient admixture in human history. Genetics. 2012;192(3):1065–1093. pmid:22960212
- 56. Peter BM. Admixture, population structure, and f-statistics. Genetics. 2016;202(4):1485–1501. pmid:26857625
- 57. Plagnol V, Wall JD. Possible ancestral structure in human populations. PLoS Genetics. 2006;2(7):e105. pmid:16895447
- 58. Wall JD, Lohmueller KE, Plagnol V. Detecting ancient admixture and estimating demographic parameters in multiple human populations. Molecular Biology and Evolution. 2009;26(8):1823–1827. pmid:19420049
- 59. Scally A. The mutation rate in human evolution and demographic inference. Current opinion in genetics & development. 2016;41:36–43.
- 60.
Spence JP, Song YS. Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations. bioRxiv. 2019.
- 61. Lan T, Lin H, Zhu W, Laurent TCAM, Yang M, Liu X, et al. Deep whole-genome sequencing of 90 Han Chinese genomes. GigaScience. 2017;6(9):1–7.