^{ * }

VP and JDW analyzed the data and wrote the paper.

The authors have declared that no competing interests exist.

Determining the evolutionary relationships between fossil hominid groups such as Neanderthals and modern humans has been a question of enduring interest in human evolutionary genetics. Here we present a new method for addressing whether archaic human groups contributed to the modern gene pool (called ancient admixture), using the patterns of variation in contemporary human populations. Our method improves on previous work by explicitly accounting for recent population history before performing the analyses. Using sequence data from the Environmental Genome Project, we find strong evidence for ancient admixture in both a European and a West African population (^{−7}), with contributions to the modern gene pool of at least 5%. While Neanderthals form an obvious archaic source population candidate in Europe, there is not yet a clear source population candidate in West Africa.

Determining the evolutionary relationships between modern humans and fossil hominine groups such as Neanderthals has been a question of enduring interest in human evolutionary genetics. In this paper, Plagnol and Wall present a new method for addressing whether archaic human groups contributed to the modern gene pool. Using sequence data from the Environmental Genome Project, they find strong evidence for ancient admixture in both a European and a West African population, with contributions to the modern gene pool of at least 5%. While Neanderthals form an obvious archaic source population candidate in Europe, there is not yet a clear source population candidate in West Africa. The authors' results have direct implications for the competing models of modern human origins. In particular, their estimates of non-negligible contributions of archaic populations to the modern gene pool are inconsistent with strict forms of the Recent African Origin model, which posits that modern humans evolved in a single location in Africa and from there spread and replaced all other existing hominines.

A long-standing controversy in the field of human evolution concerns the origin of modern humans [

The easiest way to answer this question is through a direct comparison of DNA sequences from both archaic and modern populations. Recently, researchers have managed to sequence fragments of Neanderthal mtDNA from fossil bones [

In this paper, we take a different approach to the question. We look for signs of Neanderthal admixture by analyzing the patterns of linkage disequilibrium (LD) in contemporary human DNA sequences. Our method relies on the observation that the genetic signature of ancient admixture is so strong that even tens of thousands of years of random mating is not enough to obscure it [

To avoid possible confounding effects, we first use extant sequence data to estimate parameters for a demographic null model that incorporates several known features of modern human history: recent population growth, a bottleneck in Europeans, and population differentiation between European and African populations [^{*}^{*}

Our analysis is based on 135 finished genes from the NIEHS Environmental Genome Project (EGP, as of February 2006, see [_{ST}

Distribution of Summary Statistics for the NIEHS-EGP Dataset and Our Best-Fitting Model

We use a simple two-island model with islands representing European and African populations. Initially, we considered models where there was no migration between the two populations after they split. These models did not fit the data well (unpublished data) so we use a model that incorporates a low level of migration between the populations. We include population growth in each population as well as a bottleneck in the European branch. We estimate a total of six parameters and the likelihood is estimated over a grid of values to find the maximum (see

To estimate parameters we use a composite-likelihood approach based on various summary statistics (cf. [^{*}_{ST}^{*}_{ST}

The second set of summary statistics divides the SNPs at a locus in three categories: private in the CEPH sample, private in the Yoruba sample, and segregating in both samples. Sites segregating in both samples are subsequently divided between low and high frequency (we set the threshold at 10%). SNPs segregating in both samples and at low frequency are characteristic of recent migrations and help to estimate this rate. This set of statistics has the useful property that the joint distribution can be computed exactly for a given realization of the genealogical process (Ancestral Recombination Graph [ARG] [

Even though these two sets of summary statistics are correlated, we could not estimate their joint distribution. To estimate the overall likelihood we use a composite-likelihood approximation: precisely, we assume that both sets of summary statistics are independent.

Our approach does not provide an accurate estimate of the date of divergence between the populations. Interpreting confidence intervals is difficult because we use a composite-likelihood approach. Nevertheless, a ^{2} approximation for the composite-log-likelihood ratio provides our best estimate of the confidence interval. Using this approximation we find a lower bound at 120,000 years and no upper bound. Precisely, the goodness-of-fit for an equilibrium island model with a low rate of migration between both populations is only slightly worse than for our best-fitting model. We set the divergence date to the lower bound of the confidence interval (more consistent with our knowledge of human history) and verified that this choice does not affect qualitatively the results presented in this paper (see

Our procedure estimates that the bottleneck event is more ancient than the putative admixture event. We find that precise dating of this bottleneck is difficult because beyond 50,000 years, a change in the date of the bottleneck has very little effect on the pattern of polymorphism. The parameters of this model are presented in

DOI: 10.1371/journal.pgen.0020105.g001

To be able to assess the significance of the pattern of LD, one needs to measure the goodness of fit of the model to confirm that it captures the main features of European and West African demography. We provide quantile–quantile plots between data and simulated distributions in

An important feature of our demographic model is that it reproduces well the ratio of the estimated recombination rate

We now show that we can detect a specific aspect of the level of LD that is directly affected by the level of admixture and that is not captured by the estimator of the recombination rate ρ. Our statistic ^{*}^{*}^{*}^{*}^{*}^{*}^{*}_{All},^{*}_{Yor}, and ^{*}_{CEPH}. While ^{*}_{All} typically captures information about the oldest and deepest branches of the genealogical tree, ^{*}_{CEPH} provides information about branches internal to the European tree. These branches are expected to be the signature of an ancient admixture in Europe. An illustration of what ^{*}_{CEPH} does is provided in

See [

For the gene ^{*}_{CEPH} picks 18 SNPs divided into three congruent sets. All selected SNPs, denoted by a black dot, are segregating in the European sample and fixed in the Yoruba sample. The associated

DOI: 10.1371/journal.pgen.0020105.g002

To illustrate the efficiency of the method, we compare a scenario where there is no archaic population to another where the admixture level is set to 5% in the European population. We assume the admixture occurred 50,000 years ago and use our best-fitting model, with and without admixture. We simulate 40-kb regions and use the recombination and mutation rates that were estimated from the EGP data.

We show a quantile–quantile plot comparing both simulated distributions of ^{*}_{CEPH} in ^{*}_{All} and ^{*}_{Yor} are not significantly affected by this admixture (unpublished data). We find that for ^{*}_{CEPH} the difference of both simulated means is 60% of the standard deviation (computed under the no-admixture hypothesis). With such values a power study shows that a

Since for ^{*}_{CEPH} many points are far away from the diagonal, we can conclude that the two models are easily distinguishable from each other. This would not be possible based on the distribution of ρ^_{CEPH}.

DOI: 10.1371/journal.pgen.0020105.g003

To test the null hypothesis of no ancient admixture, we calculate ^{*}_{All},^{*}_{Yor} and ^{*}_{CEPH} for each of the 135 loci. We estimate a ^{*}^{*}

These 135 ^{2} with 2

^{*}_{All}, ^{*}_{Yor}, ^{*}_{CEPH}, and Average Values of
ˆ in Each Sample for Three Different Levels of Neanderthal Admixture and Three Scenarios for Recombination Rate Heterogeneity

We first find that a constant recombination rate (within and between loci) cannot account for the distribution of ^{*}_{All}. This observation is consistent with the fact that the observed variability of
ˆ exceeds the variance of the simulated distribution under the assumption of a uniform ^{*}_{All}.

To account for this variability, we consider a random distribution for ^{*}_{All}. This consistent fit (see

The deviation from the diagonal line shows a discrepancy between the data and the null model for ^{*}_{CEPH} but not for ^{*}_{All}.

DOI: 10.1371/journal.pgen.0020105.g004

We investigated a third model of recombination rate variability. In this model, we assume a uniform background rate and a random number of hotspots (parameters were estimated based on [

However, based on the values of ^{*}_{CEPH} we find that within the CEPH sample the level of LD is higher than predicted by our model. This discrepancy is very strong and observed for all models of recombination. Even setting the recombination rate to zero does not account for the large values of ^{*}_{CEPH} in the CEPH sample (^{*}_{Yor} found in the data are significantly higher than expected. The distribution of the ^{*}_{CEPH} and ^{*}_{All} are shown in ^{−7} for both samples, depending on the recombination model one considers (see ^{*}_{CEPH} and ^{*}_{Yor} is provided in

We now consider the effect of ancestral admixture on our inference procedure and show that it significantly improves the fit of our model, indicating strong evidence in favor of some form of ancestral admixture in the history of European and West African populations. We use the following approach: for different levels of admixture we reestimate the demographic parameters, and investigate if the newly estimated demography is consistent with the observed distribution of ^{*}

We first observe that the level of admixture in Europe has little effect on the average
ˆ in the European sample.

Second, in the presence of admixture, the estimated demographic parameters are only slightly modified (see _{10}-ratio between the maximum value estimated with a 5% admixture and no admixture equals three.

Third, this putative admixture in the European sample has a limited effect on the distribution of ^{*}_{All} and ^{*}_{Yor}. However, it increases strongly the values of ^{*}_{CEPH}, as shown in ^{*}_{CEPH} observed in the data (^{*}

We use a model-based approach to describe the history of European and West African populations. This model predicts what pattern of polymorphism to expect in the absence of ancient admixture, and how such an event would affect the data. In the absence of admixture, the comparison of the data with our model shows a clear discrepancy that can be explained by an admixture rate of 5% in the European population. Even though we cannot exclude the possibility that an alternative demographic scenario is the cause of this pattern, this aspect of the data was chosen to be the most sensitive to an admixture event.

If the signal we observe is indeed the result of an admixture event, then these results would change our understanding of the origins of modern humans. It would indicate that archaic populations such as Neanderthals must have made a substantial contribution to the modern gene pool in Europe. We observe a similar pattern for West African populations even though a clear source population has not yet been found.

While the putative source population may not be as obvious as in Europe (Neanderthals), the fossil record shows that transitional forms of Homo were widespread in Africa, even after the time of emergence of modern humans. Other genetic studies have also found evidence for ancient structure in African populations [

Our model was designed to be as simple as possible while still capturing the main features of human polymorphism. We assessed qualitatively its goodness of fit and we found that it fits the data well: both the statistics we fitted and also the estimates of the recombination rate in both populations are consistent with the data. Our model makes simpler and fewer assumptions about human demography than a previous study with different findings [

Our inference procedure based on summary statistics has similarities to two previous studies [

We found that the choice of summary statistics is very important in our inference procedure. In particular, not incorporating a statistic which measures the level of migration between European and African branches leads to a different maximum of the composite likelihood where the divergence date is much lower and the estimates of LD are biased. If we had not added this summary statistic we could not have observed this poor goodness of fit. Hence, we tried to fit in our inference procedure all components of the data which seemed relevant. However, we cannot exclude that an important feature has been missed because our summary statistics cannot measure it.

Our inference procedure cannot clearly reject an island model where both populations remain separated indefinitely with a low level of migration. Using a ^{2} approximation, the ^{−3}. This is relatively low but our composite-likelihood approach is likely to narrow confidence intervals, and given the uncertainty regarding our model (constant migration rate in particular) we cannot clearly exclude this model. However, the discrepancy between observations and expectations remain unchanged when looking at ^{*}_{Yor} and ^{*}_{CEPH} (^{−7}) and this choice does not affect significantly our main findings. We note that an equilibrium island model, along with all models incorporating a substantial element of ancient admixture, is not compatible with simple forms of the RAO model.

We investigated the pattern of polymorphism of the SNPs selected by our method in both other samples available in the NIEHS dataset: Hispanics and Han Chinese. We found that 75% of the SNPs selected by ^{*}_{Yor} in the Yoruba sample are fixed in the Hispanic and Chinese samples. However this number is not significantly different from other SNPs segregating in the Yoruba sample and fixed in the CEPH sample. Most SNPs selected by ^{*}_{CEPH} in the CEPH sample are variable in the Hispanic (90%) and Chinese (50%) samples. These numbers also do not differ significantly from other SNPs fixed in the Yoruba sample and segregating in the CEPH sample. However, there is no clear expectation regarding those proportions because we do not precisely know where and when the admixture occurred. In addition, the pattern of polymorphism has been affected by recent migrations, in particular between non-African populations.

Some alternate explanations can potentially explain the elevated values of ^{*},^{*}^{*}_{CEPH} are not identified as positively selected by Voight et al. [^{*}_{CEPH}) in the ADH cluster, where Voight et al. [

An advantage of our method is that in addition to showing some evidence in favor of a significant rate of admixture, we also specifically pick a subset of candidate “archaic” SNPs. The only way to be certain of the answer would be to verify that these SNPs are indeed mutated in the DNA of Neanderthal fossils. Estimating the significance of the observed pattern is difficult because modeling human demography is a complex task. However, it is likely that if there has been a significant level of admixture, the SNPs selected by ^{*}

Our approach requires resequencing polymorphism data free of ascertainment bias and we based our analyses on data from the EGP. The data were generated at the University of Washington (Seattle, Washington, United States) (see [

Our analysis requires that the ethnicity of the samples is known. 135 out of the 505 genes in the EGP dataset fit this criterion. The average length of the genes included in the study is 50 kb. The average sequenced length is 25 kb. We restricted our study to the 12 Yoruba and 22 Caucasian (CEPH) individuals. We excluded genes on the sex chromosomes. Indels as well as SNPs with more than two alleles are also excluded.

We make several assumptions about the demographic scenario in order to limit the number of parameters to estimate. First, we assume that the population growth is 100-fold in each population. Second, we assume that the bottleneck lasts 1,000 years, and we only estimate the reduction of the population size during this period. The six parameters left to be estimated are: the date of the beginning of growth (one parameter for each population), the date of the bottleneck, and its intensity (reduction of the population size during the bottleneck), the date of divergence between European and African populations, and the migration rate after this divergence.

We use a grid for the parameter values on which we estimate the composite likelihood. The grid consists of ten values per parameter (10^{6} total) and we refined this grid several times to locate the maximum of the composite-likelihood surface. For each value of the parameters, the same set of simulations is used across different loci in the data (using the mean sequenced length). The log-likelihood is then summed across loci to obtain the final value.

The first set of summary statistics consists of Tajima's ^{*} in the CEPH sample, and _{ST}

The second set of summary statistics divides the SNPs at a locus in four categories: private in the CEPH sample, private in the Yoruba sample, and segregating in both samples at low or high frequency (we set the threshold at 10%). For each branch of the ARG [_{1},_{2},_{3},_{4}) for a random SNP to belong to each of the four classes. Conditional on the ARG and the total number of SNPs _{1},_{2},_{3},_{4}) is multinomial and can be obtained explicitly. By simulating a large number of ARGs and averaging the computed probabilities for each simulated ARG _{i},

For each point on the grid, we found that 80,000 simulated ARGs are needed to obtain a precise value of the likelihood. The computation of the likelihood of the data at one point of the grid requires approximately five minutes on a 1.8-GHz Opteron processor. A significant amount of time is saved by stopping the computation after 20,000 simulations if the estimated likelihood at this point of the grid is clearly lower than the current maximum. The total computing time over the 10^{6} points of the grid takes approximately three days on 100 processors. All simulations in this paper use a modified version of

We consider various scenarios for the recombination rate. To describe the simulated distribution of

A second model includes variability between loci but not within. Specifically, we set the distribution of

Finally, we consider a model consisting of a background rate and a random number of hotspots. The background rate is variable with mean μ =
f̄ = 0.21 and standard deviation

We define ^{*}

The score is computed as a sum over successive pairs of SNPs in the optimum subset

We call the distance between two SNPs the number of chromosomes in the data at which the genotypes differ. Note that when—for a given pair of SNPs—an individual is a double heterozygote, we assume that the distance between both SNPs is zero (in other words we assume that both genotypes are in phase). If the total distance (summed over all individuals) between two SNPs is zero, both sites are congruent and the score is equal to the distance in bp between them plus 5,000. If this distance is greater than five, the score is set to −∞. If the distance is between one and five, the score is equal to −10,000. We also impose a minimum distance between sites within the optimum set

We also need to account for missing data. For a pair of SNPs to be congruent, we allow no more than two chromosomes with the property that a missing call at one SNP is associated with the minor allele at the other SNP. When one of the two SNPs has a minor allele frequency of two, we make this criteria more stringent and allow only one such chromosome.

The computation of ^{*}

Because each locus has a different length, and different regions were not scanned for polymorphism, different sets of simulations (which reproduce these precise characteristics) are used for each locus to estimate the distribution of ^{*}_{All},^{*}_{Yor}, and ^{*}_{CEPH}. On each simulated ARG we place a number of mutations equal to the number of variable sites at this locus in the data. This is done to avoid biases due to variability in the mutation rate. Also, a random fraction of the genotyping calls is labeled missing. The probability of being missing is equal to the fraction of missing calls in the data at this locus.

For each simulated genealogical tree, we obtain a value for ^{*}_{All},^{*}_{Yor}, and ^{*}_{CEPH} defined as ℙ (^{*}^{*}_{data}) where ^{*}_{data} is the value computed from the data. We can obtain an overall ^{2} with 2

(100 KB PDF)

(100 KB PDF)

(120 KB PDF)

(64 KB PDF)

(2.8 MB PDF)

(2.8 MB PDF)

(28 KB PDF)

(21 KB PDF)

(29 KB PDF)

ancestral recombination graph

Centre d'Etude du Polymorphisme Humain 1980 database of people living in Utah with ancestry from Northern and Western Europe

Environmental Genome Project

recent African origin