Joint Estimation of Contamination, Error and Demography for Nuclear DNA from Ancient Humans

When sequencing an ancient DNA sample from a hominin fossil, DNA from present-day humans involved in excavation and extraction will be sequenced along with the endogenous material. This type of contamination is problematic for downstream analyses as it will introduce a bias towards the population of the contaminating individual(s). Quantifying the extent of contamination is a crucial step as it allows researchers to account for possible biases that may arise in downstream genetic analyses. Here, we present an MCMC algorithm to co-estimate the contamination rate, sequencing error rate and demographic parameters—including drift times and admixture rates—for an ancient nuclear genome obtained from human remains, when the putative contaminating DNA comes from present-day humans. We assume we have a large panel representing the putative contaminant population (e.g. European, East Asian or African). The method is implemented in a C++ program called ‘Demographic Inference with Contamination and Error’ (DICE). We applied it to simulations and genome data from ancient Neanderthals and modern humans. With reasonable levels of genome sequence coverage (>3X), we find we can recover accurate estimates of all these parameters, even when the contamination rate is as high as 50%.


Author Summary
nature of the data in O will be explained below, and will vary in each of the 98 different cases we describe. The parameters contained in Ω may simply be  For all models we will describe, the probability of the data can be defined (1) Here, i is the true (unknown) genotype of the ancient sample, and P [i |Ω, O] 108 is the probability of genotype i given the demographic parameters and the 109 data. 110 We focus now on computation on the likelihood for one site j in the 111 genome. In the following, we abuse notation and drop the subscript j. Given where 116 q 2 = r C (w(1 − ) + (1 − w) ) + (1 − r C )(1 − ) (4) q 1 = r C (w(1 − ) + (1 − w) ) + (1 − r C ) ((1 − )/2 + /2) (5) In the sections below, we will turn to the more complicated part of the model, which is obtaining the probability P [i|Ω, O] for a genotype in the 118 ancient sample, given particular demographic parameters and additional data 119 available. We will do this in different ways, depending on the kind of data 120 we have at hand. ination from present-day humans, from which many genomes are available.

127
For all analyses below, we restrict to sites where 0 < y j < 1. Note 128 that it is entirely possible (but not required) that y = w, meaning that, 129 aside from the ancient DNA sample, the only additional data we have are Appendix A, and lead to the following formulas: We generated 10,000 neutral simulations using msms [19] for different 152 choices of τ C and τ A (with θ = 20 in each simulation) to verify our analytic 153 expressions were correct ( Figure 2). The probability does not depend on θ, 154 so the choice of this value is arbitrary. 155 The above probabilities allows us to finally obtain P [i | y j , Ω, O].  2) Two panels of genomes from two "anchor" populations that may be 167 related to the ancient DNA sample. One of these populations -called pop-168 ulation Y -may (but need not) be the same population as the contaminant 169 and may (but need not) have received admixture from the ancient population 170 (Figure 1.B). The sample frequencies for this population will be labeled as 171 y. The other population -called Z -will have sample frequencies labeled z. 172 We will assume the drift times separating these two populations are known 173 (parameters τ Y and τ Z in Figure 1.B). This is a reasonable assumption as 174 these parameters can be accurately estimated without the need of using an 175 ancient outgroup sample, as long as admixture is not extremely high. 176 We can then estimate the remaining drift parameters, the error and contamination rates and the admixture time (β) and rate (α) between the archaic population and modern population Y . The diffusion solution for this threepopulation scenario with admixture is very difficult to obtain analytically. PhiX dataset distribution, we generated quality scores for each nucleotide.

303
Fragments were mapped back to a random individual from the contaminant 304 panel. Figure 6 shows DICE's performance on this scenario with different 305 error models. In all cases, we find that the parameters are estimated with 306 high accuracy. As expected, the ts/tv model infers a higher error rate at 307 transitions, due to the additional errors introduced by deamination on the 308 ends of the ancient fragments.  Additionally, we simulated a scenario in which only a single human con-319 taminated the sample. That is, rather than drawing contaminant fragments 320 from a panel of individuals, we randomly picked a set of two chromosomes 321 at each unlinked site and only drew contaminant fragments from those two 322 chromosomes. Figure S13 shows that inference is robust to this scenario, 323 unless the contamination rate is very high (25%). In that case, the drift 324 of the archaic genome is substantially under-estimated, but the error, con-  We sought to see whether we would use our method to identify the con-  Additionally, Figures S17, S18, S19 show the effect of misspecifying the con-

378
The EUR panel is the one with the highest posterior mode (Table 1). 379 We then tested a variety of ancient DNA nuclear genome sequences at 380 different levels of coverage, obtained via different methods (shotgun sequenc-381 ing and SNP capture) and from different hominin groups (modern humans 382 and Neanderthals). We used AFR as the anchor panel and either AFR (Ta-383 ble S1) or EUR (Table S2)   using Africans as the anchor population and Europeans as the contaminant 419 population ( Figure S20). We observe that contamination rates are higher in  accurately estimated than when the admixture event is ancient (Figure 7).

464
As before, we also verified that the posterior mode was a good proxy to  Table   491 2).  higher posterior mode than another is no guarantee that it is a better fit to 518 the data for demographic scenarios that may be different from the ones we 519 simulated.

520
We also applied our method to empirical data, specifically to two Ne-     . We tested the performance of the two-population method under a variety of drift and contamination scenarios for a sample of very low (0.5X) or very high (30X) coverage. We found that we needed more sites (≈ 1.6 million) to obtain accurate estimates from the low coverage sample. The MCMC chain was also run for a longer time (1 million steps). The top row shows the absolute difference between the estimated and the simulated contamination rate, while the bottom row shows the absolute difference corresponding to the anchor drift. In all simulations, the anchor drift was set to be equal to the ancient sample drift.    Table S2. We applied the two-population method to ancient Neanderthal and modern human genomes ranging from 52X to 0.054X coverage. We tested both shotgun-sequencing data and SNP capture data. We used AFR as the anchor panel and     Table S4. Posterior modes of parameter estimates under the three-population inference framework for the Mezmaiskaya Neanderthal autosomal genome. We used different 1000G populations as candidate contaminants. In all cases, Africans were the unadmixed anchor population and Europeans were the admixed anchor population. The ancestral human drift refers to the drift in the modern human branch before the split of Europeans and Africans. The post-split European-specific and African-specific drifts were estimated separately without the archaic genome (τ Af r = 0.009, τ Eur = 0.255). In all cases, the Neanderthal drift parameter gets stuck at the upper boundary (5 drift units) of parameter space.  Figure S2. Absolute difference between estimated and simulated contamination rates for a variety of anchor drift and contamination scenarios, for different levels of coverage. In all simulations, the anchor drift was set to be equal to the ancient sample drift. Figure S3. Absolute difference between estimated and simulated anchor drifts for a variety of anchor drift and contamination scenarios, for different levels of coverage. In all simulations, the anchor drift was set to be equal to the ancient sample drift. Figure S4. Absolute difference between estimated and simulated ancient sample drifts for a variety of anchor drift and contamination scenarios, for different levels of coverage. In all simulations, the anchor drift was set to be equal to the ancient sample drift. Figure S5. Absolute difference between estimated and simulated contamination rates for a variety of anchor drift and contamination scenarios, when coverage is low (0.5X, 1X or 1.5X). Here, we used a large number of sites and run the MCMC chain for 1 million steps. In all simulations, the anchor drift was set to be equal to the ancient sample drift. Figure S6. Absolute difference between estimated and simulated anchor drifts for a variety of anchor drift and contamination scenarios, when coverage is low (0.5X, 1X or 1.5X). Here, we used a large number of sites and run the MCMC chain for 1 million steps. In all simulations, the anchor drift was set to be equal to the ancient sample drift. Figure S7. Absolute difference between estimated and simulated ancient sample drifts for a variety of anchor drift and contamination scenarios, when coverage is low (0.5X, 1X or 1.5X). Here, we used a large number of sites and run the MCMC chain for 1 million steps. In all simulations, the anchor drift was set to be equal to the ancient sample drift. Figure S8. Absolute difference between estimated and simulated contamination rates for a variety of anchor drift and contamination scenarios, when coverage is low (0.5X, 1X or 1.5X). We used a large number of sites and run 10 MCMC chains for 1 million steps each. To ensure convergence, we then selected the chain with the highest posterior probability, and here show estimates from that chain. In all simulations, the anchor drift was set to be equal to the ancient sample drift Figure S9. Absolute difference between estimated and simulated anchor drifts for a variety of anchor drift and contamination scenarios, when coverage is low (0.5X, 1X or 1.5X). We used a large number of sites and run 10 MCMC chains for 1 million steps each. To ensure convergence, we then selected the chain with the highest posterior probability, and here show estimates from that chain. In all simulations, the anchor drift was set to be equal to the ancient sample drift Figure S10. Absolute difference between estimated and simulated ancient sample drifts for a variety of anchor drift and contamination scenarios, when coverage is low (0.5X, 1X or 1.5X). We used a large number of sites and run 10 MCMC chains for 1 million steps each. To ensure convergence, we then selected the chain with the highest posterior probability, and here show estimates from that chain. In all simulations, the anchor drift was set to be equal to the ancient sample drift. Figure S11. Estimation of parameters for a high-coverage ancient DNA genome (30X) with low sequencing error (0.1%), a large anchor population panel (100 haploid genomes) and admixture in the anchor population from the archaic population (5%), using the two-population inference framework, which does not model admixture. Error bars represent 95% posterior intervals.     . We added a 1 to the difference to be able to plot the difference on a logarithmic scale. The three panels contain results for three admixture scenarios (from left to right: admixture rate of 0%, 5% and 50%) and each panel shows the difference under different contamination rates and demographic models (see Figure S15). Figure S17. Parameters estimates under the two-population model using different putative contaminants, when the true contaminant is panel A. Each row of panels represents a different set of drift parameters, keeping the contamination rate fixed at 25% and the error rate at 0.1%. In this case, the admixture rate from the ancient population to the ancestor of A and B was kept at 0%. The anchor panel used was panel D (see Figure S15). Figure S18. Parameters estimates under the two-population model using different putative contaminants, when the true contaminant is panel A. Each row of panels represents a different set of drift parameters, keeping the contamination rate fixed at 25% and the error rate at 0.1%. In this case, the admixture rate from the ancient population to the ancestor of A and B was kept at 5%. The anchor panel used was panel D (see Figure S15). Figure S19. Parameters estimates under the two-population model using different putative contaminants, when the true contaminant is panel A. Each row of panels represents a different set of drift parameters, keeping the contamination rate fixed at 25% and the error rate at 0.1%. In this case, the admixture rate from the ancient population to the ancestor of A and B was kept at 50%. The anchor panel used was panel D (see Figure S15). Figure S20. Estimation of parameters for the Altai Neanderthal genome across different GC levels using the two-population model, while keeping (black) or removing (red) CpG sites from the input dataset. Error bars represent 95% posterior intervals. Figure S21. We tested one of the Yoruba genomes from Prüfer et al. [4] and obtain an estimate of 0% contamination, regardless of whether we use Europeans or Africans as the candidate contaminant. The anchor drift time is close to 0 when using Africans as the anchor population, as the sample belongs to that same population, while it is non-zero (= 0.22) when using Europeans. Error bars represent 95% posterior intervals.    . We added a 1 to the difference to be able to plot the difference on a logarithmic scale. The three panels contain results for three admixture scenarios (from left to right: admixture rate of 0%, 5% and 50%) and each panel shows the difference under different contamination rates and demographic models (see Figure  S15). Figure S26. Parameters estimates under the three-population model using different putative contaminants, when the true contaminant is panel A. Each row of panels represents a different set of drift parameters, keeping the contamination rate fixed at 25% and the error rate at 0.1%. In this case, the admixture rate from the ancient population to the ancestor of A and B was kept at 0%. The unadmixed anchor panel used was panel D and the admixed anchor panel was panel B (see Figure S15). Figure S27. Parameters estimates under the three-population model using different putative contaminants, when the true contaminant is panel A. Each row of panels represents a different set of drift parameters, keeping the contamination rate fixed at 25% and the error rate at 0.1%. In this case, the admixture rate from the ancient population to the ancestor of A and B was kept at 5%. The unadmixed anchor panel used was panel D and the admixed anchor panel was panel B (see Figure S15). Figure S28. Parameters estimates under the three-population model using different putative contaminants, when the true contaminant is panel A. Each row of panels represents a different set of drift parameters, keeping the contamination rate fixed at 25% and the error rate at 0.1%. In this case, the admixture rate from the ancient population to the ancestor of A and B was kept at 50%. The unadmixed anchor panel used was panel D and the admixed anchor panel was panel B (see Figure S15). and is homozygous for the ancient allele in the ancient individual. Then: In the above expressions, the functions f depend on τ C and τ A , but we 737 omit this conditioning for ease of notation. As can be seen, all we need To obtain f AA y , we know that, assuming the anchor population to be at So we finally obtain: We now have all the elements necessary to obtain the conditional probabil-

786
Let v be the base that was originally sampled at a given site, before v under the assumption that R j was mapped at the correct genomic location. 807 We have: