The Strength of Selection against Neanderthal Introgression

Hybridization between humans and Neanderthals has resulted in a low level of Neanderthal ancestry scattered across the genomes of many modern-day humans. After hybridization, on average, selection appears to have removed Neanderthal alleles from the human population. Quantifying the strength and causes of this selection against Neanderthal ancestry is key to understanding our relationship to Neanderthals and, more broadly, how populations remain distinct after secondary contact. Here, we develop a novel method for estimating the genome-wide average strength of selection and the density of selected sites using estimates of Neanderthal allele frequency along the genomes of modern-day humans. We confirm that East Asians had somewhat higher initial levels of Neanderthal ancestry than Europeans even after accounting for selection. We find that the bulk of purifying selection against Neanderthal ancestry is best understood as acting on many weakly deleterious alleles. We propose that the majority of these alleles were effectively neutral—and segregating at high frequency—in Neanderthals, but became selected against after entering human populations of much larger effective size. While individually of small effect, these alleles potentially imposed a heavy genetic load on the early-generation human–Neanderthal hybrids. This work suggests that differences in effective population size may play a far more important role in shaping levels of introgression than previously thought.

S 1 S 1 homozygotes are very rare and can be ignored. Numerical and individual-based simulations (see below) show that this assumption is reasonable for the parameter values we consider. We assume that, prior to admixture, Neanderthals and humans were fixed for allele N 1 and N 2 , respectively. At the time of introgression, the frequency of N 1 rises instantaneously from 0 to p 0 in the human population.
In the following, we describe how p t , the frequency of N 1 t generations after introgression, depends on its initial frequency p 0 , the heterozygote selection coefficient s, and the recombination rate r. Let x 0 and y 0 denote the frequency of haplotypes S 1 N 1 and S 2 N 1 in humans at the time of admixture. Hence, the frequency of allele N 1 immediately after admixture is In the following we assume that x 0 + y 0 1, and so we can ignore the effects of selection against homozygotes.
After recombination and random union of gametes, the haplotype frequencies in zygotes can be approximated by still assuming that S 1 is initially rare in humans.
After viability selection, the haplotype frequencies in the next generation of adult humans become From here, it is straightforward to obtain the explicit equations for the haplotype frequencies at generation t as This can be simplified to Plugging equation (5) into equation (1), we obtain where As time t goes to infinity, f (r, s, t) approaches Bengtsson's [1] gene flow factor for the case when selection happens before migration (cf. Eq. A3 in [2]): If s and r are small, at equilibrium f (r, s, t) further simplifies to r/(r + s), which is equal to the approximation to the effective rate of gene flow found by Petry [3] based on a diffusion approximation. If y 0 = 0, we can replace x 0 by p 0 in the equations above.

A single X-chromosomal locus under selection
The non-pseudoautosomal X chromosome is a special case, because of the differences in transmission and the fact that recombination only happens in females. We take this into account by modifying Eq. (2) to The factor of 2/3 accounts for the fact that two thirds of all X chromosomes are found in females.
As above, we assume that selection acts on viability and after recombination and random mating, but we now allow for selection to be sex specific. We denote by s f and s m the strength of selection against female and male carriers of a single S 1 allele, respectively. Recalling that, at the time of selection, a proportion of 2/3 of the X chromosomes are found in females, and 1/3 in males, we obtain the haplotype frequencies in adults of the next generation as Iteration of Eq. (10) yields the explicit haplotype frequencies at time t, The frequency of allele N 1 at time t, p X,t = x X,t + y X,t , can therefore be written as By setting s f = s m = s and allowing recombination to happen in the entire population, i.e. by replacing 1/3 and 2/3 by 0 and 1, respectively, we recover Eq. (6), as expected. At equilibrium (t = ∞), Eq. (12) becomes which can be approximated as if both selection and recombination are weak (s m , s f , r 1).

Multiple autosomal loci under selection
In the following, we consider a neutral locus embedded in a suite of multiple autosomal loci under purifying selection. Due to the complexities of multilocus models, we make the following simplifications. First, we assume that, immediately before introgression, all deleterious alleles were fixed in Neanderthals, but absent in humans. Second, we directly consider the situation where the frequency of the neutral Neanderthal-derived alleles has reached equilibrium in the human population (i.e. t = ∞). This can be justified if purifying selection is strong relative to the time since introgression, the admixture proportion, and the recombination rate. Third, we directly draw on the above-mentioned analogy between the surviving proportion of a cohort of introgressed alleles and the effective rate of gene flow at a neutral locus experiencing linkagemediated selection (cf. Eq. 7 and subsequent text). We can therefore approximate the equilibrium frequency of the introgressed neutral allele N 1 by modifying the multilocus version of the effective migration rate in Eq. (24) of reference [4].
Specifically, let there be I and J loci under selection on the left-and right-hand side of the focal neutral site, and denote them by A i and B j , respectively. The equilibrium frequency of the introgressed neutral allele N 1 can then be approximated as where a i and b i are the selection coefficients against the deleterious mutations at locus A i and B j , respectively, and r Ai and r Bj are the recombination fractions between the neutral locus and the respective locus under selection. Equation (15) assumes that both selection and recombination are weak.
If we set the selection coefficient at all loci under selection to the same value s, Eq. (15) simplifies to where r i and r j are short cuts for r Ai and r Bj , respectively.
To assess the accuracy of Eq. (15), we derived discrete-time recursion equations for a model with one neutral and two selected loci. As before, we assumed that the admixture proportion is small, so that homozygous carriers of Neanderthal-derived alleles are very rare in the human population and the dynamics of the full diploid model can be approximated by considering a haploid model with four haplotypes. In addition, we assumed that the mean fitness of the human population was not affected by the few carriers of deleterious introgressed mutations. To simplify our notation, we denote the two loci under selection by A and B, and use A 1 (A 2 ) and B 1 (B 2 ) for the deleterious (advantageous) alleles at locus A and B, respectively. We consider the following two configurations: one in which the neutral locus N is flanked by one locus under selection on each side (A-N-B), and another where the neutral locus is flanked by two selected loci on one side (A-B-N). We denote by r XY the recombination rate between locus X and Y, where r XY = r YX , and we assume that recombination distances accumulate additively across loci.
For configuration A-N-B, the four focal haplotypes are We denote their frequencies among adults of generation t after viability selection but before recombination and random mating by , and x 4 (t), respectively. After recombination, random mating, and viability selection, the haplotype frequencies among adults of the next generation are where w i denotes the relative fitness of haplotype i.
For configuration A-B-N, the four focal haplotypes are In this case, the haplotype frequencies follow the following recursions: For both configurations, the frequency of the introgressed neutral allele N 1 at time t can be approximated by where the x i (t) evolve according to Eq. (17) and (18), depending on the configuration.
We assume that fitness is additive across loci, and parametrize it as where 0 ≤ a, b ≤ 1.
We numerically iterated Eqs. (17) and (18) and computed p t according to Eq. (19) at each step until an equilibrium was reached. Specifically, we terminated the process when the absolute difference between consecutive predicted allele frequencies p t and p t+1 became smaller than 10 −9 .
We also iterated Eqs. (17) and (18)  (13) and Eq. (24) from reference [4] suggest that the equilibrium frequency of the introgressed neutral allele N 1 may be approximated as where a i,f (a i,m ) and b j,f (b j,m ) are the coefficients of selection against heterozygous carriers of the deleterious mutation at locus A i and B j in females (males). Moreover, r Ai and r Bj are the recombination rates between the neutral locus and locus A i and B j , respectively. The factor of 1/2 accounts for the fact that a given X-chromosomal haplotype spends half of its time in males relative to the time spent in females.
If we fix the selection coefficients in females and males across all loci to s f and s m as above, where r i and r j are short cuts for r Ai and r Bj , respectively.
To test our conjecture in Eq. (21), we derived discrete-time recursions for a model with one neutral and two selected loci analogous to those in Eq. (17) and (18), but for the case of the X chromosome. As before, we assumed that the admixture proportion is small, so that the diploid model can be approximated by a haploid model with four haplotypes, and the mean fitness of the human population is not affected by introgression of deleterious mutations. We again simplify our notation by denoting the two loci under selection by A and B, and we use A 1 (A 2 ) and B 1 (B 2 ) for the deleterious (advantageous) alleles at locus A and B, respectively. As above, we consider the two configurations A-N-B and A-B-N, and assume that recombination rates accumulate additively across loci.
For configuration A-N-B, the four haplotypes of interest are and A 2 N 1 B 2 , and their frequencies after recombination in generation t are given by respectively. After random mating and viability selection, the frequency of haplotype i among adults in the next generation is where m i and f i are the relative fitnesses of haplotype i in males and females, respectively. The frequency of the introgressed neutral allele N 1 at time t is then obtained from Eq. (19), where the x i (t) behave as described in Eq. (24).
For configuration A-B-N, the four haplotypes of interest are and A 2 B 2 N 1 . Their frequencies after recombination in generation t are respectively. Note that r AN = r AB + r BN by assumption. Equations (24) and (19) remain unchanged.
As above, we assume that fitness is additive across loci. For females, we parametrize fitnesses as and for males, we set where 0 ≤ a f , a m , b f , b m ≤ 1.
We numerically iterated Eq. (24) and computed p t according to Eq. (19) at each step until an equilibrium was reached. As above, we terminated the process when the absolute difference between consecutive allele frequencies p t and p t+1 became smaller than 10 −9 . We also iterated

Evaluating the accuracy of approximations
We evaluated the accuracy of Eqs. We chose the range for s such that it includes the point estimate obtained from our inference procedure (see S2 Text for details).
First, we tested the effects of ignoring homozygous individuals by numerically iterating discrete-time recursions for a model with two loci under selection, after setting the selection coefficient at one of these to zero. For the case of autosomes, we used the recursions given in Eq.
(2.9) on page 45 in reference [5] (cf. [6]). For the X chromosome, we used those from Eq. (10) in reference [7] after correcting them by substituting F for 2F . We verified Eqs. (6) and (8), and Eqs. (12) and (13) by comparison to values obtained through iteration of the recursions for the autosomes and X chromosome, respectively. To obtain the equilibrium values using recursions, we iterated the latter until the difference in allele frequency between two generations was less than 10 −9 . The results of this comparison suggest that ignoring homozygous individuals does not affect p t substantially (S1 Fig-S4 Fig).
Our inference method (S2 Text) is based on the deterministic expressions for the Neanderthal allele frequency derived above, and we view drift as 'noise' around these expectations. In a second test, we therefore assessed whether genetic drift has a substantial influence on the average frequency of the focal neutral allele. We did this by performing individual-based simulations and calculating the difference between the frequency of the neutral alleles after t = 2000 generations (p t,sim ) and that obtained from Eq. (6)  In conclusion, our tests suggest that for the parameter range of interest, our approximations describe the expected frequency of neutral alleles well.

Models of waves of introgression
Hybridization with Neanderthals likely happened over many generations, rather than in one generation as we assumed above with the single-pulse admixture model. Furthermore, there is evidence of at least two waves of Neanderthal introgression into the East Asian population [8][9][10][11].
In this section, we derive expressions for the present-day frequency of a neutral Neanderthal allele linked to a single locus exposed to purifying selection under a number of more complicated models. Specifically, we start by considering how haplotype frequencies change over time during continuous introgression. We then use this result to derive the equivalent expressions for models with one and two discrete waves of introgression. Lastly, we show that the a single-pulse model cannot be distinguished from the wave models if the onset and duration of admixture periods are unknown.

Haplotype frequencies during continuous introgression
In the simplest model of continuous introgression, each generation a constant fraction of N 1 S 1 and N 1 S 2 haplotypes flow into the human population. We consider a slightly more complex model where the fraction of haplotypes that introgresses into humans at the initial admixture event is x 0 and y 0 , but mx 0 and ny 0 for all future generations, where m > 0 and n > 0. This parametrization will be useful when we derive a model of two waves of introgression below.
The following events happen each generation: 1) admixture (inflow of Neanderthal alleles) at the adult stage, 2) recombination, mating, gamete production, random union of gametes, and 3) viability selection. During the initial admixture event, the frequency of N 1 S 1 and N 1 S 2 in humans rises instantaneously from 0 to x 0 and y 0 , respectively. Recombination and viability selection then alter the haplotype frequencies according to Eq. (2) and (3), respectively. From the first generation onward, introgression increases the frequency of Neanderthal haplotypes in humans by mx 0 and ny 0 . Therefore, for generations t = 1, 2, 3, . . . , the haplotype frequencies after admixture become After recombination, we have and after viability selection, the haplotype frequencies among adults in the next generation are Using the same methods as for the single-pulse model, we find the explicit expressions for the frequency of N 1 S 1 and N 1 S 2 in the human population in generation t as where

Single-wave introgression model
We now consider a model with a single wave of continuous introgression over τ generations, after which introgression stops. Therefore, during the first τ generations, haplotype frequencies change according to the model of continuous introgression discussed above (Eq. 31). For t > τ , haplotype frequencies change according to the single-pulse model with initial frequencies x τ and y τ (Eq. 5). Therefore we can write and where

Dual-wave introgression model
It is straighforward to extend the model of a single wave of continuous introgression to include further waves. We do this here for a second wave. Specifically, introgression occurs during a first wave of length τ 1 generations, after which it stops until generation τ 2 , when the second wave of introgression starts. The second wave ends at generation τ 3 . Up until τ 2 this model is equivalent to the single-wave introgression model introduced above. Therefore, the expressions for x t and y t for t ≤ τ 2 are given by Eqs. (33) and (34), after replacing τ by τ 1 . At generation τ 2 , introgression starts again, but the initial haplotype frequencies are now x τ2 and y τ2 , rather than 0. This situation is equivalent to the continuous introgression model starting from t = 1 (rather than t = 0) and with x 0 and y 0 replaced by x τ2 and x τ2 .
Then, the expressions for x t and y t can be found to be and where In both the single-pulse and wave models, haplotype frequencies change at the same rate once introgression stops, and this change is determined by r and s. The difference is that the haplotype

S8 Fig
Mapping models with one (red line) and two (blue line) waves of introgression to a single-pulse model. By changing time in the single-pulse model (dashed and dotted black lines) as described in S1 Text, we can recover present-day haplotype frequencies generated by the