Skip to main content
Advertisement
  • Loading metrics

Allele frequency divergence reveals ubiquitous influence of positive selection in Drosophila

  • Jason Bertram

    Roles Conceptualization, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    jxb@iu.edu

    Affiliations Environmental Resilience Institute, Indiana University, Bloomington, Indiana, United States of America, Department of Biology, Indiana University, Bloomington, Indiana, United States of America

Abstract

Resolving the role of natural selection is a basic objective of evolutionary biology. It is generally difficult to detect the influence of selection because ubiquitous non-selective stochastic change in allele frequencies (genetic drift) degrades evidence of selection. As a result, selection scans typically only identify genomic regions that have undergone episodes of intense selection. Yet it seems likely such episodes are the exception; the norm is more likely to involve subtle, concurrent selective changes at a large number of loci. We develop a new theoretical approach that uncovers a previously undocumented genome-wide signature of selection in the collective divergence of allele frequencies over time. Applying our approach to temporally resolved allele frequency measurements from laboratory and wild Drosophila populations, we quantify the selective contribution to allele frequency divergence and find that selection has substantial effects on much of the genome. We further quantify the magnitude of the total selection coefficient (a measure of the combined effects of direct and linked selection) at a typical polymorphic locus, and find this to be large (of order 1%) even though most mutations are not directly under selection. We find that selective allele frequency divergence is substantially elevated at intermediate allele frequencies, which we argue is most parsimoniously explained by positive—not negative—selection. Thus, in these populations most mutations are far from evolving neutrally in the short term (tens of generations), including mutations with neutral fitness effects, and the result cannot be explained simply as an ongoing purging of deleterious mutations.

Author summary

Natural selection is the process fundamentally driving evolutionary adaptation; yet the specifics of how natural selection molds the genome are contentious. A prevailing neutralist view holds that the evolution of most mutations is essentially random. Here, we develop new theory that looks past the stochasticity of individual mutations and instead analyzes the behavior of mutations across the genome as a collective. We find that selection has a strong non-random influence on most of the Drosophila genome over short timescales (tens of generations), including the bulk of mutations that are not themselves directly targeted by selection. We show that this likely involves ongoing positive selection.

Introduction

One of the central problems of evolutionary biology is to delineate the role of natural selection in shaping genetic variation. Most genetic variation consists of neutral mutations which, though having no appreciable effects on fitness, are not free from the influence of selection. When selection acts on non-neutral mutations, neutral mutations that share similar genetic backgrounds can be dragged along for the ride, a process called linked selection [1]. The extent to which linked selection influences neutral variation is a major point of contention [2, 3]—one with practical implications because putatively neutral mutations are widely used to infer population demographic history [4] and as a baseline for detecting selection [2, 5]. There is also ongoing debate about the particular modes of selection responsible for shaping genetic variation. Negative selection purging the influx of deleterious mutations is probably prevalent [6, 7], but positive selection on rarer advantageous mutations is crucial for adaptive evolution and likely also has a hand in shaping neutral variation [8].

Until recently, the bulk of the evidence entering the above debates rested on patterns of genetic variation measured at single snapshots in time. The interpretation of such evidence is complicated because the prospective signatures of selection are accumulated over an uncertain history during which other confounding processes (e.g. population demography) also shape genetic diversity [5, 9, 10]. Crucially, single snapshot data is unable to reveal what the process of selection is doing at any point in time i.e. selectively changing allele frequencies.

A more direct approach is to analyze allele frequency data gathered from the same population at multiple points in time [10]. Evolve and resequence (E&R) experiments [1114] and studies on wild populations [15, 16] have identified allele frequency changes associated with rapid phenotypic adaptation. However, determining the full nature and extent of selective allele frequency change has been difficult. Numerous methods exist for inferring selection coefficients from allele frequency time series [10, 1724], but are only reliable for selection that is strong relative to the intensity of random, non-selective allele frequency change (random genetic drift) and allele frequency measurement error (e.g. due to population sampling or limited sequencing read depth).

This is a major limitation that likely precludes detection of most of the influence of selection. Fitness-relevant traits are often complex (influenced by a large number of genes) and harbor ample genetic variation. Selection on such traits will thus often cause modest allele frequency shifts distributed across many loci rather than be concentrated at a small number of strongly selected loci [2527]. Moreover, even if some genomic regions harbor strongly selected alleles, much of the associated linked selection could be undetectably weak. Thus, resolving the short-term (∼ tens of generations) influence of selection across the genome remains an important challenge [28].

Here we present a new approach to analyze the genome-wide influence of selection using time-resolved allele frequency data. Our approach capitalizes on a distinctive pattern of among-locus temporal allele frequency divergence that to our knowledge has not previously been described. In contrast with single-locus approaches, this allele frequency divergence is a collective pattern incorporating alleles across the genome. We therefore lose the ability to identify particular loci under selection; in return are able to detect polygenic selective processes that are not detectable with single-locus approaches.

Traditionally the allele frequency variance in a cohort of neutral alleles with initial frequency p is assumed to have the binomial form (1) where Δtp denotes the change in allele frequency after t generations, and the variance coefficient Ct is frequency independent [29, Chap. 3]. The allele frequency divergence in Eq (1) is largely a consequence of random genetic drift. However, selection can also cause neutral allele frequencies to diverge. The influential effective population size literature has derived (frequency-independent) expressions for Ct in a wide variety of circumstances [30]. Crucially, a large body of work has attempted to subsume the effects of selection on neutral alleles into the frequency-independent value of Ct, including both the effects of unlinked fitness variation [31, 32], and some manifestations of linked selection [6, 33, 34]. The effective population size literature thus views (1) as a broadly applicable model of neutral allele divergence, simply requiring a tuning of Ct to capture the effects of selection on neutral alleles, at least to a first approximation [3, 30].

Here we show, on the contrary, that linked selection causes among-locus neutral allele frequency variance to deviate from the binomial form (1), such that the variance coefficient Ct is frequency-dependent. We use this frequency-dependence to detect the presence of selection, analyze its influence on allele frequencies over time and estimate the typical magnitude of total selection coefficients (capturing both direct and linked selection) across the genome. Applying our approach to E&R and wild Drosophila single nucleotide polymorphism (SNP) data we find evidence of strong linked selection affecting most SNPs (although we cannot rule out migratory fluxes in the wild population). We argue that the specific form of frequency-dependence we find implies a substantial role for positive selection.

Results

Neutral evolution implies binomial allele frequency variance

The Eq (1) binomial variance classically arises in the neutral Wright-Fisher model, which assumes random sampling of gametes each generation; then Ct = 1 − (1 − 1/2N)t where N is the (diploid) population size. In its basic form the neutral Wright-Fisher model entails a number of biological simplifications including random mating, constant N, non-overlapping generations, and the absence of fitness differences between individuals. Many of these assumptions can be relaxed without affecting the binomial form of Eq (1), at least approximately for large N and over long timescales [30, 35]. Similarly, much of the justification for Wright-Fisher as a biologically valid description of genetic drift is derived from its equivalence to a broader class of drift models in the limit of large N and slow allele frequency change (the diffusion limit [36]). Here we are interested in shorter time scales (≤ tens of generations), and want our approach to be applicable to small laboratory populations (< 103 individuals). We therefore evaluate the validity of Eq (1) more generally.

An enormous variety of purely neutral genetic drift models have binomial variance [37]. This includes the Cannings model, which represents neutrality using a general exchangeability assumption that allows for arbitrary offspring number distributions [38]. Binomial variance thus accommodates fundamental deviations from Wright-Fisher such as “sweepstakes” reproduction in high-fecundity organisms [39]. However, due to the presence of fitness variation in adapting populations, neutral mutations do not evolve according to “pure drift” of the sort studied in ref. [37], even if unlinked from alleles under selection [31]. In particular, the Cannings model is not applicable because exchangeability precludes fitness variation between individuals.

We show that binomial variance applies quite generally for neutral alleles unlinked from selected loci (A in S1 Text). In short, we use a generalized exchangeability argument to show that binomial variance holds in the presence of fitness variation provided that the neutral alleles under consideration are in linkage equilibrium with alleles under selection. Intuitively, linkage equilibrium ensures that the distribution of genetic backgrounds is exchangeable between alternate neutral alleles, even though individual genetic backgrounds are not exchangeable.

Non-binomial variance (equivalently, frequency-dependent Ct) thus signifies a violation of generalized exchangeability. The obvious way for this to occur is for allele frequency change to have a nonzero bias; this could be due to linked selection, migration, mutation bias or gene drive. Additionally, deviations from binomial variance can occur if the population is structured into genetically differentiated demes (B in S1 Text). Below we check for binomial variance empirically and discuss our findings in relation to these factors leading to non-binomial variance, focusing mostly on selection for reasons that will become apparent.

Note that while our exchangeability argument yields Eq (1) with finite variance for finite N, in the diffusion limit infinite variance is possible in the Cannings model [37]. None of our results depend on N → ∞ limiting behavior, so we do not discuss this possibility further.

Selection creates non-binomial allele frequency variance

We now analyze the effects of selection on allele frequency divergence, demonstrating that deviations from binomial variance will often result.

The expected frequency change after one generation due to selection on an allele starting at frequency p is given by (2) where is the selection coefficient, is the mean fitness of the focal allele, is the mean fitness of all other alleles at the same locus, and is population mean fitness. Here s is the “total” selection coefficient that captures the net effect of selection at linked loci and the focal locus [40, 41].

Selection generates among-locus divergence of allele frequencies when its strength or sign varies among alleles in a cohort. To quantify this effect, we apply the law of total variance to Δ1p where s is allowed to vary between loci: (3)

The second term in Eq (3) represents selective divergence i.e. the deterministic allele frequency divergence created by among-locus variation in s. Using Eq (2), it can be written as σ2(s|p)[p(1 − p)]2, where σ2(s|p) is the (possibly frequency-dependent) variance in total selection coefficients among loci with initial frequency p. The presence of the [p(1 − p)]2 factor will tend to cause intermediate frequency alleles to have elevated variance relative to the binomial case (Fig 1). Thus, while it is possible for the allele frequency variance created by selection to be binomial, in general it is not. Beyond the tendency for elevated variance at intermediate frequencies, the exact shape and magnitude of the deviation is determined by σ2(s|p).

thumbnail
Fig 1.

(A) The total selection coefficient measures the overall effect of selection on an allele including any associations with other sites under selection. (B) When alleles at different sites have different total selection coefficients, selection generates allele frequency divergence. (C) Compared to the binomial variance (proportional to p(1 − p)) created by random genetic drift, the selective variance tends to be more elevated at intermediate frequencies (proportional to [p(1 − p)]2) because the magnitude of selective allele frequency change is proportional to p(1 − p).

https://doi.org/10.1371/journal.pgen.1009833.g001

More generally, allele frequencies are measured t > 1 generations apart during which time the selective divergence accumulates. The temporal structure of selection is then important. The total allele frequency change after t generations is the sum over the intervening t generations , where δip = pi+1pi and pi is the frequency in generation i (i = 0, 1, …, t − 1 counting from the preceding measurement). From Eq (2) we have δip = sipi(1 − pi) where si is the total selection coefficient in generation i. Assuming that the total selection coefficients and total allele frequency change over t generations are small ( and |si| ≪ 1), the expected allele frequency change is approximately (dropping terms of order s2). The selective divergence is then given by (4)

The two sums on the right represent respectively: the divergence contribution from fitness variation within intervening generations; and the divergence contribution from temporal consistency in fitness variation across intervening generations.

Sustained selection manifests as positive among-locus temporal covariances Cov(si, sj) > 0. If total selection coefficients were perfectly constant with time these positive covariances would create rapid selective divergence with the allele frequency variance in Eq (4) growing quadratically over time (because there are t(t − 1) covariance terms in Eq (4); for further details see C in S1 Text). However, for neutral alleles (the bulk of segregating variants), the temporal covariance between si and sj is expected to decay exponentially with increasing time separation |ji| due to recombination. In the two-locus case where the neutral allele is hitchhiking with one selected allele, linkage disequilibrium (and thus covariance) decays at rate ∼(1 − r)|ji| where r is the recombination rate between the two alleles [1]. The multilocus case similarly involves exponential decay averaged over all linked sites under selection [42]. Nevertheless, even if recombination destroys linkage disequilibrium so rapidly that only concurrent generations |ij| = 1 covary, there are still t − 1 such pairs contributing to Eq (4). Thus, among-locus temporal autocovariances Cov(si, sj) can make a substantial contribution to the overall selective divergence.

Alternatively, even if selection fluctuates in such a way that total selection coefficients are temporally uncorrelated Cov(si, sj) = 0, the within-generation selective divergence can still create non-binomial frequency dependence. The variance resulting from this effect accumulates at a slower linear rate with time (because there are t variance terms in Eq (4); C in S1 Text)—a selective random walk [43]. Selection that changes in a more predictable manner could in principle generate no overall divergence at all—if selection reverses direction concurrently at many loci, negative covariances can be created in Eq (4) shrinking the overall divergence.

In addition to the selective divergence described above, selection has another effect in Eq (3): it perturbs the drift contribution to divergence Es[Var(Δtp|p, s0, …, st−1)]. This effect occurs when a mean selective bias in the cohort displaces allele frequencies and thus perturbs the effects of drift (regardless of whether there is among-locus variation in total selection coefficients). We show that the selective perturbation to the drift variance has the form where c is a frequency-independent constant of order 1 (D in S1 Text). This result assumes that the cohort does not start close to fixation, and is also insensitive to population dynamic specifics if many generations separate measurements (t ∼ 10 in the data we analyze). For a generational measurement interval (t = 1) this result also holds in canonical models (i.e. Wright-Fisher and Moran), but in general it is possible that the exact form of the selective drift perturbation depends on population specifics. In the following analysis the exact expression for the selective drift perturbation will not be important; we only use the fact that it scales with , which implies that its effects are negligibly small in the populations of interest here (Methods).

Combining variance contributions we have (5) where Dt is the frequency-independent variance coefficient in the absence of selection. The variance coefficient Ct(p) is thus partitioned respectively into a frequency-independent genetic drift component, a frequency-dependent selective drift perturbation, and a frequency-dependent selective divergence.

Positive excess variance indicates positive selection

The deviation from binomial allele frequency variance described in the previous section depends crucially on the among-locus total selection coefficient variance σ2(s|p). This quantity is challenging to analyze because it is determined by the structure of linkage disequilibrium. We thus performed forward-time population genetic simulations using SLiM [44] to supplement our theoretical results (see Methods for simulation details). For simplicity, we focus on three archetypal scenarios in an unstructured, demographically stable population closed to migration: a continual influx of deleterious mutations, no non-neutral mutations (the control case), and a continual influx of unconditionally beneficial mutations. For short we call the first and last of these “negative selection” and “positive selection” respectively. In all three cases we maintain a steady influx of neutral mutations; these constitute the bulk of segregating mutations and therefore dominate the behavior of σ2(s|p). Intuitively we expect that the frequency dependence of σ2(s|p) could be quite different in the negative versus positive selection scenarios, because unconditionally deleterious mutations strong enough to cause detectable allele frequency divergence rarely reach intermediate frequencies, whereas beneficial mutations routinely do so.

To check for binomial frequency variance, we use allele frequencies from two timepoints t = 10 generations apart (chosen for compatibility with the empirical data we consider below) to calculate Ct = Var(Δtp|p)/p(1 − p) for alleles starting at intermediate 0.5 < p < 0.55 and high 0.9 < p* < 0.95 major allele frequencies. We then calculate the “excess variance” Ct(p) − Ct(p*). We also calculate total selection coefficients for all segregating mutations to investigate how the selective divergence term in Eq (5) behaves as a function of p. To make the magnitude of the latter easier to interpret, we show total selection coefficient variance on a per-generation scale where is the time-averaged total selection coefficient.

According to the theory in the preceding section, selective divergence tends to create positive excess variance due to the p(1 − p) factor in the last term in Eq (5). Our positive selection simulations confirm this prediction, consistently creating positive excess variance (Fig 2A and 2C). On the other hand, there is no consistent deviation from binomial variance in the negative selection simulations: increases with major allele frequency so rapidly that the overall selective divergence term in Eq (5) is independent of frequency (Fig 2A and 2B). While these simulations are obviously simplified, the concentration of selective divergence at low/high frequencies is a general feature of the purging of new deleterious mutations. Thus, selection does generate elevated variance at intermediate frequencies as predicted theoretically, but not just any form selection: it is important that selection be “positive” in the sense of not only eliminating rare variants.

thumbnail
Fig 2.

(A) Forward-time population genetic simulations consistently show elevated excess variance under positive selection only. Excess variance defined as Ct(p) − Ct(p*) with major allele frequencies 0.5 < p < 0.55 and 0.9 < p* < 0.95 and t = 10 generations. (B) Under strong negative selection (deleterious mutation rate U = 1/genome/generation, mutation selection coefficient s = −0.05), total selection coefficients are substantial at all frequencies but much stronger for high major allele frequencies resulting in a frequency-independent overall selective divergence like the neutral case. (C) In contrast, the selective divergence shows clear frequency dependence under positive selection, thus producing excess variance at intermediate frequencies. Population size N = 1000; 100 replicates per parameter combination. Stars indicate which panel A simulations are shown in panels B and C respectively.

https://doi.org/10.1371/journal.pgen.1009833.g002

Intermediate frequency alleles have elevated variance in Drosophila

We next investigated whether binomial allele frequency variance is observed empirically. In two fruit fly (D. Simulans) E&R experiments [11, 12], we observe systematically elevated variance coefficients Ct at intermediate frequencies (Fig 3). We rule out measurement error as driving this pattern, because the major sources of pooled sequencing error (population sampling, read sampling, unequal individual contributions to pooled DNA) also create binomial variance rather than a systematic frequency-dependent bias (E in S1 Text; [45, 46]). We also rule out migration, since these E&R populations are closed. Moreover, as will be discussed in the next section, systematically elevated variance cannot be explained by a few large effect loci, implying that a substantial fraction of SNPs across the genome are involved in the observed pattern. Hence we also rule out mutation bias and gene drive as being the main driver of elevated variance at intermediate frequencies since these processes do not have the requisite scale. Finally, population structure tends to create a variance deficit at intermediate frequencies (B in S1 Text); thus, even if some population structure is present in these closed E&R populations, it would tend to eliminate the observed elevation of variance, not explain it. We deduce that the pattern observed in Fig 3 is due to selection, consistent with the theoretical prediction that selective divergence tends to cause elevated variance at intermediate frequencies.

thumbnail
Fig 3. Intermediate frequency SNPs in E&R D. Simulans populations (A [11]; B [12]) have systematically elevated variance coefficients Ct(p) = Var(Δtp|p)/p(1 − p) relative to higher frequency SNPs after one round of evolution and resequencing (t ≈ 10 in A; t ≈ 15 in B), inconsistent with the binomial expectation for neutrally evolving alleles (1).

Ct(p) is calculated in 2.5% major allele frequency bins using all SNPs in the genome (circles). Vertical lines show 95% block bootstrap confidence intervals (1Mb blocks). We subtract the constant minpCt(p) from Ct(p) in each replicate to prevent differences in the overall magnitude of Ct(p) between replicates from obscuring p dependence within each replicate.

https://doi.org/10.1371/journal.pgen.1009833.g003

Similar results are found in a wild D. Melanogaster population [15] (S1 Fig), although this population is not closed and elevated variance could also be attributed to migration. The effect of migration on allele frequency divergence can be understood analogously to selection (Eq (3)) as introducing a migration divergence term Var(m(p* − p)|p) = m2 Var(p* − p|p) where m is the proportion of individuals in the focal population replaced by migrants from the source population each generation, and p* denotes source population frequencies. The migration divergence thus depends on the structure of differentiation between focal and source populations. The a priori expectation is for Var(p* − p|p) to be greatest at high p (the opposite of the observed pattern), where the largest differences p* − p are possible (analogous to the mathematical constraints on FST [47]). However, since we do not know the structure of population differentiation (or even what the source population might be), we remain agnostic about the influence of migration in the ref. [15] population.

Next we explored the behavior over time of the elevated variance shown in Fig 3 by following its accumulation within a frequency cohort for two studies in which allele frequencies were measured more than twice [11, 15]. Similar to our simulations, at each measured timepoint we quantified the excess variance using the difference Ct(p) − Ct(p*), where p is the initial frequency of the cohort and p* > p is a reference frequency. In practice we choose p = 0.5 to maximize the contrast with the reference frequency, while p* ∼ 0.8–0.9 is chosen to be large enough that there is a meaningful contrast with p = 0.5 but safely displaced from the p = 1 boundary where allele frequency variances are not measured reliably (see sharp increases in Fig 3 as p → 1).

We find that excess variance accumulates over the course of the entire Barghi et al. [11] E&R experiment (Fig 4A shows one replicate, other replicates are similar; S2 Fig), implying a sustained, polygenic divergence in allele frequencies. This pattern is consistent with the positive Δp temporal autocovariances documented in [28]. Sustained divergence is what we expect to occur from selection in a novel but constant laboratory environment.

thumbnail
Fig 4. Excess allele frequency variance (a measure of deviation from neutrality defined as Ct(p) − Ct(p*)) accumulates over time in a D. Simulans E&R experiment (A; [11]), but remains relatively flat in a wild D. Melanogaster population (B; S = Spring, F = Fall, LF = Late Fall; 09 = 2009 etc.; [15]).

The excess variance is calculated for intermediate frequency alleles falling within a major allele frequency bin at p = 0.5. In (A), p* = 0.9 and bin width is 2.5%. In (B), p* = 0.8 and bin width is 5%. Vertical lines show 95% block bootstrap confidence intervals (1Mb blocks).

https://doi.org/10.1371/journal.pgen.1009833.g004

By contrast, excess variance in wild D. Melanogaster populations [15] does not exhibit continual accumulation of excess variance over time, with fluctuations evident in each cohort (Fig 4B). Fluctuations imply a concurrent reversal in the direction of non-neutral allele frequency change across many loci such that non-neutral divergence is partly lost to a subsequent coordinated non-neutral convergence. Bearing in mind that migration may contribute to this pattern, the fluctuations shown in Fig 4 are compatible with temporally fluctuating selection affecting a large proportion of the genome, as proposed by ref. [15]. However, while ref. [15] attributed temporal fluctuations in selection to periodic seasonal change, we do not see a clear annual periodicity in the accumulation of variance. A similar lack of annual periodicity is found in allele frequency temporal autocovariances [28]. These results suggest a more complex selective (or migratory) regime of which seasonal fluctuations are only a part.

Linked selection strongly perturbs SNP frequencies in Drosophila

In the previous section we argued that selection is most likely responsible for elevated allele frequency divergence at intermediate frequencies in three Drosophila studies (with the possible exception of the ref. [15] study because of migration). We next used the theory developed above to estimate the typical magnitude of total selection coefficients associated with elevated divergence (we also apply our analysis to ref. [15] supposing that selection was responsible).

We measure the typical intensity of selection using the among-locus standard deviation σ(s|p). This quantity determines the selective divergence in Eq (5), and has the convenient property of measuring the absolute magnitude of s regardless of sign. Intuitively, σ(s|p) measures the intensity of a collective “polygenic” adaptive response shared across many loci. If a fraction f of loci have s = 0, then where σnn(s|p) is the standard deviation in s among non-neutral loci. Thus, a substantial fraction of the alleles in a cohort must have nonzero s (f appreciably smaller than 1) for there to be a discernible σ(s|p) signal.

We estimate σ(s|p) from measured allele frequency divergence using Eq (5). Since we only have measurements separated by t generations, we actually estimate where is the time-averaged selection coefficient . To estimate from Eq (5), we need to eliminate the non-selective divergence contributions of genetic drift Dt and measurement error (which was not included in Eq (5)). In Methods we show that these latter contributions are cancelled out in the excess variance Ct(p) − Ct(p*), avoiding the complication of independently estimating them. However, some selective divergence is also cancelled out in the difference Ct(p) − Ct(p*), so that this approach only obtains a lower bound (6)

In all three Drosophila studies, we find the above lower bound to be of order 10−4 (Fig 5), implying that total selection coefficients with magnitudes of order are commonplace in the populations considered here.

thumbnail
Fig 5. Total selection coefficients show substantial among-locus variance in Drosophila.

(A-C) Lower bound estimates of calculated from (6) (circles; vertical lines show 95% block bootstrap confidence intervals) are of order 10−4, which implies typical s values of ∼1%. Following the original studies [11, 12, 15], we assume t = 10 (A); t = 15 (B) and t = 10 (C; for both summer and winter).

https://doi.org/10.1371/journal.pgen.1009833.g005

Discussion

Several lines of evidence support the view that selection strongly influences genetic variation in Drosophila [8, 12, 28, 48]. Our results independently show that even over a short time interval (tens of generations), most intermediate frequency SNPs are influenced by selection—total selection coefficients (which include linked selection) of |s|∼1% are the norm among intermediate frequency SNPs, despite most of these SNPs having no effect on fitness. Since our method relies on contrasting behavior at different frequencies, the effect of selection on extreme frequency alleles is used as a reference and is therefore not directly inferred. We expect the effects of selection to be even greater at extreme frequencies where most deleterious mutations are segregating and recent neutral mutations are most tightly linked to selected backgrounds.

The power of our approach stems from aggregating allele frequency behavior over many loci, thereby leveraging the sheer number of variants measured with whole-genome sequencing to discern a selective signal. Heuristically, the sampling error in the lower bound estimate (6) is proportional to where L is the number of independent loci used to estimate Ct(p). With enough sequenced variants (L ∼ 105), selection coefficients of order |s|∼1% should be detectable over a single generation even when allele frequency noise is of comparable magnitude (i.e. read depth and population size ∼102; see Methods). Intuitively, variants across the genome experience a detectable non-neutral shift as a collective even though the underlying allele frequency changes may be indistinguishable from drift at individual loci.

Our approach is a departure from the widespread use of frequency-independent Ct for neutral mutations [30]. The variance coefficient Ct can be expressed in terms of the “variance effective population size” Ne as Ct = 1 − (1 − 1/2Ne)t. Thus, selection makes Ne frequency-dependent for neutral mutations over short timescales (i.e. before an appreciable fraction of the alleles in a cohort fix). The origin of this non-binomial allele frequency variance is variation in the selective background of alleles at different loci.

Selection does not need to be consistent over time to have this effect: stochastically fluctuating selection with no temporal consistency can also generate non-binomial allele frequency variance. However, temporally consistent selection generates divergence more rapidly, and temporal covariances can be responsible for most of the selective divergence (Results). Moreover, allele frequency changes Δp are correlated over time in the systems analyzed here [28]. Thus, it seems likely that temporally consistent selection is at least partly responsible for the patterns documented here.

Note, however, that in contrast to ref. [28], the temporal covariances relevant to allele frequency divergence in Eq (4) are between total selection coefficients, not Δp. For among-locus temporal covariances in total selection coefficients to be non-zero it is necessary for those coefficients to vary among loci, whereas Δp covariances quantify any temporal consistency in allele frequency change [42]. Thus, Δp temporal covariances can theoretically be present without any selective divergence, and vice versa. In practice, the temporal autocovariances in Δp must be calculated across three measurement steps e.g. Cov(ptp0, p2tpt). These cross-measurement covariances do not contribute to the divergence observed at t generations, and are only a subset of the covariances contributing to the divergence observed at 2t generations (Eq (4)). Therefore, the patterns of variance accumulation documented here are related but not equivalent to the patterns documented in ref. [28]. Temporal autocovariances in Δp predominantly capture the extent to which the genome-wide influence of selection has a temporally enduring pattern across measurements. Allele frequency divergence captures the cumulative genome-wide influence of both temporally stable and fluctuating selection between two measurements. The relative contribution from temporal covariances in total selection coefficients depends on the intensity of selective fluctuations as well as the persistence time of linkage disequilibrium (Results), and would require generational allele frequency measurements to quantify.

We found that the frequency structure of allele frequency divergence is informative about the underlying structure of direct selection (Fig 2). Elevated divergence of intermediate frequency alleles is difficult to explain if only negative selection on unconditionally deleterious mutations is occurring. Although selection against an influx of deleterious mutations can generate transient sweep-like behavior for neutral mutations that originate on genetic backgrounds with above-average fitness, this scenario still entails overwhelmingly more influence on allele frequency dynamics at low/high frequencies compared to intermediate frequencies [49]. More broadly, it may be possible to make more detailed inferences about the structure of direct selection by moving beyond allele frequency variances and analyzing the entire distribution of allele frequency change Δtp.

Quantifying the bounds on how much selection is possible, and how much selection actually occurs in natural popoulations, is a long running controversy [50, 51]. The strong total selection coefficients (|s|∼1%) we find must predominantly reflect linked selection on neutral SNPs. This implies a substantial risk of overestimating the amount of direct selection when, as is commonly done, selection coefficients are inferred at individual loci and then attributed to direct selection. This “excess significance” is a well known difficulty in E&R experiments [12, 52], and similar challenges have arisen in wild populations [15]. Our results indicate that improving the sensitivity of single-locus selection coefficient inferences, or better controlling for multiple comparisons, will likely not resolve this issue. Our total selection coefficient estimates are also substantially larger than direct selection coefficients of individual alleles estimated from diversity patterns in Drosophila [8]. This is consistent with a linkage-centered view of neutral mutation evolution in which the selective background of most neutral mutations contains multiple alleles under selection such that allele frequency behavior is governed by the fitness variation within local “linkage blocks” [53] or larger haplotypes [11].

Methods

Simulations

We used SLiM [44] to simulate a closed population with N = 103 individuals, a 100Mb diploid genome, a recombination rate of 10−8/base pair/generation, and a neutral mutation rate of 10−8/base pair/generation. Non-neutral mutations were introduced at rate U/chromosome/generation, where in each simulation non-neutral mutations were assumed to have the same fixed selection coefficient. Four background selection regimes (U = 1, 0.1 × s = −0.05, −0.01), one neutral regime (U = 0), and four positive selection regimes (U = 0.1, 0.01 × s = 0.01, 0.02) were evaluated (Fig 2). In each regime, 100 replicates were simulated with complete genotypes recorded at generations 104 and 104 + 10, mimicking the t = 10 generation interval in the empirical studies after a burn in period of 10N = 104 generations. Total selection coefficients in Fig 2B and 2C computed using Eq (2) from genotype data at generation 104.

Data processing

SNP frequency data were obtained from the open access resources published in [15] (wild D. Melanogaster, 1 replicate, ∼5 × 105 SNPs, 7 timepoints), [11] (D. Simulans E&R, 10 replicates, ∼5 × 106 SNPs, 7 timepoints) and [12] (D. Simulans E&R, 3 replicates, ∼3 × 105 SNPs, 2 timepoints). We performed no additional SNP filtering. For the [15] data, only SNPs tagged as “used” were included.

Block bootstrap confidence intervals

We use bootstrapping to estimate the variability of the quantities plotted in Figs 35. These quantities are calculated as an average over loci, where nearby loci are unlikely to be statistically independent due to linkage. To account for the non-independence of individual loci when bootstrap sampling, 95% confidence intervals are calculated using a block bootstrap procedure [28]. Each chromosome is partitioned into 1 megabase windows (∼120 total windows). Bootstrap sampling is then applied to these windows. The plotted vertical lines span the 2.5% and 97.5% block bootstrap percentiles.

Estimation of the selection coefficient variance

To derive Eq (6) we show that the reference value Ct(p*) satisfies the inequality (7)

The first line above is Eq (5) evaluated at the reference frequency p* with an additional measurement error term M included. M is frequency-independent because measurement error is binomial (E in S1 Text; [45, 46]). Eq (7) implies that the reference value Ct(p*) is an upper bound on the drift and measurement components of Ct(p) for all p. Taking the difference Ct(p) − C(p*), we then have (8) eliminating Dt and M.

To derive Eq (7) we first drop the selective drift perturbation because it is negligibly small compared to the selective divergence in the populations considered here: Ct (and therefore Dt) is of order 10−2, E[si|p] is at most of order 10−2, and t ∼ 10; hence . By comparison, is of order 10−2. Second, we have p*(1 − p*)σ2(st|p*) > 0; subtracting this term gives the inequality.

Estimation limits

Our analysis relies on detecting differences in Ct(p) between cohorts with different values of p. The ability to detect such differences is determined by the sampling error in Ct(p) arising due to the calculation of Var(Δtp|p) from a finite number of loci. To estimate this sampling error, we assume that Δtp is approximately normally distributed, in which case the sample variance in Var(Δtp|p) is 2Var(Δtp|p)2/(L − 1) ≈ 2Var(Δtp|p)2/L where L ≫ 1 is the number of independent loci used to estimate Var(Δtp|p). The standard error in Ct(p) = Var(Δtp|p)/p(1 − p) is thus given by . This defines the scale of statistically detectable differences in Ct(p) − Ct(p*), which in turn determines the statistically detectable lower bound estimate on σ2(s|p) (6). For example, to detect σ2(s|p)∼10−4 at p = 0.5 (i.e. a typical selection coefficient of σ(s|p)∼1%) after one generation of evolution with C1 ∼ 10−2 (i.e. a population sample of ∼100 individuals, an average read depth of ∼100 and fairly strong genetic drift D1 ∼ 10−2), we need at least L ∼ 105 independent SNPs.

Supporting information

S1 Text. Supplemental text.

This file contains supplemental text sections A-E.

https://doi.org/10.1371/journal.pgen.1009833.s001

(PDF)

S1 Fig. Frequency dependence of C.

Same as Fig 3 but for the Bergland et al. data. Each curve represents a different seasonal iterate e.g. summer 2009 to fall 2009.

https://doi.org/10.1371/journal.pgen.1009833.s002

(TIF)

S2 Fig. All Barghi et al. replicates.

Same as Fig 4A but including all 10 replicates from Barghi et al.

https://doi.org/10.1371/journal.pgen.1009833.s003

(TIF)

Acknowledgments

We thank Matthew Hahn for insightful discussions.

References

  1. 1. Barton NH. Genetic hitchhiking. Philosophical Transactions of the Royal Society of London Series B: Biological Sciences. 2000;355(1403):1553–1562. pmid:11127900
  2. 2. Kern AD, Hahn MW. The neutral theory in light of natural selection. Molecular biology and evolution. 2018;35(6):1366–1371. pmid:29722831
  3. 3. Jensen JD, Payseur BA, Stephan W, Aquadro CF, Lynch M, Charlesworth D, et al. The importance of the neutral theory in 1968 and 50 years on: a response to Kern and Hahn 2018. Evolution. 2019;73(1):111–114. pmid:30460993
  4. 4. Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS genet. 2009;5(10):e1000695. pmid:19851460
  5. 5. Haasl RJ, Payseur BA. Fifteen years of genomewide scans for selection: trends, lessons and unaddressed genetic sources of complication. Molecular ecology. 2016;25(1):5–23. pmid:26224644
  6. 6. Charlesworth B, Morgan M, Charlesworth D. The effect of deleterious mutations on neutral molecular variation. Genetics. 1993;134(4):1289–1303. pmid:8375663
  7. 7. Pouyet F, Aeschbacher S, Thiéry A, Excoffier L. Background selection and biased gene conversion affect more than 95% of the human genome and bias demographic inferences. Elife. 2018;7:e36317. pmid:30125248
  8. 8. Elyashiv E, Sattath S, Hu TT, Strutsovsky A, McVicker G, Andolfatto P, et al. A genomic map of the effects of linked selection in Drosophila. PLoS genetics. 2016;12(8):e1006130. pmid:27536991
  9. 9. Li J, Li H, Jakobsson M, Li S, SjÖDin P, Lascoux M. Joint analysis of demography and selection in population genetics: where do we stand and where could we go? Molecular ecology. 2012;21(1):28–44. pmid:21999307
  10. 10. Lynch M, Ho WC. The Limits to Estimating Population-Genetic Parameters with Temporal Data. Genome biology and evolution. 2020;12(4):443–455. pmid:32181820
  11. 11. Barghi N, Tobler R, Nolte V, Jakšić AM, Mallard F, Otte KA, et al. Genetic redundancy fuels polygenic adaptation in Drosophila. PLoS biology. 2019;17(2):e3000128. pmid:30716062
  12. 12. Kelly JK, Hughes KA. Pervasive linked selection and intermediate-frequency alleles are implicated in an evolve-and-resequencing experiment of Drosophila simulans. Genetics. 2019;211(3):943–961. pmid:30593495
  13. 13. Therkildsen NO, Wilder AP, Conover DO, Munch SB, Baumann H, Palumbi SR. Contrasting genomic shifts underlie parallel phenotypic evolution in response to fishing. Science. 2019;365(6452):487–490. pmid:31371613
  14. 14. Castro JP, Yancoskie MN, Marchini M, Belohlavy S, Hiramatsu L, Kučka M, et al. An integrative genomic analysis of the Longshanks selection experiment for longer limbs in mice. elife. 2019;8:e42014. pmid:31169497
  15. 15. Bergland AO, Behrman EL, O’Brien KR, Schmidt PS, Petrov DA. Genomic evidence of rapid and stable adaptive oscillations over seasonal time scales in Drosophila. PLoS Genet. 2014;10(11):e1004775. pmid:25375361
  16. 16. Monnahan PJ, Colicchio J, Fishman L, Macdonald SJ, Kelly JK. Predicting evolutionary change at the DNA level in a natural Mimulus population. PLOS Genetics. 2021;17(1):1–25.
  17. 17. Bollback JP, York TL, Nielsen R. Estimation of 2Nes from temporal allele frequency data. Genetics. 2008;179(1):497–502. pmid:18493066
  18. 18. Illingworth CJ, Mustonen V. Distinguishing driver and passenger mutations in an evolutionary history categorized by interference. Genetics. 2011;189(3):989–1000. pmid:21900272
  19. 19. Malaspinas AS, Malaspinas O, Evans SN, Slatkin M. Estimating allele age and selection coefficient from time-serial data. Genetics. 2012;192(2):599–607. pmid:22851647
  20. 20. Feder AF, Kryazhimskiy S, Plotkin JB. Identifying signatures of selection in genetic time series. Genetics. 2014;196(2):509–522. pmid:24318534
  21. 21. Khatri BS. Quantifying evolutionary dynamics from variant-frequency time series. Scientific reports. 2016;6:32497. pmid:27616332
  22. 22. He Z, Dai X, Beaumont M, Yu F. Estimation of Natural Selection and Allele Age from Time Series Allele Frequency Data Using a Novel Likelihood-Based Approach. Genetics. 2020;216(2):463–480. pmid:32769100
  23. 23. Schraiber JG, Evans SN, Slatkin M. Bayesian inference of natural selection from allele frequency time series. Genetics. 2016;203(1):493–511. pmid:27010022
  24. 24. Taus T, Futschik A, Schlötterer C. Quantifying selection with pool-seq time series data. Molecular biology and evolution. 2017;34(11):3023–3034. pmid:28961717
  25. 25. Pritchard JK, Pickrell JK, Coop G. The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Current biology. 2010;20(4):R208–R215. pmid:20178769
  26. 26. Höllinger I, Pennings PS, Hermisson J. Polygenic adaptation: From sweeps to subtle frequency shifts. PLoS genetics. 2019;15(3):e1008035. pmid:30893299
  27. 27. Barghi N, Hermisson J, Schlötterer C. Polygenic adaptation: a unifying framework to understand positive selection. Nature Reviews Genetics. 2020;21(12):769–781. pmid:32601318
  28. 28. Buffalo V, Coop G. Estimating the genome-wide contribution of selection to temporal allele frequency change. Proceedings of the National Academy of Sciences. 2020;117(34):20672–20680. pmid:32817464
  29. 29. Walsh B, Lynch M. Evolution and selection of quantitative traits. Oxford University Press; 2018.
  30. 30. Charlesworth B. Effective population size and patterns of molecular evolution and variation. Nature Reviews Genetics. 2009;10(3):195–205. pmid:19204717
  31. 31. Robertson A. Inbreeding in artificial selection programmes. Genetics Research. 1961;2(2):189–194.
  32. 32. Santiago E, Caballero A. Effective size of populations under selection. Genetics. 1995;139(2):1013–1030. pmid:7713405
  33. 33. Gillespie JH. Genetic drift in an infinite population: the pseudohitchhiking model. Genetics. 2000;155(2):909–919. pmid:10835409
  34. 34. Santiago E, Caballero A. Effective size and polymorphism of linked neutral loci in populations under directional selection. Genetics. 1998;149(4):2105–2117. pmid:9691062
  35. 35. Nagylaki T. Models and approximations for random genetic drift. Theoretical Population Biology. 1990;37(1):192–212.
  36. 36. Kimura M. Diffusion models in population genetics. Journal of Applied Probability. 1964;1(2):177–232.
  37. 37. Der R, Epstein CL, Plotkin JB. Generalized population models and the nature of genetic drift. Theoretical population biology. 2011;80(2):80–99. pmid:21718713
  38. 38. Cannings C. The latent roots of certain Markov chains arising in genetics: a new approach, I. Haploid models. Advances in Applied Probability. 1974;6(2):260–290.
  39. 39. Hedgecock D. Does variance in reproductive success limit effective population sizes of marine organisms? Genetics and evolution of aquatic organisms. 1994; p. 122–134.
  40. 40. Kirkpatrick M, Johnson T, Barton N. General Models of Multilocus Evolution. Genetics. 2002;161(4):1727–1750. pmid:12196414
  41. 41. Gompert Z, Egan SP, Barrett RD, Feder JL, Nosil P. Multilocus approaches for the measurement of selection on correlated genetic loci. Molecular Ecology. 2017;26(1):365–382. pmid:27696571
  42. 42. Buffalo V, Coop G. The linked selection signature of rapid adaptation in temporal genomic data. Genetics. 2019;213(3):1007–1045. pmid:31558582
  43. 43. Felsenstein J. Evolutionary trees from gene frequencies and quantitative characters: finding maximum likelihood estimates. Evolution. 1981; p. 1229–1242. pmid:28563384
  44. 44. Haller BC, Messer PW. SLiM 3: forward genetic simulations beyond the Wright–Fisher model. Molecular biology and evolution. 2019;36(3):632–637. pmid:30517680
  45. 45. Zhu Y, Bergland AO, González J, Petrov DA. Empirical validation of pooled whole genome population re-sequencing in Drosophila melanogaster. PloS one. 2012;7(7):e41901. pmid:22848651
  46. 46. Gautier M, Foucaud J, Gharbi K, Cézard T, Galan M, Loiseau A, et al. Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping. Molecular Ecology. 2013;22(14):3766–3779. pmid:23730833
  47. 47. Alcala N, Rosenberg NA. Mathematical Constraints on FST: Biallelic Markers in Arbitrarily Many Populations. Genetics. 2017;206(3):1581–1600. pmid:28476869
  48. 48. Sella G, Petrov DA, Przeworski M, Andolfatto P. Pervasive natural selection in the Drosophila genome? PLoS Genet. 2009;5(6):e1000495. pmid:19503600
  49. 49. Cvijović I, Good BH, Desai MM. The effect of strong purifying selection on genetic diversity. Genetics. 2018;209(4):1235–1278. pmid:29844134
  50. 50. Haldane JBS. The cost of natural selection. Journal of Genetics. 1957;55(3):511.
  51. 51. Barton NH. Linkage and the limits to natural selection. Genetics. 1995;140(2):821–841. pmid:7498757
  52. 52. Nuzhdin SV, Turner TL. Promises and limitations of hitchhiking mapping. Current opinion in genetics & development. 2013;23(6):694–699. pmid:24239053
  53. 53. Neher RA, Kessinger TA, Shraiman BI. Coalescence and genetic diversity in sexual populations under selection. Proceedings of the National Academy of Sciences. 2013;110(39):15836–15841. pmid:24019480