^{1}

^{1}

^{2}

^{*}

Conceived and designed the experiments: JBP. Performed the experiments: SK. Analyzed the data: SK. Wrote the paper: SK JBP. Designed the research: SK JBP.

The authors have declared that no competing interests exist.

Evolutionary pressures on proteins are often quantified by the ratio of substitution rates at non-synonymous and synonymous sites. The dN/dS ratio was originally developed for application to distantly diverged sequences, the differences among which represent substitutions that have fixed along independent lineages. Nevertheless, the dN/dS measure is often applied to sequences sampled from a single population, the differences among which represent segregating polymorphisms. Here, we study the expected dN/dS ratio for samples drawn from a single population under selection, and we find that in this context, dN/dS is relatively insensitive to the selection coefficient. Moreover, the hallmark signature of positive selection over divergent lineages, dN/dS>1, is violated within a population. For population samples, the relationship between selection and dN/dS does not follow a monotonic function, and so it may be impossible to infer selection pressures from dN/dS. These results have significant implications for the interpretation of dN/dS measurements among population-genetic samples.

Since the time of Darwin, biologists have worked to identify instances of evolutionary adaptation. At the molecular scale, it is understood that adaptation should induce more genetic changes at amino acid altering sites in the genome, compared to amino acid–preserving sites. The ratio of substitution rates at such sites, denoted dN/dS, is therefore commonly used to detect proteins undergoing adaptation. This test was originally developed for application to distantly diverged genetic sequences, the differences among which represent substitutions along independent evolutionary lineages. Nonetheless, the dN/dS statistics are also frequently applied to genetic sequences sampled from a single population, the differences among which represent transient polymorphisms, not substitutions. Here, we show that the behavior of the dN/dS statistic is very different in these two cases. In particular, when applied to sequences from a single population, the dN/dS ratio is relatively insensitive to the strength of natural selection, and the anticipated signature of adaptive evolution, dN/dS>1, is violated. These results have implications for the interpretation of genetic variation sampled from a population. In particular, these results suggest that microbes may experience substantially stronger selective forces than previously thought.

The identification of genetic loci undergoing adaptation is a central project of evolutionary biology. With the advent of sequencing technologies, a variety of statistical tests have been developed to quantify selection pressures acting on protein-coding regions. Among these, the dN/dS ratio is one of the most widely used, owing in part to its simplicity and robustness. This measure quantifies selection pressures by comparing the rate of substitutions at silent sites (dS), which are presumed neutral, to the rate of substitutions at non-silent sites (dN), which possibly experience selection. The ratio dN/dS is expected to exceed unity only if natural selection promotes changes in the protein sequence; whereas a ratio less than unity is expected only if natural selection suppresses protein changes

The dN/dS ratio was originally developed for the analysis of genetic sequences from divergent species

Here we analyze the population genetics of dN/dS. We find that the relationship between the selection pressure and dN/dS is qualitatively different for samples drawn from a single population compared to sampled from divergent lineages. As a result, standard tests for selection based on dN/dS are extremely sensitive to violation of the assumption of divergent lineages. We show that the expected dN/dS ratio within a population is relatively insensitive to selection pressure—a result which helps to explain a body of empirical observations about microbial populations. Moreover, we show that the hallmark signature of positive selection across divergent lineages, dN/dS>1, does not hold within population: strong positive selection is expected to produce dN/dS<1 among population samples. As a result, when applied to intra-specific samples, the standard interpretation of dN/dS is unjustified and may lead to surprising conclusions. This point is illustrated by two recent studies that report dN/dS ratios near 1 among strains of ^{5}) and strong selective advantages of antibiotic-resistance mutations

Our presentation begins with a review of the theory underlying the interpretation of dN/dS across divergent lineages. We then develop the appropriate theory for studying selection and dN/dS within a single population. We compare our theoretical expectations to Monte Carlo simulations based on the Wright-Fisher model. We conclude with a discussion of practical implications.

There are at least two time-scales on which to investigate adaptive evolution: short time-scales, which apply to genetic variation segregating within a population of conspecifics; and long, or evolutionary, time-scales, which apply when comparing the genomes of divergent species.

Over short time-scales, natural selection at a genetic locus may be inferred by inspecting sequences sampled from a population. Polymorphism data are typically compared to expectations under a neutral null model, such as the Wright-Fisher model that forms the basis of Kingman's coalescent

Over long time-scales, by contrast, natural selection is often quantified by comparing orthologous gene sequences from divergent species. In this context, each species is associated with a single representative genetic sequence, and intraspecific polymorphisms are ignored

Over long time-scales, the dN/dS ratio is an extremely popular measure of adaptive evolution in protein-coding sequences. This measure quantifies selection pressures by comparing the rate of substitutions at silent sites (dS), which are presumed neutral, to the rate of substitutions at non-silent sites (dN), which possibly experience selection. In practice, the dN/dS ratio is commonly estimated from data using, for example, the PAML computer package

The Markov-chain model underlying PAML's calculation of dN/dS explicitly ignores polymorphisms segregating within a population; instead, it represents each divergent species as a single sequence. Furthermore, the Markov-chain model does not describe any details of the process by which a mutation enters a population, changes in frequency, and eventually fixes. Instead, fixation events occur instantaneously in the model, and transient polymorphisms within each divergent population are ignored. These simplifying assumptions are perfectly reasonable when studying substitution rates between long divergent species (e.g.

Given a data set of diverged sequences, and assuming (or simultaneously inferring) their phylogenetic relationship, PAML estimates the parameter

Although originally formulated without reference to population genetics

Equation (2) provides an important link between

Researchers often compute a dN/dS value when comparing conspecific sequences, whose differences reflect polymorphisms segregating within a population (e.g.

To address this question, we must understand the behavior of the dN/dS statistic within a single population over a relatively short time-scale—i.e. the population genetics of dN/dS. In this context, dN and dS represent, respectively, the number of

In principle, calculating these quantities requires knowing the expected coalescent time between sampled individuals. Since the general expression for the coalescent time in the presence of selection is not known, we approximate dN and dS by the number of

In order to calculate the expected number of differences between two sampled individuals we utilize the stationary allele frequency distribution at a site. If Φ denotes the stationary frequency distribution for polymorphisms that arise at rate

We use diffusion theory to derive an expression for the stationary frequency distribution of polymorphisms at a site, Φ. In the case of recurrent mutation between two alleles with fixed fitnesses 1 and 1+

Strictly speaking, Yang's model of selection is a special case of an infinite-sites model under which subsequent mutations each provide an additional selective advantage (or disadvantage)

In the

Equations (3) and (4) provide an analytic approximation for the expected dN/dS ratio between sequences sampled from a single population, which we denote _{pop}:_{pop} depends on both

Across divergent lineages there is a simple monotonic relationship between the selection coefficient,

The dashed line shows the expected dN/dS ratio for samples from divergent lineages, given by Equation (2). The solid lines show the expected dN/dS ratio for within-population samples, given by Equation (5), under two mutation rates.

Within a single population, however, the relationship between selection and dN/dS is markedly different (

The difference between short and long time-scales is even more striking in the case of positive selection. Within a population, the dN/dS ratio equals 1 under neutrality (

Black squares show the mean±two standard errors of the observed dN/dS ratio. Left panel shows results for ^{3} independent sites.

Compared to the case of divergent lineages, the behavior of dN/dS within a population is so radically different that inferences of positive and negative selection based on dN/dS are problematic or, in many cases, impossible. Whereas dN/dS<1 is a faithful indication of negative selection across divergent lineages, the observation of dN/dS<1 within a population is consistent with either weak negative or strong positive selection. The intuition behind this result is straightforward: strong positive selection within a population will produce rapid sweeps at selected sites (but not at neutral sites, which are assumed independent). As a result, two individuals sampled from such a population are likely to contain the same allele at each selected site, producing a dN/dS value less than unity. By contrast, selective sweeps along divergent lineages will tend to produce fixed differences between representative individuals sampled from the two independent populations. Thus, the simple interpretation of dN/dS that applies to divergent lineages does not apply within a population.

We performed two sets of Monte Carlo simulations, each based on the Wright-Fisher model with continual selection (i.e. selection

Black squares show the mean±two standard errors of the observed dN/dS ratio. The predicted dN/dS ratios for divergent lineages are shown in dashed lines (Equation 2); the predicted dN/dS ratios for a single population are shown in solid lines (Equation 5). Left column corresponds to results for two independent populations; right column corresponds to results for a single population. Top panels show results for ^{3} independent sites, and the simulations for a single population were performed at ^{4} independent sites.

The simulation results confirm our theoretical analysis of dN/dS. The relationship between selection and dN/dS is accurately described by Equation (2) when comparing individuals sampled from two divergent lineages. By contrast, when individuals are sampled from a single population, the relationship between selection and dN/dS is radically different and accurately described by Equation (5) —even though the simulation procedure used for a single population is identical to the procedure used in each of the two independent populations.

In the second set of simulations we considered a slightly more realistic situation based on the true genetic code. These simulations employed the same Wright-Fisher model with continual selection, but in this case 64 allelic types are available instead of two. We compared two sampled individuals, each consisting of 10^{4} (single population) or 10^{3} (two populations) independent codon sites, and we estimated dN/dS from the sampled sequences using the PAML computer package, as opposed to using the exact ancestry. Thus, these simulations and dN/dS values provide a close representation of data that are likely to be encountered in practice.

Two populations | One population | |

_{pop} |
||

−5 | 0.002 (0.000, 0.014) | 0.001 (0.000, 2.755) |

0.002 (0.000, 0.014) | 0.289 (0.068, 0.813) | |

−2 | 0.068 (0.040, 0.106) | 1.000 (0.000, 19.300) |

0.105 (0.065, 0.159) | 0.608 (0.226, 1.399) | |

0 | 0.934 (0.712, 1.237) | 0.750 (0.000, 11.020) |

1.066 (0.810, 1.412) | 0.967 (0.456, 1.934) | |

2 | 4.114 (2.821, 5.451) | 0.500 (0.025, 5.621) |

3.245 (1.840, 4.868) | 1.472 (0.749, 2.796) | |

5 | 4.409 (2.942, 6.172) | 2.501 (0.396, 14.330) |

2.823 (1.763, 4.023) | 1.680 (0.927, 3.024) |

The framework used in our second set of simulations is more realistic than the simple two-allele framework used in our theoretical analyses or those of Nielsen & Yang

The dN/dS ratio remains one of the most popular and reliable measures of evolutionary pressures on protein-coding regions. Much of its popularity stems from the simple, intuitive interpretation of dN/dS<1 as negative selection, dN/dS = 1 as neutrality, and dN/dS>1 as positive selection. However, this simple interpretation requires that the sequences being compared represent stereotypical samples from divergent populations—an assumption that is also implicit in the methods that estimate dN/dS by maximum likelihood

Recently, Rocha et al. have investigated the relationship between divergence time and dN/dS

The fact that polymorphisms within a population differ from divergences between species is well understood by population geneticists

Our analysis of selection and dN/dS has assumed independence of sites or, equivalently, free recombination between sites. This assumption is unrealistic in many practical settings. However, the same assumption has been made in prior analytic work on dN/dS

We have focused our analysis on Yang's particular formulation of selection, which stipulates that all mutations experience the same selection coefficient compared to the resident type

Complications associated with interpreting dN/dS for population samples do not arise in many practical applications of dN/dS—i.e. those involving comparisons among divergent species. However, as sequence data are increasingly available, there is a temptation to apply computer packages such as PAML to intraspecific data—as has been done in many cases already (e.g.

Many empirical studies of genes evolving under negative selection have found quizzical results, which our analysis helps to clarify: dN/dS values for such genes are typically closer to 1 when comparing intra-specific samples as opposed to inter-specific samples. This observation holds for bacterial data

Our results also have implications for inferences of positive selection based on dN/dS among conspecific samples. Even when samples come from independently evolving populations, the power of the dN/dS statistic to detect positive selection is low when the majority of sites in the protein evolve under purifying selection

For higher eukaryotes, the distinction between multiple independent populations versus a single population is usually clearcut: samples from different species represent independent populations, whereas conspecific samples should be treated as arising from a single population (unless they are sampled from regions that have been reproductively isolated for more than

As the discussion above suggests, it may be difficult to determine the appropriate time-scale associated with a dataset of sampled microbial sequences, particularly for a virus sampled at different timepoints. In fact, there may not be a single time-scale that applies to the entire dataset. In such cases, the relationship between the observed dN/dS ratios and the underlying selection coefficients will be described by some (unknown) mixture of Equation (2) and Equation (5). In such cases our central conclusion still holds: the relationship between selection and dN/dS is not necessarily a simple monotonic function, and it may be impossible to infer the selection pressure from the dN/dS measurement.

Here we derive the stationary distribution (4) under Yang's model of continual positive or negative selection. Consider a haploid population of constant size

Within the standard diffusion approximation, the system is described by the frequency

Equations (7) and (8) are the initial condition and the normalization condition, respectively. The non-standard condition (9) arises in the model of selection

It is worth noting that our boundary condition is not the same as a periodic boundary condition. A periodic condition would allow probability flux from state

We are interested in the stationary solution Φ(_{1} = 0, we arrive at the classical zero-flux stationary solution by Wright _{1} and _{2}. To take the limit in (9), we notice that the following equality is true for any ^{θ}^{−1}^{2γx}(_{1}Ψ(_{2}) and _{2} = 0. This leads to (4) for 0<

We performed Wright-Fisher simulations of a population of constant size

This simulation takes the following parameters as input: ^{−6}, 5×10^{−5}}, These values correspond to

^{−1} generations in order for it to reach the mutation-selection-drift equilibrium. In the last generation, we sampled two individuals and counted the number of mutations that occurred on the lineage connecting them, _{pop}(

^{−1} generations, after which we counted the number of substitutions (fixation events) that occurred in each population. The number of substitutions,

We also simulated the evolution of a protein coding sequence consisting of

In each simulation at a site, an individual could carry one of the 64 codons. The mutation probability was ^{−7},10^{−6}}. We ran the single population simulations for ^{4} sites for ^{5} generations. We ran the two population simulations for ^{3} sites for ^{−1} generations.

We used the CODEML program from the PAML package to infer the most likely dN/dS ratio for each pair of sequences. We used the likelihood ratio test, based on the ^{2} distribution, to determine the 95% confidence interval on the estimated dN/dS ratio.

The relationship between the selection coefficient,

(0.3 MB EPS)

Stationary frequency distribution of the mutant allele for the Wright-Fisher model with continual selection. Gray bars show the histrogram obtained from the two-allele simulations with

(0.4 MB EPS)

SK and JBP were funded by a grant from the James S. McDonnell Foundation. JBP also acknowledges support from the Burroughs Wellcome Fund, the Penn Genome Frontiers Institute, and the Defense Advanced Research Projects Agency\Fun Bio” Program (HR0011-05-1-0057). The authors are grateful to Todd Parsons, Warren Ewens, Michael Desai, and Michael Lässig for discussions on the diffusion approximation, and to Ricky Der for the asymptotic analysis of the expected dN/dS ratio.