Quantifying how constraints limit the diversity of viable routes to adaptation

Convergent adaptation occurs at the genome scale when independently evolving lineages use the same genes to respond to similar selection pressures. These patterns of genetic repeatability provide insights into the factors that facilitate or constrain the diversity of genetic responses that contribute to adaptive evolution. A first step in studying such factors is to quantify the observed amount of repeatability relative to expectations under a null hypothesis. Here, we formulate a novel index to quantify the constraints driving the observed amount of repeated adaptation in pairwise contrasts based on the hypergeometric distribution, and then generalize this for simultaneous analysis of multiple lineages. This index is explicitly based on the probability of observing a given amount of repeatability by chance under a given null hypothesis and is readily compared among different species and types of trait. We also formulate an index to quantify the effective proportion of genes in the genome that have the potential to contribute to adaptation. As an example of how these indices can be used to draw inferences, we assess the amount of repeatability observed in existing datasets on adaptation to stress in yeast and climate in conifers. This approach provides a method to test a wide range of hypotheses about how different kinds of factors can facilitate or constrain the diversity of genetic responses observed during adaptive evolution.


Introduction
If different species encounter the same selection pressure, will adaptive responses occur via homologous genes or follow distinct genetic routes to the same phenotype? What factors limit the diversity of viable genetic routes to adaptation and how does variation translate into evolution? Empirical studies have identified different amounts of convergent adaptation at the genome scale across a range of species, traits, timescales, and levels of developmental-genetic hierarchy [1][2][3]. When evolution uses the same genes repeatedly to generate a given trait value, is this because of constraints acting on the genetic and developmental pathways limiting the production of variation (i.e., there is only a limited number of ways to generate a given trait value), or because of fitness constraints acting on genotypes that yield the same trait value (i.e., only some genotypes are selectively optimal)? Note that "constraint" in the evolutionary literature is commonly invoked to refer to factors that limit an adaptive phenotypic response in general (e.g., [4,5]). Here, we use it to refer to the factors that limit the diversity of genes used in independent bouts of adaptation and use the term "diversity constraint" hereafter for clarity.
As a case study to examine the differing types of diversity constraint, Mc1r provides perhaps the most well-known example of convergent local adaptation at the gene scale, and has been implicated in driving colour pattern variation in mice, lizards, mammoths, fish, and a range of other organisms [6][7][8][9][10]. Extensive studies in mice have revealed that over 50 genes can be mutated to give rise to variation in colour pattern [11], yet Mc1r consistently tends be one of the main contributors to locally adapted colour polymorphisms. Mc1r has minimal pleiotropic side effects [8,11] and it can mutate to a similar trait value through numerous different changes in its protein sequence [10,11] and therefore may have a higher rate of mutation to beneficial alleles than other genes. As such, it seems to be driven by a combination of both types of constraint: more ways to mutate via Mc1r implies that developmental-genetic constraints limit the contribution of other genes, while limited pleiotropy in Mc1r implies that fitness costs constrain which genes can yield mutations that provide a viable route to adaptation.
Are the diversity constraints acting on melanism representative of the kinds of constraints that shape patterns of genome-scale convergence and non-convergence across the tree of life? To answer this question, it is necessary to quantify the extent of genome-scale convergence in a wide variety of organisms and traits and ascertain what kind of diversity constraints are operating, which requires the development of an appropriate statistical framework. To this end, it is helpful to frame the above questions based on the flexibility of the mapping from genotype to trait to fitness: does repeatability occur because of low redundancy in the mapping of genotype to trait (hereafter low GT-redundancy [4]: only a few ways to make the same trait value; Fig 1A), or because of low redundancy in the mapping of genotype to fitness? (hereafter low GF-redundancy: only a subset of the genotypes yielding the same trait value are optimal; Fig  1B).
GT-redundancy is determined by two factors: 1) the difference between the number of genes that need to mutate to yield a given trait value and the number of genes that could mutate to give rise to variation in the trait, and 2) the extent to which different genes have of a phenotypic effect, because architectures with different allele effect sizes and linkage relationships can have different fitness depending on the interaction between migration, selection, and drift. For example, if a given phenotype is coded by many small unlinked alleles, this architecture would be less fit than a similar phenotype coded by a few large or tightly linked alleles, in the context of migration-selection balance [15] or negative frequency dependence [16,17]. Similarly, the increased drift that occurs in small populations may prevent alleles of small effect from responding to natural selection [18,19], resulting in such genotypes being effectively neutral and therefore lower in realized fitness than those made up of large-effect alleles. For example, polygenic models of directional selection (e.g., [20]) assume no GT-and GF-redundancy, while traditional quantitative genetic models of Gaussian stabilizing selection assume high GT-and GF-redundancy (e.g., [21]).
In addition to GT-and GF-redundancy, other factors also impact signatures of convergence, such as differences among genes in recombination rate or propensity to retain standing variation. There has been considerable discussion in the literature about the effects of these and other factors on convergence [3,[23][24][25][26][27][28][29][30], and various indices have been previously used to quantify repeatability in empirical contexts (e.g. Jaccard index, Proportional Similarity; [1,2]). These existing indices provide a useful description of how often the same gene is used in adaptation, but as we will show below, they are not well-suited for testing of hypotheses to discriminate between these different kinds of constraint. They do not incorporate information about the genes that could contribute to adaptation but don't, which is necessary to evaluate what kinds of diversity constraints are operating, and they are not explicitly tied to the probability of repeatability occurring under a null model.
Here, we develop novel statistical approaches for quantifying the diversity constraints that drive repeatability in genomic data from studies of local adaptation and experimental evolution. To study these constraints, we formulate an explicit probability-based representation of the deviation of observed repeatability from expectations under different null hypotheses. This approach can be used after standard tests have been applied to identify the putative genes driving adaptation and uses as input either binary categorization of genes as "adapted" or "non-adapted" or any continuous index representing the relative amount of evidence for a given gene being involved in adaptation (e.g. F ST , p-values, Bayes factors). We begin by formulating an analytical model for a contrast of two lineages with binary data, and then generalize this model for contrasts of multiple lineages using either binary or continuous data. We also propose a novel index estimating the proportion of genes in the genome that can potentially give rise to adaptation. In all cases, these models can be used to successively test null hypotheses that incorporate different amounts of information about the constraints that could shape repeatability. The simplest null hypothesis is that there are no constraints and all genes have equal probability of contributing to adaptation. If more repeatability is observed than expected under this null model, then two inferences can be made: natural selection is driving patterns of convergence (and that observed signatures are not all false positives), and some diversity constraints are operating to increase the repeatability of adaptation. We then consider how other null hypotheses can be formulated to represent the various kinds of constraints discussed above. We focus mainly on the effect of low GT-redundancy, where the number of genes that could potentially contribute to adaptation is much smaller than the total number of genes in the genome, but also discuss how constraints arising from GFredundancy, standing variation, or mutation rate could be modeled. Because this method quantifies repeatability in terms of probability-scaled deviations from expectations, it can be applied across any trait or species of interest, allowing contrasts to be made on the same scale of measurement.

Quantifying diversity constraints in pairwise contrasts
Suppose there are two lineages, x and y, that have recently undergone adaptation to a given selection pressure, resulting in convergent evolution of the same trait value within each lineage. This adaptation could be global, with new mutations fixed within lineages (e.g., in experimental evolution studies with multiple replicate populations), or local, with mutations contributing to divergence among populations within each lineage (e.g., in observational studies of natural adaptation to environmental gradients). In either case, we assume that adaptation can be reduced to a binary categorization of genes as "adapted" or "non-adapted" to represent which genes contribute to fitness differences (either relative to an ancestor or a differentlyadapted population). We use the following notation to represent different properties of the genomic basis of trait variation: the number of loci in the genome of each species is n x , and n y , with the number of orthologous loci shared by both species being n s ; the adaptive trait is controlled by g x and g y loci in each species, with g s shared loci (i.e. the loci in which mutations will give rise to phenotypic variation in the trait, hereafter the "mutational target"); of the g loci that give rise to variation, only a subset have the potential to contribute to adaptation due to the combined effect of all constraints, represented by ga x and ga y , with ga s shared loci (the "effective adaptive target"); in a given bout of adaptation, the number of loci that contribute to adaptation in each lineage is a x and a y , with a s orthologous loci contributing in both lineages. For simplicity, we assume that there is complete overlap in the genomes (n s = n x = n y ), mutational targets (g s = g x = g y ), and loci potentially contributing to adaptation (ga s = ga x = ga y ) in both species (see supplementary materials and S1 Fig for set notation). These assumptions are most appropriate for lineages that are relatively recently diverged, where most orthologous genes are retained at the same copy number and the developmental-genetic program is relatively conserved, so that the same genes potentially give rise to variation in both lineages. Lineages separated by greater amounts of time would be expected to have reduced n s due to gene deletion, duplication, and pseudogenization in either lineage, and reduced g s and ga s due to evolution and divergence of the developmental-genetic program, through sub-and neo-functionalization, and divergence in regulatory networks.
Under the assumption that all ga s genes have equal probability of contributing to adaptation (i.e., no diversity constraints are operating), the amount of overlap in the complement of genes that are adapted in both lineages (a s ) is described by a hypergeometric distribution where the expected amount of overlap is ā s = a x a y /ga s (e.g. [31]). In practice, we typically have little prior knowledge about which genes have the potential to contribute to either adaptation (ga s ) or standing variation in the trait (g s ), but we can draw inferences about how these parameters constrain the diversity of adaptive responses by testing hypotheses and comparing the observed amount of overlap (a s ) to the amount expected under a given null hypothesis (ā s ). To test different hypotheses about how diversity constraints give rise to repeated adaptation, we represent the total number of genes included in the test set as g 0 . The simplest null hypothesis is that there are no diversity constraints and all genes potentially give rise to variation and contribute to adaptation (g 0 = ga s = g s = n s ), so by rejecting this null, we can infer that ga s < n s , and calculate an effect size that represents the magnitude by which all types of constraints contribute to repeatability (see Eq 1, below; note that it is also possible that the null hypothesis could be falsified in the opposite direction, with less overlap in the loci contributing than expected under the null, which might occur if evolution had occurred towards a different optimum in each lineage). Without independent lines of evidence about which genes potentially contribute to variation in the trait (g s ), it is not possible to evaluate the relative importance of GT-vs. GFredundancy using the framework here. In model systems where independent information is available for the magnitude of g s (based on mutation accumulation or GWAS; see Discussion), then a more refined null hypothesis can be tested, where g 0 = g s , allowing some inferences to be made about the relative importance of GT-and GF-redundancy (Table 1). By rejecting this null, we can infer that ga s < g s , which could occur due to low GF-redundancy or differences among genes in mutation rate or standing variation. Alternatively, if we fail to reject this null hypothesis, then it suggests that g s ffi ga s , which would imply that GF-redundancy doesn't make any additional contribution to repeatability beyond the contribution of GT-redundancy. We can also reverse the direction of inquiry and estimate ga s directly from the data by calculating b ga s ¼ a x a y =a s , such that an index representing the effective proportion of the genome that can potentially contribute to adaptation can be calculated asP a;hyper ¼ a x a y =ða s n s Þ.
For any value of g 0 , an effect size representing the excess in overlap due to convergence relative to the null hypothesis can be expressed by standardizing the observed overlap by subtracting the mean (ā s = a x a y /g 0 ) and dividing by the standard deviation of the hypergeometric distribution: This index provides a quantitative representation of how much more overlap occurs than expected under the null hypothesis, scaled according to how much a given bout of evolution would deviate from this expectation if the null hypothesis were true. Similarly, the exact probability of observing a s or more shared loci contributing to adaptation can also be calculated using the hypergeometric probability (see Supplementary Information for sample R-script), which provides a p-value.

Quantifying diversity constraints in multiple lineages
While pairwise contrasts are most straightforward statistically, they have considerably lower power than comparisons among multiple lineages. If one gene (such as Mc1r) tends to drive adaptation repeatedly in a large number of lineages, this may go undetected in an approach using multiple pairwise comparisons but would be readily detected in a simultaneous comparison of multiple lineages. Unfortunately, while the hypergeometric distribution provides an exact analytical prediction for the amount of overlap in a pairwise comparison, which can be used to calculate a p-value and the probability-based effect size (C hyper ), it cannot be easily generalized to simultaneously analyze multiple lineages. While it is possible to conduct pairwise analysis and average the results across multiple comparisons, p-values from this approach might fail to detect cases where a single gene contributes repeatedly to adaptation in more than two lineages, as information does not transfer among the pairwise comparisons. We now develop an alternate, approximate approach to assess repeatability in multiple lineages by calculating Pearson's χ 2 goodness of fit statistic and comparing this to a null distribution of χ 2 statistics simulated under the null hypothesis to calculate a p-value as the proportion of replicates in the null that exceed the observed test statistic. The p-value obtained by this approach Quantifying how constraints limit the diversity of viable routes to adaptation represents the probability of observing a test statistic as extreme or more extreme under the null hypothesis, considering all lineages simultaneously. While the p-value is calculated from simultaneous analysis of all lineages, the effect size is instead calculated as an average across all pairwise comparisons among the k replicate lineages, because this represents the increase in repeatability relative to expectations under the null for a given bout of adaptation in any single lineage. This difference is important because the effect size should not depend on sampling effort in terms of the number of lineages, while the p-value should reflect the statistical power gained from multiple lineages. Consider the case where g 0 genes can potentially contribute to adaptation in the given trait and each lineage has some complement of genes that have mutated to drive adaptation, with α i,j representing the binary score for gene i in lineage j (1 = adapted, 0 = non-adapted). The summation for gene i across all lineages provides the observed counts (o i = S j α i,j ) while the expected counts (e i ) can be set based on the null hypothesis being tested. Under null hypotheses where all genes in g 0 have equal probability of contributing to adaptation, the expected counts are equal to the mean of the observed counts (e = S i o i /g 0 ), and Pearson's χ 2 statistic can be calculated by the usual approach: χ 2 = S(o − e) 2 /e. Under ideal conditions, Pearson's χ 2 would approximate the analytical χ 2 distribution with its mean and standard deviation equal to the degrees of freedom (df) and 2df, respectively. While this could be used to make an analytical hypothesis test (as above), in practice there will often be large deviations between Pearson's χ 2 and the analytical distribution, due to violation of the assumptions when expected counts are low (See Supplementary Materials, S2 Fig). Instead, we simulate a null distribution of w 2 sim values under the null hypothesis by using permutation within each lineage and recalculating w 2 sim for each replicate. The p-value is then equal to the proportion of the w 2 sim values that exceed the observed χ 2 (using all lineages simultaneously), while the effect size is calculated as the mean C-score across all pairwise contrasts (simulating w 2 sim for each pairwise contrast): The magnitude of C chisq therefore represents deviation between the observed amount of repeatability and that expected under the null hypothesis, which will vary as a function of the diversity constraints affecting the trait evolution, but not the number of lineages being compared. While C chisq relies upon simulation of a null distribution, it can be calculated relatively quickly. Importantly, the magnitude of C chisq varies linearly with C hyper (Fig 2A & 2B), showing that it represents the extent of diversity constraints in the same way as the analytically precise C hyper . While this approach provides a more accurate p-value for comparisons of multiple lineages, there is no particular reason to use C chisq rather than C hyper for binary input data, as both effect sizes are calculated on a pairwise basis. The main reason that we develop this approach is to extend it to continuously distributed data, which can allow greater sensitivity and avoid arbitrary choices necessary to categorize the commonly used indices of local adaptation (e.g. F ST or p-values) into "adapted" or "non-adapted".

Quantifying diversity constraints with continuous data
In many empirical contexts, genome scans for selection yield continuously distributed scores representing the strength of evidence for each locus contributing to adaptation (e.g., F ST , p-values, Bayes factors). Using the same notation as above, but with α i,j representing the continuous score for the i th gene in the j th lineage, the total score for each gene can be calculated as a sum across lineages, " a i ¼ P k j a i;j , while the mean score over all genes and lineages is " " A statistic analogous to the above χ 2 can then be calculated as w 2 ¼ P ð" a i À " " aÞ 2 = " " a, and the same approach for calculating the null distribution of this statistic can then be used to calculate C chisq according to Eq 2.
With continuous data, there are additional complexities that arise depending on the distribution of the particular dataset being used and how its magnitude represents evidence for a gene's involvement adaptation. One approach, which we used in all examples here, is to transform data so that values scale positively and approximately linearly with the weight of evidence for adaptation, by standardizing data within each lineage by subtracting their observed withinlineage minimum and dividing by their observed within-lineage maximum, such that the values within each lineage are bounded from 0 to 1. This reduces differences among lineages in the absolute magnitude of indices representing adaptation, which can be desirable when they C chisq and C hyper provide approximately equal estimates of the magnitude of the diversity constraints driving repeatability, whilê P a;lik provides an estimate of the proportion of all genes that could potentially contribute to adaptation, which is not collinear with the Cscores. Plots show values calculated for simulated datasets generated by randomly drawing two arrays with g s genes, with a i loci adapted in one array and a i + 20 in the other, and then sorting a proportion of the rows in each array to artificially generate more repeatability than would occur by chance (with a different proportion sorted in each replicate). In Panel A&C, g s = 200; in panel B&D, a i = 10;P a;lik calculated using Eq 4.
https://doi.org/10.1371/journal.pgen.1007717.g002 vary across many orders of magnitude (e.g. p-values from GWAS of 10 −10 and 10 −20 both provide strong evidence of adaptation). However, if some lineages actually have stronger signatures of adaptation at more loci, then this kind of standardization should not be used, as it would obscure these true differences among lineages. In this case, it would be preferable to use the same standardization across all lineages by subtracting the minimum and dividing by the maximum values observed across all lineages. While Pearson's χ 2 statistic was designed for discrete data, the above approach using continuous data represents the variability among lineages in the same way, as a variance among genes in the sum of their scores representing putative adaptation. The C chisq statistic on continuous data behaves similarly to the C hyper statistic across wide ranges of parameter space, as both are formulated in terms of deviations from the null distribution (see below).

What proportion of the genome can potentially contribute to adaptation?
While the number of genes that potentially contribute to adaptation (ga s ) can be estimated using the hypergeometric equation, b ga s ¼ a x a y =a s , it is difficult to apply this to comparisons of multiple lineages, as some pairwise contrasts may have no overlap in the genes contributing to adaptation (a s = 0), making the equation undefined. To estimate b ga s from all lineages simultaneously, we can instead formulate a likelihood-based approach where the probability that we observe locus i adapted in o i lineages is: where Bin(n,y,x) is the probability under the binomial distribution of getting x successes in n trials, each with probability y. As above, o i is the number of adapted genes in k lineages (with , P a is the proportion of g 0 that can actually contribute to adaptation (P a = ga s /n s ), and ō is the probability of each gene contributing to adaptation ō = So i /(ga s k). The estimated value of b ga s is then the value at which the log-likelihood function: is maximized. Once the maximum-likelihood value of b ga s is estimated, this can be expressed either as an absolute number representing the effective number of genes that can contribute to adaptation or as a proportion of the total number of shared genes in the genome: P a;lik ¼ b ga s =n s . This approach implicitly assumes that all genes that have the potential to contribute to adaptation (ga s ) have approximately equal probabilities of actually contributing to adaptation. In very extreme cases, such where one gene is very highly repeatable while other genes only contribute to adaptation in a single lineage, b ga s will tend to represent the contribution of the repeatable genes and discount the contribution of the idiosyncratic genes (see Supplementary Materials). Multi-class models could be developed to estimate ga s for different classes of genes in such scenarios by accounting for their different probabilities of contributing to adaptation (See https://github.com/samyeaman/dgconstraint for scripts containing functions for the above calculations).

Comparison between indices for quantifying convergence
The C hyper , C chisq , andP a;lik estimators capture different aspects of the biology underlying convergence than other previously used estimators of repeatability. To estimate the repeatability of evolution, Conte et al. [1] used the additive and multiplicative Proportional Similarity (PS add and PS mult ) indices of [32] in a meta-analysis of QTL and candidate gene studies, while Bailey et al. [2] used the Jaccard Index to quantify patterns in bacterial evolution experiments.
The PS indices are defined as PS add = S min(α ix , α iy ) and where α ix and α iy are the relative contribution of gene i to adaptation in lineages x and y [33], while the Jaccard index is defined as where A x and A y are the sets of adapted genes in each lineage [2]. Both of these indices are based on standardizing the number of overlapping adapted loci by the total number of adapted loci, and neither includes information about non-adapted genes that potentially could have contributed to adaptation.
To illustrate the differences between these various indices of convergence, we generated four example datasets showing either randomly drawn complements of genes with adapted mutations (Fig 3A) or highly convergent datasets drawn from a smaller (Fig 3B) or larger ( Fig  3C & 3D) pool of genes that potentially contribute to trait variation (g s ), with differing numbers of loci contributing to adaptation. Scenario C is the most constrained, as it exhibits the same amount of overlap as B, but this overlap is drawn from a larger pool of genes so it is less likely to occur by chance. While neither the Jaccard index nor the PS indices distinguish between the B, C, and D scenarios (as the same proportions of genes are being used for adaptation, so repeatability is the same), both the C chisq and C hyper indices show the highest scores for scenario C, because it has the smallest probability of occurring by chance if all genes had equal probabilities of contributing to adaptation. TheP a;lik index also identifies scenario C as most constrained in terms of the smallest proportion genes potentially contributing to adaptation. TheP a;lik index also shows that this proportion is equal for scenarios B & D, despite differences in the probability of the observed repeatabilities occurring by chance (as per the C-scores). More generally, whileP a;lik tends to decrease with increasing C-score, these indices differ in magnitude (Fig 2C & 2D), as they represent different aspects of diversity constraints. In summary, the Jaccard and PS indices quantify the proportion of genes used for adaptation that are used repeatedly, the C-score indices are inversely proportional to the probability of the observed repeatability occurring if there were no constraints, andP a;lik represents the proportion of genes in the genome that are available for adaptation, given the existing diversity constraints (also see S3 Fig for further comparisons).

Simulating convergence using individual-based simulations
To further explore the effect of population genetic parameters on the behaviour of the above indices of repeatability and constraint, we used Nemo (v2.3.45; [34]) to simulate two scenarios of two-patches under migration-selection balance: (i) constant size of mutational target with variable proportions of small-and large-effect loci; and (ii) constant number of large-effect loci and variable number of small effect loci, resulting in a variable size of mutational target. For scenario (i), simulations had n = g s = 100 loci, of which u loci had alleles of size +/-0.1, while (100 − u) loci had alleles of size +/-0.01 (with subsequent mutations causing the allele sign to flip from positive to negative or the reverse). For scenario (ii), simulations had 10 largeeffect loci with alleles of size +/-0.1 and v small-effect loci with alleles of size +/-0.01, resulting in a variable size of mutation target. In all simulations, migration rate was set to 0.005 and the strength of quadratic phenotypic selection was 0.5, so that an individual perfectly adapted to one patch would suffer a fitness cost of 0.5 in the other patch (patch optima were +/-1; similar to [13]). Simulations were run for 50,000 generations and censused every 100 generations. For binary categorization of the input data, loci were considered to be "adapted" if F ST > 0.1 for >80% of the last 25 census points (these cut-offs are somewhat arbitrary, but qualitative patterns were comparable under different cut-offs); for continuous input data, raw F ST values Quantifying how constraints limit the diversity of viable routes to adaptation were used. Results are averaged across 20 runs, each with 20 replicates, with C chisq calculated across the 20 replicates within each run.
These scenarios further illustrate the difference between the Jaccard and PS add indices of repeatability and the C-score andP a;lik indices of constraint. In both scenarios, the small effect loci do not tend to contribute much to adaptation because large effect loci are more strongly favoured under migration-selection balance [35], which results in low GF-redundancy. In scenario (i), all indices show qualitatively similar patterns, with decreasing repeatability occurring as a result of the decreasing constraints that occur as the number of large-effect loci increases, increasing the GT-and GF-redundancy (Fig 4A). By contrast, in scenario (ii), the Jaccard and PS add indices indicate that roughly the same amount of repeatability is occurring regardless of the number of small effect loci and total size of mutational target (Fig 4B). However, over this same range of parameter space, the C -score indices show that constraint increases as the total mutational target is increasing. This occurs because while a larger number of potential routes to an adaptive phenotype are available with increasing number of small effect loci, only the same small number of loci are actually being involved in adaptation (i.e. the large effect loci), which is illustrated by the decrease in theP a;lik index. While there are many potential genetic routes to adaptation that could involve these small effect loci (high GT-redundancy), the large effect loci tend to be favoured and repeatedly involved in adaptation (low GF-redundancy). Thus, when the size of the mutational target increases in scenario (ii), the repeatability tends to stay about the same (Jaccard and PS add ) but the amount of constraint is higher (C-scores), because a smaller proportion of the available routes to adaptation are being used (P a;lik ). The continuous and binary C chisq indices are broadly similar across these parameters because there is very little variation in F ST among loci within the same size class (see Supplementary Materials for additional simulations under varying allele effect sizes).

Adjusting for incomplete sampling of the genome
The amount of constraint quantified by the C-score will depend upon the proportion of the mutational target (g s ) that is sampled by the sequencing approach, which should be proportional to the sampling of the total number of genes in the genome (n s ). Some approaches, such as targeted sequence capture, will sample only a subset of the total number of genes in the genome, which will therefore cause a bias in the estimation of constraint due to incomplete sampling, even if the genes included are a random subset of g s . This can be most clearly seen in the calculation of C hyper , where multiplying all the variables in Eq 1 by a given factor will cause a change in the magnitude of the effect size. By contrast, the Jaccard and PS measures of repeatability are not affected by incomplete sampling. If binary input data are being used and the proportion of g s that has been sampled can be accurately estimated (q), then the calculation of C hyper can be corrected by dividing all input variables by q prior to calculation, yielding a corrected score C hyper-adj . If continuously distributed input data are being used, then the dataset can be adjusted by adding g 0 (1 − q) new entries to the dataset by randomly sampling genes with replacement from the existing dataset, and then applying Eq 2 to this extended set.
To explore the effect of incomplete sampling of the genome on the calculation of C-scores and the impact of these types of correction, we constructed a test dataset by concatenating 5 replicates from the simulations in Fig 4A with 10 large effect loci, yielding a dataset with 500 loci in total and a high amount of repeatability. We then sampled a proportion q of this total dataset to simulate incomplete representation of the genome and used the above approach calculate uncorrected and corrected C-scores. While incomplete sampling can cause considerable bias in C-scores, as long as q is not too small, these approaches yield relatively accurate  mutational target (B).P a;lik shows qualitatively similar patterns to the C-scores, with a decreasing proportion of the genome accessible to adaptation occurring in scenarios with higher C-scores and higher constraints. In panel A, all runs have n s = g s = 100 loci, with u large effect loci and (100 − u) small-effect loci. In panel B, there are 10 large-effect loci, and v small-effect loci. In both scenarios, simulations were run with N = 10,000 individuals in each patch, recombination rate of r = 0.5 between loci, and per-locus mutation rate = 10 −5 . The calculation of C hyper is based on corrections of these estimates (Fig 5). At very low values of q, the variance in estimation among replicate subsets increases as a result of sampling effects when only a small number of adapted loci are included, but on average the magnitude of the corrected C-score is independent of q.

Example: Stress resistance in yeast
Experimental evolution studies provide a controlled framework to test theories on the genetic basis of adaptation under a diversity of scenarios. Gerstein et al. (2012) previously conducted an experiment to examine the diversity of first-step adaptive mutations that arose in different lines initiated with the same genotypes in response to the antifungal drug nystatin [36] and in response to copper [37]. The design allowed them to directly test how many different first-step solutions were accessible to evolution when the same genetic background adapted to the same environmental stressor. In the nystatin-evolved lines they identified 20 unique and categorizing genes as adapted when F ST > 0.1, while the calculation of C chisq is based on F ST standardized by subtracting the minimum value and dividing by the maximum within each lineage.
https://doi.org/10.1371/journal.pgen.1007717.g004 independently evolved mutations in only four different genes that act in the nystatin biosynthesis pathway: 11 unique mutations in ERG3, seven unique mutations in ERG6, and one unique mutation in each of ERG5 and ERG7 [36]. The genotypic basis of copper adaptation was broader, and there were both genomic (SNPs, small indels) and karyotypic (aneuploidy) mutations identified. If we consider just the genomic mutations, mutations were found in 28 different genes, with multiple mutations identified in four genes (12 unique mutations in VTC4, four unique mutations in PMA1, and three unique mutations in MAM3 and VTC1). If we assume that all genes in the genome could potentially contribute to adaptation (i.e. g 0 = 6604), then C hyper-nystatin = 32.5, while C hyper-copper = 12.3, and p < 0.00001 in both cases.
If we assume much lower GT-redundancy and that only the observed genes could possibly contribute to the trait (i.e. g 0-nystatin = 4, g 0-copper = 28), we can test whether the mutations are still more clustered than expected within these sets. Using the methods outlined above, we find C hyper-nystatin = 0.35, p = 0.002, and C hyper-copper = 0.43, p < 0.0001, indicating that even under the severe developmental-genetic constraints to diversity represented by this model, these data are slightly more overlapping than expected at random, likely due low GF-redundancy and potentially gene-specific differences in mutation rate (because these experiments were initiated using isogenic strains, standing variation was precluded).
Experimental evolution studies lend themselves nicely to future hypothesis testing about the impact of constraint on the genetic basis of adaptation and provide us with hypotheses about differences between the genes that were and were not observed in the screen. For example, we parsed the Saccharomyces Genome Database (http://www.yeastgenome.org) for genes that have been annotated as "resistance to nystatin: increased", where this phenotype is conferred by the null mutation. This should be a conservative dataset, as we also expect there could be mutations in additional genes that do not result in a loss-of-function phenotype that could also confer tolerance to nystatin (although we expect that the mutations we recovered in ERG3, ERG5 and ERG6 are all similar to loss of function mutations, ERG7 is inviable when null [36]). This identified an additional five genes (KES1, OSH2, SLK19, VHR2, YEH2). We can test whether the five genes without an observed mutation have a negative pleiotropic effect when null or are in areas of the genome with a lower mutation rate compared to the ERG genes (particularly compared to ERG3 and ERG6). Similar experiments could also be conducted with different Saccharomyces cerevisiae genetic backgrounds, with closely related species, or under slightly different environmental conditions (e.g., increased or decreased concentration of stress) to directly examine how different aspects of the genomic and ecological environments influence the observed level of constraints acting on adaptation.

Example: Cold tolerance in conifers
Lodgepole pine and interior spruce both inhabit large ranges of western North America and display extensive local adaptation, with large differences in cold tolerance between northern and southern populations in each species. Recent work studied the strength of correlations between population allele frequencies and a number of environmental variables and phenotypes in each species [38]. Taking one representative environmental variable as an example, a total of 50 and 121 single-copy orthologs showed strong signatures of association to Mean Coldest Month Temperature (MCMT) in pine and spruce, respectively, with 5 of these genes overlapping (based on binary categorization using the binomial cutoff "top candidate" method, as per [38]). This study included a total of 9891 one-to-one orthologs with sufficient data in both species (i.e. at least 5 SNPs per gene), so observing 5 genes overlapping corresponds to C hyper = 5.6 and p = 0.00034 under the null hypothesis that all genes had equal potential to contribute to adaptation. Alternatively, it is also possible to estimate C chisq on continuously distributed data by calculating top candidate scores for each gene using the binomial probability of seeing u outliers when there are v SNPs in a given gene, with an overall rate of w outliers per SNP (as per [38], this yields an index rather than an exact probability, due to linkage among SNPs). This approach is more sensitive to weak signatures of adaptation that occur below the binary categorization cutoff, yielding C chisq = 5.1 and p < 0.00001. Assuming that the 9891 studied genes represent a random sample from approximately 23,000 genes in the whole genome and ignoring divergence in gene content between species (n s = n x = n y ), the adjusted C-scores are C hyper-adj = 8.6 and C chisq-adj = 7.8 (with resampling of 50 replicates and 10,000 permutations per replicate), providing a very rough estimate of the total diversity constraints driving repeatability.
As it is possible that some factors such as conservation of the genomic landscape of cold-and hot-spots of recombination could spuriously drive signatures of convergence (see Discussion), it is possible to perform a basic control by comparing the above results for pine-MCMT vs. spruce-MCMT to the overlap between top candidates for different environments in each species, where convergence would not be expected. Examining the variable least strongly correlated to MCMT (annual heat-moisture index; AHM), we find 23 top candidates in spruce with one overlap with pine-MCMT top candidates and 25 in pine with no overlaps with spruce-MCMT, which correspond to p = 0.11 and p = 1, respectively. Thus, there was no significant increase in overlap among the top-candidates for different variables, where signatures of convergence are unexpected but could have been generated spuriously by some combination of demography and low recombination in some regions of the genome. However, as this is a negative control, the lack of a significant result does not prove that such effects are absent, so caution is necessary when drawing inferences from these data.
These diversity constraints correspond to an effective adaptive target of ga s = 1462 genes that could potentially contribute to adaptation in these species (out of 9891), which yieldŝ P a;lik ¼ 0:15. However, a large number of the 50 and 121 genes identified using their "top candidate test" were likely false positives, because there were no controls for population structure during the association test, as this was subsequently accounted for by the among-species comparison. Thus, if we assume a 50% false positive rate for a x and a y , then ga s declines to 370 genes, withP a;lik ¼ 0:037. In their analysis, Yeaman et al. [38] used another more sensitive test (null-W) to identify loci with signatures of convergence that were not detected based on overlap in the top candidates lists, which suggests the true amount of repeatability may be higher than inferred here. This example illustrates how these kinds of statistics may be used to make inferences about constraints, but also highlights the sensitivity of the results to small changes in parameters.

Discussion
The methods developed here provide a way to estimate the effective number of loci that can potentially contribute to adaptation and an index to quantify the total amount that all constraints contribute to repeatability relative to the null hypothesis of no constraints ("C-score"). Importantly, these statistics can be used to contrast the total constraints affecting convergence in highly divergent traits and species. The comparison between antimicrobial and copper resistance in yeast vs. cold tolerance in conifers suggests that the latter trait is less constrained than the former (C antimicrobial = 32.5; C copper = 12.3; C climate = 7.8), but that in all cases considerably more repeatability in adaptation is seen than under a model with no constraints. As C-scores are scaled by the standard deviation of the probability distribution, their magnitude scales linearly with decreasing probability of the observed repeatability occurring by chance under the null model. In conjunction with other information about number of loci that could potentially give rise to standing variation (g s ), this method can be used to test hypotheses about whether observed repeatability is due to GT-or GF-redundancy, or other confounding factors. We now review how these methods can be used to draw inferences and discuss potential problems that should be considered in their implementation.

Hypothesis testing to identify the factors that constrain diversity
Under the simplest null hypothesis that there are no diversity constraints, all genes can give rise to potentially adaptive variation in the trait (g 0 = g s = ga s = n s ). While simplistic, this approach provides an intuitive method to assess whether the amount of convergence observed is more than expected due to pure randomness. But what do we learn if we reject such a simple null hypothesis? Two inferences can be drawn in this case: many of the genes flagged by our tests for selection are likely evolving by natural selection (i.e., they are not all false positives) and some kind of constraint is involved in shaping this adaptation. The former inference means that analyzing comparative data for convergence can provide a powerful tool for identifying the genes involved in local adaptation, as this is often a significant methodological hurdle in evolutionary biology (e.g., [38]). The latter inference may seem a straw-man, as few molecular biologists would advocate a model where every gene can mutate to give rise to adaptively useful variation in a given trait. However, different forms of the "universal pleiotropy" model have been assumed in theoretical quantitative genetics [39], and the recently proposed "omnigenic model" advocates extensive pleiotropy [14]. Regardless of the true number of genes involved, this null hypothesis provides a benchmark against which we can quantify how all factors constraining the diversity of forms combine to drive repeatability, which is useful for interpreting patterns of repeatability among species and traits.
In order to make inferences about the potential importance of different kinds of diversity constraints driving repeatability, it is necessary to specify more realistic models for the evolution of local adaptation that incorporate different assumptions about size of the mutational target of the trait, extent of shared standing variation, differences in mutation rate among genes, distribution of mutation effect sizes, and species demography. The simplest modification to the above null model is to represent the extent of GT-redundancy by specifying the number of loci that potentially contribute to trait variation as a subset of the total number of loci in the genome (g 0 = g s < n s ). In the context of the C hyper index (Eq 1), reducing g 0 increases both the mean and standard deviation of the hypergeometric distribution and therefore decreases C hyper and the inferred level of residual (unexplained) constraints. If empirical estimates of g s result in C hyper~0 , then it is reasonable to conclude that low GT-redundancy is mainly responsible for the observed amount of convergent adaptation. This would not discount the importance of natural selection overall, as selection on the phenotype is still responsible for adaptation but would suggest that g s individual loci are more or less interchangeable and GF-redundancy contributes no additional constraints above those imposed by GT-redundancy (Table 1). However, as we have few (if any) conclusive estimates of g s in highly polygenic traits [40,41], the extent of constraint arising through low GT-redundancy will be difficult to assess without further directed study. Although they are by no means simple experiments to conduct, it should be possible to estimate g s from QTLs identified in multiple mutation accumulation experiments, as the number of loci detected across all experiments should asymptote towards g s , and rarefaction designs could be used to estimate g s based on the overlap between QTLs detected in two experiments (although this would likely still be biased by failing to detect loci of small effect). A similar approach to the study of repeatability in adaptive loci taken here could be applied to multiple GWAS results on standing variation for a given trait conducted independently in different species to assess the proportion of shared loci. However, it should be noted that in this case, the loci that contribute to standing variation could be shaped by previous selection and might therefore be more convergent than those identified using mutation accumulation, especially if long-term balancing selection is operating (e.g. [42]).
In order to draw inferences about the importance of these types of redundancy, it is critical to account for other factors unrelated to GT-and GF-redundancy that might drive repeatability, mainly through differences among genes in mutation rate or standing variation. The simplest approach to control these factors is to design studies that preclude shared standing variation, either through experiments founded from isogenic strains (e.g., [2,36]) or comparisons of distantly related lineages (divergence time >> 4N e ) where lineage sorting has been completed (as per [38]). While repeatability could still be driven by differences among genes in mutation rate, this can be seen as a component of GT-redundancy and therefore as a factor that can also constrain diversity. By contrast, the existence of shared standing variation occurs mainly due to historical contingency and is therefore a bias affecting estimation of C-scores, rather than a constraint. As such, parsing the contribution of mutation rate to C-scores andP a is less critical than parsing the contribution of standing variation when using these as overall indices of constraint. Unfortunately, in studies of recently diverged natural populations, it is not possible to preclude shared standing variation, so C-scores andP a could be strongly driven by this factor and therefore not particularly representative of diversity constraints. The recently developed likelihood-based method for discriminating between convergence via de novo mutation, migration, or shared standing variation ( [24]) may provide a means to parse these contributions to repeatability and refine the inference of constraint. While testing the null hypothesis of no constraints is relatively straightforward, discriminating among other potential factors constraining diversity is much more complicated. Although it is possible to make very intricate models with variable mutation rates, selection coefficients, indices of pleiotropy, shared standing variation, and/or other factors determining the likelihood of each gene contributing to adaptation [3,25,28,43], it may be very difficult to actually confidently discriminate between such models.

Practical considerations in implementation
The accuracy of the indices developed here will critically depend on the correct identification of the genes contributing to adaptation. Studies of local adaptation are particularly prone to false positives when population structure is oriented on the same axis as adaptive divergence, and it is unclear how extensively methods that correct for population structure induce false negatives or fail to accurately control for false positives [44,45]. Assuming false positives are distributed randomly throughout the genome in each lineage, failure to remove them will cause the C-scores derived here to be biased downwards. Failure to identify true positives (i.e. false negatives) will impair the accuracy of C chisq , with the direction dependent upon the underlying biology. Assuming false negatives are randomly distributed in the genome, they could also reduce the magnitude of C-scores due to lower information content. On the other hand, if large-effect loci are more likely than small-effect loci to be both detected and convergent, false negatives will tend to bias C-scores upwards. As it is typically necessary to set arbitrary cutoffs for statistical significance to identify putatively adapted loci, we might expect C chisq to increase with increasing stringency of these cutoffs, as this would be expected to reduce false positives (but also increase false negatives). However, as there are many potential contingencies and interactions between the factors that affect these two types of error, there is a clear need for both theoretical studies on how the repeatability of local adaptation is affected by the interplay between demography and selection (e.g. [28]), and refinement of these methods to derive confidence intervals taking into account likely error rates.
A particularly important problem to address in implementing this method is that false positives may be non-randomly distributed throughout the genome in a similar way in different lineages. As local variations in the rate of mutation or recombination can drive genome-wide patterns in some indices used to identify selection and adaptation [46][47][48], this could lead to signatures of convergence among distantly-related species if such patterns are conserved over long periods of evolutionary time. For example, genome-wide patterns of variation in nucleotide diversity, F ST , and d xy were all significantly correlated across three distantly related bird species, likely driven in part by conservation of local recombination rate coupled with linked selection [49]. The extent of convergence of local recombination rates appears to vary considerably among species [50][51][52][53], so it will be important to consider this factor as a potential driver of similarity in the genomic signatures used to identify selection. Methods for identifying signatures of adaptation that are explicitly linked to a phenotype or environment of interest across multiple pairs of populations may be less likely to be affected by such factors, as recombination and linked selection are unlikely to drive a pattern of repeated correlation between allele frequency and phenotype/environment. However, such methods are still vulnerable to potential biases that arise from the complex interplay between genomic landscape, selection, and recombination, and further study in both theoretical and empirical contexts will be important to test the robustness of different methods to this important source of bias. One potential approach to estimate the contribution of such confounding factors would be to compare signatures of the repeatability for adaptation in two different traits (which are not phenotypically convergent) to those for a phenotypically convergent trait.
While studying adaptation across multiple pairs of populations can greatly increase the power to detect signatures of selection when all populations are adapting via the same loci, such methods are inherently unable to detect idiosyncratic patterns where different populations of a given species are adapting via different loci. By its very nature, it may be very difficult, if not impossible to detect local adaptation in traits with high GT-or GF-redundancy, as each pair of populations may be differentiated via a different set of loci [13]. If local adaptation is much more readily detected when it arises repeatedly within a lineage, then it will be difficult to identify conclusive cases with low C-scores, causing an overestimation of the prevalence of highly repeated adaptation.
If patterns of genomic convergence are compared among multiple differentially-related lineages, it is important to consider their phylogeny when testing the importance of phylogenetic sharing of different factors affecting the propensity for gene reuse [54]. The ability to resolve orthology relationships also decreases with increasing phylogenetic distance, which can affect the estimation of n s . Similarly, the set of genes in a trait's mutational target (g i ) is expected to evolve over time, so the set of shared genes should decrease with phylogenetic distance (so that g i − g s increases with divergence time), leading to decreased repeatability over time [33]. When studies include multiple differentially-related lineages, it is probably useful to estimate Cscores on both a pairwise and mean-across-all-lineages basis to more clearly describe cases where convergence is high within pairs of closely related lineages but low among more distantly related lineages.
Finally, physical linkage is a factor that could critically affect the measurement of repeatability, as neutral alleles in other genes linked to a causal allele will tend to respond to indirect selection, causing spurious signatures of selection/local adaptation. If the same causal gene is driving adaptation in two lineages, this will tend to overestimate repeatability on a gene-bygene basis, whereas the opposite will occur if different causal genes are driving adaptation. Yeaman et al. [38] found significantly elevated levels of linkage disequilibrium (LD) among candidate genes for local adaptation, which may have arisen due to physical linkage (with or without selection on multiple causal loci) or statistical associations driven by selection among physically unlinked loci. In this case, the fragmented genome and lack of suitable genetic map precluded a comprehensive analysis of the impact of LD. If genome/genetic map resources permit, it may be possible to analyse repeatability on haplotype blocks rather than individual genes, which could minimize the biases due to physical linkage.

Comparison to other indices of repeatability
A large number of indices have been developed to characterize similarity among ecological communities, which can be broadly grouped based on binary vs. quantitative input data and whether they account for joint absence of a given type (reviewed in [55]). In most cases, these indices are not derived from a probability-based representation of expectations, though Raup and Crick [56] quantified an index of similarity based on the p-value of a hypergeometric test (see also [57]). The C hyper index that we have developed here uses the same underlying logic as the Raup-Crick index but quantifies the effect size as a deviation from the expectation under the null hypothesis in units of the standard deviation of the null distribution. The C-score and P a indices developed here provide a complement to indices of repeatability that have been used in previous studies of convergence at the genome scale (e.g. [2,33]). Whereas the Jaccard, PS add , and other similar indices represent how commonly a given gene tends to be used in adaptation, the C-score indices quantify how much constraint is involved in driving this observed repeatability, whereasP a quantifies the proportion of the genome that is effectively available for adaptation. In some cases, these indices will be qualitatively similar in quantifying patterns of convergence (e.g. Fig 4A), but in other cases they will diverge considerably, because the C-scores are explicitly aimed at representing the importance of genes that could contribute to adaptation but do not.

What does convergence tell us about the basis of trait variation?
The repeated observation of convergent adaptation at the genome scale violates a fundamental assumption of the infinitesimal model of quantitative genetics: that the mutations responsible for adaptation have small effects and are essentially interchangeable [58]. If there are infinitely many interchangeable loci that could contribute to adaptation, the chance of the same gene playing a causal role in independent bouts should be vanishingly small. Molecular-genetic studies show that some traits are causally generated by only a few genes in a specific pathway, presumably limiting the mutational target and increasing the potential for convergence. For example, only the small number of genes that are directly involved in terpene production [59] would likely contribute to large variations in the amount of terpene produced by a plant. However, a second category of mutations in other non-pathway genes could also indirectly contribute to variation in terpene production through perturbations of regulatory networks. The recently proposed "omnigenic model" posits that genes can be categorized into "core" vs. "peripheral" function for a given trait, as a way to distinguish between those with larger direct effects vs. smaller indirect effects [14], although this model has also been criticized [60]. The majority of evidence that has been considered in the context of the omnigenic model has come from Genome-Wide Association Studies (GWAS) of standing variation, but it is unclear whether this represents the "stuff" of long-term adaptation. Indeed, it appears that in humans there are pronounced differences in the distributions of alleles that contribute to standing vs. adaptive genetic variation, as GWAS studies of standing variation find mainly small-effect variants [61,62], whereas studies of local adaptation have found a number of large effect loci (e.g, for lactase persistence [63], diving [64], and high altitude [65]). If GT-redundancy is typically high and GF-redundancy commonly low, then there will be little correlation between the loci that can give rise to standing variation and the smaller subset of those responsible for long-term adaptation. Studying whether adaptation is commonly repeatable at the genome scale will therefore make an important complement to GWAS studies of standing variation, providing a window into the factors that constrain the diversity of viable routes to adaptation, and informing our broader understanding of how variation translates into evolution.

Conclusions
We present a method to quantify the constraints that drive genomic repeatability of adaptation, to enable testing of hypotheses about the nature of these constraints. Contrasting the repeatability of adaptation with studies of standing variation will deepen our understanding of evolution and the factors that affect how it gives rise to diversity. Comparative approaches examining C-scores and the proportion of adaptation-effective loci (P a;lik ) for the same trait across different branches of the phylogeny may allow us to infer rates of evolution in constraints and potential differences between rapidly vs. slowly radiating lineages, and study whether adaptation drives such changes. Comparisons across traits within lineages will illuminate how different kinds of traits are constrained, and whether low GT-and GF-redundancy constitute important constraints at different levels of biological organization. Similarly, this approach could be used to examine whether the types of constraint that predominate depend upon critical population genetic parameters such as effective population size, which affects the long term efficiency of selection on the developmental-genetic program. While we have focused on repeatability at the gene level, this framework could be applied at other levels of organization, such as gene network, protein domain, or individual nucleotide (reviewed by [3]), and could include the contribution of intergenic regulatory regions if it is possible to identify orthology. These methods therefore provide a first step towards comprehensive quantification and understanding of evolutionary constraints and the role that different factors play in the rise of diversity during adaptation.
Supporting information S1 Data. Additional analysis and model description.
(DOCX) S1 Fig. Schematic representation of the overlap between the genes that potentially contribute to variation (G) and adaptation (GA) and those observed to contribute to adaptation (A) within two lineages (x and y) undergoing local adaptation to a similar selection pressure. Note that for simplicity N x and N y are not shown, and this figure is drawn under the assumption that all members of G x and G y are also members of N s , and all members of GA x and GA y are also members of G x and G y (i.e., G