Generalization of the Ewens sampling formula to arbitrary fitness landscapes

In considering evolution of transcribed regions, regulatory modules, and other genomic loci of interest, we are often faced with a situation in which the number of allelic states greatly exceeds the population size. In this limit, the population eventually adopts a steady state characterized by mutation-selection-drift balance. Although new alleles continue to be explored through mutation, the statistics of the population, and in particular the probabilities of seeing specific allelic configurations in samples taken from a population, do not change with time. In the absence of selection, probabilities of allelic configurations are given by the Ewens sampling formula, widely used in population genetics to detect deviations from neutrality. Here we develop an extension of this formula to arbitrary, possibly epistatic, fitness landscapes. Although our approach is general, we focus on the class of landscapes in which alleles are grouped into two, three, or several fitness states. This class of landscapes yields sampling probabilities that are computationally more tractable, and can form a basis for the inference of selection signatures from sequence data. We demonstrate that, for a sizeable range of mutation rates and selection coefficients, the steady-state allelic diversity is not neutral. Therefore, it may be used to infer selection coefficients, as well as other key evolutionary parameters, using high-throughput sequencing of evolving populations to collect data on locus polymorphisms. We also find that our theory remains sufficiently accurate even if the assumptions such as the infinite allele limit and the"full connectivity"assumption in which each allele can mutate into any other allele are relaxed. Thus, our framework establishes a theoretical foundation for inferring selection signatures from samples of sequences produced by evolution on epistatic fitness landscapes.


Introduction
With the advent of high-throughput molecular biology techniques, it has recently become possible to carry out large-scale genotype-phenotype assays in molecular systems [1][2][3][4][5]. For example, Podgornaia  four key residues in the E. coli protein kinase PhoQ, and assayed each mutant for the signaling function mediated by its binding partner PhoP [1]. This study revealed 1659 functional PhoQ variants, which can be thought of as forming the upper plane on the fitness landscape; all nonfunctional variants form the lower plane. The upper plane is divided into several clusters under single-point amino acid or nucleotide mutations-sequences within each cluster can mutate into each other through neutral substitutions only. The two-plane landscape is epistatic-the effect of a given mutation depends on the amino acids at the other three positions, in agreement with previous reports on the major role of epistasis in molecular evolution [6][7][8][9]. The picture of a "coarse-grained" fitness landscape stratified into several distinct phenotypes is in agreement with other recent high-throughput experiments aimed at elucidating the relationship between sequence and function [2-4, 7, 10, 11]. Although these experiments typically yield continuous distributions of selection coefficients, the distributions tend to be bimodal, with one peak corresponding to strongly deleterious and lethal mutations and another to weakly deleterious and neutral ones [12][13][14]. These observations suggest stratifying the fitness landscape into functional and non-functional phenotypes; intermediate fitness states such as those corresponding to weakly deleterious phenotypes can be added if necessary to refine the picture.
Overall, given the astronomically large number of alleles, the typical size of neutrally-connected clusters of sequences can be assumed to be much larger than the population size. Then evolutionary dynamics on a multiple-plane landscape will be characterized by mutation-selection-drift balance [15][16][17][18][19][20][21][22] in the infinite-allele limit. At steady state, population statistics, such as the mean and the variance of the number of distinct alleles or the probability of observing a given pattern of allelic diversity in a sample of sequences, do not change anymore, even though the population continues to explore new alleles through mutation [22]. In the absence of selection, the steady-state allele sampling probability was derived by Ewens [23]. The Ewens sampling formula can be used to understand allelic diversity in neutral populations and to test for deviations from the neutral expectation; [24] its essential limitation is that, essentially, each allele is allowed to mutate into every other allele [22]. The Ewens formula arises naturally in many sampling problems in biological and physical sciences [25][26][27]. However, in order to understand molecular evolution in the presence of selection and make quantitative predictions of selection coefficients, it is necessary to extend it to more general fitness distributions.
Previous work in this area has focused mostly on the symmetric overdominance model, first analyzed in this context by Watterson [18,28]. This is a diploid model in which all heterozygotes have the same selective advantage over all homozygotes, such that the mean population fitness depends on the square of allele frequencies. Since the sampling formula for this model is challenging to evaluate and therefore has never been used in practical calculations, subsequent work in the field focused on various approximations to the exact result, which require additional assumptions such as weak selection [18] or large sample sizes [29]. In particular, Joyce and collaborators have discussed asymptotic properties of the sampling distributions under a model of selection with multiple fitness states [30,31], as well as the symmetric overdominance model [32]. More recently, Watterson's model of selection was generalized by Handa [33] and Huillet [34], who considered mean population fitness involving allele frequencies raised to the arbitrary power q ! 1. They obtained sampling probabilities expressed in terms of multi-dimensional integrals which would be difficult to employ in practical calculations. In any event, only the q = 1 (neutral evolution) and q = 2 (symmetric overdominance) cases appear to have biological meaning.
Furthermore, Ethier and Kurtz have studied allelic diversity in a general model of selection in which fitness of each new allele is a symmetric function of the allelic states of its two parents, focusing on the proofs of existence and uniqueness of a steady state in the infinite-allele limit. [35,36] Desai et al. have investigated sampling probabilities in a model (previously introduced by Charlesworth et al. [37] and Hudson and Kaplan [38]) based on a sequence of neutral and negatively selected sites [39]. This model has no interactions between sites, and therefore can be treated using the Poisson Random Field approach [40]. Since molecular evolution is characterized by prominent epistasis and correlated fitness values between parents and their offspring, the approach of Desai et al. cannot be applied to genomic data without careful numerical analysis of all the approximations involved. Finally, several prior publications have focused on steady-state population statistics other than sampling probabilities. In particular, Li used the steady-state approach to obtain the frequency spectrum for a general landscape, and derived expressions for the mean number of alleles in a sample, as well as the mean and the variance of heterozygosity [19][20][21]. Ewens and Li derived frequency spectra for landscapes with two and three distinct fitness states and used them to compute the mean number of distinct alleles and the mean heterozygosity [41]. Griffiths derived a general integral expression for the frequency spectrum in a genic selection model [42].
Here we demonstrate an extension of the Ewens sampling formula to arbitrary fitness landscapes with genic selection. First, we follow previous work [15][16][17][18][19][20][21][22] in assuming that the population adopts a steady state characterized by mutation-selection-drift balance. The steady state depends on the mean population fitness, which involves a linear combination of gene frequencies. Next, we derive a general sampling formula valid for any mutation rate μ, population size N, sample size n ( N, and the number of alleles K with arbitrary fitness. We find that the most general sampling formula is difficult to employ in numerical calculations with large finite values of K, but small values of K and the infinite K limit are more manageable. Here we focus on the infinite-allele (K ! 1) approximation with several phenotypic states, inspired by recent high-throughput molecular evolution experiments [1-5, 7, 10, 11]. We have developed a numerical technique based on the efficient calculation of Bell polynomials, which is distinct from previous efforts to compute sampling probabilities [43,44]. Our approach enables us to study selection signatures and deviations from neutrality on landscapes with arbitrary fitness distributions.
We contrast our predictions with the effective population size approximation [37,39]. We also compare our results with explicit simulations, using the Moran population genetics model [45] with single-point mutations as a benchmark against which the accuracy of the "full connectivity" assumption is checked. Finally, we investigate the limitations of the infinite-allele assumption. Our results are applicable to understanding the nature of allelic diversity under selection, mutation and drift. Moreover, our sampling formulas can form a basis of a quantitative, numerically feasible test for detecting the presence of selection and estimating its strength in evolving populations. Population-level allele diversity data are made increasingly available through high-throughput sequencing techniques, making our approach a practical and timely tool for studying the role of selection in evolution-a topic of much current interest and debate [14,[46][47][48][49][50][51].

Sampling probability with selection
We consider a haploid population of fixed size N (our results also hold for diploid populations, as long as fitness values are assigned to individual genes rather than organisms). Each organism in the population is represented by a single allele in the state i, with fitness f i ; there are K distinct allelic states. Mutations occur with a probability μ per generation, changing the original allele into one of the K − 1 remaining alleles. Thus the probability of offspring A j produced by parent A i 6 ¼ j is μ/(K − 1) (note that our approach can be easily generalized to the case of final-state-dependent mutation rates: μ ij = μ j , 8i in A i ! A j ). We can view this system as an "allelic network" with the topology of a complete graph, with K vertices representing allelic states and edges representing mutational moves. Stochastic evolution of the population can then be described using Moran [45,52] or Wright-Fisher [15,52] models of population dynamics.
The steady-state distribution of allelic frequencies for these models is given by [15][16][17][18][19] pðxÞ where x = (x 1 , . . ., x K ) is a vector of allelic frequencies, = θ/(K − 1) with θ = Nμ for Moran and θ = 2Nμ for Write-Fisher models correspondingly, hf i ¼ P K i¼1 f i x i is mean population fitness, and Z is a normalization constant.
In many situations relevant to molecular evolution, the number of alleles K is much larger than the population size N. In this case, the steady state in terms of allele frequencies is unlikely to be reached on relevant evolutionary time scales. Mathematically, the K ! 1 limit of Eq 1 becomes ill-defined [53,54]. Nonetheless, the steady state is well-defined in terms of allelic counts rather than frequencies of specific alleles [22]. In other words, the allelic diversity of the population (e.g. as characterized by the mean and the variance of the distribution of the number of distinct allelic types) is tractable and will no longer change in steady state, although new alleles will continue to be explored through mutation.
Since only a subset of the entire population is typically available for analysis, we shall focus on the probabilities of allelic counts in samples of size n ( N. To introduce the concept of allelic counts, let us for a moment consider a finite number of allelic types, e.g. K = 5, and call the corresponding alleles A, B, C, D, E. Suppose that we take a sample of n = 4 alleles from the population and we first observe allele A, then C, then A again, and finally D. We can record this sequence of alleles as an ordered list (A, C, A, D). However, typically we are not interested in the order in which alleles appear in the sample, and therefore record the result as an unordered list {A, A, C, D}, which shows that allele A has appeared twice and alleles C and D have appeared once each. Here we have used the notation {a, b, . . ., z} for unordered lists ({a, b, . . ., z} = {b, a, . . ., z}), and (a, b, . . ., z) for ordered lists ((a, b, . . ., z) 6 ¼ (b, a, . . ., z)).
Alternatively, we can record non-zero allelic counts, which yields n A = 2, n C = 1, n D = 1. Finally, we can dispense with the allele labels altogether, identifying each allele in the sample as either new or already seen. In this case, we are left with an unordered list of counts {2, 1, 1}, meaning that we have observed 4 alleles of 3 different types, with one type represented by two alleles and the other two types by one each. In general, we will refer to n = {n 1 , . . ., n k } as the sample configuration or the allelic counts. An equivalent representation would be to use a histogram which records how many groups of j identical alleles occur in the sample, with j ranging from 1 to n. In our example, there is one group of two identical alleles and two groups of one allele each, so that (A, C, A, D) is recorded as the allelic histogram (a 1 = 2, a 2 = 1, a 3 = 0, a 4 = 0). All results in the paper are presented in terms of the counts {n 1 , . . ., n k } rather than the histogram (a 1 , . . ., a n ).
It turns out that the allelic counts are appropriate variables in the infinite allele limit. The celebrated Ewens sampling formula [22,23] expresses the probability of observing a particular sample configuration n in the absence of selection: where N P is the total number of distinct permutations of the allelic counts, and θ (n) = θ(θ + 1). . .(θ + n − 1) is the rising factorial. Following an approach developed by Watterson [18], we generalize the Ewens sampling formula to the case of multiple fitness states. We define γ, a vector whose components, γ m , are fractions of all alleles with fitness f m . Allowing m to range from 1 to M ( P M m¼1 g m ¼ 1) results in a landscape with M ( K distinct fitness states. Unless γ m * 1/K, there is an infinite number of alleles with the same fitness, so that the landscape looks like M fitness planes interconnected through mutations. For this reason we shall often refer to phenotypic states as fitness planes and to the fitness landscape as the multiple-plane landscape. Our main result is the following expression for the sampling probability (details of the derivation are available in Materials and Methods): F ðγy þ ν Y ; y þ n; βÞ F ðγy; y; βÞ Here, F ða; b; zÞ is a generalization of the confluent hypergeometric function 1 F 1 ða; b; zÞ to vector arguments. The double sum in Eq 3 takes into account all possible ways of assigning observed allelic counts n to M fitness planes; ν Y is an auxiliary vector which encodes these assignments (see Materials and Methods and Fig 1 for extended explanations). Each assignment contributes differently to the final expression due to the non-trivial fitness landscape. The fitness values are stored in the vector β, whose components are fitness differences β m = N(f m − f 1 ) scaled by the population size N. For example, in the case of two fitness states β 1 = 0 and β 2 = N(f 2 − f 1 ) = Ns, where s is the selection coefficient. Finally, i 1 . . .i M indicate the number of distinct allelic types sampled from the corresponding fitness plane ( P M m¼1 i m ¼ k). The first line in Eq 3 is simply the Ewens formula (Eq 2) without N P , which is the value returned by the double sum on the second line when all fitness values are equal. The version of the sampling formula with selection (Eq 3) suitable for a finite number of alleles K is provided in Materials and Methods. In the main text we shall focus on the infinite allele limit. Despite the seemingly complicated structure of Eq 3, it can be used in efficient numerical calculations. The following sections are devoted to exploring the properties of this formula and discussing its applicability and accuracy if some of the model assumptions are relaxed.

The effective population size approximation
According to the effective population size (EPS) approximation [37,39] in the monomorphic limit population dynamics is effectively neutral with a rescaled population size N Ã . Indeed, in this limit Eq 3 reduces to in the two-plane case. The θ ! 0 limit corresponds to the s ) μ regime with s being finite; Eq 4 is the same as the neutral sampling formula (Eq 2) in the monomorphic limit if the population size is rescaled: This result can be generalized to the landscape with multiple fitness planes, in which case N Ã = γ m N, where γ m is a fraction of nodes with the highest fitness. However, the EPS approximation breaks down in the polymorphic regime. Indeed, if we take the θ ! 1 limit while keeping s/μ finite, it can be shown for the two-plane landscape that where P½n; s ¼ 0 is given by Eq 2, and the coefficients c m depend solely on the allelic counts n 1 , . . ., n k . Since the right-hand side of Eq 5 does not depend on the population size, it can be used to define N Ã = λ 1/(k − n) N. However, this definition will be sample-specific, as λ depends on the allelic counts via c m 's. Thus there is no universal rescaling of the population size in the strongly polymorphic regime, and therefore evolutionary dynamics is non-neutral.

Detection of selection signatures
As discussed above, in general we expect allele diversity to deviate from neutrality, making it possible to detect selection signatures using a set of sequences sampled from the population.
To investigate non-neutral population dynamics, we compute probabilities for all integer partitions n = {n 1 , . . ., n k } of n alleles sampled from the population evolving under selection (Eq 3), and compare them with steady-state partition probabilities obtained under neutral evolution (Eq 2) and the monomorphic EPS approximation (Eq 4). We use the Kullback-Leibler (KL) distance to quantify the difference between two probability distributions [55]: KL(p||q) = ∑ i p i log(p i /q i ), where i is the partition label. For the two-plane system, we first compare partition probabilities under selection, p i ¼ P½n; y; b, with the corresponding neutral probabilities, q i ¼ P½n; y; b ¼ 0. In Fig 2A, we plot the KL divergence as a function of the mutation rate and the selection strength for the two-plane fitness landscape.
We observe that evolutionary dynamics is essentially neutral if selection is weak (s μ); in addition, the range of selection coefficients for which neutrality holds increases in the monomorphic regime (Nμ 1). On the other hand, population statistics is clearly non-neutral when the population is polymorphic and when the separation between the two fitness planes is large. Next, we compute the KL divergence KL(p||q Ã ) between the EPS probability distribution, Fig 2B). We see that the EPS approximation fails in the polymorphic, weak-selection regime. Overall, the neutral and EPS approximations are approximately complementary: for example, in the strong-selection (s ) μ) polymorphic regime, when evolutionary dynamics becomes non-neutral, it is well approximated by the EPS model.
In Fig 2C we show KL divergences between partition probability distributions on two-and three-plane fitness landscapes. We observe that the partition probabilities are essentially twoplane (i.e., there are no selection signatures indicating presence of intermediate-fitness alleles) if the population is monomorphic (Nμ 1), or if the distance between the two upper planes is smaller than the mutation rate (Δs μ). However, there is a considerable parameter region in which deviations between two and three-plane sampling probabilities appear to be significant (with KL divergences between the two distributions of 0.01 or more), making it possible to detect three distinct fitness states in the sampling data.

Mutation load
By definition, the mutation load is given by [52,56] To estimate the mutation load at steady state, we compute the expected value of the mean population fitness over multiple realizations of the stochastic process, E½hf i.
For the two-plane system, this computation leads to Another indicative quantity is the average fraction of the population with low fitness, E½x low . For the two-plane system it is given by E½x low ¼ Lð1 þ sÞ=s:  Values of mutation load for the two-plane fitness landscape are shown in Fig 3A over a range of selection strengths and mutation rates. As expected, we observe that the largest deviations from the maximum fitness occur in the strong-mutation, strong-selection regime, where a fraction of the population is constantly displaced to the lower plane by mutation, incurring a fitness cost. Correspondingly, at a given value of selection strength the mutation load increases with the mutation rate. In the monomorphic regime the mutation load is vanishingly low because the entire population condenses to a single allelic state and moves randomly on the upper plane. The fraction of the population on the lower fitness plane is shown in Fig 3B. The fraction is high when the separation between the two planes is low and, at a fixed separation, it increases with the mutation rate.

Fitness landscape models and numerical simulations
To check our main result (Eq 3), we have compared it to the outcomes of numerical simulations of two models. In the first model, each allele is allowed to mutate into any of the other K − 1 alleles with equal probability. We call this model fully-connected (FC); derivations of the Ewens sampling formula and our generalization of it (Eq 3) were carried out for the FC model. The second model is more realistic: an organism is represented by a sequence of integers S = (a 1 , . . ., a L ) of length L and alphabet size A, meaning that 0 a i A − 1. A mutation replaces an integer at a randomly chosen site with one of the remaining A − 1 integers; all the replacements have equal probabilities. We call this model a single-point mutation (SPM) model; it is a more realistic description of protein or nucleotide sequence evolution.
To assign a fitness value to each allele, we focus on the landscapes in which alleles can have either low or high fitness values (the two-plane model), or low, intermediate, and high fitness values (the three-plane model). The fractions of alleles found in each plane are given by γ: γ = (γ, 1 − γ) for the two-plane model and γ = (γ 1 , γ 2 , 1 − γ 1 − γ 2 ) for the three-plane model. In the FC model, the mutational neighborhood of each allele is the same, so that any desired allele fractions γ can be implemented. However, in the SPM model the fractions of neutral, beneficial and deleterious moves in each plane will depend on γ and the assignment of states to planes. We wished to produce non-trivial distributions of neutral moves on the fitness planes, with mutational neighborhoods of some alleles being completely neutral in each plane. Another condition was that the number of alleles in each plane should decrease with its fitness, to reflect the fact that beneficial mutations are rare.
To fulfill these requirements, we chose to assign fitness values in the SPM model in the following way. We use the sequence length L = 10 and the alphabet size A = 4. For each sequence S = (a 1 , . . ., a L ) we compute a score z = a 1 + . . . + a L . We compare these scores with a set of cutoffs (c 1 , . . ., c M−1 ) for the M-plane landscape. For the two-plane landscape, the fitness is 1 if z c 1 , and 1 + s otherwise. We use the cutoff c 1 = 17, which yields γ = (0.758, 0.242). For the three-plane landscape, if z c 1 the fitness is 1, if c 1 < z c 2 the fitness is 1 + s − Δs, and if z > c 2 the fitness is 1 + s + Δs. We choose the cutoffs c 1 = 17 and c 2 = 21, which lead to γ = (0.758, 0.210, 0.032). In order to compare FC and SPM simulations directly, we use the same values of γ in the corresponding FC models.
Our numerical simulations have been carried out using the Moran model of population genetics [22,45]. Specifically, we have evolved a population of N = 10 3 haploid organisms, each of which could be in one of K allelic states. At each step a parent is chosen by randomly sampling the population with weights proportional to the fitness of each individual. An offspring is then produced as an exact copy of the parent. Next, the offspring undergoes mutation with the probability μ. Finally, the population is uniformly sampled to choose an organism that will be replaced by the offspring, keeping the overall population size constant. Probabilities of sampling n individuals from the population were calculated as averages over 10 6 samples gathered from 10 3 independent runs. For each run, a randomly generated initial population was evolved to steady state, after which n individuals were sampled from the population with replacement 10 3 times, waiting *1/μ generations between subsequent samples.
Note that in the neutral case the exact mapping between θ and μ is given by θ = Nμ/(1 − μ) for the Moran model. [22] However, it is unclear if this mapping can be extended to the nonneutral cases considered here. In any event, for the population size and the values of θ investigated below, μ = θ/(N + θ) ' θ/N. Therefore, we use the diffusion theory result θ = Nμ in comparing theoretical predictions with numerical simulations.

Partition probabilities on fully-connected vs. single-point-mutant networks
Here we investigate the extent to which sampling probabilities change in the SPM sequence evolution model described above, compared to the FC fitness landscape. We are especially interested in the limits of the predictive power of our theoretical framework, which necessarily involves the FC assumption. In Fig 4 and Table 1 we compare theoretical predictions with numerical simulations on the FC and SPM networks in the two-plane system for the sample of n = 3 alleles. Overall, as expected, we observe an excellent agreement between theory and simulations on FC networks. Furthermore, we see that the agreement between SPM simulations and our theoretical results is reasonable: in nearly all cases, the predicted ranking of the sample partitions, as well as the ranking of any given sample partition with respect to the selection strength, Ns, are preserved. The largest discrepancies occur in the weakly polymorphic (Nμ = 1), non-neutral regime (Ns = 6, 13).
The situation is qualitatively similar when a three-plane fitness landscape is considered ( Fig  5, Table 1). We again observe an excellent agreement between theory and FC simulations and, overall, a reasonable agreement between theory and SPM simulations, with the largest discrepancies again occurring in the weakly polymorphic, non-neutral regime. These observations remain true when samples with n = 4 and 5 alleles are considered (Tables 2 and 3).
Finally, we have checked whether our theoretical predictions, which rely on the full-connectivity assumption, are closer to the non-neutral rather than neutral SPM steady-state dynamics in numerical simulations: if this is the case, we should be able to predict selection signatures in populations evolving under single-point mutations using our methodology. We have  Table 1. Note that the error bars of the partition probabilities are too small to be shown, due to extensive sampling in our numerical simulations.
https://doi.org/10.1371/journal.pone.0190186.g004 Table 1. KL divergences between theoretical predictions and numerical simulations for single-plane, two-plane (Fig 4), and three-plane (Fig 5) Table 1. https://doi.org/10.1371/journal.pone.0190186.g005 Ewens sampling formula with selection computed the ratio of KL distances defined in the Table 1 caption; this ratio is less than 1 if the theoretical predictions with selection are closer to the corresponding SPM simulation than to the neutral SPM simulation, and greater than 1 otherwise. We observe that the ratio is less than 1 in all cases with selection and for all sample sizes (Tables 1-3), indicating that the error introduced by the FC assumption is less than the distance between selective and neutral systems (note that the ratio is 1 by definition in the single-plane neutral case).

Infinite-allele assumption
Although our approach is valid for an arbitrary number of alleles K, statistics of allele diversity in a population under selection become substantially easier to deal with in the infinite-allele limit. As discussed in the Introduction, this limit is justified since our focus here is on evolution of protein, RNA and DNA sequences, where the number of alleles grows exponentially with sequence length. Nonetheless, we have systematically investigated the extent of deviations between our infinite-allele theoretical results and simulations as the number of alleles K decreases and becomes comparable to the population size N. Fig 6 shows the KL divergence between partition probabilities derived theoretically for the two-plane landscape in the infinite-allele limit (Eq 3) and obtained numerically on finite-size FC networks. We consider three regimes: monomorphic (Nμ = 0.1), weakly polymorphic (Nμ = 1.0), and strongly polymorphic (Nμ = 10.0). In the latter two cases, noticeable deviations between theory and simulations begin to appear below the K * N regime; the agreement improves as the population becomes more monomorphic. We conclude that our theory is applicable over a wide range of mutation rates, as long as the network size is comparable to, or greater than, the population size.

Discussion and conclusion
One of the most challenging problems in evolutionary biology is to understand evolutionary dynamics of molecular loci, such as protein or RNA-coding sequences, or gene regulatory regions. The number of nucleotides at these loci, L, is large enough so that the total number of possible sequences, K = A L , is astronomical, far exceeding the population size N. Under these conditions the evolution of a molecular locus, assumed to be decoupled by recombination from the rest of the genome, reaches a "de-labelled" steady state. The allelic diversity in the steady-state population is determined by the balance of forces of selection and drift on one hand, and mutation on the other. The former act to reduce allelic diversity, while the latter acts to increase it. As a result, population statistics such as the mean number of distinct alleles, or the probability of seeing a certain allelic configuration in a sample, do not change with time, even though new genotypes continue to be explored on the effectively infinite allelic network. The steady-state allelic diversity in an infinite-allele neutral system was explored by Ewens [22,23]. The main result of that study, the Ewens sampling formula, is widely used in population genetics. However, selection is bound to play a key role in molecular evolution, and recent high-throughput studies connecting protein sequences with phenotypes [1-4, 7, 10, 11] reveal a more complex picture of molecular evolution: generally, a functional protein is disrupted by a fraction of mutations (e.g., through substitution of a hydrophobic residue for a hydrophilic one in the protein core). Other mutations do not significantly change protein stability, binding affinity, or binding specificity, and are therefore effectively neutral. Occasionally, a mutation is found which increases the fitness of an already functional, adapted protein, but these Ewens sampling formula with selection mutations are very infrequent. Overall, recent experimental studies indicate that "coarsegrained" fitness landscapes comprised of multiple interconnected planes (i.e., several distinct fitness states) are a reasonable representation. The simplest landscape of this kind has just two fitness states, with functional sequences on the upper plane and non-functional sequences on the lower plane [1]. Multiple-plane fitness landscapes constructed in this way are characterized by extensive epistasis under the single-point mutational move set, which is likely to be pervasive in molecular evolution [6][7][8][9].
Since molecular evolution may be described by steady-state dynamics on multiple-plane fitness landscapes, it is of great interest to generalize the Ewens sampling formula to arbitrary fitness distributions, and to the case of several distinct fitness states in particular. Tractable expressions for sampling probabilities would enable inference of selection coefficients, relative numbers of alleles in each fitness state, and mutation rates, using DNA, RNA, or protein sequences sampled from the population as input to the inference procedure. Here we report an extension of the Ewens sampling formula to arbitrary fitness distributions, focusing on the multiple-plane case which yields substantial simplifications in the infinite-allele limit. Unlike techniques based on the Poisson random field framework [40], such as the sampling probability formulas developed by Desai et al. [39], our approach does not rely on assuming independent evolution at each site along the sequence. However, an essential drawback of the Ewens sampling formula and our generalization of it is the "full-connectivity" assumption (i.e., that each allele can mutate into every other allele). Furthermore, the sampling formula becomes intractable for large sample sizes, since the number of terms to sum over in Eq 3 becomes too large.
Therefore, in order to study the limits of applicability of our theory, we have carried out extensive comparisons with numerical simulations on multiple-plane fitness landscapes. First, we checked the full-connectivity assumption inherent in the Ewens approach by comparing the sampling probabilities of our theory with those obtained by simulation of steady-state populations evolving on single-point-mutant networks. We find that the agreement, although dependent on the details of the fitness landscape model, the values of selection coefficients, and mutation rates (and least reliable in the weakly polymorphic regime), remains strong enough overall to encourage application of our theoretical results to sequence data. We also find that the error introduced by the full-connectivity assumption, as measured by the KL distance, is less than the distance between sampling probabilities in neutral and non-neutral systems. Note that our SPM model of the fitness landscape was constructed specifically to create a non-trivial distribution of neutral, deleterious and beneficial single-point mutations for the alleles, in some sense making it as distant from the fully connected network as possible. Thus we expect the errors inherent in our theoretical framework to be smaller (or at least not much worse) in applications to natural systems. Second, we have checked the infinite-allele assumption by systematically reducing the number of alleles until it became lower than the population size. We find that, for a wide range of mutation rates, deviations between theory and simulations become significant only when the number of alleles approaches the population size from above. Thus our assumption of the infinite network size is justified for sufficiently long loci, such as those encoding transcribed or regulatory regions.
Robust inference of selection coefficients from a sample of sequences collected from an evolving population requires statistics of allelic diversity to deviate substantially from the neutral expectation. If selection cannot be ruled out a priori, the use of our generalized Ewens sampling formula, which is valid throughout the entire parameter space, is necessary for inferring selection signatures and mutation rates from data. Moreover, allelic diversity generated by steady-state evolutionary dynamics on a three-plane fitness landscape is sufficiently distinct from its two-plane counterpart in the strong-selection, weakly polymorphic regime, opening up a possibility of inferring multiple selection coefficients from a sample of sequences. Another hallmark of non-neutral population dynamics is de-localization of the population to multiple fitness planes. With a two-plane landscape, we expect the fraction of the population on the lower plane to increase with the mutation rate and decrease with the distance between the two planes. Our investigation of the mutation load confirms these predictions.
In summary, we have generalized the Ewens sampling formula to populations evolving under selection. Although in principle our results are valid for arbitrary fitness distributions, focusing on the infinite allele limit and landscapes characterized by several distinct fitness states yields substantial simplifications, making our approach computationally tractable and thus applicable to inferring selection signatures from high-throughput sequence data. Such multiple-state "coarse-grained" fitness distributions appear to be a reasonable starting point supported by recent large-scale genotype-phenotype maps in molecular systems [1-4, 7, 10, 11]. Unlike previous approaches, we do not assume that each site along the sequence evolves independently-an assumption that has recently been challenged in molecular evolution studies [6][7][8][9]. However, we do make the infinite allele assumption, and, as in the Ewens original formula [23], assume that each allele can mutate into any other allele. Therefore, we check our theory against numerical simulations in model systems where these assumptions are relaxed, and find that our predictions remain accurate enough to enable inference of evolutionary parameters from sequencing data.
is a generalization of the confluent hypergeometric function 1 F 1 ða; b; zÞ to vector arguments.
Here, a (j) = Γ(a + j)/Γ(a) is the rising factorial, B j is the jth complete Bell polynomial, and a j ¼ ðj À 1Þ! P n i¼1 a i z j i . To obtain Eq 7, we have used the following result for integrating over the (K − 1)-dimensional simplex S K − 1 : Z A (K − 1)-dimensional simplex S K − 1 is a subspace of R K : ðx 1 ; . . . ; x K Þ 2 ½0; 1 K which satis- We have expanded the exponent in Eq 1 in a Taylor series and applied Eq 10 to each term in the resulting expansion.

Strongly monomorphic limit
In this limit the mutation rate tends to zero while the population size is kept fixed, ! 0 [52,[56][57][58]. Consider the Fourier transform of the steady-state distribution in Eq 7: where the integral is over the (K − 1)-dimensional simplex. Using Eq 9, we can write the Fourier transform as a ratio of two generalized hypergeometric functions: Taking the ! 0 limit yieldsp mono ðkÞ ¼ Thus the steady-state distribution in the monomorphic limit is given by: where VolðS KÀ 1 Þ ¼ ffiffiffi ffi K p =ðK À 1Þ! is the volume of the (K − 1)-dimensional simplex and (1 m ) i = δ mi . The population resides in one of the K monomorphic states available to it, with the probability of being in a particular state exponentially weighted by its fitness [59][60][61].

Probability of a sample of alleles
In this section we derive the sampling probability when the number of alleles K is finite. Let us find the probability P½n of observing counts n = {n 1 , . . ., n k }, assuming that the population has reached steady state in terms of its allelic diversity. Before considering general case, we illustrate our approach using an example with only K = 3 allelic types: A ¼ ðA; B; CÞ. We wish to calculate the probability of observing counts {2, 1} in a sample of size n = 3, which is assumed to be much less than the population size N. There are 18 samples that contribute to this counts: where p(x A , x B , x C ) is given by Eq 7. Consequently, the probability of observing two A's and one B in any order is given by [18] P½fA; where 3 2 1 À Á is the multinomial coefficient. Introducing a set S 2 A ¼ fðA; BÞ; ðA; CÞ; ðB; CÞg, which permutes allelic identities in an ordered manner (i.e., the overall allele ordering from A to B to C is preserved in each pair of alleles), we can take into account the first 9 configurations in the table above: In order to include 9 remaining configurations in the table, we need to switch the order of the alleles: But switching the alleles in each pair amounts to replacing x 2 Thus we can summarize the entire table by introducing a set P(n 1 , . . ., n k ) of all distinct permutations of the counts {n 1 , . . ., n k }, which determine the powers to which the allelic frequencies are raised in Eq 17. In our example P(2, 1) = {(2, 1), (1, 2)}. Therefore, The above example can be easily generalized to describe the probability P½fn 1 ; . . . ; n k g of observing arbitrary counts. To do so, we enumerate all K alleles, forming a unique ordered list A ¼ ð1; . . . ; KÞ. Second, we choose a subset σ = (σ 1 , . . ., σ k ) of size k from A without replacement, so that the allelic order is preserved: σ 1 < . . . < σ k (note that no subsets are allowed to contain repeating elements of A). Then S k A can be naturally defined as a set which contains all ordered subsets of A of size k. Finally, as before P(n) is a set of all distinct permutations of allelic counts. Following these steps we have where the expectation is calculated with respect to the steady-state allele distribution, Eq 7.
We can use sampling probability (Eq 20) to compute the distribution of the number of different allelic types k: where the summation runs over all ordered partitions of n into k positive integers.

Generalized sampling formula
As Eq 20 demonstrates, evaluation of sample probabilities requires calculation of moments of allele frequency distributions. This could be done by taking derivatives of the normalization constant Z ¼ BðϵÞFðϵ; jϵj; βÞ in Eq 7 with respect to the corresponding components of β: Then Eq 20 takes the form where ν σ is a K-dimensional vector whose σ i -th components are ν i with i = 1, . . ., k and all the other components are zero. Here, we have used the fact that differentiating Eq 9 with respect to z yields a simple result similar to that known for the regular confluent hypergeometric function: where n ¼ P k i¼1 n i and (1 i ) j = δ ij . As discussed above, the sum over σ extends over all distinct subsets of k alleles sampled from K uniquely ordered alleles and subject to the σ 1 < . . . < σ k constraint. Therefore ν σ has K − k zero and k non-zero components which are distributed according to σ. The sum over ν extends over all distinct permutations of allelic counts which sum up to n. Eq 23 is valid for an arbitrary fitness landscape and an arbitrary number of alleles K.

Neutral limit of the sampling formula
When all alleles have the same fitness, the general sampling formula given by Eq 23 should reduce to the Ewens formula for neutral evolutionary dynamics [22,23]. Indeed, with all β i set to zero, the generalized hypergeometric function F ða; b; 0Þ (Eq 9) becomes 1. Then for the finite number of alleles K where N P = |P(n)| is the total number of distinct permutations of allelic counts. In the limit of an infinite number of alleles K ! 1, Eq 24 reduces to Eq 2. Changing variables to allelic histogram counts yields Q k i¼1 n i ¼ Q n j¼1 j a j and N P ¼ k!= Q n j¼1 a j !, resulting in P½ða 1 ; . . . ; a n Þ ¼ n! Q n j¼1 a j !j a j y k y ðnÞ : ð25Þ Eq 25 is a standard form of the Ewens sampling formula [22,23].

Sampling formula for a population with two fitness states
As a straightforward generalization of the neutral case, consider a system with I alleles of fitness f 2 and K − I alleles with fitness f 1 > f 2 . Thus the fitness landscape consists of two interconnected "planes". We can assume without loss of generality that alleles 1 through I belong to the lower plane and alleles I + 1 through K belong to the higher plane. Then γ = I/K defines a fraction of nodes on the lower plane and the fitness vector is plus all alternative assignments of the first i counts within the first I entries of ν Y , and the remaining k − i counts within the last K − I entries of ν Y , such that the original order of the non-zero count entries is not changed. In this case, the generalized hypergeometric function reduces to the confluent hypergeometric function: F ðϵ þ ν Y ; jϵj þ n; βÞ ¼ 1 F 1 ðgy þ X i m¼1 n m ; y þ n; bÞ: ð28Þ Then for finite K the sampling probability is given by: Here, the I i À Á and KÀ I kÀ i À Á binomial factors are due to assigning non-zero counts to alternative positions within ν Y , as described above. Taking the infinite allele (K ! 1) limit with γ fixed, we arrive at 1 F 1 ðgy þ P i m¼1 n m ; y þ n; bÞ 1 F 1 ðgy; y; bÞ Thus hypergeometric sampling of Eq 29 reduces to binomial sampling in the infinite-allele limit.

Sampling formula for a population with multiple fitness states
and its infinite allele limit is given by F ðγy þ ν Y ; y þ n; βÞ F ðγy; y; βÞ The sums in Eqs 31 and 32 take into account all possible ways of sampling n alleles from M planes (Fig 1). To explain these sums, let us imagine distributing n books over M shelves. The books come in k indivisible volume sets, and the ith set has ν i identical books in it. We would like to find all book-to-shelf arrangements, keeping in mind that shelves have finite capacities: only I m books can be placed on the m-th shelf. One way to describe any book-to-shelf arrangement is to use an M-dimensional vector ν Y which records how many books are placed on each shelf. For example, if M > k, a vector ν Y = (ν 1 , . . ., ν k , 0, . . ., 0) with M − k zeros following k non-zero entries describes placing volume sets on shelves in a particular order: the first volume set goes on the first shelf, the second volume on the second shelf and so on (assuming that the shelves are large enough to accommodate the volume sets), until no more books are left, so that the remaining M − k shelves remain empty. Permutations of this arrangement, expressed as permutations of ν Y vector elements, are also allowed (again, assuming that all the shelves are large enough). We can also put more than one volume set on a single shelf, leading to arrangements such as (ν 1 + ν 2 , ν 3 , . . ., ν k , 0, . . ., 0) with M − k + 1 zero and k − 1 non-zero entries. As before, this arrangement is allowed only if the number of books on each shelf does not exceed shelf capacities. Note that the question of capacity does not arise in the infinite allele limit, since the shelves become effectively infinitely long.
In order to systematically list all the arrangements for volume sets (ν 1 , . . ., ν k ), we follow a simple rule: if the kth set of ν k books is placed on the mth shelf, the (k + 1)th set of ν k + 1 books goes either on the same shelf or on the m 0 th shelf with m 0 > m. Taking elements of (ν 1 , . . ., ν k ) one by one and changing the initial shelf (onto which the 1st volume set is placed) and the number of volume sets on each shelf, we can generate a set of all permutations of ν Y elements. We shall call this set YðI; nÞ since it depends on both the shelf capacities I = (I 1 , . . ., I M ) and the volume sets n. In the limit of infinite shelf capacity the dependence on shelf sizes disappears, and the set of all permutations will be called YðnÞ. To include all possible arrangements, we need to perform the book-placing procedure for each distinct permutation of n. Now, if we replace shelves with fitness planes and volume sets with allelic counts, we obtain an algorithm for generating all allowed placements of allelic counts on fitness planes. The nonnegative indices i 1 , . . ., i M in Eqs 31 and 32 represent the number of volume sets (allelic counts) on each shelf (fitness plane). The distribution of alleles among fitness planes of finite capacity is illustrated in Fig 1A for M = 3 and a vector of allelic counts ν = (4, 1, 2); the infinite-plane case is shown in Fig 1B. Next, let us consider the monomorphic limit of Eq 32. It can be shown that Therefore, as expected, the P½fng (k = 1) term dominates in the monomorphic limit. By construction, Eq 32 reduces to the neutral limit, Eq 2, when all fitness values are the same. In addition, the neutral limit is reproduced in the strongly polymorphic limit F ðγy þ ν Y ; y þ n; βÞ À ! y!1 F ðγy; y; βÞ; ð35Þ and Eq 32 reduces to the neutral result. This is expected since selection effects become vanishingly small in this regime.

Efficient evaluation of sampling probabilities
To evaluate sampling probabilities, we need to compute F ða; b; zÞ (Eq 9) efficiently. The calculation of F ða; b; zÞ is performed by filling a square matrix with the partial Bell polynomials B n, k , from which complete Bell polynomials can be calculated from the rows as B n ¼ P n k¼1 B n;k . We use the following convolution identity: ðx♢yÞ n ¼ P nÀ 1 j¼1 n j x j y nÀ j . Note that the identity is commutative, i.e. ðx♢yÞ n ¼ ðy♢xÞ n , and that the summation limits are such that the convolution of two vectors with nonzero elements will always have a zero as its first element. Let x k♢ denote the vector that results when x is convolved with itself k times. The convolution matrix C is lower triangular and has the vector x = (x 1 , . . ., x n ) T as its leftmost column, x 2♢ as the second leftmost, etc. Partial Bell polynomials can then be calculated as: B n;k ðx 1 ; . . . ; x nÀ kþ1 Þ ¼ ðx k♢ Þ n k! ¼ C n;k k! : ð36Þ The matrix elements C n, k can be calculated starting from the top of the matrix, left-to-right within each row. The sum in Eq 9 runs over complete Bell polynomials in ascending order, so that convergence can be checked after the completion of each row. We specify a relative precision, e.g. ¼ 10 À 12 , and terminate the computation of F once the contribution of the current term j is small enough compared to the partial sum from 0 to j − 1: jF j =F partial j <.