Predicting Carriers of Ongoing Selective Sweeps without Knowledge of the Favored Allele

Methods for detecting the genomic signatures of natural selection have been heavily studied, and they have been successful in identifying many selective sweeps. For most of these sweeps, the favored allele remains unknown, making it difficult to distinguish carriers of the sweep from non-carriers. In an ongoing selective sweep, carriers of the favored allele are likely to contain a future most recent common ancestor. Therefore, identifying them may prove useful in predicting the evolutionary trajectory—for example, in contexts involving drug-resistant pathogen strains or cancer subclones. The main contribution of this paper is the development and analysis of a new statistic, the Haplotype Allele Frequency (HAF) score. The HAF score, assigned to individual haplotypes in a sample, naturally captures many of the properties shared by haplotypes carrying a favored allele. We provide a theoretical framework for computing expected HAF scores under different evolutionary scenarios, and we validate the theoretical predictions with simulations. As an application of HAF score computations, we develop an algorithm (PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps) to identify carriers of the favored allele in selective sweeps, and we demonstrate its power on simulations of both hard and soft sweeps, as well as on data from well-known sweeps in human populations.

Expected value of M k,i , constant population size. Let T k denote the duration of epoch k. For a population of constant size N , the duration T k is exponentially distributed with rate k 2 /N (see [2]): and the number of mutations on a lineage i in epoch k, denoted M k,i , is Poisson distributed with rate µ T k . For a constant-sized population, this implies .
Later, we will explore E[T k ] and E[M k,i ] for exponentially growing populations. , where 1 ≤ i ≤ k, 2 ≤ k ≤ n, and 1 ≤ w ≤ n − k + 1. We wish to evaluate Eq. (7) (or (S1)) for = 1, and generally for any . This requires E[M k,i ], which we already have for constant populations, and E[W k,i ], which we derive next. For a positive integer , denote the rising factorial w ( ) = w(w + 1)(w + 2) · · · (w + − 1) (S5) and set w (0) = 1. Then W ( ) k,i denotes the th rising factorial of random variable W k,i , while W k,i denotes the ordinary th power. We will compute E[W Let be a nonnegative integer. We will prove that which is equivalent to Eq. (9) in the text. Using Eq. (S4), the expected value is the sum Here we use that rising factorials and binomial coefficients are related via w ( ) ! = w+ −1 . The rightmost sum may be evaluated using a combinatorial method: we will count all (k + − 1)-element subsets of {1, 2, . . . , n + − 1} by two methods. First, it is the binomial coefficient n+ −1 k+ −1 . Second, we will show that it also equals the rightmost sum in (S7).
To show this, we may order any such subset and partition it into three parts, (A, b, C): A is the set of its smallest elements; b is the ( + 1) th smallest element; and C is the set of the largest k − 2 elements. Element b must be in the range + 1 ≤ b ≤ n − k + + 1 to allow for smaller elements and k − 2 larger elements. We rewrite b as b = + w, where w is in the range 1 ≤ w ≤ n − k + 1.
Given b, we may choose A in one of b−1 = w+ −1 ways and choose C in one of n+ −1−b k−2 = n−w−1 k−2 ways. Multiply these counts for a given b, and sum over all b (by summing over all w) to obtain that the total number of subsets is the rightmost sum in (S7). But we also showed the number of such subsets is n+ −1 k+ −1 . Thus, (S7) evaluates to

Deriving E[ -HAF] in a constant-sized population
Our derivation of E[ -HAF] consists of three main steps. We start by defining a more general form of the HAF score for an arbitrary function f (w), which we denote HAF f . We then set f (w) to the rising factorial function and derive E[HAF w ( ) ]. Finally, we use Stirling Numbers to convert between rising factorials and powers, and obtain E[ -HAF].
Step 1: The generalized form E[HAF f ] Consider an arbitrary polynomial f (w) (or more generally, an arbitrary function f : Z → Z). Define the HAF f score of a HAF vector c as We now generalize the derivation of Eq. (7) to HAF f . Sum the above equation over all haplotypes in a genealogy: Note that in the sum over j, all summands are independent of j, so that sum is replaced by multiplying by the number of terms, w k,i . We divide the above sum by n and then average over all genealogies to obtain Further, since the random variables W k,i for i = 1, . . . , k are identically distributed, all terms in the inner sum are the same, and may be consolidated into k times the value of the term on one branch: In a constant-sized population, Eq. (S3) states that E[M k,i ] = θ/(k(k − 1)). Plug this in: (S12) Step 2: The rising factorial form E[HAF w ( ) ] We will show that for f (w) = w ( ) , for any nonnegative integer : For this proof only, abbreviate W = W k,1 . Note that Using this and Eq. (S6) gives We will split this sum into two telescoping sums. One may show that (S17) Using this with shifted k's and 's, part of the sum in Eq. (S16) can be written which is a telescoping sum that evaluates to Note that 1 ( +2) = ( + 2)!, simplifying this to Plugging Eqs. (S18),(S19) into Eq. (S16), we obtain: This proves Eq. (S13). Step The expected value of (W k,i ) is then: We evaluate E[ -HAF] = E[HAF w ] as a linear combination of terms E[HAF w (q) ], using the same coefficients that express w as a linear combination of terms of w (q) (Eq. (S20a)): For = 0: For = 1: It is straightforward to compute this for any in the same fashion.
(3) has simple derivation, but a variable number of terms, n, that grows as n grows. Eq. (7) leads to an evaluation (S22) that is a polynomial in n of degree . Both represent E[ -HAF] in a constant-sized population; we now show that they are equal as a consequence of the following lemma: Lemma 1 In a constant-sized population, for any polynomial f (w), we have Applying this to f (w) = w gives equivalence of the two formulas (3) and (S22) for E[ -HAF].
Proof of lemma. Both sides are linear in f , so it suffices to prove it on any basis of polynomials. We use the basis of rising factorials, f (w) = w ( ) for ≥ 0. Eq. (S13) evaluates E[HAF f ] on this basis. We now evaluate n−1 w=1 w ( ) . Note that This leads to a telescoping sum: Multiplying this sum by θ/n gives the same value as (S13) gives for E[HAF w ( ) ]. Thus, the lemma holds on the basis f (w) = w ( ) for ≥ 0. By linearity, it holds on all polynomials f (w).

Deriving E[ -HAF] in an exponentially growing population
Let the population size at time t in the past be given by N (t) = N 0 e −rt , where N 0 is the current population size. M k,i depends on T k , which in turn depends on the size of the population in epoch k. Slatkin and Hudson (1991) [5, p. 559] derived an iterative formula to generate a sequence of random times T k (k = n, n − 1, . . . , 2).
The actual formula Hudson implemented in the simulator ms [6] (precursor to msms [7]) is slightly different; in our notation, it is as follows. Let α = 2 N 0 r be a scaled growth rate. Generate random values U n , . . . , U 2 that are uniformly distributed in (0, 1), and iteratively compute T n , . . . , T 2 via Then generate a random value for each M k,i using a Poisson distribution with mean µ T k (k = 2, . . . , n and i = 1, . . . , k). This does not lead to a closed form expression for the expected HAF scores, but we may nonetheless use it to estimate the expected HAF scores computationally, as illustrated in S2 Fig. The first method, cumulative time, is to generate random values of T k 's and M k,i 's as above, and plug them into (S10), along with exact values of E[W k,i · f (W k,i )] computed using (S6) or (S21).
The second method is conditional expectation. For a given k, we may compute the expected value of (S24) given the condition that T k+1 , . . . , T n are known. This gives estimates T k = t k (computed in order k = n, n − 1, . . . , 2) as follows: where τ k = t k+1 + · · · + t n (with τ n = 0) and dt. In practice, Eq. (S25) is used to generate scaled times, In terms of these times, E[M k,i ] = µE[T k ] ≈ µ t k (which is approximate since the formula for t k uses conditional expectation while this equation does not). We then estimate E[HAF f ] via (S11): For f (w) = w ( ) , similarly to (S15), this gives an estimate and for f (w) = w , we use (S20) to obtain In terms of α = 2 N 0 r and θ = 2 N 0 µ, the coefficient µ/r may be replaced by θ/α. We used Maple (maplesoft.com) to generate the sequence of scaled times r · t k by iterating Eq. (S25). Maple supports arbitrary precision numerical computations, which is needed due to complications with double precision arithmetic. Let x = k(k−1) α exp(r · τ k ). Small rounding errors in x due to limited precision are amplified in e x , which may lead to slightly different results when iterating the recursion using double precision arithmetic vs. Maple's arbitrary precision arithmetic. Additionally, as x increases, e x grows rapidly and leads to an overflow in double precision arithmetic at approximately x ≥ 709.7827, while E 1 (x) decays rapidly and leads to an underflow at approximately x ≥ 701.8334, even though e x E 1 (x) is usually representable in double precision. For example, E 1 (702) ≈ 1.9 · 10 −308 underflows in double precision arithmetic, but e 702 E 1 (702) ≈ 0.0014 is representable. The largest value of x used in Eq. (S25) is x = n(n − 1)/α in iteration k = n, leading to an underflow in double precision if n(n − 1)/α > 701.8.

Number of mutations in a genealogy
Let f (w) = 1/w for w > 0 and f (0) = 0. Note that for an individual HAF vector c, some entries may be 0, but in a genealogy, all w k,i = 0 since every branch of the observed tree has at least one haplotype as its descendant. The sum (S9) of HAF f (c) over all haplotypes in a genealogy becomes and thus (S11) becomes In a constant-sized population, Eqs. (S3) and (S29) give that the expected number of mutations in a genealogy is where H(n) = n k=1 1 k is the Harmonic number. Under exponential growth, we instead estimate times t k or scaled times r · t k via (11). Then we estimate the expected number of mutations by Computing the mean HAF f score from a SNP matrix For a sample of n individuals whose genealogy has q segregating sites, the SNP matrix A consists of n rows (each row representing one individual) and q columns, with entry 0 denoting ancestral alleles and 1 denoting derived alleles.
The v th row of A is a haplotype vector as depicted as h in Fig 1. Let w all denote the sum of the rows of A. The frequency of the j th allele is w all j , the sum of all entries in the j th column of A.
In n v=1 HAF f (c(v)), the j th mutation contributes f (w j ) for each of the w j haplotypes that have that mutation (A vj = 1) and contributes 0 on each of the n − w j others. Thus, the average of HAF f over the genealogy is 1 n We use this to compute empirical averages of HAF f in simulations (ms [6] and msms [7] output a SNP matrix), as well as in further theoretical developments. In particular, given a haplotype vector h and the corresponding HAF vector c, we have

HAF score dynamics and peak values
Consider n haplotypes randomly sampled from a Wright-Fischer (WF) fixed-size population of N haploid individuals under a hard sweep with selection coefficient s. Let ν denote the fraction of individuals that carry the favored allele. We assume the sample has at least one carrier and at least one non-carrier, so that ν ∈ { 1 n , 2 n , . . . , n−1 n }; at fixation (ν = 1), there are no non-carriers remaining. We assume strong selection (N s 1) and no recombination in the region being sampled. We measure time in generations going backwards. See S13 Fig. The current time is time 0; let T car denote the time when all sampled carriers coalesce to their most recent common ancestor (denoted by MRCA car ). Let T all denote the time when all sampled individuals coalesce to a common ancestor (denoted by MRCA all ). Let T (k) denote time to MRCA of k randomly chosen haplotypes in a population of size N . From Nordborg [8], we have At time T car , we have exactly one ancestor of the favored allele carrier, and assume we have m ancestors of non-carriers. In a hard sweep scenario, the favored mutation arises at the same time as the onset of selection; therefore, the time between T car and T all is governed by the neutral WF model with population N , and is well approximated by neutral coalescent theory. Applying Eq. (S35) to the remaining sample of m + 1 individuals at time T car gives the expected time to coalesce: Let A denote the SNP matrix of a sample of n individuals, with νn carriers of the favored allele (S14 Fig). We order the columns so that all mutations are ordered chronologically from left to right. Similarly, the rows are ordered so that the first νn rows correspond to carrier haplotypes.
The i th haplotype is represented by the i th row of A (denoted by row vector h i ) and has 1-HAF score denoted by 1-HAF i . From Eq. (S34), we have We partition the matrix into three submatrices: A 1 consists of the rows corresponding to carriers and columns corresponding to mutations occurring prior to the onset of selection (times between T car and T all ). A 2 is the submatrix from all non-carrier rows and mutations prior to onset of selection. Finally, A 3 is the submatrix of all columns after the onset of selection (times between 0 and T car ). Define w car as the sum of all carrier haplotypes; similarly, let w non be the sum of all non-carrier haplotypes. Thus, Also, we can partition the equation for computing 1-HAF i by rows as We partition a haplotype vector h in the matrix by separating SNPs that occurred before (subvector h b ) and after (subvector h a ) the onset of mutation. A similar partitioning works for w all , w car , and w non . Thus, for a random haplotype h, The expected 1-HAF score of a carrier (or non-carrier) haplotype h can similarly be decomposed as To compute the expected 1-HAF score, we bound each of these constituent terms.
Lemma 2 Consider a sample of n individuals under a hard sweep with νn carriers ( 1 n ≤ ν ≤ n−1 n ). Let m be the number of non-carrier haplotypes remaining at time T car . Then for a random carrier haplotype h, we have while for a random non-carrier haplotype h, these become is the expected number of ones in h b , which is the expected number of mutations on the lineage from MRCA all to MRCA car . Therefore, Plugging this into Eq. (S42) gives Eq. (S38). Proof of Eq. (S39). On restricting w non = n i=νn+1 h i to the mutations from before the onset of selection, we obtain Going back from time T car to T all , we observe the coalescence of the ancestors of carrier haplotype h and non-carrier is the expected number of ones in common between these two haplotypes among the SNPs occurring before the onset of selection. This equals the expected number of mutations in the lineage from T to T all : Note that T − T car is the time for a sample of 2 individuals (ancestors of h and h i at time T car ) to coalesce. We evaluate Eq. (S45) using Eq. (S35): Plug this into Eq. (S44). All n − νn = (1 − ν)n terms are the same, so it simplifies to Eq. (S39). Proof of Eq. (S40). We have Here, h is a non-carrier while h i is a carrier, so they are distinct and they coalesce prior to the onset of selection. Thus, Eq. (S46) applies here. All νn terms of the sum in Eq. (S47) evaluate to Eq. (S46), so it simplifies to Eq. (S40). Proof of Eq. (S41). We have The term is the expected number of ones in common in haplotypes h and h i among the SNPs occurring before the onset of selection.
We apply Lemma 3 (upcoming) to show that two random non-carrier haplotypes h and h i have the same ancestor at time T car with probability or different ancestors with probability 1 − p. Thus, with probability p, we have h b i = h b , and with probability 1 − p, we have h b i = h b . Therefore, Eq. (S48) is a sum of (1 − ν)n terms, each of form (S50), so Eq. (S48) evaluates to Eq. (S41).
In the preceding proof, the probability p that two non-carriers share the same ancestor was computed using the following lemma, applied to n 0 = (1 − ν)n non-carriers in the sample and m ancestors of non-carriers at time T car . Lemma 3 Consider a subsample of n 0 individuals, which coalesce to m ancestors at some time T , while the rest of the sample coalesces to one or more ancestors distinct from these m. Draw two individuals (with replacement) from the subsample. The probability that they have the same ancestor at time T is Proof Note that Pr(W m,i = w) is given by Eq. (S4), and holds upon restricting the full genealogy to the subsample. Divide this expected value by n 0 2 pairs and simplify to obtain Eq. (S51).

Lemma 4 Consider a sample of n individuals under a hard sweep. For a random haplotype h (carrier or non-carrier), we have
Proof. We have The dimension of vectors h a i and h is the number of mutations in [0, T car ]. The expected value of this dimension is µE[T car ]. The dot product of two binary vectors is at least 0, and is bounded above by their dimension, so Plugging Eq. (S54) into Eq. (S53) gives Eq. (S52).
Lemma 5 Consider a population of size N with selection coefficient s, under a hard sweep with strong selection (N s 1). Then Proof. Campbell [9] shows that under strong selection, the fixation time is approximately 1 s ln(2N s) . As T car is the time for the favored allele to reach frequency νn, it is dominated by the fixation time. Therefore, Plug Eq. (S56) and the scaled mutation rate θ = 2N µ into Lemma 4 to obtain Eq. (S55). be the expected number of non-carriers in the whole population when the u th coalescent event occurs among the non-carrier samples. As we go backwards in time (increasing u), the expected number of non-carriers increases: Additionally, the chance of coalescing decreases and the expected coalescent time increases. For a fixed size population of size N , the expected time for one coalescent event in a sample of size n is N/ n 2 (see [8]). Hence, the expected time for the u th coalescent event is at least On the other hand, the sum of all coalescent times during the selective sweep is at most T car . Therefore, Comparing this lower bound on E[T car ] with the upper bound in Eq. (S56) gives This gives the upper bound on ε ν in Eq. (S57).
Theorem 7 Consider a sample of n individuals under a hard sweep with νn carriers ( 1 n ≤ ν ≤ n−1 n ). Assume strong selection (N s 1). In the time [0, T car ], let the number of coalescent events among the (1 − ν)n non-carrier haplotypes be ε ν (1 − ν)n. The expected 1-HAF score of a random carrier haplotype is while for a random non-carrier haplotype, Proof. Recall the decomposition of E[1-HAF] given in Eq. (S37): The number of ancestors of non-carriers at time For a random carrier haplotype h, we evaluate the first two terms in Eq. (S63) by using Eqs. (S38) and (S39), and bound the third term by using Lemma 5, to obtain: For a random non-carrier haplotype h, we use Eqs. (S40) and (S41) and Lemma 5: Ignoring the small term ln(2N s) 2N s , we obtain the desired expressions (S61) and (S62). We find the maximum of Eq. (S61) over 0 ≤ ν ≤ 1 using differentiation; the result is We simulated selective sweeps for a variety of parameters and compared their trajectories against these results. In S15 Fig, we compared the trajectories of both carriers and non-carriers in 500 selective sweeps for each pair (θ, n) with θ ∈ {24, 48}, n ∈ {100, 200, 300, 400}, s = 0.08, and N = 2000. Over the course of a trajectory, the frequency ν of the favored allele varies from ν = 1/n (1 carrier) to ν = 1 (fixation), but different trajectories will go through a different sequence of values of ν. We aligned the trajectories by the values of ν. For each ν, we separated carriers and non-carriers and averaged together (1-HAF)/(nθ) over the generation (if any) with allele frequency ν in each sweep. We plotted these averaged trajectories in green (θ = 24) and pink (θ = 48). We compared these against the expected value (Theorem 7). The expected value is bounded above in blue (ε ν = 0) and below in red (ε ν = ε max ν ); see Lemma 6. These are bounds on the expected value, not on simulation means. Simulation means may range over the whole distribution, but tend to vary around the expected value.
In this paper, due to the run time for forward simulations, we used a relatively small value for population size (N = 2000); in reality, N is a lot larger (10000), so that ε ν becomes smaller (see Lemma 6). Therefore, the upper and lower bounds in the proof of Theorem 7 move closer together, and we approximate ε ν = 0. To simplify the presentation in the main text, we set ε ν = 0, giving:

Empirical validation of PreCIOSS
We tested the performance of PreCIOSS for different population genetics parameters provided in S1      Table 2]. The model assumes an out-of-Africa split at time T B , with a bottleneck that reduced the effective population from N Af to N B , allowing for migrations at rate m Af-B . The African population stays constant at N Af up to the present generation. The model assumes a second split between European and Asian populations at time T EuAs , with a bottleneck reducing the Asian and European populations to N As0 and N Eu0 respectively. The bottleneck was followed by exponential growth at rates r As and r Eu , as well as migrations among all three sub-populations, leading to current populations from which Asian (CHB+JPT), European (CEU), and Africans (YRI) individuals were sampled. n , 2 n , . . . , n−1 n }, s = 0.08, and N = 2000, we did 1500 trials. We plotted the mean value of (1-HAF)/(θn) as a function of ν, for both carriers and non-carriers, and compared against the theoretical expected value. The expected value of (1-HAF)/(nθ) lies somewhere between the blue and red curves. The mean values may range over the whole distribution (and are not constrained by the blue and red curves) but tend to vary around the expected value.