The Yule Approximation for the Site Frequency Spectrum after a Selective Sweep

In the area of evolutionary theory, a key question is which portions of the genome of a species are targets of natural selection. Genetic hitchhiking is a theoretical concept that has helped to identify various such targets in natural populations. In the presence of recombination, a severe reduction in sequence diversity is expected around a strongly beneficial allele. The site frequency spectrum is an important tool in genome scans for selection and is composed of the numbers , where is the number of single nucleotide polymorphisms (SNPs) present in from individuals. Previous work has shown that both the number of low- and high-frequency variants are elevated relative to neutral evolution when a strongly beneficial allele fixes. Here, we follow a recent investigation of genetic hitchhiking using a marked Yule process to obtain an analytical prediction of the site frequency spectrum in a panmictic population at the time of fixation of a highly beneficial mutation. We combine standard results from the neutral case with the effects of a selective sweep. As simulations show, the resulting formula produces predictions that are more accurate than previous approaches for the whole frequency spectrum. In particular, the formula correctly predicts the elevation of low- and high-frequency variants and is significantly more accurate than previously derived formulas for intermediate frequency variants.

for s = n. (10) Using this definition to formulate the joint distribution of E and L leads to

Section B
Corrected proof for the calculation of P(E = 0, L = l) (see Corollary 2.7 in [1]) The probability for E = 0 and L = l can first be expressed by Now, we can calculate u by using (4) Section C Proof of (7) By our assumptions, new mutations hit the genealogy of the sample only in the neutral phase. Therefore, we have to combine our understanding from the selective phase as given by the Yule approximation of the genealogy with the neutral site frequency spectrum. The expected frequency spectra at time t = 0 of a sample of size k, i.e., after the neutral phase, is given by In our case, the sample size k at time t = 0 depends on the genealogy in the selective phase. We combined the probability for different genealogies in the selective phase with the corresponding expected frequency spectra. We will distinguish 3 different situations: a) There is no early recombinant family, i.e., E = 0.
b) There is an early recombinant family, but it consists of only one individual, i.e., E = 1.
c) There is an early recombinant family, and it consists of more then one individual, i.e., E ≥ 2.
The different situations are displayed in Figure 5 for a sample of size 5. Additionally, for every situation, different cases must be distinguished. Here, we present the situations E = 0 and E ≥ 2 If L = n or L = n − 1, then the number of lines before and after the selective phase are the same, i.e., K = n. Therefore, only the neutral phase has an effect on the frequency spectra, which leads to In the case that l ∈ {1, . . . n − 2}, note that there are l + 1 lines at the end of the neutral phase. We must consider two different events, which lead to a mutation of size i. Either the mutation achieves a size of i before the selective phase and none of these lines split during the selective phase or the mutation has size i − d + 1 before the selective phase and one of these lines is hit by the founder of the sweep.
In the first event, the mutation affects i lines, leading to l+1 i possibilities to select the i lines that have the mutation. The line that splits must not be one of the i lines, so the probability that a mutation of size i keeps its size is . Combining this probability with the expected number of mutations under neutrality -see (14) -gives the expected number of mutations that achieve a size of i by this first event θ i Using the same considerations for the second event (size i is not reached until the end of the selective phase), the conditional expectation in the case l ∈ {1, . . . n − 2} is obtained by For l ∈ {n − 1, n}, we use the equation and the fact that l l≥i and l d≤i is always true for l = n and l = n − 1.
In the situation E > 1, cases 9 (D = 0) and 10 (D = 1) (see Figure 5) can be handled in the same way as before. There is only one line that splits (now e and not d) in the selective phase, such that a mutation again has two possibilities to get to a size of i. Therefore, let us concentrate on the more challenging case 8, E > 1 and D > 1. Note that in this case, we have K = l + 2 lines, which are present at the beginning of the selective phase. There are a total of four possibilities that cause a mutation of size i. There are two coalescence events that must be considered, one for the early recombinants and one for the lines going to the founder of the sweep (see Figure 5). We distinguish the following four events: • event 1: a mutation grows before the selective phase to a size of i and none of these lines is hit by the two lines that go back to the founder or the early recombinant family. (This event is only possible for l ≥ i.) • event 2: a mutation grows to a size of i − d + 1 before the selective phase and one of these lines is hit by the founder, but no line is hit by the early recombinant family. (This event is only possible for d ≤ i and d + l ≥ i.) • event 3: a mutation grows to a size of i − e + 1 before the selective phase and one of these lines is hit by the early recombinant family, but no line is hit by the founder. (This alternative is only possible for e ≤ i and e + l ≥ i.) • event 4: a mutation grows to a size of i − e − d + 2 before the selective phase, one of these lines is hit by the founder and another line is hit by the early recombinant family. (This alternative is only possible for e + d ≤ i.) Each event can be treated similar to situation 1, except for the fact that there are now 2 drawings. We will explain event 2 to show the procedure: At the beginning of the selective phase, the mutation has size i − d + 1, so there are l+2 i−d+1 possibilities to distribute the mutation among the lines and 1 1 l i−d possibilities to receive the designated result because one line must be hit by the founder of the sweep and the other i − d lines can be distributed among l lines. Combining the arising probability with the expected frequency spectra before the selective phase leads to for the expected number of mutations that achieve a size of i due to event 2. In total, under the condition e + l ∈ {2, . . . n − 2} (because in case 8 we have D ≥ 2) and L = l, we get the equation Now, we want to use the conditioned expectations to obtain the expectation E[S i ]. For this purpose, we split the expectation on conditioned expectations, that is, where the O term appears because of the approximation character of the applied formula for the selective phase. Now, we can insert the conditioned expectations to obtain the final result P(E = s − l, L = l) · θ · (l + 2 − i)(l + 1 − i) (l + 1)(l + 2)i 1l l≥i + (l + 1 − i + n − s) (l + 1)(l + 2) 1l n−s≤i 1l l+n−s≥i + (1 − i + s) (l + 1)(l + 2) 1l s−l≤i 1l s≥i + (i + l − n + 1) (l + 1)(l + 2) 1l n−l≤i + n−2 l=1 P(E = n − l, L = l) (l + 1 − i)θ (l + 1)i 1l l≥i + θ l + 1 1l n−l≤i + O ρ 2 α 2 .