Genetic Hitchhiking under Heterogeneous Spatial Selection Pressures

During adaptive evolutionary processes substantial heterogeneity in selective pressure might act across local habitats in sympatry. Examples are selection for drug resistance in malaria or herbicide resistance in weeds. In such setups standard population-genetic assumptions (homogeneous constant selection pressures, random mating etc.) are likely to be violated. To avoid misinferences on the strength and pattern of natural selection it is therefore necessary to adjust population-genetic theory to meet the specifics driving adaptive processes in particular organisms. We introduce a deterministic model in which selection acts heterogeneously on a population of haploid individuals across different patches over which the population randomly disperses every generation. A fixed proportion of individuals mates exclusively within patches, whereas the rest mates randomly across all patches. We study how the allele frequencies at neutral markers are affected by the spread of a beneficial mutation at a closely linked locus (genetic hitchhiking). We provide an analytical solution for the frequency change and the expected heterozygosity at the neutral locus after a single copy of a beneficial mutation became fixed. We furthermore provide approximations of these solutions which allow for more obvious interpretations. In addition, we validate the results by stochastic simulations. Our results show that the application of standard population-genetic theory is accurate as long as differences across selective environments are moderate. However, if selective differences are substantial, as for drug resistance in malaria, herbicide resistance in weeds, or insecticide resistance in agriculture, it is necessary to adapt available theory to the specifics of particular organisms.


Introduction
When an advantageous mutation arises and rapidly increases to high frequency under strong positive selection, neutral variants on the same chromosome (initially linked to the mutation) ''hitchhike'' along the mutation to high frequency. This rapid change in neutral allele frequencies generates a characteristic pattern of polymorphism, commonly referred to as a ''selective sweep''. Meiotic recombination breaks the association between the advantageous and the neutral allele (the ''hitchhiker''). Therefore, the pattern of a selective sweep is contained within a small map distance from the locus under selection. Signatures of selective sweeps include the local reduction of polymorphism (expected heterozygosity), skew of site frequency spectrum, and a unique spatial pattern of linkage disequilibrium. As vast amounts of genome-wide data becomes available, the characteristic patterns of genetic hitchhiking provide a powerful tool to identify candidate regions in the genome that were recently (or still are) under positive directional selection. Moreover, as the quality of genetic data improves, it might be possible to develop methods aiming to reconstruct the underlying evolutionary dynamics by ''reverse engineering'' hitchhiking patterns. This however requires to extend classical theory to situations that reflect organism-specific characteristics regarding particularities in e.g. the selective environment, demography, or mating structure.
[1] first provided a comprehensive mathematical analysis of this evolutionary process. Since then, remarkable advancements in the mathematical theory of selective sweeps were made [2][3][4][5]. Theories focused on the stochastic patterns of variation, mainly achieved through coalescent and diffusion approximations, in order to detect and interpret selective sweeps from DNA sequence polymorphism. Consequently, as more genomic data became available, clear cases of selective sweeps that confirm such theoretical predictions rapidly accumulated (reviewed in [6][7][8]).
Recent theoretical studies focus on the expansion of theory beyond the ''standard model'' of genetic hitchhiking. The standard model assumes that an advantageous mutation, arising as a single copy on a random chromosome, increases to high frequency under constant and homogeneous selective pressure in an ideal randommating population of constant size. This model, the basic scenario of adaptive evolution that [1] considered, is simple enough to allow the application of diffusion and coalescent approximations and thus the prediction of stochastic patterns. However, a selective sweep in a real population must occur under a very complex demographic structure and under various modes of positive selection. Application of the standard model of genetic hitchhiking to the interpretation of actual data may thus lead to a serious problem. Recent studies addressed this problem by modeling selective sweeps that occur from standing genetic variation or recurrent beneficial mutations [9,10], under arbitrary dominance of the beneficial allele [11], under selection on a quantitative trait [12], in newly derived populations [13][14][15], in geographically structured populations [16], and under the complex life cycle of malaria parasites [17,18].
Homogeneity of selective pressure driving the beneficial mutation to a high frequency is an important assumption in the standard model of selective sweeps. Typically this is well justified even for a population that is distributed over multiple ''patches'' with different selective environments, if individuals move rapidly over different patches and also mate with other individuals from other patches. In that case, the population might be modeled to be panmictic under homogeneous selective pressure, which is given by the selective advantage of the beneficial allele averaged over all patches. However, as will be argued below, if mating is restricted to between individuals within the same patch, it can alter the effective rate of meiotic recombination and thus the strength of genetic hitchhiking. This may be important for many species in which mating occurs between individuals within a restricted range (under the same selective environment) but the young offspring (or seeds) are dispersed over a much wider range. Such species include plants that reproduce frequently by self-fertilization and animals that lay eggs in common breeding sites from which the young disperses into random habitats, or agents causing vector-borne parasitic diseases. Particularly plasmodium species, parasites that cause human malaria, are further important examples: male and female gametocytes produced inside a single human host enter a mosquito's gut during the blood meal and release gametocytes, which immediately fuse and undergo meiosis, producing sporozoites that are transferred to different hosts. Therefore, given that hosts constitute heterogeneous selective environments (''patches'') for parasites, an allele under selection experiences random switches of patches over malaria transmission cycles while mating always occurs between gametocytes from the same patch. This is an important consideration for malaria parasites in which strong patterns of selective sweeps due to the evolution of drug resistance were discovered [19][20][21][22].
This study investigates the hitchhiking effect of a mutant allele spreading over a heterogeneous environment, which is composed of patches with different selective pressures. Averaged over all patches, the mutant allele is advantageous over the wild type. This model of selection with random dispersal over patches between generations is known as the hard-selection Levene model (cf. [23], Ch. 6). [24] originally formulated this model assuming soft selection. While typically the Levene model is considered to study the maintenance of multiple alleles at the balance of selective pressures in different patches, we consider an overall advantage of the mutant allele that ensures the rapid increase of its frequency by directional selection. Our model also differs from the Levene model in which mating between individuals occur within patches.
After formulating the deterministic model and deriving the corresponding recursion equations, we study the effect of a single locus under positive directional selection on a neutral multialelic locus. We propose an analytical solution for the equilibrium frequencies and the expected heterozygosity at the neutral locus after the sweep is complete. We further derive approximations for the equilibrium heterozygosity that are easier to interpret. In particular we want to contrast within-patch mating and mating after random dispersion over the whole population. Even further, we present stochastic simulations in comparison to the analytic results of the deterministic model.

Overview of the Model
Assume that a haploid sexual population disperses randomly in finitely many patches P 1 , . . . ,P K . Offspring is born in a common breeding site and then migrates randomly into the K patches. Let the a k denote the proportion of individuals that migrate into patch P k . Viability selection acts differently across patches. After reaching the reproductive age adults migrate to the common breeding site for reproduction. A proportion b k of individuals of patch P k mates randomly with individuals from the same patch, whereas the remaining individuals mate randomly in the common breeding site. The haploid offspring in the next generation migrates again into the different patches from the common breeding site. The proportion b k of individuals mating with other individuals of the same patch has various interpretations. It might reflect that individuals from different patches arrive at different times at the common breeding site, and hence they have a higher chance to mate with individuals from their own patch. Alternatively, it might be interpreted as matings that occur on the way to the breeding site. It might also reflect that some matings occur within the patches before migrating to the breeding site. For simplicity, we will refer to the proportion b k of matings, as withinpatch and to the proportion 1{b k as breeding-site matings.
Suppose that the size of the population is sufficiently large to treat the evolutionary changes deterministically. Then, the population in a given generation is represented by a vector p of haplotype frequencies, which is counted after sexual reproduction in the common breeding site. The single-generation change of p is determined by the reproductive success within the different patches. Mating and meiotic recombination is as described above.
This model superficially appears to be the hard-selection Levene model (cf. [23] Chapter 6, for a diploid version), which is equivalent to the standard haploid selection model without migration. However, there is a crucial difference. Namely, the Levene model assumes that mating occurs randomly within the common breeding site, while we assume that only a proportion of individuals of each patch mates within this site. Clearly, our model reduces to the hard selection Levene model if b k~0 for k[f1, . . . ,Kg. On the contrary, if b k~1 all matings occur within the patches. We will discuss the differences of our model and the hard-selection Levene model in more detail in the following sections. (The Levene model was introduced originally by [24] for soft-selection.).

Change of Haplotype Frequencies
Assume L multi-allelic loci in a genome of haploid individuals, and let n i be the number of alleles segregating at locus i, yielding to n~P L i~1 n i haplotypes in total. These are labelled 1, . . . ,n in the usual order. Their respective relative frequencies in the overall population are 1, . . . ,p n , which are summarized by the haplotypefrequency vector p~(p 1 , . . . ,p n ).
Let a k p k denote the frequency of haplotype i in patch k. Then the (absolute) frequency of haplotype i in patch k after selection is are the absolute numbers of individuals in patch P k that mate randomly within the patch and at the breeding site, respectively. Moreover, i denotes the number of haplotypes in patch k after selection. The probability that a mating between an iand a j-haplotype occurs in patch P k is then given by Let the probability that mating between an iand a j-haplotype gives rise to a l-haplotype be R(i,j?l). Therefore, the number of l-haplotypes that are produced in patch P k is given by The number of l-haplotypes that arrive unmated in the common breeding site is where W i~P K k~1 (1{b k )a k W (k) i , and the total number of unmated individuals at the breeding site is Hence, the number of l-haplotypes produced in the breeding site is From (2) and (5), the relative frequency of haplotype l in the whole population is calculated to be We shall briefly summarize the classical hard-selection Levene model: Remark 1. In the case of the hard selection Levene model, all individuals mate in a common pool. The relative frequency of i-haplotypes in j a k and W W~P n j~1 p j W j . Hence, the frequency of l-haplotypes in the next generation is given by

Results
Now we want to study genetic hitchhiking, i.e., the influence of selection at a single locus on a linked neutral locus. For this purpose we assume that the first locus is selected with two alleles A 1 and A 2 , and that the second locus is selectively neutral with finitely many alleles B 1 , . . . ,B M . We number the haplotypes such that l stands for A 1 B l and lzM stands for A 2 B l (1ƒlƒM). Moreover, we denote the recombination rate between the two loci by r.

Dynamics at the Selected Locus
Let us denote the frequency of A 1 by p and that of A 2 by 1{p. The fitnesses of a haplotype carrying the allele By marginalization of the above dynamics it is straightforward to derive the dynamics for p. In subsection 1 of Analysis we show where is mean fitness of A 1 among all patches and is the mean fitness of A 2 among all patches. Note that the dynamics (8) are independent of the b k 's. In particular, the dynamics (8) at the selected locus are that of the standard haploid selection model, which is identical to the hardselection Levene model. Summarizing, we obtain: Result 1. The allele A 1 will become fixed in the population if and only if lwm: .
Moreover, by iterating (8a) the frequency of A 1 in generation t, with initial condition p(0)~p 0 , is calculated to be Furthermore, we have lim t??
if lwm, Dynamics at the Neutral Locus Now we want to study the hitchhiking effect of the spread of an resistant allele at a single locus on neutral variation. As before p denotes the frequency of the resistant allele A 1 . We have In Analysis, subsection 2, we derive Q l in generation T to be where and In the last step we set c k :~a k b k for k[f1, . . . ,Kg, c 0 :~P K k~1 a k (1{b k ), and w (0) i~w i . Hence, we defined patch P 0 as the breeding site, and c k is the proportion of the population mating in path P k .
Although, we could in principle derive R l (T) analogously, we refrain from doing so. We are only interested in the case in which the allele A 1 sweeps through the population. Hence, at equilibrium A 2 vanishes, and all neutral alleles are linked to the allele A 1 . Hence, the equilibrium frequency of B l is given by lim t?? Q l (t). Deriving these frequencies allows to study genetic hitchhiking. In particular, we havê where From the above it is straightforward to calculate the equilibrium heterozygosity defined bŷ The equilibrium heterozygosity depends on the initial allelefrequency distribution at the neutral locus, becauseQ Q l does. However, as shown in Analysis (subsection 3) the relative expected heterozygosity defined by is independent of the initial distribution of allele-frequency distribution. Here, E denotes the expectation (over the initial distribution of allele-frequencies), and is the initial heterozygosity. We summarize Result 2. The equilibrium frequency of the neutral allele B l is given by (15). The expected relative heterozygosity is calculated to be where A r is defined in (15b).
Remark 2. For the hard-selection Levene model, we need to set b k~0 for all k, which giveŝ which clearly is exactly the solution for standard hitchhiking.
The differences between our model and the hard-selection Levene model become obvious from the above remark. Whereas the dynamics at the selected model coincide for both models, differences occur at linked neutral loci. Not surprisingly, the hardselection Levene model is equivalent to the standard haploid selection model. In particular, the relative heterozgosity which measures the hitchhiking effect (see section 3) does not coincide for the two models. Figures 1, 2, and 3 illustrate these differences.
The analytic solution (17) is insofar not satisfying as it is iterative and difficult to interpret. We will therefore derive approximations that have a simpler form and are easier to interpret in terms of the involved parameters.

Approximations
By writing p t for p, and using (8), L pt becomes (cf. 35,28). Hence, Moreover, In the section Analysis we even show that L p §1{r, always holds. Hence, we can appoximately set L pt &1{r. Therefore, we can approximateQ Q l bŷ Note that (20) has the same structure as comparable quantities in [18]. Hence, (20) can be further approximated with exactly the same methods as in [18]. This leads to Result 3. The equilibrium frequencyQ Q l of the allele B l is given by (18). If p 0 &0, the frequency is approximatelŷ ) log (1{r) log m{ log l If additionally r&0, we approximately obtain Q Q l &Q Q l : Figure 3. Exact vs. approximate average relative heterozygosity. Average relative heterozygosity H(r) as a function of r as given by (15) A scratch of the proof based on the results of [18] is presented in the section Analysis (subsection 4).
The above results allows for a simple interpretation. The neutral allele's frequency is a weighted average over the respective frequencies resulting from each patch (including the breeding site P 0 ). The weights are the proportion of individuals mating in each patch, c k , times the relative size of the patches, i.e., the relative frequency of individuals in the patch, w (k) 1 =l. Moreover, withinpatch mating leads to an adjustment factor e r for the neutral allele's frequency within each patch as compared to standard hitchhiking. This adjustment measures how deviations of the selective regime in patch P k from the overall selection regime affects recombination. In particular, if This implies that patches that reflect the population average selective pressures can be subsumed within the common breeding site. However, in patches characterized by 'extreme' selective regimes, deviations might be substantial. We can summarize: the set of patches that reflect the overall selective regime. If p 0 ,r&0, the equilibrium frequencyQ Q l of the allele B l is given bŷ where V c~f 0, . . . ,Kg\V.
The equilibrium heterozygosity is obtained by combining an adaptation of Result 2 with Result 3 and 4.
Result 5. If p 0 &0 and r&0 we have where with V defined as in Result 4. The factor w is an adjustment due to increased inbreeding within patches caused by different survival rates resulting from different selective regimes.

Stochastic Simulations
The stochastic behavior of our two-locus model is explored by computer simulation in which the population contains a finite number (N) of haploid individuals. We restrict our attention to contrast the two extreme situations of complete intra-patch mating (b k~1 for all k) and to the hard-selection Levene model (b k~0 for all k). Furthermore, we will assume only two patches for most of the simulations.
Given N individuals in generation t, sampling of individuals (offspring) for generation tz1 is performed in the following manner. First, a copy of a randomly-picked individual in generation t is sent to patch P k with probability a k . Then, a number x is drawn from uniform distribution between 0 and max i . This copy is accepted (i.e. sampled) into generation tz1 (2) if it carries the mutant (wildtype) allele. Otherwise this copy is discarded. This procedure is repeated until all N haploids are sampled. Next, to perform recombination, Nr=2 pairs of individuals are chosen and cross-overs occur. For each pair, the first individual is chosen randomly from the entire population. If b k~0 , the second individual is also chosen over the entire population (Levene model). If b k~1 , the second individual is chosen from the same patch. This completes reproduction for generation tz1. Simulations start (t~0) with one mutant and N{1 wildtype alleles. If the mutant allele is lost, the simulation starts again from the initial condition. The simulation stops when the mutant allele reaches fixation in the entire population (t~t). We use the method of quantifying the short-term coalescent rate from the individual-based simulation, as described in [25], to determine the expected heterozygosity at a neutral locus. Briefly, at the beginning of the simulation, all N individuals carry distinct neutral alleles, as the neural allele of the ith individual is represented by the ''ancestral number'' i (~1, . . . ,N). Then, let q i (t) be the frequency of ancestral number i at time t during simulation. As described above, q i (0)~1=N for all i. As a result of the selective sweep, q i (t)~0 for many i, while P N i~1 q i (t)~1. Assuming that new neutral mutations between time 0 and t can be ignored, the expected heterozygosity at t~t is given by (cf. [25]). The results of the simulation model are presented in Figures 1  and 2. As expected, the heterozygosity is lower than predicted by the deterministic model. This can be adjusted by adjusting the initial frequency in the deterministic model (i.e., by shortening the length of the trajectory).

Discussion
While adaptive evolution in reality follows complex patterns (demography, heterogeneous selection pressures, spatial structure, mating behavior, etc.), such processes can often be accurately described within the idealized framework formed by standard population-genetic assumptions (constant homogeneous selection pressures, constant population size, random mating). Deviations from standard assumptions -particularly heterogeneities in selective pressures -are obviously important in allopatry and parapatry. However, even individuals living in sympatry might experience substantial differences in selective pressures. Examples include selection for herbicide resistance in weeds [26][27][28], stress tolerance in insects and weeds in agriculture, insecticide resistance in bed bugs [29][30][31][32], drug resistance in vector borne diseases (see below). Whereas in these examples candidate regions under selection might be inferred with population-genetic methods that build up on standard theory, substantial errors could result when attempting to reconstruct the underlying evolutionary dynamics (e.g., estimating selection coefficients, speed of evolution, recombination rates, etc.) from the selective sweep patterns. To avoid misinferences under such scenarios, it is therefore necessary to validate the applicability of standard population-genetic theory, and -if appropriate -adapt existing theory, particularly since many of the mentioned examples are matters of economic relevance and/or global health interest.
For instance, Plasmodium parasites causing human malaria typically experience different 'environmental conditions' depending on characteristics of human hosts determining selective regimes (drug treatment, drug dosage, immune response, levels of host-acquired or natural immunity, etc.). Parasites conferring resistance to antimalarial drugs are advantageous only in hosts treated with the respective drugs, whereas they are slightly deleterious in untreated hosts due to metabolic costs. In parallel, sexual reproduction occurs inside the mosquito vector, randomly but exclusively between parasites that were extracted from the same host, manifesting another deviation from standard assumption. Heterogeneous selection pressures act also on a spatial scale because drug-deployment policies and control interventions are country specific. This is particularly relevant along the borders of Cambodia, Laos, Myanmar, and Thailand where the containment of emerging artemisinin resistance is of fundamental importance to sustain successful malaria control [33]. Inferences based on standard population-genetic assumptions might be misleading as parasites experience highly varying selective environments and severe inbreeding is immanent to the specifics of malaria transmission.
More generally, parasites or pathogens that sexually reproduce within hosts might experience radically heterogeneous selection pressures, as immune responses may occur differently across organs or within specific tissues. Sexual reproduction might be common even in fungal pathogens [34]. In agriculture patches of contrasting selective regimes are created in sympatry by human interventions (cf. [35,36]). The use of fertilizer, manure, herbicides, pesticides along with interventions such as plowing and irrigation varies across farmed land. Therefore, insects or weeds might experience radically different selective conditions across nearby acres. A striking example of a rapid evolutionary change under such a setting is the fast progression of glyphosate (''roundup'') resistance in many species of weeds, economically challenging US agriculture. Genetic understanding of glyphosate resistance will require the detection and analyses of selective sweeps in the plants, including those reproducing by selfpollination and long-distance seed dispersal.
In this study we introduced a model for heterogeneous selection in sympatry within a haploid population that randomly disperses across patches in every generation. Viability selection acts differently within the patches and mating occurs randomly within or between patches. In the limiting case that mating occurs randomly between all patches, the model reduced to the hardselection Levene model (cf. [23], Ch. 6), which is identical to the standard selection model. However, if mating occurs exclusively within demes, the deviations from the standard model can be substantial. We showed that the dynamics at a single selected locus are independent of the dispersal pattern. Namely, they are solely determined by the average selection intensities across patches. However, as soon as two or more linked loci are considered deviations from standard-population-genetic assumptions become apparent. Particularly, we studied how the genetic variation at a neutral locus is affected as a beneficial mutation sweeps at a nearby linked locus.
We were able to derive an analytic solution for the allele frequencies at a neutral locus after the beneficial mutation became fixed. As the analytic solution is complicated we also derived approximations, which allow for clear and simple interpretations. Namely they reflect the frequency change driven by the selective pressure averaged over patches, however adjusted by a factor determining the relative importance of the patches. As long as differences in selection pressures are moderate the hitchhiking effect is accurately described by standard population-genetic theory. However, if selection pressures are extreme as it might be the case in the above mentioned examples, heterogeneities in selection pressures in combination with intra-patch mating leads to stronger reductions in genetic variation than predicted by the standard model. The reason is as follows. Radically reversing directional selection across patches leads to mating only between individuals carrying the allele that allows survival within the respective selective environments, thus greatly increasing the effect of inbreeding. Hence, meiotic recombination is less efficient to restore genetic variation. This effect however cannot be just summarized by an adjustment of the recombination rate. In fact the unique mating scheme leads to a process for which selection and recombination cannot be decoupled.
We also performed stochastic simulations to verify the results of the deterministic model's analytic prediction. As expected the deterministic solutions were underestimating the reduction of genetic variation at neutral loci. However, as usual this can be compensated by adjusting the effective initial frequency of the advantageous allele, which reflects the shorter allele frequency trajectory of the advantageous allele conditional on its escape from extinction by random genetic drift.
In general our results are informative to properly interpret selection coefficients when these are attempted to be measures from the patterns of selective sweeps. Unfortunately, appropriate data is unavailable for the mentioned examples to which our model would apply (pesticide and herbicide resistance). Nevertheless, as the examples are of great economic interest, and as population genetic theory continues to advance such data hopefully become available soon. Anyhow, the model is applicable to malaria where attempts have been made to link estimates of selection to the hitchhiking pattering (e.g. [19,22]).
The hitchhiking effect revealed in this study might be compared to that of another study assuming the subdivision of population into many small demes or patches [16,37]. They predicted the reduced strength of hitchhiking (higher heterozygosity), in contrast to our current result, due to population subdivision. Their model however assumes homogeneous selective pressure over demes and limited migration of individuals between demes. In such a case, the delay in the propagation of advantageous allele into the entire population provides more opportunities for recombination that breaks the hitchhiking. Most populations in nature would violate the assumptions of both studies (instantaneous dispersal among demes of the current study and homogeneous selective pressure in [16]). Further investigation is needed for the joint effect of the two forces.

Single-locus Dynamics
Here, we derive the marginal dynamics at a single locus. Let p denote the frequency of allele A 1 , i.e., p~P M i~1 p i . The fitness of allele A i in patch P k is denoted by w (k) i . Hence, we have W (1{b k )a k w (k) i , so that we obtain w 1~Wi and w 2~WizM for i[f1, . . . ,Mg.
With the above notation we can derive W W (k)~p w (k) 1 z(1{p)w (k) 2 . Thus, Clearly, we have { 2p(1{p)y 2 ((1{p)xzpy) 3 v0 and det H~0, i.e., the leading minors of H are non-positive. Hence, f is concave but not strictly concave (note that f (x,x)~x=2). Hence, for positive random variables X and Y the Jensen's inequality for higher dimensions yields.