Live Hot, Die Young: Transmission Distortion in Recombination Hotspots

There is strong evidence that hotspots of meiotic recombination in humans are transient features of the genome. For example, hotspot locations are not shared between human and chimpanzee. Biased gene conversion in favor of alleles that locally disrupt hotspots is a possible explanation of the short lifespan of hotspots. We investigate the implications of such a bias on human hotspots and their evolution. Our results demonstrate that gene conversion bias is a sufficiently strong force to produce the observed lack of sharing of intense hotspots between species, although sharing may be much more common for weaker hotspots. We investigate models of how hotspots arise, and find that only models in which hotspot alleles do not initially experience drive are consistent with observations of rather hot hotspots in the human genome. Mutations acting against drive cannot successfully introduce such hotspots into the population, even if there is direct selection for higher recombination rates, such as to ensure correct segregation during meiosis. We explore the impact of hotspot alleles on patterns of haplotype variation, and show that such alleles mask their presence in population genetic data, making them difficult to detect.

where the first term is the change due to the drive in heterozygotes and the second term to mutation out of the hotspot allele. For a large effective population size N e , this model is well approximated as a simple diffusion process (scaling time in units of 2N e , as N e tends to infinity, see Ewens [2] for an introduction) with diffusion parameter x(1 − x) and drift which is merely equation (2) scaled by 2N e .

SOM 2: Obtaining human parameters
In order to obtain the estimated human drive parameters shown in table 1 of the main text, we based calculations on (3) with µ D ≈ 0, and N e = 10, 000. For each hotspot, the intensity of crossover in males has previously been estimated [7] (DNA1-3), [10] (NID1) and [14] (SHOX). At two of these hotspots, DNA2 [9] and NID1 [10] a marker was identified for which one of the two types disrupted hotspot activity, and estimates were obtained of both the crossover rate r h , and the probability p of transmission of this marker in recombinants, for heterozygotes. We assumed the same rates in females as in males throughout.
It is easy to show that the population scaled drive parameter (3) is given by 8N e r H (1/2 − p) = 8N e r h (1/2 − p ) and we substituted in p = 0.24 for DNA2 and p = 0.26 for NID1 to obtain the strength of drive for these hotspots. To estimate crossover-based drive for the other hotspots, we used the same p as for DNA2 and a crossover rate as shown in the table.
To estimate the gene-conversion based drive we modified the above to include the fact that at a chosen marker, under the DSB model we can only observe gene conversion occurring if the initiating chromosome is transmitted to the offspring (with probability 1/2). If r h is now the observed gene conversion rate, and p is the proportion of gene conversion products carrying the disrupting allele, and r H is now the rate (in heterozygotes) of DSBs that result in gene conversion, the contribution of simple gene conversion to the scaled drive is 8N e r H (1/2 − p) = 16N e r h (1/2 − p ).
Finally, to estimate the gene-conversion contribution we used direct estimates of r h = 1.3 × 10 −4 and p = 0.29 for NID1 [10], and for the other hotspots we used the same value of p , estimates of r h = 1.3 − 3.4 × 10 −3 for the most-frequently converted marker at hotspot DNA3 [8], and then assumed the same ratio of crossover to conversion at the remaining four hotspots as for DNA3. The estimates produced are therefore very approximate, since recombination rates in females are unknown, there is uncertainty in the rate estimates, and p and the assumed gene conversion rates are extrapolated from other hotspots in most cases. The SHOX hotspot is in the psuedoautosomal region of the Y chromosome, which is subject to an obligate crossover in males, and likely to be much less active in females.
We have assumed that the SHOX hotspot is inactive in females, and so the sex-averaged recombination rate for the SHOX hotspot is half that observed in males. Further, we have assumed implicitly here that in heterozygotes for the disrupting mutation the recombination rate would equal the current population rate; an additive model of intensity might instead suggest a smaller recombination rate in heterozygotes, perhaps as little as half the population rate, and in this case the drive parameter should be reduced by this factor.

SOM 3: Probability that a hotspot survives uncooled from ancestral populations
Using the parameterisation defined in the main text and standard diffusion theory, an allele that arises via a single mutation (so at frequency 1/2N ), and lowers the rate of DSB initiation by r H , has an approximate probability of achieving fixation (and thereby weakening, or removing, the hotspot). Mutations that disrupt the hotspot occur at rate 2N µ D , therefore the rate at which disrupting alleles arise and fix in the population is approximately 2N µ D × eq(4). As r H becomes large, this death rate increases approximately linearly with hotspot intensity, the mutation rate (or the number of sites able to disrupt the hotspot through their mutation) and the effective population size N e . Neglecting the possibility of a currently segregating disrupting allele, or of several disruptive mutations co-segregating within the population at one time, the probability of survival of a hotspot in both species from the ancestral population to the current day is approximated by This approximation is commonly used for selected alleles [11] and we found it to provide an excellent approximation, for the parameter values used in this paper, to the probability obtained via direct simulation using a Wright-Fisher simulation (results not shown).

SOM 4: The frequency spectrum of hotspot alleles
Suppose within a region of DNA sequence, the rate of introduction of hotspot alleles is µ H and the rate of introduction of a mutation killing any such hotspot is µ D (note that µ D is the rate per hotspot, while µ H is the rate of hotspot introduction in the region). Then the expected number of hotspots with population frequency in a small frequency interval This equation is obtained by modifying the expected number of selected alleles with frequency x in a model with one way mutation [13], to reflect the hotspot parameters. Note that the above expression implies that exactly as for polymorphic sites there could potentially be a large number of hotspot alleles segregating at low frequency in the population. We assume that the effective population size is the census population size here in order to simplify our results (this assumption is common and in our case has little effect as it scales only the mutation rates).

SOM 5: Simulations to investigate demography
We sought to consider whether human population size changes might strongly influence the evolution of hotspots, and in particular if such events could result in more active hotspots reaching high frequency in the population. Human populations are likely to have experienced both recent expansion and bottleneck events. Population growth from some ancestral size will increase the efficacy of drive against hotspots relative to the ancestral size, and so result in fewer hot hotspots reaching fixation, so we did not consider this possibility in detail. ago, and reducing the population size 10-fold for 400 generations, before a return to the original population size, whilst the second has a milder bottleneck beginning at the same time but reducing the population size 2.5-fold for 1600 generations. Although neither of these models is likely to be perfect, we believe they do provide a sense of the impact of plausible population size changes, at least in the European group, on hotspot intensity distributions.
For each model, we set the mutation rate away from hotspots to be low, and 25 times the mutation rate towards hotspots (10 −8 ), as in Figure 3. (The effect of altering these mutation rates is chiefly to alter the absolute number of hotspots in the population, rather than the relative counts of hotspots as a function of heat.) For each of the four intensities and the two demographic models, we simulated 25,000,000 generations of evolution of the population, to ensure convergence to stationarity, 20,000 times and measured the proportion of simulations that a hotspot at frequency above 50% in the population was observed in the present day (in fact we increased the efficiency of simulation by reducing all times and population sizes by a factor of 10, while increasing the recombination and mutation parameters by factors of 10, to achieve the same population dynamics but at 100-fold reduced computational cost). The results, shown graphically in Figure 3, indicated at most a weak impact of these bottlenecks on the hotspot distribution relative to constant population size models, and in particular we never observed hotspots as intense as either NID1 or DNA3 in our simulations. We also performed similar simulations for a constant population size of 10,000 -the results (not shown) showed excellent agreement with the theoretical predictions plotted on Figure 3.

SOM 6: Upper bound on the number of hotspot alleles with apparent population heat r H
It is reasonable to assume that the rate of recombination inferred by an LD based approach for a segregating hotspot can be thought of crudely representing a time average of the mean heat of the hotspot across individuals in the population. Thus an intense hotspot that has always been at low frequency would be inferred to be a less intense hotspot by LD based methods. To capture in part this effect we construct an upper bound on the "apparent" intensity of a segregating very intense hotspot allele, and consider the spectrum of population intensities of hotspots using our upper bound.
Consider a hotspot allele (with recombination rate r H ) presently at frequency x in the population. Imagining that the hotspot allele was held at this frequency through time, it would appear to have a rate of recombination r H x in the population (i.e. the rate inferred by LD based approaches). However if this hotspot allele has a high rate of recombination and hence strong drive against it, in the past the allele will typcially have had frequency below x (its current frequency will be the highest frequency achieved, as for intense hotspots the frequency will decrease roughly deterministically backward in time). Thus r H x is normally an upper bound on the population estimated rate of an intense hotspot allele with current frequency x. Assuming this upper bound, each hotspot of heat r H /x which are currently at a frequency x in the population has an apparent heat r H . The density of hotspots with heat r H and at frequency x is: and we can transform to units of apparent heat r H using the relationship r H = r H x) to give density of hotspots of inferred heat r H and at frequency x: Finally, to find the total density of hotspots with an upper bounded heat of r H we integrate over the frequency of the hotspot allele where the lower limit on the integral, α ≥ 1/2N , is the minimum frequency a hotspot needs to achieve in practice to have apparent intensity r H (and in particular if any new hotspots have maximal intensity c, α ≥ r H /c). Thus for large hotspot intensities the number of hotspots estimated to have heat r H decreases at least exponentially.
SOM 7: Some alternative models of selection for recombination due to correct segregation It is straightforward to show that under more general models of selection for crossing over, an expression of the same form as equation (4) of the main text holds for the drive parameter, where we must merely substitute w for w to obtain the appropriate (modified) scaled drive parameter: For example, our model of selection for recombination may be more appropriate for female meiosis, where a far higher percentage of gametes with an incorrect number of chromosomes are allowed to proceed through meiosis [6] and thus perhaps lower reproductive success. In contrast, a relatively low recombination rate in male meiosis, although lowering the proportion of viable sperm, might have somewhat less of an effect on male fertility. In fact, the qualitative effect of such selection is similar whether recombination fraction in males affects fertility or not. In this case, where the recombination rate only influences fertility in females, where w is the probability of crossing over in a region in females.
Similarly, under an even more dramatic model of selection for recombination where fitness is proportional to the number of crossover events C, and if this number of events has finite expectation E(C) then w = (1 + 1/E(C)) −1 (corresponding to more favourable conditions for a hotspot allele compared to the previous cases, by Jensen's inequality).

SOM 8: Expected number of fixed hotspot alleles
A slightly more general form of the expected number of fixed hotspot alleles, equation (6) of the main text, can be derived. We assume that the hotspot allele increases the probability of a DSB initiation by r * H , and that any locally disrupting mutation reduces this probability of initiation by r H . In a heterozygote for the ancestral and hotspot allele the probability that the initiating allele is transmitted is p * . Similarly, in a heterozygote for the disrupting and hotspot allele the probability that the initiating allele is transmitted is p.
Hotspot alleles, as before, are introduced at rate 2N e µ H and reach fixation with the probability given by equation (4) modified to reflect that the drive 8N e r * H (1/2 − p * ) acts against the allele instead of for it. Note that for p * = 1/2 this probability converges to 1/2N e , since the allele experiences no drive and therefore acts as a neutral allele. The probability of a disrupting allele reaching fixation is given by equation (4). The expected number of currently fixed hotspots in the region is then simply the ratio of the rate at which hotspots arise and fix to the rate at which each given hotspot is lost by fixation of a disrupting allele. The distribution of the number of fixed hotspots is Poisson (through a simple application of queuing theory), with expectation In addition to mutations that locally disrupt the hotspot, a hotspot may also be disrupted by alleles at remote loci that do not benefit from the drive. However, unless the rate at which these arise is much greater than the rate at which locally disrupting alleles arise they are unlikely to play a strong role in the evolution of hotspots. They may however, play a part in the evolution of relatively cold hotspots.

SOM 9: The Coalescent process with biased transmission
We suppose that the DSB initiation process is additive, i.e. the rate of initiation in AA homozygotes is 2r A and that the rate of initiation in BB homozygotes is 2r B (the fully general case is easily treated in the same way as presented here). Table 1 shows the joint probability of the offspring allelic type (denoted by O) and the DSB events that could have occurred in the parents (given that the frequency of the A allele in the parental generation is x). Using these joint probabilities we can construct the conditional probabilities of various events.
Note that the first and second row of the table sum approximately x and 1 − x respectively (as the frequency of a particular allelic type, P (0), alters little between two generations).

SOM 10: Background-specific rate of recombination
In the normal ancestral process both backgrounds A and B recombine at the same rate.
Consideration must be given now to the probability that a DSB happens in the previous generation, on a haplotype with a particular allele, A or B; this will be influenced by not just the differing probabilities of initiation, but also by the very fact that this allele was transmitted through the recombination event. If the frequency of the A allele in the previous generation is x, then conditional on the offspring type in the current generation, O, being A, the probability of a DSB in the previous generation is and by a similar argument Note that although this formula strictly applies to the probability of DSBs on each background, it is directly applicable to the probability of crossover, if we instead let r A and r B be the crossover rates for the two alleles and p be the transmission probability of the initiating allele conditional on a crossover in a heterozygote. When the DSB process is completely biased against the initiating allele, i.e. p = 0, then from equations (12) and (13)  This at first seems counterintuitive, since forward in time more recombination occurs on the hotspot allelic background. However, the fact that the allele was transmitted acts in such a way as to counter this effect. The argument for equal rates can be phrased as follows for p = 0. The offspring of recombinants are always the non-initiating allele, and the a type of the non-initiating allele is simply drawn from the population frequency at random.
Following lineages back in time, we are constantly following non-initiating types through any recombination events encountered. Further, since at any such recombination these are drawn at random from the population, the type of a lineage gives no information about the recombination rate on that lineage.
In contrast, if p = 1 the initiating allele is always passed on, and in this case equations (12) and (13) and by a similar argument for an offspring of type B Once again, these results may be applied directly to crossing over by considering the parameters r A and r B as the probability of crossing over for the two alleles, and p as the probability of transmission of the initiating allele given a crossover in a heterozygote. Similarly it is applicable to gene conversion events unaccompanied by crossing over, with the corresponding probabilities. In the case of perfect biased transmission against the initiating allele, p = 0, then the type of the nontransmitted allele is chosen to be A with probability The region of intense gene conversion associated with the hotspot is small [8]. For simplicity we consider only markers outside this region, as there will be few markers within this region (other than the hotspot locus). We also do not model gene conversion outside of the hotspot. A full treatment would have to model the ancestral gene conversion fragments, for example as described in Wiuf and Hein [19]. We effectively assume that the hotspot has zero width, but finite heat, which allows us to consider gene conversion within the hotspot that does not result in crossing over as a process that merely switches the background of the ancestral haplotype outside of the hotspot. Conditional on a gene conversion event, we can ignore the parent contributing at most the small gene converted fragment, and need follow only the other parent back. For an offspring of type A with parental genotype AB, we follow the A chromosome with some probability f A , and otherwise the B chromosome. We assume both chromosomes produced by a gene conversion are equally likely to be passed down to the offspring, and consider the case where initiation takes place on an allele's own strand. The initiating allele is passed down with probability p nc which is half the probability that the gene conversion tract does not include the rate determining locus. This enables calculation of the required probability, and we obtain the following; we describe a method to simulate the ancestry of a sample, and hence sample variation data, conditional on a current (or at least recent) segregating hotspot allele, in order to enable future quantitative investigation of these questions. We also briefly consider the implications of the derived backward process.
To explore the effect of a segregating hotspot allele in detail, we can model the underlying genealogy by a modified coalescent process. The rate of DSB initiation and the choice of parental background depend on the frequency of the A and B alleles. As noted before the drive is exactly analogous to genic selection. This means that we need to consider simultaneously, backward in time, the ancestry of the sample and the frequency of the hotspot allele.
This can be modeled in a similar way to modeling the genealogy when selection acts at a single site, as described in Hudson and Kaplan [5] (see Nordborg [16] for an introduction).
Conditional on the frequency of the hotspot allele through time, the genealogical process can be modeled as a subdivided population where the frequency of each allele gives the population sizes.
Let crossing over result from a DSB with probability q. When a DSB is initiated, if crossing over does not occur the initiating allele is transmitted with probability p nc , while if crossing over does result, it is transmitted with probability p c . Combining the results of the previous three sections, the following algorithm describes how to simulate from the process.
• Simulate the frequency {X t } of the hotspot allele in the population backward in time from the present day, to the eventual loss of the A allele, approximating the conditional backward diffusion using a birth and death process as described in Griffiths [4] and implemented in Coop and Griffiths [1] (see SOM 12 for more details).
• Initially set t = 0, and sample the initial numbers n A of type A haplotypes and n B of type B haplotypes according to the appropriate ascertainment model for sampled sequences.
• At a time t into the past, with current frequency X t = x of the hotspot allele A in the population, some totals n A of type A haplotypes and n B of type B ancestral haplotypes still remain. Then events occur to the n A + n B ancestors of the sample at the following instantaneous rates: 5. Crossing over on one of the A or B ancestral haplotypes, rates n A ×eqn. (12) and n B ×eqn. (13) respectively; using qr A , qr B and p c in place of r A , r B and p.
• Allow the above process to continue until all sequences have reached a single ancestor (i.e. n A + n B = 1) For gene conversion and crossover events, one parental chromosomes will be the same allelic type as the offspring. The allelic type of the nontransmitted allele must also be chosen, using (1 − q)r A , (1 − q)r B and p nc for gene conversion (without crossing over) and qr A , qr B and p c for crossing over, from equation (14) or (15) if the ancestral haplotype is A or B respectively.
In addition, the material that the respective parents contribute to the offspring must also be chosen, and the probabilities associated with this choice are described in SOM 10.
As usual, the above algorithm can be adapted to save computational time by only following lineages containing DNA material ancestral to the sample, and at positions which have not yet reached a most recent common ancestor (we must also keep track of the type, A or B, at the hotspot allele on such lineages). We can also easily extend to allow crossing over outside of the hotspot region which is dealt with in the usual way, i.e. by drawing the allelic type of the parental chromosome from the current population frequency of the hotspot allele when the crossing over occurs [5].
which is a version of equation (3) modified to reflect the differing biases in the products of DSBs that are and are not accompanied by crossover. The approximating Moran model has a population size of N chromosomes; which for computational convenience is less than 2N e .
If there are currently j (0 < j < N ) chromosomes with the hotspot allele in the population, the Moran model has birth rate N (N x(1 − x) + µ(x))/2 (18) and death rate where x = j/N . Note that we do not have to condition on eventual loss back in time, since this is guaranteed by the one-way mutation towards the B allele.