Inference of recombination maps from a single pair of genomes and its application to archaic samples

Understanding the causes and consequences of recombination rate evolution is a fundamental goal in genetics that requires recombination maps from across the tree of life. Since statistical inference of recombination maps typically depends on large samples, reaching out studies to non-model organisms requires alternative tools. Here we extend the sequentially Markovian coalescent model to jointly infer demography and the variation in recombination along a pair of genomes. Using extensive simulations and sequence data from humans, fruit-flies and a fungal pathogen, we demonstrate that iSMC accurately infers recombination maps under a wide range of scenarios – remarkably, even from a single pair of unphased genomes. We exploit this possibility and reconstruct the recombination maps of archaic hominids. We report that the evolution of the recombination landscape follows the established phylogeny of Neandertals, Denisovans and modern human populations, as expected if the genomic distribution of crossovers in hominids is largely neutral.


Overview of iSMC
Besides its common interpretation as a backwards-in-time process, the coalescent with recombination 33,34 can also be modelled as unfolding spatially along chromosomes 35 . Starting from a genealogy at the first position of the alignment, the process moves along the chromosome sequence, adding recombination and coalescence events to the ensuing ancestral recombination graph (ARG) 36,37 (Figure 1a). Due to long-range correlations imposed by rare recombination events that happen outside the ancestry of the sample (in so-called trapped non-ancestral material 38 ), the genealogy after a recombination event cannot be entirely deduced from the genealogy before, rendering the process non-Markovian. The sequentially Markovian coalescent process (SMC) 39,40 ignores such recombination events, but captures most of the properties of the original coalescent 41 while being computationally tractable. It is the foundation of recent models for demographic inference 42 -44 and has been used to infer the broad-scale recombination map of the human-chimpanzee ancestor based on patterns of incomplete lineage sorting 45,46 . In the SMC, transition probabilities between genealogies are functions of ancestral coalescence rates and -of key relevance to this study -the population recombination rate (ρ = 4.Ne.r) 43,44 . Thus, heterogeneous recombination landscapes affect the SMC by modulating the frequency of genealogy transitions: genomic regions with higher recombination rate are expected to harbour relatively more genealogies than regions with smaller recombination rate (Figure 1). We leverage this information by extending the SMC to accommodate spatial heterogeneity in ρ (see Methods). In brief, our new model combines the discretised distribution of times to the most recent common ancestor (TMRCAs) of the pairwise SMC 43 with a discretised distribution of ρ to jointly model their variation along the genome. Since we model the transition between discretised ρ categories as a spatially Markovian process along the genome, combining the SMC with the Markov model of recombination variation leads to a Markov-modulated Markov model. pairs (Figure 1c). We name our new approach "integrative sequentially Markov coalescent (iSMC)", as it enables jointly capturing the effect of time and space in the Coalescent. This framework explicitly connects the genealogical process with the classical definition of LD as the non-random association of alleles at different loci 49 , which has been formulated in terms of covariances in coalescence times 50 . Henceforth, we restrict the use of the term LD to its "topological" interpretation 51 .
iSMC models spatial variation in the recombination rate using a single discrete distribution (Figure 1b), which can be accommodated to various models of recombination rate variation (see Methods). After fitting the alternative distributions to sequence data, Akaike's Information Criterium (AIC) 52 is employed as a mean of model selection. If AIC favours a spatially heterogeneous model over the null model where ρ is constant along the genome, iSMC then estimates a recombination landscape of single-nucleotide resolution by weighting the discretised values of the selected distribution with their local posterior probabilities. In the following section, we benchmark our model on different simulated scenarios. The correlations between simulated and inferred maps reported therein were computed after binning the inferred landscapes into windows of 50 kb, 200 kb, 500 kb and 1 Mb.

Simulation study
To assess iSMC's overall performance, we used SCRM 53 to simulate five recombination landscapes corresponding to different patterns of magnitude and frequency of change in ρ and a "null" scenario with constant recombination rate along the genome (see Methods). For each landscape, we simulated 10 ARGs, each describing the ancestry of 2 haploid chromosomes. We tested two discretisation schemes for the joint distribution of TMRCAs and recombination rates: the first with 40 time intervals, five ρ categories; the second with 20 time intervals, 10 ρ categories, leading to a total of 200 hidden states in both configurations (Supplemental Note).
Model selection based on AIC favours the correct model in 45 of the 50 datasets (Table S1), with the five exceptions belonging to the scenario where changes are frequent and of small magnitude. In this regime, transitions to regions of slightly different recombination rates do not significantly skew the distribution of genealogies, and the short length of blocks with constant ρ leaves little signal in the data. Nevertheless, correlations between simulated and inferred maps are highly significant (ranging from 0.385 to 0.563 in the five identifiable replicates with frequent changes of low magnitude, and from 0.682 to 0.928 in the others, Tables S2, S3).
Overall, the results are consistent between replicates and robust to the choice of discretisation (Figure 2b). We use the configuration with 40 time intervals and five ρ categories, as it implements a finer discretisation of time that is more adequate to capture the effect of ancestral demography. In the following, as we introduce new simulated scenarios, we focus on the recombination landscape with frequent changes of large magnitude.
Demographic history. The random sampling of haplotypes during population bottlenecks and expansions affects LD between SNPs, thus creating spurious signals of variation in ρ 54 -56 . To test whether iSMC could capture the effect of demography on the inference of recombination maps, we simulated a heterogeneous recombination landscape coupled with either a recent 20-fold increase, or ancient 20-fold decrease in population size. We then fit our model twice for each scenario: first, erroneously assuming a flat demographic history; second, allowing iSMC to infer piecewise constant coalescence rates in order to accommodate population size changes. The correlation coefficients are all highly significant (Figure 3), showing that the inferred recombination landscape is relatively robust to misspecification of the demographic scenario, but are systematically higher when demography is jointly inferred (Figure 2c-d, Table S4, S5). The difference is stronger at the fine scale, where, in the presence of complex demography the distribution of genealogies can get locally confined to a time period, and ignorance about differential coalescence rates reflects poorly on local ρ estimates. We conclude that the joint-inference approach of iSMC can disentangle the signal that variable recombination and fluctuating population sizes leave on the distribution of SNPs. Variation in mutation rate. The rate of de novo mutations varies along the genome of many species. For example, CpG di-nucleotides experience an increase in mutation rate (μ) as a result of methylation followed by deamination into thymine, whereas the efficiency of the molecular repair machinery is negatively correlated with the distance from the DNA replication origin, causing μ to vary accordingly 59 . Such heterogeneity could bias iSMC's estimates because the transition into a region of higher μ mimics the transition to a genealogy with a more ancient common ancestor, since in both cases the outcome is locally increased genetic diversity. To assess the impact of variation of mutation rate on the estimation of recombination rate, we simulated two scenarios of variation of θ = 4.Ne.μ along the genome, corresponding to low and high frequency of change, relative to the frequency of change in the recombination rate. We report that transitions to different mutation rates along the genome globally do not introduce substantial biases in our estimates (correlations range from 0.655 to 0.844, Figure 2f, Table S8,   S9). There is, however, an effect of the local ratio θ / ρ at the fine-scale: regions with a ratio < 1 tend to have their local ρ underestimated because the relatively small number of mutations render some recombination events undetectable ( Figure S1a). Importantly, this effect is a consequence of the magnitude of μ, and is not expected as a result of the local reduction in diversity caused by selection, since in the latter case there is a linear scaling of both θ and ρ (due to a decrease in the local effective population size, Ne), keeping their ratio constant. Furthermore, the bias is reduced when site-specific recombination estimates are binned into larger windows ( Figure S1b).

Application to a fungal pathogen, fruit-flies and humans
We benchmarked iSMC on model organisms with contrasting genomic architectures and evolutionary histories. The leaf blotch Zymoseptoria tritici is a highly polymorphic fungal pathogen with a compact genome (40 Mb) that is under widespread selection 60 and exhibits an extremely rugged recombination landscape 31,61 . In this species, AIC favours a heterogeneous model with the presence of recombination hotspots in all three pairs of genomes analysed ( Table   1, see Methods). Under this model, correlations between iSMC maps and a previously published genetic map 61 are highly significant (Table S10)

Application to archaic samples
At the fine scale, the location of cross-over events in great apes is strongly influenced by the sequence of the PRDM9 gene 6,10,63 . Recombination hotspots determined by PRDM9 tend to erode over time, being replaced somewhere else in the genome with the rise of new PRDM9 alleles.
Such turn-over of hotspot locations leads to rapid evolution of the recombination landscape 64,65 , therefore, if the location of hotspots is selectively neutral, recombination maps should become more dissimilar with increasing divergence between populations and species. This hypothesis has been corroborated by two lines of evidence. First, comparisons between recombination maps of extant great ape species show no overlap of hotspots at the fine scale, but correlations increase with window size. While this observation suggests that molecular players other than PRDM9 shape the landscape at the large scale, such positive relationship could be an artefact of reduced variance when recombination estimates are averaged in larger windows (e.g., correlations increase with window size in our simulations, despite recombination landscape being static over time, Table S1-2). Second, in silico prediction of PRDM9 binding sites in the Denisovan Europeans form a distinct cluster; the 45,000 year-old Ust'Ishim is a long branch, sister to this clade, depicting similarities in the recombination landscape that have been frozen by his demise soon after the out-of-Africa migration; and all modern humans are highly diverged from the monophyletic Neandertal-Denisovan group (Figure 4a). The tree has high bootstrap support for its branches and is corroborated by principal component analysis (Figure 4b). Overall, the topology is consistent at larger scales ( Figure S2), with the notable exception of Ust'Ishim forming a highly supported cluster with the Altai Neandertal. Interestingly, introgression from Neandertals is estimated to have occurred into the population ancestral to this ancient Homo sapiens, 232-430 generations before his lifetime. His DNA carries segments of Neandertal ancestry that are longer and more abundant than those in contemporary modern humans 71 . It is thus tempting to interpret this result as a consequence of the hybridisation event. Besides similar LD in the longer introgressed regions, a non-exclusive possibility is that the recombination landscape of Ust'Ishim was influenced by trans-acting alleles inherited from Neandertals, presumably acting at a larger scale than PRDM9. This could explain the discrepancy between recombination-based trees at the large versus fine scale (where rapid evolution of PRDM9 would quickly erode the signal of introgression), although further investigation is needed to test this hypothesis -ideally, as the number of high-quality archaic genomes continues to increase in the next years. Concretely, these results show that 1) iSMC can be used to extract information from archaic genomes; 2) hotspot turn-over driven by PRDM9 not only is responsible for short timescale differences in the location of cross-over events, but the accumulation of such differences is

DISCUSSION:
Our analyses have shown that iSMC is able to infer accurate recombination maps from single pairs of genomes. Nevertheless, the correlations between statistical and experimental maps for the three species are consistently lower than the correlations obtained during simulations. While this difference can be partly explained by technical noise (e.g., sequencing or SNP calling errors), there are alternative -and conceptually engaging -explanations for it. First, the different types of data used by experimental and statistical methods imply that they measure different facets of recombination. While experimental maps reflect the average cross-over rate (r) between markers (i.e., a snapshot of the landscape at present-day generation), the population rate ρ (4.Ne.r) estimated by iSMC depends on ancestral population sizes as well as on the variation of Ne along the genome, and reflects the time-average of the historical recombination rate r. Since regions of high r tend to erode quickly 64,65 , the evolution of the recombination landscape over time implies that ρ may be quite different from r (and more meaningful than r in the context of population genomics 57 ). Second, biological processes that affect Ne locally (therefore distorting the distribution of genealogies) but are unaccounted for by our model will affect LD without reflecting the recombination rate. Among these, introgression and selective sweeps (through the rapid increase of the frequency of advantageous haplotypes) can introduce a substantial bias if prevalent along the genome. In a similar vein, background selection distorts the distribution of genealogies towards lower TMRCAs 72 , and should be common near regions under purifying selection such as coding regions.
In order to study processes such as population structure 62  genomics generates high quality whole-genome sequences, with a small number of individuals representing each of several sub-populations 73 -76 . Due to the importance of linkage in ancestral inference, recombination maps are key to analysing these data -and because of its power with restricted sample sizes, iSMC is well suited for the task. We have demonstrated its accuracy in species with contrasting levels of diversity, demographic histories and selective pressures, and posit that it will be useful for investigation in other species. Not only will such maps aid the interpretation of diversity in non-model organisms, but a picture of the recombination landscape in different groups will tell us about the nature of recombination itself 77 . Open questions include whether the recombination landscape is associated with large-scale genome architecture and how variation in the recombination landscape relates to life history and ecological traits. Finally, as ancient DNA samples become more common (including species other than humans 78 ), it will be possible to obtain maps from extinct taxa, granting the opportunity to study the evolution of the recombination landscape with unprecedented resolution 45,79 .

METHODS:
The Markov-modulated Hidden Markov Model framework SMC models discretise a distribution of coalescence times into t intervals to implement a discrete space Hidden Markov Model (HMM) with t x t transition matrix: where G ij (the transition probabilities between genealogies i and j) is a function of ancestral coalescence rates and the global parameter ρ, which is assumed to be constant along the genome 40,44 . The key innovation in iSMC is to relieve this assumption by letting ρ vary along the states that the probability distribution of R at position x + 1 only depends on the the distribution at position x. We consider the case where the transition probability between any two R categories (P ij ) is identical and equivalent to one auto-correlation parameter (δ). The transition matrix of this Markovian process is simply: Because ρ is a parameter of the SMC, variation in the recombination rate affects the transition probabilities between genealogies (Q SMC ). Since spatial variation in ρ is modeled as a Markovian process, the combined process is said to be Markov-modulated by ρ, leading to a Markovmodulated HMM. If t is the number of discrete genealogies of the SMC, and k is the number of discretised ρ categories, then the Markov-modulated HMM is a HMM with n = t x k hidden states (Figure 1). We further disallow transitions between hidden states where both the genealogies and ρ values change simultaneously. In this case, the transition matrix of the Markov-modulated process, Q iSMC , is given by: diag (Q p )⊗Q SMC +(Q p −diag (Q p ))⊗I t where ⊗ denotes the Kronecker product and I t the identity matrix of dimension t. Since many of its elements are reduced to zero, the matrix can be transversed efficiently: In other words, Q iSMC is a composition of k 2 square sub-matrices of dimension t. The main diagonal sub-matrices are Q SMC assembled using the corresponding ρ value for category k (and scaled by 1 -δ), while the off-diagonal sub-matrices are identity matrices which have δ as main diagonal elements.

Modelling spatial variation in recombination rates
iSMC implements three models of spatial variation in the recombination rate. We first use a Gamma probability density function with a single parameter (α = beta), which constrains it to have a mean equal to 1.0. After discretisation into k categories of equal density, the mean value inside each category is drawn to scale the genome-wide average ρ during integration over all recombination rates in the forward recursion (equation 1). In our simulation study, since we used a continuous Gamma distribution to draw values of the recombination landscape, we used this model to infer recombination maps. In the second model, we extend the gamma distribution by adding a category that represents the intensity of the recombination rate in sharp hotspots (parameter H). Since hotspots are narrow relative to the extension of the background recombination rate, we use extra parameters to accommodate this effect. As before, δ is the transition probability between gamma categories, and we introduce u as the transition probability where we integrate over all k discretized values of ρ and over all t TMRCA intervals. The transition between genealogies ( G t →G u ) is a function of both the focal recombination rate (ρ k ) and the ancestral coalescence rates, and G u → S x represents the emission probability from at position x, where n is the product of t TMRCA intervals and k ρ categories, iSMC obtains the k-dimensional vector k Z x by summing, for all k, the posterior probabilities of every hidden state n matching ρ category k. Finally, iSMC uses k Z x to scale the corresponding values of the discretized distribution of ρ, such that the posterior average ρ for position x is (ρ x )= ∑ k Z x k ⋅ρ k .

Modelling complex demographic histories
The original HMM implementation of the SMC 43 uses the expectation-maximisation algorithm to optimise transition probabilities, where the actual targets of inference -the coalescence rates at each time interval -are hyper-parameters of the model. Here we use cubic spline interpolation 42 to map coalescence rates at time boundaries, which are then assumed to be piecewise constant for the duration of each interval. Because we use three internal splines knots (i.e., the demographic history is divided into four epochs wherein a cubic curve is fitted), the number of parameters is substantially reduced in our model -in particular when a fine discretisation of TMRCA is employed.

Simulation study
Four scenarios of spatial variation in ρ. We simulated a piecewise constant recombination rate along the genome by drawing values from a gamma distribution with parameters α and β, and segment lengths from a geometric distribution with mean length g. We considered four possible scenarios where α = β = 0.5 or 5.0, and g = 100 kb or 1 Mb. For each of the four combinations, [github repository will be made public upon publication]. R scripts used both for simulations and data analyses are available in Supplemental Data 1.

Data analysis
Model selection followed by inference of recombination maps in the three species studied ( Table   1)         Magenta dashed line represents y = x. A, both ρ and θ are binned into 50 kb windows. Solid lines represent loess smoothing curves fitted to each class of θ / ρ, with 95% confidence intervals aroung them (shaded). B, both ρ and θ were binned into 500 kb windows.