Tracking human population structure through time from whole genome sequences

The genetic diversity of humans, like many species, has been shaped by a complex pattern of population separations followed by isolation and subsequent admixture. This pattern, reaching at least as far back as the appearance of our species in the paleontological record, has left its traces in our genomes. Reconstructing a population’s history from these traces is a challenging problem. Here we present a novel approach based on the Multiple Sequentially Markovian Coalescent (MSMC) to analyze the separation history between populations. Our approach, called MSMC-IM, uses an improved implementation of the MSMC (MSMC2) to estimate coalescence rates within and across pairs of populations, and then fits a continuous Isolation-Migration model to these rates to obtain a time-dependent estimate of gene flow. We show, using simulations, that our method can identify complex demographic scenarios involving post-split admixture or archaic introgression. We apply MSMC-IM to whole genome sequences from 15 worldwide populations, tracking the process of human genetic diversification. We detect traces of extremely deep ancestry between some African populations, with around 1% of ancestry dating to divergences older than a million years ago.


PSMC
Here we briefly rederive the central equations of the PSMC [3]. In the following, we denote the rate of coalescence by λ(t) = (2N (t)) −1 . The transition probability is derived from the SMC model by McVean and Cardin [6]. We consider a given recombination event, which takes place at time u < s in either of the two branches m = {1, 2}. This recombination event causes a "floating" branch which coalesces back onto the other branch at time t. The probability for this is given by the probability that no coalescence occurred between u and t times the probability that it coalesces exactly at time t: where the Heavyside-function is defined as and reflects the fact that the transition probability to switch to time t is zero if u > t. We show in the Appendix that this conditional probability is properly normalized, i.e. that ∞ 0 q(t|s, u, m) dt = 1 for all given s, u and m. We need to integrate out the two unknown variables u and m, both with uniform probability. The probability that no recombination occurred in either of the two branches of length s is exp(−2rs). Together this yields:

Including Self-coalescence: PSMC'
Marjoram and Wall [5] realized that there was one particular feature missing from the original SMC formulation. An important rationale behind equation 1 is that the recombining "floating" branch will definitely coalesce with the other of the two branches, therefore definitely changing the tMRCA to the new value s. However, it is of course possible, that the floating branch will simply coalesce back onto its own branch, therefore resulting in a recombination event that does not change the tMRCA.
In order to extend the model to include this self-coalescence, we again consider the probability that the time switches from s to t, given some recombination time u. We can distinguish two cases: for t > s, the transition probability is given by the probability that no coalescence occurred with either of the two branches < t and no coalescence to the single branch between t and s. For s < t, the transition probability is given by the probability of coalescing to the other branch, rather than to the self-branch. Finally, we have a third class of recombination events which result in t = s, namely if the floating branch coalesces back onto its own branch before s.
The conditional probability then reads Again, we show in the Appendix, that this conditional probability is normalized. The full transition probability then reads The equilibrium probability is with the integral For later purposes, we introduce some more functions. We rewrite the transition matrix

Discrete time intervals
We divide time into a set of n T intervals that span the entire space from 0 to ∞. In practice, as interval boundaries we use the same boundaries as chosen by PSMC [3], defined as: Here, α and T max are constants that in the case of PSMC were chosen to be α = 0.1 and t max = 15. Note that by construction we have T 0 = 0 and T N T = ∞.
This patterning sets of with time patterns approximately linearly distributed through time, and then crosses over to a patterning that is uniformly distributed in log-space. This ensures higher resolution in recent than in ancient times.
For MSMC2, we would like to increase resolution in recent times depending on the number of individuals, i.e. haplotypes we use. For example, with four haplotypes, in recent times we have approximately 6 times more recent coalescent events to analyse compared to just two haplotypes . This should therefore allow us to increase resolution in recent times by 6 fold. We generally set the parameter α in equation 13 to be where n pairs is the number of total haplotype pairs analysed. With phased data, and n hap haplotypes from the same population, we have but this can be different if multiple populations or unphased data is anlaysed. For example, if we have four diploid individuals in total, separated evenly into two populations, then we consider all pairs of haplotypes across the two populations, so we have n pairs = 16. If eight diploid individuals from the same population are analysed, and no phasing is available, then we have n pairs = 8. In the MSMC2-implementation, this behaviour can be controlled with the --pairIndices flag (see https://github.com/stschiff/msmc2). The scaling of α, according to equation 14, is then set automatically by the number of specified pairs.

Piecewise constant Population sizes
We then define piecewise constant population sizes which correspond to piecewise constant coalescence rates: We now can compute the integral L (t 1 ; t 2 ). Let the next lower time boundary from t 1 be β, and the next lower time boundary from t 2 be α. We also define ∆ α = T α+1 − T α : In the following, we denote the next lower index of a given time in the function parameters, with q 0 (t; α) meaning that T α < t < T α+1 : For the off-diagonal integrals we first get for t < s: For the case t > s, things depend on the interval in which s lies, denoted by β:

Integrating over time intervals
For each time interval we now have to integrate t through [T a ; T a+1 ]. First the equilibrium probability: Next, we compute the expected time in interval β: This expression for t β has a numerical instability for λ β 10 −3 . We set the following asymptotic values: We can now write down equations for the off-diagonal elements of the transition matrix, i.e. elements with α = β. First the case α < β: where we have used Analogously we have: where we have used The complete discrete transition matrix now reads: due to the column normalization of the transition matrix.

Emission Probability
An observation at location i in the genome for a pair of haplotypes (as in a single diploid genome), O i , can be either of O i = {0, 1, 2}, where 0 denotes missing data in either of the two haplotypes, 1 denotes a site where both haplotypes have the same allele (i.e. a homozygous genotype in case of a single diploid genome), 1 denotes a mismatch between the alleles of the two haplotypes (i.e. a heterozygote genotype in case of a single diploid genome), The emission probabilities for exact coalescence times are: For discrete time intervals, we need to integrate over the conditional probability distribution in each time interval: and of course we have as before: There are special forms of these expressions for two cases. First, if T α+1 = ∞, then we have ∆ α = ∞, and so the expression becomes Second, there is again a numerical instability for λ α 10 −3 , in which case the expression becomes

MSMC2 Hidden Markov Model
We can now define a Hidden Markov Model (see [1] for background reading), based on PSMC' using the above defined transition and emission probabilities. For a given sequence of length L, we define the observations as O 1 . . . O L . We define a forward variable f 1 (α) . . . f L (α) by the recursion relation: In practice, we can speed these algorithms up substantially by precomputing powers of emission-transition matrices in order to quickly skip over long regions with missing or homozygous data. This is described in [7].
We now recursively run these two variables over all chromosomes and all pairs of haplotypes. This makes it different from MSMC, which consisted of one HMM across all haplotypes simultaneously. Here we run separately over all combinations of pairs. So for example, with two diploid phased human genomes from a single population, we would run the forward-backward algorithm independently (and possibly in parallel) over 132 chromosomal pairs of haplotypes: In order to estimate parameters of our HMM (i.e. the piecewise constant coalescence rates λ α and the recombination rate r), we use the Baum-Welch algorithm, similarly to MSMC.
We first define an objective function and where O n denotes the entire collection of observed data across all chromosomes and analysed haplotype pairs from all individuals, θ denotes the set of parameters used in this iteration of the algorithm,θ denotes free parameters to be varied in the maximization step of the algorithm (see below). The first term in equation 43 sums up the evidence from the observed transitions along the data, and the second sums up the evidence from the observed emissions. Both evidence matrices depend on the data and on the current set of parameters θ. Matrix Ξ is a square-matrix with as many rows and columns as there are hidden states. Matrix Γ has as many rows as there are different symbols in the alphabet (here 3), and as many columns as there are hidden states.
The fact that all haplotypes pairs from all analysed individuals and chromosomes are summed up into one objective function corresponds to a composite-likelihood across all individuals. We essentially ignore correlations of hidden states across different pairs of haplotypes, which affects the likelihood itself, but turns out in practice to yield unbiased parameter estimates.
The sum runs in principle over all sites. In practice, we sparsen this sum by selecting an equally spaced set of sites. By default, the distance between each counted site is 1000, but this can be controlled via the parameter --hmmStrideWidth.
The maximization step of the Baum-Welch algorithm then re-estimates the parameters by maximizing the objective function: The Baum-Welch algorithm consists of iterations of i) the forward-backward algorithm to compute the objective function, and ii) a maximization step to estimate new parameters. In the next iteration, the forward-backward algorithm is then run with the new parameters, and so forth.
After about 20 iterations, we find that the likelihood plateaus for most MSMC runs.
Note that due to the sparsening using --hmmStrideWidth as explained above, it can principle happen that the likelihood does not anymore strictly increase from iteration to iteration. If that is observed, we recommend to decrease the stride width. But in practice we never observe this within 20 iterations.

Combining within-and cross-coalescence rates estimates
While MSMC can estimate three coalescence rate functions simultaneously when run over genomes from two populations, MSMC2 runs over pairs of populations separately. Each run then uses a slightly different time scaling (due to different heterozygosity, i.e. allele mismatch, estimates within and across populations). For MSMC-IM, we however need three estimates of coalescence rates defined along the same time intervals.
We supply a simple python script, called combineCrossCoal.py, which reads in three result files from MSMC2, each from one pair of populations, and uses interpolation of the resulting piecewise constant coalescence rate estimates to merge these datasets. Details about this can be found in the accompanying README of the msmc-tools repository on github.com/stschiff/msmc-tools.

Appendix: Normalizations
In the following derivations, we define L(t) to be an antiderivative of λ(t), i.e. L (t) = λ(t). We will also make use of the substitution rule b a g (x)f (g(x))dx = g(b) g(a) f (z) dz. (47)

PSMC conditional transition probability
We have We need to show that the PSMC conditional probability is normalized:

PSMC' conditional transition probability
We have We again need to compute the integral ∞ 0 q(t|s, u, m) dt. We divide the integral into three parts: 2 MSMC-IM model

Continuous IM model
Our model is based on Hobolth et al. 2011 [2], which demonstrates that the time to the most recent common ancestor (tMRCA) of two lineages sampled from a pair of populations can be exactly computed from a matrix exponential. Hobolth et al. 2011 [2] formulate the IM model as a continuous time Markov chain.
Here we build on that work and define a two-island model by time-dependent population sizes N 1 (t) and N 2 (t) and a time-dependent continuous symmetric migration rate m(t) between the two populations, discarding the clean split concept in Hobolth et al. but describe the population separation as a continuous process. The state of the two lineages composes a series of states in a Markov chain. At time t = 0 (the present-day generation), the state of two randomly sampled uncoalesced lineages starts from either of the following three states S 11 , S 12 , S 22 , and at any later time end up in any of five states S 11 , S 12 , S 22 , S 1 or S 2 .
We describe this evolution of the state space via a probability vector x n (t) denoting the state probability to be in state n at time t, with time counting backwards in time. We summarise that vector in bold font as x(t).
We summarise the transition rate between states by a matrix Q(t), where rows indicating the state at some time t, and columns the state one generation later. Then the matrix Q can be expressed in terms of a symmetric migration rate and effective population sizes (very similar to [2]): where N 1 (t), N 2 (t) and m(t) are all time-dependent. Diagonal elements are set such that rows sum up to zero. The state probability vector in the next generation is then the product of x(t) and Q: where 1 is a diagonal unit matrix. For n generations, we get We now switch to continuous time, and note that for a small time interval ∆t we can write: Longer time segments t can then be divided into n small time intervals, and we assume Q is constant in each interval and independent from matrices in other intervals.
In the limit of n → ∞, the equation above becomes a matrix exponential: When t 0 = 0, we then have: We can use this general state propagation equation to compute the conditional probability of ending up in a specific final state s f after time t given a specific starting state s 0 . For example, the probability to end in state s f = S 11 when starting in state s 0 = S 12 would be: where we have followed the convention introduced above that the order of states in vector notation is S 11 , S 12 , S 22 , S 1 , S 2 .
We can now use this to write down the probability of a coalescence event of the two lineages at time t, starting in on of the starting states s 0 ∈ {S 11 , S 12 , S 22 }: P IM (t|s 0 , N 1 , N 2 , m) = G(S 11 , t|s 0 ) · 1/2N 1 + G(S 22 , t|s 0 ) · 1/2N 2 (59) because in order for a coalescence event to occur exactly at time t, we require that i) no coalescence has occurred before (so we exclude final states S 1 and S 2 ), ii) both lineages are in the same population (so we exclude S 12 ).

Comparing with MSMC outputs
MSMC (here as a term used independently from a specific implementation like MSMC or MSMC2) estimates time-dependent effective coalescent rates λ ij between a pair of lineages i and j. From these rates, we can compute the probability density for coalescence events: The basic idea behind MSMC-IM is to fit the model from equation 59 to the observed distribution from equation 60 to estimate parameters N 1 (t), N 2 (t) and m(t).

Model Fitting
So far we haven't specified the form of the time-dependent parameters N 1 (t), N 2 (t) and m(t). Since MSMC uses piecewise constant functions for the coalescence rates, we decided to use exactly the same method in MSMC-IM, and impose a piece-wise constant structure on our model parameters with the same time patterning as in MSMC.
We denote the time boundaries by t i , with i = 0 . . . n T , where n T is the number of time segments, and t 0 = 0 is the left-most time-boundary, and t n T = ∞ is the rightmost time segment. Note that in practice we set t n T = 4t n T −1 . We can then define the following χ 2 -statistic across all time-segments to measure the fit deviation between the coalescent distributions from MSMC and the IM model: For brevity we omit the dependency on model parameters N 1 (t), N 2 (t) and m(t) here. Minimization of this χ 2 -statistic is numerically implemented via Powell's method (using the function minimize(method='Powell') from the scipy-package in python (www.scipy.org)).

Regularisation
We need to estimate N 1 , N 2 and m for each time interval, which for the default MSMC time patterning means 96 parameters in total. It turns out that this model is overspecified for times at which the two populations have almost completely merged (as for example reflected by M (t) approaching 1, see main text). To avoid over-fitting, we add two regularisation terms to the above χ 2statistic: The regularization terms β 1 and β 2 are tunable, and in practice we set β 1 to 1e-8 and β 2 to 1e-6 by default. This β 1 value was chosen to be low enough to not affect migration rate estimates but avoid over-estimation, and the β 2 value was chosen to be low enough to not affect population size estimates at time substantially before the split time, but strong enough to "pull together" the two population sizes for times very deep in the past, where all lineages have effectively merged into one population.

Hazard function for estimating coalescence rates from IM model
While the primary variable to use for comparison between model and data is the probability density function of pairwise coalescence times (eqs. 60 and 59), we can also compute the Hazard function from the model, to be directly compared to the pairwise coalescence rates output by MSMC: as following equation: This expression becomes numerically unstable for very ancient times, for which the denominator becomes too small.

Internal auto-Correction and parameter constraints
In some cases, MSMC coalescence rate estimates in the most ancient few time intervals are noisy, which can affect migration rate estimates in these windows and lead to artifacts. We therefore implemented an automatic check of the rate estimates in the most ancient time intervals before fitting with MSMC-IM, and auto-correct these values. Specifically, we check in all time segments that correspond to the last two free parameters (with the default patterning of 1*2+25*1+1*2+1*3, as in MSMC2, the last five time intervals would be checked). In these intervals, since we do not genuinely expect estimates to fluctuate much at this end of the analysis time window, we require estimates to fall within a range of [a/1.5, a × 1.5], where a is the value of the third-last free parameter in MSMC, so the time segment just before the segments that are checked. If this condition is not fulfilled, we correct the estimates in the checked time intervals to a. This autocorrection is independently performed for each pair of haplotypes analysed (so for example we independently check λ 11 , λ 12 and λ 22 independently).
We also constrain parameters N 1 (t) and N 2 (t) to be below 10 7 and migration rates to be below 100, to avoid overflow issues during the fit. Furthermore, in MSMC-IM's automatic output report, we do not report estimated migration rates for times more ancient than after M (t) has reached 0.999, because of the very little data that is left to infer migration rates when all but 0.1% of lineages have effectively already merged in one ancestral population.

Interpreting Population size estimates
In MSMC-IM, we have two populations that never merge into one ancestral population. Instead, continuous migration is used to model movement of lineages across population boundaries, and hence also coalescence events between lineages sampled across populations.
The degree to which lineages get mixed, looking back in time, can be quantified by the cumulative migration density, as defined in Methods as In recent times, where M (t) 1, population sizes parameters N 1 (t) and N 2 (t) correspond closely to the inverse coalescence rates 1/λ 11 (t) and 1/λ 11 (t) estimated by MSMC. However, as M (t) approaches 1, the interpretation of these parameters differs from what one would normally call an "ancestral population size" in a clean-split model: In our model, we maintain two separate populations, so that with probability 1/2, two lineages will be in separate populations and cannot coalesce. Therefore, the effective coalescence rates in MSMC-IM for times at which M (t) → 1, is half the rate expected for an ancestral population with size N 1 (t) or N 2 (t).
Therefore, for M (t) → 1, a meaningful estimate for the effective "ancestral" population size would be 2N 1 (t) ≈ 2N 2 (t). We therefore found it useful to report "corrected" population size estimates defined as