Genome-Wide Inference of Ancestral Recombination Graphs

The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the “ancestral recombination graph” (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of chromosomes conditional on an ARG of chromosomes, an operation we call “threading.” Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the posterior distribution over ARGs and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. The patterns we observe near protein-coding genes are consistent with a primary influence from background selection rather than hitchhiking, although we cannot rule out a contribution from recurrent selective sweeps.


Calculation of transition probabilities
The general formula for the transition probabilities of the HMM (equation 20) can be simplified and its evaluation can be made more efficient by recognizing several distinct scenarios for the joint configuration of the previous local tree T n i−1 , the current local tree T n i , and the recombination R n i . We will consider two main cases, corresponding to the presence (R n−1 i = / 0) and absence (R n−1 i = / 0) of "old" (previously sampled) recombinations, respectively (see Figure S1-1). In addition, we will consider three subcases of each of these main cases. Throughout this section, we will use the notation y i−1 = (x i−1 ,t i−1 ) and y i = (x i ,t i ) to indicate the previous and current coalescence points for the resampled branch, respectively, with each x i indicating a branch and each t i a time point. We will assume y i−1 is indexed by l and y i by m, and we will assume their time points are indexed by a and b, respectively (i.e., t i−1 = s a , t i = s b ). In addition, z i = (w i , u i ) will denote a new recombination between positions i − 1 and i, with w i indicating a branch and u i a time point in T n i−1 . We will use v to indicate the new branch that is being threaded into the ARG. We will assume the single sequence threading operation, so v must be an external branch, but the subtree threading setting is very similar (as discussed in later sections). We will also use the notation S(u) to indicate the index of the time point associated with a node u in a local tree. If this recombination is added to the new branch v (blue), that branch may re-coalesce anywhere else in the local tree. Alternatively, the new recombination can be placed on the branch associated with the previous thread state, x i−1 , in which case the recoalescence must occur on the new branch v (B) or on branch x i−1 (C). When an old recombination is present (R n−1 i = / 0), the associated SPR operation often does not affect the thread location. However, it can cause the thread state to change its branch (D) or its time (E) in a deterministic manner. (F) If the recoalescence point is the same as the thread point, the thread state can change to recombination-bearing branch w i and any time within the interval between the recombination and recoalescence times, t i ∈ [u i ,t]. The thread state can also change to the branch above the recoalescence point and keep the same time (not shown).

Major Case #1: No Old Recombinations
where, for simplicity, we drop the explicit conditioning on the model parameters ρ and N. Notice that this sum has no more than K + 1 terms.
As in the general case, the first term in equation S1 is given by equation 5 and the second term by equation 13 from the main text. However, these equations simplify in this case. Because we have assumed that R n−1 i = / 0 and we can assume that z i = (v, s k ) represents a valid recombination, we can write the following in place of where, and all other terms are as defined for equation 5. Similarly, equation 13 simplifies in this case to, where P(s b | v, s k , T n i−1 ) is given by equations 10-12.
2. Recoalescence to same branch at different times: In this case, the recombination may have occurred either on the new branch v or on the recoalescence branch x i . If the recombination occurred on branch v, then, as above, its time index can range between 0 and the minimum of a and b (the indices of t i−1 and t i , respectively). If it occurred on branch x i , then its time index can range between the time index at which x i came into existence, which is given by S(x i ), and the minimum of t i−1 and t i . Thus, As in case (1), the first term in each of these sums is given by equation S2 and the second term is given by equation S4. Each of these sums also has no more than K + 1 terms.
3. Recoalescence to same branch at same time: This case is similar to the previous one, except that it must also allow for the possibility of no recombination between positions i − 1 and i (z i = / 0).

Major Case #2: Old Recombinations
The other major case to consider is when a recombination is already given, R n−1 i = (w i , s k ) = / 0. Our modeling assumptions prohibit a new recombination in this case, so it must be true that z i = / 0. Thus, Because we can assume in this setting thatR n−1 i represents a valid recombination, the first term has a form similar to that of equation S2, that is, where A k is given by equation S3 and all other terms are as defined for equation 5. Similarly, the second term has a form similar to that of equation S4, The calculation of these transition probabilities can be further simplified by considering three subcases. In defining these cases, we use the notation (x i ,t i ) to indicate the recoalescence point associated with the old recombination R n−1 i = (w i , u i ). Notice that, in the case of an old recombination, this is not the same as the state y i = (x i ,t i ), which represents the new coalescence point for branch v (not branch w i ). Here, the time index b corresponds to the recoalescence time 1. Deterministic case. If the previous state or the recombination point z i = (w i , s k ), then the transition process is completely deterministic, meaning that there is only one transition with non-zero probability. A series of well-defined rules identifies the state y i that must follow y i−1 ( Figure S1-2). Note that, because the HMM is unnormalized, the probability of the permitted state transition will generally not be equal to one.

2.
Recombination-point case. If the previous state y i−1 equals the recombination point z i = (w i , s k ), then the recombination point can be either above the new branch v or below it. If the recombination is above branch v, then the new state must be the same as the old one, y i = y i−1 . If, on the other hand, the recombination is below branch v, then branch x i−1 "escapes" and the new branch v must coalesce up higher in the tree (see Figure S1-2 for a similar calculation).
3. Recoalescence-point case. If the previous state equals the recoalescence point, consider the possibility that the recombining branch w i recoalesces at (x i ,t i ) as well as the possibility that w i recoalesces at any location along the new branch v. The reason is that all such scenarios allow T n−1 i to have the same configuration after removal of v. Note that this case is always distinct from case (2) because the recoalescence cannot be on the same branch as the recombination (this would imply a bubble, which are not allowed in the SMC process). The destination states y i that are relevant for this scenario are    Let us re-express equation 10 as, where, Notice that, The values of the form C 0,m can be computed recursively in a preprocessing step for m = 0, . . . , K, as follows:

Further optimization of forward algorithm
Implemented in a direct manner, the forward algorithm would require O(Ln 2 K 2 ) time. However, by taking advantage of redundancies in the transition probabilities we can reduce this running time to O(LnK 2 ). The approach used here is similar to that used by Paul et al. [1].
Recall that the forward algorithm computes a table of values of the form, where b i m (D n i ) is the emission probability for state m and alignment column i, and a i−1 l,m is the transition probability from state l to state m at position i − 1 (see section entitled "Stochastic Traceback" in main text for complete details).
Again, let the state variable y i be defined by a branch x i and a time t i . In addition, let C (x i , j) be the index for These symmetries mean that the true dimensionality of A i is considerably reduced. To exploit this reduced dimensionality, let us define a reduced transition matrix A i = {a i j,k } indexed by the time points (i.e., 0 ≤ j ≤ K and 0 ≤ k ≤ K), such that a i j,k gives the transition probability from time point j to time point k assuming that x i = x i+1 . We can now rewrite the recurrence in the forward algorithm as follows, assuming that k is the time point associated with index m: where and Notice

Subtree sampling
In this section, we outline our strategy for subtree sampling (i.e., resampling of internal branches in the local trees) in greater detail. As described in the main text, subtree sampling is needed to enable efficient mixing of the MCMC sampler with more than few sequences. Unlike the single-sequence threading operation, subtree sampling allows the "deep structure" of the ARG to be perturbed in a reasonably efficient manner.
For each local tree T n i , imagine that one of the internal branches v is removed, thus producing two trees: a main tree T M,n i and a subtree T S,n i . The main tree has the same root node as the original full tree T n i and has a basal branch that extends to the maximum time s K . The subtree T S,n i has v as its root node and does not have a basal branch. In this setting, the effect of the recoalescence operation is to allow the partial local tree (T M,n i , T S,n i ) to be reconnected into a full local tree. This is accomplished by allowing introducing a lineage leading to v and allowing it to recoalesce with the main tree. Notice that this is a direct generalization of the single sequence recoalescence operation. In that case, v is required to be a leaf node, but in the general case, it is allowed to be any node (other than the root) in the local tree.
As in the single sequence case, we can denote the recoalescence point at site i by y i = (x i ,t i ).
Let the age of the subtree root v be s q . Notice that the structures of the main tree and the subtree below age s q do not affect the coalescence rate of the new branch v. Thus, resampling the coalescence point for the internal branch is essentially the same as resampling the coalescence point for an external branch. The only restriction is that the recoalesce point must be at least as old as s q . This operation can be used within any local block having a single local tree, i.e., for which T n i = T n j for all i and j. However, a new problem arises in the case in which the local trees differ across sites, as discussed in the next section.

Resampling internal branches across multiple local blocks
Internal branches can also be resampled across multiple local blocks. This involves removing one branch from each tree in T n to create a list of main trees T M,n and subtrees T S,n . A coalescence threading Y can then be sampled to define how each subtree recoalesces to the corresponding main tree, thereby defining a new collection of complete local trees T n .
The problem is that a poor choice of a series of internal branches to remove and resample can result in a highly constrained threading distribution. To see why this is true, imagine that the selected series of internal branches is such that the branch for each local block is completely unrelated (e.g., in a different subtree of the full phylogeny) to the previous one. In this case, if a new recombination is sampled within a local block during the subtree threading operation, that recombination will have to be "undone" by the end of the local block to allow the new local tree for that block to be reconciled with the main tree and subtree for the next block. Thus, any "move" in ARG space must involve tightly coordinated sequences of recombinations that cancel one another out, in a sense. Because such sequences will be difficult to find, there will be a strong tendency to simply resample the previous threading, and the sampler will not mix well.
The solution to this problem is to select sequences of internal branches that are in some way mutually "compatible," so that these constraints on the reconciliation of local trees across blocks are relaxed. It turns out that it is sufficient to select sequences of branches such that adjacent branches in the sequence share ancestry, as defined below.
In order to identify such sequences we use an auxiliary data structure called a branch graph ( Figure S1-3). The branch graph B is derived from the local trees T n and recombinations R n . To construct B we only need one local tree from each non-recombining block. Let this subset of trees and recombinations be represented by the vectors T and R, respectively (we will drop the superscript for simplicity). For each local tree T i and node v i, j ∈ V (T i ), we create a node u i, j ∈ V (B). Then we add a directed edge (u i, j , u i+1,k ) ∈ E(B) if, and only if, the branches above v i, j and v i+1,k share ancestry.
We define "shared ancestry" as follows. Let z i+1 = (w i+1 , u i+1 ) represent the recombination point and y i+1 = (x i+1 ,t i+1 ) represent the recoalescing point leading from local tree T i to local tree T i+1 . In addition, let M be a mapping such that M(v i, j ) = v i+1,k if v i, j and v i+1,k represent precisely the same coalescent event in trees T i and T i+1 , respectively. Notice that the node above the recombination branch w i+1 does not map to any node in T i+1 , that is, M(p(w i+1 )) = / 0. Also, the new node created in T i+1 by recoalescence, which we denote v + i+1 , does not have any node mapping to it. However, all other nodes in T i and T i+1 have a one-to-one mapping.
Shared ancestry can occur in three ways. Consider two arbitrary nodes in adjacent local trees, v i, j and v i+1,k .
First, if M(v i, j ) = / 0 and v i+1,k = v + i+1 , then v i, j and v i+1,k share ancestry if, and only if, is the node that is eliminated by the recombination between i and i + 1then v i, j and v i+1,k share ancestry if, and only if, v i+1,k is the remaining child of the eliminated node. By "remaining then v i, j and v i+1,k share ancestry if, and only if, the recoalescence occurs on the branch above v i, j , that is, v i, j = x i . These rules produce a graph B such that, for each local block i, there is one node (the one above which the recoalescence occurs) with an out-degree of two, while all other nodes have an out-degree of one. Similarly, for each local block, there is one node with in-degree of two (the remaining child of the node eliminated by the recombination) and all other nodes have an in-degree of one. A directed path in the branch graph indicates a series of branches valid for removal.
The number of directed paths indicates the number of possible ways to remove internal branches according to this scheme. This number can be computed in a straightforward way using dynamic programming. Let P i, j represent the number of paths ends in block i on branch j. This value can be computed recursively as follows: (S17) Thus, the total number of directed paths can be computed as ∑ j P m, j where m is the index of the last local block. Paths can be sampled uniformly using a standard traceback procedure. Starting with the last block m, the last node j can be chosen with probability (S18) Given a chosen node v i, j , the next node v i−1,k in the traceback can be chosen with probability, (S19)

Gibbs and Metropolis-Hastings Sampling of ARGs
Our goal is to sample ARGs G n from the posterior distribution given the model parameters Θ = (N , µ, ρ) and the data D n , namely P(G n | Θ, D n ). (S20) Using our threading method, we can define both Gibbs and generalized Metropolis-Hastings Markov chain Monte Carlo (MCMC) methods for sampling ARGs. Let g and g be two possible values for the random variable G n . Let q(g → g ) give the probability of proposing g given g under some proposal procedure. The Metropolis-Hastings algorithm requires that the acceptance probability for the proposed move must be, Let us now consider a particular type of probabilistic proposal procedure. Let S be a random variable representing a random subgraph of an ARG g and let S(g) give a restricted set of subgraphs of g. Given a current ARG g, randomly choose a subgraph S = s and then sample from the posterior a new ARG g in which the subgraph s is held fixed (i.e., all changes occur outside of s). We can now write the proposal probability as, q(g → g ) = ∑ s∈S(g) P(S = s | G n = g) P(G n = g | S = s, Θ, D n ) = ∑ s∈S(g,g ) P(S = s | G n = g) P(G n = g | S = s, Θ, D n ), where we use the notation S(g, g ) to indicate S(g) ∩ S(g ), thereby enforcing the constraint that the sampled subgraph s must belong to the restricted sets for both the original ARG g and the proposed ARG g . This proposal probability can be further simplified as follows: where the simplification in the second line is possible because S is a subgraph of G n .
If we choose subgraphs uniformly from the set S(g), such that P(S = s | G n = g) = 1/|S(g)|, we can then write the acceptance probability as (S24) For cases where |S(g)| = |S(g )| the acceptance probability is always 1 and the procedure is a valid a Gibbs sampler. This is true for the case when S(g) is the set of subgraphs of g where one sequence is removed from the ARG. If there are n sequences then |S(g)| = n.
For resampling internal branches, |S(g)| is not as trivial to calculate, but it can be calculated using dynamic programming (see previous section). However, |S(g)| will not always be equal to |S(g )| and therefore there is a chance of rejection.
The rationale for using a proposal procedure based on conditioning on the subgraph is that uses the data to drive the proposal. Without such a strategy, the acceptance probability would be driven by changes in likelihood which can vary wildly when resampling large ARGs.
In order to establish that the stationary distribution of this Markov chain equals the desired posterior distribution, we must show that the chain is irreducible, aperiodic, and positive recurrent. First, the chain is irreducible because, given any two ARGs g and g , it is possible to find a sequence of proposed moves that will transform g to g . To see that this is true, consider a subgraph S(g, L) of g that is defined by removing threads from g until only a set of leaves L remains. First consider the base case of a single leaf, L = {l}. In this case, we trivially have S(g, L) = S(g , L), because both subgraphs are simply trunk genealogies. Now let us add one leaf at a time to L. Each time we add a leaf l to the set L, we can ensure that S(g, L) = S(g , L) by removing the thread for l from g and then re-threading l in such a way that S(g, L) = S(g , L). In this way, we can obtain any g from any g using the threading operation.
Next, to see that the chain is aperiodic and positive recurrent, note that self-transitions g → g have nonzero probability. In addition, every transition in the Markov chain is reversible. Specifically, for any transition g → g , we choose a subgraph s and then sample g conditional on s. Notice that based on the design of our branch removal procedure, if g was sampled conditioned on s, then s can be obtained by applying the branch removal procedure to g . Since s is a subgraph of g, g can be sampled by the threading procedure conditioned on s. Thus, the reverse transition g → g must have non-zero probability. Together, nonzero self transitions and reversible non-self transitions guarantee that the chain is aperiodic. The chain is positive recurrent because it has a finite state space and is irreducible.

Evaluation of Discretized Sequentially Markov Coalescent in Data Generation
Following McVean and Cardin [2], we compared data sets generated under the DSMC with ones generated under the sequentially Markov coalescent (SMC) and the coalescent-with-recombination (CwR). First, we simulated 100 kb regions of 20 sequences using our standard simulation parameters including four different µ/ρ ratios (see Methods). We  Figure S1A). These distributions were essentially indistinguishable at lower recombination rates, and the DSMC exhibited only a slight excess of recombinations events at higher rates.
Interestingly, the DSMC appeared not to be highly sensitive to the number of time intervals K, although the excess in recombination events at high rates was most pronounced under the most coarse-grained discretization scheme (K = 9).
In a second simulation experiment, we assumed a single ratio of µ/ρ = 2 and considered four effective population sizes (N), ranging from 10,000 to 30,000 individuals. In this comparison, we used the number of segregating sites as the summary statistic of interest. As expected, this statistic increases approximately linearly with N under all models.
Once again, we found that the CwR, SMC, and DSMC models produced nearly identical distributions of counts, with only a minor inflation under the coarsest discretization schemes (Supplementary Figure S1B). Overall, these comparisons indicate that the discretization scheme used by the DSMC has at most a minimal effect on measurable patterns of mutation and recombination at realistic parameter values for human populations, suggesting that the model will be adequate for use in inference.

Information in the ARG about Population Phylogenies
To explore the usefulness of ARGweaver in demographic analysis, we attempted to infer a population phylogeny with admixture edges for the 11 human populations represented in the Complete Genomics data set (see Supplementary Figure S20 for a list of populations). We began by extracting genealogies from two loci per ∼2-Mb block, such that each genealogy was approximately 1 Mb from the next one. We then computed a consensus tree at each locus using a standard majority rule for edges across the sampled local trees. Here we considered every 100th sample from our ARGweaver sampling run (21 trees per locus). In addition, we collapsed identical adjacent local consensus trees into a single tree. In the end, our analysis considered 2,304 trees from 1,376 ∼2-Mb blocks. Next, we reduced each tree to a single representative of each of the 11 represented populations by selecting one haploid sample per population at random, and extracting the subtree spanning these 11 leaves.
We then analyzed these 11-leaf trees with the PhyloNet program. PhyloNet finds a population tree that minimizes the number of "deep coalescences" required for reconciliation with a given set of local trees, allowing both for phylogenetic discordance from incomplete lineage sorting (see, e.g., [3]) and for a specified number of hybridization (admixture) events between groups [4,5]. We ran PhyloNet version 3.5.0 on the 11-leaf consensus trees using the InferNetwork parsimony option and specifying a maximum number of hybridization nodes in the range 0-5. Identical phylogenies and networks were obtained for four different random choices of haploid samples per population.
In the absence of hybridization (0 nodes), PhyloNet recovers the expected phylogeny for these populations, with the deepest divergence event between African and Eurasian populations, and successively more recent events separating  Figure S20B). In most cases, the inferred source populations are consistent with previous studies, but there are two major anomalies in the inferred networks. The first anomaly is the use of the Gujarati Indians (GIH) as a source population for the admixed MXL and PUR populations. This may be a consequence of the absence in this data set of a better surrogate for Native American source populations for the MXL and PUR or it may reflect European admixture in India [7]. The second anomaly is the inference of admixture from the MXL and Tuscan (TSI) individuals in the CEU sample. Overall, it appears that the program correctly identifies a complex pattern of gene flow among the Latino, European, and African populations but is unable to reconstruct the precise topology of this subnetwork.