The Effect of Single Recombination Events on Coalescent Tree Height and Shape

The coalescent with recombination is a fundamental model to describe the genealogical history of DNA sequence samples from recombining organisms. Considering recombination as a process which acts along genomes and which creates sequence segments with shared ancestry, we study the influence of single recombination events upon tree characteristics of the coalescent. We focus on properties such as tree height and tree balance and quantify analytically the changes in these quantities incurred by recombination in terms of probability distributions. We find that changes in tree topology are often relatively mild under conditions of neutral evolution, while changes in tree height are on average quite large. Our results add to a quantitative understanding of the spatial coalescent and provide the neutral reference to which the impact by other evolutionary scenarios, for instance tree distortion by selective sweeps, can be compared.


Introduction
Coalescent theory is a central part of modern population genetics [1][2][3].It constitutes the basis of genealogical models, of statistical tests of the neutral evolution hypothesis [4] as well as of many simulation tools [5][6][7].Besides application in population genetics, coalescent models and their various generalizations became an object of study in their own right in probability, graph theory and combinatorics [8][9][10][11][12].
The classical coalescent is a binary, rooted, unordered tree with a fixed number n of leafs.The latter is also called the size of the tree (Figure 1A).Such a tree can be interpreted as the genealogical history of a sample of DNA sequences, where mergers (''coalescents'') of two lineages represent events of common ancestry.Thus, coalescent trees are naturally fitted with a time scale and for this reason they are sometimes called labelled histories.A biologically important generalization of the simple case is the coalescent with recombination.Recombination is a process by which two DNA sequences reciprocally exchange genetic material.In the coalescent framework this translates into lineage splits (Figure 1B).A split represents the un-coupling of the genealogical history of two sequence fragments.The ancestral recombination graph (ARG) [13] is a model to integrate such lineage splits into coalescent trees.Each sequence position x along the chromosome is associated with a coalescent tree T x , which is the marginal tree of the ARG at position x.Depending on the rate of recombination, chromosomes are divided into smaller or larger sequence fragments f i (''haplotype block'') in such a way that all positions within a fragment are free of recombination and therefore have the same marginal tree T f .The spatial coalescent is the sequence (T fi ) i of coalescent trees along a sample of recombining chromosomes.Study of the spatial coalescent is of prominent interest in population genomics, since it contains information about the demographic and evolutionary history of a population.For instance, it has lately been used to infer demographic parameters in non-African human [14].Unfortunately, the spatial coalescent is not a simple Markov process [15], complicating its probabilistic analysis and leaving many open problems to be addressed.
Here, we investigate the impact of single recombination events upon some measures of tree topology and shape.By topology we mean the branching pattern of a tree; by shape we mean its topology and branch lengths.In particular, we ask how recombination affects tree height and tree (im-)balance.The latter is measured by the difference in size of the left and right subtrees emerging from the root or any internal node.Depending on when and where a recombination event occurs, the effect on altering tree structure may be drastic, mild or completely silent.Informally, drastic events are those which lead to a large change of tree height or balance.These are events which typically involve splits by recombination of the branches emerging from the root of the tree.As such they may strongly affect the genealogical structure of haplotypes.Identifying and characterizing these events is very informative for population genetic inference.Mild events are typically those which occur along very recent branches, close to the leafs of the tree.They do not, or only mildly, affect haplotype structure and mutation frequency spectrum.Interestingly, there is a non-negligible portion of recombination events which do not alter tree topology, i.e. the branching pattern.We call these events silent.Sometimes, also the branch lengths remain unchanged; we call these events hidden (Figure 2).
Our goals are to formalize these concepts, to characterize in more detail the effect of single recombination events upon tree shape and to quantify the relative frequencies of drastic, mild and silent events.We explicitly calculate the probabilities of changes in height or root balance induced by a single recombination event.
Our results are based on the assumption of a standard neutral model of constant population size.This means that for each coalescent event two lineages are chosen at random to merge.Further, the timing of events is exponentially distributed with a rate which, after re-scaling by population size N, depends only on the number of lineages at a given time.
In Results Section (a), we define a probability density for the trees in the spatial coalescent and we explain the difference between pointwise marginal trees T x , evaluated at every basepair x of the DNA sequences, and the marginal trees T f , evaluated at every fragment f .We derive a simple relation between the densities of T x and T f .In Section (b) we analyze the recombination events which lead to height-changes and derive their probabilities.In Section (c) we quantify the concept of root imbalance, called V, and derive the first-order transition probabilities under single recombination events.We focus on events which produce unbalanced trees and, at the same time, lead to an increase of tree height.This type of events is of particular interest for the analysis of biological data.Their effect on the mutation frequency spectrum and on haplotype structure is the basis of tests to reject the neutral evolution hypothesis (e.g., [16][17][18]).Therefore, for bench-marking it is highly interesting to know how often such events occur under purely neutral conditions, but it is not the goal of this paper to devise another neutrality test.Then, we generalize the results regarding the tree topology parameter V and derive the transition probability for arbitrary types of recombination events.Using this, we calculate the run-length distribution of V along recombining chromosomes.Finally, in Section 0.4, we calculate the average proportion of hidden recombination events and derive its limiting behavior for large sample sizes.
We remind the reader that the spatial coalescent is a non-Markovian process and not completely determined by transitions of any finite order.However, it is a homogeneous process.Therefore, first-order transition probabilities are well-defined and independent of the position in the sequence.Here, we compute first order probabilities for single recombination events from one tree to the next, averaging over all trees of the ARG which are not directly involved in the recombination event considered.Therefore, our results hold for the spatial coalescent as described by the ARG [13].In fact, the ARG is the model which is underlying all our calculations.

(a) Tree Distribution and Recombination
We consider a sample of n ''chromosomes'' from a diploid panmictic population of constant size N. Without recombination, the genealogical history for these chromosomes is described by the classical coalescent process [1,2].The set of all possible coalescent trees of size n is a product R n{1 z 6L n , where R n{1 z contains positive real waiting times of n{1 independent coalescent events and the discrete set L n represents the set of all possible tree topologies.For our purposes here it is more convenient to consider labelled coalescent trees: this means that not only the internal nodes are ordered but also the leafs carry leaf labels.Hence [19] (see also http://oeis.org/A006472),the cardinality of L n is Furthermore, all trees in L n have the same probability , when they are generated under the standard coalescent process [20].The waiting times t k for a coalescent event, given k lineages, are exponentially distributed with mean 1=k(k{1).Time runs backward from the leafs to the root of the tree and is measured in units of the coalescent, i.e. time is scaled by four times the population size.Therefore, R n{1 z 6L n can be regarded as being equipped with a probability mass function which factorizes into a probability density p k (t k ) for each waiting time (2ƒkƒn) and the discrete probability for the topology P (top) .For trees T in the above sense, we denote the resulting probability 'density' by and we have where t k (T) is the time interval during which the coalescent tree T has k lineages.Modeling recombination as an ARG [13], there are two processes to be considered: coalescence and recombination.Given k independent lineages, in the coalescent process two lineages merge into a single one with rate k(k{1).In the recombination process, a single lineage splits into two with rate krL, where r~4Nr denotes the population recombination rate, r is the recombination rate per base and L is the finite length of the sequence.After a recombinational split the two ancestral lineages correspond to different sequence fragments, left and right of the point of recombination.This point is chosen uniformly along the sequence of length L. We assume that r is small, so multiple recombination events in the same position are negligible.
Given a tree T(x) in position x, the length before the first recombination event downstream (or upstream) of x is geometrically distributed with parameter rl(T), where l(T) represents the total length of the tree.Since r is small, it can be safely approximated by an exponential distribution with the same parameter rl(T).
Recombination events may change the shape of the tree.The local tree at position x in the genome may differ from the local tree at position y due to recombination.Moving along the genome, we consider two different sequences of trees: the sequence S x ~fT(x 1 ),T(x 2 ), . ..g of local trees for all positions x 1 ,x 2 , . .., and the sequence S f ~fT(f 1 ),T(f 2 ), . ..g of local trees which are separated by a single recombination event (Figure 3).Note that a tree in S f can span several base positions, as the typical length 1=rl(T f ) of the fragment f is greater than 1.Also, note that consecutive trees in S f need not be different.This occurs when fragments are separated by hidden recombination events.
The standard coalescent without recombination is recovered when looking at the tree for a single position x in the sequence, ignoring all other trees.Neither the rate of coalescent events nor the choice of coalescing lineages in this tree are influenced by ancestral lineages at other positions.The local tree T(x) at any position x is therefore a standard coalescent tree without recombination [21] and the marginal density of a tree in position x of the ARG is identical to p (c) (T); i.e., picking the tree in position x from a random sequence S x is equivalent to generating one from the standard coalescent process without recombination.
On the other hand, picking a tree from a random sequence S f results in a different distribution.The reason is that short trees recombine less, therefore they tend to span larger regions and to be under-represented in S f compared to S x , as illustrated in Figures 4 and 5.
In fact, the two distributions differ by weights which are proportional to the length L f of the fragments spanned by each tree.Since in the limit of large sequences the average length is E(L f (T))~1=(rl(T)), we have p (c) (T)!p (r) (T)=l(T).Therefore, for large sequences, the tree density after a random recombination event is given by where l(T) denotes the total length of the tree.For the standard neutral model, E c (l)~a n ~Pn{1 i~1 1=i.Note that the two distributions differ only in their weights of branch lengths, but not with respect to topology.
The argument leading to eq (3) can be made rigorous under the assumption of infinitely long chromosomes, using the fact that the coalescent with recombination is an ergodic process [22] (see Text S1, Supporting Information eqs (1)-( 3)).As a check of eq (3), we show that p (r) (T) is invariant under a single recombination event.Let P x (T'DT) be the transition density from tree T in a given position x to tree T' in position xz1, and P r (T'DT) the transition density from tree T to tree T' obtained by a single recombination event.Since the marginal density p (c) (T) is the same for every position, we have independent of the recombination rate.For small recombination rates and at first order in r, we have P x (T'DT)~(1{rl(T))d T',T zrl(T)P r (T'DT).Substituting this into (4) gives Figure 2. Non-silent, silent and hidden recombination events.A: Non-silent recombination changes tree topology.In the case shown, also V changes from 2 to 1. B: A recombination event which changes the order of internal nodes.Whether this event is classified as non-silent or silent, depends on the tree definition.It is non-silent for labelled histories (considered here; eq (1)), but it would be silent for unlabelled trees.C: A silent recombination event, which does not affect the branching pattern, but the lengths of the recombining branches.D: A hidden recombination event.It does neither affect branching pattern nor branch lengths.doi:10.1371/journal.pone.0060123.g002 That is, after normalization Furthermore, any marginal tree obtained from an ARG (conditioned on the number of recombinations in the sequence) by choosing randomly an ancestral lineage for every recombination event is distributed according to p (r) (T).This can be seen from symmetry: none of two trees separated by a single recombination event is distinguished, so they have the same distribution, which is the invariant distribution under a single recombination event, i.e. p (r) (T).This property has far-reaching consequences since it makes it possible to exploit the symmetries of the ARG.
Note that the two distributions, p (r) (T) and p (c) (T), become asymptotically identical when n becomes large.To see this, it suffices to consider the random variable l=E(l).Its mean is identical to 1. Since Var(l)~P n{1 i~1 i {2 &p 2 =6 for large n [2], one has The right hand side of equation ( 6) converges to 0 with increasing n.Therefore the factor l=E(l) converges to 1 and p (r) (T)~(l=E(l))p (c) (T)?p (c) (T) (in the sense of local weak convergence).The relations between the empirical probability distributions p½(T(x)) x and p½(T f ) f along the sequence and the probability densities p (c) (T) and p (r) (T) are summarized in the following diagram: Shown are the height distribution of trees in S x (solid; ''positions'') and in S f (dashed; ''fragments'').For comparison, the theoretical distributions for S x are plotted in light colors.doi:10.1371/journal.pone.0060123.g004 The distributions p (c) (T) and p (r) (T) need to be carefully distinguished when measuring the effect of a single recombination event.If one asks for the first recombination event downstream of a given position x in the genome, then the initial tree at position x is distributed with p (c) (T).If one asks instead for the effect of a randomly chosen recombination event, then the density p (r) (T) is the appropriate one.

(b) Height-changing Recombination Events
Probabilities of height changing events.Recombination can be interpreted as a random prune-and-regraft event on the tree [23].First, a time point of pruning is selected uniformly anywhere on the tree; second, the node immediately above the selected branch is removed; third, the pruned branch is re-grafted onto the tree anywhere above the pruning point or onto the ancestral lineage of the root, forming a new node.For hidden recombination events, prune and re-graft occur on the same branch, without modifying topology or branch lengths of the tree.
We denote the root node by n 0 and the first internal node by n 1 .There are four types of recombination events that change the height of the tree (Figure 6).U ('up'): a prune-and-regraft event on the root branches generates a higher root without changing the topology; D ('down'): a prune-and-regraft event on the root branches generates a lower root without changing the topology; N ('new'): pruning a branch below the root branches and regrafting onto the ancestral branch of the root creates a new root, while the old root becomes internal node n 1 ; S ('substitute'): pruning a root branch and re-grafting onto a branch in the subtree of n 1 causes n 1 to become the root.
In fact, for the root to change height it must either be shifted (cases U and D) or be replaced (cases N and S).If the root is replaced, it can become an internal node n 1 (case N) or be lost (case S).Cases U and D leave the topology unchanged, while cases N and S do not.
We denote the probabilities of these events by P U , P D , P S , P N .We compute these quantities under both distributions, p (c) (T) and p (r) (T).
Given a coalescent tree of size n, let the level k be the time interval when exactly k independent lineages coexist, with k~2, . . .,n.The waiting time at the kth level is t k (T), in the following called t k for short.Tree height may be increased by recombination events of type U or N. The total probability for this, P UN (T), is given by the sum of the probabilities of pruning at all possible levels, but never re-grafting lower than the root: where the product is defined to be 1 when k~2.This is a telescopic series that can be re-summed in a function of the total length of the tree Interestingly, this probability depends only on the total length l(T) of the tree and not on the topology.Very short trees grow with high probability, very long trees are unlikely to grow (Figure S1).
The average probability of height-increase when passing from one recombination-delimited sequence fragment to the next is which agrees very well with simulations (Figure 7).Note that P (r) UN approaches zero as slowly as O(1= log (n)).Figure 7. Increase of tree height.Probabilities P (r) UN (black), P (r) U (green) and P (r) N (red) of events that increase tree height as a function of sample size n.Dots represent the values of P (r)  UN obtained by simulations using program ms [5] and selecting a random recombination event which is far from the sequence boundaries.doi:10.1371/journal.pone.0060123.g007 This result can also be derived directly by counting ARGs, since p (r) (T) corresponds to the distribution of a random tree in an ARG.We will consider the case of a recombination event at a given level k and then average over all levels.To obtain the total number of ARGs A n,k with a single recombination event at level k, choose a tree at random (among DL n D possibilities), then choose the branch to be pruned (k possibilities) and the branch to which it is re-grafted at the same or a higher level ( P k j~1 j possibilities).Therefore, The number of ARGs where the new tree is higher than the old one is kDL n D, because there is just one possibility of re-grafting, namely on the ancestral lineage above the root of the old tree.The probability of pruning at level k in the old tree is P k ~kt k =l.Therefore, one can average over p (r) (T) to obtain P (r) UN ~Pn k~2 kE r (t k =l)kDL n D=A n,k , which is identical to equation (9).
Focusing now on pruning of the root branches, we obtain P U analogously to equation (7).Let N k (n j ) be the number of direct descendants of node n j at level k.N k (n j ) can take values 0,1,2.The average value of N k (n j ) satisfies the recursion that has the solution N N k (n j )~2 (jz1) k{1 : In particular, the average number of direct descendants of the root at level k is N N k (n 0 )~2=(k{1).The probability P U is a modification of equation ( 7): multiplying by the fraction of events that are actually of type U, i.e.N k (n 0 )=k, one obtains In contrast to equation (7), equation ( 11) cannot be easily simplified since it depends also on the topology.After averaging over p (r) (T), we obtain and where b n ~Pn{1 j~1 1=j 2 .
The probabilities P (r) D and P (r) S can be computed similarly to the above formulae, giving and (Text S1, Supporting Information eqs ( 4)-( 9)).Alternatively, one may employ an argument based on symmetry properties of the ARG.Among two adjacent trees in the ARG, the left one is smaller or larger than the right one with equal probability.Therefore, The same is true when the root is only shifted.Thus, Hence, by subtraction, Note that the identities ( 17) and (18), being topological in nature, are also valid for models with variable population size.A related result about the probability that a random recombination event leaves tree height unchanged (1{P (r)  UN {P (r) DS ) has been obtained previously by Griffiths & Marjoram [24].
Equations ( 8), ( 11), ( 14), (15) are valid also when averaging over the distribution p (c) (T), instead of p (r) (T).However, exact results are available only for small sample sizes.For the case of arbitrary n we use the following Taylor approximation of the ratio moment where E(X )=E(l) represents the desired probability P (c) .When the expansion is truncated at zeroth order (i.e., replacing the first moment of the ratio by the ratio of first moments), one obtains the results analogous to equations ( 12), ( 13), ( 17) and (18).More detailed calculations are given in Text S1, Supporting Information eqs ( 10)- (12).These yield, for instance, the probability of increasing tree height Note that the scaling factor on the right hand side in equation ( 20) approaches 1 very slowly with increasing n.The case P (c) UN is actually an exception since an exact formula exists [15] for all values of ; in fact, P UN (T) depends only on l(T), therefore it is sufficient to average this quantity over the distribution of l obtained in [15].For small samples there is a considerable difference between P (c) UN and P (r) UN .For example, if n~2, we have P (c) UN ~0:55 while only P (r) UN ~0:33.Amount of change in height.The variation in height Dh has a simple distribution.If the height increases, then the difference is given by the waiting time for coalescence of two lineages.It is and where g(x) is the Heaviside function, g(x)~1 if x §0 and 0 otherwise.If the height decreases because of an event of type D, its distribution is given by the waiting time for coalescence before time t 2 , equivalent to the ''bounded coalescent'' for two lineages [25] P D (DhDT)~2 e {2(t 2 zDh) g({Dh)g(t 2 zDh) For events of type S, the variation in height is simply the waiting time t 2 of the tree where d(x) is the Dirac delta distribution.Averaging these quantities over p (r) (T) and using the symmetries of the ARG, we obtain and P (r) N (Dh)~P (r) S ({Dh)~2e {2Dh g(Dh)P (r) N : i.e., all these variations in height are exponentially distributed for an average tree.Taking expectations, the average change in height after one of these events is

(c) Root Imbalance and Recombination
Let L n0 (R n0 ) be the number of left (right) descendants of the root.We have L n0 zR n0 ~n.We call the random variable V~min (L n0 ,R n0 ) root imbalance.V is a coarse-grained measure of tree topology.A recombination event may or may not change V and a change of V is neither sufficient nor necessary for a change in tree height.Since many recombination events induce rearrangements of the lower branches (close to the leafs) of the tree, they may affect V without affecting tree height.Still, large changes in V are often associated with height-changing recombination events of type N or S and thus are associated with drastic changes of tree topology.
In this section we calculate the transition probabilities P(vDv 0 ) for V under a single recombination event, averaged over the initial tree.First, we focus on events of type UN, i.e. increasing height, and then we obtain the transition probabilities for all types of events separately.
Root imbalance and height-increasing events.Let the size of a branch be the number of leaves below the branch.A specific tree of size n can be fully described by the probability P n,k (iDT) that a randomly chosen branch at level k has size i.Averaging over trees of size n, the probability that a branch of level k has size i is [26].Let P P (r) UN (i) be the probability that the height increases and the pruned branch has size i.It is obtained, similarly to P (r) UN , by multiplying each term of the sum in equation ( 7) by P n,k (iDT).Thus, given a tree T, and, averaging over p (r) (T), one obtains More generally, the probability that the pruned branch has size i, given that recombination leads to an increase in height, is simply P P (r) (iDUN)~P P (r) UN (i)=P (r) UN .The random variable V can take values between 1 and n=2 and is the folded version of the random variable i which ranges from 1 to n{1.Hence, the distribution of V, after an event that increases tree height, is and the distribution of V, conditioned on tree height increase, is as illustrated in Figure S3.Now we calculate the probability conditioned on the value v 0 of V before recombination, i.e. the transition probability P (r)  UN (vDv 0 ).The basic quantity for this computation is the probability P n,k (iDv 0 ) that a branch at level k has size i in a tree of total size n, given that the size of the root branches are v 0 and n{v 0 .To compute this, we need information about the actual size k at level k of the subtree of size v 0 of the root.We denote the distribution of k by P(kDv 0 ,k,n) and the distribution of i given the sizes k and v 0 of its root subtree at levels k and n by P(iDk,v 0 ).Note that i does not depend on k nor on n, but only on the size of the root subtree to which it belongs (see Figure S4).Therefore we have The probability P(iDk,v 0 ) is equal to as can be shown by considering the corresponding subtree of the root as the whole tree and using equation ( 27).The probability P(kDv 0 ,k,n) depends only on the topology, therefore it can be obtained by counting the number of labelled coalescent trees (http://arxiv.org/abs/1112.1295v2) with a root branch of size v 0 in the whole tree that reduces to size k at level k, denoted by L n,v 0 ,k,k , and dividing by the total number of trees with a root branch of size v 0 , denoted by L n,v 0 .Using that DL n D~n!(n{1)!=2 n{1 , that the coalescent process induces a uniform distribution on L n and that the distribution of v 0 is 2=(n{1)(1zd 2v 0 ,n ) [27], we have The set of all trees in L n,v 0 ,k,k can be generated in the following way: (i) choose v 0 leafs out of n; (ii) choose an relative order of the n{2 coalescent events among the two subsets with v 0 and n{v 0 leafs such that among the first n{k events v 0 {k events belong to the first subset and n{v 0 {kzk belong to the second; (iii) choose a topology for the root subtree of size v 0 ; (iv) choose a topology for the complementary subtree of the root.This process generates exactly once all trees in L n,v 0 ,k,k , except for the case v 0 ~n=2, where each tree is generated twice.Therefore, we have Taking the ratio of tree counts, we obtain an hypergeometric distribution Finally, inserting the results ( 32) and ( 35) into (31), we obtain where B x,y;z and M x,y;z are the normalization and the mean (i.e., the zeroth and first moment) of the hypergeometric distribution with parameters x, y and z, if they satisfy 0ƒx,yƒz, and 0 otherwise.Note that M x,y;z ~xy z B x,y;z .
As before, we introduce P n,k (iDv 0 ) in equation ( 7) to obtain and, finally, the result Figures 8 and S5 illustrate these probabilities.With a recombination event of type N, v tends to change to smaller values.Thus, the tree becomes more unbalanced.However, by far the highest probability is attained for v~v 0 , irrespective of v 0 and mainly due to events of type U.This case is omitted from the figures for clarity.
Other recombination events that change root imbalance.Now we consider all possible recombination events that change V. Events of type U and D do not change V, so they can be ignored.Apart from the events of type N that we discussed above, other relevant recombination events are of type S and of type R ('root remains'), i.e. any event which leaves the root untouched.To compute the probability of a change in V for these types of events, we use the fact that random trees from an ARG have the distribution p (r) (T) and that the probability of each labelled ARG topology is the same.Due to this, we need only count the number of ARGs with a single recombination event at level k compatible with root imbalances v 0 and v, and denoted by A n,k,v0,v,S and A n,k,v0,v,R .Then, we divide by the total number A n,k,v0 of ARGs with a recombination at level k and root imbalance v 0 for the original tree.Putting everything together, we obtain where Q x,y;z is the second moment of the hypergeometric distribution with parameters x, y and z satisfying 0ƒx,yƒz, and 0 otherwise, and H(n) is the Heaviside function, H(n)~1 if n §0 and 0 otherwise.Note that the ARG symmetries imply the non-trivial relation The relative importance of P (r) R versus P (r) UN and P (r) DS is shown in Figure S6.
The contribution for events of type S can be obtained using the symmetry properties of the ARG.In fact, an ARG with a recombination event of type S changing v 0 to v is equivalent to an ARG with an event of type N changing v to v 0 .Therefore, DS (vDv 0 )~P (r) UN (v 0 Dv) This result is essentially the transpose of the one shown in Figure 8, i.e. after an event of Type S, v has an almost uniform distribution irrespective of v 0 .Finally, the transition probability is ~v=v 0 : UN (vDv 0 )zP (r) DS (vDv 0 )zP (r) R (vDv 0 ) v~v 0 : 1{ P v=v 0 P (r) UN (vDv 0 )zP (r) DS (vDv 0 )zP (r) R (vDv 0 ) This distribution is shown in Figures S7 and S8 for n~40.
It Increases Logarithmically in n (Figure 9) To translate this into physical length, we assume that the distance between two consecutive recombination events is exponentially distributed with mean 1=(rl(T)).Averaging over p (r) (T) we obtain 1=(ra n ).Therefore, distance l top between two events of type N or S is approximately independent of n.For example, if the scaled recombination rate is r&10 {3 , the genomic distance between such events is about 4kb.
Assuming that also the scaled mutation rate is h&10 {3 per bp and assuming n~100, an interval between drastic recombination events of type N or S contains about 4a 99 &20 polymorphic sites.This number should be sufficiently high to enable at least a rough tree re-construction from SNP data, and to estimate V.It will probably not be sufficient for the reconstruction of the fine topological structure of the lower branches.
To estimate the correlation length of V, also events of type R need to be taken into account.In fact, changes in V occur more often than events of type N or S. Using equation (43), we determined the run-length of V, i.e. the number of recombination events that occur before a change in V happens.Considering a random initial tree, an estimate for the run-length is given by The run-length is longer for more imbalanced trees, but always on the order of a few recombination events (between 2 and 6; Figure 10).This is also a reasonable estimate for the correlation length of the fine topological structure.We now consider correlation in tree height.Height can change by events U,D,N and S. The average change in height is the same, DDhD~1=2, for all these events.Therefore, correlation length can be estimated as L (r)  h *1=P (r) UNDS : Since P (r) UNDS ~2P (r)  UN is between 0:25 and 0:3 for 20=n=100 (Figure 7), drastic changes in height are expected on average every 3 to 4 recombination events.More generally, the correlation length also increases logarithmically in n and is For the physical correlation length we have.
This is only about a quarter of the topological correlation length.Therefore, an exact reconstruction of tree height is difficult.For instance, for n~100 and h~r~10 {3 , one would have on average only 5 SNPs to estimate height or other tree parameters.
For the case n~2, Hudson [21] gives a formula for the correlation between the heights of two trees in dependence of the recombination rate r.The formula predicts that the correlation drops to about 0:5 with r1:4, i.e. after approximately 1.4 recombination events.Our rough estimate for the correlation length in this case is 1=P (r) UNDS ~1:5, and in good agreement with Hudson's result.
Finally, we briefly comment that linkage disequilibrium and haplotype block size depend strongly on the number and distribution of mutation and recombination events along coalescent trees, i.e. they depend strongly on tree topology and length.Since topology can in practice only be indirectly estimated from polymorphism patterns, not all changes in topology are actually visible for these statistics.The correlation lengths estimated from experimental data will tend to be larger than the theoretical estimates presented here.Assuming that haplotype blocks are mostly delimited by 'drastic' recombination events, involving a change of topology, we estimate the size of these haplotype fragments L h , centered at some position x with a tree T. Assuming further that neither tree length l(T) nor the probability of topology-changing drastic recombination events P td (T) change much after a 'non-drastic' recombination event, the probability distribution for the haplotype sizes is The average size is then The class of drastic recombination events that should be considered to determine P td (T) is probably larger than the class of type N and S events.However, P td (T)~P NS (T) is a reasonable lower bound approximation.

Discussion
We have considered the effect of single recombination events on coalescent tree topology and explicitly determined the probability with which recombination triggers 'drastic' We consider a change to be drastic if it leads to a change of tree height or of tree imbalance.These types of events are of practical interest because both have an effect on the pattern of polymorphic sites which are informative for genealogical reconstruction and evolutionary inferences.The primary effect of height change is upon the number of mutations, while a change in tree imbalance primarily affects the mutation site frequency spectrum.
Our results show important qualitative differences for the two types.The average change in height is quite drastic per se (50% of average tree height), while the average change in imbalance is quite mild, with large jumps occuring only very rarely.Our results hold for the standard neutral model, i.e. a model with constant population size and without substructure.As such, our results may serve as the analytical reference case for constructing formal tests of the neutral evolution hypothesis.For instance, the probabilities of height or topology change are markedly altered in the presence of selective sweeps, i.e. the fast fixation of a mutant allele due to positive selection.Recombination close to the sweep site, where tree height is severely reduced [29], tends to lead to both a drastic increase of tree height and highly imbalanced trees [16,18].In contrast, variable population size leaves a different signature on the probabilities of drastic recombination events.Non-constancy of N is reflected in branch length variation, but it has no impact on the branching pattern, i.e. on topology.In fact, if panmixis continues to hold, the probability distribution of tree topologies does not depend on population size.Variation of N affects only branch lengths and waiting times.Since all our results, averaged over p (r) (T), depend implicitly on the first moments of the waiting times through the quantity P k ~kE(t k =l), they can in principle be adapted to models with variable population size using the theory developed earlier [26,30].A detailed treatment is left to further investigation.Here we just note that the relations (17), ( 18) and (49) are valid for all models of variable population size.
Population substructure is another important case of deviation from the standard neutral model.Restricted gene flow between subpopulations strongly affects the transition probabilities of root imbalance, but less the distribution of height change.A more detailed discussion of the impact of these evolutionary scenarios upon a test statistic of the neutral evolution hypothesis is given in [18].We have derived a number of further results which shed more light on the details and consequences of recombination.We analysed the correlation length between trees on a recombining chromosome and showed that topological correlation is generally longer-ranging than correlation in tree height.Still, for both types very few recombination events -on the order of ten -are sufficient to unlink the genealogical histories of two genomic fragments, given standard neutral conditions.The calculations also make clear that correlation length (number of recombinations) scales logarithmically in n.This is important to take into account for deep sequencing association studies.
It is perhaps surprising to see that a considerable fraction of recombination events remains hidden.Even for large sample sizes, about 10% of the recombination events are not visible.An even larger fraction is silent, i.e. does not cause topological changes of the underlying genealogy.
Analyzing root imbalance in more detail, we found that the distribution of V-run lengths is biased towards unbalanced trees: under the standard neutral model, unbalanced trees tend to span larger genomic regions than balanced trees.Interestingly, the Vrun length, when normalized, is asymptotically independent of n.Our results provide a basis to tackle problems of correlation between tree statistics in coalescent models.They extend known results, such as the one by Hudson [21] concerning tree height correlation, to the more general case of arbitrary sample size n.
Some of the quantities studied here involve counting problems of ancestral recombination graphs with a single recombination event.These problems are related to counting problems of phylogenetic networks [31].Unlike counting problems of trees, which can often be tackled by generating function techniques ( [20], arxiv.org/abs/1112.1295v2,arxiv.org/abs/1202.5668v3),only few results are available for tree-like structures with independent cycles so far [32].Our results represent a step towards a combinatorial treatment of these problems.

Figure 1 .
Figure 1.Example coalescent trees.A: Tree of size n~10 generated under the coalescent process.The y-axis represents a time scale, with leafs at the 'present', and the root in the 'past'.Starting from the present and going backwards in time, coalescent events are exponentially distributed with a parameter depending on population size (2N) and the number of lineages at any given point in time.B: Recombination is a prune (asterisk) and regraft (circle) event: a lineage splits and merges onto another lineage which exists in the population at the time of recombination.This lineage does not need to extend to the present, and it may have become extinct from the entire population (cross).Recombination has changed the height of the coalescent tree with respect to the tree in panel A (Dh), but has not changed root imbalance: for both trees V~3.doi:10.1371/journal.pone.0060123.g001

Figure 3 .Figure 4 .
Figure 3. Distinction between sequences S x and S f along a recombining chromosome (sketched in the middle).Sequence S x is the sequence of coalescent trees plotted for each nucleotide.Sequence S f is the sequence of coalescent trees for each recombination fragment.Recombination breakpoints are indicated by arrows.doi:10.1371/journal.pone.0060123.g003

Figure 5 .Figure 6 .
Figure 5. Height of neutral coalescent trees along the genome.One simulation run using ms[5] with n~20 and r~4Nr~10 {3 .On the right, the distribution of the trees according to S x and S r and the average length before a recombination event, for a simulation of a sequence of length 10 6 .doi:10.1371/journal.pone.0060123.g005 )~E(DhDN)~{E(DhDD)~{E(DhDS)~1=2.Comparing this to the average height of a tree, E(h)~1{1=n, one notices that a single recombination event changes tree height by 50% on average.

Figure
Figure S1Probability of increasing height after a recombination event as a function of the total tree length l. (PDF)