The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection

Gene tree topologies have proven a powerful data source for various tasks, including species tree inference and species delimitation. Consequently, methods for computing probabilities of gene trees within species trees have been developed and widely used in probabilistic inference frameworks. All these methods assume an underlying multispecies coalescent model. However, when reticulate evolutionary events such as hybridization occur, these methods are inadequate, as they do not account for such events. Methods that account for both hybridization and deep coalescence in computing the probability of a gene tree topology currently exist for very limited cases. However, no such methods exist for general cases, owing primarily to the fact that it is currently unknown how to compute the probability of a gene tree topology within the branches of a phylogenetic network. Here we present a novel method for computing the probability of gene tree topologies on phylogenetic networks and demonstrate its application to the inference of hybridization in the presence of incomplete lineage sorting. We reanalyze a Saccharomyces species data set for which multiple analyses had converged on a species tree candidate. Using our method, though, we show that an evolutionary hypothesis involving hybridization in this group has better support than one of strict divergence. A similar reanalysis on a group of three Drosophila species shows that the data is consistent with hybridization. Further, using extensive simulation studies, we demonstrate the power of gene tree topologies at obtaining accurate estimates of branch lengths and hybridization probabilities of a given phylogenetic network. Finally, we discuss identifiability issues with detecting hybridization, particularly in cases that involve extinction or incomplete sampling of taxa.


Phylogenetic networks
The term phylogenetic network has grown to become an umbrella term that encompasses any non-treelike model [1]; therefore, it is important to explicitly describe the phylogenetic network model used. Since we are concerned with hybridization and deep coalescence, we use the evolutionary, or hybridization, phylogenetic network model given in [2], which we now briefly review.
Definition 1 A phylogenetic X -network, or X -network for short, W is an ordered pair (G, ), where G = (V, E) is a directed, acyclic graph (DAG) with • indeg(r) = 0 (r is the root of W ); • ∀v ∈ V L , indeg(v) = 1 and outdeg(v) = 0 (V L are the external tree nodes, or leaves, of W ); • ∀v ∈ V T , indeg(v) = 1 and outdeg(v) ≥ 2 (V T are the internal tree nodes of W ); and, • ∀v ∈ V N , indeg(v) = 2 and outdeg(v) = 1 (V N are the reticulation nodes of W ), E ⊆ V ×V are the network's edges (we distinguish between reticulation edges, edges whose heads are reticulation nodes, and tree edges, edges whose heads are tree nodes), and : V L → X is the leaf-labeling function, which is a bijection from V L to X .
We use V (W ) and E(W ) to denote the set of nodes and edges of phylogenetic network W . Fig. S1 shows an example of a phylogenetic network based on Definition 1. . A phylogenetic network W , and its associated branch lengths and hybridization probabilities. The network has 9 nodes (solid circles), which include the root r, one reticulation node, h, 4 leaves (bijectively labeled by the set X = {A, B, C, D}), and 3 internal tree nodes. Shown also are the branch lengths (red) and hybridization probabilities (blue).
In addition to the topology of a phylogenetic network W , we associate with each branch b = (u, v) in the network a branch length, denoted by λ b (equivalently, λ (u,v) ), which reflects the time in coalescent units between the two endpoints of the branch. To describe all branch lengths of a phylogenetic network, a vector λ with one entry per branch is provided. In addition, for each reticulation node h, with two parent edges b 1 = (u, h) and b 2 = (v, h), we associate hybridization probabilities γ b1 (equivalently, γ (u,h) ) and γ b2 (equivalently, γ (v,h) ), such that γ b1 + γ b2 = 1. The parameter γ (x,h) is taken to denote the proportion of alleles in the population h that are inherited from population x. To describe all hybridization probabilities associated with a phylogenetic network, a vector γ with one entry per reticulation edge is provided.

From a phylogenetic network to a Multilabeled (MUL) tree
Central to our formulation/algorithm for computing the probability of a gene tree given a phylogenetic network is converting the phylogenetic network to a multilabeled tree, or MUL tree [3]. A MUL tree is not a true phylogenetic tree, since its leaves are not uniquely labeled by a taxa set. However, we show in this work that the MUL tree representation of a phylogenetic network allows us to extend coalescent-based calculations of gene tree probabilities in a straightforward manner to cases where hybridization may be involved.
It is straightforward to convert a phylogenetic network into its corresponding MUL tree. The main idea is to process the phylogenetic network in a bottom-up fashion, traversing its nodes from the leaves towards the root. Every time a reticulation node h is encountered, two copies of the tree rooted at its child w are created, and each of h's two parents points to exactly one of the two copies. As the traversal operates in a bottom-up fashion, it is guaranteed that when a reticulation node is encountered, there are no reticulation nodes remaining "under" it (they would have been processed already). In addition to the topology, the conversion maps the branch lengths and hybridization probabilities to the appropriate branches as well. Finally, as a single edge in a phylogenetic network W may give rise to multiple edges in the MUL tree T , Algorithm NetworkToMULTree returns a mapping φ : E(T ) → E(W ) that keeps track of this information.
The MUL tree T that corresponds to the phylogenetic network W of Fig. S1 is given in Fig. S2. In this example, we have the following values of φ : S2. The MUL tree, branch lengths (red), and hybridization probabilities (blue), that correspond to the phylogenetic network of Fig. S1, as generated by Algorithm 1. In the MUL tree, each branch has a hybridization probability; values not shown here equal 1.
The conversion procedure is given formally in Algorithm 1 (NetworkToMULTree).
Input: Phylogenetic X -network W ; branch lengths λ; hybridization probabilities γ. Output: MUL tree T ; branch lengths λ ; hybridization probabilities γ ; edge mapping φ : while traversing the nodes of T bottom-up do if node h has two parents, u and v, and child w then Create a copy of Tw whose root is new node w and set φ(e) = e where e ∈ E(T w ) is a copy of e ∈ E(Tw); Add to T two new edges e1 = (u, w) and e2 = (v, w ); It is important to note that it is possible that two different phylogenetic network topologies give rise to the same MUL-tree topology, and under certain settings of branch lengths and hybridization probabilities, the networks may also give rise to identical MUL-tree topologies and branch parameters (which, by definition, would result in nonidentifiability of the topology and/or parameter values). However, if the parameter values differ between the two networks, they may still be identifiable, even though the two networks give rise to the same MUL-tree topology. This issue is illustrated in Fig. S3. Fig. S3. Two phylogenetic networks N 1 and N 2 that give rise to the same MUL-tree toplogy.
The phylogenetic network W 1 involves a hybridization between A and B, a hybridization between C and D, and a hybridization of the two hybrids. The phylogenetic network W 2 involves a hybridization between A and C, a hybridization between B and D, and a hybridization between the two hybrids. When the MUL-tree T is obtained from W 1 , then we have When the MUL-tree T is obtained from W 2 , then we have Further, different lengths of the branches of the two networks would result in different branch lengths of the MULtrees produced from each of the networks.

Coalescent histories on phylogenetic networks and their MUL trees
The notion of coalescent histories is central to computing gene tree probabilities [4]. Let V (t) denote the set of nodes in a tree t, and let t u denote the subtree of tree t that is rooted at node u. Given gene tree g and species tree T , a coalescent history is a function h : V (g) → V (T ) such that the following conditions hold: • if w is a leaf in g, then h(w) is the leaf in T with the same label (in the case of multiple alleles, h(w) is the leaf in T with the label of the species from which the allele labeling leaf w in g is sampled); and, Given a species tree T and a gene tree g, H T (g) denotes the set of all coalescent histories; mathematical properties and algorithms for computing H T (g) have been given [5,6]. A similar notion of coalescent histories can be defined on phylogenetic networks. Let W be a phylogenetic network and u be a node in V (W ). We denote by W u the set of nodes in W that are under node u (that is, the set of nodes that are reachable from the root of W via at least one path that goes through node u). We can now define a coalescent history of a gene tree g and a species (phylogenetic) network W as a function h : V (g) → V (W ) such that the following conditions hold: • if w is a leaf in g, then h(w) is the leaf in W with the same label (the same as above in the case of multiple alleles); and, The algorithm given in [6] for computing the set H T (g) does not apply to the case when the species phylogeny is a network; that is, for computing H W (g). Further, a phylogenetic network is parameterized with hybridization probabilities γ that must be associated properly with the coalescent histories to obtain the gene tree probability.
Let T be a MUL tree, g be a gene tree, and f be a valid allele mapping (see main text). Then, a coalescent history is a function h : V (g) → V (T ) such that the following conditions hold: where a is the allele that labels leaf w; and, We denote by H T,f (g) the set of all coalescent histories of gene tree g within the branches of MUL tree T given the valid allele mapping f . Table S1 lists all the coalescent histories of the gene tree and MUL tree in Fig. S4. The allele mappings are given in Fig. 1 in the main text. Each row in the table gives the branches of the MUL tree on which the coalescent events, represented by the gene tree internal nodes x, y and z, occur. Table S1 . The coalescent histories of the gene tree topology and the MUL tree T of Fig. S4 (the valid allele mappings are given in Fig. 1 in the main text). x, y, and z are the internal nodes of the gene tree, and each number corresponds to the branch in the MUL tree to which the internal node of the gene tree is mapped.
Allele mapping x y z
For Bayesian inference, we used the GTR+Gamma+I model of sequence evolution, as well as the following setting of MCMC analysis, in MrBayes: mcmc ngen=1000000 mcmcdiagn=yes relburnin=yes burninfrac=0.25 stoprule=no stopval=0.01 For maximum parsimony, we used the following commands in PAUP* (when multiple optimal trees were found for a locus, we used the strict consensus of all of them): set criterion=parsimony maxtrees=1000 increase=no; outgroup Calb; hs; This step resulted in a set G of 106 gene trees, each of which was restricted to the five taxa under study (notice that some of the trees are not fully resolved-a reflection of the use of strict consensus on multiple trees). When reconciling a gene tree that is not fully resolved with a species phylogeny, we considered all possible full resolutions of the gene tree, and considered the resolution that resulted in the best score.
To account for the model parameterization in the likelihood computation, we computed the values of three information criteria, AIC by [11], AICc by [12] and BIC by [13] , in order to account for the number of parameters and allow for model selection.
The AIC measure is defined as: where ln L is the log likelihood score, and k is the number of parameters. In our case, the number of parameters equals the number of branch lengths being estimated plus the number of hybridization probabilities being estimated. The AICc measure corrects for finite sample size, and is defined as: where ln L and k are as in the case of AIC, and n is the number of gene trees used to estimated the likelihood score. Finally, the BIC measure is defined as: The lower the values of these criteria, the better the fit of the model to the data. Table S2. The different topologies inferred by MrBayes and/or PAUP* along with their posterior probabilities (for MrBayes analyses) and frequencies (for PAUP* analyses).

Simulating gene genealogies
We used the ms program [14] to generate synthetic data reflecting six different scenarios that combine hybridization, divergence, and extinction in various ways; these scenarios are depicted by the phylogenetic networks in Fig. S5. In our simulations, all horizontal branches in Fig. S5 had length 0. In all cases, we simulated Fig. S5. Phylogenetic networks depicting different hybridization/divergence/extinction scenarios. The α and β parameters denote the proportions (or, probabilities) of alleles that are inherited from the "left" parents of the reticulation nodes (1 − α and 1 − β denote the proportions of the alleles that are inherited from the "right" parents of the nodes).
nloci ∈ {10, 25, 50, 100, 500, 1000, 2000} loci, for two time intervals: interval 1, which corresponds to t1 = t2 = t3 = t4 = 1.0 coalescent units, and interval 2, which corresponds to t1 = t2 = t3 = t4 = 2.0 coalescent units. It is important to note that the ms program measures time in 4N e units, where N e is the effective population size. Since a coalescent unit equals 2N e , we used values 0.5 and 1.0 for times in ms to reflect time intervals 1 and 2, respectively. For each setting of parameters, 100 data sets were generated, and averaged results over the 100 data sets were computed.

Accuracy of inference
For the four scenarios I, IV, V, and VI, the parameters (branch lengths and hybridization probabilities) are identifiable, and we focused on the accuracy of our method for inferring these parameters from samples of gene trees that were simulated as discussed in the previous section. That is, given a sample G of gene tree topologies, and a phylogenetic network topology W , we solved where P W,λ,γ (G) is computed based on Equation (2)

Identifiability
The results in Fig. S10 show that if we use the correct (true) values of branch lengths, the hybridization probabilities are identifiable, and can be estimated with high accuracy as the number of gene trees sampled increases (the inference procedure is identical to that described in the previous section).
However, if both branch lengths and hybridization probabilities are to be estimated, then issues of unidentifiability arise, as we now show.
Consider the phylogenetic network depicted by Scenario II in Fig. S5. Let λ be the branch lengths vector with λ 1 ≡ t 1 = s, λ 2 ≡ t 2 = p, and λ 3 ≡ t 3 = q, and let γ be the hybridization probabilities vector with γ 1 ≡ α = a and γ 2 ≡ β = b. For a given set G of gene trees, we can obtain other vectors λ and γ such that P W,λ,γ (G) = P W,λ ,γ (G), by setting the branch lengths arbitrarily to t 1 = s , t 2 = p , t 3 = q , and then setting the hybridization probabilities as follows α = − (e p − 1)(e q − 1)abe p+q (e q − 1)(e p+q − be p+p +q − e p +q + be p +q ) and β = − (e p+q − be p+p +q − e p +q + be p +q )e −(p+q) e p − 1 .
For example, if we use p = 2.0, q = 2.0, a = 0.5, b = 0.5, p = 1.7, q = 1.7, and then set α = 0.9088149157446168 and β = 0.29101947060819205 (based on the above two formulas), then we obtain the same probability of any set of gene trees on the phylogenetic network of Scenario II in Fig. S5.
If we sample two alleles per species B (and a single or more alleles per each of the two species A and C), this lack of identifiability case disappears, since now the number of gene tree topologies is greater than the number of parameters being estimated. However, in practice, the value of t 1 does affect the identifiability of the parameter values, since the larger it is, the higher the probability that the two alleles sampled from B would coalesce and give a signal similar to that provided by a single allele. This point is illustrated by the results shown in Fig. S11.
To produce these results, we parameterized the phylogenetic network of Scenario II above with two different sets of values: • network1: t 2 = t 3 = 2.0, α = β = 0.5.
• network2: t 2 = t 3 = 1.7, α = 0.9088149157446168 and β = 0.29101947060819205. The probabilities of each of these 9 gene tree topologies (we choose one gene tree topology per category), as a function of the value of t 1 are shown in Fig. S11.
Clearly, the two networks exhibit the gene tree topologies with different probabilities, when t 1 = 0.25. However, the gap between the probabilities starts closing as the value of t 1 increases. When t 1 = 4.0 or 8.0, the gaps are too small to be even observed in any realistic data set (of a few thousand loci). At these branch lengths, the three topologies with non-negligible probabilities are the ones of categories 6, 8, and 9, which have the two alleles of B coalesce before either of them coalesce with alleles of the other two species.
In other words, while sampling two alleles from B help ameliorate the identifiability issue, a relatively large sample (in terms of the number of loci) needs to be used, and the the time between hybridization and the subsequent divergence must not be too large, for methods to uniquely identify the parameter values.
Furthermore, in the special case where α = 0.0, a phylogenetic tree, with appropriate branch lengths can be found, to fit the data exactly with the same probability that the phylogenetic network would. Consider the phylogenetic network N in Fig. S12(left), which reflects Scenario II in Fig. S5 in the case where α = 0.0.
Let λ be the branch lengths vector with λ 1 ≡ t 1 , λ 2 ≡ t 2 , and λ 3 ≡ t 3 , and let γ be the hybridization probabilities vector with γ 1 ≡ β. Now, consider the phylogenetic tree T in Fig. S12(right). Then, if we set t as a function of β, t 2 , and t 3 , as follows: for any set G of gene trees. This result shows (as illustrated in Fig. 2 in the main text) that as t 2 increases, the value of t becomes unaffected by t 2 , and that increasing t proportionally to the increase in t 3 always maintains identical probabilities of gene trees under both phylogenies in Fig. S12, as reflected by the derivatives: Clearly, lim t2→∞ ∂t ∂t 2 = 0.
Let us now consider the phylogenetic network of Scenario III in Fig. S5. In this case, both species involved in the hybridization are extinct. Surprisingly, the results in Fig. S13 show that if we use the correct (true) values of branch lengths, the hybridization probability α is identifiable, and can be estimated with high accuracy as the number of gene trees sampled increases.
However, if both branch lengths and hybridization probability are to be estimated, then issues of non-identifiability arise, as we now show. Let λ be the branch lengths vector with λ 1 ≡ t 1 = s, λ 2 ≡ t 2 = p, and λ 3 ≡ t 3 = q, and let γ be the hybridization probabilities vector with γ 1 ≡ α = a. For a given set G of gene trees, we can obtain other vectors λ and γ such that P W,λ,γ (G) = P W,λ ,γ (G), by setting the hybridization probability arbitrarily to α = a and the branch lengths arbitrarily to t 1 = s , t 3 = q , and For example, if we use p = 1.0, q = 2.0, a = 0.8, a = 0.1, q = 1.8, and then set p = 1.050498643 (based on the above formula), then we obtain the same probability of any set of gene trees on the phylogenetic network of Scenario III in Fig. S5.
Furthermore, a phylogenetic tree, with appropriate branch lengths can be found, to fit the data exactly with the same probability that the phylogenetic network would. Let λ be the branch lengths vector with λ 1 ≡ t 1 , λ 2 ≡ t 2 , and λ 3 ≡ t 3 , and let γ be the hybridization probabilities vector with γ 1 ≡ α. Now, consider the phylogenetic tree T in Fig. S12(right). Then, if we set t as a function of α, t 2 , and t 3 , as follows: then, P N,λ,γ (G) = P T,t (G) for any set G of gene trees. See Fig. S14 for values of t(α, t 2 , t 3 ). This result shows that as t 2 increases, the value of t becomes unaffected by t 2 , and that increasing t proportionally to the increase in t 3 always maintains identical probabilities of gene trees under both the phylogenetic network of Scenario III and the phylogenetic tree in Fig. S12, as reflected by the derivatives:     (5); from left to right: α = 0.1, 0.5, and 0.9, respectively.