Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

TripNet: A Method for Constructing Rooted Phylogenetic Networks from Rooted Triplets

Abstract

The problem of constructing an optimal rooted phylogenetic network from an arbitrary set of rooted triplets is an NP-hard problem. In this paper, we present a heuristic algorithm called TripNet, which tries to construct a rooted phylogenetic network with the minimum number of reticulation nodes from an arbitrary set of rooted triplets. Despite of current methods that work for dense set of rooted triplets, a key innovation is the applicability of TripNet to non-dense set of rooted triplets. We prove some theorems to clarify the performance of the algorithm. To demonstrate the efficiency of TripNet, we compared TripNet with SIMPLISTIC. It is the only available software which has the ability to return some rooted phylogenetic network consistent with a given dense set of rooted triplets. But the results show that for complex networks with high levels, the SIMPLISTIC running time increased abruptly. However in all cases TripNet outputs an appropriate rooted phylogenetic network in an acceptable time. Also we tetsed TripNet on the Yeast data. The results show that Both TripNet and optimal networks have the same clustering and TripNet produced a level-3 network which contains only one more reticulation node than the optimal network.

Introduction

Phylogenetic networks are a generalization of phylogenetic trees that permit the representation of non-tree-like underlying histories. A rooted phylogenetic network is a rooted directed acyclic graph in which no node has indegree greater than 2 and the outdegree of each node with indegree 2 is 1. Such nodes are called reticulation nodes. In rooted phylogenetic networks the nodes with indegree 1 and outdegree 0 are called leaves and are distinctly labeled by a set of given taxa. Mathematicians are interested in developing methods that infer a phylogenetic tree or network from basic building blocks. In the computation of a rooted tree or network, one group of the basic building blocks are rooted triplets, the rooted binary trees on three taxa [1].

In 1981, Aho et al., studied the problem of constructing a rooted tree from a set of rooted triplets [2]. They proposed an algorithm called BUILD algorithm which shows that, given a set of rooted triplets, it is possible to construct in polynomial time a rooted tree that all the input triplets are contained in it or decide that no such tree exists.

When there is no tree for a given set of triplets one may try to produce an optimal phylogenetic network. In this context, the goal is to compute an optimal rooted phylogenetic network that contains all the rooted triplets. One possible optimality criterion is to minimize the level of the network, which is defined as the maximum number of reticulation nodes contained in any biconnected component of the network. The other optimality criterion is to minimize the number of reticulation nodes [1]. In [3] and [4] the authors considered the problem of deciding whether, given a set of rooted triplets as input, is it possible to construct a level-1 rooted phylogenetic network that contains all the input triplets? They showed that, in general, this problem is NP-hard. However, in [4] the authors showed that when the set of rooted triplets is dense, which means that for each set of three taxa there is at least one rooted triplet in the input set, the problem can be solved in polynomial time. After their results, all research in this new area has up to this point focused on constructing rooted phylogenetic networks from dense rooted triplet sets.

LEV1ATHAN is an algorithm for generating a level-1 rooted phylogenetic network from a set of rooted triplets [5]. Specifically, it attempts to find a level-1 rooted phylogenetic network that contains as many of the input rooted triplets as possible. This problem is an NP-hard problem [5]. The algorithm by [6] can be used to find a level-1 or a level-2 rooted phylogenetic network which minimizes the number of reticulation nodes, if such a network exists. In [6] the authors also showed that for a dense set of rooted triplets τ, if τ is precisely equal to the set of rooted triplets that are contained in some rooted phylogenetic network, then they can construct such a rooted phylogenetic network with smallest possible level in time O(|τ|k+1), where k is a fixed upper bound on the level of the network. In addition based on the ideas described in [6], for a given dense set of rooted triplets τ, the authors proposed the SIMPLISTIC algorithm which always returns some rooted phylogenetic network that contains τ. But it does not give any minimality guarantees.

In [7] the authors showed that given a dense set of rooted triplets τ and a fixed number k, it is possible to construct in time O(|τ |k+1) a level-k rooted phylogenetic network that contains τ or decides that no such network exists.

In this paper we present a heuristic algorithm called TripNet for constructing rooted phylogenetic networks with the minimum number of reticulation nodes from an arbitrary set of rooted triplets. Despite of current methods that work for dense set of rooted triplets, a key innovation is the applicability of TripNet to non-dense set of rooted triplets.

In “unpublished data” the authors applied TripNet on both real and simulated data. Here TripNet algorithm is described in details, some theorems are proved, and one simulation is performed to show the accuracy of TripNet. Also TripNet is tested on the Yeast data. This paper is organized as follows. In section 2, first some definitions and notation are presented. Then we describe BUILD algorithm. Finally a new method called TCD, is introduced for constructing rooted triplets from (biological) sequences. In section 3 we compare TripNet with SIMPLISTIC on the triplets sets that are obtained from TCD method. Then we test TripNet on the Yeast data. In section 4 we discuss the performance of TripNet. In the last section the directed graph Gτ related to a set of triplets τ is introduced. Then we show that if either a set of triplets is obtained from a set of sequences using TCD method or a set of triplets is consistent with a tree, then Gτ is a DAG. This property has a key role in solving the Integer Programming system which is introduced in the remaining, in polynomial time. Then the concept of the height function of a rooted phylogenetic network is introduced, and an efficient method for obtaining a height function hτ for a given set of rooted triplets τ is explained. It is shown that the condition of consistency of a rooted phylogenetic network N with the height function hτ can be a good alternative for the condition of consistency of N with τ. To show this, firstly we define the Integer Programming system in such a way that its constraints intuitively force the consistency of N with τ. Secondly, we show that if τ is consistent with a tree T, then T is consistent with hτ and T can be constructed using this height function. In the last section we present TripNet algorithm.

Preliminaries

Here first we present some definitions and notation. Then we describe BUILD algorithm. Finally a new method called TCD, is introduced for constructing rooted triplets from a set of sequences.

2.1 Definitions and notation

Let X be a set of taxa. A rooted phylogenetic tree (tree for short) on X is a rooted unordered leaf labeled tree whose leaves are distinctly labeled by X and every node which is not a leaf has at least outdegree two. A directed acyclic graph (DAG) is a directed graph that is free of directed cycles. A DAG G is connected if there is an undirected path between any two nodes of G. It is biconnected if it contains no node whose removal disconnects G. A biconnected component of a graph G is a maximal biconnected subgraph of G. A rooted phylogenetic network (network for short) on X is a rooted DAG in which the root has indegree 0 and outdegree 2 and every node except the root satisfies one of the following conditions:

  1. It has indegree 2 and outdegree 1. These nodes are called reticulation nodes.
  2. It has indegree 1 and outdegree 2.
  3. It has indegree 1 and outdegree 0. These nodes are called leaves and are distinctly labeled by X.

A reticulation leaf is a leaf whose parent is a reticulation node. A network is said to be a level-k network if each of its biconnected components contains at most k reticulation nodes. A tree can be considered as a level-0 network.

A rooted triplet (triplet for short) is a rooted binary unordered tree with three leaves. We use ij|k to denote a triplet with taxa i and j on one side and k on the other side of the root (Figure 1a). A set of triplets τ is called dense if for each subset of three taxa, there is at least one triplet in τ. A triplet ij|k is consistent with a network N or equivalently N is consistent with ij|k if the leaf set of ij|k is a subset of the leaf set of N, and N contains a subdivision of ij|k, i.e. if N contains distinct nodes u and v and pairwise internally node-disjoint paths ui, uj, vu and vk. Figure 1b shows an example of a network consistent with ij|k. A set τ of triplets is consistent with a network N if all the triplets in τ are consistent with N. We use the symbols τ(N) and LN to represent the set of all triplets that are consistent with N and the set of labels of its leaves respectively. For any set τ of triplets define L(τ) = . The set τ is called a set of triplets on X if L(τ) = X.

thumbnail
Figure 1. A triplet and a network consistent with it.

(a) The triplet $ij|k$, (b) The triplet ij|k is consistent with the given network.

https://doi.org/10.1371/journal.pone.0106531.g001

2.2 BUILD algorithm

Let τ be a set of triplets. BUILD is a top-down algorithm, constructs a tree consistent with τ if such a tree exists. The algorithm is guided by the Aho graph.

Definition 1. (Aho graph) Let X be a set of taxa and τ be a set of triples on X. The Aho graph AG(τ) = (V,E) associated with τ has node set V = X and any two nodes i and j are connected by an edge in E if and only if there exists a triplet ij|k ∈ τ [1].

BUILD algorithm: Given a non-empty set of rooted triples τ on X, the aim is to construct a rooted phylogenetic tree T on X that is consistent with τ, if one exists. If AG(τ) has only one connected component, then the algorithm reports fail. Else, for each node set U of a connected component of AG(τ), determine the set τ|U which denotes the set of all triplets in τ whose leaves are in U and recursively compute the rooted phylogenetic subtree T(τ|U) which denotes the tree constructed with BUILD algortihm consistent with τ|U. Finally, create a root node r and combine all computed subtrees by connecting r to the root of each of them [1]. For an example see Figure 2.

thumbnail
Figure 2. An example of BUILD algorithm for the given set {bc |a, ac |d,

de |b} of triplets.

https://doi.org/10.1371/journal.pone.0106531.g002

2.3 Triplets construction method

There exist different methods like Maximum Parsimony or Maximum Likelihood for constructing triplets from (biological) sequences [6]. In this section a method for constructing triplets is presented. Suppose that X is a set of n taxa, and D = [Dij] be an n×n distance matrix on X. For each three taxa i, j, and kX, and the entries Dij, Dik, and Djk, we assign the triplet ij|k if Dij < min {Dik, Djk}. We name this method Triplets Construction with Distance; TCD for short. In this paper we use TCD method for constructing triplets.

Results

In this section to show the performance of TripNet on the triplets sets which are obtained from TCD, we compare TripNet with SIMPLISTIC. Also we test TripNet on the Yeast data. It is the only published triplets data that are obtained from biological data.

3.1 Comparing SIMPLISTIC and TripNet

SIMPLISTIC is the only available software which has the ability to return some rooted phylogenetic network consistent with a given dense set of rooted triplets. But it does not give any minimality guarantees [6].

SplitsTree is a valuable tool for constructing an special kind of unrooted phylogenetic networks from different types of data as input. This program converts a given set of sequences X into a distance matrix DX to compute the resulting network. The distance matrix DX is reported as one of the output of SplitsTree [8].

Let be the set of triplets that is obtained from DX using TCD, and consider it as the input for TripNet.

Note that is not necessarily dense, since for some three taxa i, j, and k we might have  = <. In this case one of the triplets ij|k or jk|i is assigned to i, j, and k to obtain a dense set of triplets as the input of SIMPLISTIC. Also if  =  = , then randomly one of the three possible triplets related to i, j and k is assigned to them.

To perform the simulation we generate 160 different sets of sequences are generated using TREEVOLVE. TREVOLVE is a software which simulate the evolution of DNA sequences under a coalescent model [9]. TREEVOLVE contains many input parameters which one can adjust them. In this study we adjust the Number of samples, the Number of sequences, and the Length of sequence, and for the other parameters the default values are adjusted. In this study the Number of sequences is 10, 20, 30, and 40. For each input parameter the Number of sequences the Length of sequence is 100, 200, 300, and 400. For each case the Number of samples is set to 10.

In this study we run both methods on a PC with an Intel DuallCore processor running at 1.80 GHz.

We set the running time restriction 6 hours for methods. Let Nfinite be the set of networks for which the running time is less than 6 hours.

The results of the comparison between TripNet and SIMPLISTIC on the three most important parameters i.e. running time of both methods, number of the reticulation nodes and the level of the final networks, are shown in Table 1.

The results show that when the number of input taxa is 10, both methods always return a network in at most one second. For the number of input 20, in 5% of cases SIMPLISTIC returns no results in less than 6 hours. For the remaining 95% of the cases, the SIMPLISTIC running time is on average 306 seconds, while in all cases on average the TripNet running time is at most 2 seconds. But by increasing this parameter to 30, in 67.5% of the cases, SIMPLISTIC has not the ability to return a network in less than 6 hours. For the remaining 22.5% of the cases on average SIMPLISTIC outputs a network in 2675 seconds, while in all cases the TripNet running time is on average 200 seconds. Moreover when this parameter is set to 40, in all cases SIMPLISTIC fails to return any network in less than 6 hours, while on average TripNet outputs a network in 775 seconds. Totally for all 160 input triplets sets on average TripNet outputs a network in less than 250 seconds, while on average in 57% of the SIMPLISTIC networks which belong to Nfinite, the running time is near to 750 seconds.

Also the results show that in all cases the number of the reticulation nodes and the level of TripNet networks are less than SIMPLISTIC networks. Note that for the number of input 40, on average the number of the reticulation nodes and the level of the TripNet networks are 15.825 and 15.25, while for these data SIMPLISTIC can not return any network in less than 6 hours.

3.2 Yeast data

The Yeast data is a dense set of triplets generated using real yeast data, obtained from the Fungal Biodiversity Center in Utrecht. This data set which contains information about 21 species is available online from (http://skelk.sdf-eu.org/level2triplets.html). Based on the algorithm developed in [10]. Steven Kelk has developed a software application, called LEVEL2, for constructing level-2 networks from dense sets of triplets. LEVEL2 is not applicable to general triplet sets and it produces a network only if there exists a level-2 network consistent with the input triplets. However, LEVEL2 has the advantage that it always produces the best possible network which also minimizes the number of reticulation nodes. LEVEL2 network for the Yeast data is a 21-leaf level-2 network which is given in Figure 3a [10]. As our only chance for comparing TripNet networks with the best possible networks we repeated the analysis of Yeast data using TripNet. The TripNet network for the Yeast dataset is given in Figure 3b. As one can see, TripNet produced a level-3 network which contains only one more reticulation node than the network obtained by LEVEL2. The running time of both algorithms is nearly one second.

thumbnail
Figure 3. Resulting networks from Yeast triplets. (a) LEVEL2 algorithm result. (b) TripNet algorithm result.

https://doi.org/10.1371/journal.pone.0106531.g003

Discussion

In this paper we introduced TripNet which is the software that has the ability to return some network consistent with an arbitrary given set of triplets.TripNet and supplementary files are freely available for download at (www.bioinf.cs.ipm.ir/software/tripnet). Unlike previous methods which only work on dense triplet sets, our method works on any set of triplets. Some theorems were proved to clarify the rationale behind the steps of TripNet. In this paper the TCD method was introduced for constructing triplets. In order to study the performance of TripNet on the triplets that are obtained from TCD method we performed a simulation on 160 different sets of triplets, and compared TripNet with SIMPLISTIC.

The results showed that in all 160 cases TripNet outputs an appropriate network in an acceptable time, while just in 57.5% of these cases SIMPLISTIC has the ability to return some network in less than 6 hours. Also on average in all cases TripNet outperforms SIMPLISTIC on the number of the reticulation nodes, and the level of the output network.

Also by increasing the number of input taxa, the running time of SIMPLISTIC exceeds abruptly, such that for the input taxa 40, it could not return any network in less than 6 hours.

These results showed that for large size input data that are obtained from TCD method, SIMPLISTIC is not a practical method for constructing networks, while TripNet works well in all cases.

To establish the performance of TripNet on real datasets, we tested TripNet on Yeast data, and compared our results with those of LEVEL2. For Yeast data TripNet produced a level-3 network which contains only one more reticulation node than the optimal network obtained by LEVEL2. Both networks have the same clustering and represent the same evolutionary relationship between taxa. While TripNet has been designed for general triplet sets (not necessarily dense or consistent with a restricted level network), this example shows that the network produced by TripNet is very close to the best possible solution.

Materials and Methods

In this section we prove some theorems to clarify the rationale behind the steps of TripNet. Then TripNet is presented in nine steps.

5.1 The directed graph related to a set of triplets and height function

Throughout this subsection we denote i, j by ij for short. Let τ be a set of triplets. Define Gτ, the directed graph related to τ, by V(Gτ) = {ij: i,j ∈ L(τ), i ≠ j} and E(Gτ) = {(ij,ik): ij|k ∈ τ} ∪ {(ij,jk): ij|k ∈ τ}. In the following we present some basic properties of Gτ.

In what follows the height function of a tree is introduced. Let denotes the set of all subsets of X of size 2.

Definition 2. Let X be an arbitrary finite set. A function h: is called a height function on X.

Let T be a rooted tree with the root r, cij be the lowest common ancestor of the leaves i and j, and lT denotes the length of a longest path starting at r.

Definition 3. The height function of T, hT is defined as hT(i,j) = lT-dT(r,cij) where i and j are two distinct leaves of T (dT(r,cij) denotes the length of the path between r and cij).

Let T be a tree. The definition above implies that a triplet ij|k is consistent with T if and only if hT(i, j)<hT(i, k) or hT(i, j)<hT(j, k).

Let X = {x1, x2, …, xm} be a finite set, D be a distance matrix on X, and τ be the set of triplets on X that are obtained from TCD method using D. Let Gτ contains a cycle x1x 2x2x 3 → … → xn−1xnx1x 2. Then , which is a contradiction. So Gτ is a DAG.

Moreover if τ is a triplet set consistent with a tree T, then Gτ is a DAG. This is so because if Gτ contains a cycle x1x2x2x 3 → … → xn−1xnx1x 2, then hT(x1,x2) < hT(x2,x3) < … < hT(xn−1,xn) < hT(x1,x2), which is a contradiction.

The height function of a DAG is introduced as what follows.

Let τ be a set of triplets, Gτ be a DAG and denotes the length of the longest path in Gτ. Since Gτ is a DAG, the set of nodes with outdegree zero is nonempty. Assign +1 to the nodes with outdegree zero and remove them from Gτ. Assign to the nodes with outdegree zero in the resulting graph and continue this procedure until all nodes are removed.

Definition 4. For any two distinct i, jL(τ), define as the value that is assigned by the above procedure to the node ij and call it the height function related to Gτ.

Let τ be a set of triplets that is consistent with a tree, and Tτ denotes the unique tree that is produced by BUILD algorithm. Then Gτ is a DAG and is well-defined. The following theorem represents an upper bound for based on .

Theorem 1. Let τ be a set of triplets that is consistent with a tree. Then .

Proof. The proof proceeds by induction on . It is trivial when  = 3. Assume that theorem holds when . Let  = k+1 and T1, T2, …, Tm be m subtrees which are obtained from Tτ by removing its root. For each i, , let , and ri be the root of Ti. By the induction assumption for each i, ,. Moreover we conclude from BUILD algorithm that , for . Thus , for . Also for i, , the maximum length of the longest path in Ti is . It means that for i, , the maximum length of the longest path in is at least . Therefore the length of the longest path in is at least . Let . We have two cases.

Case 1. For some i and j, , and . Since the outdegree of ab in is zero and cab = r, then .

Case 2. For some i, , . By the induction assumption for i, . Therefore  =  =   = . The last inequality is obtained by construction of from for i, .

So for each , and the proof is complete.

Now we describe an algorithm similar to BUILD algorithm, using height functions. We refer to this algorithm by HBUILD. Let h be a height function on X. Define a weighted complete graph (G,h) where V(G) = X and edge {i, j} has weight h(i,j). Remove the edges with maximum weight from G. If removing these edges results in a connected graph the algorithm stops. Otherwise, the process of removing the edges with maximum weight is continued in each connected component until each connected component contains only one node. At the end of this procedure one can reconstruct the tree by reversing the steps of the algorithm similar to BUILD algorithm (see Figure 4). The algorithm above decides in polynomial time whether a tree with height function h exists.

thumbnail
Figure 4. The steps of constructing Tτ from the given set τ = {kl |j,

kl |i, jk |i, jl |i}, using HBUILD. (a) The graph Gτ, (b) The graph (G,h), (c) Removing maximum weights from the graph (G,h), (d) Constructing Tτ using step c.

https://doi.org/10.1371/journal.pone.0106531.g004

So if τ is a set of triplets which is consistent with a tree, then is a DAG and  = h and HBUILD algorithm constructsa tree consistent with τ. Note that based on theorem 1 the tree that is produced by HBUILD is exactly Tτ.

The HBUILD tree is not necessarily a binary tree. To obtain a binary tree consistent with a set of triplets, we do the following procedure.

Let T be a tree and x be a node of T with x1, x2, …, xk, as its children. Consider a new node y. Construct by removing the edges (x, x1), (x, x2), …, (x, xk-1) from T and adding the edges (x, y), (y, x1), (y, x2), …, (y, xk-1) to T. Continuing the same method for each node with outdegree more than 2 a binary tree is obtained, and call it a binarization of T (see Figure 5). Obviously, one can obtain different binarization of T. Let τ be a set of triplets that is consistent with a tree T1, and T2 be a binarization of T1. Then τ is consistent with T2.

thumbnail
Figure 5. An example of binarization.

The binary tree is a binarization of the non-binary tree.

https://doi.org/10.1371/journal.pone.0106531.g005

In the remaining of this section we generalize the concept of height function from trees to networks. This generalization is not straightforward because the concept of (lowest) common ancestor of two leaves of a network is not well-defined. Let N be a network with the root r and lN be the length of a longest directed path from r to the leaves. For each node u consider d(r,u) as the length of the longest directed path from r to u. For any two nodes u and v, we call u an ancestor of v, if there is a directed path from u to v. If u is an ancestor of v then we say that v is lower than u. Let i and j be two leaves of N. c is called a lowest common ancestor of i and j in N, if c is a common ancestor of i and j and there is no common ancestor of i and j lower than c. For any two leaves i and j, let Cij denote the set of all lowest common ancestors of i and j.

Definition 5. For each pair of leaves i and j, define hN(i,j) =  min{lN-d(r,c): cCij} and call it the height function of N.

Obviously, every network N indicates a unique height function hN. But two different networks may have the same height function (see Figure 6).

thumbnail
Figure 6. Two different networks with the same height function.

For the given network N and tree T, hN = hT = h. h(j,k) = 1, h(i,j) = h(i,k) = 2 and h(i,l) = h(j,l) = h(k,l) = 3.

https://doi.org/10.1371/journal.pone.0106531.g006

In the following proposition we prove that for a given height function h there is a network N such that hN = h+1.

Proposition 1. Let X be an arbitrary finite set and h be a height function on X. Then there exists a network N not necessarily binary, such that its leaves are distinctly labeled by X and hN = h+1.

Proof. Let X = {x1, x2, …, xn} and hmax = max{h(xi, xj): }. Let r be the root of N,, and X′ = {x1, x2, …, xn}. Consider n nodes that are distinctly labeled by X′ members. For each pair of nodes xi and xj with h(xi, xj)  =  hmax, connect xi and xj to r by two paths of length hmax which just are common in the root. For each pair of nodes xi and xj with h(xi, xj) < hmax, consider a new node and connect xi and xj to this new node and connect this node to r by a path of length hmax-h(xi, xj). For each node which is labeled by xi, consider a new node as its child and label it by xi. The resulting network in which its leaves are distinctly labeled by X satisfies the condition hN = h+1.

Note that the network N which is constructed in the proof of Proposition 1 is not necessarily a rooted phylogenetic network. To construct a rooted phylogenetic network N′ from N in such a way that if a triplet is consistent with N then it is consistent with N′, do the following procedure. Replace each path in which all its inner nodes have indegree and outdegree one, with a path of length one. The method of constructing N shows that If there is a node v with indegree , then it has just one child as a leaf. Let this child is labeled by x, and its d parents are labeled by x1, x2, …, xd. Replace the edge which is connected to x with a path of length d-2 in such a way that its d-2 inner nodes from v to x are labeled with 1 to d-2. For each i, remove the edge xiv and connect xi to i. Do the binarization on the root. The resulting network N′ is consistent with all triplets which are consistent with N.

The following theorem shows relation between the height function of a network and a triplet consistent with it.

Theorem 2. Let N be a network, i, j, and k be its three distinct leaves. If hN(i, j) < hN(i, k) or hN(i, j) < hN(j, k) then ij|k is consistent with N.

Proof. Suppose that hN(i, j) < hN(i, k). Let vij$ and vik be common ancestors of i, j and i, k respectively, such that hN(i, j)  =  lN-d(vij, r) and hN(i, k)  =  lN-d(r,vik). Let li and lj be two distinct paths from vij to i and j, respectively. Let lk be an arbitrary path from vik to k. If then it follows that which is a contradiction. So ij|k is consistent with N.

The reverse of the above theorem is not necessarily true. For example, consider the network of Figure 7. The triplet ij|k is consistent with it, but h(i,j) = h(i,k) = 3 and h(j,k) = 2.

thumbnail
Figure 7. A counter example for the reverse of Theorem 2.

ij|k is consistent with the given network, but h(i,j) = h(i,k) = 3 and h(j,k) = 2.

https://doi.org/10.1371/journal.pone.0106531.g007

The basic idea of TripNet algorithm is to find a height function as an intermediate computational step that yields the minimum amount of information required to construct the network from a set of triplets. So it is important to find a way for computing hN from a set of triplets. In the rest of this section we introduce a computational method for computing hN using Integer Programming. Let τ be a set of triplets with |L(τ)| = n. Inspired from the two inequalities that are the consequence of Definition 3 and Theorem 2, for each triplet ij|k ∈ τ, define two inequalities and . Since the number of variables in such inequalities are at most , we obtain the following system of inequalities from τ.

Let s be an integer. Define the following Integer Programming and call it IP(τ,s).

Intuitively if IP(τ,s) has a feasible solution, we expect that the optimal solution to this integer programming is an approximation of the height function of an optimal network N consistent with τ. The following theorems support this intuition.

Theorem 3. Let τ be a set of triplets. Then Gτ is a DAG if and only if for some integer s, the IP(τ,s) has a feasible solution. In this case the minimum number s, for which IP(τ,s) has a feasible solution, is +1.

Proof. Let Gτ be a DAG. Without loss of generality assume that Gτ is connected.

The proof proceeds by induction on . If  = 1 then obviously for s = 1, IP(τ,s) has no feasible solution and for each , IP(τ,s) has a feasible solution. Assume that the theorem holds for . Suppose that τ is a set of triplets with  = k+1. Let A be the set of the terminal nodes of all longest paths in Gτ. For each ij ∈ A there is some xL(τ) such that ix|j ∈ τ. Let B be the set of all such triplets and τ′ = τ\B. Apparently, B≠φ and the length of the longest path in Gτ′ is k. By the induction assumption the minimum number s for which IP(τ′,s) has a feasible solution, is +1 = . Consider IP(τ,+1). Define h(i, j) = +1, for each ij ∈ A and h(t,l) = h′(t,l), for each tl A. h is a feasible solution to IP(τ,+1). Now if s is a solution for IP(τ,s) then s-1 is a solution for IP(τ′,s-1). So +1 is the minimum solution for IP(τ,s). Now suppose that τ is a set of triplets and for some integer s, IP(τ,s) has a feasible solution h. Assume that Gτ has a cycle . Corresponds to C we have inequalities which is a contradiction and the proof is complete.

Let τ be a set of triplets that is consistent with a tree or constructed from a given set of taxa, using TCD method. It was shown that Gτ is a DAG and by Theorem 3, is a feasible solution to IP(τ,+1).

Theorem 4. Let τ be a set of triplets consistent with a tree. Then is the unique optimal solution to IP(τ,+1).

Proof. The graph Gτ is a DAG, since τ is consistent with a tree. So is well efined.

The proof proceeds by induction on . Without loss of generality assume that Gτ is connected. The theorem is trivial when  = 1. Let for each set of triplets consistent with a tree, be the unique optimal solution to IP(τ,+1) where  =  k1. Suppose that τ is a set of triplets consistent with a tree and  =  k+1. Let τ′ be the set of triplets which is introduced in the proof of Theorem 3. By the induction assumption is the unique optimal solution to IP(τ′, +1). By Theorem 3 the minimum s for which IP(τ, s) has a feasible solution is +1. Also +1 = . It follows that is the unique optimal solution to the IP(τ,+1) and the proof is complete.

It is important to point out that the introduced target function of the above IP can be replaced with other appropriate target functions. But we use this special target function because it can be easily possible to find a solution for this IP in polynomial time when the input triplets are obtained from TCD method. Secondly using this target function, enable us to prove those above theorems which show the consistency of the result of the TripNet algorithm with a tree when there is a tree consistent with given triplets.

5.2 TripNet algorithm

Now we describe the TripNet algorithm in nine steps. In this algorithm the input is a set of triplets τ and the output is a network consistent with τ. Also if τ is consistent with a tree the algorithm constructs a binarization of Tτ.

Step 1. In this step we find a height function h on L(τ). If Gτ is a DAG we set Gτ =  Gτ. If Gτ is not a DAG we remove some edges from Gτ in such a way that the resulting graph Gτ is a DAG. Set h = .

If τ is obtained from a set of taxa using TCD method, then Gτ is a DAG. Removing minimum number of edges from a directed graph to make it a DAG is known as the minimum Feedback Arc Set problem which is NP-hard [11]. Thus we use the following heuristic method and try to remove as minimum number of edges as possible from Gτ in order to lose minimum information. First a cycle C is selected randomly. Let Cmax denote the set of nodes in C with the maximum degree. Remove an edge of C which one of its ends belongs to Cmax. This process continues until the resulting graph is a DAG. However, any such missing information will be recaptured in Step 9.

Step 2. In this step TripNet first apply HBUILD on h. If the result is a tree, TripNet constructs a binarization of this tree. Otherwise TripNet goes to Step 3.

Note that if τ is consistent with a tree, TripNet constructs a binarization of Tτ.

Step 3. Remove all the maximum-weight edges from G. The process of removing all the maximum-weight edges from the graph continues until the resulting graph is disconnected.

In [3] and [4] the authors introduced the concept of SN-sets for a set of triplets τ. A subset S of L(τ) is an SN-set if there is no triplet ij|k ∈ τ such that iS and j, kS. In [4] it is shown that if τ is dense then the maximal SN-sets partition L(τ) and can be found in polynomial time. By contracting each of the SN-set to a single node and assuming a common ancestor for all of these leaves, the size of the problem is reduced. In these papers, for finding the maximal SN-sets in polynomial time, the authors use the high density of the input triplet sets. TripNet algorithm uses the concept of height function as an auxiliary tool to obtain SN-sets instead of the high density assumption.

Step 4. For each connected component obtained in Step 3 which is not an SN-set, we apply Step 3. This process continues until all of the resulting components are SN-sets. Let {S1, S2, …, Sk} be the set of resulting SN-sets. If each SN-set contains only one node, HBUILD is applied and if the result is a tree TripNet constructs a binary tree and goes to Step 6. Otherwise TripNet goes to Step 5. If for some i, |Si|>1, contract each Si to a single node si and set S = {s1, s2, …, sk}. Update the set of triplets by defining τS = {sisj|sk: if ∃ xy|z ∈ τ, xSi, ySj and zSk}. Constructs a weighted complete graph (GS, wS) with V(GS) = S and wS(si, sj) = min {h(x, y): xSi and ySj}. Set (G, w) = (GS, wS) and TripNet goes to Step 3.

The following theorem is a consequence of the definition SN-set for (GS, wS).

Theorem 5. Applying Steps 3 and 4 on (GS, wS) and τS, each resulting SN-set has one member.

Proof. Suppose that S = {s1, s2, s3, …, sr} is an SN-set in (GS, wS). Now assume that in the procedure of Step 3 by removing the edges with weight l, S1 separates from S2. Thus there exists k > l such that by removing the edges with weight at least k in (GS, wS), the connected component S separates from other components of GS. It means that by removing the edges with weight at least k in G, we obtain the SN-set which is a contradiction.

In the next step the reticulation leaves are recognized using the following three criteria:

Criterion I. Let mi and Mi be the minimum and maximum weight of the edges in (G,h) with exactly one end in Si. Choose the node with minimum mi and if there is more than one node with minimum mi then choose among them the nodes which has minimum Mi. Let R1 denotes the set of such nodes.

Criterion II. Let wmin =  min {w(si,sj): }. In GS consider the induced subgraph on the edges with the weight wmin. Choose the nodes of R1 with the maximum degree in this induced subgraph. Let R2 denotes the set of such nodes.

Criterion III. For each node sR2, remove it from GS and find SN-sets for this new graph using Steps 3 and 4. Let ns be the number of SN-sets of this new graph with cardinality greater than one. Choose the nodes in R2 with maximum ns. Let R3 denotes the set of such nodes.

We state an example to show the idea behind these three criteria.

Let τ = {ij|l, jk|i, kl|j, kl|i, no|m, lo|k, jl|o, mn|l, mn|j, no|k, mo|i, jk|n, ij|o, ik|m, il|n}.

τ is not consistent with a tree but it is consistent with the network N shown in Figure 8a. Obviously, N is an optimal network consistent with τ. In order to find SN-sets we construct Gτ and (G, h), and find SN-sets from (G, h) using Steps 3 and 4 (Figures 8b to 8g). It follows that S = {{i}, {j}, {k}, {l}, {m}, {n, o}}. Now in GS (Figure 8h). we expect that the reticulation is in R1. In this example both k and l are in R1. Also we expect that if there is a reticulation leaf, it belongs to R2 which again both k and l are in R2. Now just l belongs to R3. Thus we consider l as the reticulation leaf (Figures 8i to 8n). Remove triplets from τS which contain l and denote the new set of triplets by τ′S. Obviously τ′S is consistent with a tree. We add this reticulation leaf to a binarization of such that the resulting network is consistent with τS. Note that if we consider each node except than l as the reticulation leaf then final network consistent with τS has at least two reticulation leaves.

thumbnail
Figure 8. An example to show how TripNet works to find a reticulation leaf by applying step 5.

Edges with weight 6 are shown by dotted lines. (a) τ = {ij|l, jk|i, kl|j, kl|i, no|m, lo|k, jl|o, mn|l, mn|j, no|k, mo|i, jk|n, ij|o, ik|m, il|n} is not consistent with a tree and is consistent with the given level-1 network, (b) G′τ is obtained from Gτ by removing the dotted line, (c) Computing (G, h), (d) Remove edges with weights 6 and 5 from (G, h) to obtain SN-sets {n, o} and {m}, (e) Remove edges with weights 4 and 3 from the remaining graph to obtain SN-set {i}, (f) Remove edges with weights 2 from the remaining graph to obtain SN-set {j}, (g) Remove edges with weights 1 from the remaining graph to obtain SN-sets {k} and {l}, (h) Compute GS. both SN-sets {k} and {l} satisfy Criteria I and II, (i) Remove {l} from GS, (j) Remove edges with weights 6 from the graph of previous step to obtain SN-sets {i, j, k} and {m, n, o}, (k) Remove {k} from GS, (l) Remove edges with weights 6 and 5 from the graph of previous step to obtain SN-sets {n, o} and {m}, (m) Remove edges with weights 4 from the remaining graph to obtain SN-set {l}, (n) Remove edges with weights 3 from the remaining graph to obtain SN-sets {i} and {j}. The steps i to n shows that l is the reticulation leaf. In these steps criterion III is applied.

https://doi.org/10.1371/journal.pone.0106531.g008

Step 5. In this step the reticulation leaf is recognized using three criteria. Do the criterion I. If |R1| = 1 then choose the node xR1 as the reticulation node. Otherwise if |R1|>1 do the criterion II. If |R2| = 1 then choose the node xR2 as the reticulation node. Otherwise if |R2|>1 do the criterion III. If |R3| = 1 then choose the node xR3 as the reticulation node. Otherwise if |R3|>1 then by the speed options we choose the reticulation node as follows.

Slow. Each node in R3 is examined as the reticulation leaf.

Normal. Two nodes in R3 are selected randomly and each of these two nodes is examined as the reticulation leaf.

Fast. One node in R3 is selected randomly as the reticulation leaf.

Let x be a node which is considered as a reticulation leaf. Remove x from GS and all of the triplets which contain x from τS. Define G = G \ {x} and go to Step 3.

Note that for the Fast option the running time of the algorithm is polynomial.

For biological data almost always the criteria I and II find a unique reticulation leaf.

So on real data the running time of TripNet is almost always polynomial.

Step 6. Let x1, x2, …, xm be m reticulation leaves which are obtained in Step 5 with this order and T be the tree that is constructed in Step 4. Now add these m nodes in the reverse order to T as what follows. Let e1 and e2 be two edges of T. Consider two new nodes y1 and y2 in the middle of e1 and e2. Connect y1 and y2 to a new node y3 and connect the reticulation leaf xm to y3. Do this procedure for all pairs of edges and choose a pair such that the resulting network is consistent with maximum number of triplets in τ. Continue this procedure until all the reticulation nodes are added.

Step 7. For each SN-set Si and the set of triplets we run the algorithm again.

Step 8. Replace each SN-set in the network of Step 6 with its related network constructed in Step 7 to obtain a network N′.

Let τ′ ∈ τ be the set of the triplets which are not consistent with N′. For each pair of leaves a and b assume that τ′ab is the set of triplets in τ′ which are of the form ab|c. Consider the pair of leaves i and j such that τ′ij has the maximum cardinality. Assume that pi and pj are the parents of i and j, respectively.

Step 9. Create two new nodes in the middle of the edges pi i and pj j and connect them with a new edge. This new edge creates a reticulation node and all of the triplets in τ′ij will be consistent with the new network. All consistent triplets with the new network are removed from τ′ and this procedure will continue until τ′ becomes empty.

Figure 9 presents an example of the algorithm with all of its Steps.

thumbnail
Figure 9. Steps of TripNet for input triplets: jk |i, li |j, mj |i, jn|i, kl |i, ik |m, ik |n, lm|i, ln|i, mn|i, kl |j, km|j, jn|k, lm|j, jl |n, mn|j, kl |m, kl |n, mn|k, mn|l }.

https://doi.org/10.1371/journal.pone.0106531.g009

Acknowledgments

The authors would like to thank M.Kargar for his kind and useful comments. This work is supported in part by a grant (BS-1392-1-05) from the Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.

Author Contributions

Conceived and designed the experiments: CE HP RT. Performed the experiments: HP. Analyzed the data: CE HP RT. Contributed reagents/materials/analysis tools: CE HP RT. Wrote the paper: CE HP RT.

References

  1. 1. Huson DH, Rupp R, Scornavacca C (2010) Phylogenetic Networks Concepts, Algorithms and Applications. Cambridge University Press.
  2. 2. Aho AV, Sagiv Y, Szymanski TG, Ullman JD (1981) Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comp 10: 405–421.
  3. 3. Jansson J, Nguyen NB, Sung WK (2006) Algorithms For combining rooted triplets into a galled phylogenetic network. SIAM Journal on Computing 35(5): 1098–1121.
  4. 4. Jansson J, Sung Wk (2006) Inferring a Level-1 Phylogenetic Network from a Dense Set of Rooted Triplets. Theoretical Computer Science 363: 60–68.
  5. 5. Huber K, Iersel LV, Kelk S, Suchecki R (2010) A Practical Algorithm for Reconstructing Level-1 Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics.
  6. 6. Van Iersel L, Kelk S (2011) Constructing the simplest possible phylogenetic network from triplets. Algorithmica 60: 207–235.
  7. 7. To TH, Habib M (2009) Level-k Phylogenetic Networks are Constructable from a Dense Triplet Set in Polynomial Time. In CPM09 5577: 275–288.
  8. 8. Huson DH (1998) SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 14(10): 68–73.
  9. 9. Grassly N, Rambaut A (1997) Treevole: a program to simulate the evolution of DNA sequences under different population dynamic scenarios. 1.3. Wellcome Centre for Infectious Disease, Department of Zoology, Oxford University, Oxford, UK.
  10. 10. Van Iersel L, Keijsper J, Kelk S, Stougie L, Hagen F, et al. (2009) Constructing level-2 phylogenetic networks from triplets. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6(4): 667–681.
  11. 11. Karp R (1972) Reducibility among combinatorial problems. Proc. Sympos., IBM Thomas J. Watson Res. Center, Yorktown Heights, N.Y. 85–103.