TripNet: A Method for Constructing Rooted Phylogenetic Networks from Rooted Triplets

The problem of constructing an optimal rooted phylogenetic network from an arbitrary set of rooted triplets is an NP-hard problem. In this paper, we present a heuristic algorithm called TripNet, which tries to construct a rooted phylogenetic network with the minimum number of reticulation nodes from an arbitrary set of rooted triplets. Despite of current methods that work for dense set of rooted triplets, a key innovation is the applicability of TripNet to non-dense set of rooted triplets. We prove some theorems to clarify the performance of the algorithm. To demonstrate the efficiency of TripNet, we compared TripNet with SIMPLISTIC. It is the only available software which has the ability to return some rooted phylogenetic network consistent with a given dense set of rooted triplets. But the results show that for complex networks with high levels, the SIMPLISTIC running time increased abruptly. However in all cases TripNet outputs an appropriate rooted phylogenetic network in an acceptable time. Also we tetsed TripNet on the Yeast data. The results show that Both TripNet and optimal networks have the same clustering and TripNet produced a level-3 network which contains only one more reticulation node than the optimal network.


Introduction
Phylogenetic networks are a generalization of phylogenetic trees that permit the representation of non-tree-like underlying histories. A rooted phylogenetic network is a rooted directed acyclic graph in which no node has indegree greater than 2 and the outdegree of each node with indegree 2 is 1. Such nodes are called reticulation nodes. In rooted phylogenetic networks the nodes with indegree 1 and outdegree 0 are called leaves and are distinctly labeled by a set of given taxa. Mathematicians are interested in developing methods that infer a phylogenetic tree or network from basic building blocks. In the computation of a rooted tree or network, one group of the basic building blocks are rooted triplets, the rooted binary trees on three taxa [1].
In 1981, Aho et al., studied the problem of constructing a rooted tree from a set of rooted triplets [2]. They proposed an algorithm called BUILD algorithm which shows that, given a set of rooted triplets, it is possible to construct in polynomial time a rooted tree that all the input triplets are contained in it or decide that no such tree exists.
When there is no tree for a given set of triplets one may try to produce an optimal phylogenetic network. In this context, the goal is to compute an optimal rooted phylogenetic network that contains all the rooted triplets. One possible optimality criterion is to minimize the level of the network, which is defined as the maximum number of reticulation nodes contained in any biconnected component of the network. The other optimality criterion is to minimize the number of reticulation nodes [1]. In [3] and [4] the authors considered the problem of deciding whether, given a set of rooted triplets as input, is it possible to construct a level-1 rooted phylogenetic network that contains all the input triplets? They showed that, in general, this problem is NP-hard. However, in [4] the authors showed that when the set of rooted triplets is dense, which means that for each set of three taxa there is at least one rooted triplet in the input set, the problem can be solved in polynomial time. After their results, all research in this new area has up to this point focused on constructing rooted phylogenetic networks from dense rooted triplet sets.
LEV1ATHAN is an algorithm for generating a level-1 rooted phylogenetic network from a set of rooted triplets [5]. Specifically, it attempts to find a level-1 rooted phylogenetic network that contains as many of the input rooted triplets as possible. This problem is an NP-hard problem [5]. The algorithm by [6] can be used to find a level-1 or a level-2 rooted phylogenetic network which minimizes the number of reticulation nodes, if such a network exists. In [6] the authors also showed that for a dense set of rooted triplets t, if t is precisely equal to the set of rooted triplets that are contained in some rooted phylogenetic network, then they can construct such a rooted phylogenetic network with smallest possible level in time O(|t| k+1 ), where k is a fixed upper bound on the level of the network. In addition based on the ideas described in [6], for a given dense set of rooted triplets t, the authors proposed the SIMPLISTIC algorithm which always returns some rooted phylogenetic network that contains t. But it does not give any minimality guarantees.
In [7] the authors showed that given a dense set of rooted triplets t and a fixed number k, it is possible to construct in time O(|t | k+1 ) a level-k rooted phylogenetic network that contains t or decides that no such network exists.
In this paper we present a heuristic algorithm called TripNet for constructing rooted phylogenetic networks with the minimum number of reticulation nodes from an arbitrary set of rooted triplets. Despite of current methods that work for dense set of rooted triplets, a key innovation is the applicability of TripNet to non-dense set of rooted triplets.
In ''unpublished data'' the authors applied TripNet on both real and simulated data. Here TripNet algorithm is described in details, some theorems are proved, and one simulation is performed to show the accuracy of TripNet. Also TripNet is tested on the Yeast data. This paper is organized as follows. In section 2, first some definitions and notation are presented. Then we describe BUILD algorithm. Finally a new method called TCD, is introduced for constructing rooted triplets from (biological) sequences. In section 3 we compare TripNet with SIMPLISTIC on the triplets sets that are obtained from TCD method. Then we test TripNet on the Yeast data. In section 4 we discuss the performance of TripNet. In the last section the directed graph Gt related to a set of triplets t is introduced. Then we show that if either a set of triplets is obtained from a set of sequences using TCD method or a set of triplets is consistent with a tree, then G t is a DAG. This property has a key role in solving the Integer Programming system which is introduced in the remaining, in polynomial time. Then the concept of the height function of a rooted phylogenetic network is introduced, and an efficient method for obtaining a height function h t for a given set of rooted triplets t is explained. It is shown that the condition of consistency of a rooted phylogenetic network N with the height function h t can be a good alternative for the condition of consistency of N with t. To show this, firstly we define the Integer Programming system in such a way that its constraints intuitively force the consistency of N with t. Secondly, we show that if t is consistent with a tree T, then T is consistent with h t and T can be constructed using this height function. In the last section we present TripNet algorithm.

Preliminaries
Here first we present some definitions and notation. Then we describe BUILD algorithm. Finally a new method called TCD, is introduced for constructing rooted triplets from a set of sequences.

Definitions and notation
Let X be a set of taxa. A rooted phylogenetic tree (tree for short) on X is a rooted unordered leaf labeled tree whose leaves are distinctly labeled by X and every node which is not a leaf has at least outdegree two. A directed acyclic graph (DAG) is a directed graph that is free of directed cycles. A DAG G is connected if there is an undirected path between any two nodes of G. It is biconnected if it contains no node whose removal disconnects G. A biconnected component of a graph G is a maximal biconnected subgraph of G. A rooted phylogenetic network (network for short) on X is a rooted DAG in which the root has indegree 0 and outdegree 2 and every node except the root satisfies one of the following conditions: a) It has indegree 2 and outdegree 1. These nodes are called reticulation nodes. b) It has indegree 1 and outdegree 2. c) It has indegree 1 and outdegree 0. These nodes are called leaves and are distinctly labeled by X.
A reticulation leaf is a leaf whose parent is a reticulation node. A network is said to be a level-k network if each of its biconnected components contains at most k reticulation nodes. A tree can be considered as a level-0 network.
A rooted triplet (triplet for short) is a rooted binary unordered tree with three leaves. We use ij|k to denote a triplet with taxa i and j on one side and k on the other side of the root (Figure 1a). A set of triplets t is called dense if for each subset of three taxa, there is at least one triplet in t. A triplet ij|k is consistent with a network N or equivalently N is consistent with ij|k if the leaf set of ij|k is a subset of the leaf set of N, and N contains a subdivision of ij|k, i.e. if N contains distinct nodes u and v and pairwise internally nodedisjoint paths u R i, u R j, v R u and v R k. Figure 1b shows an example of a network consistent with ij|k. A set t of triplets is consistent with a network N if all the triplets in t are consistent with N. We use the symbols t(N) and L N to represent the set of all triplets that are consistent with N and the set of labels of its leaves respectively. For any set t of triplets define L(t) = | t[t L t . The set t is called a set of triplets on X if L(t) = X.

BUILD algorithm
Let t be a set of triplets. BUILD is a top-down algorithm, constructs a tree consistent with t if such a tree exists. The algorithm is guided by the Aho graph. Definition 1. (Aho graph) Let X be a set of taxa and t be a set of triples on X. The Aho graph AG(t) = (V,E) associated with t has node set V = X and any two nodes i and j are connected by an edge in E if and only if there exists a triplet ij|k M t [1].
BUILD algorithm: Given a non-empty set of rooted triples t on X, the aim is to construct a rooted phylogenetic tree T on X that is consistent with t, if one exists. If AG(t) has only one connected component, then the algorithm reports fail. Else, for each node set U of a connected component of AG(t), determine the set t| U which denotes the set of all triplets in t whose leaves are in U and recursively compute the rooted phylogenetic subtree T(t| U ) which denotes the tree constructed with BUILD algortihm consistent with t| U . Finally, create a root node r and combine all computed subtrees by connecting r to the root of each of them [1]. For an example see Figure 2.

Triplets construction method
There exist different methods like Maximum Parsimony or Maximum Likelihood for constructing triplets from (biological) sequences [6]. In this section a method for constructing triplets is presented. Suppose that X is a set of n taxa, and D = [D ij ] be an n6n distance matrix on X. For each three taxa i, j, and k M X, and the entries D ij , D ik , and D jk , we assign the triplet ij|k if D ij , min {D ik , D jk }. We name this method Triplets Construction with Distance; TCD for short. In this paper we use TCD method for constructing triplets.

Results
In this section to show the performance of TripNet on the triplets sets which are obtained from TCD, we compare TripNet with SIMPLISTIC. Also we test TripNet on the Yeast data. It is the only published triplets data that are obtained from biological data.

Comparing SIMPLISTIC and TripNet
SIMPLISTIC is the only available software which has the ability to return some rooted phylogenetic network consistent with a given dense set of rooted triplets. But it does not give any minimality guarantees [6].
SplitsTree is a valuable tool for constructing an special kind of unrooted phylogenetic networks from different types of data as input. This program converts a given set of sequences X into a distance matrix D X to compute the resulting network. The distance matrix D X is reported as one of the output of SplitsTree [8].
Let t DX be the set of triplets that is obtained from D X using TCD, and consider it as the input for TripNet. Note that t DX is not necessarily dense, since for some three taxa i, j, and k we might have D Xij = D X jk ,D X ik . In this case one of the triplets ij|k or jk|i is assigned to i, j, and k to obtain a dense set of triplets t X dense as the input of SIMPLISTIC. Also if D Xij = D Xjk = D Xik , then randomly one of the three possible triplets related to i, j and k is assigned to them.
To perform the simulation we generate 160 different sets of sequences are generated using TREEVOLVE. TREVOLVE is a software which simulate the evolution of DNA sequences under a coalescent model [9]. TREEVOLVE contains many input parameters which one can adjust them. In this study we adjust the Number of samples, the Number of sequences, and the Length of sequence, and for the other parameters the default values are adjusted. In this study the Number of sequences is 10, 20, 30, and 40. For each input parameter the Number of sequences the Length of sequence is 100, 200, 300, and 400. For each case the Number of samples is set to 10.
In this study we run both methods on a PC with an Intel DuallCore processor running at 1.80 GHz.  We set the running time restriction 6 hours for methods. Let N finite be the set of networks for which the running time is less than 6 hours.
The results of the comparison between TripNet and SIMPLIS-TIC on the three most important parameters i.e. running time of both methods, number of the reticulation nodes and the level of the final networks, are shown in Table 1.
The results show that when the number of input taxa is 10, both methods always return a network in at most one second. For the number of input 20, in 5% of cases SIMPLISTIC returns no results in less than 6 hours. For the remaining 95% of the cases, the SIMPLISTIC running time is on average 306 seconds, while in all cases on average the TripNet running time is at most 2 seconds. But by increasing this parameter to 30, in 67.5% of the cases, SIMPLISTIC has not the ability to return a network in less   Moreover when this parameter is set to 40, in all cases SIMPLISTIC fails to return any network in less than 6 hours, while on average TripNet outputs a network in 775 seconds. Totally for all 160 input triplets sets on average TripNet outputs a network in less than 250 seconds, while on average in 57% of the SIMPLISTIC networks which belong to N finite , the running time is near to 750 seconds.
Also the results show that in all cases the number of the reticulation nodes and the level of TripNet networks are less than SIMPLISTIC networks. Note that for the number of input 40, on average the number of the reticulation nodes and the level of the TripNet networks are 15.825 and 15.25, while for these data SIMPLISTIC can not return any network in less than 6 hours.

Yeast data
The Yeast data is a dense set of triplets generated using real yeast data, obtained from the Fungal Biodiversity Center in Utrecht. This data set which contains information about 21 species is available online from (http://skelk.sdf-eu.org/level2triplets. html). Based on the algorithm developed in [10]. Steven Kelk has developed a software application, called LEVEL2, for constructing level-2 networks from dense sets of triplets. LEVEL2 is not applicable to general triplet sets and it produces a network only if there exists a level-2 network consistent with the input triplets. However, LEVEL2 has the advantage that it always produces the best possible network which also minimizes the number of reticulation nodes. LEVEL2 network for the Yeast data is a 21-leaf level-2 network which is given in Figure 3a [10]. As our only chance for comparing TripNet networks with the best possible networks we repeated the analysis of Yeast data using TripNet. The TripNet network for the Yeast dataset is given in Figure 3b. As one can see, TripNet produced a level-3 network which contains only one more reticulation node than the network obtained by LEVEL2. The running time of both algorithms is nearly one second.

Discussion
In this paper we introduced TripNet which is the software that has the ability to return some network consistent with an arbitrary given set of triplets.TripNet and supplementary files are freely available for download at (www.bioinf.cs.ipm.ir/software/tripnet). Unlike previous methods which only work on dense triplet sets, our method works on any set of triplets. Some theorems were proved to clarify the rationale behind the steps of TripNet. In this paper the TCD method was introduced for constructing triplets. In order to study the performance of TripNet on the triplets that are obtained from TCD method we performed a simulation on 160 different sets of triplets, and compared TripNet with SIMPLISTIC.
The results showed that in all 160 cases TripNet outputs an appropriate network in an acceptable time, while just in 57.5% of these cases SIMPLISTIC has the ability to return some network in less than 6 hours. Also on average in all cases TripNet outperforms SIMPLISTIC on the number of the reticulation nodes, and the level of the output network.
Also by increasing the number of input taxa, the running time of SIMPLISTIC exceeds abruptly, such that for the input taxa 40, it could not return any network in less than 6 hours.
These results showed that for large size input data that are obtained from TCD method, SIMPLISTIC is not a practical method for constructing networks, while TripNet works well in all cases.
To establish the performance of TripNet on real datasets, we tested TripNet on Yeast data, and compared our results with those of LEVEL2. For Yeast data TripNet produced a level-3 network which contains only one more reticulation node than the optimal network obtained by LEVEL2. Both networks have the same clustering and represent the same evolutionary relationship between taxa. While TripNet has been designed for general triplet sets (not necessarily dense or consistent with a restricted  The steps i to n shows that l is the reticulation leaf. In these steps criterion III is applied. doi:10.1371/journal.pone.0106531.g008 Figure 9. Steps of TripNet for input triplets: jk |i, li |j, mj |i, jn|i, kl |i, ik |m, ik |n, lm|i, ln|i, mn|i, kl |j, km|j, jn|k, lm|j, jl |n, mn|j, kl |m, kl |n, mn|k, mn|l }. doi:10.1371/journal.pone.0106531.g009 TripNet: A Method for Constructing Networks from Triplets PLOS ONE | www.plosone.org level network), this example shows that the network produced by TripNet is very close to the best possible solution.

Materials and Methods
In this section we prove some theorems to clarify the rationale behind the steps of TripNet. Then TripNet is presented in nine steps.

The directed graph related to a set of triplets and height function
Throughout this subsection we denote i, j by ij for short. Let t be a set of triplets. Define G t , the directed graph related to t, by In the following we present some basic properties of G t .
In what follows the height function of a tree is introduced. Let ) denotes the set of all subsets of X of size 2.
Definition 2. Let X be an arbitrary finite set. A function h: Let T be a rooted tree with the root r, c ij be the lowest common ancestor of the leaves i and j, and l T denotes the length of a longest path starting at r. Definition 3. The height function of T, h T is defined as h T (i,j) = l T -d T (r,c ij ) where i and j are two distinct leaves of T (d T (r,c ij ) denotes the length of the path between r and c ij ).
Let T be a tree. The definition above implies that a triplet ij|k is consistent with T if and only if h T (i, j),h T (i, k) or h T (i, j),h T (j, k).
Let X = {x 1 , x 2 , …, x m } be a finite set, D be a distance matrix on X, and t be the set of triplets on X that are obtained from TCD method using D. Let G t contains a cycle x 1 x 2 R x 2 x 3 R … R x n21 x n R x 1 x 2 . Then D x1x2 vD x2x3 v:::vD xn{1xn vD x1x2 , which is a contradiction. So G t is a DAG.
Moreover if t is a triplet set consistent with a tree T, then G t is a DAG. This is so because if G t contains a cycle , h T (x 1 ,x 2 ), which is a contradiction.
The height function of a DAG is introduced as what follows. Let t be a set of triplets, G t be a DAG and l Gt denotes the length of the longest path in G t . Since G t is a DAG, the set of nodes with outdegree zero is nonempty. Assign l Gt +1 to the nodes with outdegree zero and remove them from G t . Assign l Gt to the nodes with outdegree zero in the resulting graph and continue this procedure until all nodes are removed.
Definition 4. For any two distinct i, j M L(t), define h Gt (i,j) as the value that is assigned by the above procedure to the node ij and call it the height function related to G t .
Let t be a set of triplets that is consistent with a tree, and T t denotes the unique tree that is produced by BUILD algorithm. Then G t is a DAG and h Gt is well-defined. The following theorem represents an upper bound for h Tt based on h Gt . Theorem 1. Let t be a set of triplets that is consistent with a tree. Then h Tt ƒ h Gt .
Proof. The proof proceeds by induction on DL Tt D. It is trivial when DL Tt D = 3. Assume that theorem holds when DL Tt Dƒk. Let DL Tt D = k+1 and T 1 , T 2 , …, T m be m subtrees which are obtained from T t by removing its root. For each i, 1ƒiƒm, let t i~t D LT i , and r i be the root of T i . By the induction assumption for each i, 1ƒiƒm,h Tt i #h Gt i . Moreover we conclude from BUILD algorithm that T i~Tti , for 1ƒiƒm. Thus h Ti ƒh Gt i , for 1ƒiƒm. Also for i, 1ƒiƒm , the maximum length of the longest path in T i is l Tt {1. It means that for i, 1ƒiƒm, the maximum length of the longest path in G ti is at least l Tt {2. Therefore the length of the longest path in G t is at least l Tt {1. Let a, b[L Tt . We have two cases. Case 1. For some i and j, 1ƒivjƒm, a[L Ti and b[L Tj . Since the outdegree of ab in G t is zero and c ab = r, then h Tt (a,b)~l Tt # h Gt (a,b).
Case 2. For some i, 1ƒiƒm, a,b [L Ti . By the induction assumption h Tt i (a,b)#h Gt i (a,b) for i, 1ƒiƒm. Therefore h Tt (a,b) Gt (a,b). The last inequality is obtained by construction of G t from G ti for i, 1ƒiƒm.
So for each a,b [L Tt , h Tt (a,b)#h Gt (a,b) and the proof is complete. Now we describe an algorithm similar to BUILD algorithm, using height functions. We refer to this algorithm by HBUILD. Let h be a height function on X. Define a weighted complete graph (G,h) where V(G) = X and edge {i, j} has weight h(i,j). Remove the edges with maximum weight from G. If removing these edges results in a connected graph the algorithm stops. Otherwise, the process of removing the edges with maximum weight is continued in each connected component until each connected component contains only one node. At the end of this procedure one can reconstruct the tree by reversing the steps of the algorithm similar to BUILD algorithm (see Figure 4). The algorithm above decides in polynomial time whether a tree with height function h exists.
So if t is a set of triplets which is consistent with a tree, then G t is a DAG and h Tt (a,b)#h Gt (a,b) = h and HBUILD algorithm constructsa tree consistent with t. Note that based on theorem 1 the tree that is produced by HBUILD is exactly T t.
The HBUILD tree is not necessarily a binary tree. To obtain a binary tree consistent with a set of triplets, we do the following procedure.
Let T be a tree and x be a node of T with x 1 , x 2 , …, x k , k §3 as its children. Consider a new node y. Construct T 0 by removing the edges (x, x 1 ), (x, x 2 ), …, (x, x k-1 ) from T and adding the edges (x, y), (y, x 1 ), (y, x 2 ), …, (y, x k-1 ) to T. Continuing the same method for each node with outdegree more than 2 a binary tree is obtained, and call it a binarization of T (see Figure 5). Obviously, one can obtain different binarization of T. Let t be a set of triplets that is consistent with a tree T 1 , and T 2 be a binarization of T 1 . Then t is consistent with T 2 .
In the remaining of this section we generalize the concept of height function from trees to networks. This generalization is not straightforward because the concept of (lowest) common ancestor of two leaves of a network is not well-defined. Let N be a network with the root r and l N be the length of a longest directed path from r to the leaves. For each node u consider d(r,u) as the length of the longest directed path from r to u. For any two nodes u and v, we call u an ancestor of v, if there is a directed path from u to v. If u is an ancestor of v then we say that v is lower than u. Let i and j be two leaves of N. c is called a lowest common ancestor of i and j in N, if c is a common ancestor of i and j and there is no common ancestor of i and j lower than c. For any two leaves i and j, let C ij denote the set of all lowest common ancestors of i and j.
Definition 5. For each pair of leaves i and j, define h N (i,j) = min{l N -d(r,c): cMC ij } and call it the height function of N.
Obviously, every network N indicates a unique height function h N . But two different networks may have the same height function (see Figure 6).
In the following proposition we prove that for a given height function h there is a network N such that h N = h+1. Proposition 1. Let X be an arbitrary finite set and h be a height function on X. Then there exists a network N not necessarily binary, such that its leaves are distinctly labeled by X and h N = h+1.
Proof. Let X = {x 1 , x 2 , …, x n } and h max = max{h(x i , x j ): 1ƒi,jƒn}. Let r be the root of N,, and X9 = {x9 1 , x9 2 , …, x9 n }. Consider n nodes that are distinctly labeled by X9 members. For each pair of nodes x i and x j with h(x i , x j ) = h max , connect x9 i and x9 j to r by two paths of length h max which just are common in the root. For each pair of nodes x i and x j with h(x i , x j ) , h max , consider a new node and connect x9 i and x9 j to this new node and connect this node to r by a path of length h max -h(x i , x j ). For each node which is labeled by x9 i , consider a new node as its child and label it by x i . The resulting network in which its leaves are distinctly labeled by X satisfies the condition h N = h+1.
Note that the network N which is constructed in the proof of Proposition 1 is not necessarily a rooted phylogenetic network. To construct a rooted phylogenetic network N9 from N in such a way that if a triplet is consistent with N then it is consistent with N9, do the following procedure. Replace each path in which all its inner nodes have indegree and outdegree one, with a path of length one. The method of constructing N shows that If there is a node v with indegree d §2, then it has just one child as a leaf. Let this child is labeled by x, d §3 and its d parents are labeled by x 1 , x 2 , …, x d . Replace the edge which is connected to x with a path of length d-2 in such a way that its d-2 inner nodes from v to x are labeled with 1 to d-2. For each i, 1ƒiƒd{2 remove the edge x i v and connect x i to i. Do the binarization on the root. The resulting network N9 is consistent with all triplets which are consistent with N.
The following theorem shows relation between the height function of a network and a triplet consistent with it. Theorem 2. Let N be a network, i, j, and k be its three distinct Proof. Suppose that h N (i, j) , h N (i, k). Let v ij $ and v ik be common ancestors of i, j and i, k respectively, such that h N (i, j) = l N -d(v ij , r) and h N (i, k) = l N -d(r,v ik ). Let l i and l j be two distinct paths from v ij to i and j, respectively. Let l k be an arbitrary path from v ik to k. If l i \l k =1 then it follows that h N (i,j) §h N (i,k) which is a contradiction. So ij|k is consistent with N.
The reverse of the above theorem is not necessarily true. For example, consider the network of Figure 7. The triplet ij|k is consistent with it, but h(i,j) = h(i,k) = 3 and h(j,k) = 2.
The basic idea of TripNet algorithm is to find a height function as an intermediate computational step that yields the minimum amount of information required to construct the network from a set of triplets. So it is important to find a way for computing h N from a set of triplets. In the rest of this section we introduce a computational method for computing h N using Integer Programming. Let t be a set of triplets with |L(t)| = n. Inspired from the two inequalities that are the consequence of Definition 3 and Theorem 2, for each triplet ij|k M t, define two inequalities h(i,k){h(i,j) §1 and h(j,k){h(i,j) §1. Since the number of variables in such inequalities are at most D( L(t) 2 )D, we obtain the following system of inequalities from t.
0vh(i,j)ƒ D( L(t) 2 )D 1ƒi,jƒn: Let s be an integer. Define the following Integer Programming and call it IP(t,s).
Maximize P 1ƒi,jƒn h(i,j), 0vh(i,j)ƒs 1ƒi, jƒn: Intuitively if IP(t,s) has a feasible solution, we expect that the optimal solution to this integer programming is an approximation of the height function of an optimal network N consistent with t.
The following theorems support this intuition. Theorem 3. Let t be a set of triplets. Then G t is a DAG if and only if for some integer s, the IP(t,s) has a feasible solution. In this case the minimum number s, for which IP(t,s) has a feasible solution, is l Gt +1.
Proof. Let G t be a DAG. Without loss of generality assume that G t is connected.
The proof proceeds by induction on l Gt . If l Gt = 1 then obviously for s = 1, IP(t,s) has no feasible solution and for each s §2, IP(t,s) has a feasible solution. Assume that the theorem holds for l Gt ƒk. Suppose that t is a set of triplets with l Gt = k+1. Let A be the set of the terminal nodes of all longest paths in G t . For each ij M A there is some x M L(t) such that ix|j M t. Let B be the set of all such triplets and t9 = t\B. Apparently, B?w and the length of the longest path in G t 9 is k. By the induction assumption the minimum number s for which IP(t9,s) has a feasible solution, is l G t 0 +1 = l Gt . Consider IP(t,l Gt +1). Define h(i, j) = l Gt +1, for each ij M A and h(t,l) = h9(t,l), for each tl = [ A. h is a feasible solution to IP(t,l Gt +1). Now if s is a solution for IP(t,s) then s-1 is a solution for IP(t9,s-1). So l Gt +1 is the minimum solution for IP(t,s). Now suppose that t is a set of triplets and for some integer s, IP(t,s) has a feasible solution h. Assume that G t has a cycle i 1 j 1 ?i 2 j 2 ? . . . ?i m j m ?i 1 j 1 . Corresponds to C we have inequalities h(i 1 j 1 )vh(i 2 ,j 2 )v . . . vh(i m ,j m )vh(i 1 ,j 1 )which is a contradiction and the proof is complete.
Let t be a set of triplets that is consistent with a tree or constructed from a given set of taxa, using TCD method. It was shown that G t is a DAG and by Theorem 3, h Tt is a feasible solution to IP(t,l Gt +1).
Theorem 4. Let t be a set of triplets consistent with a tree. Then h Tt is the unique optimal solution to IP(t,l Gt +1).
Proof. The graph G t is a DAG, since t is consistent with a tree. So l Gt is well efined.
The proof proceeds by induction on l Gt . Without loss of generality assume that G t is connected. The theorem is trivial when l Gt = 1. Let for each set of triplets consistent with a tree, h Tt be the unique optimal solution to IP(t,l Gt +1) where l Gt = k §1. Suppose that t is a set of triplets consistent with a tree and l Gt = k+1. Let t9 be the set of triplets which is introduced in the proof of Theorem 3. By the induction assumption h T t 0 is the unique optimal solution to IP(t9, l G t 0 +1). By Theorem 3 the minimum s for which IP(t, s) has a feasible solution is l Gt +1. Also l G t 0 +1 = l Gt . It follows that h Tt is the unique optimal solution to the IP(t,l Gt +1) and the proof is complete.
It is important to point out that the introduced target function of the above IP can be replaced with other appropriate target functions. But we use this special target function because it can be easily possible to find a solution for this IP in polynomial time when the input triplets are obtained from TCD method. Secondly using this target function, enable us to prove those above theorems which show the consistency of the result of the TripNet algorithm with a tree when there is a tree consistent with given triplets.

TripNet algorithm
Now we describe the TripNet algorithm in nine steps. In this algorithm the input is a set of triplets t and the output is a network consistent with t. Also if t is consistent with a tree the algorithm constructs a binarization of T t .
Step 1. In this step we find a height function h on L(t). If G t is a DAG we set G9 t = G t . If G t is not a DAG we remove some edges from G t in such a way that the resulting graph G9 t is a DAG. Set h = h G 0 t . If t is obtained from a set of taxa using TCD method, then G t is a DAG. Removing minimum number of edges from a directed graph to make it a DAG is known as the minimum Feedback Arc Set problem which is NP-hard [11]. Thus we use the following heuristic method and try to remove as minimum number of edges as possible from G t in order to lose minimum information. First a cycle C is selected randomly. Let C max denote the set of nodes in C with the maximum degree. Remove an edge of C which one of its ends belongs to C max . This process continues until the resulting graph is a DAG. However, any such missing information will be recaptured in Step 9.
Step 2. In this step TripNet first apply HBUILD on h. If the result is a tree, TripNet constructs a binarization of this tree. Otherwise TripNet goes to Step 3. Note that if t is consistent with a tree, TripNet constructs a binarization of T t .
Step 3. Remove all the maximum-weight edges from G. The process of removing all the maximum-weight edges from the graph continues until the resulting graph is disconnected.
In [3] and [4] the authors introduced the concept of SN-sets for a set of triplets t. A subset S of L(t) is an SN-set if there is no triplet ij|k M t such that i = [S and j, k M S. In [4] it is shown that if t is dense then the maximal SN-sets partition L(t) and can be found in polynomial time. By contracting each of the SN-set to a single node and assuming a common ancestor for all of these leaves, the size of the problem is reduced. In these papers, for finding the maximal SN-sets in polynomial time, the authors use the high density of the input triplet sets. TripNet algorithm uses the concept of height function as an auxiliary tool to obtain SN-sets instead of the high density assumption.
Step 4. For each connected component obtained in Step 3 which is not an SN-set, we apply Step 3. This process continues until all of the resulting components are SN-sets. Let {S 1 , S 2 , …, S k } be the set of resulting SN-sets. If each SN-set contains only one node, HBUILD is applied and if the result is a tree TripNet constructs a binary tree and goes to Step 6. Otherwise TripNet goes to Step 5. If for some i, |S i |.1, contract each S i to a single node s i and set S = {s 1 , s 2 , …, s k }. Update the set of triplets by defining t S = {s i s j |s k : if ' xy|z M t, x M S i , y M S j and z M S k }. Constructs a weighted complete graph (G S , w S ) with V(G S ) = S and w S (s i , s j ) = min {h(x, y): x M S i and y M S j }. Set (G, w) = (G S , w S ) and TripNet goes to Step 3.
The following theorem is a consequence of the definition SN-set for (G S , w S ).
Theorem 5. Applying Steps 3 and 4 on (G S , w S ) and t S , each resulting SN-set has one member.
Proof. Suppose that S = {s 1 , s 2 , s 3 , …, s r } is an SN-set in (G S , w S ). Now assume that in the procedure of Step 3 by removing the edges with weight l, S 1 separates from S 2 . Thus there exists k . l such that by removing the edges with weight at least k in (G S , w S ), the connected component S separates from other components of G S . It means that by removing the edges with weight at least k in G, we obtain the SN-set S 1 |S 2 | . . . |S r which is a contradiction.
In the next step the reticulation leaves are recognized using the following three criteria: Criterion I. Let m i and M i be the minimum and maximum weight of the edges in (G,h) with exactly one end in S i . Choose the node with minimum m i and if there is more than one node with minimum m i then choose among them the nodes which has minimum M i . Let R 1 denotes the set of such nodes.
Criterion II. Let w min = min {w(s i ,s j ): 1ƒi,jƒk}. In G S consider the induced subgraph on the edges with the weight w min . Choose the nodes of R 1 with the maximum degree in this induced subgraph. Let R 2 denotes the set of such nodes.
Criterion III. For each node s M R 2 , remove it from GS and find SN-sets for this new graph using Steps 3 and 4. Let n s be the number of SN-sets of this new graph with cardinality greater than one. Choose the nodes in R 2 with maximum n s . Let R 3 denotes the set of such nodes.
We state an example to show the idea behind these three criteria.
t is not consistent with a tree but it is consistent with the network N shown in Figure 8a. Obviously, N is an optimal network consistent with t. In order to find SN-sets we construct G9 t and (G, h), and find SN-sets from (G, h) using Steps 3 and 4 (Figures 8b to 8g) (Figure 8h). we expect that the reticulation is in R 1 . In this example both k and l are in R 1 . Also we expect that if there is a reticulation leaf, it belongs to R 2 which again both k and l are in R 2 . Now just l belongs to R 3 . Thus we consider l as the reticulation leaf (Figures 8i to 8n). Remove triplets from t S which contain l and denote the new set of triplets by t9 S . Obviously t9 S is consistent with a tree. We add this reticulation leaf to a binarization of T t 0 S such that the resulting network is consistent with t S . Note that if we consider each node except than l as the reticulation leaf then final network consistent with t S has at least two reticulation leaves.
Step 5. In this step the reticulation leaf is recognized using three criteria. Do the criterion I. If |R 1 | = 1 then choose the node x M R 1 as the reticulation node. Otherwise if |R 1 |.1 do the criterion II. If |R 2 | = 1 then choose the node x M R 2 as the reticulation node. Otherwise if |R 2 |.1 do the criterion III. If |R 3 | = 1 then choose the node x M R 3 as the reticulation node. Otherwise if |R 3 |.1 then by the speed options we choose the reticulation node as follows.
Slow. Each node in R 3 is examined as the reticulation leaf. Normal. Two nodes in R 3 are selected randomly and each of these two nodes is examined as the reticulation leaf.
Fast. One node in R 3 is selected randomly as the reticulation leaf.
Let x be a node which is considered as a reticulation leaf. Remove x from G S and all of the triplets which contain x from t S . Define G = G \ {x} and go to Step 3. Note that for the Fast option the running time of the algorithm is polynomial.
For biological data almost always the criteria I and II find a unique reticulation leaf.
So on real data the running time of TripNet is almost always polynomial.
Step 6. Let x 1 , x 2 , …, x m be m reticulation leaves which are obtained in Step 5 with this order and T be the tree that is constructed in Step 4. Now add these m nodes in the reverse order to T as what follows. Let e 1 and e 2 be two edges of T. Consider two new nodes y 1 and y 2 in the middle of e 1 and e 2 . Connect y 1 and y 2 to a new node y 3 and connect the reticulation leaf x m to y 3 . Do this procedure for all pairs of edges and choose a pair such that the resulting network is consistent with maximum number of triplets in t. Continue this procedure until all the reticulation nodes are added.
Step 7. For each SN-set S i and the set t Si of triplets we run the algorithm again.
Step 8. Replace each SN-set in the network of Step 6 with its related network constructed in Step 7 to obtain a network N9.
Let t9 M t be the set of the triplets which are not consistent with N9. For each pair of leaves a and b assume that t9 ab is the set of triplets in t9 which are of the form ab|c. Consider the pair of leaves i and j such that t9 ij has the maximum cardinality. Assume that p i and p j are the parents of i and j, respectively.
Step 9. Create two new nodes in the middle of the edges p i i and p j j and connect them with a new edge. This new edge creates a reticulation node and all of the triplets in t9 ij will be consistent with the new network. All consistent triplets with the new network are removed from t9 and this procedure will continue until t9 becomes empty. Figure 9 presents an example of the algorithm with all of its Steps.