TripNet: A Method for Constructing Rooted Phylogenetic Networks from Rooted Triplets

Hadi Poormohammadi; Changiz Eslahchi; Ruzbeh Tusserkani

doi:10.1371/journal.pone.0106531

Abstract

The problem of constructing an optimal rooted phylogenetic network from an arbitrary set of rooted triplets is an NP-hard problem. In this paper, we present a heuristic algorithm called TripNet, which tries to construct a rooted phylogenetic network with the minimum number of reticulation nodes from an arbitrary set of rooted triplets. Despite of current methods that work for dense set of rooted triplets, a key innovation is the applicability of TripNet to non-dense set of rooted triplets. We prove some theorems to clarify the performance of the algorithm. To demonstrate the efficiency of TripNet, we compared TripNet with SIMPLISTIC. It is the only available software which has the ability to return some rooted phylogenetic network consistent with a given dense set of rooted triplets. But the results show that for complex networks with high levels, the SIMPLISTIC running time increased abruptly. However in all cases TripNet outputs an appropriate rooted phylogenetic network in an acceptable time. Also we tetsed TripNet on the Yeast data. The results show that Both TripNet and optimal networks have the same clustering and TripNet produced a level-3 network which contains only one more reticulation node than the optimal network.

Citation: Poormohammadi H, Eslahchi C, Tusserkani R (2014) TripNet: A Method for Constructing Rooted Phylogenetic Networks from Rooted Triplets. PLoS ONE 9(9): e106531. https://doi.org/10.1371/journal.pone.0106531

Editor: William W. Lytton, SUNY Downstate MC, United States of America

Received: March 13, 2014; Accepted: August 8, 2014; Published: September 10, 2014

Copyright: © 2014 Poormohammadi et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. We can send all required data via email. Also all data will be online available in few days later. TripNet software, TripNet data and supplementary files are freely available for download at (www.bioinf.cs.ipm.ir/software/tripnet). We addressed this link in the manuscript and mentioned that all required data and software are freely available for download from the link.

Funding: This work is supported in part by a grant (BS-1392-1-05) from the Institute for Research in Fundamental Sciences (IPM), Tehran, Iran. The funders had used the clusters of this institute for data collection and data analysis of this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Phylogenetic networks are a generalization of phylogenetic trees that permit the representation of non-tree-like underlying histories. A rooted phylogenetic network is a rooted directed acyclic graph in which no node has indegree greater than 2 and the outdegree of each node with indegree 2 is 1. Such nodes are called reticulation nodes. In rooted phylogenetic networks the nodes with indegree 1 and outdegree 0 are called leaves and are distinctly labeled by a set of given taxa. Mathematicians are interested in developing methods that infer a phylogenetic tree or network from basic building blocks. In the computation of a rooted tree or network, one group of the basic building blocks are rooted triplets, the rooted binary trees on three taxa [1].

In 1981, Aho et al., studied the problem of constructing a rooted tree from a set of rooted triplets [2]. They proposed an algorithm called BUILD algorithm which shows that, given a set of rooted triplets, it is possible to construct in polynomial time a rooted tree that all the input triplets are contained in it or decide that no such tree exists.

When there is no tree for a given set of triplets one may try to produce an optimal phylogenetic network. In this context, the goal is to compute an optimal rooted phylogenetic network that contains all the rooted triplets. One possible optimality criterion is to minimize the level of the network, which is defined as the maximum number of reticulation nodes contained in any biconnected component of the network. The other optimality criterion is to minimize the number of reticulation nodes [1]. In [3] and [4] the authors considered the problem of deciding whether, given a set of rooted triplets as input, is it possible to construct a level-1 rooted phylogenetic network that contains all the input triplets? They showed that, in general, this problem is NP-hard. However, in [4] the authors showed that when the set of rooted triplets is dense, which means that for each set of three taxa there is at least one rooted triplet in the input set, the problem can be solved in polynomial time. After their results, all research in this new area has up to this point focused on constructing rooted phylogenetic networks from dense rooted triplet sets.

LEV1ATHAN is an algorithm for generating a level-1 rooted phylogenetic network from a set of rooted triplets [5]. Specifically, it attempts to find a level-1 rooted phylogenetic network that contains as many of the input rooted triplets as possible. This problem is an NP-hard problem [5]. The algorithm by [6] can be used to find a level-1 or a level-2 rooted phylogenetic network which minimizes the number of reticulation nodes, if such a network exists. In [6] the authors also showed that for a dense set of rooted triplets τ, if τ is precisely equal to the set of rooted triplets that are contained in some rooted phylogenetic network, then they can construct such a rooted phylogenetic network with smallest possible level in time O(|τ|^k⁺¹), where k is a fixed upper bound on the level of the network. In addition based on the ideas described in [6], for a given dense set of rooted triplets τ, the authors proposed the SIMPLISTIC algorithm which always returns some rooted phylogenetic network that contains τ. But it does not give any minimality guarantees.

In [7] the authors showed that given a dense set of rooted triplets τ and a fixed number k, it is possible to construct in time O(|τ |^k⁺¹) a level-k rooted phylogenetic network that contains τ or decides that no such network exists.

In this paper we present a heuristic algorithm called TripNet for constructing rooted phylogenetic networks with the minimum number of reticulation nodes from an arbitrary set of rooted triplets. Despite of current methods that work for dense set of rooted triplets, a key innovation is the applicability of TripNet to non-dense set of rooted triplets.

In “unpublished data” the authors applied TripNet on both real and simulated data. Here TripNet algorithm is described in details, some theorems are proved, and one simulation is performed to show the accuracy of TripNet. Also TripNet is tested on the Yeast data. This paper is organized as follows. In section 2, first some definitions and notation are presented. Then we describe BUILD algorithm. Finally a new method called TCD, is introduced for constructing rooted triplets from (biological) sequences. In section 3 we compare TripNet with SIMPLISTIC on the triplets sets that are obtained from TCD method. Then we test TripNet on the Yeast data. In section 4 we discuss the performance of TripNet. In the last section the directed graph Gτ related to a set of triplets τ is introduced. Then we show that if either a set of triplets is obtained from a set of sequences using TCD method or a set of triplets is consistent with a tree, then G_τ is a DAG. This property has a key role in solving the Integer Programming system which is introduced in the remaining, in polynomial time. Then the concept of the height function of a rooted phylogenetic network is introduced, and an efficient method for obtaining a height function h_τ for a given set of rooted triplets τ is explained. It is shown that the condition of consistency of a rooted phylogenetic network N with the height function h_τ can be a good alternative for the condition of consistency of N with τ. To show this, firstly we define the Integer Programming system in such a way that its constraints intuitively force the consistency of N with τ. Secondly, we show that if τ is consistent with a tree T, then T is consistent with h_τ and T can be constructed using this height function. In the last section we present TripNet algorithm.

Preliminaries

Here first we present some definitions and notation. Then we describe BUILD algorithm. Finally a new method called TCD, is introduced for constructing rooted triplets from a set of sequences.

2.1 Definitions and notation

Let X be a set of taxa. A rooted phylogenetic tree (tree for short) on X is a rooted unordered leaf labeled tree whose leaves are distinctly labeled by X and every node which is not a leaf has at least outdegree two. A directed acyclic graph (DAG) is a directed graph that is free of directed cycles. A DAG G is connected if there is an undirected path between any two nodes of G. It is biconnected if it contains no node whose removal disconnects G. A biconnected component of a graph G is a maximal biconnected subgraph of G. A rooted phylogenetic network (network for short) on X is a rooted DAG in which the root has indegree 0 and outdegree 2 and every node except the root satisfies one of the following conditions:

It has indegree 2 and outdegree 1. These nodes are called reticulation nodes.
It has indegree 1 and outdegree 2.
It has indegree 1 and outdegree 0. These nodes are called leaves and are distinctly labeled by X.

A reticulation leaf is a leaf whose parent is a reticulation node. A network is said to be a level-k network if each of its biconnected components contains at most k reticulation nodes. A tree can be considered as a level-0 network.

A rooted triplet (triplet for short) is a rooted binary unordered tree with three leaves. We use ij|k to denote a triplet with taxa i and j on one side and k on the other side of the root (Figure 1a). A set of triplets τ is called dense if for each subset of three taxa, there is at least one triplet in τ. A triplet ij|k is consistent with a network N or equivalently N is consistent with ij|k if the leaf set of ij|k is a subset of the leaf set of N, and N contains a subdivision of ij|k, i.e. if N contains distinct nodes u and v and pairwise internally node-disjoint paths u → i, u → j, v → u and v → k. Figure 1b shows an example of a network consistent with ij|k. A set τ of triplets is consistent with a network N if all the triplets in τ are consistent with N. We use the symbols τ(N) and L_N to represent the set of all triplets that are consistent with N and the set of labels of its leaves respectively. For any set τ of triplets define L(τ) = . The set τ is called a set of triplets on X if L(τ) = X.

Download:

Figure 1. A triplet and a network consistent with it.

(a) The triplet $ij|k$, (b) The triplet ij|k is consistent with the given network.

https://doi.org/10.1371/journal.pone.0106531.g001

2.2 BUILD algorithm

Let τ be a set of triplets. BUILD is a top-down algorithm, constructs a tree consistent with τ if such a tree exists. The algorithm is guided by the Aho graph.

Definition 1. (Aho graph) Let X be a set of taxa and τ be a set of triples on X. The Aho graph AG(τ) = (V,E) associated with τ has node set V = X and any two nodes i and j are connected by an edge in E if and only if there exists a triplet ij|k ∈ τ [1].

BUILD algorithm: Given a non-empty set of rooted triples τ on X, the aim is to construct a rooted phylogenetic tree T on X that is consistent with τ, if one exists. If AG(τ) has only one connected component, then the algorithm reports fail. Else, for each node set U of a connected component of AG(τ), determine the set τ|_U which denotes the set of all triplets in τ whose leaves are in U and recursively compute the rooted phylogenetic subtree T(τ|_U) which denotes the tree constructed with BUILD algortihm consistent with τ|_U. Finally, create a root node r and combine all computed subtrees by connecting r to the root of each of them [1]. For an example see Figure 2.

Download:

Figure 2. An example of BUILD algorithm for the given set {bc |a, ac |d,

de |b} of triplets.

https://doi.org/10.1371/journal.pone.0106531.g002

2.3 Triplets construction method

There exist different methods like Maximum Parsimony or Maximum Likelihood for constructing triplets from (biological) sequences [6]. In this section a method for constructing triplets is presented. Suppose that X is a set of n taxa, and D = [D_ij] be an n×n distance matrix on X. For each three taxa i, j, and k ∈ X, and the entries D_ij, D_ik, and D_jk, we assign the triplet ij|k if D_ij < min {D_ik, D_jk}. We name this method Triplets Construction with Distance; TCD for short. In this paper we use TCD method for constructing triplets.

Results

In this section to show the performance of TripNet on the triplets sets which are obtained from TCD, we compare TripNet with SIMPLISTIC. Also we test TripNet on the Yeast data. It is the only published triplets data that are obtained from biological data.

3.1 Comparing SIMPLISTIC and TripNet

SIMPLISTIC is the only available software which has the ability to return some rooted phylogenetic network consistent with a given dense set of rooted triplets. But it does not give any minimality guarantees [6].

SplitsTree is a valuable tool for constructing an special kind of unrooted phylogenetic networks from different types of data as input. This program converts a given set of sequences X into a distance matrix D_X to compute the resulting network. The distance matrix D_X is reported as one of the output of SplitsTree [8].

Let be the set of triplets that is obtained from D_X using TCD, and consider it as the input for TripNet.

Note that is not necessarily dense, since for some three taxa i, j, and k we might have = <. In this case one of the triplets ij|k or jk|i is assigned to i, j, and k to obtain a dense set of triplets as the input of SIMPLISTIC. Also if = = , then randomly one of the three possible triplets related to i, j and k is assigned to them.

To perform the simulation we generate 160 different sets of sequences are generated using TREEVOLVE. TREVOLVE is a software which simulate the evolution of DNA sequences under a coalescent model [9]. TREEVOLVE contains many input parameters which one can adjust them. In this study we adjust the Number of samples, the Number of sequences, and the Length of sequence, and for the other parameters the default values are adjusted. In this study the Number of sequences is 10, 20, 30, and 40. For each input parameter the Number of sequences the Length of sequence is 100, 200, 300, and 400. For each case the Number of samples is set to 10.

In this study we run both methods on a PC with an Intel DuallCore processor running at 1.80 GHz.

We set the running time restriction 6 hours for methods. Let N_finite be the set of networks for which the running time is less than 6 hours.

The results of the comparison between TripNet and SIMPLISTIC on the three most important parameters i.e. running time of both methods, number of the reticulation nodes and the level of the final networks, are shown in Table 1.

Download:

Table 1. SIMPLISTIC and TripNet network results.

https://doi.org/10.1371/journal.pone.0106531.t001

The results show that when the number of input taxa is 10, both methods always return a network in at most one second. For the number of input 20, in 5% of cases SIMPLISTIC returns no results in less than 6 hours. For the remaining 95% of the cases, the SIMPLISTIC running time is on average 306 seconds, while in all cases on average the TripNet running time is at most 2 seconds. But by increasing this parameter to 30, in 67.5% of the cases, SIMPLISTIC has not the ability to return a network in less than 6 hours. For the remaining 22.5% of the cases on average SIMPLISTIC outputs a network in 2675 seconds, while in all cases the TripNet running time is on average 200 seconds. Moreover when this parameter is set to 40, in all cases SIMPLISTIC fails to return any network in less than 6 hours, while on average TripNet outputs a network in 775 seconds. Totally for all 160 input triplets sets on average TripNet outputs a network in less than 250 seconds, while on average in 57% of the SIMPLISTIC networks which belong to N_finite, the running time is near to 750 seconds.

Also the results show that in all cases the number of the reticulation nodes and the level of TripNet networks are less than SIMPLISTIC networks. Note that for the number of input 40, on average the number of the reticulation nodes and the level of the TripNet networks are 15.825 and 15.25, while for these data SIMPLISTIC can not return any network in less than 6 hours.

3.2 Yeast data

The Yeast data is a dense set of triplets generated using real yeast data, obtained from the Fungal Biodiversity Center in Utrecht. This data set which contains information about 21 species is available online from (http://skelk.sdf-eu.org/level2triplets.html). Based on the algorithm developed in [10]. Steven Kelk has developed a software application, called LEVEL2, for constructing level-2 networks from dense sets of triplets. LEVEL2 is not applicable to general triplet sets and it produces a network only if there exists a level-2 network consistent with the input triplets. However, LEVEL2 has the advantage that it always produces the best possible network which also minimizes the number of reticulation nodes. LEVEL2 network for the Yeast data is a 21-leaf level-2 network which is given in Figure 3a [10]. As our only chance for comparing TripNet networks with the best possible networks we repeated the analysis of Yeast data using TripNet. The TripNet network for the Yeast dataset is given in Figure 3b. As one can see, TripNet produced a level-3 network which contains only one more reticulation node than the network obtained by LEVEL2. The running time of both algorithms is nearly one second.

Download:

Figure 3. Resulting networks from Yeast triplets. (a) LEVEL2 algorithm result. (b) TripNet algorithm result.

https://doi.org/10.1371/journal.pone.0106531.g003

Discussion

In this paper we introduced TripNet which is the software that has the ability to return some network consistent with an arbitrary given set of triplets.TripNet and supplementary files are freely available for download at (www.bioinf.cs.ipm.ir/software/tripnet). Unlike previous methods which only work on dense triplet sets, our method works on any set of triplets. Some theorems were proved to clarify the rationale behind the steps of TripNet. In this paper the TCD method was introduced for constructing triplets. In order to study the performance of TripNet on the triplets that are obtained from TCD method we performed a simulation on 160 different sets of triplets, and compared TripNet with SIMPLISTIC.

The results showed that in all 160 cases TripNet outputs an appropriate network in an acceptable time, while just in 57.5% of these cases SIMPLISTIC has the ability to return some network in less than 6 hours. Also on average in all cases TripNet outperforms SIMPLISTIC on the number of the reticulation nodes, and the level of the output network.

Also by increasing the number of input taxa, the running time of SIMPLISTIC exceeds abruptly, such that for the input taxa 40, it could not return any network in less than 6 hours.

These results showed that for large size input data that are obtained from TCD method, SIMPLISTIC is not a practical method for constructing networks, while TripNet works well in all cases.

To establish the performance of TripNet on real datasets, we tested TripNet on Yeast data, and compared our results with those of LEVEL2. For Yeast data TripNet produced a level-3 network which contains only one more reticulation node than the optimal network obtained by LEVEL2. Both networks have the same clustering and represent the same evolutionary relationship between taxa. While TripNet has been designed for general triplet sets (not necessarily dense or consistent with a restricted level network), this example shows that the network produced by TripNet is very close to the best possible solution.

Materials and Methods

In this section we prove some theorems to clarify the rationale behind the steps of TripNet. Then TripNet is presented in nine steps.

5.1 The directed graph related to a set of triplets and height function

Throughout this subsection we denote i, j by ij for short. Let τ be a set of triplets. Define G_τ, the directed graph related to τ, by V(G_τ) = {ij: i,j ∈ L(τ), i ≠ j} and E(G_τ) = {(ij,ik): ij|k ∈ τ} ∪ {(ij,jk): ij|k ∈ τ}. In the following we present some basic properties of G_τ.

In what follows the height function of a tree is introduced. Let denotes the set of all subsets of X of size 2.

Definition 2. Let X be an arbitrary finite set. A function h: → is called a height function on X.

Let T be a rooted tree with the root r, c_ij be the lowest common ancestor of the leaves i and j, and l_T denotes the length of a longest path starting at r.

Definition 3. The height function of T, h_T is defined as h_T(i,j) = l_T-d_T(r,c_ij) where i and j are two distinct leaves of T (d_T(r,c_ij) denotes the length of the path between r and c_ij).

Let T be a tree. The definition above implies that a triplet ij|k is consistent with T if and only if h_T(i, j)<h_T(i, k) or h_T(i, j)<h_T(j, k).

Let X = {x₁, x₂, …, x_m} be a finite set, D be a distance matrix on X, and τ be the set of triplets on X that are obtained from TCD method using D. Let G_τ contains a cycle x₁x₂ → x₂x₃ → … → x_n₋₁x_n → x₁x₂. Then , which is a contradiction. So G_τ is a DAG.

Moreover if τ is a triplet set consistent with a tree T, then G_τ is a DAG. This is so because if G_τ contains a cycle x₁x₂ → x₂x₃ → … → x_n₋₁x_n → x₁x₂, then h_T(x₁,x₂) < h_T(x₂,x₃) < … < h_T(x_n₋₁,x_n) < h_T(x₁,x₂), which is a contradiction.

The height function of a DAG is introduced as what follows.

Let τ be a set of triplets, G_τ be a DAG and denotes the length of the longest path in G_τ. Since G_τ is a DAG, the set of nodes with outdegree zero is nonempty. Assign +1 to the nodes with outdegree zero and remove them from G_τ. Assign to the nodes with outdegree zero in the resulting graph and continue this procedure until all nodes are removed.

Definition 4. For any two distinct i, j ∈ L(τ), define as the value that is assigned by the above procedure to the node ij and call it the height function related to G_τ.

Let τ be a set of triplets that is consistent with a tree, and T_τ denotes the unique tree that is produced by BUILD algorithm. Then G_τ is a DAG and is well-defined. The following theorem represents an upper bound for based on .

Theorem 1. Let τ be a set of triplets that is consistent with a tree. Then .

Proof. The proof proceeds by induction on . It is trivial when = 3. Assume that theorem holds when . Let = k+1 and T₁, T₂, …, T_m be m subtrees which are obtained from T_τ by removing its root. For each i, , let , and r_i be the root of T_i. By the induction assumption for each i, ,≤. Moreover we conclude from BUILD algorithm that , for . Thus , for . Also for i, _, the maximum length of the longest path in T_i is . It means that for i, , the maximum length of the longest path in is at least . Therefore the length of the longest path in is at least . Let . We have two cases.

Case 1. For some i and j, , and . Since the outdegree of ab in is zero and c_ab = r, then ≤.

Case 2. For some i, , . By the induction assumption ≤ for i, . Therefore = = = ≤≤. The last inequality is obtained by construction of from for i, .

So for each , ≤ and the proof is complete.

Now we describe an algorithm similar to BUILD algorithm, using height functions. We refer to this algorithm by HBUILD. Let h be a height function on X. Define a weighted complete graph (G,h) where V(G) = X and edge {i, j} has weight h(i,j). Remove the edges with maximum weight from G. If removing these edges results in a connected graph the algorithm stops. Otherwise, the process of removing the edges with maximum weight is continued in each connected component until each connected component contains only one node. At the end of this procedure one can reconstruct the tree by reversing the steps of the algorithm similar to BUILD algorithm (see Figure 4). The algorithm above decides in polynomial time whether a tree with height function h exists.

Download:

Figure 4. The steps of constructing T_τ from the given set τ = {kl |j,

kl |i, jk |i, jl |i}, using HBUILD. (a) The graph G_τ, (b) The graph (G,h), (c) Removing maximum weights from the graph (G,h), (d) Constructing T_τ using step c.

https://doi.org/10.1371/journal.pone.0106531.g004

So if τ is a set of triplets which is consistent with a tree, then is a DAG and ≤ = h and HBUILD algorithm constructsa tree consistent with τ. Note that based on theorem 1 the tree that is produced by HBUILD is exactly T_τ.

The HBUILD tree is not necessarily a binary tree. To obtain a binary tree consistent with a set of triplets, we do the following procedure.

Let T be a tree and x be a node of T with x₁, x₂, …, x_k, as its children. Consider a new node y. Construct by removing the edges (x, x₁), (x, x₂), …, (x, x_k_-1) from T and adding the edges (x, y), (y, x₁), (y, x₂), …, (y, x_k_-1) to T. Continuing the same method for each node with outdegree more than 2 a binary tree is obtained, and call it a binarization of T (see Figure 5). Obviously, one can obtain different binarization of T. Let τ be a set of triplets that is consistent with a tree T₁, and T₂ be a binarization of T₁. Then τ is consistent with T₂.

Download:

Figure 5. An example of binarization.

The binary tree is a binarization of the non-binary tree.

https://doi.org/10.1371/journal.pone.0106531.g005

In the remaining of this section we generalize the concept of height function from trees to networks. This generalization is not straightforward because the concept of (lowest) common ancestor of two leaves of a network is not well-defined. Let N be a network with the root r and l_N be the length of a longest directed path from r to the leaves. For each node u consider d(r,u) as the length of the longest directed path from r to u. For any two nodes u and v, we call u an ancestor of v, if there is a directed path from u to v. If u is an ancestor of v then we say that v is lower than u. Let i and j be two leaves of N. c is called a lowest common ancestor of i and j in N, if c is a common ancestor of i and j and there is no common ancestor of i and j lower than c. For any two leaves i and j, let C_ij denote the set of all lowest common ancestors of i and j.

Definition 5. For each pair of leaves i and j, define h_N(i,j) = min{l_N-d(r,c): c∈C_ij} and call it the height function of N.

Obviously, every network N indicates a unique height function h_N. But two different networks may have the same height function (see Figure 6).

Download:

Figure 6. Two different networks with the same height function.

For the given network N and tree T, h_N = h_T = h. h(j,k) = 1, h(i,j) = h(i,k) = 2 and h(i,l) = h(j,l) = h(k,l) = 3.

https://doi.org/10.1371/journal.pone.0106531.g006

In the following proposition we prove that for a given height function h there is a network N such that h_N = h+1.

Proposition 1. Let X be an arbitrary finite set and h be a height function on X. Then there exists a network N not necessarily binary, such that its leaves are distinctly labeled by X and h_N = h+1.

Proof. Let X = {x₁, x₂, …, x_n} and h_max = max{h(x_i, x_j): }. Let r be the root of N,, and X′ = {x′₁, x′₂, …, x′_n}. Consider n nodes that are distinctly labeled by X′ members. For each pair of nodes x_i and x_j with h(x_i, x_j) = h_max, connect x′_i and x′_j to r by two paths of length h_max which just are common in the root. For each pair of nodes x_i and x_j with h(x_i, x_j) < h_max, consider a new node and connect x′_i and x′_j to this new node and connect this node to r by a path of length h_max-h(x_i, x_j). For each node which is labeled by x′_i, consider a new node as its child and label it by x_i. The resulting network in which its leaves are distinctly labeled by X satisfies the condition h_N = h+1.

Note that the network N which is constructed in the proof of Proposition 1 is not necessarily a rooted phylogenetic network. To construct a rooted phylogenetic network N′ from N in such a way that if a triplet is consistent with N then it is consistent with N′, do the following procedure. Replace each path in which all its inner nodes have indegree and outdegree one, with a path of length one. The method of constructing N shows that If there is a node v with indegree , then it has just one child as a leaf. Let this child is labeled by x, and its d parents are labeled by x₁, x₂, …, x_d. Replace the edge which is connected to x with a path of length d-2 in such a way that its d-2 inner nodes from v to x are labeled with 1 to d-2. For each i, remove the edge x_iv and connect x_i to i. Do the binarization on the root. The resulting network N′ is consistent with all triplets which are consistent with N.

The following theorem shows relation between the height function of a network and a triplet consistent with it.

Theorem 2. Let N be a network, i, j, and k be its three distinct leaves. If h_N(i, j) < h_N(i, k) or h_N(i, j) < h_N(j, k) then ij|k is consistent with N.

Proof. Suppose that h_N(i, j) < h_N(i, k). Let v_ij$ and v_ik be common ancestors of i, j and i, k respectively, such that h_N(i, j) = l_N-d(v_ij, r) and h_N(i, k) = l_N-d(r,v_ik). Let l_i and l_j be two distinct paths from v_ij to i and j, respectively. Let l_k be an arbitrary path from v_ik to k. If then it follows that which is a contradiction. So ij|k is consistent with N.

The reverse of the above theorem is not necessarily true. For example, consider the network of Figure 7. The triplet ij|k is consistent with it, but h(i,j) = h(i,k) = 3 and h(j,k) = 2.

Download:

Figure 7. A counter example for the reverse of Theorem 2.

ij|k is consistent with the given network, but h(i,j) = h(i,k) = 3 and h(j,k) = 2.

https://doi.org/10.1371/journal.pone.0106531.g007

The basic idea of TripNet algorithm is to find a height function as an intermediate computational step that yields the minimum amount of information required to construct the network from a set of triplets. So it is important to find a way for computing h_N from a set of triplets. In the rest of this section we introduce a computational method for computing h_N using Integer Programming. Let τ be a set of triplets with |L(τ)| = n. Inspired from the two inequalities that are the consequence of Definition 3 and Theorem 2, for each triplet ij|k ∈ τ, define two inequalities and . Since the number of variables in such inequalities are at most , we obtain the following system of inequalities from τ.

Let s be an integer. Define the following Integer Programming and call it IP(τ,s).

Intuitively if IP(τ,s) has a feasible solution, we expect that the optimal solution to this integer programming is an approximation of the height function of an optimal network N consistent with τ. The following theorems support this intuition.

Theorem 3. Let τ be a set of triplets. Then G_τ is a DAG if and only if for some integer s, the IP(τ,s) has a feasible solution. In this case the minimum number s, for which IP(τ,s) has a feasible solution, is +1.

Proof. Let G_τ be a DAG. Without loss of generality assume that G_τ is connected.

The proof proceeds by induction on . If = 1 then obviously for s = 1, IP(τ,s) has no feasible solution and for each , IP(τ,s) has a feasible solution. Assume that the theorem holds for . Suppose that τ is a set of triplets with = k+1. Let A be the set of the terminal nodes of all longest paths in G_τ. For each ij ∈ A there is some x ∈ L(τ) such that ix|j ∈ τ. Let B be the set of all such triplets and τ′ = τ\B. Apparently, B≠φ and the length of the longest path in G_τ′ is k. By the induction assumption the minimum number s for which IP(τ′,s) has a feasible solution, is +1 = . Consider IP(τ,+1). Define h(i, j) = +1, for each ij ∈ A and h(t,l) = h′(t,l), for each tl A. h is a feasible solution to IP(τ,+1). Now if s is a solution for IP(τ,s) then s-1 is a solution for IP(τ′,s-1). So +1 is the minimum solution for IP(τ,s). Now suppose that τ is a set of triplets and for some integer s, IP(τ,s) has a feasible solution h. Assume that G_τ has a cycle . Corresponds to C we have inequalities which is a contradiction and the proof is complete.

Let τ be a set of triplets that is consistent with a tree or constructed from a given set of taxa, using TCD method. It was shown that G_τ is a DAG and by Theorem 3, is a feasible solution to IP(τ,+1).

Theorem 4. Let τ be a set of triplets consistent with a tree. Then is the unique optimal solution to IP(τ,+1).

Proof. The graph G_τ is a DAG, since τ is consistent with a tree. So is well efined.

The proof proceeds by induction on . Without loss of generality assume that G_τ is connected. The theorem is trivial when = 1. Let for each set of triplets consistent with a tree, be the unique optimal solution to IP(τ,+1) where = k1. Suppose that τ is a set of triplets consistent with a tree and = k+1. Let τ′ be the set of triplets which is introduced in the proof of Theorem 3. By the induction assumption is the unique optimal solution to IP(τ′, +1). By Theorem 3 the minimum s for which IP(τ, s) has a feasible solution is +1. Also +1 = . It follows that is the unique optimal solution to the IP(τ,+1) and the proof is complete.

It is important to point out that the introduced target function of the above IP can be replaced with other appropriate target functions. But we use this special target function because it can be easily possible to find a solution for this IP in polynomial time when the input triplets are obtained from TCD method. Secondly using this target function, enable us to prove those above theorems which show the consistency of the result of the TripNet algorithm with a tree when there is a tree consistent with given triplets.

5.2 TripNet algorithm

Now we describe the TripNet algorithm in nine steps. In this algorithm the input is a set of triplets τ and the output is a network consistent with τ. Also if τ is consistent with a tree the algorithm constructs a binarization of T_τ.

Step 1. In this step we find a height function h on L(τ). If G_τ is a DAG we set G′_τ = G_τ. If G_τ is not a DAG we remove some edges from G_τ in such a way that the resulting graph G′_τ is a DAG. Set h = .

If τ is obtained from a set of taxa using TCD method, then G_τ is a DAG. Removing minimum number of edges from a directed graph to make it a DAG is known as the minimum Feedback Arc Set problem which is NP-hard [11]. Thus we use the following heuristic method and try to remove as minimum number of edges as possible from G_τ in order to lose minimum information. First a cycle C is selected randomly. Let C_max denote the set of nodes in C with the maximum degree. Remove an edge of C which one of its ends belongs to C_max. This process continues until the resulting graph is a DAG. However, any such missing information will be recaptured in Step 9.

Step 2. In this step TripNet first apply HBUILD on h. If the result is a tree, TripNet constructs a binarization of this tree. Otherwise TripNet goes to Step 3.

Note that if τ is consistent with a tree, TripNet constructs a binarization of T_τ.

Step 3. Remove all the maximum-weight edges from G. The process of removing all the maximum-weight edges from the graph continues until the resulting graph is disconnected.

In [3] and [4] the authors introduced the concept of SN-sets for a set of triplets τ. A subset S of L(τ) is an SN-set if there is no triplet ij|k ∈ τ such that iS and j, k ∈ S. In [4] it is shown that if τ is dense then the maximal SN-sets partition L(τ) and can be found in polynomial time. By contracting each of the SN-set to a single node and assuming a common ancestor for all of these leaves, the size of the problem is reduced. In these papers, for finding the maximal SN-sets in polynomial time, the authors use the high density of the input triplet sets. TripNet algorithm uses the concept of height function as an auxiliary tool to obtain SN-sets instead of the high density assumption.

Step 4. For each connected component obtained in Step 3 which is not an SN-set, we apply Step 3. This process continues until all of the resulting components are SN-sets. Let {S₁, S₂, …, S_k} be the set of resulting SN-sets. If each SN-set contains only one node, HBUILD is applied and if the result is a tree TripNet constructs a binary tree and goes to Step 6. Otherwise TripNet goes to Step 5. If for some i, |S_i|>1, contract each S_i to a single node s_i and set S = {s₁, s₂, …, s_k}. Update the set of triplets by defining τ_S = {s_is_j|s_k: if ∃ xy|z ∈ τ, x ∈ S_i, y ∈ S_j and z ∈ S_k}. Constructs a weighted complete graph (G_S, w_S) with V(G_S) = S and w_S(s_i, s_j) = min {h(x, y): x ∈ S_i and y ∈ S_j}. Set (G, w) = (G_S, w_S) and TripNet goes to Step 3.

The following theorem is a consequence of the definition SN-set for (G_S, w_S).

Theorem 5. Applying Steps 3 and 4 on (G_S, w_S) and τ_S, each resulting SN-set has one member.

Proof. Suppose that S = {s₁, s₂, s₃, …, s_r} is an SN-set in (G_S, w_S). Now assume that in the procedure of Step 3 by removing the edges with weight l, S₁ separates from S₂. Thus there exists k > l such that by removing the edges with weight at least k in (G_S, w_S), the connected component S separates from other components of G_S. It means that by removing the edges with weight at least k in G, we obtain the SN-set which is a contradiction.

In the next step the reticulation leaves are recognized using the following three criteria:

Criterion I. Let m_i and M_i be the minimum and maximum weight of the edges in (G,h) with exactly one end in S_i. Choose the node with minimum m_i and if there is more than one node with minimum m_i then choose among them the nodes which has minimum M_i. Let R₁ denotes the set of such nodes.

Criterion II. Let w_min = min {w(s_i,s_j): }. In G_S consider the induced subgraph on the edges with the weight w_min. Choose the nodes of R₁ with the maximum degree in this induced subgraph. Let R₂ denotes the set of such nodes.

Criterion III. For each node s ∈ R₂, remove it from GS and find SN-sets for this new graph using Steps 3 and 4. Let n_s be the number of SN-sets of this new graph with cardinality greater than one. Choose the nodes in R₂ with maximum n_s. Let R₃ denotes the set of such nodes.

We state an example to show the idea behind these three criteria.

Let τ = {ij|l, jk|i, kl|j, kl|i, no|m, lo|k, jl|o, mn|l, mn|j, no|k, mo|i, jk|n, ij|o, ik|m, il|n}.

τ is not consistent with a tree but it is consistent with the network N shown in Figure 8a. Obviously, N is an optimal network consistent with τ. In order to find SN-sets we construct G′_τ and (G, h), and find SN-sets from (G, h) using Steps 3 and 4 (Figures 8b to 8g). It follows that S = {{i}, {j}, {k}, {l}, {m}, {n, o}}. Now in G_S (Figure 8h). we expect that the reticulation is in R₁. In this example both k and l are in R₁. Also we expect that if there is a reticulation leaf, it belongs to R₂ which again both k and l are in R₂. Now just l belongs to R₃. Thus we consider l as the reticulation leaf (Figures 8i to 8n). Remove triplets from τ_S which contain l and denote the new set of triplets by τ′_S. Obviously τ′_S is consistent with a tree. We add this reticulation leaf to a binarization of such that the resulting network is consistent with τ_S. Note that if we consider each node except than l as the reticulation leaf then final network consistent with τ_S has at least two reticulation leaves.

Download:

Figure 8. An example to show how TripNet works to find a reticulation leaf by applying step 5.

Edges with weight 6 are shown by dotted lines. (a) τ = {ij|l, jk|i, kl|j, kl|i, no|m, lo|k, jl|o, mn|l, mn|j, no|k, mo|i, jk|n, ij|o, ik|m, il|n} is not consistent with a tree and is consistent with the given level-1 network, (b) G′_τ is obtained from G_τ by removing the dotted line, (c) Computing (G, h), (d) Remove edges with weights 6 and 5 from (G, h) to obtain SN-sets {n, o} and {m}, (e) Remove edges with weights 4 and 3 from the remaining graph to obtain SN-set {i}, (f) Remove edges with weights 2 from the remaining graph to obtain SN-set {j}, (g) Remove edges with weights 1 from the remaining graph to obtain SN-sets {k} and {l}, (h) Compute G_S. both SN-sets {k} and {l} satisfy Criteria I and II, (i) Remove {l} from G_S, (j) Remove edges with weights 6 from the graph of previous step to obtain SN-sets {i, j, k} and {m, n, o}, (k) Remove {k} from G_S, (l) Remove edges with weights 6 and 5 from the graph of previous step to obtain SN-sets {n, o} and {m}, (m) Remove edges with weights 4 from the remaining graph to obtain SN-set {l}, (n) Remove edges with weights 3 from the remaining graph to obtain SN-sets {i} and {j}. The steps i to n shows that l is the reticulation leaf. In these steps criterion III is applied.

https://doi.org/10.1371/journal.pone.0106531.g008

Step 5. In this step the reticulation leaf is recognized using three criteria. Do the criterion I. If |R₁| = 1 then choose the node x ∈ R₁ as the reticulation node. Otherwise if |R₁|>1 do the criterion II. If |R₂| = 1 then choose the node x ∈ R₂ as the reticulation node. Otherwise if |R₂|>1 do the criterion III. If |R₃| = 1 then choose the node x ∈ R₃ as the reticulation node. Otherwise if |R₃|>1 then by the speed options we choose the reticulation node as follows.

Slow. Each node in R₃ is examined as the reticulation leaf.

Normal. Two nodes in R₃ are selected randomly and each of these two nodes is examined as the reticulation leaf.

Fast. One node in R₃ is selected randomly as the reticulation leaf.

Let x be a node which is considered as a reticulation leaf. Remove x from G_S and all of the triplets which contain x from τ_S. Define G = G \ {x} and go to Step 3.

Note that for the Fast option the running time of the algorithm is polynomial.

For biological data almost always the criteria I and II find a unique reticulation leaf.

So on real data the running time of TripNet is almost always polynomial.

Step 6. Let x₁, x₂, …, x_m be m reticulation leaves which are obtained in Step 5 with this order and T be the tree that is constructed in Step 4. Now add these m nodes in the reverse order to T as what follows. Let e₁ and e₂ be two edges of T. Consider two new nodes y₁ and y₂ in the middle of e₁ and e₂. Connect y₁ and y₂ to a new node y₃ and connect the reticulation leaf x_m to y₃. Do this procedure for all pairs of edges and choose a pair such that the resulting network is consistent with maximum number of triplets in τ. Continue this procedure until all the reticulation nodes are added.

Step 7. For each SN-set S_i and the set of triplets we run the algorithm again.

Step 8. Replace each SN-set in the network of Step 6 with its related network constructed in Step 7 to obtain a network N′.

Let τ′ ∈ τ be the set of the triplets which are not consistent with N′. For each pair of leaves a and b assume that τ′_ab is the set of triplets in τ′ which are of the form ab|c. Consider the pair of leaves i and j such that τ′_ij has the maximum cardinality. Assume that p_i and p_j are the parents of i and j, respectively.

Step 9. Create two new nodes in the middle of the edges p_i i and p_j j and connect them with a new edge. This new edge creates a reticulation node and all of the triplets in τ′_ij will be consistent with the new network. All consistent triplets with the new network are removed from τ′ and this procedure will continue until τ′ becomes empty.

Figure 9 presents an example of the algorithm with all of its Steps.

Download:

Figure 9. Steps of TripNet for input triplets: jk |i, li |j, mj |i, jn|i, kl |i, ik |m, ik |n, lm|i, ln|i, mn|i, kl |j, km|j, jn|k, lm|j, jl |n, mn|j, kl |m, kl |n, mn|k, mn|l }.

https://doi.org/10.1371/journal.pone.0106531.g009

Acknowledgments

The authors would like to thank M.Kargar for his kind and useful comments. This work is supported in part by a grant (BS-1392-1-05) from the Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.

Author Contributions

Conceived and designed the experiments: CE HP RT. Performed the experiments: HP. Analyzed the data: CE HP RT. Contributed reagents/materials/analysis tools: CE HP RT. Wrote the paper: CE HP RT.

References

1. Huson DH, Rupp R, Scornavacca C (2010) Phylogenetic Networks Concepts, Algorithms and Applications. Cambridge University Press.
2. Aho AV, Sagiv Y, Szymanski TG, Ullman JD (1981) Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comp 10: 405–421.
3. Jansson J, Nguyen NB, Sung WK (2006) Algorithms For combining rooted triplets into a galled phylogenetic network. SIAM Journal on Computing 35(5): 1098–1121.
- View Article
- Google Scholar
4. Jansson J, Sung Wk (2006) Inferring a Level-1 Phylogenetic Network from a Dense Set of Rooted Triplets. Theoretical Computer Science 363: 60–68.
5. Huber K, Iersel LV, Kelk S, Suchecki R (2010) A Practical Algorithm for Reconstructing Level-1 Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics.
6. Van Iersel L, Kelk S (2011) Constructing the simplest possible phylogenetic network from triplets. Algorithmica 60: 207–235.
- View Article
- Google Scholar
7. To TH, Habib M (2009) Level-k Phylogenetic Networks are Constructable from a Dense Triplet Set in Polynomial Time. In CPM09 5577: 275–288.
- View Article
- Google Scholar
8. Huson DH (1998) SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 14(10): 68–73.
- View Article
- Google Scholar
9. Grassly N, Rambaut A (1997) Treevole: a program to simulate the evolution of DNA sequences under different population dynamic scenarios. 1.3. Wellcome Centre for Infectious Disease, Department of Zoology, Oxford University, Oxford, UK.
10. Van Iersel L, Keijsper J, Kelk S, Stougie L, Hagen F, et al. (2009) Constructing level-2 phylogenetic networks from triplets. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6(4): 667–681.
- View Article
- Google Scholar
11. Karp R (1972) Reducibility among combinatorial problems. Proc. Sympos., IBM Thomas J. Watson Res. Center, Yorktown Heights, N.Y. 85–103.

[ref1] 1. Huson DH, Rupp R, Scornavacca C (2010) Phylogenetic Networks Concepts, Algorithms and Applications. Cambridge University Press.

[ref2] 2. Aho AV, Sagiv Y, Szymanski TG, Ullman JD (1981) Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comp 10: 405–421.

[ref3] 3. Jansson J, Nguyen NB, Sung WK (2006) Algorithms For combining rooted triplets into a galled phylogenetic network. SIAM Journal on Computing 35(5): 1098–1121.
View Article
Google Scholar

[4] View Article

[5] Google Scholar

[ref4] 4. Jansson J, Sung Wk (2006) Inferring a Level-1 Phylogenetic Network from a Dense Set of Rooted Triplets. Theoretical Computer Science 363: 60–68.

[ref5] 5. Huber K, Iersel LV, Kelk S, Suchecki R (2010) A Practical Algorithm for Reconstructing Level-1 Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[ref6] 6. Van Iersel L, Kelk S (2011) Constructing the simplest possible phylogenetic network from triplets. Algorithmica 60: 207–235.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref7] 7. To TH, Habib M (2009) Level-k Phylogenetic Networks are Constructable from a Dense Triplet Set in Polynomial Time. In CPM09 5577: 275–288.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref8] 8. Huson DH (1998) SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 14(10): 68–73.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref9] 9. Grassly N, Rambaut A (1997) Treevole: a program to simulate the evolution of DNA sequences under different population dynamic scenarios. 1.3. Wellcome Centre for Infectious Disease, Department of Zoology, Oxford University, Oxford, UK.

[ref10] 10. Van Iersel L, Keijsper J, Kelk S, Stougie L, Hagen F, et al. (2009) Constructing level-2 phylogenetic networks from triplets. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6(4): 667–681.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref11] 11. Karp R (1972) Reducibility among combinatorial problems. Proc. Sympos., IBM Thomas J. Watson Res. Center, Yorktown Heights, N.Y. 85–103.