Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A Survey of Methods for Constructing Rooted Phylogenetic Networks

A Survey of Methods for Constructing Rooted Phylogenetic Networks

  • Juan Wang
PLOS
x

Abstract

Rooted phylogenetic networks are primarily used to represent conflicting evolutionary information and describe the reticulate evolutionary events in phylogeny. So far a lot of methods have been presented for constructing rooted phylogenetic networks, of which the methods based on the decomposition property of networks and by means of the incompatible graph (such as the CASS, the LNETWORK and the BIMLR) are more efficient than other available methods. The paper will discuss and compare these methods by both the practical and artificial datasets, in the aspect of the running time of the methods and the effective of constructed phylogenetic networks. The results show that the LNETWORK can construct much simper networks than the others.

Introduction

The evolutionary history of species is traditionally denoted as a rooted phylogenetic tree. Each tree represents certain evolutionary information of the species (denoted as cluster, will be discussed in the following) [14]. When rooted phylogenetic trees are constructed by different methods or from different datasets, all of the evolutionary information represented by these tree are often conflicting. The conflicting evolutionary information cannot be expressed as a phylogenetic tree. However, phylogenetic network can represent the conflicting evolutionary information, which is a generalization of phylogenetic tree. It can also describe the evolution involving significant amounts of reticulate events such as recombinations, hybridizations, and horizontal gene transfers [59].

Phylogenetic networks can be divided into unrooted [1015] and rooted networks [1626]. Unrooted phylogenetic networks are mainly used to visualize conflicting evolutionary information. Rooted phylogenetic networks not only can represent conflicting evolutionary information implied phylogenetic trees, but also can describe the reticulate evolutionary events that species occurred during evolution [27]. There is a large body of research on rooted phylogenetic networks. Devising appropriate algorithms for constructing rooted phylogenetic networks from rooted phylogenetic trees has become an important field of research in molecular evolution. Recently a lot of scholars have focused on the research of the field, and developed a number of methods.

Dendroscope [28] is a program for constructing rooted phylogenetic networks, which unites some methods such as the cluster network [29], the galled network [30] and the CASS [22]. Among all of the methods, the CASS can construct simpler networks than other methods, but it is extremely slow for large datasets. And the networks constructed by the CASS are highly dependent on the order of input data, i.e. the constructed phylogenetic networks are generally different for the same dataset when input orders are different. Then Wang et al improved the CASS algorithm, and designed two algorithms: the LNETWORK [31] and the BIMLR [32]. The LNETWORK and the BIMLR are faster than CASS and have less influence of input data order.

In the following, we refer to rooted phylogenetic networks as networks, unless otherwise stated.

Preliminaries

Given a set of taxa . A subset of (except both ∅ and ) is called a cluster. For two clusters C1 and C2 on , if they are disjoint or one is a subset of the other, i.e. C1C2 = ∅ or C1C2 or C2C1, we say that C1 and C2 are compatible, otherwise they are incompatible. Given a set of clusters on . If any two clusters in is compatible, is called compatible, otherwise it is incompatible. The incompatibility graph IG() = (V, E) of is defined as an undirected graph with node set V = C and edge set E, where two clusters are connected by an edge if and only if they are incompatible.

Let S be a subset of . The restriction of to S, denoted by , is defined as the result of removing all of the taxa in (i.e. the taxa in but not in S) from each cluster in . The incompatibility degree of , denoted by , is defined as the number of edges in . Let be a set of clusters on and x a taxon in . The incompatibility degree of x, denoted by d(x), is defined as the result of subtracting the incompatibility degree of from that of , i.e. . If the incompatibility degree of a taxon x is maximal among all of the taxa in , i.e. , we call that x is the incompatibility taxon w.r.t. . The frequency of a taxon x w.r.t. , denoted by f(x), is defined as the number of clusters which contain x, i.e. .

A rooted phylogenetic network N = (V, E) on is a rooted directed acyclic graph (DAG for short), and its leaves are bijectively labelled as . The indegree of a node vV is denoted as σ(v). A node v is a reticulate node if σ(v) ≥ 2; otherwise it is a tree node; particularly, a tree node is a root node if σ(v) = 0. An edge e = (u, v) is a tree edge if v is a tree node, otherwise it is a reticulate edge. The reticulation number in a network N = (V, E) is ∑σ(v)>0(σ(v) − 1) = |E| − |V| + 1.

Given a cluster C and a rooted phylogenetic tree T. If there is an edge e in T such that the set of taxa reachable from e equals C, we say that T represents C. Given a network N, when connecting one incoming edge and disconnecting all other incoming edges for each reticulate node, if there exists a tree edge e such that the set of taxa reachable from e equals C, we say that N represents C in the softwired sense. Alternatively, if there is a tree edge e in N such that the set of taxa reachable from e equals C, we say that N represents C in the hardwired sense.

Let N = (V, E) be a network representing the set of clusters . A cluster is often represented by more than one tree edge in N and a tree edge eE often represents more than one cluster in . If there exists a mapping ϵ from to the set of tree edges of N, where ϵ(C) is a tree edge representing C for , such that for any two clusters , C1 and C2 lie in the same connected component of the incompatibility graph IG() if and only if ϵ(C1) and ϵ(C2) are contained in the same biconnected component of N. Then we call that N is decomposable w.r.t. .

When constructing rooted phylogenetic networks from rooted phylogenetic trees, the methods first compute the clusters represented by the input trees, and then construct a rooted phylogenetic network representing all clusters.

The rooted phylogenetic networks can describe evolutionary history in the presence of reticulate events, such as horizontal gene transfers, hybridizations and recombinations. These reticulate events are rare in reality [33]. Accordingly, it is expected that the constructed network has the minimal number of reticulate nodes. Let N be a network constructed for the input cluster set . Assume that is the set of clusters represented by N. In fact contains more clusters than , i.e. . Here we define the redundant clusters of N as the clusters which are in but not in , i.e. . In phylogenetic analysis, the taxa in a cluster are putative monophyletic. Consequently, the ideal situation would be , i.e. all clusters represented by the constructed network would be the clusters represented in the input trees, and no others. Therefore, by means of parsimony principle, the best constructed network is one that minimizes the number of redundant clusters, which is based on the prerequisite that it has minimal number of reticulate nodes.

The following section will introduce the main methods for constructing rooted phylogenetic networks from rooted phylogenetic trees.

Methods

So far, the main methods for constructing rooted phylogenetic networks from rooted phylogenetic trees are the cluster network, the galled network, the CASS, the LNETWORK and the BIMLR. The following will give a brief introduction to each method.

The cluster network is a method for constructing rooted phylogenetic networks, which is based on the Hasse diagram. Given a set of clusters . It first defines a partial order which is a binary relation ≼ on : for , uv if and only if uv. The is called a partially ordered set. Then it draws a Hasse diagram H = (V, E) for , which is a DAG with node set and the edge set E, where there is an edge e = (u, v) if and only if and there exists no other node w in V such that . Finally it labels the leaves of H by the taxa of and assigns the root of H. The result DAG is the rooted phylogenetic network representing .

The galled network is a method based on the seed-growing algorithm. It first finds a set of taxa by seed-growing algorithm such that the set is compatible. Next it constructs a rooted tree T for . Finally it attaches the reticulate nodes to T under a certain amount of constraints, where the labels of nodes which are children of reticulate nodes are the taxa of S. The constructed network represents .

The CASS, the LNETWORK and the BIMLR are the methods based on the decomposition property of networks. They first find all non-trivial biconnected components of ; and then construct the subnetwork for ; next integrate those subnetworks into a final network. The difference among them is the construction of the subnetworks.

We have known that a network N represents an incompatible set of clusters. After removing all reticulate nodes from N, N becomes a tree representing a compatible set of clusters. However, the construction of networks is the inverse process mentioned above. Given a set of clusters on . We can construct a tree for if it is compatible, otherwise, we first remove some taxa from , such that the result set is compatible, then construct a tree for , finally append the removed taxa to the tree under certain conditions.

The CASS, the LNETWORK and the BIMLR construct subnetworks by the above description. Assume that the taxa set of is (1 ≤ ik). When constructing the subnetworks for , they first remove a few taxa of from each cluster in , such that the result set is compatible, then construct a rooted tree T for , next attach some new reticulate nodes to T, finally add a new leaf below each reticulate node and label it as the removed taxon. From the process, we can see that the removed taxa in are very pivotal.

The difference among the CASS, the LNETWORK and the BIMLR is the removed taxa. The CASS tries to remove a few taxa, i.e. by means of trial and error, it randomly removes some taxa, if it can construct a network representing the set of clusters, then it stops; otherwise it continues to remove other taxa. The LNETWORK removes the taxa computed by seed-growing algorithm. The BIMLR removes the incompatibility taxa with the maximal frequency. The CASS method aims at minimizing the number of reticulate nodes, while the LNETWORK and the BIMLR not only minimize the number of redundant of clusters but also let the reticulate nodes as few as possible.

When constructing networks, the LNETWORK and the BIMLR find all networks representing the cluster set, mainly to reduce the number of redundant clusters in the resulting network and lessen the influence of the input data order on the resulting network.

Results

In order to survey the performance of those methods, we do the experiments using both the practical and artificial data. The paper [31] has compared the LNETWORK, the CASS, the cluster network and the galled network; and the results show that the LNETWORK and the CASS can construct much simpler networks than the others. Here we just compare the LNETWORK, the CASS and the BIMLR using both the practical and artificial data (https://sites.google.com/site/cassalgorithm/data-sets).

All experiments were performed on a computer with an Intel Xeon E5504 2.0GHz CPU, 8 GB RAM and 147GB HDD. The operating system was Debian 4.1 32bit with Java 1.6 installed.

The experiments are used to compare two main aspects of these methods, on the one hand, the influence of input data order (Table 1 shows the results), on the other hand, the complexities of the constructed network, i.e. the reticulation number and the number of redundant clusters (Tables 2 and 3 show the results).

thumbnail
Table 1. Results of LNETWORK compared with BIMLR in terms of influence of input data order.

https://doi.org/10.1371/journal.pone.0165834.t001

thumbnail
Table 2. Results of BIMLR, LNETWORK and CASS for the artificial datasets.

https://doi.org/10.1371/journal.pone.0165834.t002

thumbnail
Table 3. Results of BIMLR, LNETWORK and CASS for the practical datasets.

https://doi.org/10.1371/journal.pone.0165834.t003

The papers [31, 32] have shown that the LNETWORK and the BIMLR are superior to the CASS in terms of the influence of input data order; and the two methods are faster than the CASS. Here we just compare the LNETWORK and the BIMLR on the influence of input data order, and the results are shown in Table 1. In the experiment, because each program needs to construct the network for every input order of dataset, the running time is factorial. Accordingly, here we just use the dataset with small scale. When comparing networks constructed by a method for the same dataset with different input orders, the difference of those networks is more small, the method is more stable, otherwise it is unstable. Here the difference among the constructed networks is measured by means of the tripartition distances [34].

Table 1 shows the number of constructed networks, the mean, the minimum (min) and the maximum (max) of tripartition distance of those networks constructed for each dataset. And the last row shows the average values. From the Table 1 we come to the following conclusions. First, for the same data with different input orders, the number of different networks constructed by the BIMLR is less than the number of different networks constructed by the LNETWORK except for the dataset with and . Second, for almost datasets, the mean, min, max of the BIMLR are less than corresponding values of the LNETWORK except for the datasets with , and . The result shows that, when the input order of the data is changed, even the network constructed by the BIMLR is more than one, those networks are more similar to each other than the networks constructed by LNETWORK. Thus, the BIMLR are more stable than the LNETWORK.

We compare the BIMLR, the LNETWORK and the CASS on several artificial datasets. Table 2 shows the results, which have the reticulation number r, the redundant cluster number c, the running time t in hours (h), minutes (m) and seconds (s), and their average values in last row. Table 2 shows that the BIMLR takes least time except for the datasets with and . For every dataset, the reticulation number of the network constructed by the BIMLR is the same as that of the network constructed by CASS, which is less than that of the network constructed by the LNETWORK. The networks constructed by LNETWORK have fewer redundant clusters than the networks constructed by CASS for almost all datasets, while the networks constructed by BIMLR have fewest redundant clusters. Thus, the networks constructed by BIMLR are simplest in terms of the redundant clusters and the reticulation number contained in the constructed networks for the artificial datasets.

We compare the BIMLR, the LNETWORK and the CASS on practical datasets. Table 3 shows the results, which have the reticulation number r, the redundant cluster number c, the running time t in hours (h), minutes (m) and seconds (s) and their average values in last row. Table 3 shows the BIMLR takes least time except for the dataset with . For most datasets, the reticulation number of the networks constructed by the CASS is least, secondly LNETWORK, at least BIMLR. The networks constructed by LNETWORK have fewest redundant clusters, secondly the BIMLR, at least CASS. Table 3 shows that the average reticulation number of LNETWORK and BIMLR is slightly more than that of CASS. Hence, the networks constructed by LNETWORK are simplest for the practical dataset.

From the Tables 2 and 3, it follows that the LNETWORK and the BIMLR take less time than the CASS; the networks constructed by LNETWORK and BIMLR have fewer redundant clusters than those constructed by CASS; and the average reticulation number of LNETWORK and BIMLR are slightly more than that of CASS.

Conclusion

We compared BIMLR, LNETWORK and CASS using one artificial and one practical dataset. The results show that the BIMLR is superior to the others for the artificial datasets, while the LNETWORK is superior to the others for the practical datsets. Accordingly in practice, the LNETWORK is the best option for the construction of networks.

Acknowledgments

I would like to thank the editor and two anonymous referees for remarks and suggestions which improved the exposition in this paper.

Author Contributions

  1. Conceptualization: JW.
  2. Data curation: JW.
  3. Formal analysis: JW.
  4. Funding acquisition: JW.
  5. Investigation: JW.
  6. Methodology: JW.
  7. Project administration: JW.
  8. Resources: JW.
  9. Software: JW.
  10. Supervision: JW.
  11. Validation: JW.
  12. Visualization: JW.
  13. Writing – original draft: JW.
  14. Writing – review & editing: JW.

References

  1. 1. Vellutini BC, Hejnol A. Expression of segment polarity genes in brachiopods supports a non-segmental ancestral role of engrailed for bilaterians. Scientific Reports. 2016; 6(6): 1994–1994. pmid:27561213
  2. 2. Hejnol A, Pang K. Xenacoelomorpha’s significance for understanding bilaterian evolution. Current Opinion in Genetics & Development. 2016; 39: 48–54. pmid:27322587
  3. 3. Huson DH, Rupp R, Scornavacca C. Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press; 2011.
  4. 4. Zou Q, Guo J, Ju Y, Wu M, Zeng X, Hong Z. Improving tRNAscan-SE Annotation Results via Ensemble Classifiers. Molecular Informatics. 2015; 34(11-12): 2992–3000.
  5. 5. Zou Q, Hu Q, Guo M, Wang G. HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics. 2015; 31(15): 2475–2481. pmid:25812743
  6. 6. Linder CR, Rieseberg LH. Reconstructing patterns of reticulate evolution in plants. American Journal of Botany. 2004; 91(10): 1700–1708.
  7. 7. Gusfield D, Hickerson D, Eddhu S. An efficiently computed lower bound on the number of recombinations in phylogenetic networks: Theory and empirical study. Discrete Applied Mathematics. 2007; 155(6): 806–830.
  8. 8. Zou Q, Li XB, Jiang WR, Lin ZY, Li GL, Chen K. Survey of MapReduce frame operation in bioinformatics. Briefings in Bioinformatics. 2014; 15(4): 637–647. pmid:23396756
  9. 9. Liu Y, Zeng X, He Z, Zou Q. Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Transactions on Computational Biology & Bioinformatics, 2016:1–1 pmid:27076459
  10. 10. Bandelt HJ, Dress AW. A canonical decomposition theory for metrics on a finite set. Advances in mathematics. 1992; 92(1): 47–105.
  11. 11. Bandelt HJ, Forster P, Röhl A. Median-joining networks for inferring intraspecific phylogenies. Molecular biology and evolution. 1999; 16(1): 37–48. pmid:10331250
  12. 12. Buneman P. The recovery of trees from measures of dissimilarity. Mathematics in the archaeological and historical sciences. Edinburgh University Press; 1971: 387–395.
  13. 13. Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Molecular biology and evolution. 2006; 23(2): 254–267. pmid:16221896
  14. 14. Bryant D, Moulton V. Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Molecular biology and evolution. 2004; 21(2): 255–265. pmid:14660700
  15. 15. Huson DH, Dezulian T, Klopper T, Steel MA. Phylogenetic super-networks from partial trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2004; 1(4): 151–158. pmid:17051697
  16. 16. Song YS, Hein J. Constructing minimal ancestral recombination graphs. Journal of Computational Biology. 2005; 12(2): 147–169. pmid:15767774
  17. 17. Huson DH, Kloepper TH. Computing recombination networks from binary sequences. Bioinformatics. 2005; 21(suppl2): ii159–ii165. pmid:16204096
  18. 18. Gusfield D, Eddhu S, Langley C. Efficient reconstruction of phylogenetic networks with constrained recombination. Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE. IEEE; 2003: 363–374.
  19. 19. Gusfield D. Optimal, efficient reconstruction of root-unknown phylogenetic networks with constrained and structured recombination. Journal of Computer and System Sciences. 2005; 70(3): 381–398.
  20. 20. Gusfield D, Bansal V. A fundamental decomposition theory for phylogenetic networks and incompatible characters.Research in Computational Molecular Biology. Springer; 2005: 217–232.
  21. 21. Semple C. Hybridization networks. Department of Mathematics and Statistics, University of Canterbury; 2006.
  22. 22. van Iersel L, Kelk S, Rupp R, Huson D. Phylogenetic networks do not need to be complex: using fewer reticulations to represent conflicting clusters. Bioinformatics. 2010; 26(12): i124–i131. pmid:20529896
  23. 23. Collins J, Linz S, Semple C. Quantifying hybridization in realistic time. Journal of Computational Biology. 2011; 18(10): 1305–1318. pmid:21210735
  24. 24. Wu Y. Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees. Bioinformatics. 2010; 26(12): i140–i148. pmid:20529899
  25. 25. van Iersel L, Keijsper J, Kelk S, Stougie L, Hagen F, Boekhout T. Constructing level-2 phylogenetic networks from triplets. Computational Biology and Bioinformatics, IEEE/ACM Transactions on. 2009; 6(4): 667–681.
  26. 26. van Iersel L, Kelk S. Constructing the simplest possible phylogenetic network from triplets. Algorithmica. 2011; 60(2): 207–235.
  27. 27. Wang J. A new algorithm to construct phylogenetic networks from trees. Genetics and Molecular Research. 2014; 13(1): 1456–1464. pmid:24634244
  28. 28. Huson DH, Scornavacca C. Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Systematic Biology. 2012; 61(6): 1061–1067. pmid:22780991
  29. 29. Huson DH, Rupp R. Summarizing multiple gene trees using cluster networks. Algorithms in Bioinformatics. Springer; 2008: 296–305.
  30. 30. Huson DH, Rupp R, Berry V, Gambette P, Paul C. Computing galled networks from real data. Bioinformatics. 2009; 25(12): i85–i93. pmid:19478021
  31. 31. Wang J, Guo M, Liu X, Liu Y, Wang C, Xing L, Che k. LNETWORK: an efficient and effective method for constructing phylogenetic networks. Bioinformatics. 2013; 29(18): 2269–2276. pmid:23811095
  32. 32. Wang J, Guo M, Xing L, Che K, Liu X, Wang C. BIMLR: A Method for Constructing Rooted Phylogenetic Networks from Rooted Phylogenetic Trees. Gene. 2013; 527(1): 344–351. pmid:23816409
  33. 33. Linder CR, Moret BME, Nakhleh L, Warnow T. Network (Reticulate) Evolution: Biology, Models, and Algorithms. In The Ninth Pacific Symposium on Biocomputing (PSB. 2010). 2010.
  34. 34. Cardona G, Llabrés M, Rosselló F, Valiente G. Metrics for phylogenetic networks I: Generalizations of the Robinson-Foulds metric. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2009; 6(1): 46–61. pmid:19179698