NetCombin: An algorithm for optimal level-k network construction from triplets

Phylogenetic networks construction is one the most important challenge in phylogenetics. These networks can present complex non-treelike events such as gene flow, horizontal gene transfers, recombination or hybridizations. Among phylogenetic networks, rooted structures are commonly used to represent the evolutionary history of a species set, explicitly. Triplets are well known input for constructing the rooted networks. Obtaining an optimal rooted network that contains all given triplets is main problem in network construction. The optimality criteria include minimizing the level and the number of reticulation nodes. The complexity of this problem is known to be NP-hard. In this research, a new algorithm called Netcombin is introduced to construct an optimal network which is consistent with input triplets. The innovation of this algorithm is based on binarization and expanding processes. The binarization process innovatively uses a measure to construct a binary rooted tree T consistent with the maximum number of input triplets. Then T is expanded in an intellectual process by adding minimum number of edges to obtain final network with the minimum number of reticulation nodes. In order to evaluate the proposed algorithm, NetCombin is compared with four state of the art algorithms, RPNCH, NCHB, TripNet, and SIMPLISTIC. The experimental results on real data indicate that by considering the trade-off between speed and precision, the NetCombin outperforms the others. Author summary Hadi Poormohammadi got his PhD in Mathematics, Applied combinatorics from Shahid Beheshti University, Tehran, Iran in 2013. He is now working as an assistant professor at the Faculty of Computer Engineering, Meybod University. His research interests include Combinatorics, Graph theory and Bioinformatics. Mohsen Sardari Zarchi got his PhD in computer engineering, Artificial Intelligence from University of Isfahan in 2015. He is now working as an assistant professor at the Faculty of Computer Engineering, Meybod University. His research interests include Deep learning, Image processing, Artificial intelligence and Bioinformatics.

Triplets are commonly considered as a standard input for building rooted 24 structures [2,6]. These small tree structures are usually obtained from a set of biological 25 sequence data using standard methods such as Maximum Likelihood (ML) and 26 Maximum Parsmiony (MP) [5,12]. Also in some cases, the output of some biological 27 experiments is directly in the form of triplets [13]. Moreover, in some experiments, 28 triplets are generated randomly to evaluate a model [5]. 29 For a given set of triplets as input, the main challenge is to construct a rooted 30 structure (tree or network) which contains all triplets or equivalently, all triplets are 31 consistent with the obtained structure [2,6,11,12]. Formally, a triplet is consistent with 32 a rooted structure when the triplet is a subgraph of that structure [2,5,6,11,12]. The 33 majority preference is to obtain a rooted tree structure. However, as mentioned before, 34 the tree structure can't represent reticulate events and eventually the reticulate events 35 can't be covered. So, in this case the network construction should be considered. 36 In order to build a rooted network from a set of triplets, several algorithms were 37 introduced recently [2,6,7,11,12,14]. The well-known algorithms are TripNet [6], 38 SIMPLISTIC [12], NCHB [14] and RPNCH [7]. These algorithms try to find an optimal 39 rooted phylogenetic network that is consistent with a given set of triplets. Formally, a 40 rooted phylogenetic network (network for short) is a directed acyclic graph (DAG) 41 ( Figure 3) that is connected and each vertex satisfies one of the following four categories: 42 (i) A unique root node with in-degree 0 and out-degree 2. (ii) Tree nodes with in-degree 43 1 and out-degree 2. (iii) Reticulation nodes with in-degree 2 and out-degree 1. (iv)  (a) A rooted phylogenetic network. r is root, t 1 , t 2 , . . . , t 8 are tree nodes, r 1 , r 2 , r 3 are reticulation nodes, and l 1 , l 2 , . . . , l 7 are leaves. (b) The network of figure 4a has two biconnected components. One of the biconnected component contains r 1 and the other contains r 2 and r 3 and . So the network is a level-2 network.
The level of a rooted phylogenetic network is defined as the maximum number of 50 reticulation nodes in each biconnected component (Figure 4b). 51 Generally the problem of constructing an optimal rooted phylogenetic network 52 consistent with a given set of triplets is known to be NP-hard [15,16]. When the 53 structure of the input triplets is dense, this problem can be solved in polynomial time 54 order [16]. A set of triplets τ is called dense if for each subset of three taxa there is at 55 least one information in the set of input triplets [6,16]. More precisely, a set of triplets τ 56 is called dense if for a given set of taxa X and each subset of three taxa {i, j, k} one of 57 the triplets ij|k or ik|j or jk|i belongs to τ [6,16]. For example for a given set of taxa 58 X = {a, b, c, d, e}, the set of triplets 59 τ = {ab|c, ad|b, be|a, ac|d, ae|c, de|a, bd|c, bc|e, be|d, de|c} is dense. 60 As mentioned above, density is a critical constraint concerning with constructing a 61 rooted phylogenetic network that contains all given triplets. However, usually there is 62 no constraint on the input triplets and in most cases the input triplets might not be 63 dense. So, introducing efficient heuristic methods to solve this problem is demanding.

64
The desirable goal is to construct a rooted network with no reticulate events i.e. a 65 rooted tree structure. BUILD is the algorithm that was introduced for obtaining a tree 66 structure from a given set of triplets if such a tree exists [17]. In fact, BUILD algorithm 67 decides in polynomial time order if there is a rooted phylogenetic tree that contains all 68 given triplets and then produces an output if such a tree exists. Figure 5, indicates an 69 example of BULID algorithm steps for the given 70 τ = {cd|b, cd|a, cd|e, cd|f, ef |a, ef |b, ef |c, ef |d, db|a, db|e, db|f, da|e, de|f, cb|a, cb|e, cb|f 71 , ab|e, ab|f, ac|e, ac|f }. The Aho graph AG(τ ) is defined based on τ . The set of nodes are X = L(τ ) = {a, b, c, d, e, f } and two nodes i, j ∈ X are connected iff there is a node x ∈ X such that ij|x ∈ τ . Also AG(τ |A) in which A ⊆ X is defined in a similar way. Note that in this way the induced graph of AG(τ ) on the set of nodes A ⊆ X is considered i.e. the set of nodes are A and i, j ∈ A are connected iff there is a node x ∈ A such that ij|x ∈ τ . (a) AG(τ ). (b) Based on AG(τ ) the resulting tree is obtained. (c) AG(τ |e, f ). (d) AG(τ |a, b, c, d). (e) Based on AG(τ |e, f ) the resulting tree is obtained. (f) Based on AG(τ |a, b, c, d) the resulting tree is obtained. (g) AG(τ |b, c, d).
(h) Based on AG(τ |b, c, d) the resulting tree is obtained. (i) AG(τ |c, d). (j) Based on AG(τ |c, d) the resulting tree is obtained. (k) The final tree consistent with given τ that is obtained from BUILD algorithm by reversing its steps.
In tree construction process for a given τ if BUILD stops, it means that there is no 73 tree structure for the given set of triplets. Figure 6 shows an example for the set of 74 triplets τ = {bc|a, bd|a, cd|a, bc|d, cd|b} in which BUILD algorithm stops. In this case, 75 the main goal is to construct a network structure similar to a tree as much as possible. 76 In other words, constructing a rooted phylogenetic network with the minimum 77 reticulate events is the main challenge.

78
The simplest possible non-treelike structure (network structure) is level-1 rooted 79 phylogenetic network which also known as galled tree [15,16]. Figure 7 shows an example of a galled tree. If level-1 networks can not represent all input triplets, more 81 complex (higher level) networks are considered to achieve consistency. LEV1ATHAN is 82 a well-known algorithm to construct level-1 networks [18]. In [12] an algorithm is 83 introduced that produces at most a level-2 network (Figure 4). When more complex networks are needed (e.g. figure 8), not restricted algorithms 85 such as NCHB, TripNet, RPNCH and SIMPLISTIC are applicable which try to 86 construct a consistent network with the optimality criterions (the level and the number 87 of reticulation nodes) [6,7,12,14]. Among the above four mentioned algorithms, 88 SIMPLISTIC is not exact and just works for dense sets of triplets while for the other 89 three methods there is no constraint on the input triplets. This is one of the 90 SIMPLISTIC disadvantages. Moreover for complex networks SIMPLISTIC is very time 91 consuming and has not the ability to return an output in an appropriate time [6]. TripNet has three speed options: slow, normal, and fast. The slow option returns a 93 network near to an optimal network. Normal option works faster compared to slow 94 option, but its network is more complex compared to slow option. Note that slow and 95 normal options return an output in an appropriate time for input triplets consistent 96 with simple and low level networks. However these two options are not appropriate for 97 large data, because by increasing the number of taxa, the set of triplets corresponds to 98 them are consistent with high level networks. Fast option usually output a network in 99 an appropriate time but its network is more complex compared to the two other options. 100 This option is used when the slow and normal options have not the ability to return a 101 network in an appropriate time. It means that fast option just try to output a network 102 and does not consider the optimality criterions. In summary, TripNet has not the ability 103 to return an optimal network in an appropriate time, when input data is large [6].

104
NCHB is an improvement of TripNet which tries to improve the complexity of the 105 TripNet networks but like TripNet it has not the ability to return an optimal network in 106 an appropriate time for large data [14].

107
RPNCH is a fast method for constructing a network consistent with a given set of 108 triplets, but its output is usually more complex considering the two optimality criterions 109 compared to SIMPLISTIC and TripNet networks. It means that although RPNCH is 110 fast but on average, the RPNCH networks are far away from the optimality 111 criterions [7].

112
Generally none of the above four methods have the ability to return a network near 113 to an optimal network consistent with a given set of input triplets in an appropriate 114 time. So the focus of this paper is to introduce a new method called NetCombin for 115 constructing a network near to optimal in an appropriate time without any constraint 116 on the input triplets. In this research our innovation is based on the binarization and 117 expanding processes. In the binarization process nine measures are used innovatively to 118 construct binary rooted trees consistent with the maximum number of input triplets. into a consistent network an intellectual process is used. In this process minimum 122 number of edges are added heuristically to obtain the final network with the minimum 123 number of reticulation nodes.

124
The structure of this paper is as follow. Section 2, presents the basic notations and 125 definitions. In section 3, our proposed algorithm (NetCombin) is introduced. In section 126 4, the new introduced algorithm is compared with NCHB, TripNet, RPNCH, and

133
A rooted phylogenetic tree (tree for short) on a given set of taxa X is a rooted 134 directed tree that contains a unique node r (root) with in-degree zero and out-degree at 135 least 2. In a tree, leaves are with in-degree 1 and out-degree 0 and are distinctly labeled 136 by X. Also inner nodes i.e. nodes except root and leaves, has in-degree 1 and out-degree 137 at least 2 [2,6].
For example, figure 9 shows that 142 triplets ij|k and jk|i are consistent with the given network, but ik|j is not consistent. A 143 set of triplets τ is consistent with a network N (or equivalently N is consistent with τ ) 144 if all the triplets in τ are consistent with N . τ (N ) denotes the set of all triplets that are 145 Binarization is a basic concept, defined as follows. Let T be a rooted tree and x be a 147 node with x 1 , x 2 , . . . , x k , k ≥ 3 childeren. These k children are partitioned into two 148 disjoint subsets X l and X r . Let X l = {x 1 , x 2 , . . . , x i } and X r = {x i+1 , . . . , x k } in which 149 x 1 , x 2 , . . . , x k is an arbitrary relabeling of x 1 , x 2 , . . . , x k . If |X l | > 1 then create a new 150 node x l , remove the edges (x, x 1 ), (x, x 2 ), . . . , (x, x i ) and create the edges (x, x l ) and 151 (x l , x 1 ), (x l , x 2 ), . . . , (x l , x i ). Do the same process if |X r | > 1 . Continue the process 152 until the out-degree of all nodes except the leaves, be 2. The new tree is called a 153 binarization of T . Figure 10 shows an example of a non-binary tree and its two different 154 binarizations. If T 2 is a binarization of Let G τ be a DAG and l Gτ be the length of the longest directed path in G τ . Assign 166 l Gτ + 1 to the nodes with out-degree=0 and remove them. Assign l Gτ to the nodes with 167 out-degree=0 in the resulting graph and continue this procedure until all nodes are 168 removed. Define h Gτ (a, b), a, b ∈ L(τ ) and a = b as the value that is assigned to the 169 node ab ∈ V (G τ ) and call it the height function related to G τ [6]. For example figures 170 13a to 13d. If τ is consistent with a tree then G τ is a DAG and h Gτ is well defined [6]. 171 Let r be the root of a given network N and l N be the length of the longest directed 172 path in N . For each node a let d(r, a) be the length of the longest directed path from r 173 to a. For any two nodes a and b, a is an ancestor of b if there is a directed path from a 174 to b. In this case b is lower than a. For any two nodes a, b ∈ L N a node c is called a

181
A quartet is an un-rooted binary tree with four leaves. The symbol ij|kl is used to 182 show a quartet in which i, j and k, l are its two pairs. Each quartet contains a unique 183 edge for which two its endpoints are not leaves. This edge is called the inner edge of the 184 quartet (See figure 12) [6].

186
In order to build a network N consistent with a given set of triplets τ , the height 187 function h N related to τ is defined [6]. The height function is a measure that is used to 188 obtain a basic structure of the final network (N ) [6]. This basic structure is in the form 189 of a rooted tree. The height function enforce that the obtained rooted tree be consistent 190 with maximum number of triplets of τ . In this research, the height function is obtained 191 as follow [6]:

192
Constructing tree using height function 193 Let T be a tree with its unique height function h T and i, j ∈ L T . The triplet ij|k is . The above two items 197 imply that the following Integer Programming (IP) IP (τ, s) is established for a given 198 triplets τ with |L(τ )| = n [6].
December 28, 2019 6/15 The solution of the above IP provides a criterion to obtain the basic tree structure. 201 Ideally it is expected that the above IP has a feasible solution i.e. a solution that 202 satisfies all its constraints. If there is a tree consistent with a given τ then the above IP 203 has a feasible solution and the solution that maximizes the above IP is the height 204 function of a tree that is consistent with τ . More precisely in this case h Tτ is the unique 205 optimal solution to the IP(τ, l Gτ + 1) in which T τ is the unique tree that is constructed 206 by BUILD (or equivalently HBUILD) [6]. So in this case by using HBUILD the desired 207 tree consistent with τ can be constructed in polynomial time based on the optimal 208 solution [6] Figure 13 indicates an example of the HBUILD process for the given 209 τ = {cd|b, cd|a, bd|a, bc|a}. Generally the above IP has a feasible solution iff the graph G τ is a DAG and in this 211 case the minimum s that gives a feasible solution for IP(τ, s) is l Gτ + 1 [6]. So for a 212 given τ the IP might have a feasible solution although there is no tree consistent with τ . 213 In the worst case, there is no tree consistent with a given τ and no feasible solution for 214 the above IP i.e. equivalently the graph G τ is not a DAG. To overcome this flaw, the 215 goal is to remove minimum number of edges from G τ (minimum number of criterions 216 from the IP) to lose minimum information. The problem of removing minimum number 217 of edges from a directed graph to obtain a DAG is known as the Minimum Feedback 218 Arc Set problem, MFAS problem for short. MFAS is NP-hard [19]. The heuristic 219 method that is introduced in [14] is used to obtain a DAG from G τ . For simplicity the 220 new graph that is a DAG is called G τ again. Now the height function h Gτ related to G τ 221 is the desired solution.

222
In the following the goal is to obtain a tree structure from the obtained h Gτ . In the 223 initial step HBUILD is applied on h Gτ . The ideal situation is when HBUILD continues 224 until a tree structure is obtained. However, HBUILD may stop in one of its subsequence 225 steps. More precisely, Let (G, h) be the weighted complete graph related to h Gτ .  If by removing the edges with maximum weight from a connected component C, the 232 resulting graph C remains connected, then HBUILD halts. Hence, the goal is to 233 disconnect the obtained connected component (C ). In order to disconnect C , similar 234 to RPNCH [7] three different processes can be performed as follow (See figure 14) : ii. The Min-Cut method is applied on C . Min-Cut is a method that removes 238 minimum weights sum of removed edges in a way that the resulting graph is converted 239 into two connected components.

244
Then Min-Cut method is applied on the updated graph.

245
In this research, for each connected component, the above three processes is applied 246 and then by using HBUILD, three possible tree structures are obtained. From here, 247 without loss of generality, the symbol T int is used to show the tree structure obtained 248 from HBUILD or the three tree structures gained from the above processes.  Let T be a rooted tree and τ (T ) be the set of triplets consistent with T . Also let 251 T binary be a binarization of T and τ (T binary ) be the set of triplets consistent with 252 T binary . Then τ (T ) ⊆ τ (T binary ). It means that binarization is an effective tool to make 253 the tree structure more consistent with the given triplets. To perform binarization on 254 T int , the following heuristic algorithm is proposed. For a given set of triplets τ and T int a binary tree structure T intBin is demanded.

256
Binarization can be performed simply with a random approach [6,7]. In order to make 257 binarization more efficient, a new heuristic algorithm is introduced innovatively in this 258 research. This algorithm is originally based on the three parameters, w, t, and p [14,20]. 259 Let τ be a set of triplets and V i , V j ⊆ L(τ ) and [14,20]. Based on 264 the three parameters (w, t, and p ), nine different measure are defined [5]. The measures 265 M = {m 1 , m 2 , . . . , m 9 } are defined as: . By using 268 these measures, nine binary tree structures ( T intBin ) are built from T int .

269
The binarization process is performed as follow:   Among 6+1 structures, select the more consistent structure and add its 285 root to C. 286

13:
Update T intBin respect to selected structure. The binarization process is performed based on using nine different defined measures 291 and Subtree-Pruning-Regrafting (SPR) [21]. SPR is a method in tree topology 292 search [21]. In the binarization process SPR helps to obtain a tree from T int more 293 consistent with input triplets. If T int is binary there is nothing to do; Else there is at  figure 15). Here, SPR is used innovatively to improve the merging consistency.

301
Suppose that c lk and c rk are the roots of the two left and right subtrees of c k , 302 k ∈ {i, j}. The idea behind SPR is replacing subtrees to achieve a new binary tree 303 structure with higher consistency. In this work the potential replacement are introduced 304 in six different ways as follow (See figure 16). i) c i c lj , ii) c i c rj , iii) c j c li , obtained. Among these tree structures and the structure without replacement, the best 307 tree structure consistent with more input triplets is selected.

309
Let τ ⊆ τ be the set of triplets that are not consistent with T intBin . Here, the goal is 310 to add some edges to T intBin in order to construct a network consistent with input τ . 311 In the network construction process, edges are added incrementally to obtain the final 312 network consistent with τ . In order to add edges, we use innovatively a heuristic 313 criterion to select edges rather than random selection. The heuristic criterion is 314 depended on the current non-consistent triplets ∈ τ and the current network structure. 315 To this purpose, for each pair of edges of the current network structure, a value is

323
The RPNCH, NCHB, SIMPLISTIC and TripNet are famous algorithms in constructing 324 phylogenetic networks from given triplets. The SIMPLISTIC algorithm just works for 325 dense triplets [12], while there is no constraints on the NCHB, TripNet, and RPNCH 326 inputs [6,7,14]. In order to evaluate the performance of NetCombin, the following 327 scenario is designed. There is two standard approaches to generate triplets data. Firstly, triplets can be 330 generated randomly which is the simplest way. Secondly, triplets can be obtained from 331 real data. Real data usually are in the form of (biological) sequences which is used in 332 this research. There are standard methods for converting sequences into triplets.

333
Maximum Likelihood (ML) is the well-known method which constructs tree from 334 sequence data [5,12]. For this reason, TREEVOLVE is used which is a software for 335 generating biological sequences [22]. TREEVOLVE has different parameters that can be 336 adjusted manually. In this research we set the parameters, the number of samples, the 337 number of sequences, and the length of sequences. For the other parameters, the default 338 values are used. The number of sequences is set to 10, 20 30, and 40 and the length of 339 sequences is set to 100, 200, 300, and 400. For each case, the number of samples is 10. 340 So totally 160 different sets of sequences are generated. Then PhyML software is used 341 which works based on Maximum Likelihood (ML) criterion. For each set of sequences, 342 all subsets of three sequences are considered and for each of them, an outgroup is 343 assigned. Each subset of three sequences plus the assigned outgroup, are considered as 344 input for PhyML and for these data the output of PhyML is a quartet. Finally by 345 removing the outgroup from each quartet, the set of triplets is obtained. In this 346 research, each triplet information related to a quartet in which the weight of its unique 347 inner edges is zero, is removed. This is because of these types of triplets contains no 348 information and are stars. The way of generating triplets may give non-dense sets of 349 triplets. SIMPLISTIC is used as a method for comparison and its output should be 350 dense. So by adding a random triplet correspond to each star, each non-dense set is 351 converted to a dense set and is used as the input.

353
In order to show the performance of NetCombin we compare it with TripNet, 354 SIMPLISTIC, NCHB, and RPNCH on the data that are generated in the previous 355 subsection. Since for large size data, SIMPLISTIC has not the ability to return a 356 network in an appropriate time, the time restriction 6 hours is considered. Let N f inite 357 be the set of networks for which the running time of the method is at most 6 hours. Let 358 S sequence shows the number of sequences where S sequence ∈ {10, 20, 30, 40}. The output 359 of TripNet, SIMPLISTIC, NCHB, and RPNCH is a unique network, but NetCombin 360 outputs 27 networks and the best network is reported. Since the process of constructing 361 these 27 NetCombin networks is independent, we apply NetCombin in a parallel way to 362 obtain 27 networks simultaneously.

363
The results of comparing these methods on the two optimality criterions and  Table 1 and 2 show the results of the number of networks that belong to N f inite , and 366 the average of running time of the networks that belong to N f inite . These results show 367 that when the number of taxa is 10, in all cases, all methods on average give an output 368 in at most 2 seconds. When the number of taxa is 20, in 5 % of the cases, SIMPLISTIC 369 has not the ability to return a network in less than 6 hours. For the remaining 95% of 370 the cases SIMPLISTIC on average gives an output in 310 seconds. For these data the 371 other four methods on average construct a network in less than 4 seconds. When the 372 number of taxa is 30, in 32.5% of the cases, on average SIMPLISTIC outputs a network 373 in 2600 seconds. For the remaining 77.5% of the cases, SIMPLISTIC has not the ability 374 to return a network in less than 6 hours. For these data, on average NetCombin and 375 RPNCH output a network in at most 15 seconds, while NCHB and TripNet on average 376 output a network in 203 and 210 seconds, respectively. When the number of input taxa 377 is 40, in all cases SIMPLISTIC does not return an output in time restriction 6 hours. In 378 this case NetCombin and RPNCH on average output a network in at most 44 seconds 379 while NCHB and TripNet return a network in time at least 740 seconds.   The results show that on average for small size data SIMPLISTIC is appropriate.

415
But by increasing the number of taxa and for large size data it has not the ability to 416 return a network in an appropriate time and its running time is highest. Also in all 417 cases on average the SIMPLISTIC number of reticulation nodes and levels are just 418 better than RPNCH. Note that SIMPLISTIC just works for dense sets of input triplets. 419 The results show that by increasing the number of taxa, the running time of 420 SIMPLISTIC increases exponentially. In more details when the number of taxa is 40, in 421 time less than 6 hours it does not return any network, while the other 4 methods in at 422 most 745 seconds output a network.

423
Also the results show that on average NCHB and TripNet running time results are 424 nearly the same, but on average the two optimality criterions for NCHB results are 425 better compared to TripNet. Note that the differences between TripNet and NCHB 426 results for the optimality criterions are not significant.

427
The results show that for small size data TripNet and NCHB are appropriate and 428 their results for the optimality criterions and running time are on average the best. But 429 by increasing the number of taxa, the running time of these methods exceeds 430 significantly compared to NetCombin, while the two optimality criterions for their 431 networks are nearly the same with NetCombin networks results.

432
The results show that generally and by considering the running time, the level, and 433 the number of reticulation nodes of the final networks, on average NetCombin is a 434 valuable method that returns an appropriate network in an appropriate time.