A complete theoretical framework for inferring horizontal gene transfers using partial order sets

Nahla A. Belal; Lenwood S. Heath

doi:10.1371/journal.pone.0281824

Abstract

We present a method for detecting horizontal gene transfer (HGT) using partial orders (posets). The method requires a poset for each species/gene pair, where we have a set of species S, and a set of genes G. Given the posets, the method constructs a phylogenetic tree that is compatible with the set of posets; this is done for each gene. Also, the set of posets can be derived from the tree. The trees constructed for each gene are then compared and tested for contradicting information, where a contradiction suggests HGT.

Citation: Belal NA, Heath LS (2023) A complete theoretical framework for inferring horizontal gene transfers using partial order sets. PLoS ONE 18(3): e0281824. https://doi.org/10.1371/journal.pone.0281824

Editor: Vladimir Makarenkov, Universite du Quebec a Montreal, CANADA

Received: June 13, 2022; Accepted: January 31, 2023; Published: March 24, 2023

Copyright: © 2023 Belal, Heath. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: This is a theoretical paper that addresses a problem from a mathematical point of view. The problem is proven to be NP-Complete and all algorithms and proofs are in the paper.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Most work in evolutionary genomics has focused on vertical gene transfer from one species to a lineal descendant. Much recent work has been directed towards the phenomenon of horizontal gene transfer (HGT) [1]. Because of the impact of HGTs on the ecological and pathogenic character of genomes, algorithms are sought that can computationally determine which genes of a given genome are products of HGT events. Numerous strategies have employed nucleotide composition of coding sequences to predict HGT. Previous methods marked the genes with a typical G + C content. Other methods used codon usage patterns to predict HGT. Also, many models used nucleotide patterns for genomic signature, these models have been analyzed using sliding windows, Bayesian classifiers, Markov models, and support vector machines. While no previous work uses partial orders to investigate HGT, we do summarize computational research for detecting HGT in the later Related Literature section.

Suppose that we have complete, annotated genomes for m species. Further, suppose that we have selected a set of n genes, from some reference genome or otherwise, for analysis. If we know the relative distances between each pair of species per gene, then we have a set of partial orders defining the relative relationship among species that can be used to identify which genes are candidates for HGT. Given a poset for each gene, a tree corresponding to that gene is constructed; different trees suggest genes that are candidates for HGT. Once HGT is indicated, additional time-related information can be brought to bear to determine the relative order of events and to establish direction. In fact, our algorithm predicts direction as illustrated in Fig 1.

Download:

Fig 1. The possible HGT events for the example in Fig 34.

https://doi.org/10.1371/journal.pone.0281824.g001

Suppose that we have complete, annotated genomes for species s₁, s₂, …, s_m. Further, suppose that we have selected a set of genes, from some reference genome or otherwise, for analysis. Let those genes be g₁, g₂, …, g_n. Standard methods for obtaining the set of genes, such as the one in Lake and Rivera [2], can be followed. BLASTing gene g_k in species s_i against a database of genes from all m species, we obtain a bit score B(g_k; s_i, s_j) of a best alignment of that gene against the same gene in species s_j. If g_k is not found in s_j, then set B(g_k; s_i, s_j) = 0. In general, the higher B(g_k; s_i, s_j) is, the better the match between gene g_k in species s_i and gene g_k in species s_j. There is no need to take special notice of an absent gene, since B(g_k; s_i, s_j) = 0 is a meaningful substitute for a Boolean value representing presence or absence of a gene.

There is another quantity associated with the (g_k, s_i, s_j) triple. Define T(g_k; s_i, s_j) to be the true evolutionary distance, this means what actually happened during the process of gene evolution, in time between the g_k gene of s_i and the g_k gene of s_j. For example, if the most recent common ancestor of the two genes existed 20 million years ago, then T(g_k; s_i, s_j) is 40 million years. While these T(g_k; s_i, s_j) values cannot be measured directly, either absolute or relative values for times can be estimated using probabilistic models.

The B(g_k; s_i, s_j) values are not random. In fact, a ranking of the B(g_k; s_i, s_j) values for 1 ≤ j ≤ m should roughly match a ranking of the T(g_k; s_i, s_j) values from the s_i gene g_k to all the other g_k’s. In the absence of HGT or other horizontal evolutionary events, we must have T(g_k; s_i, s_j) = T(g_ℓ; s_i, s_j) for every pair of genes g_k and g_ℓ. Therefore, we expect that the rankings of the B(g_k; s_i, s_j) and B(g_ℓ; s_i, s_j) values will be similar in ways we want to explore. And, under reasonable assumptions, the distribution of relative distances should be consistent with predictions of coalescent theory. In particular, as evolutionary distances increase, there will typically be multiple genes that have the same T value from the g_k gene in species s_i. Moreover, the probability that two evolutionary events occur at the same instance in time is 0.

In the presence of horizontal evolutionary events, the patterns of rankings of the B and T values will be different for different genes, depending on which horizontal events each gene is involved in. Two genes that are involved in exactly the same horizontal events will have identical patterns in their T values and similar patterns in their B values.

If we use the rankings of the B values as an approximate substitute for the rankings of the unknown T values, then the rankings can be compared and clustered to identify groups of genes that participated in the same horizontal events. Fix a gene g_k. Then there is a gene g_k tree that represents the true evolutionary history of the g_k’s in all the species. It is rooted at the most recent common ancestor of the m species. Our first goal is to define a computational problem to achieve this clustering and to design an efficient algorithm to solve the problem. In the following, proofs of results are elaborated. Note that Belal and Heath [3] is an earlier five-page announcement of these results.

Definitions

For a rooted (directed) tree T, let R(T) be the root of T, let I(T) be the set of internal nodes of T, and let L(T) be the set of leaves of T.

Let S be a finite set of species. An S-tree T = (V, E) is a rooted tree such that every internal node has outdegree at least two and a bijective labeling function λ: L(T) → S. In particular, every S-tree has precisely |S| leaves. Fig 2 illustrates an S-tree for the case n ≥ 2, where there is only one internal node, the root r = R(T). There are n leaves x₁, x₂, …, x_n and λ(x_i) = s_i. If every internal node of T has outdegree exactly two, then T is an evolutionary tree. Fig 3 illustrates an evolutionary tree on five species.

Download:

Fig 2. A trivial non-binary S-tree with a minimum number of nodes and no evolutionary assumptions [3].

https://doi.org/10.1371/journal.pone.0281824.g002

Download:

Fig 3. An evolutionary S-tree with 5 taxa [3].

https://doi.org/10.1371/journal.pone.0281824.g003

Let T = (V, E) be an S-tree. Let u ∈ V. The subtree rooted at u is T(u). The species set S(u) for u is the set of leaf labels in T(u).

Let T be an S-tree with an internal node x that has three or more children. A refinement step (on T at x) adds an internal node y to the tree T, where y is the parent of a proper subset of the children of x and y is a new child of x. An S-tree T′ is a refinement of T if T′ can be obtained by performing zero or more refinement steps on T. For example, in Fig 4, T₂ is a refinement of T₁ by a refinement step on T₁ at r. The refinement step applied adds one internal node y, which is the parent of s₁ and s₂ in T₂; y and s₃ are the direct children of r in T₂.

Download:

Fig 4. Refinement of T₁ to T₂ [3].

https://doi.org/10.1371/journal.pone.0281824.g004

Let X = {X₁, X₂} and Y = {Y₁, Y₂} be two partitions of S. Call such partitions with two elements each 2-partitions. Note that the deletion of an edge from an S-tree induces two connected subtrees and, hence, a 2-partition of S. X and Y are contradicting partitions if there exist four species s₁, s₂, s₃, s₄ such that s₁, s₂ ∈ X₁, s₃, s₄ ∈ X₂, s₁, s₃ ∈ Y₁, and s₂, s₄ ∈ Y₂. Two S-trees T₁ and T₂ are contradictory if their exists an edge in T₁ and an edge in T₂ such that their induced 2-partitions are contradicting.

Let u, v ∈ L(T), for some S-tree T. The most recent common ancestor MRCA(u, v) of u and v is the node w that is a common ancestor of u and v such that T(w) is the smallest rooted subtree in T containing both u and v.

A partial order is a binary relation ≤ over a set S that is reflexive, antisymmetric, and transitive, i.e., for all a, b, c ∈ S, we have that

a ≤ a (reflexivity);
if a ≤ b and b ≤ a then a = b (antisymmetry); and
if a ≤ b and b ≤ c then a ≤ c (transitivity).

A set with a partial order is a partially ordered set or a poset. If (S, ≤) is a poset and a, b ∈ S, then a < b if and only if a ≤ b and a ≠ b. Note that a < b is transitive. The directed graph G = (S, <) is clearly a directed acyclic graph (DAG). The transitive reduction of G is the DAG on node set S that contains those edges (a, b) such that there is no c ∈ S satisfying a < c < b. A Hasse diagram of < (which is also a Hasse diagram of ≤) is a drawing of the transitive reduction of (S, <) such that no arrows are included. An example of a Hasse diagram is shown in Fig 5. The diagram shown corresponds to the following poset:

Download:

Fig 5. An example of a Hasse diagram.

https://doi.org/10.1371/journal.pone.0281824.g005

Let s_i ∈ S be a species. An s_i-poset P = (S, ≤_i) is a poset with the property that, for every s_j ∈ S, we have s_i ≤_i s_j. In other words, s_i is the unique minimum element of P.

The s_i-poset P_i = (S, ≤_i) is compatible with S-tree T if, for all distinct triples x, y, z ∈ L(T) such that λ(x) = s_i, λ(y) = s_j, and λ(z) = s_k and such that s_j≤_is_k, then we have the shortest path from either of x or y to z passes through MRCA (x, y). Given the tree shown in Figs 6, 7 shows an example of a poset that is compatible with the given tree, while Fig 8 shows an incompatible poset, where the poset indicates that s₃ is the closest species to s₁, while, in the tree, the closest species to s₁ is s₂.

Download:

Fig 6. An example of a tree to test compatibility with posets [3].

https://doi.org/10.1371/journal.pone.0281824.g006

Download:

Fig 7. An example of a poset compatible with the tree in Fig 6 [3].

https://doi.org/10.1371/journal.pone.0281824.g007

Download:

Fig 8. An example of a poset incompatible with the tree in Fig 6 [3].

https://doi.org/10.1371/journal.pone.0281824.g008

Let be a set of posets. is consistent if, for all posets P_i, P_j ∈ P, whenever s_j≤_is_k, then s_i≤_js_k. For example, let P₁ = {(s₁, s₂), (s₁, s₃), (s₂, s₃)}, P₂ = {(s₂, s₁), (s₂, s₃), (s₁, s₃)}, and P₃ = {(s₃, s₁), (s₃, s₂)}. Then, {P₁, P₂, P₃} is consistent. However, if P₄ = {(s₃, s₁), (s₃, s₂), (s₁, s₂)}, then {P₁, P₂, P₄} is inconsistent, since P₁ and P₂ indicate that s₁ and s₂ are closer to each other than to s₃, while P₄ indicates that s₁ is closer to s₃ than to s₂.

Related literature

Among the methods for detecting HGT addressed by many researchers is conditioned reconstruction. Conditioned reconstruction (CR) is a phylogenetic technique that utilizes gene absence/presence data to reconstruct phylogenetic relationships [4]. CR [2], compares a genomic sequence to another and according to whether a gene ortholog is present or absent supplies a P or A character state. The probability of a state transition is analyzed using Markov models. Given two genes, X and Y, four patterns are possible, PP, PA, AP, and AA. Many questions were raised on how to count the pattern AA. How can one identify genes that are missing from both genomes X and Y. To solve this problem, CR uses a conditioning genome, as a reference to which genes to be considered. A gene has to be present in both the conditioning genome and the genome being coded, in order to be considered present. An absent gene is present in the conditioning genome and absent from the genome under study. The conditioning genome has a big effect on the results obtained, as it represents the full set of orthologous genes coded during matrix development. In our approach, we avoid building our results on a conditioning genome, or any other input that would bias our results. However, the approach we present is similar to CR in the problem addressed and the use of information about all genes in the genomes. Bailey et al. [4] argue that CR cannot be used to distinguish between HGT and genome fusion. They suggest some refinements that make CR perform better. Bapteste and Walsh [5] question the ring of life hypothesis of Lake and Rivera [2]. They claim that it is not possible to reconstruct the ring of life in the presence of HGT. Bapteste and Walsh [5] see that the conditioning genome (CG) is more a tool than a biological concept, this genome can exist anywhere in the tree of life and can not be used in evolutionary reconstruction. See Belal [6] for additional discussion of CR. Related methods are found in [7–11].

Other methods for detecting horizontal gene transfer are proposed by multiple researchers. Podell and Gaasterland [12] present the DarkHorse method for detecting HGT. They defined the LPI, lineage probability index, to measure HGT and species closeness. This measure relies on lineage key terms. The higher the LPI score for an organism, the closer it is to the query (reference) genome. Groups of closely related organisms, have similar LPI scores. Xiang et al. [13] apply DarkHorse in analyzing the evolutionary relationship between Microsporidia and Fungi.

Moreover, phylogenetic reconstruction research contributed in solving many evolutionary problems. Nakhleh et al. [14] present a method for reconstructing phylogenetic networks using maximum parsimony. Their method is then studied and applied in [15]. Other network-based methods are found in [16–20]. For example, Cardona, Pons, and Rosselló [17] investigate LGT (lateral gene transfer) networks that combine a principal rooted subtree with a set of additional edges representing LGT. They present an efficient algorithm for constructing an LGT network from a set of phylogenetic trees.

Snir and Trifonov [21] present a method for detecting HGT. Their algorithm takes two genomes with their lengths and calculates the expectancy of each identical region’s length to obtain a measure of confidence as to exceptional similarity. Abby et al. [22] present a program called Prunier for the detection of HGT. The program searches for a maximum statistical agreement forest between a gene tree and a reference tree. Adato et al. [23] provide an algorithm for detecting HGT based on gene synteny and the concept of constant relative mutability. Scornavacca et al. [24] provide an algorithm for detecting HGT in some alternative cases. Sanchez-Soto et al. [25] introduce the algorithm ShadowCaster for HGT detection in prokaryotes.

Some researchers combine HGT with other evolutionary phenomena. Bansal et al. [26] develop the tool RANGER-DTL to detect gene duplication, transfer, and loss. Van Iersel et al. [27] develop a polynomial-time algorithm for some cases of HGT detection. Hasic and Tannier [28] present NP-hard cases for HGT detection.

In addition to the above, there are a number of theoretical approaches to problems related to HGT transfer: [28–31]. These are typically about mathematically-oriented methodologies for reconstructing a species tree or reconciling gene and species trees.

Also worth discussing, is reticulate evolution. According to [32], there are numerous reticulations among related species, especially in insects, vertebrates, microbes, and plants. In [33], extensions of Wayne Maddison’s approach are presented for reconstructing reticulate evolution that result from horizontal transfer or hybrid speciation. Two polynomial time algorithms are presented and outperform both NeighborNet and Maddison’s method. Moreover, [34] gives a review of the mathematical techniques used to construct phytogenies and reticulate evolution. Different methods are discussed, among which are distance-based, maximum parsimony, and maximum likelihood methods. In [35], the problem of approximating a dissimilarity matrix using a reticulogram is discussed, where it is obtained by adding edges an additive tree which implies improving the approximation of the dissimilarity matrix. As stated in [36], Horizontal gene transfer (HGT) is one of the most important events in evolution and they describe a new polynomial-time algorithm to infer HGT events. The algorithm uses least squares (LS), Robinson and Foulds (RF) distance, quartet distance (QD), and bipartition dissimilarity (BD). The results show that bipartition dissimilarity gives the best results.

Also, in [37] a novel heuristic technique for HGT detetction was employed for and tested on both simulated and real data. The technique was found to provide a greater sensitivity than other HGT techniques. The proposed technique also considers the lengths of the genes being transferred.

In [38] a number of operons have been identified experimentally by sequence similarity analysis and then by phylogenetic analysis. Many occurrences of horizontal transfer of entire operons were detected.

Mosaic genes have been discussed in [39]. A mosaic gene is composed of alternating sequence polymorphisms either belonging to the host original allele or derived from the integrated donor DNA. In this paper, the authors propose a method for detecting partial HGT events and related intragenic recombination giving rise to the formation of mosaic genes.

Constructing an S-tree from a set of posets

Recall the definition of compatible from the Definitions Section. The s_i-poset P_i = (S, ≤_i) is compatible with S-tree T if, for all distinct triples x, y, z ∈ L(T) such that λ(x) = s_i, λ(y) = s_j, and λ(z) = s_k and such that s_j≤_is_k, then we have the shortest path from either of x or y to z passes through MRCA (x, y).

The problem of constructing a tree is defined as follows:

Compatible Tree Construction
INSTANCE: Set S = {s₁, s₂, …, s_n} of n taxa; for 1 ≤ i ≤ n, an s_i-poset P_i = (S, ≤_i).
SOLUTION: An S-tree T compatible with P₁, P₂, …, P_n, if one exists.

Theorem 1. Let be a set of posets that is compatible with an S-tree T. Let T′ be a refinement of T. Then is compatible with T′.

Proof. The proof is by induction on the number of refinement steps, k, to obtain T′ from T. For the base case of the induction, assume that k = 0. Then T′ = T, and, therefore, is clearly compatible with T′. Now assume that k ≥ 1 and that the result holds for k − 1 refinement steps. Then there exists an S-tree T^′′ such that T^′′ is obtained by k − 1 refinement steps from T and T′ is obtained from T^′′ in one refinement step. Let u in T^′′ have children v₁, v₂, …, v_p such that in T′ there is a new node w that is a child of u with children v₁, v₂, …, v_q, where u retains children v_{q + 1}, …, v_p in T′. Note that q ≥ 2 and p − q ≥ 1. Therefore, for to be compatible with T′, the compatibility condition must hold, and that is:

For all distinct triples x, y, z ∈ L(T) such that λ(x) = s_i, λ(y) = s_j, and λ(z) = s_k and such that s_j≤_is_k, then there is a shortest path from either of x or y to z passing through MRCA (x, y).

By applying the compatibility condition to T^′′, the cases for x, y, and z are as follows:

x ∈ v₁, v₂, …, v_p or y ∈ v₁, v₂, …, v_p. Since s_j≤_is_k, therefore, there exists an MRCA for x and y. Let MRCA (x, y) be q. Therefore, the shortest path from either of x or y to z passes through q.
x, y ∈ v₁, v₂, …, v_p. Therefore, MRCA (x, y) is u, and the shortest path from either of x or y to z passes through u.
x, y ∉ v₁, v₂, …, v_p. Since, s_j≤_is_k, therefore, there exists an MRCA for x and y such that the shortest path from either of x or y to z passes through the MRCA (x, y).

Similarly, by applying the compatibility condition to T′, the cases for x, y, and z are as follows:

x, y ∈ v₁, v₂, …, v_q. Therefore, MRCA (x, y) is w, and the shortest path from either of x or y to z passes through w.
x, y ∈ v_{q + 1}, …, v_p. Therefore, MRCA (x, y) is u, and the shortest path from either of x or y to z passes through u.
x ∈ v₁, v₂, …, v_q and y ∈ v_{q + 1}, …, v_p. Therefore, MRCA (x, y) is u and the shortest path from either of x or y to z passes through u.
y ∈ v₁, v₂, …, v_q and x ∈ v_{q + 1}, …, v_p. Therefore, MRCA (x, y) is u and the shortest path from either of x or y to z passes through u.
x ∈ v₁, v₂, …, v_p or y ∈ v₁, v₂, …, v_p. Since s_j≤_is_k, therefore, there exists an MRCA for x and y. Let MRCA (x, y) be q. Therefore, the shortest path from either of x or y to z passes through q.
x, y ∉ v₁, v₂, …, v_p. Since, s_j≤_is_k, therefore, there exists and MRCA for x and y such that the shortest path from either of x or y to z passes through the MRCA (x, y).

Therefore, if the compatibility condition holds for T^′′, and T′ is obtained using one refinement step from T^′′, then the compatibility condition also holds for T′.

By induction, is compatible with T′, as required.

Now we present a data structure that the algorithm uses to identify siblings. For the set of posets, , a matrix A of size n × n is defined. We define In other words, for i ≠ j, A(i, j) is the number of species s_x such that s_j is strictly less than s_x in the poset (S, ≤_i).

Theorem 2. Let be a set of posets, and let A be the matrix representing . If is consistent, then A is symmetric.

Proof. Let be a set of posets. is consistent if, for all posets P_i, P_j ∈ P, whenever s_j ≤_i s_k, then s_i ≤_j s_k. Let 1 ≤ i < j ≤ n. By the consistency condition, {s_x∣s_j<_is_x} = {s_x∣s_i < _js_x}. Therefore, A(i, j) = A(j, i), and A is symmetric.

This A matrix represents an undirected graph, where siblings are indicated by cliques in the graph, that is, for a species s_i, all other species connected to s_i with edges having equal labels, then they are siblings. Higher values indicate siblings at lower levels in the tree, in other words, the maximum value indicates leaf siblings. Note that if there is missing data or incorrect data in the posets, there will be a problem in constructing the tree, for example, if the posets have missing information or incorrect information then the algorithm will not be able to construct a tree for that specific gene corresponding to that posets set. To follow is an example to illustrate the defined data structures. Consider the set of posets , where is given as follows: The matrix A corresponding to is shown in Table 1

Download:

Table 1. Matrix A for the set of posets

.

https://doi.org/10.1371/journal.pone.0281824.t001

And the graph G that is represented by the matrix A given in Table 1 is shown in Fig 9, where s₁, s₂, and s₃ are siblings, and their parent and s₄ are both children of the root.

Download:

Fig 9. An undirected graph with cliques representing siblings.

https://doi.org/10.1371/journal.pone.0281824.g009

To follow is an example to illustrate the data structures used in tree construction. The matrix shown in Table 2 is constructed for the posets in Fig 10.

Download:

Fig 10. Diagram for posets [3].

https://doi.org/10.1371/journal.pone.0281824.g010

Download:

Table 2. Matrix A for posets in Fig 10.

https://doi.org/10.1371/journal.pone.0281824.t002

The graph in Fig 11 shows the cliques that represent siblings indicated by matrix A in Table 2.

Download:

Fig 11. An undirected graph corresponding to the matrix shown in Table 2.

https://doi.org/10.1371/journal.pone.0281824.g011

The first row of matrix A indicates that s₂ is a sibling of s₁. The maximum value in the s₁ row is 3, which is in the s₂ column, and it is the only column with this value. This is also clear in the graph shown in Fig 11. Since the maximum value found in the s₁ row is 3, and it is only under the s₂ column, therefore, s₂ is the only sibling of s₁. Similarly, s₄ and s₅ are also siblings.

The algorithm starts by the procedure of inferring siblings by detecting cliques in the graph. For each species, the algorithm scans the row corresponding to that species, and detects which species are connected using edges with equal labels. The detected species are all siblings. After detecting each set of siblings comes the updating step. In this step, the rows and columns of the siblings are merged. This procedure is repeated until only one species is remaining, which is the root.

After scanning the s₁ row, the matrix A is reduced as shown in Table 3.

Download:

Table 3. Matrix A for posets in Fig 10 after reducing s₁ and s₂.

https://doi.org/10.1371/journal.pone.0281824.t003

Similarly, the matrix A is reduced after detecting the siblings s₄ and s₅, as shown in Table 4.

Download:

Table 4. Updated matrix A for posets in Fig 10 after reducing s₄ and s₅.

https://doi.org/10.1371/journal.pone.0281824.t004

This procedure is repeated, but this time the highest integer is 2, therefore, s₃ is a sibling of s₁₂, the parent of s₁ and s₂. And, the new matrix is shown in Table 5.

Download:

Table 5. Updated matrix A for posets in Fig 10.

https://doi.org/10.1371/journal.pone.0281824.t005

The final step creates one root for the remaining species because all the values are 0, hence, all the remaining species are at the same level. The tree reconstructed from the posets in Fig 10 is shown in Fig 12.

Download:

Fig 12. Tree corresponding to the posets in Fig 10.

https://doi.org/10.1371/journal.pone.0281824.g012

Another example to further illustrate the algorithm uses the set of posets in Fig 13.

Download:

Fig 13. Diagrams for posets.

https://doi.org/10.1371/journal.pone.0281824.g013

The matrix in Table 6 is constructed for the set of posets in Fig 13.

Download:

Table 6. Matrix A for posets in Fig 13.

https://doi.org/10.1371/journal.pone.0281824.t006

The largest integer is 3, and it indicates that s₁, s₂, and s₃ are siblings, as well as s₄, s₅, and s₆.

The matrix then becomes as shown in Table 7.

Download:

Table 7. Updated matrix A for posets in Fig 13.

https://doi.org/10.1371/journal.pone.0281824.t007

Therefore, one root is created for the remaining two nodes to construct the tree in Fig 14.

Download:

Fig 14. Tree corresponding to the posets in Fig 13.

https://doi.org/10.1371/journal.pone.0281824.g014

To follow is an example to illustrate how the algorithm works to construct an S-tree from a set of posets .

Given a set of species, S = {s₁, s₂, s₃, s₄, s₅}, with the set of posets in Fig 15.

Download:

Fig 15. Set of posets

for the set of species S = {s₁, s₂, s₃, s₄, s₅}.

https://doi.org/10.1371/journal.pone.0281824.g015

The corresponding A matrix is shown in Table 8.

Download:

Table 8. Matrix A for the posets in Fig 15.

https://doi.org/10.1371/journal.pone.0281824.t008

Therefore, the maximum is 3, with the siblings s₁ and s₂, as well as s₃ and s₄.

And, the matrix A becomes as shown in Table 9.

Download:

Table 9. Updated matrix A for the posets in Fig 15.

https://doi.org/10.1371/journal.pone.0281824.t009

Now, s₅ is a sibling of both s₁₂ and s₃₄, giving one root for the three nodes. The constructed tree is shown in Fig 16.

Download:

Fig 16. The tree corresponding to the set of posets

in Fig 15.

https://doi.org/10.1371/journal.pone.0281824.g016

Fig 17 shows the algorithm for reconstructing a tree from a set of posets . The algorithm validates the matrix A by testing that A[i, j] = A[j, i], for all i and j, where 1 ≤ i ≤ n and 1 ≤ j ≤ n. The algorithm also uses a subroutine to find cliques with equal edge labels. The subroutine scans the matrix A to find a clique with maximum edge labels. The subroutine AddSiblings shows the steps for adding the vertices that belong to a certain clique as siblings in the tree T. The subroutine also reduces the graph by merging the rows and columns in the matrix A.

Download:

Fig 17. Algorithm to construct an S-tree from a set of posets

.

https://doi.org/10.1371/journal.pone.0281824.g017

Fig 18 shows the subroutine for validating the matrix A. And, Fig 19 shows the subroutine that finds the maximum value stored in the matrix A, where Fig 20 is the subroutine that finds the clique with edge labels equal to the maximum value. The subroutine in Fig 21 adds the nodes in the clique found as siblings in the tree constructed.

Download:

Fig 18. Algorithm to validate an n × n matrix A.

https://doi.org/10.1371/journal.pone.0281824.g018

Download:

Fig 19. Algorithm to find the maximum of a matrix A.

https://doi.org/10.1371/journal.pone.0281824.g019

Download:

Fig 20. Algorithm to find a clique with edge labels equal max.

https://doi.org/10.1371/journal.pone.0281824.g020

Download:

Fig 21. Algorithm to add elements of a clique as siblings in a tree T.

https://doi.org/10.1371/journal.pone.0281824.g021

Theorem 3. The algorithm ConstructTree has O(n³) time complexity.

Proof. Lines 2–6 in the algorithm ConstructTree contain two nested loops, each of which repeats n times. The statement in line 6, which is repeated in the nested loops, takes O(n) time, that is because the poset P_i contains, at most, n ordered pairs with x = s_j. Therefore, the total amount for these three nested loops will be O(n³). Lines 8 scans the matrix A in O(n²). The while loop on line 10 repeats at most n times, on line 11, FindMax is O(n²), on line 12, FindClique is O(n²), AddSiblings on line 13 is O(n), Therefore, the while loop takes O(n³). Therefore, the complexity of the algorithm is O(n³).

Theorem 4. The algorithm ConstructTree solves the Compatible Tree Construction problem.

Proof. To prove the theorem, we use induction on the number of species. Let the number of species be n. For n = 1 and n = 2, there is no maximum value in the matrix A, hence, the tree is trivial. For n = 3, there are three possibilities for the third species s₃. Either s₃ is a sibling of s₁ and s₂, a sibling of their parent, or a sibling of either one of them. The algorithm checks the values in the A matrix, if A(1, 3) = A(2, 3) = A(1, 2), then s₃ is a sibling of s₁ and s₂, otherwise, s₃ is a sibling of their parent. In case of s₁ and s₂ not being siblings, then the values in the A matrix will detect s₃ as a sibling of either one of them, that is the third possibility. After detecting siblings, the matrix A is reduced by eliminating the siblings and replacing them by their parent. Therefore, for n species, the algorithm scans the matrix A, and at each step, the siblings are eliminated and replaced by their parent, this reduces the matrix A, until only one species is remaining, which is the root.

Generating a set of posets from a given S-tree

For each tree T, there exists a set of posets compatible with T. In this section, we show how given a tree T, the set of compatible posets can be generated.

A set of posets is compatible with an S-tree T if, for all distinct triples x, y, z ∈ L(T) such that λ(x) = s_i, λ(y) = s_j, and λ(z) = s_k and such that s_j≤_is_k, then we have the shortest path from either of x or y to z passes through MRCA (x, y). Therefore, the procedure of obtaining posets from a tree is straightforward. Given a tree T, it is clear which species are closer to each other than others, and hence, posets can be generated. By obtaining the path from each species (leaf node) to the root of the tree, and laying this path horizontally, we get the nodes sorted in order of closeness to this specific leaf node. Each node on the path represents a subtree, of which the leaves belonging to the species set represent one level of the poset.

An example to illustrate how posets are generated from a tree is shown in Fig 22. The tree on the right shows the path from s₁ to the root, where each node on the path is a root to a subtree, and the leaves belonging to each subtree represent a level of the poset P₁. The subtree with the root s₁ has only one leaf and that is s₁. The second level of the poset contains the leaves in the subtree with the root x, and that is only s₂, then comes the last level, in the subtree with the root r, and this subtree contains the leaves s₃ and s₄. Therefore, the poset P₁ is generated as follows. P₁ = {(s₁, s₂), (s₁, s₃), (s₁, s₄), (s₂, s₃), (s₂, s₄)}.

Download:

Fig 22. An example of how the poset corresponding to s₁ is generated.

https://doi.org/10.1371/journal.pone.0281824.g022

For example, given the tree shown in Fig 12, we look at each species to generate the corresponding poset. Starting with s₁, the poset P₁ automatically contains the ordered pairs (s₁, s₂), (s₁, s₃), (s₁, s₄), and (s₁, s₅). It is clear from the tree that s₂ is the closest sibling to s₁, this adds the ordered pairs (s₂, s₃), (s₂, s₄), and (s₂, s₅) to the poset P₁. Also, the ordered pairs (s₃, s₄) and (s₃, s₅) are added. In a similar manner the posets P₂, P₃, P₄, and P₅ are generated as shown in Fig 10.

Theorem 5. The algorithm GeneratePosets shown in Fig 23 generates the set of posets that is compatible with a given tree T.

Download:

Fig 23. Algorithm to generate a set of posets

from an S-tree T.

https://doi.org/10.1371/journal.pone.0281824.g023

Proof. Using a proof by construction, we show that the algorithm GeneratePosets generates the set of posets compatible with a given tree T. From the definition of compatible in Section 2 of the main document, we know that an s_i-poset P_i = (S, ≤_i) is compatible with S-tree T if, for all distinct triples x, y, z ∈ L(T) such that λ(x) = s_i, λ(y) = s_j, and λ(z) = s_k and such that s_j≤_is_k, then we have the shortest path from either of x or y to z passes through MRCA (x, y). The algorithm GeneratePosets finds, for a species s_i, the path p from s_i to the root r, on that path, the nodes that come first on the path p are definitely closer to s_i and, hence, come at a lower level in the poset. That follows from the definition of compatible, which indicates that if s_j≤_is_k, then the shortest path from either of x or y to z passes through MRCA (x, y). Therefore, by scanning the path p, the set of posets can be constructed.

Theorem 6. The algorithm GeneratePosets has a time complexity of O(n³).

Proof. Let the number of species be n. The loop on line 2 iterates n times, and on line 3, finding the path from a certain species to the root is also linear in the number of species, this gives a complexity O(n²). Then on line 7, the while loop is also linear in n, and on line 9, finding all leaves in a subtree is linear as well. This gives a total complexity of O(n³).

Relating posets to trees

The following theorems relate posets and trees to one another.

Theorem 7. Given a set of posets , if there exists an S-tree T that is compatible with, then T can be used to generate the same set of posets .

Proof. Given a set of posets , assume that is compatible with a tree T. Assume that T, in turn, generates a different set of posets . can now be used to construct a tree T′ that is compatible with , T′ is expected to be equivalent to T. However, since and are not equal, then the two trees constructed are also not the same. Since, T and T′ are different, therefore, T and T′ can yield contradictory 2-partitions, this means that that T and T′ may be contradictory trees, and hence, one of them can not be used to give the same set of posets. Hence, there is a contradiction, and T can not be used to generate a set of posets other than .

Theorem 8. Let and be two sets of posets that are compatible with the two S-trees, T₁ and T₂. Then T₁ and T₂ are contradictory if and only if there exists a poset and , such that is inconsistent with .

Proof. First, we prove that if T₁ and T₂ are contradictory then there exists a poset and a poset , such that is inconsistent with . Using a proof by contradiction, assume that T₁ and T₂ are contradictory and there is no poset and , such that is inconsistent with . Since, T₁ and T₂ are contradictory, therefore, there exists an edge in T₁ and an edge in T₂, that when cut induces contradictory 2-partitions. This means that there exists four species s₁, s₂, s₃, and s₄, such that s₁ and s₂ belong to the same partition in one tree but not in the other. Similarly, s₃ and s₄ belong to the same partition in one tree but not in the other. Since, the set of posets is compatible with T₁ and the set of posets is compatible with T₂, and since T₁ and T₂ are contradictory, therefore, there exists a poset and a poset such that is inconsistent with . This leads to a contradiction with the assumption.

The second part of the proof proves that if there exists a poset and a poset , such that is inconsistent with , then T₁ and T₂ are contradictory. Using a proof by contradiction, assume that there exists a poset and a poset , such that is inconsistent with while T₁ and T₂ are non-contradictory. If is inconsistent with , therefore, is inconsistent with the set of posets , hence, the two sets of posets can create contradictory 2-partitions in their corresponding trees, and therefore, the trees that are compatible with both sets of posets can not be non-contradictory, and this leads to a contradiction with the assumption. Therefore, the theorem follows.

Fig 24 shows an example to illustrate Theorem 8. The set of posets corresponding to the tree at the top consists of the following posets: And, the set of posets corresponding to the tree at the bottom consists of the following posets: The poset indicates that s₂ is a sibling of s₁, while the poset indicates that s₃ is a sibling of s₁. Therefore, the two posets are inconsistent.

Download:

Fig 24. Two contradicting trees.

https://doi.org/10.1371/journal.pone.0281824.g024

Refinement of trees

We start with a basic result about refinement (Theorem 9).

Lemma 1. Let T be an S-tree. Let Q be the 2-partition set of T. Then Q is not contradictory with itself.

Proof. We show that every pair of 2-partitions in Q is non-contradictory. Consider an arbitrary pair of distinct edges of T. This pair of edges are the ends of a unique path in T. Let u₀, u₁, …, u_{k − 1}, u_k be that path. Then the edges are (u₀, u₁) and (u_{k − 1}, u_k). These edges partition S into three sets: X, the set of species reachable from u₀ without using (u₀, u₁); Y, the set of species reachable from u_k without using (u_{k − 1}, u_k); and Z, the set of species reachable from u₁, u₂, …, u_{k − 1} without using (u₀, u₁) or (u_{k − 1}, u_k). The 2-partition corresponding to (u₀, u₁) is (X, Y ∪ Z), and the 2-partition corresponding to (u_{k − 1}, u_k) is (X ∪ Z, Y). Recall the definition of contradictory 2-partitions: Two 2-partitions X = (X₁, X₂) and Y = (Y₁, Y₂) are contradictory partitions if there exist four species s₁, s₂, s₃, s₄ such that s₁, s₂ ∈ X₁, s₃, s₄ ∈ X₂, s₁, s₃ ∈ Y₁, and s₂, s₄ ∈ Y₂. Let s₁, s₂, s₃, s₄ ∈ S. If s₁, s₂ ∈ X and s₃, s₄ ∈ Y ∪ Z, then s₁, s₂ ∈ X ∪ Z, so the definition definitely does not apply to the 2-partitions corresponding to (u₀, u₁) and (u_{k − 1}, u_k). Since the two edges were arbitrary, we conclude that Q is not contradictory with itself.

Lemma 2. Let T₁ be an S-tree, and let T₂ be a refinement of T₁. Let Q₁ be the 2-partition set of T₁, and let Q₂ be the 2-partition set of T₂. Then Q₁ ⊆ Q₂.

Proof. A refinement step adds one edge to T₁ and one 2-partition. By induction on the number of refinement steps to go from T₁ to T₂, we obtain Q₁ ⊆ Q₂.

Theorem 9. If S-tree T₂ can be obtained from S-tree T₁ using a number of refinement steps, then T₁ and T₂ are non-contradictory.

Proof. Let T₁ be an S-tree, and let T₂ be a refinement of T₁. Let Q₁ be the set of 2-partitions of T₁, and let Q₂ be the set of 2-partitions of T₂. By Lemma 2, Q₁ ⊆ Q₂. By Lemma 1, Q₂ is not contradictory with itself. Then Q₁ and Q₂ are non-contradictory, since otherwise Q₂ would be contradictory with itself. By definition, T₁ and T₂ are non-contradictory.

The posets given for each gene are used in the construction of one tree for each gene. These trees can contain contradictory information, as illustrated in Fig 24. To be able to identify HGT events, contradictory trees must be identified. This can be done by examining the number of ways leaves and the root in a tree can be partitioned. This is done by examining the cuts in edges that are not incident to leaf nodes. If two trees are contradictory, then there is evidence for HGT.

The minimum common refinement of two non-contradictory S-trees T₁ and T₂ is an S-tree T₃ that is a common refinement of T₁ and T₂ such that any other common refinement of T₁ and T₂ is a refinement of T₃.

Theorem 10. Let T₁ and T₂ be S-trees that are non-contradictory. Let Q₁ and Q₂ be their respective sets of 2-partitions. Then there exists a unique tree T₃ that is their minimum common refinement. Furthermore, if Q₃ is the set of 2-partitions of T₃, then Q₃ = Q₁ ∪ Q₂.

Proof. Define Q₃ = Q₁ ∪ Q₂. Therefore, Q₃ contains 2-partitions, where each 2-partition is obtained by cutting one edge of the tree T₃. Hence, the set Q₃ can be used to construct the tree T₃, by checking each 2-partition, starting with the 2-partition of minimum cardinality. Siblings in T₃ are inferred and the set is reduced. This process is repeated until only 2-partitions with one of its elements having cardinality one are remaining. Since Q₃ = Q₁ ∪ Q₂ and since Q₁ already corresponds to a tree and also Q₂ corresponds to a tree, all the 2-partitions in Q₁ and Q₂ already correspond to edges in a tree. Therefore, using the two sets, a more refined tree can be constructed. Since Q₁ and Q₂ both contain non-contradictory partitions, and since Q₃ = Q₁ ∪ Q₂, Q₃ also contains non-contradictory partitions, and hence, there exists a tree T₃ that corresponds to Q₃. Using induction, we start by Q₁ and T₁ and add 2-partitions from Q₂ to Q₁. Let k be the number of 2-partitions added. If k = 1, then a 2-partition is added from Q₂ to Q₁. Since T₁ and T₂ are non-contradictory, a 2-partition that exists in Q₂ but not in Q₁ only adds an internal node and an edge to T₁. Therefore, T₁ becomes a more refined tree. Hence, adding k 2-partitions to T₁ will further refine T₁ by adding more edges and internal nodes. Therefore, given Q₃, a set of non-contradictory 2-partitions, a tree T₃ can be constructed.

An algorithm for finding the minimum common refinement of T₁ and T₂ is shown in Fig 25. The algorithm finds all 2-partitions of T₁ and T₂. A 2-partition is found by cutting an edge of the tree and finding the leaves in the two subtrees induced. For example, cutting an edge (i, j), induces two subtrees, one with the root i and the other with the root j. Performing a depth-first search on the two subtrees finds the leaves in both subtrees. The species set for each subtree composes one of the 2-partitions; therefore, S(i) composes one partition, and S(j) composes the other.

Download:

Fig 25. Algorithm to find the minimum common refinement of two trees.

https://doi.org/10.1371/journal.pone.0281824.g025

The subroutine FindTwoPartitions shown in Fig 26 finds the 2-partition set for a given tree. When the 2-partitions sets are found for both trees, a union is performed on these sets to obtain the minimum common refinement tree.

Download:

Fig 26. Algorithm to find the 2-partitions set of a given tree.

https://doi.org/10.1371/journal.pone.0281824.g026

The algorithm that constructs a tree from its two-partition set is shown in Fig 27, followed by an illustrative example.

Download:

Fig 27. Algorithm to construct a tree from its 2-partitions set.

https://doi.org/10.1371/journal.pone.0281824.g027

An example to show the minimum common refinement, given two S-trees, T₁ and T₂, if using a number of refinement steps both trees can be refined into a third S-tree T₃, then it is guaranteed that both trees carry non-contradictory information. For example, the two S-trees, T₁ and T₂ shown in Fig 28 are non-contradictory and they are both refined into T₃. In this example, T₃ is obtained using the minimum number of refinement steps, hence, T₃ is the minimum common refinement of T₁ and T₂.

Download:

Fig 28. Refinement of T₁ and T₂ into T₃.

https://doi.org/10.1371/journal.pone.0281824.g028

Fig 29 shows an example to illustrate minimum common refinement, where the tree T₃ is the minimum common refinement of the two trees T₁ and T₂, where T₃ is obtained using one refinement step, this refinement step is performed on T₁ by adding a parent for s₃ and s₄. The refined tree is the same tree as T₂.

Download:

Fig 29. T₃ is the minimum common refinement of T₁ and T₂.

https://doi.org/10.1371/journal.pone.0281824.g029

Fig 30 shows an example to illustrate the algorithm. The node s₀ is added under the root to avoid having equivalent sets for a 2-partition, as these equivalent sets disappear when performing the union operation. In the example, T₁ has eight edges, including the edge connecting the s₀ to the root. Hence, there are eight 2-partitions sets for T₁. Similarly, their are eight 2-partitions sets for T₂. The 2-partitions sets for T₁ are as follows:

Download:

Fig 30. An example to illustrate the algorithm MinCommonRefine.

https://doi.org/10.1371/journal.pone.0281824.g030

The 2-partitions sets for T₂ are as follows:

The union of the two sets of partitions gives the following 2-partitions sets, which are the sets that give the tree T₃:

Lets consider the following two-partition set, Q, to illustrate the algorithm.

The algorithm starts by removing all sets with cardinality 1. So the set Q is reduced to the following:

The set with the minimum cardinality is in Q₇, therefore, the species s₁ and s₂ are detected as siblings and they are replaced by a parent node in all sets. Therefore, Q is modified to the following: The next step finds the minimum cardinality in both Q₈ and Q₉, where u₁ and s₃ are siblings, and s₄ and s₅ are siblings. When Q₈ and Q₉ are removed from Q, it becomes empty and the root connects the subtrees constructed. Fig 31 shows the tree constructed from the two-partition set Q.

Download:

Fig 31. An example to illustrate the algorithm ConstructTree2Partitions.

https://doi.org/10.1371/journal.pone.0281824.g031

Theorem 11. The time complexity of MinCommonRefine is O(mn + n²).

Proof. Let n be the number of species. Let m be the number of edges in a tree T. The subroutine FindTwoPartitions on Lines 3 and 4 is O(mn) Line 5 performs a union operation linear in the number of species. Line 6 constructs the tree from its two-partition set, ConstructTree2Partitions is O(n²). Therefore, the overall complexity of the algorithm MinCommonRefine is O(mn + n²).

Inferring HGT from posets

In this section, we show how posets and trees are used to infer HGT.

The problem is defined as follows:

Inferring HGT From Posets
INSTANCE: Set S = {s₁, s₂, …, s_n} of n taxa; set G = {g₁, g₂, …, g_m} of m genes; mn individual posets P_ij = (S, <_ij), for 1 ≤ i ≤ m and 1 ≤ j ≤ n.
SOLUTION: Sets of genes corresponding to contradictory trees.

A number of steps are followed to be able to infer HGT events. First, trees are constructed from posets, then the different trees are compared, where contradictory trees are identified. Trees that are contradictory with the majority of trees suggest HGT. Other events such as gene duplication, gene loss, and incomplete lineage sorting can cause the incongruence of trees [40]. In the “Constructing an S-tree From a Set of Posets” Section, we show how trees are constructed from posets; in what follows, we show how contradictory trees are detected. The algorithm DetectContradiction shown in Fig 32 takes two trees as input and detects whether they are contradictory or not.

Download:

Fig 32. Algorithm to detect contradiction between two trees.

https://doi.org/10.1371/journal.pone.0281824.g032

The process of identifying which genes are candidates of HGT proceeds as follows. Two S-trees T₁ and T₂ are tested for contradiction. If they are contradictory, then they belong to two different sets, if not then they are placed in one set. The process continues. If the next tree to be tested is T₃, then it is compared with one tree from each set to test to which set the tree T₃ belongs. It is expected that the majority of the trees will be non-contradictory, with some trees contradicting this majority, so there will be one set with a higher cardinality. Therefore, the other sets, which are the minority, are considered candidates for HGT.

The algorithm performs ideally when all the trees are completely refined (binary) trees, where the trees that are not identical are considered contradictory. In what follows, some real life HGT examples are shown to support the argument that the genes involved in HGT are a minority and that there will always be a dominant tree. In Ponting [41], it is indicated that only 0.5% of all human genes were copied into the genome from bacteria by HGT. Rujan and Martin [42] analyzed how many genes in Arabidopsis come from cyanobacteria, They used a sample of 3961 Arabidopsis nuclear protein-coding genes and compared those with the complete set of proteins from yeast and 17 reference prokaryotic genomes, including one cyanobacterium. In their analysis of 386 phylogenetic trees, they found that the number of genes horizontally transferred to Arabidopsis from cyanobacteria falls between approximately 400 genes and approximately 2200 genes. That is between 1.6% and 9.2% of nuclear genes.

The algorithm InferHGT is shown in Fig 33. The input to the algorithm is a set of trees T = {T₁, T₂, …, T_n}, where n is the number of trees and also the number of genes.

Download:

Fig 33. Algorithm to infer HGT.

https://doi.org/10.1371/journal.pone.0281824.g033

An example to illustrate the algorithm for inferring HGT is shown in Fig 34, where the trees T₁, T₂, and T₃ are non-contradictory, while the tree T₄ contradicts the three trees. In T₄ there is a 2-partition that places the two species {s₁, s₃} in one partition, and {s₂, s₄} in another partition. This 2-partition contradicts the other three trees. Therefore, the gene corresponding to T₄ is a candidate of HGT, where a horizontal transfer occurred between s₁ and s₃, or s₂ and s₄. The network in Fig 1 shows the possible horizontal transfers. We note that the figure documents both the existence of two possible horizontal transfers but also their directionality, which is especially valuable for any further investigation.

Download:

Fig 34. An example to illustrate the algorithm InferHGT.

https://doi.org/10.1371/journal.pone.0281824.g034

Theorem 12. InferHGT has complexity max(O(n²), O(m²n)).

Proof. The two nested loops on lines 4 and 5 are O(n²), where n is the number of trees. The subroutine DetectContradiction on line 6 is O(m²n), where m is the number of edges in a tree.

Conclusions

We have introduced the theoretical problem of inferring HGT using partial orders, where there is one poset per gene per species. These posets have been used to construct S-trees for the genes corresponding to these posets, one tree for each gene. These trees are then compared, where the trees that contradict the majority of trees correspond to genes that are candidates for HGT. An algorithm for identifying contradiction is presented and then used in the algorithm to infer HGT. The concept of refinement is also presented in this paper, where it can also be used to identify contradiction among trees. An algorithm for finding a minimum common refinement for two trees is also presented. This algorithm finds the union of the 2-partition sets of two trees and then uses this set to construct a third tree, which is their minimum common refinement. Other points can be further studied in this problem. For example, more effort could be done to find solutions to the problem of incorrect or missing data in the input posets. This will be incredibly challenging, but, from a practical viewpoint, it would be most valuable. Another point is to develop algorithms that use the refinement of trees for identifying contradictory trees, where two contradictory trees do not have a common refinement.

Acknowledgments

We thank Ruth Grene (biology cosultant), Ayman Abdel Hamid, T.M. Murali, and João Setubal for valuable comments and Thomas Jones for implementing some of the algorithms.

References

1. Daubin V, Szoellosi GJ. Horizontal Gene Transfer and the History of Life. Cold Spring Harbor Perspectives in Biology. 2016;8. pmid:26801681
- View Article
- PubMed/NCBI
- Google Scholar
2. Lake JA, Rivera MC. Deriving the genomic tree of life in the presence of horizontal gene transfer: Conditioned reconstruction. Molecular Biology and Evolution. 2004;21(4):681–690. pmid:14739244
- View Article
- PubMed/NCBI
- Google Scholar
3. Belal NA, Heath LS. Inferring horizontal gene transfers from posets. In: 2nd International Conference on Computer Technology and Development, ICCTD 2010; 2010. p. 32–36.
4. Bailey CD, Fain MG, Houde P. On conditioned reconstruction, gene content data, and the recovery of fusion genomes. Molecular Phylogenetic and Evolution. 2006;39:263–270. pmid:16414287
- View Article
- PubMed/NCBI
- Google Scholar
5. Bapteste E, Walsh DA. Does the ring of life ring true? Trends in Microbiology. 2005;13(6):256–261. pmid:15936656
- View Article
- PubMed/NCBI
- Google Scholar
6. Belal NA. Two Problems in Computational Genomics [PhD Dissertation]. Virginia Tech. Blacksburg, Virginia; 2011.
7. Bansal MS, Alm EJ, Kellis M. Reconciliation Revisited: Handling Multiple Optima when Reconciling with Duplication, Transfer, and Loss. Journal of Computational Biology. 2013;20:738–754. pmid:24033262
- View Article
- PubMed/NCBI
- Google Scholar
8. Bansal MS, Wu YC, Alm EJ, Kellis M. Improved Gene Tree Error Correction in the Presence of Horizontal Gene Transfer. Bioinformatics. 2015;31:1211–1218. pmid:25481006
- View Article
- PubMed/NCBI
- Google Scholar
9. Chan Yb, Ranwez V, Scornavacca C. Exploring the Space of Gene/Species Reconciliations with Transfers. Journal of Mathematical Biology. 2015;71:1179–1209. pmid:25502987
- View Article
- PubMed/NCBI
- Google Scholar
10. Liu L, Wu S, Yu L. Coalescent Methods for Estimating Species Trees from Phylogenomic Data. Journal of Systematics and Evolution. 2015;53:380–390.
- View Article
- Google Scholar
11. Nguyen M, Ekstrom A, Li X, Yin Y. HGT-Finder: A New Tool for Horizontal Gene Transfer Finding and Application to Aspergillus Genomes. Toxins. 2015;7:4035–4053. pmid:26473921
- View Article
- PubMed/NCBI
- Google Scholar
12. Podell S, Gaasterland T. DarkHorse: A method for genome-wide prediction of horizontal gene transfer. Genome Biology. 2007;8(2):R16.1–R16.18. pmid:17274820
- View Article
- PubMed/NCBI
- Google Scholar
13. Xiang H, Zhang R, De Koeyer D, Pan G, Li T, Liu T, et al. New Evidence on the Relationship Between Microsporidia and Fungi: A Genome-Wide Analysis by DarkHorse Software. Canadian Journal of Microbiology. 2014;60:557–568. pmid:25134955
- View Article
- PubMed/NCBI
- Google Scholar
14. Nakhleh L, Jin G, Zhao F, Mellor-Crummey J. Reconstructing phylogenetic networks using maximum parsimony. In: CSB’05: Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference; 2005. p. 93–102.
15. Jin G, Nakhleh L, Snir S, Tuller T. Inferring phylogenetic networks by the maximum parsimony criterion: A case study. Molecular Biology and Evolution. 2007;24(1):324–337. pmid:17068107
- View Article
- PubMed/NCBI
- Google Scholar
16. Alix B, Boubacar DA, Vladimir M. T-REX: A Web Server for Inferring, Validating and Visualizing Phylogenetic Trees and Networks. Nucleic Acids Research. 2012;40(W1):W573–W579.
- View Article
- Google Scholar
17. Cardona G, Pons JC, Rossello F. A Reconstruction Problem for a Class of Phylogenetic Networks with Lateral Gene Transfers. Algorithms for Molecular Biology. 2015;1–15. pmid:26691555
- View Article
- PubMed/NCBI
- Google Scholar
18. Layeghifard M, Peres-Neto PR, Makarenkov V. Inferring Explicit Weighted Consensus Networks to Represent Alternative Evolutionary Histories. BMC Evolutionary Biology. 2013;13. pmid:24359207
- View Article
- PubMed/NCBI
- Google Scholar
19. Nakhleh L. Evolutionary Phylogenetic Networks: Models and Issues. In: Heath LS, Ramakrishnan N, editors. Problem Solving Handbook in Computational Biology and Bioinformatics. New York: Springer; 2011. p. 125–158.
20. Pardi F, Scornavacca C. Reconstructible Phylogenetic Networks: Do Not Distinguish the Indistinguishable. PLoS Computational Biology. 2015;11. pmid:25849429
- View Article
- PubMed/NCBI
- Google Scholar
21. Snir S, Trifonov E. A novel technique for detecting putative horizontal gene transfer in the sequence space. Journal of Computational Biology. 2010;17(11):1535–1548. pmid:20973741
- View Article
- PubMed/NCBI
- Google Scholar
22. Abby S, Tannier E, Gouy M, Daubin V. Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests. BMC Bioinformatics. 2010;11(324):1–13. pmid:20550700
- View Article
- PubMed/NCBI
- Google Scholar
23. Adato O, Ninyo N, Gophna U, Snir S. Detecting Horizontal Gene Transfer between Closely Related Taxa. PLoS Computational Biology. 2015;11. pmid:26439115
- View Article
- PubMed/NCBI
- Google Scholar
24. Scornavacca C, Mayol JCP, Cardona G. Fast Algorithm for the Reconciliation of Gene Trees and LGT Networks. Journal of Theoretical Biology. 2017;418:129–137. pmid:28111320
- View Article
- PubMed/NCBI
- Google Scholar
25. Sanchez-Soto D, Aguero-Chapin G, Armijos-Jaramillo V, Perez-Castillo Y, Tejera E, Antunes A, et al. ShadowCaster: Compositional Methods Under the Shadow of Phylogenetic Models to Detect Horizontal Gene Transfers in Prokaryotes. Genes. 2020;11(7):12 pages. pmid:32645885
- View Article
- PubMed/NCBI
- Google Scholar
26. Bansal MS, Kellis M, Kordi M, Kundu S. RANGER-DTL 2.0: Rigorous Reconstruction of Gene-Family Evolution by Duplication, Transfer and Loss. Bioinformatics. 2018;34(18):3214–3216. pmid:29688310
- View Article
- PubMed/NCBI
- Google Scholar
27. van Iersel L, Janssen R, Jones M, Murakami Y, Zeh N. Polynomial-Time Algorithms for Phylogenetic Inference Problems Involving Duplication and Reticulation. IEEE-ACM Transactions on Computational Biology and Bioinformatics. 2020;17(1):14–26.
- View Article
- Google Scholar
28. Hasic D, Tannier E. Gene Tree Reconciliation Including Transfers with Replacement Is NP-hard and FPT. Journal of Combinatorial Optimization. 2019;38(2):502–544.
- View Article
- Google Scholar
29. Chan YB, Robin C. Reconciliation of a Gene Network and Species Tree. Journal of Theoretical Biology. 2019;472:54–66. pmid:30951730
- View Article
- PubMed/NCBI
- Google Scholar
30. Piovesan T, Kelk SM. A Simple Fixed Parameter Tractable Algorithm for Computing the Hybridization Number of Two (Not Necessarily Binary) Trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2013;10(1):18–25. pmid:23702540
- View Article
- PubMed/NCBI
- Google Scholar
31. Schaller D, Lafond M, Stadler PF, Wieseke N, Hellmuth M. Indirect identification of horizontal gene transfer. Journal of Mathematical Biology. 2021;83:73 pages. pmid:34218334
- View Article
- PubMed/NCBI
- Google Scholar
32. Mallet J, Besansky N, Hahn MW. How reticulated are species? BioEssays. 2016;38(2):140–149. pmid:26709836
- View Article
- PubMed/NCBI
- Google Scholar
33. Nakhleh L, Warnow T, Linder CR. Reconstructing reticulate evolution in species: theory and practice. Journal of Computational Biology. 2005;12(6):796–811. pmid:16108717
- View Article
- PubMed/NCBI
- Google Scholar
34. Makarenkov V, Legendre P. Improving the additive tree representation of a dissimilarity matrix using reticulations. Data Analysis, Classification, and Related Methods, Springer, Berlin, Heidelberg. 2000; p. 35–40.
- View Article
- Google Scholar
35. Makarenkov V, Kevorkov D, Legendre P. Phylogenetic network construction approaches. Applied Mycology and Biotechnology. 2006;6:61–97.
- View Article
- Google Scholar
36. Boc A, Philippe H, Makarenkov V. Inferring and validating horizontal gene transfer events using bipartition dissimilarity. Systematic Biology. 2010;59(2):195–211. pmid:20525630
- View Article
- PubMed/NCBI
- Google Scholar
37. Sevillya G, Adato O, Snir S. Detecting horizontal gene transfer: a probabilistic approach. BMC Genomics. 2020;106(Suppl 1). pmid:32138652
- View Article
- PubMed/NCBI
- Google Scholar
38. Omelchenko MV, Makarova KS, Wolf YIea. Evolution of mosaic operons by horizontal gene transfer and gene displacement in situ. Genome Biology. 2003;4(R55). pmid:12952534
- View Article
- PubMed/NCBI
- Google Scholar
39. Boc A, Makarenkov V. Towards an accurate identification of mosaic genes and partial horizontal gene transfers. Nucleic acids research. 2011;39(21):e144–e144. pmid:21917854
- View Article
- PubMed/NCBI
- Google Scholar
40. Than CV, Rosenberg NA. Consistency properties of species tree inference by minimizing deep coalescences. Journal of Computational Biology. 2011;18(1):1–15. pmid:21210728
- View Article
- PubMed/NCBI
- Google Scholar
41. Ponting C. Plagiarized bacterial genes in the human book of life. Trends in Genetics. 2001;17(5):235–237. pmid:11335018
- View Article
- PubMed/NCBI
- Google Scholar
42. Rujan T, Martin W. How many genes in Arabidopsis come from cyanobacteria? An estimate from 386 protein phylogenies. Trends in Genetics. 2001;17(3):113–120. pmid:11226586
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Daubin V, Szoellosi GJ. Horizontal Gene Transfer and the History of Life. Cold Spring Harbor Perspectives in Biology. 2016;8. pmid:26801681
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Lake JA, Rivera MC. Deriving the genomic tree of life in the presence of horizontal gene transfer: Conditioned reconstruction. Molecular Biology and Evolution. 2004;21(4):681–690. pmid:14739244
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Belal NA, Heath LS. Inferring horizontal gene transfers from posets. In: 2nd International Conference on Computer Technology and Development, ICCTD 2010; 2010. p. 32–36.

[ref4] 4. Bailey CD, Fain MG, Houde P. On conditioned reconstruction, gene content data, and the recovery of fusion genomes. Molecular Phylogenetic and Evolution. 2006;39:263–270. pmid:16414287
View Article
PubMed/NCBI
Google Scholar

[11] View Article

[12] PubMed/NCBI

[13] Google Scholar

[ref5] 5. Bapteste E, Walsh DA. Does the ring of life ring true? Trends in Microbiology. 2005;13(6):256–261. pmid:15936656
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Belal NA. Two Problems in Computational Genomics [PhD Dissertation]. Virginia Tech. Blacksburg, Virginia; 2011.

[ref7] 7. Bansal MS, Alm EJ, Kellis M. Reconciliation Revisited: Handling Multiple Optima when Reconciling with Duplication, Transfer, and Loss. Journal of Computational Biology. 2013;20:738–754. pmid:24033262
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref8] 8. Bansal MS, Wu YC, Alm EJ, Kellis M. Improved Gene Tree Error Correction in the Presence of Horizontal Gene Transfer. Bioinformatics. 2015;31:1211–1218. pmid:25481006
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref9] 9. Chan Yb, Ranwez V, Scornavacca C. Exploring the Space of Gene/Species Reconciliations with Transfers. Journal of Mathematical Biology. 2015;71:1179–1209. pmid:25502987
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref10] 10. Liu L, Wu S, Yu L. Coalescent Methods for Estimating Species Trees from Phylogenomic Data. Journal of Systematics and Evolution. 2015;53:380–390.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref11] 11. Nguyen M, Ekstrom A, Li X, Yin Y. HGT-Finder: A New Tool for Horizontal Gene Transfer Finding and Application to Aspergillus Genomes. Toxins. 2015;7:4035–4053. pmid:26473921
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref12] 12. Podell S, Gaasterland T. DarkHorse: A method for genome-wide prediction of horizontal gene transfer. Genome Biology. 2007;8(2):R16.1–R16.18. pmid:17274820
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref13] 13. Xiang H, Zhang R, De Koeyer D, Pan G, Li T, Liu T, et al. New Evidence on the Relationship Between Microsporidia and Fungi: A Genome-Wide Analysis by DarkHorse Software. Canadian Journal of Microbiology. 2014;60:557–568. pmid:25134955
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref14] 14. Nakhleh L, Jin G, Zhao F, Mellor-Crummey J. Reconstructing phylogenetic networks using maximum parsimony. In: CSB’05: Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference; 2005. p. 93–102.

[ref15] 15. Jin G, Nakhleh L, Snir S, Tuller T. Inferring phylogenetic networks by the maximum parsimony criterion: A case study. Molecular Biology and Evolution. 2007;24(1):324–337. pmid:17068107
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref16] 16. Alix B, Boubacar DA, Vladimir M. T-REX: A Web Server for Inferring, Validating and Visualizing Phylogenetic Trees and Networks. Nucleic Acids Research. 2012;40(W1):W573–W579.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref17] 17. Cardona G, Pons JC, Rossello F. A Reconstruction Problem for a Class of Phylogenetic Networks with Lateral Gene Transfers. Algorithms for Molecular Biology. 2015;1–15. pmid:26691555
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref18] 18. Layeghifard M, Peres-Neto PR, Makarenkov V. Inferring Explicit Weighted Consensus Networks to Represent Alternative Evolutionary Histories. BMC Evolutionary Biology. 2013;13. pmid:24359207
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref19] 19. Nakhleh L. Evolutionary Phylogenetic Networks: Models and Issues. In: Heath LS, Ramakrishnan N, editors. Problem Solving Handbook in Computational Biology and Bioinformatics. New York: Springer; 2011. p. 125–158.

[ref20] 20. Pardi F, Scornavacca C. Reconstructible Phylogenetic Networks: Do Not Distinguish the Indistinguishable. PLoS Computational Biology. 2015;11. pmid:25849429
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref21] 21. Snir S, Trifonov E. A novel technique for detecting putative horizontal gene transfer in the sequence space. Journal of Computational Biology. 2010;17(11):1535–1548. pmid:20973741
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref22] 22. Abby S, Tannier E, Gouy M, Daubin V. Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests. BMC Bioinformatics. 2010;11(324):1–13. pmid:20550700
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref23] 23. Adato O, Ninyo N, Gophna U, Snir S. Detecting Horizontal Gene Transfer between Closely Related Taxa. PLoS Computational Biology. 2015;11. pmid:26439115
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref24] 24. Scornavacca C, Mayol JCP, Cardona G. Fast Algorithm for the Reconciliation of Gene Trees and LGT Networks. Journal of Theoretical Biology. 2017;418:129–137. pmid:28111320
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref25] 25. Sanchez-Soto D, Aguero-Chapin G, Armijos-Jaramillo V, Perez-Castillo Y, Tejera E, Antunes A, et al. ShadowCaster: Compositional Methods Under the Shadow of Phylogenetic Models to Detect Horizontal Gene Transfers in Prokaryotes. Genes. 2020;11(7):12 pages. pmid:32645885
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref26] 26. Bansal MS, Kellis M, Kordi M, Kundu S. RANGER-DTL 2.0: Rigorous Reconstruction of Gene-Family Evolution by Duplication, Transfer and Loss. Bioinformatics. 2018;34(18):3214–3216. pmid:29688310
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref27] 27. van Iersel L, Janssen R, Jones M, Murakami Y, Zeh N. Polynomial-Time Algorithms for Phylogenetic Inference Problems Involving Duplication and Reticulation. IEEE-ACM Transactions on Computational Biology and Bioinformatics. 2020;17(1):14–26.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref28] 28. Hasic D, Tannier E. Gene Tree Reconciliation Including Transfers with Replacement Is NP-hard and FPT. Journal of Combinatorial Optimization. 2019;38(2):502–544.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref29] 29. Chan YB, Robin C. Reconciliation of a Gene Network and Species Tree. Journal of Theoretical Biology. 2019;472:54–66. pmid:30951730
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref30] 30. Piovesan T, Kelk SM. A Simple Fixed Parameter Tractable Algorithm for Computing the Hybridization Number of Two (Not Necessarily Binary) Trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2013;10(1):18–25. pmid:23702540
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref31] 31. Schaller D, Lafond M, Stadler PF, Wieseke N, Hellmuth M. Indirect identification of horizontal gene transfer. Journal of Mathematical Biology. 2021;83:73 pages. pmid:34218334
View Article
PubMed/NCBI
Google Scholar

[106] View Article

[107] PubMed/NCBI

[108] Google Scholar

[ref32] 32. Mallet J, Besansky N, Hahn MW. How reticulated are species? BioEssays. 2016;38(2):140–149. pmid:26709836
View Article
PubMed/NCBI
Google Scholar

[110] View Article

[111] PubMed/NCBI

[112] Google Scholar

[ref33] 33. Nakhleh L, Warnow T, Linder CR. Reconstructing reticulate evolution in species: theory and practice. Journal of Computational Biology. 2005;12(6):796–811. pmid:16108717
View Article
PubMed/NCBI
Google Scholar

[114] View Article

[115] PubMed/NCBI

[116] Google Scholar

[ref34] 34. Makarenkov V, Legendre P. Improving the additive tree representation of a dissimilarity matrix using reticulations. Data Analysis, Classification, and Related Methods, Springer, Berlin, Heidelberg. 2000; p. 35–40.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref35] 35. Makarenkov V, Kevorkov D, Legendre P. Phylogenetic network construction approaches. Applied Mycology and Biotechnology. 2006;6:61–97.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref36] 36. Boc A, Philippe H, Makarenkov V. Inferring and validating horizontal gene transfer events using bipartition dissimilarity. Systematic Biology. 2010;59(2):195–211. pmid:20525630
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref37] 37. Sevillya G, Adato O, Snir S. Detecting horizontal gene transfer: a probabilistic approach. BMC Genomics. 2020;106(Suppl 1). pmid:32138652
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref38] 38. Omelchenko MV, Makarova KS, Wolf YIea. Evolution of mosaic operons by horizontal gene transfer and gene displacement in situ. Genome Biology. 2003;4(R55). pmid:12952534
View Article
PubMed/NCBI
Google Scholar

[132] View Article

[133] PubMed/NCBI

[134] Google Scholar

[ref39] 39. Boc A, Makarenkov V. Towards an accurate identification of mosaic genes and partial horizontal gene transfers. Nucleic acids research. 2011;39(21):e144–e144. pmid:21917854
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref40] 40. Than CV, Rosenberg NA. Consistency properties of species tree inference by minimizing deep coalescences. Journal of Computational Biology. 2011;18(1):1–15. pmid:21210728
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref41] 41. Ponting C. Plagiarized bacterial genes in the human book of life. Trends in Genetics. 2001;17(5):235–237. pmid:11335018
View Article
PubMed/NCBI
Google Scholar

[144] View Article

[145] PubMed/NCBI

[146] Google Scholar

[ref42] 42. Rujan T, Martin W. How many genes in Arabidopsis come from cyanobacteria? An estimate from 386 protein phylogenies. Trends in Genetics. 2001;17(3):113–120. pmid:11226586
View Article
PubMed/NCBI
Google Scholar

[148] View Article

[149] PubMed/NCBI

[150] Google Scholar

Figures

Abstract

Introduction

Definitions

Related literature

Constructing an S-tree from a set of posets

Generating a set of posets from a given S-tree

Relating posets to trees

Refinement of trees

Inferring HGT from posets

Conclusions

Acknowledgments

References