Figures
Abstract
We present a method for detecting horizontal gene transfer (HGT) using partial orders (posets). The method requires a poset for each species/gene pair, where we have a set of species S, and a set of genes G. Given the posets, the method constructs a phylogenetic tree that is compatible with the set of posets; this is done for each gene. Also, the set of posets can be derived from the tree. The trees constructed for each gene are then compared and tested for contradicting information, where a contradiction suggests HGT.
Citation: Belal NA, Heath LS (2023) A complete theoretical framework for inferring horizontal gene transfers using partial order sets. PLoS ONE 18(3): e0281824. https://doi.org/10.1371/journal.pone.0281824
Editor: Vladimir Makarenkov, Universite du Quebec a Montreal, CANADA
Received: June 13, 2022; Accepted: January 31, 2023; Published: March 24, 2023
Copyright: © 2023 Belal, Heath. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: This is a theoretical paper that addresses a problem from a mathematical point of view. The problem is proven to be NP-Complete and all algorithms and proofs are in the paper.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Most work in evolutionary genomics has focused on vertical gene transfer from one species to a lineal descendant. Much recent work has been directed towards the phenomenon of horizontal gene transfer (HGT) [1]. Because of the impact of HGTs on the ecological and pathogenic character of genomes, algorithms are sought that can computationally determine which genes of a given genome are products of HGT events. Numerous strategies have employed nucleotide composition of coding sequences to predict HGT. Previous methods marked the genes with a typical G + C content. Other methods used codon usage patterns to predict HGT. Also, many models used nucleotide patterns for genomic signature, these models have been analyzed using sliding windows, Bayesian classifiers, Markov models, and support vector machines. While no previous work uses partial orders to investigate HGT, we do summarize computational research for detecting HGT in the later Related Literature section.
Suppose that we have complete, annotated genomes for m species. Further, suppose that we have selected a set of n genes, from some reference genome or otherwise, for analysis. If we know the relative distances between each pair of species per gene, then we have a set of partial orders defining the relative relationship among species that can be used to identify which genes are candidates for HGT. Given a poset for each gene, a tree corresponding to that gene is constructed; different trees suggest genes that are candidates for HGT. Once HGT is indicated, additional time-related information can be brought to bear to determine the relative order of events and to establish direction. In fact, our algorithm predicts direction as illustrated in Fig 1.
Suppose that we have complete, annotated genomes for species s1, s2, …, sm. Further, suppose that we have selected a set of genes, from some reference genome or otherwise, for analysis. Let those genes be g1, g2, …, gn. Standard methods for obtaining the set of genes, such as the one in Lake and Rivera [2], can be followed. BLASTing gene gk in species si against a database of genes from all m species, we obtain a bit score B(gk; si, sj) of a best alignment of that gene against the same gene in species sj. If gk is not found in sj, then set B(gk; si, sj) = 0. In general, the higher B(gk; si, sj) is, the better the match between gene gk in species si and gene gk in species sj. There is no need to take special notice of an absent gene, since B(gk; si, sj) = 0 is a meaningful substitute for a Boolean value representing presence or absence of a gene.
There is another quantity associated with the (gk, si, sj) triple. Define T(gk; si, sj) to be the true evolutionary distance, this means what actually happened during the process of gene evolution, in time between the gk gene of si and the gk gene of sj. For example, if the most recent common ancestor of the two genes existed 20 million years ago, then T(gk; si, sj) is 40 million years. While these T(gk; si, sj) values cannot be measured directly, either absolute or relative values for times can be estimated using probabilistic models.
The B(gk; si, sj) values are not random. In fact, a ranking of the B(gk; si, sj) values for 1 ≤ j ≤ m should roughly match a ranking of the T(gk; si, sj) values from the si gene gk to all the other gk’s. In the absence of HGT or other horizontal evolutionary events, we must have T(gk; si, sj) = T(gℓ; si, sj) for every pair of genes gk and gℓ. Therefore, we expect that the rankings of the B(gk; si, sj) and B(gℓ; si, sj) values will be similar in ways we want to explore. And, under reasonable assumptions, the distribution of relative distances should be consistent with predictions of coalescent theory. In particular, as evolutionary distances increase, there will typically be multiple genes that have the same T value from the gk gene in species si. Moreover, the probability that two evolutionary events occur at the same instance in time is 0.
In the presence of horizontal evolutionary events, the patterns of rankings of the B and T values will be different for different genes, depending on which horizontal events each gene is involved in. Two genes that are involved in exactly the same horizontal events will have identical patterns in their T values and similar patterns in their B values.
If we use the rankings of the B values as an approximate substitute for the rankings of the unknown T values, then the rankings can be compared and clustered to identify groups of genes that participated in the same horizontal events. Fix a gene gk. Then there is a gene gk tree that represents the true evolutionary history of the gk’s in all the species. It is rooted at the most recent common ancestor of the m species. Our first goal is to define a computational problem to achieve this clustering and to design an efficient algorithm to solve the problem. In the following, proofs of results are elaborated. Note that Belal and Heath [3] is an earlier five-page announcement of these results.
Definitions
For a rooted (directed) tree T, let R(T) be the root of T, let I(T) be the set of internal nodes of T, and let L(T) be the set of leaves of T.
Let S be a finite set of species. An S-tree T = (V, E) is a rooted tree such that every internal node has outdegree at least two and a bijective labeling function λ: L(T) → S. In particular, every S-tree has precisely |S| leaves. Fig 2 illustrates an S-tree for the case n ≥ 2, where there is only one internal node, the root r = R(T). There are n leaves x1, x2, …, xn and λ(xi) = si. If every internal node of T has outdegree exactly two, then T is an evolutionary tree. Fig 3 illustrates an evolutionary tree on five species.
Let T = (V, E) be an S-tree. Let u ∈ V. The subtree rooted at u is T(u). The species set S(u) for u is the set of leaf labels in T(u).
Let T be an S-tree with an internal node x that has three or more children. A refinement step (on T at x) adds an internal node y to the tree T, where y is the parent of a proper subset of the children of x and y is a new child of x. An S-tree T′ is a refinement of T if T′ can be obtained by performing zero or more refinement steps on T. For example, in Fig 4, T2 is a refinement of T1 by a refinement step on T1 at r. The refinement step applied adds one internal node y, which is the parent of s1 and s2 in T2; y and s3 are the direct children of r in T2.
Let X = {X1, X2} and Y = {Y1, Y2} be two partitions of S. Call such partitions with two elements each 2-partitions. Note that the deletion of an edge from an S-tree induces two connected subtrees and, hence, a 2-partition of S. X and Y are contradicting partitions if there exist four species s1, s2, s3, s4 such that s1, s2 ∈ X1, s3, s4 ∈ X2, s1, s3 ∈ Y1, and s2, s4 ∈ Y2. Two S-trees T1 and T2 are contradictory if their exists an edge in T1 and an edge in T2 such that their induced 2-partitions are contradicting.
Let u, v ∈ L(T), for some S-tree T. The most recent common ancestor MRCA(u, v) of u and v is the node w that is a common ancestor of u and v such that T(w) is the smallest rooted subtree in T containing both u and v.
A partial order is a binary relation ≤ over a set S that is reflexive, antisymmetric, and transitive, i.e., for all a, b, c ∈ S, we have that
- a ≤ a (reflexivity);
- if a ≤ b and b ≤ a then a = b (antisymmetry); and
- if a ≤ b and b ≤ c then a ≤ c (transitivity).
A set with a partial order is a partially ordered set or a poset. If (S, ≤) is a poset and a, b ∈ S, then a < b if and only if a ≤ b and a ≠ b. Note that a < b is transitive. The directed graph G = (S, <) is clearly a directed acyclic graph (DAG). The transitive reduction of G is the DAG on node set S that contains those edges (a, b) such that there is no c ∈ S satisfying a < c < b. A Hasse diagram of < (which is also a Hasse diagram of ≤) is a drawing of the transitive reduction of (S, <) such that no arrows are included. An example of a Hasse diagram is shown in Fig 5. The diagram shown corresponds to the following poset:
Let si ∈ S be a species. An si-poset P = (S, ≤i) is a poset with the property that, for every sj ∈ S, we have si ≤i sj. In other words, si is the unique minimum element of P.
The si-poset Pi = (S, ≤i) is compatible with S-tree T if, for all distinct triples x, y, z ∈ L(T) such that λ(x) = si, λ(y) = sj, and λ(z) = sk and such that sj≤isk, then we have the shortest path from either of x or y to z passes through MRCA (x, y). Given the tree shown in Figs 6, 7 shows an example of a poset that is compatible with the given tree, while Fig 8 shows an incompatible poset, where the poset indicates that s3 is the closest species to s1, while, in the tree, the closest species to s1 is s2.
Let be a set of posets. is consistent if, for all posets Pi, Pj ∈ P, whenever sj≤isk, then si≤jsk. For example, let P1 = {(s1, s2), (s1, s3), (s2, s3)}, P2 = {(s2, s1), (s2, s3), (s1, s3)}, and P3 = {(s3, s1), (s3, s2)}. Then, {P1, P2, P3} is consistent. However, if P4 = {(s3, s1), (s3, s2), (s1, s2)}, then {P1, P2, P4} is inconsistent, since P1 and P2 indicate that s1 and s2 are closer to each other than to s3, while P4 indicates that s1 is closer to s3 than to s2.
Related literature
Among the methods for detecting HGT addressed by many researchers is conditioned reconstruction. Conditioned reconstruction (CR) is a phylogenetic technique that utilizes gene absence/presence data to reconstruct phylogenetic relationships [4]. CR [2], compares a genomic sequence to another and according to whether a gene ortholog is present or absent supplies a P or A character state. The probability of a state transition is analyzed using Markov models. Given two genes, X and Y, four patterns are possible, PP, PA, AP, and AA. Many questions were raised on how to count the pattern AA. How can one identify genes that are missing from both genomes X and Y. To solve this problem, CR uses a conditioning genome, as a reference to which genes to be considered. A gene has to be present in both the conditioning genome and the genome being coded, in order to be considered present. An absent gene is present in the conditioning genome and absent from the genome under study. The conditioning genome has a big effect on the results obtained, as it represents the full set of orthologous genes coded during matrix development. In our approach, we avoid building our results on a conditioning genome, or any other input that would bias our results. However, the approach we present is similar to CR in the problem addressed and the use of information about all genes in the genomes. Bailey et al. [4] argue that CR cannot be used to distinguish between HGT and genome fusion. They suggest some refinements that make CR perform better. Bapteste and Walsh [5] question the ring of life hypothesis of Lake and Rivera [2]. They claim that it is not possible to reconstruct the ring of life in the presence of HGT. Bapteste and Walsh [5] see that the conditioning genome (CG) is more a tool than a biological concept, this genome can exist anywhere in the tree of life and can not be used in evolutionary reconstruction. See Belal [6] for additional discussion of CR. Related methods are found in [7–11].
Other methods for detecting horizontal gene transfer are proposed by multiple researchers. Podell and Gaasterland [12] present the DarkHorse method for detecting HGT. They defined the LPI, lineage probability index, to measure HGT and species closeness. This measure relies on lineage key terms. The higher the LPI score for an organism, the closer it is to the query (reference) genome. Groups of closely related organisms, have similar LPI scores. Xiang et al. [13] apply DarkHorse in analyzing the evolutionary relationship between Microsporidia and Fungi.
Moreover, phylogenetic reconstruction research contributed in solving many evolutionary problems. Nakhleh et al. [14] present a method for reconstructing phylogenetic networks using maximum parsimony. Their method is then studied and applied in [15]. Other network-based methods are found in [16–20]. For example, Cardona, Pons, and Rosselló [17] investigate LGT (lateral gene transfer) networks that combine a principal rooted subtree with a set of additional edges representing LGT. They present an efficient algorithm for constructing an LGT network from a set of phylogenetic trees.
Snir and Trifonov [21] present a method for detecting HGT. Their algorithm takes two genomes with their lengths and calculates the expectancy of each identical region’s length to obtain a measure of confidence as to exceptional similarity. Abby et al. [22] present a program called Prunier for the detection of HGT. The program searches for a maximum statistical agreement forest between a gene tree and a reference tree. Adato et al. [23] provide an algorithm for detecting HGT based on gene synteny and the concept of constant relative mutability. Scornavacca et al. [24] provide an algorithm for detecting HGT in some alternative cases. Sanchez-Soto et al. [25] introduce the algorithm ShadowCaster for HGT detection in prokaryotes.
Some researchers combine HGT with other evolutionary phenomena. Bansal et al. [26] develop the tool RANGER-DTL to detect gene duplication, transfer, and loss. Van Iersel et al. [27] develop a polynomial-time algorithm for some cases of HGT detection. Hasic and Tannier [28] present NP-hard cases for HGT detection.
In addition to the above, there are a number of theoretical approaches to problems related to HGT transfer: [28–31]. These are typically about mathematically-oriented methodologies for reconstructing a species tree or reconciling gene and species trees.
Also worth discussing, is reticulate evolution. According to [32], there are numerous reticulations among related species, especially in insects, vertebrates, microbes, and plants. In [33], extensions of Wayne Maddison’s approach are presented for reconstructing reticulate evolution that result from horizontal transfer or hybrid speciation. Two polynomial time algorithms are presented and outperform both NeighborNet and Maddison’s method. Moreover, [34] gives a review of the mathematical techniques used to construct phytogenies and reticulate evolution. Different methods are discussed, among which are distance-based, maximum parsimony, and maximum likelihood methods. In [35], the problem of approximating a dissimilarity matrix using a reticulogram is discussed, where it is obtained by adding edges an additive tree which implies improving the approximation of the dissimilarity matrix. As stated in [36], Horizontal gene transfer (HGT) is one of the most important events in evolution and they describe a new polynomial-time algorithm to infer HGT events. The algorithm uses least squares (LS), Robinson and Foulds (RF) distance, quartet distance (QD), and bipartition dissimilarity (BD). The results show that bipartition dissimilarity gives the best results.
Also, in [37] a novel heuristic technique for HGT detetction was employed for and tested on both simulated and real data. The technique was found to provide a greater sensitivity than other HGT techniques. The proposed technique also considers the lengths of the genes being transferred.
In [38] a number of operons have been identified experimentally by sequence similarity analysis and then by phylogenetic analysis. Many occurrences of horizontal transfer of entire operons were detected.
Mosaic genes have been discussed in [39]. A mosaic gene is composed of alternating sequence polymorphisms either belonging to the host original allele or derived from the integrated donor DNA. In this paper, the authors propose a method for detecting partial HGT events and related intragenic recombination giving rise to the formation of mosaic genes.
Constructing an S-tree from a set of posets
Recall the definition of compatible from the Definitions Section. The si-poset Pi = (S, ≤i) is compatible with S-tree T if, for all distinct triples x, y, z ∈ L(T) such that λ(x) = si, λ(y) = sj, and λ(z) = sk and such that sj≤isk, then we have the shortest path from either of x or y to z passes through MRCA (x, y).
The problem of constructing a tree is defined as follows:
- Compatible Tree Construction
- INSTANCE: Set S = {s1, s2, …, sn} of n taxa; for 1 ≤ i ≤ n, an si-poset Pi = (S, ≤i).
- SOLUTION: An S-tree T compatible with P1, P2, …, Pn, if one exists.
Theorem 1. Let be a set of posets that is compatible with an S-tree T. Let T′ be a refinement of T. Then is compatible with T′.
Proof. The proof is by induction on the number of refinement steps, k, to obtain T′ from T. For the base case of the induction, assume that k = 0. Then T′ = T, and, therefore, is clearly compatible with T′. Now assume that k ≥ 1 and that the result holds for k − 1 refinement steps. Then there exists an S-tree T′′ such that T′′ is obtained by k − 1 refinement steps from T and T′ is obtained from T′′ in one refinement step. Let u in T′′ have children v1, v2, …, vp such that in T′ there is a new node w that is a child of u with children v1, v2, …, vq, where u retains children vq + 1, …, vp in T′. Note that q ≥ 2 and p − q ≥ 1. Therefore, for to be compatible with T′, the compatibility condition must hold, and that is:
- For all distinct triples x, y, z ∈ L(T) such that λ(x) = si, λ(y) = sj, and λ(z) = sk and such that sj≤isk, then there is a shortest path from either of x or y to z passing through MRCA (x, y).
By applying the compatibility condition to T′′, the cases for x, y, and z are as follows:
- x ∈ v1, v2, …, vp or y ∈ v1, v2, …, vp. Since sj≤isk, therefore, there exists an MRCA for x and y. Let MRCA (x, y) be q. Therefore, the shortest path from either of x or y to z passes through q.
- x, y ∈ v1, v2, …, vp. Therefore, MRCA (x, y) is u, and the shortest path from either of x or y to z passes through u.
- x, y ∉ v1, v2, …, vp. Since, sj≤isk, therefore, there exists an MRCA for x and y such that the shortest path from either of x or y to z passes through the MRCA (x, y).
Similarly, by applying the compatibility condition to T′, the cases for x, y, and z are as follows:
- x, y ∈ v1, v2, …, vq. Therefore, MRCA (x, y) is w, and the shortest path from either of x or y to z passes through w.
- x, y ∈ vq + 1, …, vp. Therefore, MRCA (x, y) is u, and the shortest path from either of x or y to z passes through u.
- x ∈ v1, v2, …, vq and y ∈ vq + 1, …, vp. Therefore, MRCA (x, y) is u and the shortest path from either of x or y to z passes through u.
- y ∈ v1, v2, …, vq and x ∈ vq + 1, …, vp. Therefore, MRCA (x, y) is u and the shortest path from either of x or y to z passes through u.
- x ∈ v1, v2, …, vp or y ∈ v1, v2, …, vp. Since sj≤isk, therefore, there exists an MRCA for x and y. Let MRCA (x, y) be q. Therefore, the shortest path from either of x or y to z passes through q.
- x, y ∉ v1, v2, …, vp. Since, sj≤isk, therefore, there exists and MRCA for x and y such that the shortest path from either of x or y to z passes through the MRCA (x, y).
Therefore, if the compatibility condition holds for T′′, and T′ is obtained using one refinement step from T′′, then the compatibility condition also holds for T′.
By induction, is compatible with T′, as required.
Now we present a data structure that the algorithm uses to identify siblings. For the set of posets, , a matrix A of size n × n is defined. We define In other words, for i ≠ j, A(i, j) is the number of species sx such that sj is strictly less than sx in the poset (S, ≤i).
Theorem 2. Let be a set of posets, and let A be the matrix representing . If is consistent, then A is symmetric.
Proof. Let be a set of posets. is consistent if, for all posets Pi, Pj ∈ P, whenever sj ≤i sk, then si ≤j sk. Let 1 ≤ i < j ≤ n. By the consistency condition, {sx∣sj<isx} = {sx∣si < jsx}. Therefore, A(i, j) = A(j, i), and A is symmetric.
This A matrix represents an undirected graph, where siblings are indicated by cliques in the graph, that is, for a species si, all other species connected to si with edges having equal labels, then they are siblings. Higher values indicate siblings at lower levels in the tree, in other words, the maximum value indicates leaf siblings. Note that if there is missing data or incorrect data in the posets, there will be a problem in constructing the tree, for example, if the posets have missing information or incorrect information then the algorithm will not be able to construct a tree for that specific gene corresponding to that posets set. To follow is an example to illustrate the defined data structures. Consider the set of posets , where is given as follows: The matrix A corresponding to is shown in Table 1
And the graph G that is represented by the matrix A given in Table 1 is shown in Fig 9, where s1, s2, and s3 are siblings, and their parent and s4 are both children of the root.
To follow is an example to illustrate the data structures used in tree construction. The matrix shown in Table 2 is constructed for the posets in Fig 10.
The graph in Fig 11 shows the cliques that represent siblings indicated by matrix A in Table 2.
The first row of matrix A indicates that s2 is a sibling of s1. The maximum value in the s1 row is 3, which is in the s2 column, and it is the only column with this value. This is also clear in the graph shown in Fig 11. Since the maximum value found in the s1 row is 3, and it is only under the s2 column, therefore, s2 is the only sibling of s1. Similarly, s4 and s5 are also siblings.
The algorithm starts by the procedure of inferring siblings by detecting cliques in the graph. For each species, the algorithm scans the row corresponding to that species, and detects which species are connected using edges with equal labels. The detected species are all siblings. After detecting each set of siblings comes the updating step. In this step, the rows and columns of the siblings are merged. This procedure is repeated until only one species is remaining, which is the root.
After scanning the s1 row, the matrix A is reduced as shown in Table 3.
Similarly, the matrix A is reduced after detecting the siblings s4 and s5, as shown in Table 4.
This procedure is repeated, but this time the highest integer is 2, therefore, s3 is a sibling of s12, the parent of s1 and s2. And, the new matrix is shown in Table 5.
The final step creates one root for the remaining species because all the values are 0, hence, all the remaining species are at the same level. The tree reconstructed from the posets in Fig 10 is shown in Fig 12.
Another example to further illustrate the algorithm uses the set of posets in Fig 13.
The matrix in Table 6 is constructed for the set of posets in Fig 13.
The largest integer is 3, and it indicates that s1, s2, and s3 are siblings, as well as s4, s5, and s6.
The matrix then becomes as shown in Table 7.
Therefore, one root is created for the remaining two nodes to construct the tree in Fig 14.
To follow is an example to illustrate how the algorithm works to construct an S-tree from a set of posets .
Given a set of species, S = {s1, s2, s3, s4, s5}, with the set of posets in Fig 15.
The corresponding A matrix is shown in Table 8.
Therefore, the maximum is 3, with the siblings s1 and s2, as well as s3 and s4.
And, the matrix A becomes as shown in Table 9.
Now, s5 is a sibling of both s12 and s34, giving one root for the three nodes. The constructed tree is shown in Fig 16.
Fig 17 shows the algorithm for reconstructing a tree from a set of posets . The algorithm validates the matrix A by testing that A[i, j] = A[j, i], for all i and j, where 1 ≤ i ≤ n and 1 ≤ j ≤ n. The algorithm also uses a subroutine to find cliques with equal edge labels. The subroutine scans the matrix A to find a clique with maximum edge labels. The subroutine AddSiblings shows the steps for adding the vertices that belong to a certain clique as siblings in the tree T. The subroutine also reduces the graph by merging the rows and columns in the matrix A.
Fig 18 shows the subroutine for validating the matrix A. And, Fig 19 shows the subroutine that finds the maximum value stored in the matrix A, where Fig 20 is the subroutine that finds the clique with edge labels equal to the maximum value. The subroutine in Fig 21 adds the nodes in the clique found as siblings in the tree constructed.
Theorem 3. The algorithm ConstructTree has O(n3) time complexity.
Proof. Lines 2–6 in the algorithm ConstructTree contain two nested loops, each of which repeats n times. The statement in line 6, which is repeated in the nested loops, takes O(n) time, that is because the poset Pi contains, at most, n ordered pairs with x = sj. Therefore, the total amount for these three nested loops will be O(n3). Lines 8 scans the matrix A in O(n2). The while loop on line 10 repeats at most n times, on line 11, FindMax is O(n2), on line 12, FindClique is O(n2), AddSiblings on line 13 is O(n), Therefore, the while loop takes O(n3). Therefore, the complexity of the algorithm is O(n3).
Theorem 4. The algorithm ConstructTree solves the Compatible Tree Construction problem.
Proof. To prove the theorem, we use induction on the number of species. Let the number of species be n. For n = 1 and n = 2, there is no maximum value in the matrix A, hence, the tree is trivial. For n = 3, there are three possibilities for the third species s3. Either s3 is a sibling of s1 and s2, a sibling of their parent, or a sibling of either one of them. The algorithm checks the values in the A matrix, if A(1, 3) = A(2, 3) = A(1, 2), then s3 is a sibling of s1 and s2, otherwise, s3 is a sibling of their parent. In case of s1 and s2 not being siblings, then the values in the A matrix will detect s3 as a sibling of either one of them, that is the third possibility. After detecting siblings, the matrix A is reduced by eliminating the siblings and replacing them by their parent. Therefore, for n species, the algorithm scans the matrix A, and at each step, the siblings are eliminated and replaced by their parent, this reduces the matrix A, until only one species is remaining, which is the root.
Generating a set of posets from a given S-tree
For each tree T, there exists a set of posets compatible with T. In this section, we show how given a tree T, the set of compatible posets can be generated.
A set of posets is compatible with an S-tree T if, for all distinct triples x, y, z ∈ L(T) such that λ(x) = si, λ(y) = sj, and λ(z) = sk and such that sj≤isk, then we have the shortest path from either of x or y to z passes through MRCA (x, y). Therefore, the procedure of obtaining posets from a tree is straightforward. Given a tree T, it is clear which species are closer to each other than others, and hence, posets can be generated. By obtaining the path from each species (leaf node) to the root of the tree, and laying this path horizontally, we get the nodes sorted in order of closeness to this specific leaf node. Each node on the path represents a subtree, of which the leaves belonging to the species set represent one level of the poset.
An example to illustrate how posets are generated from a tree is shown in Fig 22. The tree on the right shows the path from s1 to the root, where each node on the path is a root to a subtree, and the leaves belonging to each subtree represent a level of the poset P1. The subtree with the root s1 has only one leaf and that is s1. The second level of the poset contains the leaves in the subtree with the root x, and that is only s2, then comes the last level, in the subtree with the root r, and this subtree contains the leaves s3 and s4. Therefore, the poset P1 is generated as follows. P1 = {(s1, s2), (s1, s3), (s1, s4), (s2, s3), (s2, s4)}.
For example, given the tree shown in Fig 12, we look at each species to generate the corresponding poset. Starting with s1, the poset P1 automatically contains the ordered pairs (s1, s2), (s1, s3), (s1, s4), and (s1, s5). It is clear from the tree that s2 is the closest sibling to s1, this adds the ordered pairs (s2, s3), (s2, s4), and (s2, s5) to the poset P1. Also, the ordered pairs (s3, s4) and (s3, s5) are added. In a similar manner the posets P2, P3, P4, and P5 are generated as shown in Fig 10.
Theorem 5. The algorithm GeneratePosets shown in Fig 23 generates the set of posets that is compatible with a given tree T.
Proof. Using a proof by construction, we show that the algorithm GeneratePosets generates the set of posets compatible with a given tree T. From the definition of compatible in Section 2 of the main document, we know that an si-poset Pi = (S, ≤i) is compatible with S-tree T if, for all distinct triples x, y, z ∈ L(T) such that λ(x) = si, λ(y) = sj, and λ(z) = sk and such that sj≤isk, then we have the shortest path from either of x or y to z passes through MRCA (x, y). The algorithm GeneratePosets finds, for a species si, the path p from si to the root r, on that path, the nodes that come first on the path p are definitely closer to si and, hence, come at a lower level in the poset. That follows from the definition of compatible, which indicates that if sj≤isk, then the shortest path from either of x or y to z passes through MRCA (x, y). Therefore, by scanning the path p, the set of posets can be constructed.
Theorem 6. The algorithm GeneratePosets has a time complexity of O(n3).
Proof. Let the number of species be n. The loop on line 2 iterates n times, and on line 3, finding the path from a certain species to the root is also linear in the number of species, this gives a complexity O(n2). Then on line 7, the while loop is also linear in n, and on line 9, finding all leaves in a subtree is linear as well. This gives a total complexity of O(n3).
Relating posets to trees
The following theorems relate posets and trees to one another.
Theorem 7. Given a set of posets , if there exists an S-tree T that is compatible with, then T can be used to generate the same set of posets .
Proof. Given a set of posets , assume that is compatible with a tree T. Assume that T, in turn, generates a different set of posets . can now be used to construct a tree T′ that is compatible with , T′ is expected to be equivalent to T. However, since and are not equal, then the two trees constructed are also not the same. Since, T and T′ are different, therefore, T and T′ can yield contradictory 2-partitions, this means that that T and T′ may be contradictory trees, and hence, one of them can not be used to give the same set of posets. Hence, there is a contradiction, and T can not be used to generate a set of posets other than .
Theorem 8. Let and be two sets of posets that are compatible with the two S-trees, T1 and T2. Then T1 and T2 are contradictory if and only if there exists a poset and , such that is inconsistent with .
Proof. First, we prove that if T1 and T2 are contradictory then there exists a poset and a poset , such that is inconsistent with . Using a proof by contradiction, assume that T1 and T2 are contradictory and there is no poset and , such that is inconsistent with . Since, T1 and T2 are contradictory, therefore, there exists an edge in T1 and an edge in T2, that when cut induces contradictory 2-partitions. This means that there exists four species s1, s2, s3, and s4, such that s1 and s2 belong to the same partition in one tree but not in the other. Similarly, s3 and s4 belong to the same partition in one tree but not in the other. Since, the set of posets is compatible with T1 and the set of posets is compatible with T2, and since T1 and T2 are contradictory, therefore, there exists a poset and a poset such that is inconsistent with . This leads to a contradiction with the assumption.
The second part of the proof proves that if there exists a poset and a poset , such that is inconsistent with , then T1 and T2 are contradictory. Using a proof by contradiction, assume that there exists a poset and a poset , such that is inconsistent with while T1 and T2 are non-contradictory. If is inconsistent with , therefore, is inconsistent with the set of posets , hence, the two sets of posets can create contradictory 2-partitions in their corresponding trees, and therefore, the trees that are compatible with both sets of posets can not be non-contradictory, and this leads to a contradiction with the assumption. Therefore, the theorem follows.
Fig 24 shows an example to illustrate Theorem 8. The set of posets corresponding to the tree at the top consists of the following posets: And, the set of posets corresponding to the tree at the bottom consists of the following posets: The poset indicates that s2 is a sibling of s1, while the poset indicates that s3 is a sibling of s1. Therefore, the two posets are inconsistent.
Refinement of trees
We start with a basic result about refinement (Theorem 9).
Lemma 1. Let T be an S-tree. Let Q be the 2-partition set of T. Then Q is not contradictory with itself.
Proof. We show that every pair of 2-partitions in Q is non-contradictory. Consider an arbitrary pair of distinct edges of T. This pair of edges are the ends of a unique path in T. Let u0, u1, …, uk − 1, uk be that path. Then the edges are (u0, u1) and (uk − 1, uk). These edges partition S into three sets: X, the set of species reachable from u0 without using (u0, u1); Y, the set of species reachable from uk without using (uk − 1, uk); and Z, the set of species reachable from u1, u2, …, uk − 1 without using (u0, u1) or (uk − 1, uk). The 2-partition corresponding to (u0, u1) is (X, Y ∪ Z), and the 2-partition corresponding to (uk − 1, uk) is (X ∪ Z, Y). Recall the definition of contradictory 2-partitions: Two 2-partitions X = (X1, X2) and Y = (Y1, Y2) are contradictory partitions if there exist four species s1, s2, s3, s4 such that s1, s2 ∈ X1, s3, s4 ∈ X2, s1, s3 ∈ Y1, and s2, s4 ∈ Y2. Let s1, s2, s3, s4 ∈ S. If s1, s2 ∈ X and s3, s4 ∈ Y ∪ Z, then s1, s2 ∈ X ∪ Z, so the definition definitely does not apply to the 2-partitions corresponding to (u0, u1) and (uk − 1, uk). Since the two edges were arbitrary, we conclude that Q is not contradictory with itself.
Lemma 2. Let T1 be an S-tree, and let T2 be a refinement of T1. Let Q1 be the 2-partition set of T1, and let Q2 be the 2-partition set of T2. Then Q1 ⊆ Q2.
Proof. A refinement step adds one edge to T1 and one 2-partition. By induction on the number of refinement steps to go from T1 to T2, we obtain Q1 ⊆ Q2.
Theorem 9. If S-tree T2 can be obtained from S-tree T1 using a number of refinement steps, then T1 and T2 are non-contradictory.
Proof. Let T1 be an S-tree, and let T2 be a refinement of T1. Let Q1 be the set of 2-partitions of T1, and let Q2 be the set of 2-partitions of T2. By Lemma 2, Q1 ⊆ Q2. By Lemma 1, Q2 is not contradictory with itself. Then Q1 and Q2 are non-contradictory, since otherwise Q2 would be contradictory with itself. By definition, T1 and T2 are non-contradictory.
The posets given for each gene are used in the construction of one tree for each gene. These trees can contain contradictory information, as illustrated in Fig 24. To be able to identify HGT events, contradictory trees must be identified. This can be done by examining the number of ways leaves and the root in a tree can be partitioned. This is done by examining the cuts in edges that are not incident to leaf nodes. If two trees are contradictory, then there is evidence for HGT.
The minimum common refinement of two non-contradictory S-trees T1 and T2 is an S-tree T3 that is a common refinement of T1 and T2 such that any other common refinement of T1 and T2 is a refinement of T3.
Theorem 10. Let T1 and T2 be S-trees that are non-contradictory. Let Q1 and Q2 be their respective sets of 2-partitions. Then there exists a unique tree T3 that is their minimum common refinement. Furthermore, if Q3 is the set of 2-partitions of T3, then Q3 = Q1 ∪ Q2.
Proof. Define Q3 = Q1 ∪ Q2. Therefore, Q3 contains 2-partitions, where each 2-partition is obtained by cutting one edge of the tree T3. Hence, the set Q3 can be used to construct the tree T3, by checking each 2-partition, starting with the 2-partition of minimum cardinality. Siblings in T3 are inferred and the set is reduced. This process is repeated until only 2-partitions with one of its elements having cardinality one are remaining. Since Q3 = Q1 ∪ Q2 and since Q1 already corresponds to a tree and also Q2 corresponds to a tree, all the 2-partitions in Q1 and Q2 already correspond to edges in a tree. Therefore, using the two sets, a more refined tree can be constructed. Since Q1 and Q2 both contain non-contradictory partitions, and since Q3 = Q1 ∪ Q2, Q3 also contains non-contradictory partitions, and hence, there exists a tree T3 that corresponds to Q3. Using induction, we start by Q1 and T1 and add 2-partitions from Q2 to Q1. Let k be the number of 2-partitions added. If k = 1, then a 2-partition is added from Q2 to Q1. Since T1 and T2 are non-contradictory, a 2-partition that exists in Q2 but not in Q1 only adds an internal node and an edge to T1. Therefore, T1 becomes a more refined tree. Hence, adding k 2-partitions to T1 will further refine T1 by adding more edges and internal nodes. Therefore, given Q3, a set of non-contradictory 2-partitions, a tree T3 can be constructed.
An algorithm for finding the minimum common refinement of T1 and T2 is shown in Fig 25. The algorithm finds all 2-partitions of T1 and T2. A 2-partition is found by cutting an edge of the tree and finding the leaves in the two subtrees induced. For example, cutting an edge (i, j), induces two subtrees, one with the root i and the other with the root j. Performing a depth-first search on the two subtrees finds the leaves in both subtrees. The species set for each subtree composes one of the 2-partitions; therefore, S(i) composes one partition, and S(j) composes the other.
The subroutine FindTwoPartitions shown in Fig 26 finds the 2-partition set for a given tree. When the 2-partitions sets are found for both trees, a union is performed on these sets to obtain the minimum common refinement tree.
The algorithm that constructs a tree from its two-partition set is shown in Fig 27, followed by an illustrative example.
An example to show the minimum common refinement, given two S-trees, T1 and T2, if using a number of refinement steps both trees can be refined into a third S-tree T3, then it is guaranteed that both trees carry non-contradictory information. For example, the two S-trees, T1 and T2 shown in Fig 28 are non-contradictory and they are both refined into T3. In this example, T3 is obtained using the minimum number of refinement steps, hence, T3 is the minimum common refinement of T1 and T2.
Fig 29 shows an example to illustrate minimum common refinement, where the tree T3 is the minimum common refinement of the two trees T1 and T2, where T3 is obtained using one refinement step, this refinement step is performed on T1 by adding a parent for s3 and s4. The refined tree is the same tree as T2.
Fig 30 shows an example to illustrate the algorithm. The node s0 is added under the root to avoid having equivalent sets for a 2-partition, as these equivalent sets disappear when performing the union operation. In the example, T1 has eight edges, including the edge connecting the s0 to the root. Hence, there are eight 2-partitions sets for T1. Similarly, their are eight 2-partitions sets for T2. The 2-partitions sets for T1 are as follows:
The 2-partitions sets for T2 are as follows:
The union of the two sets of partitions gives the following 2-partitions sets, which are the sets that give the tree T3:
Lets consider the following two-partition set, Q, to illustrate the algorithm.
The algorithm starts by removing all sets with cardinality 1. So the set Q is reduced to the following:
The set with the minimum cardinality is in Q7, therefore, the species s1 and s2 are detected as siblings and they are replaced by a parent node in all sets. Therefore, Q is modified to the following: The next step finds the minimum cardinality in both Q8 and Q9, where u1 and s3 are siblings, and s4 and s5 are siblings. When Q8 and Q9 are removed from Q, it becomes empty and the root connects the subtrees constructed. Fig 31 shows the tree constructed from the two-partition set Q.
Theorem 11. The time complexity of MinCommonRefine is O(mn + n2).
Proof. Let n be the number of species. Let m be the number of edges in a tree T. The subroutine FindTwoPartitions on Lines 3 and 4 is O(mn) Line 5 performs a union operation linear in the number of species. Line 6 constructs the tree from its two-partition set, ConstructTree2Partitions is O(n2). Therefore, the overall complexity of the algorithm MinCommonRefine is O(mn + n2).
Inferring HGT from posets
In this section, we show how posets and trees are used to infer HGT.
The problem is defined as follows:
- Inferring HGT From Posets
- INSTANCE: Set S = {s1, s2, …, sn} of n taxa; set G = {g1, g2, …, gm} of m genes; mn individual posets Pij = (S, <ij), for 1 ≤ i ≤ m and 1 ≤ j ≤ n.
- SOLUTION: Sets of genes corresponding to contradictory trees.
A number of steps are followed to be able to infer HGT events. First, trees are constructed from posets, then the different trees are compared, where contradictory trees are identified. Trees that are contradictory with the majority of trees suggest HGT. Other events such as gene duplication, gene loss, and incomplete lineage sorting can cause the incongruence of trees [40]. In the “Constructing an S-tree From a Set of Posets” Section, we show how trees are constructed from posets; in what follows, we show how contradictory trees are detected. The algorithm DetectContradiction shown in Fig 32 takes two trees as input and detects whether they are contradictory or not.
The process of identifying which genes are candidates of HGT proceeds as follows. Two S-trees T1 and T2 are tested for contradiction. If they are contradictory, then they belong to two different sets, if not then they are placed in one set. The process continues. If the next tree to be tested is T3, then it is compared with one tree from each set to test to which set the tree T3 belongs. It is expected that the majority of the trees will be non-contradictory, with some trees contradicting this majority, so there will be one set with a higher cardinality. Therefore, the other sets, which are the minority, are considered candidates for HGT.
The algorithm performs ideally when all the trees are completely refined (binary) trees, where the trees that are not identical are considered contradictory. In what follows, some real life HGT examples are shown to support the argument that the genes involved in HGT are a minority and that there will always be a dominant tree. In Ponting [41], it is indicated that only 0.5% of all human genes were copied into the genome from bacteria by HGT. Rujan and Martin [42] analyzed how many genes in Arabidopsis come from cyanobacteria, They used a sample of 3961 Arabidopsis nuclear protein-coding genes and compared those with the complete set of proteins from yeast and 17 reference prokaryotic genomes, including one cyanobacterium. In their analysis of 386 phylogenetic trees, they found that the number of genes horizontally transferred to Arabidopsis from cyanobacteria falls between approximately 400 genes and approximately 2200 genes. That is between 1.6% and 9.2% of nuclear genes.
The algorithm InferHGT is shown in Fig 33. The input to the algorithm is a set of trees T = {T1, T2, …, Tn}, where n is the number of trees and also the number of genes.
An example to illustrate the algorithm for inferring HGT is shown in Fig 34, where the trees T1, T2, and T3 are non-contradictory, while the tree T4 contradicts the three trees. In T4 there is a 2-partition that places the two species {s1, s3} in one partition, and {s2, s4} in another partition. This 2-partition contradicts the other three trees. Therefore, the gene corresponding to T4 is a candidate of HGT, where a horizontal transfer occurred between s1 and s3, or s2 and s4. The network in Fig 1 shows the possible horizontal transfers. We note that the figure documents both the existence of two possible horizontal transfers but also their directionality, which is especially valuable for any further investigation.
Theorem 12. InferHGT has complexity max(O(n2), O(m2n)).
Proof. The two nested loops on lines 4 and 5 are O(n2), where n is the number of trees. The subroutine DetectContradiction on line 6 is O(m2n), where m is the number of edges in a tree.
Conclusions
We have introduced the theoretical problem of inferring HGT using partial orders, where there is one poset per gene per species. These posets have been used to construct S-trees for the genes corresponding to these posets, one tree for each gene. These trees are then compared, where the trees that contradict the majority of trees correspond to genes that are candidates for HGT. An algorithm for identifying contradiction is presented and then used in the algorithm to infer HGT. The concept of refinement is also presented in this paper, where it can also be used to identify contradiction among trees. An algorithm for finding a minimum common refinement for two trees is also presented. This algorithm finds the union of the 2-partition sets of two trees and then uses this set to construct a third tree, which is their minimum common refinement. Other points can be further studied in this problem. For example, more effort could be done to find solutions to the problem of incorrect or missing data in the input posets. This will be incredibly challenging, but, from a practical viewpoint, it would be most valuable. Another point is to develop algorithms that use the refinement of trees for identifying contradictory trees, where two contradictory trees do not have a common refinement.
Acknowledgments
We thank Ruth Grene (biology cosultant), Ayman Abdel Hamid, T.M. Murali, and João Setubal for valuable comments and Thomas Jones for implementing some of the algorithms.
References
- 1. Daubin V, Szoellosi GJ. Horizontal Gene Transfer and the History of Life. Cold Spring Harbor Perspectives in Biology. 2016;8. pmid:26801681
- 2. Lake JA, Rivera MC. Deriving the genomic tree of life in the presence of horizontal gene transfer: Conditioned reconstruction. Molecular Biology and Evolution. 2004;21(4):681–690. pmid:14739244
- 3.
Belal NA, Heath LS. Inferring horizontal gene transfers from posets. In: 2nd International Conference on Computer Technology and Development, ICCTD 2010; 2010. p. 32–36.
- 4. Bailey CD, Fain MG, Houde P. On conditioned reconstruction, gene content data, and the recovery of fusion genomes. Molecular Phylogenetic and Evolution. 2006;39:263–270. pmid:16414287
- 5. Bapteste E, Walsh DA. Does the ring of life ring true? Trends in Microbiology. 2005;13(6):256–261. pmid:15936656
- 6.
Belal NA. Two Problems in Computational Genomics [PhD Dissertation]. Virginia Tech. Blacksburg, Virginia; 2011.
- 7. Bansal MS, Alm EJ, Kellis M. Reconciliation Revisited: Handling Multiple Optima when Reconciling with Duplication, Transfer, and Loss. Journal of Computational Biology. 2013;20:738–754. pmid:24033262
- 8. Bansal MS, Wu YC, Alm EJ, Kellis M. Improved Gene Tree Error Correction in the Presence of Horizontal Gene Transfer. Bioinformatics. 2015;31:1211–1218. pmid:25481006
- 9. Chan Yb, Ranwez V, Scornavacca C. Exploring the Space of Gene/Species Reconciliations with Transfers. Journal of Mathematical Biology. 2015;71:1179–1209. pmid:25502987
- 10. Liu L, Wu S, Yu L. Coalescent Methods for Estimating Species Trees from Phylogenomic Data. Journal of Systematics and Evolution. 2015;53:380–390.
- 11. Nguyen M, Ekstrom A, Li X, Yin Y. HGT-Finder: A New Tool for Horizontal Gene Transfer Finding and Application to Aspergillus Genomes. Toxins. 2015;7:4035–4053. pmid:26473921
- 12. Podell S, Gaasterland T. DarkHorse: A method for genome-wide prediction of horizontal gene transfer. Genome Biology. 2007;8(2):R16.1–R16.18. pmid:17274820
- 13. Xiang H, Zhang R, De Koeyer D, Pan G, Li T, Liu T, et al. New Evidence on the Relationship Between Microsporidia and Fungi: A Genome-Wide Analysis by DarkHorse Software. Canadian Journal of Microbiology. 2014;60:557–568. pmid:25134955
- 14.
Nakhleh L, Jin G, Zhao F, Mellor-Crummey J. Reconstructing phylogenetic networks using maximum parsimony. In: CSB’05: Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference; 2005. p. 93–102.
- 15. Jin G, Nakhleh L, Snir S, Tuller T. Inferring phylogenetic networks by the maximum parsimony criterion: A case study. Molecular Biology and Evolution. 2007;24(1):324–337. pmid:17068107
- 16. Alix B, Boubacar DA, Vladimir M. T-REX: A Web Server for Inferring, Validating and Visualizing Phylogenetic Trees and Networks. Nucleic Acids Research. 2012;40(W1):W573–W579.
- 17. Cardona G, Pons JC, Rossello F. A Reconstruction Problem for a Class of Phylogenetic Networks with Lateral Gene Transfers. Algorithms for Molecular Biology. 2015;1–15. pmid:26691555
- 18. Layeghifard M, Peres-Neto PR, Makarenkov V. Inferring Explicit Weighted Consensus Networks to Represent Alternative Evolutionary Histories. BMC Evolutionary Biology. 2013;13. pmid:24359207
- 19.
Nakhleh L. Evolutionary Phylogenetic Networks: Models and Issues. In: Heath LS, Ramakrishnan N, editors. Problem Solving Handbook in Computational Biology and Bioinformatics. New York: Springer; 2011. p. 125–158.
- 20. Pardi F, Scornavacca C. Reconstructible Phylogenetic Networks: Do Not Distinguish the Indistinguishable. PLoS Computational Biology. 2015;11. pmid:25849429
- 21. Snir S, Trifonov E. A novel technique for detecting putative horizontal gene transfer in the sequence space. Journal of Computational Biology. 2010;17(11):1535–1548. pmid:20973741
- 22. Abby S, Tannier E, Gouy M, Daubin V. Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests. BMC Bioinformatics. 2010;11(324):1–13. pmid:20550700
- 23. Adato O, Ninyo N, Gophna U, Snir S. Detecting Horizontal Gene Transfer between Closely Related Taxa. PLoS Computational Biology. 2015;11. pmid:26439115
- 24. Scornavacca C, Mayol JCP, Cardona G. Fast Algorithm for the Reconciliation of Gene Trees and LGT Networks. Journal of Theoretical Biology. 2017;418:129–137. pmid:28111320
- 25. Sanchez-Soto D, Aguero-Chapin G, Armijos-Jaramillo V, Perez-Castillo Y, Tejera E, Antunes A, et al. ShadowCaster: Compositional Methods Under the Shadow of Phylogenetic Models to Detect Horizontal Gene Transfers in Prokaryotes. Genes. 2020;11(7):12 pages. pmid:32645885
- 26. Bansal MS, Kellis M, Kordi M, Kundu S. RANGER-DTL 2.0: Rigorous Reconstruction of Gene-Family Evolution by Duplication, Transfer and Loss. Bioinformatics. 2018;34(18):3214–3216. pmid:29688310
- 27. van Iersel L, Janssen R, Jones M, Murakami Y, Zeh N. Polynomial-Time Algorithms for Phylogenetic Inference Problems Involving Duplication and Reticulation. IEEE-ACM Transactions on Computational Biology and Bioinformatics. 2020;17(1):14–26.
- 28. Hasic D, Tannier E. Gene Tree Reconciliation Including Transfers with Replacement Is NP-hard and FPT. Journal of Combinatorial Optimization. 2019;38(2):502–544.
- 29. Chan YB, Robin C. Reconciliation of a Gene Network and Species Tree. Journal of Theoretical Biology. 2019;472:54–66. pmid:30951730
- 30. Piovesan T, Kelk SM. A Simple Fixed Parameter Tractable Algorithm for Computing the Hybridization Number of Two (Not Necessarily Binary) Trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2013;10(1):18–25. pmid:23702540
- 31. Schaller D, Lafond M, Stadler PF, Wieseke N, Hellmuth M. Indirect identification of horizontal gene transfer. Journal of Mathematical Biology. 2021;83:73 pages. pmid:34218334
- 32. Mallet J, Besansky N, Hahn MW. How reticulated are species? BioEssays. 2016;38(2):140–149. pmid:26709836
- 33. Nakhleh L, Warnow T, Linder CR. Reconstructing reticulate evolution in species: theory and practice. Journal of Computational Biology. 2005;12(6):796–811. pmid:16108717
- 34. Makarenkov V, Legendre P. Improving the additive tree representation of a dissimilarity matrix using reticulations. Data Analysis, Classification, and Related Methods, Springer, Berlin, Heidelberg. 2000; p. 35–40.
- 35. Makarenkov V, Kevorkov D, Legendre P. Phylogenetic network construction approaches. Applied Mycology and Biotechnology. 2006;6:61–97.
- 36. Boc A, Philippe H, Makarenkov V. Inferring and validating horizontal gene transfer events using bipartition dissimilarity. Systematic Biology. 2010;59(2):195–211. pmid:20525630
- 37. Sevillya G, Adato O, Snir S. Detecting horizontal gene transfer: a probabilistic approach. BMC Genomics. 2020;106(Suppl 1). pmid:32138652
- 38. Omelchenko MV, Makarova KS, Wolf YIea. Evolution of mosaic operons by horizontal gene transfer and gene displacement in situ. Genome Biology. 2003;4(R55). pmid:12952534
- 39. Boc A, Makarenkov V. Towards an accurate identification of mosaic genes and partial horizontal gene transfers. Nucleic acids research. 2011;39(21):e144–e144. pmid:21917854
- 40. Than CV, Rosenberg NA. Consistency properties of species tree inference by minimizing deep coalescences. Journal of Computational Biology. 2011;18(1):1–15. pmid:21210728
- 41. Ponting C. Plagiarized bacterial genes in the human book of life. Trends in Genetics. 2001;17(5):235–237. pmid:11335018
- 42. Rujan T, Martin W. How many genes in Arabidopsis come from cyanobacteria? An estimate from 386 protein phylogenies. Trends in Genetics. 2001;17(3):113–120. pmid:11226586