The authors have declared that no competing interests exist.
Conceived and designed the experiments: AA CD. Performed the experiments: AA. Analyzed the data: AA CD. Contributed reagents/materials/analysis tools: AA GG MG CD. Wrote the paper: AA MG GG CD.
Hierarchical orthologous groups are defined as sets of genes that have descended from a single common ancestor within a taxonomic range of interest. Identifying such groups is useful in a wide range of contexts, including inference of gene function, study of gene evolution dynamics and comparative genomics. Hierarchical orthologous groups can be derived from reconciled gene/species trees but, this being a computationally costly procedure, many phylogenomic databases work on the basis of pairwise gene comparisons instead (“graph-based” approach). To our knowledge, there is only one published algorithm for graph-based hierarchical group inference, but both its theoretical justification and performance in practice are as of yet largely uncharacterised. We establish a formal correspondence between the orthology graph and hierarchical orthologous groups. Based on that, we devise GETHOGs (“Graph-based Efficient Technique for Hierarchical Orthologous Groups”), a novel algorithm to infer hierarchical groups directly from the orthology graph, thus without needing gene tree inference nor gene/species tree reconciliation. GETHOGs is shown to correctly reconstruct hierarchical orthologous groups when applied to perfect input, and several extensions with stringency parameters are provided to deal with imperfect input data. We demonstrate its competitiveness using both simulated and empirical data. GETHOGs is implemented as a part of the freely-available OMA standalone package (
Homologous biological sequences–sequences related through common ancestry–can be further classified according to the type of evolutionary event that initiated their divergence from one another. Notably, pairs of genes that descended from their last common ancestor through a speciation are referred to as orthologs, while genes that have diverged from a duplication event are referred to as paralogs
Orthology between pairs of genes can be quite reliably inferred using various algorithms, such as bidirectional best hit
For instance, OrthoMCL identifies groups of orthologs and “close” paralogs using Markov clustering, a procedure to identify sets of genes with high pairwise alignment scores
One particularly useful gene grouping strategy, sometimes referred to as
In this example, the hierarchical groups for the taxonomic range
Hierarchical groups can be trivially derived from reconciled gene/species trees, such as those obtained by LOFT
In this article, we present GETHOGs, which stands for “Graph-based Efficient Technique for Hierarchical Orthologous Groups”. The algorithm is based on correspondences between the orthology graph and the underlying gene phylogeny, correspondences that we prove in two new lemmas. We present an efficient implementation of the algorithm as part of the OMA standalone package. We demonstrate that the resulting algorithm outperforms COCO-CL on simulated and real data. We also show that GETHOGs outperforms the tree reconciliation method LOFT. Lastly, we contrast GETHOGs’s results on real data with predictions of the EggNOG and OrthoDB databases (whose precise algorithms are as yet unpublished).
In this section, we first mathematically define hierarchical orthologous groups in terms of gene and species trees, and derive useful notions and properties. We then define the orthology graph, which, crucially, can be inferred without computing gene trees. Next, we describe the correspondence between hierarchical orthologous groups and the orthology graph. The rest of the section details the data and methods used for validating and comparing our new algorithm with existing approaches.
Readers not interested in the technical details can skip this section and proceed directly to the description of GETHOGs (Results Section).
Let
By definition,
Let
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We prove this proposition by contradiction. Assume the existence of such speciation nodes
The one-to-one correspondence between
Let
We define an
Here, we consider two cases: perfect data, where we assume that the pairwise orthologs have been correctly and exhaustively identified, and “real data”, where these have been imperfectly identified, using OMA pairwise (Sect. “Orthology graph inference”;
To restrict the orthology graph to a chosen taxonomic range, we denote by
Our novel algorithm for hierarchical orthologous group inference will use the following two lemmas. The first lemma establishes a correspondence between hierarchical orthologous groups and the orthology graph (illustrated in
Likewise, it also induces an orthology subgraph with set of connected component
As Proposition 2 asserts, the correspondence between the hierarchical orthologous groups
In the second lemma, we prove that on perfect data, members of a hierarchical group have at most two degrees of separation in the orthology graph. Intuitively, this can be seen by the fact that the deepest split in all considered gene (sub)trees is a speciation node, so every gene in one subtree of this split is orthologous to every gene in the other subtree of that split. Hence, regardless of the relationships within these subtrees, it is always possible to go to another gene within the same subtree by first going to any gene in the other subtree and then coming back.
According to Lemma 1, every
We will make use of Lemma 2 to motivate and establish the heuristic FractionReachableInTwoSteps parameter to cope with imperfect input data.
To generate the simulated genomes we used ALF
We reanalysed the three gene families from a recent manually curated study by Boeckmann et al.
To construct the orthology graph, we used pairwise orthologs inferred by the OMA algorithm
With the simulated dataset, we do not assume knowledge of the true species tree. Instead, we estimate it using a least-squares distance approach (
COCO-CL requires initial homologous clusters and refines them into a hierarchy by applying a single linkage clustering algorithm on the induced pairwise distance estimates of the cluster’s multiple alignment. As suggested by the authors
LOFT is a tree-based orthology inference method
Following Boeckmann et al.
We first present an algorithm which, given a perfect input orthology graph (i.e. all the pairwise orthologs have been correctly and exhaustively identified) and the true (partially or fully resolved) species tree topology, correctly identifies for all taxonomic ranges the corresponding hierarchical orthologous groups. In the second part, we present extensions to cope with imperfect data followed by some remarks about the implementation of the algorithm. We conclude this section by comparing the performance of GETHOGs with existing methods.
In order to obtain a hierarchy of nested orthologous groups, our approach requires a rooted, at least partially resolved species tree. Our proposed algorithm computes a hierarchy of orthologous groups by recursively identifying the connected components on the orthology subgraphs induced by the species in the lineages at various taxonomic levels (Algorithm 2,
|
|
|
|
|
|
|
|
Note that, due to the definition of hierarchical groups, genes belonging to different groups at the same taxonomic range have descended from distinct genes in the last common ancestor. As we have formally established in Proposition 3, such genes are in no circumstance orthologous, and are paralogous if the groups are evolutionarily related (homologous).
The runtime complexity of the GETHOGs algorithm on perfect input data is
The two Lemmas described in the “Methods” section are only valid for perfect data. In practice, for all but trivial examples, the input orthology graph can be expected to have missing (false negative) and spurious (false positive) orthology predictions. While missing predictions are typically not a problem–the orthology graph is normally dense enough to provide a path from every group member to every other–additional predictions are more disruptive: false positives result in the erroneous merging of orthologous groups. Hence, using the transitive closure of the pairwise orthology relations would in such situations lead to excessively large clusters. Fortunately, these spuriously merged clusters are often not strongly connected to each other, with only few edges connecting them [
An example orthology graph from the OMA database where two false positive prediction merges two well-defined orthologous groups. At the level of vertebrates, the
To cope with such errors in the orthology graph, we modify/extend the algorithm GETHOGs (Algorithm 2,
|
|
|
As for the termination criterion, it is motivated by the property that with correct input, connected components graphs have diameter of at most 2 (Lemma 2). To approximate the diameter, which is expensive to compute, we estimate the average fraction of nodes which are reachable within two steps of each node (Algorithm 4,
|
|
|
|
|
|
|
|
|
|
Estimate of average fraction of nodes reachable within 2 steps |
Furthermore, it is also possible to use the weighted version of Minimum-Cut. For this purpose, we augment the orthology graph with edge weights corresponding to pairwise alignment scores, and use these weights to guide the Minimum-Cut algorithm. The rationale is that spurious false positives often have relatively low alignment scores. Hence, the spurious edges erroneously connecting two bona fide groups will have low scores and thus be targeted by the weighted Minimum-Cut procedure. But note that while this heuristic has a theoretical motivation based on our findings on perfect data, we do not claim it to be optimal.
We now give an asymptotic runtime analysis of our algorithm. Giving a tight bound on the runtime analysis on imperfect input data is not easy. We therefore make the assumption that gene duplications and losses are distributed uniformly on the gene trees (thus resulting in a mostly balanced gene family tree).
The time complexity of the D
The depth of the recursion D
The resulting overall time complexity for GETHOGs on imperfect data is therefore of order
The source of the described algorithm is freely available for non-commercial uses as part of the OMA standalone package on
The Karger-Stein algorithm was implemented for weighted graphs. The algorithm is randomised, that is to say, with certain probability (which can be made arbitrarily small) it may not find the minimum cut, but one slightly larger than the minimum. In practice, we could not find cases where it failed for the default parameters, and even if it would fail, this would mostly alter the order in which we find the groups. This randomization allows us to parallelise the procedure for very large graphs [Materials S1].
We applied our algorithm to both simulated and real data problems, and compared them to a graph-based and a tree-based hierarchical grouping strategies. We generated two artificial datasets by simulation with ALF
On the dataset with moderate duplication rate, compared to COCO-CL and LOFT, GETHOGs reported considerably more orthologous relations at roughly the same level of precision [
The two datasets show average rates of 4 independent runs of genome simulations with fixed parameters. The difference between the two datasets are essentially different gene duplication rates (see Method section for details). As a point of reference, we also show the performance of pairwise orthologs inferred in OMA (OMA Pairwise). The colour gradient corresponds to various
To analyse the sensitivity to the species phylogeny required by GETHOGs, we ran the algorithm once with the true species tree and once with a species tree inferred from the data (Supplementary
The surprisingly low recall of LOFT with respect to orthologs and paralogs in the more difficult dataset can be mainly attributed to errors in the gene family inference step, for which LOFT uses the COG algorithm. Indeed, if provided perfect gene family input, the recall for LOFT and COCO-CL increases substantially for both orthologs and paralogs (Supplementary
We now turn to the evaluation on empirical biological data. With real data, the true evolutionary relations are mostly unknown. Therefore, we restrict our analysis to a small set of thoroughly studied gene families, which we assume to be free of errors
This analysis covers three gene families, the “ancestral-type” subfamily of NADPH oxidases (
In this analysis, we observe that the predictions of GETHOGs largely outperform the ones of OrthoDB and COCO-CL in terms of precision and recall [
Although these 3 families are not sufficient to draw general conclusions, they nevertheless suggest that the good performance of GETHOGs in simulation extends to real data as well. Furthermore, it should be noted that the absence of description of the EggNOG and OrthoDB algorithms, let alone available implementation, precludes their use on custom genomic data.
We finish this section by discussing the limitations of GETHOGs. Most importantly, the method depends on the quality of the input orthology graph. We have established that GETHOGs returns optimal graphs on perfect input data, but we cannot expect perfect input data on real data. Although we have introduced heuristics to cope with errors in the orthology graph, the performance will deteriorate when the input information is not sufficient to discriminate among multiple evolutionary scenarios. We acknowledge that OMA pairwise, which is known to be relatively conservative
One potential problem with the input graph might be caused by genes encoding multi-domain proteins. Indeed, if the pairwise orthology detection method used to construct the orthology graph does not ensure that orthology between two genes extend over all (or at least most) domains, the resulting graph might strongly violate GETHOGs working assumptions. Note however that the very concept of orthology among genes with different domain composition (and thus non-homologous parts) is ill-defined, as orthology is a subtype of (and thus presupposes) homology. Because of that, many pairwise orthology inference algorithms, including the OMA algorithm we used for all input in this work, require homologous regions between two genes to extend over most of their sequence lengths
The other main limitation of GETHOGs lies in the computational cost of processing huge gene families. The currently biggest orthology graph in the OMA database contains
We presented GETHOGs, a novel algorithm for reconstructing hierarchical orthologous groups. The approach is based on an orthology graph induced by pairwise orthologous gene relations, and as such requires neither gene tree inference nor gene/species tree reconciliation. The algorithm is motivated by a lemma demonstrating the equivalence of the connected components in the orthology subgraph induced by a taxonomic range and the orthologous groups with respect to the same taxonomic range on perfect data. In order to extend the algorithm to be applicable for real data, we separate weakly connected components by splitting the graph repeatedly at its minimum cut. We stop once the graph is sufficiently densely connected, based on the lemma that the orthology graph should have diameter less than or equal to two.
We applied the algorithm on simulated and real datasets, and compared it to COCO-CL and LOFT, where it finds considerably more orthologs/paralogs at roughly the same precision rate. On real data, we also compared our algorithm to EggNOG and OrthoDB–two databases providing hierarchical orthologous groups–by re-analysing three manually curated gene families from a recent study. Though two the empirical datasets are too small to draw general firm conclusions, the results based on these families indicate that our method is competitive.
Regardless of these promising results, the
PDF containing supplementary materials and figures.
(PDF)
Parameter file used to generate the simulated datasets with low duplication rate with ALF
(DRW)
Parameter file used to generate the simulated datasets with high duplication rate with ALF
(DRW)
We thank Daniel Dalquen (ETH Zurich), Nick Goldman (EMBL-EBI), Kevin Gori (EMBL-EBI), Matthieu Muffato (EMBL-EBI), Paul Thomas (University of Southern California) and Stefan Zoller (ETH Zurich) for valuable input and stimulating discussions.