Fig 1.
A species tree S and a multifurcated gene tree G.
Each leaf xi or x of G represents a gene belonging to genome x present as a leaf in S. Step (1) of ProfileNJ is PolytomySolver, which resolves each polytomy P of G independently. A dynamic programming table M is constructed. Step (2) of ProfileNJ takes as input a count vector V, here resulting from the backtracking path related by rectangles and arrows in table M, and a distance matrix d for the considered genes. A Neighbor joining (NJ) based procedure computes the gene tree in agreement with V that best reflects the distance matrix. The final completely refined tree is given bottom right. Duplication nodes are indicated by squares.
Fig 2.
Topology accuracy of RAxML, TreeFix and ProfileNJ trees, measured by RF distance with the true tree, on ∼ 2500 simulated trees from the fungal dataset.
We use a sample of trees simulated under four different DL rate: (1rD—1rL), (2rD—2rL), (4rD—4rL) and (4rD—1rL). Percentage of reconstructed trees (y-axis) with a given RF distance (x-axis) to the true tree. TreeFix and ProfileNJ have a similar reconstruction accuracy (75% of trees match the true trees) while the input trees (RAxML) have the lowest accuracy. The graph is cut on the right, but contains more than 99% of the data.
Fig 3.
Run time of TreeFix and ProfileNJ for increasing size of gene tree.
Fig 4.
The input is a species tree (or by default the Ensembl species tree) and a gene tree (or an Ensembl gene tree ID), gene sequences and additional options such as the branch contraction threshold, the request to test all roots, the maximum number of trees to be output by ProfileNJ and sorted by likelihood, etc. The integrated algorithms are ProfileNJ and ParalogyCorrector. Using this second algorithm requires, in addition, the input of a set of orthology constraints.
Fig 5.
A general view on RefineTree when run on the Ensembl Compara gene families.
An example is given for a species tree S of four fish species, a gene family of six genes (a gene is represented by the picture of the species it belongs to, and two paralogs belonging to the same species are distinguished by a different frame color), a rooted gene tree G (although it can be non rooted in general) with branch support, and a given threshold for branch contraction. Data framed in black are the input and those framed in blue are the output of the correction algorithm labeling the edge linking the considered frames. Black arrows depict the use we make of RefineTree on the Ensembl gene trees. The green arrow and the green “or” are alternative uses avoiding one or both of the correction tools ParalogyCorrector and Unduplicator. Any framed set of data can be alternatively provided to the pipeline as input. For example, orthology constraints obtained from various sources can be directly provided as input to ParalogyCorrector. The method for inferring orthology constraints from synteny blocks is described in the text.
Fig 6.
Sequence likelihood, ancestral genome content and ancestral chromosome linearity for ProfileNJ, Synteny and Ensembl trees.
(A) Proportion of trees with a significantly better likelihood computed with PhyML. AU tests were computed for the three trees for each family, and if the tree at the first rank was significantly better than the second, it was stored as the best likelihood, and if not, it was stored as “no significant difference at the first rank”. (B) Gene content computed with DeCo. Gene content has one value for each node of the phylogeny of 63 species, except for extant genomes, for which it has one value for each leaf. (C) Genome linearity computed with DeCo. Genome linearity is represented by a graph, whose x axis is the number of neighbors a gene can have, and the y axis shows the proportion of genes having this number of neighbors. Parameters from extant genomes are given as a reference in (B) and (C). Statistics for ancestral genomes are assumed better when close to the extant ones.
Fig 7.
Numbers of duplications in the eukaryote phylogeny, estimated with reconciled ProfileNJ trees from PhyML starting trees on the whole Ensembl Compara database, version 73.
Drawn with Figtree [49].
Fig 8.
A probable example of ILS visible on a subtree of an Ensembl gene family.
The monophyly of the chimpanzee and gorilla genes (ENSPTRP00000033018 and ENSGGOP00000011432) is well supported by the sequences (left tree, constructed by PhyML, with aLRT supports), while synteny argues for orthology of both with the human genes (ENSP00000414208 and ENSP00000378687) (right tree, constructed by ProfileNJ followed by ParalogyCorrector), so that a scenario of duplications and losses compatible with the left tree is unlikely.
Fig 9.
The unduplication principle (figure redrawn from [33]).
A non linearity is detected in an ancestral genome (gene g has three neighbors). Two of its neighbors g1 and g2 are issued from a possibly dubious duplication labeled node. The tree is rearranged so that its root is labeled with a speciation instead of a duplication. In the resulting configuration and
are in two different species, so that g can have only one neighbor in this family, and linearity is recovered.