OrthoSNAP: A tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees

Jacob L. Steenwyk; Dayna C. Goltz; Thomas J. Buida III; Yuanning Li; Xing-Xing Shen; Antonis Rokas

doi:10.1371/journal.pbio.3001827

Abstract

Molecular evolution studies, such as phylogenomic studies and genome-wide surveys of selection, often rely on gene families of single-copy orthologs (SC-OGs). Large gene families with multiple homologs in 1 or more species—a phenomenon observed among several important families of genes such as transporters and transcription factors—are often ignored because identifying and retrieving SC-OGs nested within them is challenging. To address this issue and increase the number of markers used in molecular evolution studies, we developed OrthoSNAP, a software that uses a phylogenetic framework to simultaneously split gene families into SC-OGs and prune species-specific inparalogs. We term SC-OGs identified by OrthoSNAP as SNAP-OGs because they are identified using a splitting and pruning procedure analogous to snapping branches on a tree. From 415,129 orthologous groups of genes inferred across 7 eukaryotic phylogenomic datasets, we identified 9,821 SC-OGs; using OrthoSNAP on the remaining 405,308 orthologous groups of genes, we identified an additional 10,704 SNAP-OGs. Comparison of SNAP-OGs and SC-OGs revealed that their phylogenetic information content was similar, even in complex datasets that contain a whole-genome duplication, complex patterns of duplication and loss, transcriptome data where each gene typically has multiple transcripts, and contentious branches in the tree of life. OrthoSNAP is useful for increasing the number of markers used in molecular evolution data matrices, a critical step for robustly inferring and exploring the tree of life.

Citation: Steenwyk JL, Goltz DC, Buida TJ III, Li Y, Shen X-X, Rokas A (2022) OrthoSNAP: A tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees. PLoS Biol 20(10): e3001827. https://doi.org/10.1371/journal.pbio.3001827

Academic Editor: Andreas Hejnol, University of Bergen, NORWAY

Received: November 4, 2021; Accepted: September 13, 2022; Published: October 13, 2022

Copyright: © 2022 Steenwyk et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All results and data presented in this study are available from figshare (doi: 10.6084/m9.figshare.16875904).

Funding: J.L.S. and A.R. were funded by the Howard Hughes Medical Institute through the James H. Gilliam Fellowships for Advanced Study program. Research in A.R.’s lab is supported by grants from the National Science Foundation (DEB-2110404), the National Institutes of Health/National Institute of Allergy and Infectious Diseases (R56 AI146096 and R01 AI153356), and the Burroughs Wellcome Fund. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: Antonis Rokas is a scientific consultant for LifeMine Therapeutics, Inc. Jacob L. Steenwyk is a scientific consultant for Latch AI Inc.

Introduction

Molecular evolution studies, such as species tree inference, genome-wide surveys of selection, evolutionary rate estimation, measures of gene–gene coevolution, and others typically rely on single-copy orthologs (SC-OGs), a group of homologous genes that originated via speciation and are present in single copy among species of interest [1–6]. In contrast, paralogs—homologous genes that originated via duplication and are often members of large gene families—are typically absent from these studies (Fig 1). Gene families of orthologs and paralogs often encode functionally significant proteins such as transcription factors, transporters, and olfactory receptors [7–10]. The exclusion of SC-OGs from gene families has not only hindered our understanding of their evolution and phylogenetic informativeness but is also artificially reducing the number of gene markers available for molecular evolution studies. Furthermore, as the number of species and/or their evolutionary divergence increases in a dataset, the number of SC-OGs decreases [11,12]; case in point, no SC-OGs were identified in a dataset of 42 plants [11]. As the number of available genomes across the tree of life continues to increase, our ability to identify SC-OGs present in many taxa will become more challenging.

Download:

Fig 1. Cartoon depiction of 3 classes of paralogs: outparalogs, inparalogs, and coorthologs.

(A) Paralogs refer to related genes that have originated via gene duplication, such as genes M, N, and O. (B) Outparalogs and inparalogs refer to paralogs that are related to one another via a duplication event that took place prior to or after a speciation event, respectively. With respect to the speciation event that led to the split of taxa A, B, and C from D, genes M, N, and O are outparalogs because they arose prior to the speciation event; genes O1 and O2 in taxa A, B, and C are inparalogs because they arose after the speciation event. Species-specific inparalogs are paralogous genes observed only in 1 species, strain, or organism in a dataset, such as gene N1 and N2 in species A. Species-specific inparalogs N1 and N2 in species A are also coorthologs of gene N in taxa B, C, and D; the same is true for inparalogs O1 and O2 from species A, which are coorthologs of gene O from species D. (C) Cartoon depiction of SNAP-OGs identified by OrthoSNAP.

https://doi.org/10.1371/journal.pbio.3001827.g001

In light of these issues, several methods have been developed to account for paralogs in specific types of molecular evolution studies—for example, in species tree reconstruction [13]. Methods such as SpeciesRax, STAG, ASTRAL-PRO, and DISCO can be used to infer a species tree from a set of SC-OGs and gene families composed of orthologs and paralogs [11,14–16]. Other methods such as PHYLDOG [17] and guenomu [18] jointly infer the species and gene trees but require abundant computational resources, which has hindered their use for large datasets. Other software, such as PhyloTreePruner, can conduct species-specific inparalog trimming [19]. Agalma, as part of a larger automated phylogenomic workflow, can prune gene trees into maximally inclusive subtrees wherein each species, strain, or organism is represented by 1 sequence [20]. Similarly, OMA identifies subgroups of SC-OGs using graph-based clustering of sequence similarity scores [21]. Although these methods have expanded the numbers of gene markers used in species tree reconstruction, they were not designed to facilitate the retrieval of as broad a set of SC-OGs as possible for downstream molecular evolution studies such as surveys of selection. Furthermore, the phylogenetic information content of these gene families remains unknown, calling into question their usefulness.

To address this need and measure the information content of subgroups of single-copy orthologous genes, we developed OrthoSNAP, a novel algorithm that identifies SC-OGs nested within larger gene families via tree decomposition and species-specific inparalog pruning. We term SC-OGs identified by OrthoSNAP as SNAP-OGs because they were retrieved using a splitting and pruning procedure. The efficacy of OrthoSNAP and the information content of SNAP-OGs was examined across 7 eukaryotic datasets, which include species with complex evolutionary histories (e.g., whole-genome duplication) or complex gene sequence data (e.g., transcriptomes, which typically have multiple transcripts per protein-coding gene). These analyses revealed OrthoSNAP can substantially increase the number of orthologs for downstream analyses such as phylogenomics and surveys of selection. Furthermore, we found that the information content of SNAP-OGs was statistically indistinguishable from that of SC-OGs suggesting the inclusion of SNAP-OGs in downstream analyses is likely to be as informative. These analyses indicate that SNAP-OGs identified by OrthoSNAP hold promise for increasing the number of markers used in molecular evolution studies, which can, in turn, be used for constructing and interpreting the tree of life.

Results

OrthoSNAP is a novel tree traversal algorithm that conducts tree splitting and species-specific inparalog pruning to identify SC-OGs nested within larger gene families (Fig 1C). OrthoSNAP takes as input a gene family phylogeny and associated FASTA file and can output individual FASTA files populated with sequences from SNAP-OGs as well as the associated Newick tree files (Fig 2). During tree traversal, tree uncertainty can be accounted for by OrthoSNAP by collapsing poorly supported branches. In a set of 7 eukaryotic datasets that contained 9,821 SC-OGs, we used OrthoSNAP to identify an additional 10,704 SNAP-OGs. Using a combination of multivariate statistics and phylogenetic measures, we demonstrate that SNAP-OGs and SC-OGs have similar phylogenetic information content in all 7 datasets. This observation was consistent across datasets where the identification of large numbers of SC-OGs is challenging: flowering plants that have complex patterns of gene duplication and loss (15 SC-OGs and 653 SNAP-OGs), a lineage of budding yeasts wherein half of the species have undergone an ancient whole-genome duplication event (2,782 SC-OGs and 1,334 SNAP-OGs), and a dataset of transcriptomes where many genes are represented by multiple transcripts (390 SC-OGs and 2,087 SNAP-OGs). Lastly, similar patterns of support were observed among the 252 SC-OGs and the 1,428 SNAP-OGs in a contentious branch in the tree of life. Taken together, these results suggest that OrthoSNAP is helpful for expanding the set of gene markers available for molecular evolutionary studies, even in datasets where inference of orthology has historically been difficult due to complex evolutionary history or complex data characteristics.

Download:

Fig 2. Cartoon depiction of OrthoSNAP workflow.

(A) OrthoSNAP takes as input 2 files: a FASTA file of a gene family with multiple homologs observed in 1 or more species and the associated gene family tree. The outputted file(s) will be individual FASTA files of SNAP-OGs. Depending on user arguments, individual Newick tree files can also be outputted. (B) A cartoon phylogenetic tree that depicts the evolutionary history of a gene family and 5 SNAP-OGs therein. While identifying SNAP-OGs, OrthoSNAP also identifies and prunes species-specific inparalogs (e.g., species2|gene2-copy_0 and species2|gene2-copy_1), retaining only the inparalog with the longest sequence, a practice common in transcriptomics. Note, OrthoSNAP requires that sequence naming schemes must be the same in both sequences and follow the convention in which a species, strain, or organism identifier and gene identifier are separated by pipe (or vertical bar; “|”) character.

https://doi.org/10.1371/journal.pbio.3001827.g002

SC-OGs and SNAP-OGs have similar information content

To compare SC-OGs and SNAP-OGs, we first independently inferred orthologous groups of genes in 3 eukaryotic datasets of 24 budding yeasts (none of which have undergone whole-genome duplication), 36 filamentous fungi (Aspergillus and Penicillium species), and 26 mammals including humans, dogs, pigs, elephants, sloths, and others (S1 Table). There was variation in the number of SC-OGs and SNAP-OGs in each lineage (S1 Fig and S2 Table). Interestingly, the ratio of SNAP-OGs: SC-OGs among budding yeasts, filamentous fungi, and mammals was 0.83 (1,392: 1,668), 0.46 (2,035: 4,393), and 5.53 (1,775: 321), respectively, indicating SNAP-OGs can substantially increase the number of gene markers in certain lineages. The number of SNAP-OGs identified in a gene family with multiple homologs in 1 or more species also varied (S2 Fig).

Similar orthogroup occupancy and best-fitting models of substitutions were observed among SC-OGs and SNAP-OGs (S3 Fig and S3 Table), raising the question of whether SC-OGs and SNAP-OGs have similar information content. To answer this, the information content among multiple sequence alignments and phylogenetic trees from SC-OGs and SNAP-OGs (S4 Fig and S4 Table) was compared across 9 properties—Robinson–Foulds distance [22], relative composition variability [23], and average bootstrap support, for example—using multivariate analysis and statistics as well as information theory-based phylogenetic measures. Principal component analysis enabled qualitative comparisons between SC-OGs and SNAP-OGs in reduced dimensional space and revealed a high degree of similarity (Figs 3 and S5). Multivariate statistics—namely, multifactor analysis of variance—facilitated a quantitative comparison of SC-OGs and SNAP-OGs and revealed no difference between SC-OGs and SNAP-OGs (p = 0.63, F = 0.23, df = 1; S5 Table) and no interaction between the 9 properties and SC-OGs and SNAP-OGs (p = 0.16, F = 1.46, df = 8). Similarly, multifactor analysis of variance using an additive model, which assumes each factor is independent and there are no interactions (as observed here), also revealed no differences between SC-OGs and SNAP-OGs (p = 0.65, F = 0.21, df = 1). Next, we calculated tree certainty, an information theory-based measure of tree congruence from a set of gene trees, and found similar levels of congruence among phylogenetic trees inferred from SC-OGs and SNAP-OGs (S6 Table). Taken together, these analyses demonstrate that SC-OGs and SNAP-OGs have similar phylogenetic information content.

Download:

Fig 3. SC-OGs and SNAP-OGs have similar phylogenetic information content.

To evaluate similarities and differences between SC-OGs (orange dots) and SNAP-OGs (blue dots), we examined each gene’s phylogenetic information content by measuring 9 properties of multiple-sequence alignments and phylogenetic trees. We performed these analyses on 12,764 gene families from 3 datasets—24 budding yeasts (1,668 SC-OGs and 1,392 SNAP-OGs) (A), 36 filamentous fungi (4,393 SC-OGs and 2,035 SNAP-OGs) (B), and 26 mammals (321 SC-OGs and 1,775 SNAP-OGs) (C). Principal component analysis revealed striking similarities between SC-OGs and SNAP-OGs in all 3 datasets. For example, the centroid (i.e., the mean across all metrics and genes) for SC-OGs and SNAP-OGs, which is depicted as an opaque and larger dot, are very close to one another in reduced dimensional space. Supporting this observation, multifactor analysis of variance with interaction effects of the 6,630 SNAP-OGs and 6,634 SC-OGs revealed no difference between SC-OGs and SNAP-OGs (p = 0.63, F = 0.23, df = 1) and no interaction between the 9 properties and SC-OGs and SNAP-OGs (p = 0.16, F = 1.46, df = 8). Multifactor analysis of variance using an additive model yielded similar results wherein SC-OGs and SNAP-OGs do not differ (p = 0.65, F = 0.21, df = 1). There are also very few outliers of individual SC-OGs and SNAP-OGs, which are represented as translucent dots, in all 3 panels. For example, SNAP-OGs outliers at the top of panel C are driven by high treeness/RCV values, which is associated with a high signal-to-noise ratio and/or low composition bias [23]; SNAP-OG outliers at the right of panel C are driven by high average bootstrap support values, which is associated with greater tree certainty [74]; and the single SC-OG outlier observed in the bottom right of panel C is driven by a SC-OG with a high degree of violation of a molecular clock [78], which is associated with lower tree certainty [79]. Multiple-sequence alignment and phylogenetic tree properties used in principal component analysis and abbreviations thereof are as follows: average bootstrap support (ABS), degree of violation of the molecular clock (DVMC), relative composition variability, Robinson–Foulds distance (RF distance), alignment length (Aln. len.), the number of parsimony informative sites (PI sites), saturation, treeness (tness), and treeness/RCV (tness/RCV). The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

https://doi.org/10.1371/journal.pbio.3001827.g003

We next aimed to determine if SC-OGs and SNAP-OGs have greater phylogenetic information content than a random null expectation. Groups of genes reflecting a random null expectation were constructed by randomly selecting a single sequence from representative species in multicopy orthologous genes (hereafter referred to as Random-GGs for random combinations of orthologous and paralogous groups of genes) in the budding yeast (N = 647), filamentous fungi (N = 999), and mammalian (N = 954) datasets. Random-GGs were aligned, trimmed, and phylogenetic trees were inferred from the resulting multiple sequence alignments. Random-GG phylogenetic information was also calculated. Across each dataset, significant differences were observed among SC-OGs, SNAP-OGs, and Random-GGs (p < 0.001, F = 189.92, df = 4; Multifactor ANOVA). Further examination of differences revealed Random-GGs are significantly different compared to SC-OGs and SNAP-OGs (p < 0.001 for both comparisons; Tukey honest significant differences (THSD) test) in the budding yeast dataset. In contrast, SC-OGs and SNAP-OGs are not significantly different (p = 0.42; THSD). The same was also true for the dataset of filamentous fungi and mammals—specifically, Random-GGs were significantly different from SC-OGs and SNAP-OGs (p < 0.001 for each comparison in each dataset; THSD), whereas SC-OGs and SNAP-OGs were not significantly different (p = 1.00 for filamentous fungi dataset; p = 0.42 for dataset of mammals; THSD). Principal component analysis revealed Robinson–Foulds distances (a measure of tree accuracy wherein lower values represent greater tree accuracy), and relative composition variability (a measure of alignment composition bias wherein lower values represent less compositional bias), often drove differences among Random-GGs, SC-OGs, and SNAP-OGs across the datasets. In all datasets, SC-OGs and SNAP-OGs outperformed the null expectation in tree accuracy and were less compositionally biased (Table 1). These findings suggest SNAP-OGs and SC-OGs are similar in phylogenetic information content and outperform the null expectation.

Download:

Table 1. SC-OGs and SNAP-OGs are more accurate and have less compositional biases than Random-GGs.

https://doi.org/10.1371/journal.pbio.3001827.t001

SC-OGs and SNAP-OGs have similar performances in complex datasets

Complex biological processes and datasets pose a serious challenge for identifying markers for molecular evolution studies. To test the efficacy of OrthoSNAP in scenarios of complex evolutionary histories and datasets, we executed the same workflow described above—ortholog calling, sequence alignment, trimming, tree inference, and SNAP-OG detection—on 3 new datasets: (1) 30 plants known to have complex histories of gene duplication and loss [24–26]; (2) 30 budding yeast species wherein half of the species originated from a hybridization event that gave rise to a whole-genome duplication followed by complex patterns of loss and duplication [27–30]; and (3) 20 choanoflagellate transcriptomes, which contain thousands more transcripts than genes [31,32]; for orthology inference software, multiple transcripts per gene appear similar to artificial gene duplicates.

Corroborating previous results, OrthoSNAP successfully identified SNAP-OGs that can be used downstream for molecular evolution analyses. Specifically, using a species-occupancy threshold of 50% in the plant, budding yeast, and choanoflagellate datasets, 653, 1,334, and 2,087 SNAP-OGs were identified, respectively (Table 2). In comparison, 15 SC-OGs were identified in the plant dataset; 2,782 in the budding yeast dataset; and 390 in the choanoflagellate dataset. (Note that there are likely more SC-OGs than SNAP-OGs in budding yeasts because their genomes are relatively small and therefore do not have as many duplicate gene copies compared to other lineages, such as plants. Nonetheless, OrthoSNAP still substantially increases the number of markers in a phylogenomic data matrix.) To explore the impact of orthogroup occupancy, SNAP-OGs were also identified using a minimum occupancy threshold of 4 taxa. This resulted in the identification of substantially more SNAP-OGs: 15,854 in plants; 4,199 in budding yeasts; and 11,556 in choanoflagellates. Furthermore, these were substantially higher than the number of SC-OGs identified using a minimum orthogroup occupancy of 4 taxa: 200 in plants; 3,566 in budding yeasts; and 2,438 in choanoflagellates. These findings support previous observations that incorporating OrthoSNAP into ortholog identification workflows can substantially increase the number of available loci.

Download:

Table 2. OrthoSNAP identifies SNAP-OGs in complex datasets.

https://doi.org/10.1371/journal.pbio.3001827.t002

SC-OGs and SNAP-OGs have similar patterns of support in a contentious branch in the tree of life

To further evaluate the information content of SNAP-OGs, we compared patterns of support among SC-OGs and SNAP-OGs in a difficult-to-resolve branch in the tree of life. Specifically, we evaluated the support between 3 hypotheses concerning deep evolutionary relationships among eutherian mammals: (1) Xenarthra (eutherian mammals from the Americas) and Afrotheria (eutherian mammals from Africa) are sister to all other Eutheria [33,34]; (2) Afrotheria are sister to all other Eutheria [35,36]; and (3) Xenarthra are sister to a clade of both Afrotheria and Eutheria (Fig 4A). Resolution of this conflict has important implications for understanding the historical biogeography of these organisms. To do so, we first obtained protein-coding gene sequences from 6 Afrotheria, 2 Xenarthra, 12 other Eutheria, and 8 outgroup taxa from NCBI (S7 Table), which represent all annotated and publicly genome assemblies at the time of this study (S8 Table). Using the protein translations of these gene sequences as input to OrthoFinder, we identified 252 SC-OGs shared across taxa; application of OrthoSNAP identified an additional 1,428 SNAP-OGs, which represents a greater than 5-fold increase in the number of gene markers for this dataset (S8 Table). There was variation in the number of SNAP-OGs identified per orthologous group of genes (S6 Fig). The highest number of SNAP-OGs identified in an orthologous group of genes was 10, which was a gene family of olfactory receptors; olfactory receptors are known to have expanded in the evolutionary history of eutherian mammals [8]. The best-fitting substitution models were similar between SC-OGs and SNAP-OGs (S7 Fig).

Download:

Fig 4. SC-OGs and SNAP-OGs display similar patterns of support in a contentious branch concerning deep evolutionary relationships among eutherian mammals.

(A) Two leading hypotheses for the evolutionary relationships among Eutheria, which have implications for the evolution and biogeography of the clade, are that Afrotheria and Xenarthra are sister to all other Eutheria (hypothesis 1; blue) and that Afrotheria are sister to all other Eutheria (hypothesis 2; pink). The third possible, but less well-supported topology, is that Xenarthra are sister to Eutheria and Afrotheria. (B) Comparison of gene support frequency (GSF) values for the 3 hypotheses among 252 SC-OGs and 1,428 SNAP-OGs using an α level of 0.01 revealed no differences in support (p = 0.26, Fisher’s exact test with Benjamini–Hochberg multitest correction). Comparison after accounting for gene tree uncertainty by collapsing bipartitions with ultrafast bootstrap approximation support lower than 75 (SC-OGs collapsed vs. SNAP-OGs collapsed) also revealed no differences (p = 0.05; Fisher’s exact test with Benjamini–Hochberg multitest correction). (C) Examination of the distribution of frequency of topology support using gene-wise log-likelihood scores revealed no difference between SNAP-OGs and SC-OGs support for the 3 topologies (p = 0.52; Fisher’s exact test). The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

https://doi.org/10.1371/journal.pbio.3001827.g004

Two independent tests examining support between alternative hypotheses of deep evolutionary relationships among eutherian mammals revealed similar patterns of support between SC-OGs and SNAP-OGs. More specifically, no differences were observed in gene support frequencies—the number of genes that support 1 of 3 possible hypotheses at a given branch in a phylogeny—without or with accounting for single-gene tree uncertainty by collapsing branches with low support values (p = 0.26 and p = 0.05, respectively; Fisher’s exact test with Benjamini–Hochberg multitest correction; Fig 4B and S9 Table). A second test of single-gene support was conducted wherein individual gene log likelihoods were calculated for each of the 3 possible topologies. The frequency of gene-wise support for each topology was determined. No differences were observed in gene support frequency using the log likelihood approach (p = 0.52, respectively; Fisher’s exact test). Examination of patterns of support in a contentious branch in the tree of life using 2 independent tests revealed SC-OGs and SNAP-OGs are similar and further supports the observation that they contain similar phylogenetic information.

In summary, 415,129 orthologous groups of genes across 7 eukaryotic datasets contained 9,821 SC-OGs; application of OrthoSNAP identified an additional 10,704 SNAP-OGs, thereby more than doubling the number of gene markers. Comprehensive comparison of the phylogenetic information content among SC-OGs and SNAP-OGs revealed no differences in phylogenetic information content. Strikingly, this observation held true across datasets with complex evolutionary histories and when conducting hypothesis testing in a difficult-to-resolve branch in the tree of life. These findings suggest that SNAP-OGs may be useful for diverse studies of molecular evolution ranging from genome-wide surveys of selection, phylogenomic investigations, gene–gene coevolution analyses, and others.

Discussion

Molecular evolution studies typically rely on SC-OGs. Recently, developed methods can integrate gene families of orthologs and paralogs into species tree inference but are not designed to broadly facilitate the retrieval of gene markers for molecular evolution analyses. Furthermore, the phylogenetic information content of gene families of orthologs and paralogs remains unknown. This observation underscores the need for algorithms that can identify SC-OGs nested within larger gene families, which can, in turn, be incorporated into diverse molecular evolution analyses, and a comprehensive assessment of their phylogenetic properties.

To address this need, we developed OrthoSNAP, a tree splitting and pruning algorithm that identifies SNAP-OGs, which refers to SC-OGs nested within larger gene families wherein species-specific inparalogs have also been pruned. Comprehensive examination of the phylogenetic information content of SNAP-OGs and SC-OGs from 7 empirical datasets of diverse eukaryotic species revealed that their content is similar. Inclusion of SNAP-OGs increased the size of all 7 datasets, sometimes substantially. We note that our results are qualitatively similar to those reported recently by Smith and colleagues [37], which retrieved SC-OGs nested within larger families from 26 primates and examined their performance in gene tree and species tree inference. Three noteworthy differences are that we also conduct species-specific inparalog trimming, provide a user-friendly command-line software for SNAP-OG identification, and evaluated the phylogenetic information content of SNAP-OGs and SC-OGs across 7 diverse phylogenomic datasets. We also note that our algorithm can account for diverse types of paralogy—outparalogs, inparalogs, and species-specific inparalogs—whereas other software like PhyloTreePruner, which only conducts species-specific inparalog trimming [19], and Agalma, which identifies single-copy outparalogs and inparalogs [20], can account for some, but not all, types of paralogs (S10 Table). Another difference between OrthoSNAP and other approaches is that Agalma and PhyloTreePruner both require rooted phylogenies. In contrast, OrthoSNAP will automatically midpoint root phylogenies or accept prerooted phylogenies as input. Furthermore, these algorithms are not designed to handle transcriptomic data wherein multiple transcripts per gene will be interpreted as multicopy orthologs. Thus, OrthoSNAP allows for greater user flexibility and accounts for more diverse scenarios, leading to, at least in some instances, the identification of more loci for downstream analyses (S8 Fig). Notably, these software are also different from sequence similarity graph-based inferences of subgroups of single-copy orthologous genes—such as the algorithm implemented in OMA [21]. In other words, OrthoSNAP identifies subgroups of single-copy orthologous genes by examining evolutionary histories, rather than sequence similarity values. Moreover, examination of evolutionary histories facilitates the identification of species-specific inparalogs. Finally, our results, together with other studies, demonstrate the utility of SC-OGs that are nested within larger families [15,20,37,38].

Despite the ability of OrthoSNAP to identify additional loci for molecular evolution analyses, there were instances wherein SNAP-OGs were not identified in multicopy orthologous groups of genes. We discuss 3 reasons that contribute to why SNAP-OGs could not be identified among some genes—specifically, gene families with sequence data from <50% of the taxa; gene families with complex evolutionary histories (for example, HGT and duplication/loss patterns); and gene families with evolutionary histories that differ from the species tree (for example, due to analytical factors, such as sampling and systematic error, or biological factors, such as lineage sorting or introgression/hybridization [39–41]). Notably, the first reason can, but does not always, result in inability to infer SNAP-OGs and can be, to a certain extent, addressed (e.g., by lowering the orthogroup occupancy threshold in OrthoSNAP), whereas the other 2 reasons are more challenging because they often result in a genuine absence of SC-OGs. Furthermore, the actual number of SC-OGs (either those nested within multicopy orthologs or not) for any given group of organisms is not known, making it difficult to determine how many SNAP-OGs and SC-OGs one should expect to recover. Notably, this issue has long challenged researchers, even when ortholog identification is performed by also taking genome synteny into account [27].

Next, we discuss some practical considerations when using OrthoSNAP. In the present study, we inferred orthology information using OrthoFinder [42], but several other approaches can be used upstream of OrthoSNAP. For example, other graph-based algorithms such as OrthoMCL and OMA [21,43] or sequence similarity-based algorithms such as orthofisher [44] can be used to infer gene families. Similarly, sequence similarity search algorithms like BLAST+ [45], USEARCH [46], and HMMER [47] can be used to retrieve homologous sets of sequences that are used as input for OrthoSNAP. Other considerations should also be taken during the multicopy tree inference step. For example, inferring phylogenies for all orthologous groups of genes may be a computationally expensive task. Rapid tree inference software—such as FastTree or IQTREE with the “-fast” parameter [48,49]—may expedite these steps (but users should be aware that this may result in a loss of accuracy in inference; [50]).

We suggest employing “best practices” when inferring groups of putatively orthologous genes, including SNAP-OGs. Specifically, orthology information can be further scrutinized using phylogenetic methods. Orthology inference errors may occur upstream of OrthoSNAP; for example, SNAP-OGs may be susceptible to erroneous inference of orthology during upstream clustering of putatively orthologous genes. One method to identify putatively spurious orthology inference is by identifying long terminal branches [51]. Terminal branches of outlier length can be identified using the “spurious_sequence” function in PhyKIT [52]. Other tools, such as PhyloFisher, UPhO, and other orthology inference pipelines employ similar strategies to refine orthology inference [53–55]. Lastly, we acknowledge that future iterations of OrthoSNAP may benefit from incorporating additional layers of information, such as sequence similarity scores or synteny. Even though OrthoSNAP did identify SNAP-OGs in some complex datasets where synteny has previously been very helpful, such as the budding yeast dataset, other ancient and rapidly evolving lineages may benefit from synteny analysis to dissect complex relationships of orthology [51,56–58].

Taken together, we suggest that OrthoSNAP is useful for retrieving single-copy orthologous groups of genes from gene family data and that the identified SNAP-OGs have similar phylogenetic information content compared to SC-OGs. In combination with other phylogenomic toolkits, OrthoSNAP may be helpful for reconstructing the tree of life and expanding our understanding of the tempo and mode of evolution therein.

Methods

OrthoSNAP availability and documentation

OrthoSNAP is available under the MIT license from GitHub (https://github.com/JLSteenwyk/orthosnap), PyPi (https://pypi.org/project/orthosnap), and the Anaconda cloud (https://anaconda.org/JLSteenwyk/orthosnap). OrthoSNAP is also freely available to use via the LatchBio (https://latch.bio/) cloud-based console (dedicated interface link: https://console.latch.bio/explore/65606/info). Documentation describes the OrthoSNAP algorithm, parameters, and provides user tutorials (https://jlsteenwyk.com/orthosnap).

OrthoSNAP algorithm description and usage

We next describe how OrthoSNAP identifies SNAP-OGs. OrthoSNAP requires 2 files as input: one is a FASTA file that contains 2 or more homologous sequences in 1 or more species and the other the corresponding gene family phylogeny in Newick format. In both the FASTA and Newick files, users must follow a naming scheme—wherein species, strain, or organism identifiers and gene sequences identifiers are separated by a vertical bar (also known as a pipe character or “|”)—which allows OrthoSNAP to determine which sequences were encoded in the genome of each species, strain, or organism. After initiating OrthoSNAP, the gene family phylogeny is first midpoint rooted (unless the user specifies the inputted phylogeny is already rooted) and then SNAP-OGs are identified using a tree-traversal algorithm. To do so, OrthoSNAP will loop through the internal branches in the gene family phylogeny and evaluate the number of distinct taxa identifiers among children terminal branches. If the number of unique taxon identifiers is greater than or equal to the orthogroup occupancy threshold (default: 50% of total taxa in the inputted phylogeny; users can specify an integer threshold), then all children branches and termini are examined further; otherwise, OrthoSNAP will examine the next internal branch. Next, OrthoSNAP will collapse branches with low support (default: 80, which is motivated by using ultrafast bootstrap approximations [59] to evaluate bipartition support; users can specify an integer threshold) and conduct species-specific inparalog trimming wherein the longest sequence is maintained, a practice common in transcriptomics. However, users can specify whether the shortest sequence or the median sequence (in the case of 3 or more sequences) should be kept instead. Users can also pick which species-specific inparalog to keep based on branch lengths (the longest, shortest, or median branch length in the case of having 3 or more sequences). Species-specific inparalogs are defined as sequences encoded in the same genome that are sister to one another or belong to the same polytomy [19]. The resulting set of sequences is examined to determine if 1 species, strain, or organism is represented by 1 sequence and ensure these sequences have not yet been assigned to a SNAP-OG. If so, they are considered a SNAP-OG; if not, OrthoSNAP will examine the next internal branch. When SNAP-OGs are identified, FASTA files of SNAP-OG sequences are outputted. Users can also output the subtree of the SNAP-OG using an additional argument.

The principles of the OrthoSNAP algorithm are also described using the following pseudocode:

FOR internal branch in midpoint rooted gene family phylogeny:

> IF orthogroup occupancy among children termini is greater than or equal to orthogroup occupancy threshold;
>> Collapse poorly supported bipartitions and trim species-specific inparalogs;
>> IF each species, strain, or organism among the trimmed set of species, strains, or organisms is represented by only one sequence and no sequences being examined have been assigned to a SNAP-OG yet;
>>> Sequences represent a SNAP-OG and are outputted to a FASTA file
>> ELSE
>>> examine next internal branch
> ELSE
>> examine next internal branch

ENDFOR

To enhance the user experience, arguments or default values are printed to the standard output, a progress bar informs the user of how of the analysis has been completed, and the number of SNAP-OGs identified as well as the names of the outputted FASTA files are printed to the standard output.

Development practices and design principles to ensure long-term software stability

Archival instabilities among software threatens the reproducibility of bioinformatics research [60]. To ensure long-term stability of OrthoSNAP, we implemented previously established rigorous development practices and design principles [44,52,61,62]. For example, OrthoSNAP features a refactored codebase, which facilitates debugging, testing, and future development. We also implemented a continuous integration pipeline to automatically build, package, and install OrthoSNAP across Python versions 3.7, 3.8, and 3.9. The continuous integration pipeline also conducts 57 unit and integration tests, which span 95.90% of the codebase and ensure faithful function of OrthoSNAP.

Dataset generation

To generate a dataset for identifying SNAP-OGs and comparing them to SC-OGs, we first identified putative groups of orthologous genes across 4 empirical datasets. To do so, we first downloaded proteomes for each dataset, which were obtained from publicly available repositories on NCBI (S1 and S7 Tables) or figshare [51]. Each dataset varied in its sampling of sequence diversity and in the evolutionary divergence of the sampled taxa. The dataset of 24 budding yeasts spans approximately 275 million years of evolution [51]; the dataset of 36 filamentous fungi spans approximately 94 million years of evolution [63]; the dataset of 26 mammals spans approximately 160 million years of evolution [64]; and the dataset of 28 eutherian mammals—which was used to study the contentious deep evolutionary relationships among eutherian mammals—concerns an ancient divergence that occurred approximately 160 million years ago [65]. Putatively orthologous groups of genes were identified using OrthoFinder, v2.3.8 [42], with default parameters, which resulted in 46,645 orthologous groups of genes with at least 50% orthogroup occupancy (S8 Table).

To infer the evolutionary history of each orthologous group of genes, we first individually aligned and trimmed each group of sequences using MAFFT, v7.402 [66], with the “auto” parameter and ClipKIT, v1.1.3 [61], with the “smart-gap” parameter, respectively. Thereafter, we inferred the best-fitting substitution model using Bayesian information criterion and evolutionary history of each orthologous group of genes using IQ-TREE2, v2.0.6 [49]. Bipartition support was examined using 1,000 ultrafast bootstrap approximations [59].

To identify SNAP-OGs, the FASTA file and associated phylogenetic tree for each gene family with multiple homologs in 1 or more species was used as input for OrthoSNAP, v0.0.1 (this study). Across 40,011 gene families with multiple homologs in 1 or more species in all datasets, we identified 6,630 SNAP-OGs with at least 50% orthogroup occupancy (S1 Fig and S8 Table). Unaligned sequences of SNAP-OGs were then individually aligned and trimmed using the same strategy as described above. To determine gene families that were SC-OGs, we identified orthologous groups of genes with at least 50% orthogroup occupancy and each species, strain, or organism was represented by only 1 sequence—6,634 orthologous groups of genes were SC-OGs.

Measuring and comparing information content among SC-OGs and SNAP-OGs

To compare the information content of SC-OGs and SNAP-OGs, we calculated 9 properties of multiple sequence alignments and phylogenetic trees associated with robust phylogenetic signal in the budding yeasts, filamentous fungi, and mammalian datasets (S4 Table). More specifically, we calculated information content from phylogenetic trees such as measures of tree certainty (average bootstrap support), accuracy (Robinson–Foulds distance; [67]), signal-to-noise ratios (treeness; [68]), and violation of clock-like evolution (degree of violation of a molecular clock or DVMC; [69]). Information content was also measured among multiple sequence alignments by examining alignment length and the number of parsimony-informative sites, which are associated with robust and accurate inferences of evolutionary histories [70] as well as biases in sequence composition (RCV; [68]). Lastly, information content was also evaluated using metrics that consider characteristics of phylogenetic trees and multiple sequence alignments such as the degree of saturation, which refers to multiple substitutions in multiple sequence alignments that underestimate the distance between 2 taxa [71], and treeness/RCV, a measure of signal-to-noise ratios in phylogenetic trees and sequence composition biases [68]. For tree accuracy, phylogenetic trees were compared to species trees reported in previous studies [51,63,64]. All properties were calculated using functions in PhyKIT, v1.1.2 [52]. The function used to calculate each metric and additional information are described in S4 Table.

Principal component analysis across the 9 properties that summarize phylogenetic information content was used to qualitatively compare SC-OGs and SNAP-OGs in reduced dimensional space. Principal component analysis, visualization, and determination of property contribution to each principal component was conducted using factoextra, v1.0.7 [72], and FactoMineR, v2.4 [73], in the R, v4.0.2 (https://cran.r-project.org/), programming environment. Statistical analysis using a multifactor ANOVA was used to quantitatively compare SC-OGs and SNAP-OGs using the res.aov() function in R.

Information theory-based approaches were used to evaluate incongruence among SC-OGs and SNAP-OGs phylogenetic trees. More specifically, we calculated tree certainty and tree certainty-all [74–76], which are conceptually similar to entropy values and are derived from examining support among a set of gene trees and the 2 most supported topologies or all topologies that occur with a frequency of ≥5%, respectively. More simply, tree certainty values range from 0 to 1 in which low values are indicative of low congruence among gene trees and high values are indicative of high congruence among gene trees. Tree certainty and tree certainty-all values were calculated using RAxML, v8.2.10 [77].

To examine patterns of support in a contentious branch concerning deep evolutionary relationships among eutherian mammals, we calculated gene support frequencies and ΔGLS. Gene support frequencies were calculated using the “polytomy_test” function in PhyKIT, v1.1.2 [52]. To account for uncertainty in gene tree topology, we also examined patterns of gene support frequencies after collapsing bipartitions with ultrafast bootstrap approximation support lower than 75 using the “collapse” function in PhyKIT. To calculate gene-wise log likelihood values, partition log-likelihoods were calculated using the “wpl” parameter in IQ-TREE2 [49], which required as input a phylogeny in Newick format that represented either hypothesis 1, 2, or 3 (Fig 4A) and a concatenated alignment of SC-OGs and SNAP-OGs with partition information. Thereafter, the log likelihood values were used to assign genes to the topology they best supported. Inconclusive genes, defined as having a gene-wise log likelihood difference of less than 0.01, were removed.

The same methodologies—orthology inference, multiple-sequence alignment, trimming, tree inference, SNAP-OG identification, and phylogenetic information content calculations—were also applied to 3 additional datasets that represent complex datasets. Specifically, 30 plants (with a history of extensive gene duplication and loss events), 30 budding yeast species (15 of which experienced whole-genome duplication), and 20 choanoflagellate transcriptomes (where typically multiple transcripts correspond to a single protein-coding gene) [31,32].

Supporting information

S1 Fig. Numbers of orthogroups, single-copy orthogroups, orthogroups with 1 or more homologs in 1 species, and the number of SNAP-OGs identified for each dataset.

(A) The total number of orthogroups with at least 50% ortholog occupancy for each dataset. (B) The number of single-copy orthologs (SC-OGs) for each dataset (with at least 50% taxon occupancy). (C) The number of multicopy orthologs (or orthologous groups of genes wherein 1 or more species is represented by 2 or more sequences; MC-OGs) for each dataset (with at least 50% taxon occupancy). (D) The number of SNAP-OGs identified in each dataset (with at least 50% taxon occupancy). Note that the numbers depicted in panel A reflect the sum of the numbers of SC-OGs and MC-OGs in panels B and C. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

https://doi.org/10.1371/journal.pbio.3001827.s001

(TIF)

S2 Fig. The number of SNAP-OGs identified in orthologous groups of genes with 2 or more homologs in 1 or more species.

The number of SNAP-OGs per orthologous group of genes is depicted on the x-axis. For example, in the budding yeasts dataset, 977 gene families had 1 SNAP-OG each. The highest number of SNAP-OGs identified in a single orthologous group of genes in each dataset were as follows: in budding yeasts, 5 SNAP-OGs were identified in 1 orthologous group of genes that encode transcriptional activators; in filamentous fungi, 5 SNAP-OGs were identified in each of 2 orthologous groups of genes that encode multifacilitator superfamily transporters and amino acid permeases; and in mammals, 4 SNAP-OGs were identified in each of 3 orthologous groups of genes that encode voltage-gated potassium channels, casein kinases, and a tropomyosin family of actin-binding proteins. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

https://doi.org/10.1371/journal.pbio.3001827.s002

(TIF)

S3 Fig. The 10 most frequent best-fitting substitutions models are similar between SC-OGs and SNAP-OGs.

The top 10 most frequently observed best-fitting substitutions models were similar between SC-OGs and SNAP-OGs among (A) 1,668 SC-OGs and 1,392 SNAP-OGs in budding yeasts, (B) 4,393 SC-OGs and 2,035 SNAP-OGs in filamentous fungi, and (C) 321 SC-OGs and 1,775 SNAP-OGs in mammals. For example, the LG+F+I+G4 model was the most frequently observed best-fitting substitution model in SC-OGs and SNAP-OGs from budding yeasts. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

https://doi.org/10.1371/journal.pbio.3001827.s003

(TIF)

S4 Fig. Distributions of information content among SNAP-OGs and SC-OGs.

Boxplot and violin plot distributions of 9 properties representative of phylogenetic information are depicted SNAP-OGs (blue) and SC-OGs (orange) in the (A) 1,668 SC-OGs and 1,392 SNAP-OGs in budding yeasts, (B) 4,393 SC-OGs and 2,035 SNAP-OGs in filamentous fungi, and (C) 321 SC-OGs and 1,775 SNAP-OGs in mammals. Abbreviations are as follows: average bootstrap support (ABS), degree of violation of the molecular clock (DVMC), relative composition variability, Robinson-Foulds distance (RF distance), alignment length (Aln. len.), the number of parsimony informative sites (PI sites), saturation, treeness (tness), and treeness/RCV (tness/RCV). The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

https://doi.org/10.1371/journal.pbio.3001827.s004

(TIF)

S5 Fig. Quality of representation and contributions of properties of phylogenetic information content during principal component analysis.

Principal component analysis was used to qualitatively compare the similarities and differences between SNAP-OGs and SC-OGs (Fig 3). The leftmost figure in each panel of budding yeasts (A), filamentous fungi (B), and mammals (C) represents the quality of representation for each property across all principal components. The next 2 figures depict the contribution of each property (or variable) to the first and second dimension in reduced dimensional space. The red dashed line represents equal contributions from each variable. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

https://doi.org/10.1371/journal.pbio.3001827.s005

(TIF)

S6 Fig. The number of SNAP-OGs identified in an orthologous group of genes with 2 or more homologs in 1 or more species for the dataset used to examine a contentious branch in the tree of life.

The number of SNAP-OGs per orthologous group of genes is depicted on the x-axis. For example, a single SNAP-OG was identified in 1,330 gene families with 2 or more homologs in 1 or more species, whereas 4 SNAP-OGs were identified in 2 gene families with 2 or more homologs in 1 or more species. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

https://doi.org/10.1371/journal.pbio.3001827.s006

(TIF)

S7 Fig. The 10 most frequently observed best-fitting substitutions models are similar between SC-OGs and SNAP-OGs in the dataset used to examine a contentious branch in the tree of life.

Similar best-fitting substitutions models were observed between 252 SC-OGs and 1,428 SNAP-OGs in a dataset of mammals, which was used to investigate patterns of support in a contentious branch in the tree of life concerning deep evolutionary relationships among placental mammals. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

https://doi.org/10.1371/journal.pbio.3001827.s007

(TIF)

S8 Fig. Cartoon comparison of different tree decomposition algorithms.

Using the phylogeny presented in Fig 1B (panel A) and Fig 2B (panel B), different tree decomposition algorithms are compared. (A) OrthoSNAP will identify 4 SNAP-OGs, whereas DISCO and the maximally inclusive strategies will each identify 3 subgroups of orthologous genes. PhyloTreePruner will not identify any subgroups of single-copy orthologous genes. (B) OrthoSNAP will identify 5 subgroups of single-copy orthologous genes (light blue) by identifying maximally inclusive subgroups—subtrees where each taxon is represented by a single sequence—and maximally inclusive subgroups after species-specific inparalog trimming (species-specific inparalogs are shown in orange). In contrast, DISCO and maximally inclusive strategies will identify 3 SC-OGs, in part, because they do not account for species-specific inparalogs. PhyloTreePruner, which only prunes species-specific inparalogs, will not identify any subgroups of single-copy orthologous genes due to the presence of more ancient duplication events.

https://doi.org/10.1371/journal.pbio.3001827.s008

(TIF)

S1 Table. Species and accession numbers for proteomes used in each dataset.

This table details the species used for the budding yeasts, filamentous fungi, and mammalian datasets. All proteomes from budding yeasts were downloaded from Shen and colleagues [51]. Proteomes from filamentous fungi and mammals were downloaded from NCBI, and their accessions and assembly names are provided.

https://doi.org/10.1371/journal.pbio.3001827.s009

(XLSX)

S2 Table. Number of orthogroups examined.

A table of the number of orthogroups, the number of SC-OGs, the number of gene families with orthologs and paralogs (MC-OGs), and the number of SNAP-OGs examined in the present study.

https://doi.org/10.1371/journal.pbio.3001827.s010

(XLSX)

S3 Table. Ortholog occupancy for each dataset.

A table summarizing the average and standard deviation of taxon completeness in SC-OGs and SNAP-OGs.

https://doi.org/10.1371/journal.pbio.3001827.s011

(XLSX)

S4 Table. Nine properties of phylogenetic information content.

Phylogenetic information content of SC-OGs and SNAP-OGs were examined using the 9 properties described here. The abbreviation, description, additional notes, and function in PhyKIT used to calculate each property are listed here.

https://doi.org/10.1371/journal.pbio.3001827.s012

(XLSX)

S5 Table. Multifactor analysis of variance results reveals no substantial differences between SC-OGs and SNAP-OGs.

Degree of freedom, sum of squares, mean square, F-value, and p-value for multifactorial analysis of variance are shown here. Multifactorial analysis of variance was conducting accounting for potential interaction effects as well as using an additive model, which does not account for interaction effects.

https://doi.org/10.1371/journal.pbio.3001827.s013

(XLSX)

S6 Table. Tree certainty and tree certainty-all results.

Examining tree certainty and tree certainty-all revealed similar levels of incongruence among gene trees inferred using SC-OGs and SNAP-OGs.

https://doi.org/10.1371/journal.pbio.3001827.s014

(XLSX)

S7 Table. Dataset for examining deep evolutionary relationships among eutherian mammals.

The NCBI accession, assembly name, name in files, and ingroup/outgroup designations are detailed here for each proteome used.

https://doi.org/10.1371/journal.pbio.3001827.s015

(XLSX)

S8 Table. Number of orthogroups examined among eutherian mammals.

A table of the number of orthogroups, the number of SC-OGs, the number of gene families with orthologs and paralogs (MC-OGs), and the number of SNAP-OGs examined among eutherian mammals.

https://doi.org/10.1371/journal.pbio.3001827.s016

(XLSX)

S9 Table. Gene support frequency results among ancient eutherian mammalian relationships.

Gene support frequency results reveal similar levels of support between the 3 hypotheses concerning deep evolutionary divergences among mammals. Multitest corrected p-values are also shown here.

https://doi.org/10.1371/journal.pbio.3001827.s017

(XLSX)

S10 Table. Comparison between different algorithms that identify subgroups of orthologous genes or conduct species-specific inparalog trimming.

Notably, OrthoSNAP provides the most user flexibility and handles the most use cases.

https://doi.org/10.1371/journal.pbio.3001827.s018

(XLSX)

Acknowledgments

We thank the Rokas lab for helpful discussion and feedback.

References

1. Rokas A, Williams BL, King N, Carroll SB. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003;425:798–804. pmid:14574403
- View Article
- PubMed/NCBI
- Google Scholar
2. Jeffares DC, Tomiczek B, Sojo V, dos Reis M. A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome. 2015. p. 65–90.
- View Article
- Google Scholar
3. Steenwyk JL, Phillips MA, Yang F, Date SS, Graham TR, Berman J, et al. A gene coevolution network provides insight into eukaryotic cellular and genomic structure and function. bioRxiv. 2021; 2021.07.09.451830.
- View Article
- Google Scholar
4. Li Z, De La Torre AR, Sterck L, Cánovas FM, Avila C, Merino I, et al. Single-Copy Genes as Molecular Markers for Phylogenomic Studies in Seed Plants. Genome Biol Evol. 2017;9:1130–1147. pmid:28460034
- View Article
- PubMed/NCBI
- Google Scholar
5. Dong Y, Chen S, Cheng S, Zhou W, Ma Q, Chen Z, et al. Natural selection and repeated patterns of molecular evolution following allopatric divergence. Elife. 2019;8. pmid:31373555
- View Article
- PubMed/NCBI
- Google Scholar
6. Wu J, Yonezawa T, Kishino H. Rates of Molecular Evolution Suggest Natural History of Life History Traits and a Post-K-Pg Nocturnal Bottleneck of Placentals. Curr Biol. 2017;27:3025–3033.e5. pmid:28966093
- View Article
- PubMed/NCBI
- Google Scholar
7. Malnic B, Godfrey PA, Buck LB. The human olfactory receptor gene family. Proc Natl Acad Sci. 2004;101:2584–2589. pmid:14983052
- View Article
- PubMed/NCBI
- Google Scholar
8. Niimura Y, Matsui A, Touhara K. Extreme expansion of the olfactory receptor gene repertoire in African elephants and evolutionary dynamics of orthologous gene groups in 13 placental mammals. Genome Res. 2014;24:1485–1496. pmid:25053675
- View Article
- PubMed/NCBI
- Google Scholar
9. Ozcan S, Johnston M. Function and regulation of yeast hexose transporters. Microbiol Mol Biol Rev. 1999;63:554–569. pmid:10477308
- View Article
- PubMed/NCBI
- Google Scholar
10. Wingender E, Schoeps T, Dönitz J. TFClass: an expandable hierarchical classification of human transcription factors. Nucleic Acids Res. 2013;41:D165–D170. pmid:23180794
- View Article
- PubMed/NCBI
- Google Scholar
11. Emms DM, Kelly S. STAG: Species Tree Inference from All Genes. bioRxiv. 2018;267914.
- View Article
- Google Scholar
12. Thomas GWC, Dohmen E, Hughes DST, Murali SC, Poelchau M, Glastad K, et al. Gene content evolution in the arthropods. Genome Biol. 2020;21:15. pmid:31969194
- View Article
- PubMed/NCBI
- Google Scholar
13. Smith ML, Hahn MW. New Approaches for Inferring Phylogenies in the Presence of Paralogs. Trends Genet. 2021;37:174–187. pmid:32921510
- View Article
- PubMed/NCBI
- Google Scholar
14. Zhang C, Scornavacca C, Molloy EK, Mirarab S. ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Thorne J, editor. Mol Biol Evol. 2020;37:3292–3307. pmid:32886770
- View Article
- PubMed/NCBI
- Google Scholar
15. Willson J, Roddur MS, Liu B, Zaharias P, Warnow T. DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition. Hahn M, editor. Syst Biol. 2021. pmid:34450658
- View Article
- PubMed/NCBI
- Google Scholar
16. Morel B, Schade P, Lutteropp S, Williams TA, Szöllősi GJ, Stamatakis A. SpeciesRax: A tool for maximum likelihood species tree inference from gene family trees under duplication, transfer, and loss. bioRxiv. 2021; 2021.03.29.437460.
- View Article
- Google Scholar
17. Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genome-scale coestimation of species and gene trees. Genome Res. 2013;23:323–330. pmid:23132911
- View Article
- PubMed/NCBI
- Google Scholar
18. de Oliveira Martins L, Posada D. Species Tree Estimation from Genome-Wide Data with guenomu. 2017. p. 461–478.
- View Article
- Google Scholar
19. Kocot KM, Citarella MR, Moroz LL, Halanych KM. PhyloTreePruner: A phylogenetic tree-based approach for selection of orthologous sequences for phylogenomics. Evol Bioinform Online. 2013;2013:429–435. pmid:24250218
- View Article
- PubMed/NCBI
- Google Scholar
20. Dunn CW, Howison M, Zapata F. Agalma: an automated phylogenomics workflow. BMC Bioinformatics. 2013;14:330. pmid:24252138
- View Article
- PubMed/NCBI
- Google Scholar
21. Train C-M, Glover NM, Gonnet GH, Altenhoff AM, Dessimoz C. Orthologous Matrix (OMA) algorithm 2.0: more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference. Bioinformatics. 2017;33:i75–i82. pmid:28881964
- View Article
- PubMed/NCBI
- Google Scholar
22. Schuh RT, Polhemus JT. Analysis of Taxonomic Congruence among Morphological, Ecological, and Biogeographic Data Sets for the Leptopodomorpha (Hemiptera). Syst Biol. 1980;29:1–26.
- View Article
- Google Scholar
23. Phillips MJ, Penny D. The root of the mammalian tree inferred from whole mitochondrial genomes. Mol Phylogenet Evol. 2003;28:171–185. pmid:12878457
- View Article
- PubMed/NCBI
- Google Scholar
24. Defoort J, Van de Peer Y, Carretero-Paulet L. The evolution of gene duplicates in angiosperms and the impact of protein-protein interactions and the mechanism of duplication. Golding B, editor. Genome Biol Evol. 2019. pmid:31364708
- View Article
- PubMed/NCBI
- Google Scholar
25. De Smet R, Adams KL, Vandepoele K, Van Montagu MCE, Maere S, Van de Peer Y. Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. Proc Natl Acad Sci. 2013;110:2898–2903. pmid:23382190
- View Article
- PubMed/NCBI
- Google Scholar
26. Panchy N, Lehti-Shiu M, Shiu S-H. Evolution of Gene Duplication in Plants. Plant Physiol. 2016;171:2294–2316. pmid:27288366
- View Article
- PubMed/NCBI
- Google Scholar
27. Scannell DR, Byrne KP, Gordon JL, Wong S, Wolfe KH. Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature. 2006;440:341–345. pmid:16541074
- View Article
- PubMed/NCBI
- Google Scholar
28. Wolfe KH. Origin of the Yeast Whole-Genome Duplication. PLoS Biol. 2015;13:e1002221. pmid:26252643
- View Article
- PubMed/NCBI
- Google Scholar
29. Wolfe KH, Shields DC. Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997;387:708–713. pmid:9192896
- View Article
- PubMed/NCBI
- Google Scholar
30. Marcet-Houben M, Gabaldón T. Beyond the Whole-Genome Duplication: Phylogenetic Evidence for an Ancient Interspecies Hybridization in the Baker’s Yeast Lineage. Hurst LD, editor. PLoS Biol. 2015;13:e1002220. pmid:26252497
- View Article
- PubMed/NCBI
- Google Scholar
31. Richter DJ, Fozouni P, Eisen MB, King N. Gene family innovation, conservation and loss on the animal stem lineage. Elife. 2018;7. pmid:29848444
- View Article
- PubMed/NCBI
- Google Scholar
32. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–652. pmid:21572440
- View Article
- PubMed/NCBI
- Google Scholar
33. Hallström BM, Kullberg M, Nilsson MA, Janke A. Phylogenomic Data Analyses Provide Evidence that Xenarthra and Afrotheria Are Sister Groups. Mol Biol Evol. 2007;24:2059–2068. pmid:17630282
- View Article
- PubMed/NCBI
- Google Scholar
34. Wildman DE, Uddin M, Opazo JC, Liu G, Lefort V, Guindon S, et al. Genomics, biogeography, and the diversification of placental mammals. Proc Natl Acad Sci. 2007;104:14395–14400. pmid:17728403
- View Article
- PubMed/NCBI
- Google Scholar
35. Murphy WJ. Resolution of the Early Placental Mammal Radiation Using Bayesian Phylogenetics. Science. 2001;294:2348–2351. pmid:11743200
- View Article
- PubMed/NCBI
- Google Scholar
36. Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, O’Brien SJ. Molecular phylogenetics and the origins of placental mammals. Nature. 2001;409:614–618. pmid:11214319
- View Article
- PubMed/NCBI
- Google Scholar
37. Smith ML, Vanderpool D, Hahn MW. Using all gene families vastly expands data available for phylogenomic inference in primates. bioRxiv 2021; 2021.09.22.461252.
- View Article
- Google Scholar
38. van der Heijden RT, Snel B, van Noort V, Huynen MA. Orthology prediction at scalable resolution by phylogenetic tree analysis. BMC Bioinformatics. 2007;8:83. pmid:17346331
- View Article
- PubMed/NCBI
- Google Scholar
39. Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science. 2014;346:1320–1331. pmid:25504713
- View Article
- PubMed/NCBI
- Google Scholar
40. Steenwyk JL, Lind AL, Ries LNA, dos Reis TF, Silva LP, Almeida F, et al. Pathogenic Allodiploid Hybrids of Aspergillus Fungi. Curr Biol. 2020;30:2495–2507.e7. pmid:32502407
- View Article
- PubMed/NCBI
- Google Scholar
41. Meleshko O, Martin MD, Korneliussen TS, Schröck C, Lamkowski P, Schmutz J, et al. Extensive Genome-Wide Phylogenetic Discordance Is Due to Incomplete Lineage Sorting and Not Ongoing Introgression in a Rapidly Radiated Bryophyte Genus. Mol Biol Evol. 2021;38:2750–2766. pmid:33681996
- View Article
- PubMed/NCBI
- Google Scholar
42. Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20:238. pmid:31727128
- View Article
- PubMed/NCBI
- Google Scholar
43. Li L, Stoeckert CJ, Roos DS. OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. pmid:12952885
- View Article
- PubMed/NCBI
- Google Scholar
44. Steenwyk JL, Rokas A. orthofisher: a broadly applicable tool for automated gene identification and retrieval. Comeron JM, editor. G3 (Bethesda). 2021;11. pmid:34544141
- View Article
- PubMed/NCBI
- Google Scholar
45. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. pmid:20003500
- View Article
- PubMed/NCBI
- Google Scholar
46. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. pmid:20709691
- View Article
- PubMed/NCBI
- Google Scholar
47. Eddy SR. Accelerated Profile HMM Searches. Pearson WR, editor. PLoS Comput Biol. 2011;7:e1002195. pmid:22039361
- View Article
- PubMed/NCBI
- Google Scholar
48. Price MN, Dehal PS, Arkin AP. FastTree 2—Approximately maximum-likelihood trees for large alignments. PLoS ONE. 2010;5. pmid:20224823
- View Article
- PubMed/NCBI
- Google Scholar
49. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Teeling E, editor. Mol Biol Evol. 2020;37:1530–1534. pmid:32011700
- View Article
- PubMed/NCBI
- Google Scholar
50. Zhou X, Shen X-X, Hittinger CT, Rokas A. Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets. Mol Biol Evol. 2018;35:486–503. pmid:29177474
- View Article
- PubMed/NCBI
- Google Scholar
51. Shen X-X, Opulente DA, Kominek J, Zhou X, Steenwyk JL, Buh KV, et al. Tempo and Mode of Genome Evolution in the Budding Yeast Subphylum. Cell. 2018;175:1533–1545.e20. pmid:30415838
- View Article
- PubMed/NCBI
- Google Scholar
52. Steenwyk JL, Buida TJ, Labella AL, Li Y, Shen X-X, Rokas A. PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data. Schwartz R, editor. Bioinformatics (Oxford, England). 2021. pmid:33560364
- View Article
- PubMed/NCBI
- Google Scholar
53. Tice AK, Žihala D, Pánek T, Jones RE, Salomaki ED, Nenarokov S, et al. PhyloFisher: A phylogenomic package for resolving eukaryotic relationships. Hejnol A, editor. PLoS Biol. 2021;19:e3001365. pmid:34358228
- View Article
- PubMed/NCBI
- Google Scholar
54. Ballesteros JA, Hormiga G. A New Orthology Assessment Method for Phylogenomic Data: Unrooted Phylogenetic Orthology. Mol Biol Evol. 2016;33:2117–2134. pmid:27189539
- View Article
- PubMed/NCBI
- Google Scholar
55. Yang Y, Smith SA. Orthology Inference in Nonmodel Organisms Using Transcriptomes and Low-Coverage Genomes: Improving Accuracy and Matrix Occupancy for Phylogenomics. Mol Biol Evol. 2014;31:3081–3092. pmid:25158799
- View Article
- PubMed/NCBI
- Google Scholar
56. Shen X-X, Steenwyk JL, LaBella AL, Opulente DA, Zhou X, Kominek J, et al. Genome-scale phylogeny and contrasting modes of genome evolution in the fungal phylum Ascomycota. Sci Adv. 2020;6:eabd0079. pmid:33148650
- View Article
- PubMed/NCBI
- Google Scholar
57. Steenwyk JL, Opulente DA, Kominek J, Shen X-X, Zhou X, Labella AL, et al. Extensive loss of cell-cycle and DNA repair genes in an ancient lineage of bipolar budding yeasts. Kamoun S, editor. PLoS Biol. 2019;17:e3000255. pmid:31112549
- View Article
- PubMed/NCBI
- Google Scholar
58. Vakirlis N, Sarilar V, Drillon G, Fleiss A, Agier N, Meyniel J-P, et al. Reconstruction of ancestral chromosome architecture and gene repertoire reveals principles of genome evolution in a model yeast genus. Genome Res. 2016;26:918–932. pmid:27247244
- View Article
- PubMed/NCBI
- Google Scholar
59. Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol. 2018;35:518–522. pmid:29077904
- View Article
- PubMed/NCBI
- Google Scholar
60. Mangul S, Martin LS, Eskin E, Blekhman R. Improving the usability and archival stability of bioinformatics software. Genome Biol. 2019;20:47. pmid:30813962
- View Article
- PubMed/NCBI
- Google Scholar
61. Steenwyk JL, Buida TJ, Li Y, Shen X-X, Rokas A. ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference. Hejnol A, editor. PLoS Biol. 2020;18: e3001007. pmid:33264284
- View Article
- PubMed/NCBI
- Google Scholar
62. Steenwyk JL, Buida TJ, Gonçalves C, Goltz DC, Morales G, Mead ME, et al. BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data. Stajich J, editor. Genetics. 2022. pmid:35536198
- View Article
- PubMed/NCBI
- Google Scholar
63. Steenwyk JL, Shen X-X, Lind AL, Goldman GH, Rokas A. A Robust Phylogenomic Time Tree for Biotechnologically and Medically Important Fungi in the Genera Aspergillus and Penicillium. Boyle JP, editor. MBio. 2019;10. pmid:31289177
- View Article
- PubMed/NCBI
- Google Scholar
64. Tarver JE, dos Reis M, Mirarab S, Moran RJ, Parker S, O’Reilly JE, et al. The Interrelationships of Placental Mammals and the Limits of Phylogenetic Inference. Genome Biol Evol. 2016;8:330–344. pmid:26733575
- View Article
- PubMed/NCBI
- Google Scholar
65. Luo Z-X, Yuan C-X, Meng Q-J, Ji Q. A Jurassic eutherian mammal and divergence of marsupials and placentals. Nature. 2011;476:442–445. pmid:21866158
- View Article
- PubMed/NCBI
- Google Scholar
66. Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 2013;30:772–780. pmid:23329690
- View Article
- PubMed/NCBI
- Google Scholar
67. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53:131–147.
- View Article
- Google Scholar
68. Phillips MJ, Penny D. The root of the mammalian tree inferred from whole mitochondrial genomes. Mol Phylogenet Evol. 2003;28:171–185. pmid:12878457
- View Article
- PubMed/NCBI
- Google Scholar
69. Liu L, Zhang J, Rheindt FE, Lei F, Qu Y, Wang Y, et al. Genomic evidence reveals a radiation of placental mammals uninterrupted by the KPg boundary. Proc Natl Acad Sci. 2017;114:E7282–E7290. pmid:28808022
- View Article
- PubMed/NCBI
- Google Scholar
70. Shen X-X, Salichos L, Rokas A. A Genome-Scale Investigation of How Sequence, Function, and Tree-Based Gene Properties Influence Phylogenetic Inference. Genome Biol Evol. 2016;8:2565–2580. pmid:27492233
- View Article
- PubMed/NCBI
- Google Scholar
71. Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G, et al. Resolving Difficult Phylogenetic Questions: Why More Sequences Are Not Enough. Penny D, editor. PLoS Biol. 2011;9:e1000602. pmid:21423652
- View Article
- PubMed/NCBI
- Google Scholar
72. Kassambara A, Mundt F. factoextra. R package, v. 1.0.5. 2017.
- View Article
- Google Scholar
73. Lê S, Josse J, Husson F. FactoMineR: An R Package for Multivariate Analysis. J Stat Softw. 2008;25:1–18.
- View Article
- Google Scholar
74. Salichos L, Rokas A. Inferring ancient divergences requires genes with strong phylogenetic signals. Nature. 2013;497:327–331. pmid:23657258
- View Article
- PubMed/NCBI
- Google Scholar
75. Salichos L, Stamatakis A, Rokas A. Novel Information Theory-Based Measures for Quantifying Incongruence among Phylogenetic Trees. Mol Biol Evol. 2014;31:1261–1271. pmid:24509691
- View Article
- PubMed/NCBI
- Google Scholar
76. Kobert K, Salichos L, Rokas A, Stamatakis A. Computing the Internode Certainty and Related Measures from Partial Gene Trees. Mol Biol Evol. 2016;33:1606–1617. pmid:26915959
- View Article
- PubMed/NCBI
- Google Scholar
77. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–1313. pmid:24451623
- View Article
- PubMed/NCBI
- Google Scholar
78. Song S, Liu L, Edwards SV, Wu S. Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc Natl Acad Sci. 2012;109:14942–14947. pmid:22930817
- View Article
- PubMed/NCBI
- Google Scholar
79. Doyle VP, Young RE, Naylor GJP, Brown JM. Can We Identify Genes with Increased Phylogenetic Reliability? Syst Biol. 2015;64:824–837. pmid:26099258
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Rokas A, Williams BL, King N, Carroll SB. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003;425:798–804. pmid:14574403
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Jeffares DC, Tomiczek B, Sojo V, dos Reis M. A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome. 2015. p. 65–90.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Steenwyk JL, Phillips MA, Yang F, Date SS, Graham TR, Berman J, et al. A gene coevolution network provides insight into eukaryotic cellular and genomic structure and function. bioRxiv. 2021; 2021.07.09.451830.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref4] 4. Li Z, De La Torre AR, Sterck L, Cánovas FM, Avila C, Merino I, et al. Single-Copy Genes as Molecular Markers for Phylogenomic Studies in Seed Plants. Genome Biol Evol. 2017;9:1130–1147. pmid:28460034
View Article
PubMed/NCBI
Google Scholar

[12] View Article

[13] PubMed/NCBI

[14] Google Scholar

[ref5] 5. Dong Y, Chen S, Cheng S, Zhou W, Ma Q, Chen Z, et al. Natural selection and repeated patterns of molecular evolution following allopatric divergence. Elife. 2019;8. pmid:31373555
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref6] 6. Wu J, Yonezawa T, Kishino H. Rates of Molecular Evolution Suggest Natural History of Life History Traits and a Post-K-Pg Nocturnal Bottleneck of Placentals. Curr Biol. 2017;27:3025–3033.e5. pmid:28966093
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Malnic B, Godfrey PA, Buck LB. The human olfactory receptor gene family. Proc Natl Acad Sci. 2004;101:2584–2589. pmid:14983052
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Niimura Y, Matsui A, Touhara K. Extreme expansion of the olfactory receptor gene repertoire in African elephants and evolutionary dynamics of orthologous gene groups in 13 placental mammals. Genome Res. 2014;24:1485–1496. pmid:25053675
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref9] 9. Ozcan S, Johnston M. Function and regulation of yeast hexose transporters. Microbiol Mol Biol Rev. 1999;63:554–569. pmid:10477308
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref10] 10. Wingender E, Schoeps T, Dönitz J. TFClass: an expandable hierarchical classification of human transcription factors. Nucleic Acids Res. 2013;41:D165–D170. pmid:23180794
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref11] 11. Emms DM, Kelly S. STAG: Species Tree Inference from All Genes. bioRxiv. 2018;267914.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref12] 12. Thomas GWC, Dohmen E, Hughes DST, Murali SC, Poelchau M, Glastad K, et al. Gene content evolution in the arthropods. Genome Biol. 2020;21:15. pmid:31969194
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref13] 13. Smith ML, Hahn MW. New Approaches for Inferring Phylogenies in the Presence of Paralogs. Trends Genet. 2021;37:174–187. pmid:32921510
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref14] 14. Zhang C, Scornavacca C, Molloy EK, Mirarab S. ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Thorne J, editor. Mol Biol Evol. 2020;37:3292–3307. pmid:32886770
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref15] 15. Willson J, Roddur MS, Liu B, Zaharias P, Warnow T. DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition. Hahn M, editor. Syst Biol. 2021. pmid:34450658
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref16] 16. Morel B, Schade P, Lutteropp S, Williams TA, Szöllősi GJ, Stamatakis A. SpeciesRax: A tool for maximum likelihood species tree inference from gene family trees under duplication, transfer, and loss. bioRxiv. 2021; 2021.03.29.437460.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref17] 17. Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genome-scale coestimation of species and gene trees. Genome Res. 2013;23:323–330. pmid:23132911
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref18] 18. de Oliveira Martins L, Posada D. Species Tree Estimation from Genome-Wide Data with guenomu. 2017. p. 461–478.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref19] 19. Kocot KM, Citarella MR, Moroz LL, Halanych KM. PhyloTreePruner: A phylogenetic tree-based approach for selection of orthologous sequences for phylogenomics. Evol Bioinform Online. 2013;2013:429–435. pmid:24250218
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref20] 20. Dunn CW, Howison M, Zapata F. Agalma: an automated phylogenomics workflow. BMC Bioinformatics. 2013;14:330. pmid:24252138
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref21] 21. Train C-M, Glover NM, Gonnet GH, Altenhoff AM, Dessimoz C. Orthologous Matrix (OMA) algorithm 2.0: more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference. Bioinformatics. 2017;33:i75–i82. pmid:28881964
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref22] 22. Schuh RT, Polhemus JT. Analysis of Taxonomic Congruence among Morphological, Ecological, and Biogeographic Data Sets for the Leptopodomorpha (Hemiptera). Syst Biol. 1980;29:1–26.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref23] 23. Phillips MJ, Penny D. The root of the mammalian tree inferred from whole mitochondrial genomes. Mol Phylogenet Evol. 2003;28:171–185. pmid:12878457
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref24] 24. Defoort J, Van de Peer Y, Carretero-Paulet L. The evolution of gene duplicates in angiosperms and the impact of protein-protein interactions and the mechanism of duplication. Golding B, editor. Genome Biol Evol. 2019. pmid:31364708
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref25] 25. De Smet R, Adams KL, Vandepoele K, Van Montagu MCE, Maere S, Van de Peer Y. Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. Proc Natl Acad Sci. 2013;110:2898–2903. pmid:23382190
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref26] 26. Panchy N, Lehti-Shiu M, Shiu S-H. Evolution of Gene Duplication in Plants. Plant Physiol. 2016;171:2294–2316. pmid:27288366
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref27] 27. Scannell DR, Byrne KP, Gordon JL, Wong S, Wolfe KH. Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature. 2006;440:341–345. pmid:16541074
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref28] 28. Wolfe KH. Origin of the Yeast Whole-Genome Duplication. PLoS Biol. 2015;13:e1002221. pmid:26252643
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref29] 29. Wolfe KH, Shields DC. Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997;387:708–713. pmid:9192896
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref30] 30. Marcet-Houben M, Gabaldón T. Beyond the Whole-Genome Duplication: Phylogenetic Evidence for an Ancient Interspecies Hybridization in the Baker’s Yeast Lineage. Hurst LD, editor. PLoS Biol. 2015;13:e1002220. pmid:26252497
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref31] 31. Richter DJ, Fozouni P, Eisen MB, King N. Gene family innovation, conservation and loss on the animal stem lineage. Elife. 2018;7. pmid:29848444
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref32] 32. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–652. pmid:21572440
View Article
PubMed/NCBI
Google Scholar

[120] View Article

[121] PubMed/NCBI

[122] Google Scholar

[ref33] 33. Hallström BM, Kullberg M, Nilsson MA, Janke A. Phylogenomic Data Analyses Provide Evidence that Xenarthra and Afrotheria Are Sister Groups. Mol Biol Evol. 2007;24:2059–2068. pmid:17630282
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref34] 34. Wildman DE, Uddin M, Opazo JC, Liu G, Lefort V, Guindon S, et al. Genomics, biogeography, and the diversification of placental mammals. Proc Natl Acad Sci. 2007;104:14395–14400. pmid:17728403
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref35] 35. Murphy WJ. Resolution of the Early Placental Mammal Radiation Using Bayesian Phylogenetics. Science. 2001;294:2348–2351. pmid:11743200
View Article
PubMed/NCBI
Google Scholar

[132] View Article

[133] PubMed/NCBI

[134] Google Scholar

[ref36] 36. Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, O’Brien SJ. Molecular phylogenetics and the origins of placental mammals. Nature. 2001;409:614–618. pmid:11214319
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref37] 37. Smith ML, Vanderpool D, Hahn MW. Using all gene families vastly expands data available for phylogenomic inference in primates. bioRxiv 2021; 2021.09.22.461252.
View Article
Google Scholar

[140] View Article

[141] Google Scholar

[ref38] 38. van der Heijden RT, Snel B, van Noort V, Huynen MA. Orthology prediction at scalable resolution by phylogenetic tree analysis. BMC Bioinformatics. 2007;8:83. pmid:17346331
View Article
PubMed/NCBI
Google Scholar

[143] View Article

[144] PubMed/NCBI

[145] Google Scholar

[ref39] 39. Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science. 2014;346:1320–1331. pmid:25504713
View Article
PubMed/NCBI
Google Scholar

[147] View Article

[148] PubMed/NCBI

[149] Google Scholar

[ref40] 40. Steenwyk JL, Lind AL, Ries LNA, dos Reis TF, Silva LP, Almeida F, et al. Pathogenic Allodiploid Hybrids of Aspergillus Fungi. Curr Biol. 2020;30:2495–2507.e7. pmid:32502407
View Article
PubMed/NCBI
Google Scholar

[151] View Article

[152] PubMed/NCBI

[153] Google Scholar

[ref41] 41. Meleshko O, Martin MD, Korneliussen TS, Schröck C, Lamkowski P, Schmutz J, et al. Extensive Genome-Wide Phylogenetic Discordance Is Due to Incomplete Lineage Sorting and Not Ongoing Introgression in a Rapidly Radiated Bryophyte Genus. Mol Biol Evol. 2021;38:2750–2766. pmid:33681996
View Article
PubMed/NCBI
Google Scholar

[155] View Article

[156] PubMed/NCBI

[157] Google Scholar

[ref42] 42. Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20:238. pmid:31727128
View Article
PubMed/NCBI
Google Scholar

[159] View Article

[160] PubMed/NCBI

[161] Google Scholar

[ref43] 43. Li L, Stoeckert CJ, Roos DS. OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. pmid:12952885
View Article
PubMed/NCBI
Google Scholar

[163] View Article

[164] PubMed/NCBI

[165] Google Scholar

[ref44] 44. Steenwyk JL, Rokas A. orthofisher: a broadly applicable tool for automated gene identification and retrieval. Comeron JM, editor. G3 (Bethesda). 2021;11. pmid:34544141
View Article
PubMed/NCBI
Google Scholar

[167] View Article

[168] PubMed/NCBI

[169] Google Scholar

[ref45] 45. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. pmid:20003500
View Article
PubMed/NCBI
Google Scholar

[171] View Article

[172] PubMed/NCBI

[173] Google Scholar

[ref46] 46. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. pmid:20709691
View Article
PubMed/NCBI
Google Scholar

[175] View Article

[176] PubMed/NCBI

[177] Google Scholar

[ref47] 47. Eddy SR. Accelerated Profile HMM Searches. Pearson WR, editor. PLoS Comput Biol. 2011;7:e1002195. pmid:22039361
View Article
PubMed/NCBI
Google Scholar

[179] View Article

[180] PubMed/NCBI

[181] Google Scholar

[ref48] 48. Price MN, Dehal PS, Arkin AP. FastTree 2—Approximately maximum-likelihood trees for large alignments. PLoS ONE. 2010;5. pmid:20224823
View Article
PubMed/NCBI
Google Scholar

[183] View Article

[184] PubMed/NCBI

[185] Google Scholar

[ref49] 49. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Teeling E, editor. Mol Biol Evol. 2020;37:1530–1534. pmid:32011700
View Article
PubMed/NCBI
Google Scholar

[187] View Article

[188] PubMed/NCBI

[189] Google Scholar

[ref50] 50. Zhou X, Shen X-X, Hittinger CT, Rokas A. Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets. Mol Biol Evol. 2018;35:486–503. pmid:29177474
View Article
PubMed/NCBI
Google Scholar

[191] View Article

[192] PubMed/NCBI

[193] Google Scholar

[ref51] 51. Shen X-X, Opulente DA, Kominek J, Zhou X, Steenwyk JL, Buh KV, et al. Tempo and Mode of Genome Evolution in the Budding Yeast Subphylum. Cell. 2018;175:1533–1545.e20. pmid:30415838
View Article
PubMed/NCBI
Google Scholar

[195] View Article

[196] PubMed/NCBI

[197] Google Scholar

[ref52] 52. Steenwyk JL, Buida TJ, Labella AL, Li Y, Shen X-X, Rokas A. PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data. Schwartz R, editor. Bioinformatics (Oxford, England). 2021. pmid:33560364
View Article
PubMed/NCBI
Google Scholar

[199] View Article

[200] PubMed/NCBI

[201] Google Scholar

[ref53] 53. Tice AK, Žihala D, Pánek T, Jones RE, Salomaki ED, Nenarokov S, et al. PhyloFisher: A phylogenomic package for resolving eukaryotic relationships. Hejnol A, editor. PLoS Biol. 2021;19:e3001365. pmid:34358228
View Article
PubMed/NCBI
Google Scholar

[203] View Article

[204] PubMed/NCBI

[205] Google Scholar

[ref54] 54. Ballesteros JA, Hormiga G. A New Orthology Assessment Method for Phylogenomic Data: Unrooted Phylogenetic Orthology. Mol Biol Evol. 2016;33:2117–2134. pmid:27189539
View Article
PubMed/NCBI
Google Scholar

[207] View Article

[208] PubMed/NCBI

[209] Google Scholar

[ref55] 55. Yang Y, Smith SA. Orthology Inference in Nonmodel Organisms Using Transcriptomes and Low-Coverage Genomes: Improving Accuracy and Matrix Occupancy for Phylogenomics. Mol Biol Evol. 2014;31:3081–3092. pmid:25158799
View Article
PubMed/NCBI
Google Scholar

[211] View Article

[212] PubMed/NCBI

[213] Google Scholar

[ref56] 56. Shen X-X, Steenwyk JL, LaBella AL, Opulente DA, Zhou X, Kominek J, et al. Genome-scale phylogeny and contrasting modes of genome evolution in the fungal phylum Ascomycota. Sci Adv. 2020;6:eabd0079. pmid:33148650
View Article
PubMed/NCBI
Google Scholar

[215] View Article

[216] PubMed/NCBI

[217] Google Scholar

[ref57] 57. Steenwyk JL, Opulente DA, Kominek J, Shen X-X, Zhou X, Labella AL, et al. Extensive loss of cell-cycle and DNA repair genes in an ancient lineage of bipolar budding yeasts. Kamoun S, editor. PLoS Biol. 2019;17:e3000255. pmid:31112549
View Article
PubMed/NCBI
Google Scholar

[219] View Article

[220] PubMed/NCBI

[221] Google Scholar

[ref58] 58. Vakirlis N, Sarilar V, Drillon G, Fleiss A, Agier N, Meyniel J-P, et al. Reconstruction of ancestral chromosome architecture and gene repertoire reveals principles of genome evolution in a model yeast genus. Genome Res. 2016;26:918–932. pmid:27247244
View Article
PubMed/NCBI
Google Scholar

[223] View Article

[224] PubMed/NCBI

[225] Google Scholar

[ref59] 59. Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol. 2018;35:518–522. pmid:29077904
View Article
PubMed/NCBI
Google Scholar

[227] View Article

[228] PubMed/NCBI

[229] Google Scholar

[ref60] 60. Mangul S, Martin LS, Eskin E, Blekhman R. Improving the usability and archival stability of bioinformatics software. Genome Biol. 2019;20:47. pmid:30813962
View Article
PubMed/NCBI
Google Scholar

[231] View Article

[232] PubMed/NCBI

[233] Google Scholar

[ref61] 61. Steenwyk JL, Buida TJ, Li Y, Shen X-X, Rokas A. ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference. Hejnol A, editor. PLoS Biol. 2020;18: e3001007. pmid:33264284
View Article
PubMed/NCBI
Google Scholar

[235] View Article

[236] PubMed/NCBI

[237] Google Scholar

[ref62] 62. Steenwyk JL, Buida TJ, Gonçalves C, Goltz DC, Morales G, Mead ME, et al. BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data. Stajich J, editor. Genetics. 2022. pmid:35536198
View Article
PubMed/NCBI
Google Scholar

[239] View Article

[240] PubMed/NCBI

[241] Google Scholar

[ref63] 63. Steenwyk JL, Shen X-X, Lind AL, Goldman GH, Rokas A. A Robust Phylogenomic Time Tree for Biotechnologically and Medically Important Fungi in the Genera Aspergillus and Penicillium. Boyle JP, editor. MBio. 2019;10. pmid:31289177
View Article
PubMed/NCBI
Google Scholar

[243] View Article

[244] PubMed/NCBI

[245] Google Scholar

[ref64] 64. Tarver JE, dos Reis M, Mirarab S, Moran RJ, Parker S, O’Reilly JE, et al. The Interrelationships of Placental Mammals and the Limits of Phylogenetic Inference. Genome Biol Evol. 2016;8:330–344. pmid:26733575
View Article
PubMed/NCBI
Google Scholar

[247] View Article

[248] PubMed/NCBI

[249] Google Scholar

[ref65] 65. Luo Z-X, Yuan C-X, Meng Q-J, Ji Q. A Jurassic eutherian mammal and divergence of marsupials and placentals. Nature. 2011;476:442–445. pmid:21866158
View Article
PubMed/NCBI
Google Scholar

[251] View Article

[252] PubMed/NCBI

[253] Google Scholar

[ref66] 66. Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 2013;30:772–780. pmid:23329690
View Article
PubMed/NCBI
Google Scholar

[255] View Article

[256] PubMed/NCBI

[257] Google Scholar

[ref67] 67. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53:131–147.
View Article
Google Scholar

[259] View Article

[260] Google Scholar

[ref68] 68. Phillips MJ, Penny D. The root of the mammalian tree inferred from whole mitochondrial genomes. Mol Phylogenet Evol. 2003;28:171–185. pmid:12878457
View Article
PubMed/NCBI
Google Scholar

[262] View Article

[263] PubMed/NCBI

[264] Google Scholar

[ref69] 69. Liu L, Zhang J, Rheindt FE, Lei F, Qu Y, Wang Y, et al. Genomic evidence reveals a radiation of placental mammals uninterrupted by the KPg boundary. Proc Natl Acad Sci. 2017;114:E7282–E7290. pmid:28808022
View Article
PubMed/NCBI
Google Scholar

[266] View Article

[267] PubMed/NCBI

[268] Google Scholar

[ref70] 70. Shen X-X, Salichos L, Rokas A. A Genome-Scale Investigation of How Sequence, Function, and Tree-Based Gene Properties Influence Phylogenetic Inference. Genome Biol Evol. 2016;8:2565–2580. pmid:27492233
View Article
PubMed/NCBI
Google Scholar

[270] View Article

[271] PubMed/NCBI

[272] Google Scholar

[ref71] 71. Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G, et al. Resolving Difficult Phylogenetic Questions: Why More Sequences Are Not Enough. Penny D, editor. PLoS Biol. 2011;9:e1000602. pmid:21423652
View Article
PubMed/NCBI
Google Scholar

[274] View Article

[275] PubMed/NCBI

[276] Google Scholar

[ref72] 72. Kassambara A, Mundt F. factoextra. R package, v. 1.0.5. 2017.
View Article
Google Scholar

[278] View Article

[279] Google Scholar

[ref73] 73. Lê S, Josse J, Husson F. FactoMineR: An R Package for Multivariate Analysis. J Stat Softw. 2008;25:1–18.
View Article
Google Scholar

[281] View Article

[282] Google Scholar

[ref74] 74. Salichos L, Rokas A. Inferring ancient divergences requires genes with strong phylogenetic signals. Nature. 2013;497:327–331. pmid:23657258
View Article
PubMed/NCBI
Google Scholar

[284] View Article

[285] PubMed/NCBI

[286] Google Scholar

[ref75] 75. Salichos L, Stamatakis A, Rokas A. Novel Information Theory-Based Measures for Quantifying Incongruence among Phylogenetic Trees. Mol Biol Evol. 2014;31:1261–1271. pmid:24509691
View Article
PubMed/NCBI
Google Scholar

[288] View Article

[289] PubMed/NCBI

[290] Google Scholar

[ref76] 76. Kobert K, Salichos L, Rokas A, Stamatakis A. Computing the Internode Certainty and Related Measures from Partial Gene Trees. Mol Biol Evol. 2016;33:1606–1617. pmid:26915959
View Article
PubMed/NCBI
Google Scholar

[292] View Article

[293] PubMed/NCBI

[294] Google Scholar

[ref77] 77. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–1313. pmid:24451623
View Article
PubMed/NCBI
Google Scholar

[296] View Article

[297] PubMed/NCBI

[298] Google Scholar

[ref78] 78. Song S, Liu L, Edwards SV, Wu S. Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc Natl Acad Sci. 2012;109:14942–14947. pmid:22930817
View Article
PubMed/NCBI
Google Scholar

[300] View Article

[301] PubMed/NCBI

[302] Google Scholar

[ref79] 79. Doyle VP, Young RE, Naylor GJP, Brown JM. Can We Identify Genes with Increased Phylogenetic Reliability? Syst Biol. 2015;64:824–837. pmid:26099258
View Article
PubMed/NCBI
Google Scholar

[304] View Article

[305] PubMed/NCBI

[306] Google Scholar

Figures

Abstract

Introduction

Results

SC-OGs and SNAP-OGs have similar information content

SC-OGs and SNAP-OGs have similar performances in complex datasets

SC-OGs and SNAP-OGs have similar patterns of support in a contentious branch in the tree of life

Discussion

Methods

OrthoSNAP availability and documentation

OrthoSNAP algorithm description and usage

Development practices and design principles to ensure long-term software stability

Dataset generation

Measuring and comparing information content among SC-OGs and SNAP-OGs

Supporting information

S1 Fig. Numbers of orthogroups, single-copy orthogroups, orthogroups with 1 or more homologs in 1 species, and the number of SNAP-OGs identified for each dataset.

S2 Fig. The number of SNAP-OGs identified in orthologous groups of genes with 2 or more homologs in 1 or more species.

S3 Fig. The 10 most frequent best-fitting substitutions models are similar between SC-OGs and SNAP-OGs.

S4 Fig. Distributions of information content among SNAP-OGs and SC-OGs.

S5 Fig. Quality of representation and contributions of properties of phylogenetic information content during principal component analysis.

S6 Fig. The number of SNAP-OGs identified in an orthologous group of genes with 2 or more homologs in 1 or more species for the dataset used to examine a contentious branch in the tree of life.

S7 Fig. The 10 most frequently observed best-fitting substitutions models are similar between SC-OGs and SNAP-OGs in the dataset used to examine a contentious branch in the tree of life.

S8 Fig. Cartoon comparison of different tree decomposition algorithms.

S1 Table. Species and accession numbers for proteomes used in each dataset.

S2 Table. Number of orthogroups examined.

S3 Table. Ortholog occupancy for each dataset.

S4 Table. Nine properties of phylogenetic information content.

S5 Table. Multifactor analysis of variance results reveals no substantial differences between SC-OGs and SNAP-OGs.

S6 Table. Tree certainty and tree certainty-all results.

S7 Table. Dataset for examining deep evolutionary relationships among eutherian mammals.

S8 Table. Number of orthogroups examined among eutherian mammals.

S9 Table. Gene support frequency results among ancient eutherian mammalian relationships.

S10 Table. Comparison between different algorithms that identify subgroups of orthologous genes or conduct species-specific inparalog trimming.

Acknowledgments

References