Figure 1.
Presented as proportions of all data, based on number of mitogenomes analyzed (left; 1835 mitogenomes in total) and number of lineages/datasets analyzed (right; 372 lineages in total).
Figure 2.
Example demonstrating discordance between the ND5 and the “supergene” phylogenies based on a specific lineage/dataset.
Here, the two phylogenies for the Coleoptera were significantly different based on the Shimodaira-Hasegawa (SH) test and the ΔLn L between them is 264.6. For this example, a minimum of four mt protein-coding (PC) genes were required to infer a statistically indistinuishable topology to that of the “supergene” set (i.e., all 13 mt PC genes) when genes were selected by length, although the three mt PC genes that performed “best” (ND5, ND4, ND2) also inferred it. Clades/OTUs in the phylogeny showing different positions and/or relationships between the topologies are connected by colored lines, while those with the same positions and/or relationships are connected by black lines. Bootstrap support values (numbers at nodes) that increased noticeably for clades in the “supergene” topology as compared to ND5 are presented in red. Phylogenies analyzed here utilized amino acid data and were inferred via maximum likelihood (scale bars indicate replacements per site).
Figure 3.
Comparisons between phylogenetic topologies inferred from single mt protein-coding (PC) genes relative to that of the “supergene” set.
Presented as 1) decreases in Ln L (± S.E.M.) compared to the “supergene” phylogeny (i.e., all 13 PC genes, left y-axis) and 2) proportion of lineages where the “supergene” topology was recovered (right y-axis). The three “best” performing genes (i.e., ND5, ND2, ND1) are bolded and italicized along the x-axis. Single-gene phylogenies utilized amino acid data and were inferred via maximum likelihood for 63 datasets with > 40 OTUs and/or having previously published phylogenies (see text for details).
Figure 4.
Comparisons between phylogenetic topologies inferred from concatenated mt protein-coding (PC) gene sets relative to that of the “supergene” set.
A) Average decreases (± S.E.M.) in Ln L between phylogenies based on concatenated alignments of the longest mt protein-coding (PC) genes (from one to 12 genes) relative to the “supergene” (i.e., all 13 mt PC gene) phylogeny. The average gene number when this difference becomes significantly worse based on Shimodaira-Hasegawa (SH) tests is indicated by a vertical dashed line. B) Regression analysis of the minimum mt PC gene number required to infer a statistically indistinuishable topology to that of the “supergene” set relative to the number of OTUs per lineage. Phylogenies analyzed in A and B utilized amino acid data and were inferred for the 372 lineages examined in this study via maximum likelihood.
Figure 5.
The minimum number of mt protein-coding (PC) genes required to infer the phylogenetic topology of the “supergene” set as a function of taxonomic rank and average % divergence.
A) The number of mt protein-coding (PC) genes minimally required for inferring a statistically indistinuishable topology to that of the “supergene” set (i.e., all 13 mt PC genes) as a function of taxonomic rank of the lineage/dataset. A taxonomic rank of “1” along the x-axis corresponds to a within species phylogeny, “2” corresponds to within genus, etc. Subphylum was the highest rank analyzed, being assigned a taxonomic rank of 33. Taxonomic rank of each lineage followed the convention of NCBI as indicated in Dataset S1. The minimum number of mt protein-coding genes presented here are for phylogenies that utilized amino acid data and were inferred via maximum likelihood. Symbols of larger sizes represent of multiple datasets, with size being porportional to number of datasets. B) The number of mt protein-coding (PC) genes minimally required for inferring a statistically indistinuishable topology to that of the “supergene” set (i.e., all 13 mt PC genes) as a function of average % divergence of the lineage/dataset. The average % divergence was calculated using infoalign in EMBOSS. Minimum number of mt protein-coding genes presented here are for phylogenies that utilized amino acid data and inferred via maximum likelihood.
Figure 6.
Examples demonstrating instances where the minimum number of mt protein-coding (PC) genes is either apparent or ambiguous.
A) A phylogeny for the Cephalopoda where the number of mt protein-coding (PC) genes minimally required to infer a statistically indistinguishable topology to that of the “supergene” set (i.e., all 13 mt PC genes) is apparent. Shimodaira-Hasegawa (SH) tests indicated a minimum of four genes as being needed. The break is clearly indicated by the large decrease in Ln L compared to the “supergene” phylogeny when utilizing less than four genes or when utilizing four or more genes. This trend was typical for the majority of lineages examined (93.5% and 93.0% when utilizing amino acid and nucleotide data, respectively). B) A phylogeny for the Haplorrhini where the number of mt protein-coding (PC) genes minimally required to infer a statistically indistinguishable topology to that of the “supergene” topology is ambiguous. In this case, the SH tests indicated utilizing any combination other than three genes recovered the “supergene” topology, which is indicated by the large decrease in Ln L when just three genes are employed. In these cases, the minimum number of genes that statistically recreated the “supergene” topology was chosen (e.g., a single gene, NAD5, for Haplorrhini). This alternative trend was atypical of the data (6.5% and 7.0% of lineages when utilizing amino acid and nucleotide data, respectively). Comparisons presented in A and B are for phylogenies that utilized amino acid data and were inferred via maximum likelihood.
Figure 7.
Agreement metrics using concatenated mt protein-coding (PC) gene sets.
Agreement metrics (± S.E.M.) between topologies based on concatenated alignments of the longest mt protein-coding (PC) genes (from one to 12 genes) relative to the “supergene” (i.e., all 13 mt PC genes) topology. Phylogenies utilized amino acid data and were inferred via maximum likelihood for the 32 lineages with > 40 OTUs examined in this study.
Figure 8.
Comparisons between maximum likelihood (ML) and Bayesian inference (BI) on concatenated mt protein-coding (PC) gene sets relative to that of the “supergene” set.
Average decreases (± S.E.M.) in Ln L between phylogenies based on concatenated alignments of the longest mt protein-coding (PC) genes (from one to 12 genes) relative to the “supergene” (i.e., all 13 mt PC genes) phylogeny. The average gene number when this difference becomes significantly worse based on Shimodaira-Hasegawa (SH) tests is indicated by a vertical dashed line (as in Figure 4A). Phylogenies utilized amino acid data and were inferred via either ML or BI for the 32 lineages with > 40 OTUs examined in this study. Overall, there was no difference in the minimum number of genes needed to recreate the “supergene” topology when using BI or ML methods (vertical dashed lines; Poisson regression, z = 0.74, df = 1,31, P = 0.458).
Figure 9.
Comparison between selecting mt protein-coding (PC) genes by individual gene performance vs. by length.
Utilizing the three mt protein-coding (PC) genes that individually performed “best” (based on Figure 3: ND5, ND4, ND2), rather than the three longest (i.e, ND5, COX1, and ND4) mt PC genes, resulted in a significantly lower decrease in Ln L (± S.E.M.) compared to the “supergene” (i.e., all 13 mt PC genes) topology (t-test, t = 2.21, df = 1, 66, P = 0.03). Additionally, the three “best” performing mt PC genes inferred a topology statistically indistinguishable from the “supergene” topology in 31% of datasets, whereas the three longest mt PC genes never inferred the “supergene” topology. Phylogenies utilized amino acid data and were inferred via maximum likelihood for the 67 lineages that originally required more than three mt PC genes when genes were chosen based solely on length (these included the 32 lineages with > 40 OTUs).