Table 1.
Single-matrix models underestimate the level of compositional heterogeneity and overestimate the site-specific biochemical diversity of the informational gene alignments.
Figure 1.
Unrooted phylogeny of RNAP2 based on Bayesian analysis of 80 sequences of 272 amino acid positions performed with PhyloBayes under the CAT60 model.
Detailed parameters are given in the Materials and Methods section. Assuming that the root of the tree lies outside the viruses and eukaryotes, the NCLDV sequences (red) are not monophyletic but form three groups, one branch located between the archaeal (green) and the eukaryotic (blue) sequences, one branch emerging from within the eukaryotes, and one branch comprising the Emiliana huxleyi virus. Bacterial sequences are shown in purple, and metagenomic sequences of unknown organismal origin are shown in black. Branch support shown represents posterior probabilities, bar represents 0.3 substitutions per site.
Figure 2.
Unrooted phylogeny of TFIIB based on Bayesian analysis of 30 sequences of 162 amino acid positions performed with PhyloBayes under the CAT60 model.
Detailed parameters are given in the Materials and Methods section. This tree shows a polytomy in which the relationships among the the different eukaryotic groups and NCLDV lineages are not resolved at posterior probabilities 0.5. This lack of resolution beyond the eukaryote/prokaryote split is typical of the topologies recovered for this gene under the models that passed our tests. Archaeal sequences are shown in green, the black sequence represents a metagenomic sequence of unknown organismal origin. The indicated branch support values are posterior probabilities, and the bar represents 0.3 substitutions per site.
Figure 3.
Unrooted phylogeny of PCNA based on Bayesian analysis of 40 sequences of 178 Dayhoff-recoded amino acid positions performed with p4 with an additional base composition vector.
Detailed parameters are given in the Materials and Methods section. The NCLDV sequences (red) and metagenomic sequences (black) emerge as a single group from within the eukaryotes (with the exception of the Emiliana huxleyi virus). Archaeal sequences are in green. The indicated branch support values are posterior probabilities, and the bar represents 0.3 substitutions per site.
Figure 4.
Unrooted phylogeny of FEN based on Bayesian analysis of 37 sequences of 215 amino acid positions performed with PhyloBayes under the CAT60 model.
Detailed parameters are given in the Materials and Methods section. The NCLDV sequences (red) and metagenomic sequences (black) emerge as a single group from within the eukaryotes (blue), with the exclusion of the Emiliana huxleyi virus. Archaeal sequences are in green. The indicated branch support values are posterior probabilities, and the bar represents 0.6 substitutions per site. Black sequences represent metagenomic sequences of unknown organismal origin.
Table 2.
Support for the 4th domain of life hypothesis [18] from analyses of informational genes using different models.
Figure 5.
Observed (solid lines) and predicted (dashed lines) numbers of homoplasic events per site for the RNAP2 alignment under the JTT (purple) and CAT60 (black) models.
This case illustrates the pattern seen for three of the four genes (see Table 3): under the CAT model, which predicts substantially higher levels of homoplasy, there is good agreement (P = 0.38) between the observed and predicted distributions. The JTT model predicts significantly less homoplasy than it observes (P = 0.018), and the means of both distributions are lower than under CAT60. This suggests that CAT60 anticipates, and is better able to account for, higher levels of homoplasy in the data.
Table 3.
Homoplasy (mean predicted homoplasic events/site +/− variance) in each gene under the homogeneous models used in [18] (JTT for RNAP2, WAG for the others) and the UL3 and CAT60 mixture models.