Robust Identification of Noncoding RNA from Transcriptomes Requires Phylogenetically-Informed Sampling
Figure 3
Conservation of protein and RNA families.
All of the available full length Bacterial and Archaeal genomes were annotated using Rfam and Pfam models. For each Pfam/Rfam family, RNA-seq species or taxonomic group the “phylogenetic distance” is calculated using the maximum SSU rRNA F84 distance (see Methods for details). A. For the Pfam and the Rfam families we compare the levels of conservation as a function of phylogenetic distance using annotations of 2,562 bacterial genomes. E.g. of RNA families are conserved between species from the same family, whereas
of protein families are conserved within the same taxonomic range. B. The barplot shows the distribution of all pairwise distances between the RNA-seq datasets. Eleven pairs (boxed) are in the Goldilocks Zone (See Figure 4 for further analysis). C. The ranges of phylogenetic distances for comparing species from different taxonomic groups.