tRNA Signatures Reveal a Polyphyletic Origin of SAR11 Strains among Alphaproteobacteria

Molecular phylogenetics and phylogenomics are subject to noise from horizontal gene transfer (HGT) and bias from convergence in macromolecular compositions. Extensive variation in size, structure and base composition of alphaproteobacterial genomes has complicated their phylogenomics, sparking controversy over the origins and closest relatives of the SAR11 strains. SAR11 are highly abundant, cosmopolitan aquatic Alphaproteobacteria with streamlined, A+T-biased genomes. A dominant view holds that SAR11 are monophyletic and related to both Rickettsiales and the ancestor of mitochondria. Other studies dispute this, finding evidence of a polyphyletic origin of SAR11 with most strains distantly related to Rickettsiales. Although careful evolutionary modeling can reduce bias and noise in phylogenomic inference, entirely different approaches may be useful to extract robust phylogenetic signals from genomes. Here we develop simple phyloclassifiers from bioinformatically derived tRNA Class-Informative Features (CIFs), features predicted to target tRNAs for specific interactions within the tRNA interaction network. Our tRNA CIF-based model robustly and accurately classifies alphaproteobacterial genomes into one of seven undisputed monophyletic orders or families, despite great variability in tRNA gene complement sizes and base compositions. Our model robustly rejects monophyly of SAR11, classifying all but one strain as Rhizobiales with strong statistical support. Yet remarkably, conventional phylogenetic analysis of tRNAs classifies all SAR11 strains identically as Rickettsiales. We attribute this discrepancy to convergence of SAR11 and Rickettsiales tRNA base compositions. Thus, tRNA CIFs appear more robust to compositional convergence than tRNA sequences generally. Our results suggest that tRNA-CIF-based phyloclassification is robust to HGT of components of the tRNA interaction network, such as aminoacyl-tRNA synthetases. We explain why tRNAs are especially advantageous for prediction of traits governing macromolecular interactions from genomic data, and why such traits may be advantageous in the search for robust signals to address difficult problems in classification and phylogeny.


Introduction
What parts of genomes are most robust to compositional convergence? What information is most faithfully inherited vertically? The key assumptions of compositional stationarity and consistency in gene histories underpin most current approaches in phylogenomics and are frequently violated (reviewed in e.g. [1]). HGT is so widespread that the very existence of a "Tree of Life" has been questioned [2,3]. Better understanding of ancient phylogenetic relationships requires discovery of new universal, slowly-evolving phylogenetic markers that are robust to compositional convergence and HGT.
The controversial phylogeny of Ca.Pelagibacter ubique (SAR11) is a case in point. SAR11 make up between a fifth and a third of the bacterial biomass in marine and freshwater ecosystems [4]. Adaptations to extreme environmental nutrient limitation may explain why SAR11 have very small cell and genome sizes and small fractions of intergenic DNA [5]. While some recent phylogenomic studies define a clade among SAR11, the largely endoparasitic Rickettsiales, and the alphaproteobacterial ancestor of mitochondria [6,7,8], others argue that this placement of SAR11 is an artifact of independent convergence towards increased genomic A+T content, and that SAR11 belongs closer to other freeliving Alphaproteobacteria such as the Rhizobiales and Rhodobacteraceae [9,10,11]. Monophyly of SAR11 was also recently rejected [10].
Nonstationary macromolecular compositions are a known source of bias in phylogenomics [12,13]. Widespread variation in macromolecular compositions may be associated with loss of DNA repair pathways in reduced genomes [14,11], unveiling an inherent A+T-bias of mutation in bacteria [15] and elevating genomic A+T content [16,17]. A process such as this has likely altered protein and RNA compositions genome-wide in SAR11, and if such effects are accounted for, the placement of SAR11 with Rickettsiales drops away as an apparent artifact [10,11]. Consistent with this interpretation, SAR11 strain HTTC1062 shares a surprising and unique codivergence of tRNA His and histidyl-tRNA synthetase (HisRS) with a clade of free-living Alphaproteobacteria [18,19] that likely arose only once in bacteria [20]. This synapomorphy contradicts the placement of SAR11 with Rickettsiales.
[ Figure 1 about here.] This work was motivated to determine whether the entire system of tRNAprotein interactions could be exploited to address phylogeny of bacteria, particularly SAR11. The highly conserved tRNA-protein interaction network ( Fig. 1) has special advantages for comparative systems biological study from genomic data. First, the components and interactions of this network are highly conserved. Second, bioinformatic mining of interaction-determining traits from genomic tRNA data is favorable because tRNA structures are highly conserved not just across extant taxa but also across different functional classes of tRNAs ("conformity" [21]). Yet each functional class of tRNA must maintain a hierarchy of increasingly specific interactions with various proteins and other factors ("identity" [22]). The conflicting requirements of conformity and identity allow structural comparison and contrast to predict class-informative traits of tRNAs from sequence data by relatively simple bioinformatic methods [19]. The features that govern tRNA-protein interactions diverge across the three domains of life (reviewed in [23]) and also within the domain of bacteria [20].
In prior work, we developed "function logos" to predict, at the level of individual nucleotides before post-transcriptional modification, what genetically templated information in tRNA gene sequences is associated to specific functional identity classes [24]. We now call these function-logo-based predictions Class-Informative Features (CIFs). A tRNA CIF answers a question like: "if a tRNA gene from a group of related genomes carries a specific nucleotide at a specific structural position, how much informaiton do we gain about that tRNAs specific function?" Such information estimates are corrected for biased sampling of functional classes and sample size effects [24], and their statistical significance may be calculated [19]. Although an individual bacterial genome does not present enough data to generate a function logo, related genome data may be lumped, weakly assuming homogeneity of tRNA identity rules (although heterogeneity generally reduces signal). Function logos recover known tRNA identity elements (i.e. features that govern the specificity of interactions between tRNAs and proteins) [23], and more generally, predict features governing interactions with class-specific network partners such as amidotransferases [25]. A recent molecular dynamics study on a tRNA Glu -GluRS (Glutaminal tRNAsynthetase) complex identified tRNA functional sites involved in intra-and intermolecular allosteric signalling within GluRS that couples substrate recognition to reaction catalysis [26]. The predicted sites are correlated with those from proteobacterial function logos [27].
In this work, we show that tRNA CIFs have diverged among Alphaproteobacteria in a phylogenetically informative manner. Second, as phylogenetic markers, tRNA CIFs are more robust to compositional convergence than the tRNA bodies in which they are embedded. Using our tRNA-CIF-based phyloclassification approach, we confirm that SAR11 are polyphyletic with the majority of strains clustering with the free-living Alphaproteobacteria. Our results have implications for how to best mine genomic data for phylogenetic signals.
[ Figure 2 about here.] The unique traits of the RRCH tRNA His are perfectly associated to substitutions of key residues in the motif IIb tRNA-binding loops of HisRS involved in tRNA recognition [20]. Seven of eight SAR11 strains exhibited the unique tRNA His /HisRS codivergence traits in common with RRCH genomes. In contrast, strain HIMB59 presented ancestral bacterial characters in both tRNA His and HisRS (Fig. S1). These results immediately suggest that HIMB59 is not monophyletic with the other SAR11 strains, consistent with [10].
[ Figure 3 about here.] We computed function logos [24] of the RRCH clade and RSR grade to form the basis of a tRNA-CIF-based binary phyloclassifier as shown schematically in Fig. 2. To reduce bias, we used a Leave-One-Out Cross-Validation (LOOCV) approach. For comparison, we also performed LOOCV phyloclassification using sequence profiles of entire tRNAs, with typical results shown in Fig. 3B. Although the tRNA-CIF-based phyloclassifier (Fig. 3A) was biased positively by the much larger RRCH sample size, it achieved better phylogenetic separation of genomes than the total-tRNA-sequence-based phyloclassifier (Fig. 3B). The Sphingomonadales and Rhodospirillales separated in scores from the Rickettsiales in both classifiers. Most importantly, the tRNA-CIF-based phyloclassifier placed all eight SAR11 genomes closer to the RRCH clade and far away from the Rickettsiales with HIMB59 overlapping the Rhodospirillales, while the total-tRNA-sequence-based phyloclassifier placed all eight SAR11 genomes closer to the Rickettsiales. Fig. S2 shows the effects of different treatments of missing data in the total-tRNA-sequence-based classifier. Method "zero," shown in Fig. 3B, is most analogous to the method used to generate Fig. 3A. Method "skip" (Fig. S2B) shows that SAR11 tRNAs share sequence characters in common with the RSR grade that are not seen in the RRCH clade. Methods "small" and "pseudo" (Figs. S2C and S2D) show that SAR11 have sequence traits not observed in either RSR or RRCH.
[ Figure 4 about here.] Many other tRNA classes besides tRNA His contribute to the differentiated classification of RRCH and RSR genomes by the CIF-based binary classifier (Fig. 4). Other tRNA classes are also differentiated between these two groups, including tRNA Cys , tRNA Asp , tRNA Glu , tRNA Ile LAU (symbolized "J"), tRNA Lys , tRNA Tyr . These results extend the observations of [18] who discovered unusual base-pair features of tRNA Glu in the RRCH clade. In classes for which the RRCH and RSR groups are well-differentiated, HIMB59 uniquely groups with RSR while other strains group with RRCH, while for other tRNA classes, all putative SAR11 strains lie outside the RRCH and RSR distributions. This implies that more diverse Alphaproteobacterial genomic data are necessary to completely resolve the phylogenetic affiliation of SAR11 strains, but strongly contradict a monophyletic affiliation of SAR11 with Rickettsiales.
[ Figure 5 about here.] The increases in genomic A+T contents in SAR11 and Rickettsiales have also driven elevated A+T contents of their tRNA genes (Fig. 5A). Rickettsiales and SAR11 tRNA genes are both notably elevated in both A and T, and share an overall similarity in composition distinct from other Alphaproteobacteria. Hierarchical clustering of Alphaproteobacterial taxa based on tRNA gene base contents closely group SAR11 and Rickettsiales together (Fig. 5B).
[ Figure 6 about here.] Nonstationary tRNA base content -convergence to greater A+T content -causes all eight SAR11 strains in our dataset to group with Rickettsiales using phylogenomic approaches based on total tRNA sequence evidence. In a "supermatrix" phylogenomic approach, concatenating genes for 28 isoacceptor classes from 169 species (2156 total sites) and using the GTR+Gamma model in RAxML, we estimated a Maximum Likelihood tree in which all eight putative SAR11 strains branch together with Rickettsiales (Fig. S3). For this analysis, in 31% of instances when isoacceptor genes were picked from a genome, we randomly picked one gene from a set of isoacceptor paralogs. However, our results did not depend on which paralog we picked. Using a distance-based approach with FastTree, we computed a consensus cladogram over 100 replicate alignments each representing different randomized picks over paralogs. As the consensus cladogram shows (Fig. S4) each replicate distance tree placed all eight putative SAR11 strains together with Rickettsiales. The recently introduced tRNA-specific FastUniFrac-based method for microbial classification [31] also places all SAR11 strains together with Rickettsiales ( Fig. 6).
[ Figure 7 about here.] However, as shown in Fig. 7, a multiway classifier based on tRNA CIFs bins all SAR11 strains with the Rhizobiales except for HIMB59, which bins with the Rhodospirillales, consistent with the results of [10]. These results use a multilayer perceptron (MLP) classifier implemented in WEKA [32] and only seven taxon-specific CIF-based summary scores. The MLP is the simplest non-linear classifier able to handle the interdependent signals in the CIF-based scores for tree-like data [33]. In a Leave-One-Out cross-classification, all other genomes scored consistently with NCBI Taxonomy except three placed in Rhodobacteraceae based on 16S ribosomal RNA evidence: Stappia aggregata, Labrenzia alexandrii and the denitrifying Pseudovibrio sp. JE062. None of these genomes scored strongly against Rhodobacteraceae except Pseudovibrio, which scored four times greater against the Rhizobiales.
To assess robustness of our results we performed two controls: we bootstrapped sites of tRNA data in each genome to be classified, and we filtered away small CIFs with Gorodkin heights < 0.5 bits from our models, retrained the classifier and bootstrapped sites again. Generally bootstrap support values correspond to original classification probabilities. All SAR11 strains have support values > 80% as Rhizobiales, majority bootstrap values as Rhizobiales (HIMB114 at 70% with Rickettsiales at 15% and HTCC7211 at 54% with Rickettsiales at 13%), or plurality bootstrap value as Rickettsiales (HIM5 at 48% with Rickettsiales at 18%) except HIM59 which had a bootstrap support value of 87% to be in the Rhodospirillales. Full bootstrap statistics with these model are provided in Table S1.

Discussion
Our results provide strong, albeit unconventional, evidence that most SAR11 strains are affiliated with Rhizobiales, while strain HIMB59 is affiliated with Rhodospirillales. These results are entirely consistent with comprehensive phylogenomic studies that control for nonstationary macromolecular compositions in Alphaproteobacteria [9,10,11] or a site-rate-filtered analysis [34]. Our CIFbased method works even though SAR11 and Rickettsiales tRNAs have converged in base content, so that total tRNA sequence-based phylogenomics gives opposite results. tRNA CIFs must be at least partly robust to compositional convergence of the tRNA bodies in which they are embedded.
It is well known that aminoacyl-tRNA synthetases (aaRS) are highly prone to HGT [35,36,37,38,39] including in Alphaproteobacteria [20,40,41]. We hypothesize that our tRNA-CIF-based phyloclassifiers are also robust to HGT of components of the tRNA-protein interaction network, consistent with [42], who argued that a horizontally transferred aaRS is more likely to functionally ameliorate to a tRNA-protein network into which it has been transferred rather than remodel that network to accomodate itself. HGT of aaRSs may also perturb a network so as to cause a distinct pattern of divergence ( [20] and this work). Wang et al. [18] discuss the possibility that RRCH tRNA His and HisRS were co-transferred into an ancestral SAR11 genome. However, this fails to explain the correlations of many other tRNA traits of SAR11 genomes with the RRCH clade reported here. Further study is needed to address the robustness of our method to component HGT.
A more distant relationship between most SAR11 strains and Rickettsiales actually strengthens the genome streamlining hypothesis [5]. If SAR11 were a true branch within Rickettsiales, it becomes more difficult to claim that genome reduction in SAR11 occurred by a selection-driven evolutionary process distinct from the drift-dominated erosion of genomes in the Rickettsiales [43, 16,44]. By the same token, polyphyly of nominal SAR11 strains implies that the extensive similarity in genome structure and other traits between HIMB59 and SAR11 reported by [45] may have originated independently. Perhaps convergence in some traits is consistent with streamlining, which could also explain trait-sharing between SAR11 and Prochlorococcus, marine cyanobacteria also argued to have undergone streamlining [46]. Clear signs of data-limitation in our study should be taken to mean that better taxonomic sampling will improve our results and could ultimately resolve more than two origins of SAR11-type genomes among Alphaproteobacteria.
We extracted accurate and robust phylogenetic signals from tRNA gene sequences by first integrating within genomes to identify features likely to govern functional interactions with other macromolecules. Unlike small molecule interactions, macromolecular interactions are mediated by genetically determined structural and dynamic complementarities. These are intrinsically relative; a large neutral network [47] of interaction-determining features should be compatible with the same interaction network. Coevolutionary divergenceturnover-of features that mediate macromolecular interactions, while conserv-ing network architecture, has been described in the transcriptional networks of yeast [48,49] and worms [50] and in post-translational modifications underlying protein-protein interactions [51]. This work demonstrates that divergence of interaction-governing features is phylogenetically informative.
It remains open how such features diverge, with possibilities including compensatory nearly neutral mutations [52], fluctuating selection [53], adaptive reversals [54], and functionalization of pre-existent variation [55]. Major changes to interaction interfaces may be sufficient to induce genetic isolation between related lineages, as discussed for the 16S rRNA-and 23S rRNA-based standard model of the "Tree of Life," in which many important and deep branches associate with large, rare macromolecular changes ("signatures") in ribosome structure and function [56,57,58].
Interaction-mediating features of macromolecules may be systems biology's answer to the phylogeny problem. Perhaps no other traits of genomes are vertically inherited more consistently than those that mediate functional interactions with other macromolecules in the same lineage. In fact, the structural and dynamic basis of interaction among macromolecular components -essential to their collaborative function in a system -may define a lineage better than any of those components can themselves, either alone or in ensemble.

Materials and Methods
Supplementary data packages are provided to reproduce all figures from raw data and enable third-party classification of alphaproteobacterial genomes.

Data
The 2011 release of the tRNAdb-CE database [28] was downloaded on August 24, 2011. From this master database, we selected Alphaproteobacteria data as specified by NCBI Taxonomy data (downloaded September 24, 2010, [59]). Also using NCBI Taxonomy, we further tripartitioned Alphaproteobacterial tRNAdb-CE data into those from the RRCH clade, the RSR grade (excluding SAR11), and three SAR11 genomes, as documented in Supplementary data for figure 2. Five additional SAR11 genomes (for strains HIMB59, HIMB5, HIMB114, IMCC9063 and HTCC9565) were obtained from J. Cameron Thrash courtesy of the lab of S. Giovannoni. We custom annotated tRNA genes in these genomes as the union of predictions from tRNAscan-SE version 1.3.1 (with -B option, [60]) and Aragorn version 1.2.34 [61]. We classified initiator tRNAs and tRNA Ile CAU using TFAM version 1.4 [62] using a model previously created to do this based on identifications in [63] provided as supplementary data. We aligned tRNAs with covea version 2.4.4 [64] and the prokaryotic tRNA covariance model [60], removed sites with more than 97% gaps with a bioperl-based utility [65], and edited the alignment manually in Seaview 4.1 [66] to remove CCA tails and remove sequences with unusual secondary structures. We mapped sites to Sprinzl coordinates manually [29] and verified by spot-checks against tRNAdb [67]. We added a gap in the -1 position for all sequences and -1G for tRNA His in the RSR group [18].

tRNA CIF Estimation and Binary Classifiers
Our tRNA-CIF-based binary phyloclassifier with Leave-One-Out Cross-Validation (LOO CV) is computed directly from function logos, estimated from tDNA alignments as described in [24]. Here, we define a feature f ∈ F as a nucleotide n ∈ N at a position l ∈ L in a structurally aligned tDNA, where N = {A, C, G, T } and L is the set of all Sprinzl coordinates [29]. The set F of all possible features is the Cartesian product F = N ×L. A functional class or class of a tDNA is denoted c ∈ C where C = {A, C, D, E, F, G, H, I, J, K, L, M, N, P, Q, R, S, T, V, W, X, Y } is the universe of functions we here consider, symbolized by IUPAC one-letter amino acid codes (for aminoacylation classes), X for initiator tRNAs, and J for tDNA Ile LAU . A taxon set of genomes or just taxon set S ∈ P(G) is a set of genomes, where G is the set of all genomes, and P(G) is the power set of G. In this work a genome G is represented by the multiset of tDNA sequences it contains, denoted T G . The functional information of features is computed with a map h : (F × C × P(G)) −→ R ≥0 from the Cartesian product of features, classes and taxon sets to non-negative real numbers. For a feature f ∈ F , class c ∈ C and taxon set S ∈ P(G), h(f, c, S) is the fraction of functional information or "Gorodkin height" [68], measured in bits, associated to that feature, class and taxon set. In this work, for a given taxon set S, a function logo H(S) is the tuple: Furthermore the set I(S) ⊂ (F × C) of tRNA Class-Informative Features for taxon set S is defined: Briefly, a tRNA Class-Informative Feature is a tRNA structural feature that is informative about the functional classes it associates with, given the context of tRNA structural features that actually co-occur among a taxon set of related cells, and corrected for biased sampling of classes and finite sampling of sequences [24]. Let A denote a set of Alphaproteobacterial genomes partitioned into three disjoint subsets X, Y and Z with X ∪ Y ∪ Z = A, representing genomes from the RRCH clade, the RSR grade, and the eight nominal Ca. Pelagibacter strains respectively. To execute Leave-One-Out Cross-Validation of a tRNA CIF-based binary phyloclassifier for a genome G ∈ A, we compute a score S C (G, S 1 , S 2 ), averaging contributions from the multiset T G of tDNAs in G scored against two function logos H(S 1 ) and H(S 2 ) computed respectively from two disjoint taxon sets S 1 ⊂ A and S 2 ⊂ A, with G / ∈ S 1 ∪ S 2 . In this study, those sets are X \ G and Y \ G, denoted X G and Y G respectively. Each tDNA t ∈ T G presents a set of features F t ⊂ F and has a functional class c t ∈ C associated to it. The score S C (G, X G , Y G ) is then defined: As controls, we implemented four total-tDNA-sequence based binary phyloclassifiers to score a genome G. All are slight variations in which a tRNA t ∈ T G of class c(t) contributes a score that is a difference in log relative frequencies of the features it shares in class-specific profile models generated from X G and Y G . The default "zero" scoring scheme method S Z T (G, X G , Y G ) shown in Fig. 3B is defined as: where #{f, c, S} is the observed frequency of feature f in tDNAs of class c in set S, and #{c, S} is the frequency of tDNAs of class c in set S. Method "skip" corresponds to scoring scheme S K T (G, X G , Y G ) defined as: where and p(f |c, R) ≡ #{f, c, R}/#{c, R} for R ∈ {S, T } as before. Methods "pseudo" and "small" correspond to scoring schemes S I T (G, X G , Y G ): where where f = (n, l), o ≡ #{f, c, S}, t ≡ #{c, S}, I = 1 for method "pseudo," and, for method "small,"

Analysis of tRNA Base Composition
We computed the base composition of tRNAs aggregated by clades using bioperlbased [65] scripts, and transformed them by the centered log ratio transformation [30] with a custom script provided as supplementary data. We then computed Euclidean distances on the transformed composition data, and then performed hierarchical clustering by UPGMA on those distances as implemented in the program NEIGHBOR from Phylip 3.

Multiway Classifier
All tDNA data from the RSR and RRCH clades were partitioned into one of seven monophyletic clades: orders Rickettsiales (N = 40 genomes), Rhodospirillales (N = 10), Sphingomonadales (N = 9), Rhizobiales (N = 91), and Caulobacterales (N = 6), or families Rhodobacteraceae (N = 43) or Hyphomonadaceae (N = 4) as specified by NCBI taxonomy (downloaded September 24, 2010, [59]) and documented in supplementary data for figure 7. We withheld data from the eight nominal SAR11 strains, as well as from three genera Stappia, Pseudovibrio, and Labrenzia, based on preliminary analysis of tDNA and CIF sequence variation. Following a related strategy as with the binary classifier, we computed, for each genome, seven tRNA-CIF-based scores, one for each of the seven Alphaproteobacterial clades as represented by their function logos, using the principle of Leave-One-Out Cross-Validation (LOO CV), that is, excluding data from the genome to be scored. Function logos were computed for each clade as described in [24]. For each taxon set X G (with genome G left out if it occurs), genome G obtains a score S M (G, X G ) defined by: Each genome G is then represented by a vector of seven scores, one for each taxon set modeled. These labeled vectors were then used to train a multilayer perceptron classifier in WEKA 3.7.7 (downloaded January 24, 2012, [32]) by their defaults through the command-line interface, which include a ten-fold cross-validation procedure. We bootstrap resampled sites in genomic tRNA alignment data (100 replicates) and also bootstrap resampled a reduced (and retrained) model including only CIFs with a Gorodkin height [24] ≥ 0.5 bits.
[                   Figure 6: FastUniFrac-based phylogenetic tree of alphaproteobacteria using tRNA data computed according to the methods of [31].  Figure 7: FastUniFrac-based phylogenetic tree of alphaproteobacteria using tRNA data computed according to the methods of [31].