Sequence Evidence in the Archaeal Genomes that tRNAs Emerged Through the Combination of Ancestral Genes as 5′ and 3′ tRNA Halves

The discovery of separate 5′ and 3′ halves of transfer RNA (tRNA) molecules—so-called split tRNA—in the archaeal parasite Nanoarchaeum equitans made us wonder whether ancestral tRNA was encoded on 1 or 2 genes. We performed a comprehensive phylogenetic analysis of tRNAs in 45 archaeal species to explore the relationship between the three types of tRNAs (nonintronic, intronic and split). We classified 1953 mature tRNA sequences into 22 clusters. All split tRNAs have shown phylogenetic relationships with other tRNAs possessing the same anticodon. We also mimicked split tRNA by artificially separating the tRNA sequences of 7 primitive archaeal species at the anticodon and analyzed the sequence similarity and diversity of the 5′ and 3′ tRNA halves. Network analysis revealed specific characteristics of and topological differences between the 5′ and 3′ tRNA halves: the 5′ half sequences were categorized into 6 distinct groups with a sequence similarity of >80%, while the 3′ half sequences were categorized into 9 groups with a higher sequence similarity of >88%, suggesting different evolutionary backgrounds of the 2 halves. Furthermore, the combinations of 5′ and 3′ halves corresponded with the variation of amino acids in the codon table. We found not only universally conserved combinations of 5′–3′ tRNA halves in tRNAiMet, tRNAThr, tRNAIle, tRNAGly, tRNAGln, tRNAGlu, tRNAAsp, tRNALys, tRNAArg and tRNALeu but also phylum-specific combinations in tRNAPro, tRNAAla, and tRNATrp. Our results support the idea that tRNA emerged through the combination of separate genes and explain the sequence diversity that arose during archaeal tRNA evolution.


Introduction
The origin and evolution of tRNA is an exciting topic widely discussed in the field of molecular evolution [1][2][3][4][5]. Three types of tRNA genes have been identified in archaeal genomes: ''nonintronic tRNA'', which is encoded on a single gene with no intron sequence; ''intron-containing tRNA'' (intronic tRNA), which is encoded on a single gene with 1 to 3 introns punctuating the mature tRNA sequence at various locations [6][7][8]; and ''split tRNA'', which has been found only in the hyperthermophilic archaeal parasite Nanoarchaeum equitans, of which the 59 and 39 halves are encoded on separate genes [9]. The discovery of these split tRNA genes raised the question of whether ancestral tRNA was encoded on a single gene or separate genes. No redundant tRNAs with the same anticodon have been found in the N. equitans genome, and experimental data suggest that these split tRNAs are transcribed and processed into functional tRNA molecules [9]. The joiner region of the pre-tRNA product of the 2 split tRNA genes consists of a bulge-helix-loop (BHL) motif, a relaxed form of bulge-helix-bulge (BHB) motif which is recognized and cleaved by homodimer or heterotetramer splicing endonucleases during tRNA processing [10,11]. Interestingly, BHB and BHL motif is known as a distinctive structure for protein-spliced intron and are observed at the exon-intron boundary of archaeal intronic tRNA [12,13], mRNA [14,15], rRNA [16] and eukaryal intronic tRNAs [17,18]. Recently, a novel permuted structure of tRNA gene was discovered in a genome of primitive red alga, Cyanidioschyzon merolae, which tRNAs are encoded on a single gene in the order of 39 half-59 half forming a potential BHB motif around intron-exon junction [19]. These common processing machineries suggest an evolutionary link among various types of tRNAs.
In this context, Di Giulio suggested that ancestral tRNA was encoded by 2 separate minigenes, which later fused to encode modern tRNAs [3,5,20]. This conclusion is based on the probabilistic argument that the boundaries of the split tRNAs and of many of the introns found in the tRNA sequences are located at anticodon loops. In addition, it has been proposed that the tRNA molecule originated by direct duplication of an RNA hairpin structure [3,4,21]. Experimental studies have shown that a single hairpin minihelix that recapitulates the domain of top half tRNA sequence (acceptor stem+TyC arm helix) can be aminoacylated by modern tRNA synthetases [22]. The definition of the minihelix is different from that of split tRNA (separated at the anticodon loop) but the idea of tRNA emerged from two distinct minigenes has been supported by the genomic tag hypothesis, which proposes that the top half tRNA sequence must have originated among ancient life to catalyze RNA replication in the 'RNA world' ahead of bottom half tRNA sequence (D arm helix+anticodon stem) on account of the independent structure and functionality of the 59 and 39 tRNA sequences [1,2]. Nevertheless, the phylogenetic position of N. equitans remains uncertain, as it has two incongruent positions in the phylogenetic tree of the Archaea depending on the dataset [23,24], and so the question remains as to whether the split tRNA genes really exemplify the ancestral form of tRNA.
We have recently developed software called SPLITS [7,8], which has predicted most of the archaeal missing tRNAs that cannot be predicted by tRNAscan-SE, the most widely used software for tRNA annotation [25]. The recent accumulation of complete archaeal genome sequences in the public databases enabled us to perform comprehensive phylogenetic analysis of 1953 archaeal tRNA sequences in 45 archaeal species. Here, we report on the overall phylogeny of the archaeal tRNAs, focusing on the evolutionary relationships among nonintronic, intronic, and split tRNAs. Further, by separating all kinds of tRNAs at the anticodon region and mimicking the split tRNA sequences, we show that network topologies based on the sequence similarities of 59 and 39 tRNA halves differ significantly. Moreover, the specific combination of 59 and 39 tRNA halves from different groups explained the variation of amino acids in the codon table. Here, we provide sequence evidence supporting the hypothesis that tRNA evolved through a combination of 59 and 39 tRNA sequences.

Results and Discussion
Phylogenetic analysis of mature tRNAs in 45 archaeal species We predicted 1977 putative tRNA candidates from the genome sequences of 45 archaeal species with 2 tRNA predicting programs, SPLITS [7,8] and tRNAscan-SE [25]. All tRNA sequences were manually checked, and 24 false candidates (tRNAlike sequences used for viral integration, and pseudogenes) were eliminated from the dataset. The resulting 1953 archaeal tRNAs, including 6 known split tRNAs and 423 intronic tRNAs, were used as a dataset for phylogenetic analysis. We performed structural alignment based on their mature tRNA sequences (from which introns and leader sequences were deleted) by manually improving the multiple alignment data through complete matching of the consensus nucleotides conserved among archaeal tRNAs (see Methods for detail). An unrooted neighbor-joining (NJ) tree was then produced. As a result, 1953 tRNAs were separated into 22 clusters: 12 dominated by tRNAs with its anticodon corresponding to a single type of amino acid (e.g., a tRNA for Ala), and 10 consisting of tRNAs with anticodons corresponding to 2 to 4 amino acids (e.g., a tRNA for Arg-Lys-Trp [i.e., 3 amino acids]) ( Fig. 1). For example, 89 out of 90 archaeal tRNA Gln were clustered in the same branch (cluster 16); and all tRNA Asp and tRNA Glu except the split tRNA Glu were clustered indistinguishably in the same branch (cluster 17). Like as Asp-Glu cluster, there are several indistinguishable pairs of amino acids clustered in the same branch that are in precursorproduct relationship or in biosynthetic relationship (ex: Ala-Val, Arg- Lys, Phe-Tyr, Cys-Ser) supporting the coevolution theory of the origin of the genetic code [26].
The split tRNAs were placed in 5 different clusters: 4 split tRNAs (tRNA iMet [cluster 1], tRNA His [cluster 6], tRNA Gln [cluster 16], and tRNA Lys [cluster 22]) were placed adjacent to other tRNAs with synonymous codons. Only the two split tRNA Glu (cluster 15) was distant from any other tRNAs that branched at the root of the tRNA Gln cluster. Accordingly, only 13 tRNA clusters (shaded in Fig. 1) included tRNAs from all 3 archaeal phyla (Nanoarchaeota, Crenarchaeota, and Euryarchaeota), suggesting that about half of the tRNAs corresponding to specific amino acids evolved from a monophyletic origin. For a clearer understanding of the phylogenetic distribution of archaeal tRNAs, we tabulated the phylogenetic location of tRNAs from 16 archaeal genera against each tRNA's cluster number (Supplemental Fig. S1). Further, we have found that none of the N. equitans tRNAs branched adjacent to the tRNAs of the host species Ignicoccus hospitalis (data not shown), thus we assume that no tRNA genes were laterally transferred between the 2 species after N. equitans became parasite of Ignicoccus cell.
Since 4 out of the 6 split tRNAs possessed clear sequence similarity with other tRNAs with the same anticodons, we focused on the precise phylogeny of the 6 split tRNAs from N. equitans in relation to other archaeal tRNAs. We performed detailed phylogenetic analysis for each of the 6 split tRNAs based on the Bayesian method (Supplemental Fig. S2). Mature sequences of 3 split tRNAs (tRNA iMet , tRNA Lys and tRNA Gln ) branched with other tRNAs with same anticodon from Crenarchaeota and Euryarchaeota lineages. Split tRNA His clustered with tRNA His from the Crenarchaeota lineage and M. kandleri. The most notable was split tRNA Glu , which located at the root of tRNA Glu cluster. The phylogenetic position of split tRNA Glu in NJ tree was adjacent to tRNA Gln and tRNA Trp cluster, although this contradiction exemplifies the sequence ambiguity of split tRNA Glu . Thus, split tRNAs reveal various characteristics in the phylogeny of archaeal tRNAs-universal (tRNA iMet , tRNA Lys and tRNA Gln ), crenarchaeal-specific (tRNA His ), and unique (tRNA Glu ) phylogenetic positions-suggesting that split tRNA could be the ancestral form of tRNAs. Besides, intronic tRNAs were scattered throughout tRNA phylogeny in almost every tRNA clusters with introns positioned at the same location as the 59-39 boundary of the split tRNAs. We found an intronic tRNA Arg with an intron sequence possessing 58% identity to that of the split tRNA Lys leader sequence located at the same position as that of the intron position (Fig. 2). Both tRNA belongs to the same cluster (cluster 22 in Fig. 1) suggesting that some intronic tRNAs may have emerged from integrated split tRNA in the archaeal genome.
Sequence analysis of the 59 and 39 tRNA halves reveals their different evolutionary backgrounds As described in the previous section, comprehensive phylogenetic analysis of 1953 mature archaeal tRNA sequences revealed an evolutionary relationship between the three types of tRNAs. If split genes were the ancestral form of tRNA, can we explain the origin and evolution of modern tRNA sequences by the combination patterns of 59 and 39 tRNA halves? We attempted to corroborate this hypothesis by analyzing the evolutionary backgrounds of 59 and 39 halves. We selected 7 archaeal species, each from the closest taxon to the last archaeal common ancestor: 1 Nanoarchaeota (N. equitans, Neq), 3 Crenarchaeota (P. aerophilum, Pae; A. pernix, Ape; and S. solfataricus; Sso), and 3 Euryarchaeota (P. furiosus, Pfu; M. kandleri, Mka; and M. jannaschii, Mja). In total, 296 tRNA sequences were extracted and separated into the 59 half (positions 1-33) and the 39 half (positions 37-73) at the anticodon region (positions 34-36) to mimic split tRNA. We considered each tRNA half as a network node, and nodes were linked when sequence identity between the 2 tRNA halves was above certain threshold. Accordingly, we have constructed 2 different networks (59 half network and 39 half network) based on their sequence similarities. Previously, a network-based approach was used to explain 2 mechanisms of tRNA evolution: point mutation and complementary mutation [27]. Thus, concept of representing the sequence similarities among tRNA halves as a network provides comprehensive understanding of sequence characteristics and its diversity based on the global statistics. Figure. 3 shows the sequence similarity networks of 59 and 39 tRNA halves with thresholds of .70%, .75%, and .80% for 59 tRNA halves (Fig. 3A), and .80%, .85%, and .88% for 39 tRNA halves (Fig. 3B). A noticeable topological difference is apparent between the sequence similarity networks of the 59 and 39 halves. Sequence diversity of the 59 half sequences appeared between sequence similarities of .70% and .75%, where the single large network started to localize into small clusters corresponding to specific amino acids. This feature was more prominent at a similarity of .80%, where 59 halves clearly separated into 6 groups (1-6 in Fig. 3A). In contrast, all 39 half sequences except tRNA Ser and tRNA Leu (possessing long variable stem loops shown as yellow and orange nodes in Fig. 3B) gathered into 1 large group even at a sequence similarity of .80%. The 39 half sequences were finally differentiated at a sequence similarity of .88% into 9 groups (A-I in Fig. 3B).
To characterize the overall network topology of the 59 and 39 halves, we compared the 2 networks by measuring the distribution of the number of connections per node (connectivity distribution). Network topology can be classified into 2 distinct types: 'scale-free networks', in which a few nodes have many links but most have only a few links, and which follow a power-law distribution [28]; and 'small-world networks', which consist of nodes linked randomly and follow a Poisson or exponential distribution [29]. We found obvious difference in degree distribution between the 2 networks at all similarity thresholds (Fig. 4). At a similarity threshold of .75%, the clustering coefficient c (degree of linkage between nodes) of 59 half was only 0.17 while 39 half was 0.47, which means that 47% of all possible connection are used. The power-law-like distribution of the 59 half network was enhanced as the similarity threshold increased, while the distribution of 39 half network changed from Poisson-like (.75%) to irregular (.80-85%) then finally a power-law-like distribution at a sequence similarity of .88%. Meanwhile, the ratio of the cluster coefficient between 59 and 39 halves were constantly 1:2.5-1:4 showing that characteristics of the 2 tRNA domains significantly differ at all level of sequence threshold (75%-88%). We suggest that high commonality of the 39 half sequences is one of the evidence of which 39 half could have had a single origin. On the other hand, the scale-free distribution of the 59 half represents a specific sequence characteristic within each sequence group, suggesting a non-monophyletic origin of the 59 half sequences. These different sequence characteristics of the 59 and 39 tRNA halves further support the idea that ancient archaeal tRNA emerged from a specific combination of 59 and 39 halves.
Specific combinations of 59 with 39 tRNA halves strongly correlate with amino acid distribution in the codon table Since the 59 and 39 halves are suggested to have had different evolutionary backgrounds, we hypothesized that the sequence variety of tRNAs was derived directly from different combinations of 59 and 39 tRNA sequences in the early stage of archaeal  Figure S3. Group A is the largest group in the 39 half network, accounting for 104/296 (35%) sequences of tRNAs corresponding to various amino acids. In N. equitans, 39 half sequences of tRNA Arg , tRNA Ile , tRNA Val , tRNA Ala , tRNA Pro , tRNA Trp , and tRNA Phe were all categorized as group A. Adding the 59 sequence information further classified these tRNAs into small groups possessing the same 59-39 combination. For example, in all 7 species, the same 4-A combination was observed for tRNA Arg and tRNA Lys ; this, together with the fact that split tRNA Lys is clustered among the Arg-Lys cluster (Fig. 1), suggests that these tRNAs originated from a combination of ancestral minigenes of the 2 sequence groups. The same features are apparent in the 5-A combination of tRNA Val . Exceptionally, however, 2 tRNA Val s in M. kandleri both have the 4-A combination located in unique cluster (cluster 11 in Fig. 1). M. kandleri is an intriguing species which possesses many unique tRNAs clustered at different phylogenetic positions from other tRNAs with the same anticodon (Supplemental Fig. 1). The N. equitans genome also encodes unique tRNA Ile , which displays the only UAU (TAT) anticodon found in the Archaea. tRNA Ile (Neq) is considered to be evolutionarily unique, as no homologous sequence has been found in any other archaeal species and therefore it is located within an isolated branch with various tRNAs from M. kandleri (cluster 11 in Fig. 1). However, by splitting the sequence of tRNA Ile (Neq), we found tRNA Gly (GCC) in N. equitans genome which possessed identical 39 half but different 59half (Fig. 6A). This sequence evidence supports our idea of the ancient integration of split tRNA genes.
Comparing the codon tables of all 3 archaeal species groups (Supplemental Fig. S3), we found several interesting features related to the evolutionary selection of 59-39 combinations. We identified a phylum-specific distribution of tRNA halves corresponding to the same amino acids. For example, the 59 tRNA sequences of euryarchaeal tRNA Pro belong to group 6, but those of the nanoarchaeal and crenarchaeal tRNA Pro s belong to group 5. In contrast, the 39 sequences of tRNA Pro all belong to group C except N. equitans. tRNA Trp and tRNA Ala also showed phylumspecific 59 half sequences. This different ancestry of the 59 half sequences clearly explains the non-monophyletic origins of tRNA Pro , tRNA Trp and tRNA Ala . Secondly, specific 59-39 combinations of tRNA iMet , tRNA Thr , tRNA Ile , tRNA Gly , tRNA Gln , tRNA Glu , tRNA Asp , tRNA Lys , tRNA Arg and tRNA Leu are conserved among all 7 species, placing these tRNAs among the most ancient unchanged tRNAs in the genome since the last archaeal common ancestor. Finally we observed a clear rule in the sequence divergence of 59 and 39 halves. For example in 59 half network, group 1 has connection with group 3 but not with group 5 nor group 6 and in 39 half network, group D has connection with group C and group B but not with group A (Fig. 3). Based on these rules, we found that group 5 and group C were placed at the center of network diversity, suggesting that these cluster should have been the origin of the whole network. When conbining the two (59 and 39) rules together, tRNA Gly appeared as the only 5-C combination conserved among 7 species. Thus we suggest that minigenes encoding 59 and 39 tRNA sequence of tRNA Gly were the origins of other tRNA genes in the very early stage of tRNA evolution. Later, gene duplication and sequence divergence occurred and selection of specific 59-39 tRNA sequences would have increased the specificity of tRNAs along with the usage of various amino acids. As shown in Figure 6, many tRNAs share identical 39 halves but have different 59 halves (Fig. 6A-6C), or vice versa (Fig. 6D). In that circumstance, the anticodon sequence could have been supplied by either sequences, a feature observed in split tRNAs [30].
In conclusion, our hypothesis that modern tRNAs evolved through the combination of 59 and 39 tRNA fragments is supported strongly by the variation in amino acids in the codon table and explains the sequence diversity in archaeal tRNAs. Some tRNAs show a highly conserved 39 half but different 59 half sequences (or vice versa), while other tRNAs consist of specific 59-39 combinations and correspond to single types of amino acids, features that are conserved throughout all archaeal phyla. We need natural or experimental evidence of how these split genes evolved into modern tRNA genes. According to current studies of processed pseudogenes, at least some rRNAs, mRNAs, and tRNAs can retropose (reverse-transcribe and integrate) into the genome in Mammalia and plants [31][32][33]. A good example is known as short interspread nuclear element (SINE); a retrotransposon which is known to be originated from tRNA via retropositional mechanism [34,35]. Experimentally, Escherichia coli-derived tRNA Tyr was reverse-transcribed as mono-cDNA by using retroplasmid-derived reverse transcriptase without any template DNA [36]. Therefore, we speculate that ancestral 59 and 39 tRNA minigenes integrated at the transcript level, along with the help of reverse transcriptase or integrase. If so, both intronic and nonintronic tRNAs could have been produced from transcribed split tRNA, resulting in the integration of the leader sequence of split tRNA as an intron sequence containing the BHB motif, and the deposition of the CCA sequence added by nucleotidyltransferase downstream of the 39 end of retroposed tRNA sequences. This hypothesis could also be adapted to mRNAs and rRNAs, because some of these functional RNAs in the N. equitans genome are encoded on separate genes just like split tRNA [23]. Recently we have reported a novel type of functional tRNA gene in a primitive red alga, Cyanidioschyzon merolae, in which the 39 half of the tRNA sequence lies upstream of the 59 half in the genome [19]. This 'permuted' tRNA might also be considered as a result of the adaptive integration of 59 and 39 tRNA halves of ancient eukaryotes or a retroposed product of circular pre-tRNA transcript. Consequently, we have provided a molecular basis for our current understanding of the origin and evolution of tRNAs. Novel tRNA sequences from the upcoming fourth archaeal phylum, the Korarchaeota [37], and other primitive Eukarya and Prokarya separated near the root of the phylogenetic tree will allow further testing of our hypothesis.
Both nonintronic and intronic tRNAs were predicted by 2 tRNA-predicting programs, SPLITS (7,8) and tRNAscan-SE (17). tRNAscan-SE was run with default parameters for predicting nonintronic tRNAs and intronic tRNAs with canonical introns. SPLITS was run parameters -d 2, -p 0.51, -H 2, and -F 3 for predicting intronic tRNAs with multiple introns. The split tRNA sequences were taken from Randau [30]. All predicted tRNA sequences were manually checked to eliminate false candidates and to correct the intron sequences. All predicted archaeal tRNA sequences are registered in a new tRNA database (Sugahara and Kikuta et al., to be published separately).

tRNA sequence analysis
All mature tRNA sequences (input as tDNA sequences) were aligned using CLUSTALW [38] with the following parameters: Multiple Alignment parameter ''Gap Opening'' = 22.5; Multiple Alignment parameter ''Gap Extension Penalty'' = 0.83; Pairwise Alignment parameter ''Gap Opening'' = 22.5; Multiple Alignment parameter ''Delay Divergent Sequence'' = 25%. These parameters gave the highest performance in RNA sequence alignment [39]. Aligned tRNA sequences (.aln file) were manually improved by matching the consensus nucleotides conserved among archaeal tRNAs: base 8 (T or C), base 14-15 (AG), base 18-19 (GG), base 21 (A or G), base 33 (T), base 48 (T or C) base 53-58 (GTTC[AG]A), base 60-61 (TC) and base 72 (T or C) [6] to obtain a structural alignment. The neighbor joining tree was constructed using CLUSTALW. Bayesian trees were constructed using MrBayes software v. 3.1.2 [40]. The model of sequence evolution was determined by MrModeltest software v. 2.2. The phylogenetic tree files were visualized using Hypertree [41] and Treeview X software (http://darwin.zoology.gla.ac.uk/,rpage/ treeviewx/). The mimicry of split tRNA halves were conducted by separating tRNA sequences at positions 34-36 (anti-codon); we defined the 2 separated sequences as 59 half tRNA (positions 1-33) and 39 half tRNA (positions 37-73). Sequence similarity was calculated based on the aligned file of mature tRNA sequences (.aln). The anti-codon and CCA sequence was excluded to avoid bias. Networks were visualized with Cytoscape software v. 2.5 [42]. Clustering coefficient c is calculated as given equation: where i(i-1) refers to the total number of possible connection among i nodes, j refers to the total number of edge found in the network. Clustering coefficient measures relative linkage among nodes and it is used to understand the topological characteristics of a network (18,20).

Supporting Information
Figure S1