Exploration of Small RNAs

For several decades, only a limited number of noncoding RNAs, such as ribosomal and transfer RNA, have been studied in any depth. In recent years, additional species of noncoding RNAs have increasingly been discovered. Of these, small RNA species attract particular interest because of their essential roles in processes such as RNA silencing and modifications. Detailed analyses revealed several pathways associated with the function of small RNAs. Although these pathways show evolutional conservation, there are substantial differences. Advanced technologies to profile RNAs have accelerated the field further resulting in the discovery of an increasing number of novel species, suggesting that we are only just beginning to appreciate the complexity of small RNAs and their functions. Here, we review recent progress in novel small RNA exploration, including discovered small RNA species, their pathways, and devised technologies.


Introduction
There is substantial interest in noncoding RNAs (ncRNAs), which play an essential role in complex biological systems without encoding for proteins. Only a limited number of ncRNAs, such as ribosomal RNA (rRNA) and transfer RNA (tRNA), have previously been characterized in any depth. Recent studies revealed many novel ncRNAs, covering a wide range of sizes [1]. RNA molecules have several functions including catalytic activity and ability to act as a structural component. Of these functions, the ability to specify a nucleic acid sequence is superior compared to proteins. A common way in which ncRNA contributes to biological processes is through the ribonucleoprotein (RNP) complex, where its role is to guide recognition of nucleic acid target sequences relying upon sequence complementarity [2]. Small RNA molecules are widely utilized in this type of machinery, and are involved in important biological processes [3]. Exploration of novel small RNA species and their functions attracts substantial interest. The advent of recent technologies to profile cellular RNAs, such as highthroughput sequencing and microarray, coupled with computational analysis, has contributed to rapid progress in this field. Here, we review the recently discovered small RNA species and their pathways in a view of conservations and differences between higher eukaryotes. We also summarize recent exploration efforts of novel small RNAs based on devised technologies to provide a perspective for the future.

RNA Silencing-Related Small RNAs
RNA silencing of endogenous genes, viruses, and selfish genomic elements is a regulatory process that relies on small RNA molecules, approximately 22 nucleotides long [4]. The trigger of RNA silencing is an RNA molecule harboring a duplex. Such a molecule is processed with the following steps: (i) small RNA production: a precursor RNA is cleaved to produce a small double-stranded RNA (dsRNA), where the precursor can be a hairpin-structured RNA or a long dsRNA in the case of microRNA (miRNA) or small interfering RNA (siRNA), respectively; (ii) RNP assembly: the resulting small RNA is loaded into an RNP complex, and (iii) gene silencing: the RNP suppresses its target gene, where the target recognition is guided by the loaded small RNA, and the silencing activity is mediated by the proteins composing the RNP at the post-transcriptional or transcriptional level [5].
This machinery is adopted in a wide range of organisms. Although the overall pathways resemble each other, there are substantial differences between organisms ( Figure 1). In metazoans, two RNase III endonucleases, Drosha and Dicer, contribute to a process of small RNA production. Drosha cleaves a long primary transcript including a stem-loop (termed primary miRNA, or pri-miRNA) near the base of the stem to release a hairpin structure, termed precursor miRNA (pre-miRNA). Dicer cleaves pre-miRNA or a long dsRNA (precursor of siRNA) to produce a small dsRNA, only one strand of which is loaded into the RNP [6]. Homo sapiens, Mus musculus, and Caenorhabditis elegans have only one Dicer gene, which contributes to both miRNA and siRNA production. On the other hand, these two roles are encoded by distinct genes in several organisms. Drosophila melanogaster has two proteins of Dicer, DCR-1, and DCR-2, which are used for miRNA and siRNA, respectively [7]. Arabidopsis thaliana has four Dicer orthologues, DCL1 to DCL4, but no Drosha. DCL1 contributes to miRNA production, DCL2 to 22nt siRNA from invading viruses, and DCL3 to 24nt siRNA from endogenous genes [8,9].
There are also differences in the effector complexes participating in RNA silencing, siRNP, or miRNP. Their main component is Argonaute, which has two principal domains: an RNA-binding PAZ domain at the N-terminus and RNaselike Piwi domain at the C-terminus. The Argonaute protein family consists of the Ago subfamily, the Piwi subfamily, and the C. elegans specific subfamily ( Figure 1) [10]. Mammalian AGO subfamily members contribute to both the siRNA and miRNA pathways. In contrast, AGO1 contributes only to the miRNA pathway, and AGO2 only to the siRNA pathway in D. melanogaster [7]. The many members in C. elegans have also been suggested to have a distinction in their roles [11]. Only ALG contributes to miRNA processing, whereas RDE-1 and ERGO-1 cleave exogenous and endogenous dsRNA to produce siRNA. Intriguingly, the Piwi subfamily is not found in A. thaliana. This organism has only AGO subfamily members, where AGO1 is involved in miRNA and endogenous siRNA, and AGO4 in DNA methylation through endogenous siRNA [9].
There is an additional organism-dependent pathway: an amplification of small RNAs relying upon RNA-dependent RNA polymerase (RdRP), which currently has been found only in C. elegans and A. thaliana. The polymerase synthesizes dsRNAs from RNAs cleaved by siRNP or miRNP, and the synthesized dsRNAs are used as siRNA precursors [12]. This process contributes to the amplification of siRNA and the subsequent silencing effect. In plants, two cleavage events often trigger siRNA biogenesis by this amplification pathway [13]. RdRP activity has also been observed in D. melanogaster, although its homolog has not yet been identified [14].
Endogenous siRNA variations. siRNAs were originally found as processed products of introduced long dsRNA, but subsequent analysis has revealed endogenous siRNAs with several origins. Repeat-associated siRNA (rasiRNA) originate from repetitive sequences, such as transposable elements, that have the ability to replicate themselves independent of their host organisms. rasiRNAs are found in both the sense and antisense strands of the transposable elements, and there are predominant biases in the strand from which the RNAs are derived [15][16][17][18][19][20]. rasiRNAs are suggested to repress the transposable elements themselves or mRNAs harboring sequence complementary to them. Besides posttranscriptional level silencing, rasiRNAs have been shown to be involved in transcriptional level silencing in plants through chromatin modifications [15,16]. Repetitive sequences are frequently found in chromatin domains, and it is suggested that rasiRNAs contribute to the regulation of chromatin status.
Trans-acting siRNA (ta-siRNA) is a class of siRNA that targets other genes rather than the gene producing the siRNA itself [21]. It is found only in plants, and surprisingly it is derived from an mRNA cleaved by a miRNA. A cleaved product of the mRNA attacked by a miRNP is used as a template to synthesize double-strand RNA with RdRP, and the synthesized product triggers RNA silencing to repress other genes [22,23].
Putatively siRNA-related RNAs. Novel types of ncRNAs that resemble siRNAs have also been discovered, although their function still remains unclear. Piwi-interacting RNAs (piRNAs) are potentially a third class of small RNAs involved in RNA silencing [24][25][26]. Piwi-related proteins compose a subfamily belonging to the AGO family as described above, and they are specifically expressed in the germline. In fact, Piwi of D. melanogaster and Miwi2 of mouse are essential for germline stem cells, which suggests a common machinery between these organisms [27]. In fact, piRNAs are commonly 25nt-29nt long, which are slightly longer than miRNAs and siRNAs, and they are clustered into a limited number of loci on the genome, where only one strand encodes piRNA mainly. Recent analysis suggests a model of its biogenesis in fly: piRNA attacks a transcribed transposon, and the cleaved transposon contributes to piRNA production [28]. Surprisingly, the three members of the piwi subfamily, Piwi, Aub, and Ago3, play distinct roles in the model: Piwi and Aub bind to piRNA specifically, which is the antisense of the transposon, while Ago3 binds the cleaved fragment of the targeted transposon. The interplay between piRNAs that are sense and antisense to transposable elements is suggested to be conserved in mouse [29]. However, any long transcripts derived from the piRNA loci, potential precursors of piRNA, have not yet been found. And mouse piRNAs are not significantly related to repeats, while a substantial part of fly piRNAs are derived from repeats [24,25]. The piRNA pathway including its biogenesis still remains to be studied in detail.
A class of 21nt-long RNAs that is distinct from miRNA and siRNA has been found in C. elegans. These RNAs, termed 21U-RNA [30], are similar to piRNA in a few aspects: an uridine is frequently found at their 59 end, and their origins are clustered on the genome. However, their clusters span large regions (2-3 Mbp), compared to the piRNA clusters (approximately 100 kbp). The 21U-RNAs also share two motifs in their upstream (less than 50 bp), whereas no motifs have so far been found in biogenesis of small RNA involving RNA silencing [30]. C. elegans specific subfamily piwi (Figure 1) is expected to comprise species-specific pathways, and some of them might be related to the 21U-RNAs.

snoRNA/scaRNA
Small nucleolar RNA (snoRNA), another class of small RNA discovered recently, contributes to RNA modification of ribosomal RNAs (rRNAs), small nuclear RNAs (snRNAs), and, putatively, other RNAs [31]. Two families of snoRNA have been revealed to catalyze distinct modifications: the C/D box family, for 29-O-methylation, and the H/ACA box, for pseudouridylation. The former possesses two motifs at the 59and 39-ends (termed C and D boxes), imperfect copies of the motifs (C9 and D9 boxes), and guide sequences to specify target RNAs. C/D box snoRNA forms an RNP complex with four proteins: FBL (fibrillarin, a methyltransferase), NOL5A (Nop56), NOP5/NOP58, and NHP2L1 (Snu13). The H/ACA box snoRNA possesses two hairpins, which contain internal loops to form pseudo-knot structures with its target RNA, and two single-stranded regions containing two motifs (H and ACA boxes). This also comprises RNP with distinct four proteins: DKC1 (dyskerin, a pseudouridine synthase), NOLA1 (Gar1p), NOLA2 (Nhp2p), and NOLA3 (Nop10p) [32]. These snoRNPs mediate the modifications of the targeted RNA, where their target sites are recognized by complementary (guide) sequences within the snoRNAs. Orphan snoRNAs, whose guide sequences are not complementary to rRNA or snRNA, have also been observed [33]. Notably, one of these orphan snoRNAs, HBII-52, which is located in an imprinted locus, has been revealed to regulate alternative splicing of the serotonin receptor 2C by relying upon sequence complementarity. Loss of this snoRNA produces different isoforms of mRNA, which are likely to cause the Prader-Willi syndrome [34]. This finding suggests that the other orphan snoRNAs are potentially involved in splicing machinery as well as in RNA modification. Intriguingly, composite RNAs harboring both C/D boxes and H/ACA boxes have been found. They are localized to Cajal bodies, conserved subnuclear organelles in the nucleoplasm, and are termed scaRNA. They are suggested to mediate both of the two modifications, 29-Omethylation and pseudouridylation, relying upon the corresponding motifs [35,36].

Small RNA Exploration
Technologies to profile cellular RNA, sometimes termed RNomics [37,38], have led to recent discoveries in the field of small RNA. Their advances drastically extend the range that can be explored. The approaches to profile cellular RNAs are mainly classified into four categories [38]: (i) RNA direct sequencing, (ii) cDNA cloning followed by sequencing, (iii) hybridization-based detection, and (iv) genomic SELEX. The first category, RNA direct sequencing, is a classical method applicable for a very limited number of RNAs, which are highly abundant and distinguishable from other species relying upon just length, like tRNA and rRNA. The fourth category, genomic SELEX, identifies possible RNA sequences for binding a specific protein through in vitro synthesis of RNAs based on the genome. Although it has a benefit in its independence from samples expressing target RNAs, its application is quite limited. The remaining two, cDNA sequencing and the hybridization-based approach, are the most commonly used methods for recent exploration efforts.
cDNA sequencing. A widely used approach to explore small RNAs is random sequencing of size-fractionated RNAs, which requires linker ligation to cellular RNAs, reversetranscription, PCR amplification, concatemerization, cloning, and sequencing [39]. Based on its ability to explore unpredicted RNAs, this approach initiated the systematic exploration of novel miRNAs [40][41][42], and subsequently it has been applied to profile additional RNA species in a wide range of samples, such as RNAs extracted from various organisms, and mutants, and RNAs immunoprecipitated with a related protein [13,24,25,28,30,[43][44][45]. A benefit of random sequencing is its ability to extract information about the abundance of different RNA species included in the sample. As the likelihood of a molecule being sequenced correlates with its abundance, rare species have a small chance of being discovered in small-scale sequencing. This limitation is being addressed by the development of highly parallel sequencing technologies, such as MPSS [46] and 454 pyrosequencing [47]. These large-scale sequencing techniques enable the reading of small RNAs several hundred thousand times ( Figure 2, Table S1), which make it possible to detect rare species and to quantify RNA abundances with better accuracy. The number of reads obtained through random sequencing of cDNA will not necessarily reflect the original proportion, due to different efficiencies in reverse transcription of RNA into cDNA depending on secondary structure and/or modifications [38]. However, it is still possible to compare abundances of detected RNAs across samples because reverse-transcription efficiency is expected to depend on the RNA itself, not on which samples were used.
Hybridization-based detection. Hybridization-based detection systems, such as northern blots and microarrays, are used to detect and/or quantify expression of small RNAs [37,38] as well as mRNAs. Northern blots are commonly used to confirm small RNAs detected with other methods and have also been devised for expression profiling of more than 100 miRNAs [48]. Its sensitivity is comparatively limited, but the limitation can be addressed by the use of locked nucleic acids (LNA) as probes [49] and the use of the soluble carbodiimide to cross-link RNA to nylon membranes [50]. Microarrays are utilized for large-scale profiling of small RNAs [51,52], and their sensitivities and specificities can also be improved with  The original data is provided as Table S1.
the use of beads [53], LNA [54], and incorporation of hairpin structures to probes [55]. In the context of novel miRNA exploration, customized microarrays are used in combination with computational prediction [43,56]. Such predictions are not necessarily optimal, and will predict slightly different sequences in several cases. In order to detect true miRNAs from such predictions, RNA-primed array-based Klenow extension (RAKE) has been used to clearly distinguish distinct borders of a mature miRNA with tiling probes covering the regions proximal to the predictions [43]. Another study performed sequence-directed cDNA cloning and sequencing following a microarray analysis to confirm and determine the detected sequences [56]. Besides miRNA analysis, siRNAs and rasiRNAs have been analyzed with microarrays, in particular with high-density (tiling) microarrays. In the study of siRNAs derived from scattered sense and antisense regions within a locus, the use of tiling arrays to profile the entire region revealed distinct features between siRNAs and rasiRNAs [19]. In another study, whole genome tiling array is used to profile small RNAs, as well as long RNAs derived from the nucleus and the cytosol. This enabled the profiling of the production process of small RNAs on a whole-genome scale [57].
Computational analysis. The experimental approaches described above require a coupling to computational analysis, such as the prediction of small RNAs and characterization of discovered RNAs. The sequencing approach requires classification of obtained RNAs in order to get a complete picture of the RNA composition in the analyzed samples, and to select RNAs of interest in subsequent analyses. A major approach in the classification of small RNAs is mapping them onto the genome with subsequent comparison to genome annotations. Databases collecting ncRNA sequences (and profiles in Rfam [58]) of specific interest (Table 1) are also available for such classifications.
Except in the case of the whole-genome tiling arrays, the hybridization-based approach requires target RNA sequences to design probes in advance of experiments. Computational prediction of miRNA has been performed in many studies, which mainly rely upon the secondary structure of the miRNA precursor [59,60], thermodynamic stability [61], and/ or genome conservation between species [62][63][64]. Several studies have succeeded in validating the predicted RNAs experimentally and demonstrating the accuracy of the prediction methods [65]. Prediction of snoRNA has also been performed computationally, relying upon features such as secondary structure, antisense sequence to putative target, and the presence of C/D and H/ACA boxes [66][67][68][69][70]. Recent analyses have revealed unexpected RNAs even within these classes, such as species-specific (not conserved among species) miRNA and orphan snoRNA with unknown targets, as described above. In particular, genome conservation is used broadly to find functional elements within the genome, but this does not necessarily mean that nonconserved regions have no function. Improvements of computational methods with consideration of the recent findings are used to explore further novel RNAs [56,69,71].

Discussion
Recent post-genome analyses have revealed that a large fraction of the genome, more than 60%-70%, can be transcribed [72,73]. Considering that small RNAs are derived from intergenic regions, introns, exons, and repetitive sequences originally thought to be unimportant or junk, all transcripts are potential sources of functional small RNAs. Although a limited fraction of the cellular RNAs have been characterized functionally [1], it is expected that a substantial number of the small RNA species and their related pathways still remain unrevealed. Recent studies have demonstrated that synthetic RNA duplexes harboring sequences complementary to promoters rather than mRNA can contribute to gene activation and repression [74][75][76]. No endogenous RNAs have previously been found to have such a function. This does not necessarily mean that no endogenous RNAs are involved in this machinery, but rather suggests the possibility that there are still unrevealed small RNA pathways.