Sequence Fingerprints of MicroRNA Conservation

It is known that the conservation of protein-coding genes is associated with their sequences both various species, such as animals and plants. However, the association between microRNA (miRNA) conservation and their sequences in various species remains unexplored. Here we report the association of miRNA conservation with its sequence features, such as base content and cleavage sites, suggesting that miRNA sequences contain the fingerprints for miRNA conservation. More interestingly, different species show different and even opposite patterns between miRNA conservation and sequence features. For example, mammalian miRNAs show a positive/negative correlation between conservation and AU/GC content, whereas plant miRNAs show a negative/positive correlation between conservation and AU/GC content. Further analysis puts forward the hypothesis that the introns of protein-coding genes may be a main driving force for the origin and evolution of mammalian miRNAs. At the 5′ end, conserved miRNAs have a preference for base U, while less-conserved miRNAs have a preference for a non-U base in mammals. This difference does not exist in insects and plants, in which both conserved miRNAs and less-conserved miRNAs have a preference for base U at the 5′ end. We further revealed that the non-U preference at the 5′ end of less-conserved mammalian miRNAs is associated with miRNA function diversity, which may have evolved from the pressure of a highly sophisticated environmental stimulus the mammals encountered during evolution. These results indicated that miRNA sequences contain the fingerprints for conservation, and these fingerprints vary according to species. More importantly, the results suggest that although species share common mechanisms by which miRNAs originate and evolve, mammals may develop a novel mechanism for miRNA origin and evolution. In addition, the fingerprint found in this study can be predictor of miRNA conservation, and the findings are helpful in achieving a clearer understanding of miRNA function and evolution.


Introduction
Gene evolution has been well investigated, and the conservation of genes has been found to be associated with many biological factors [1]. microRNAs (miRNAs) are small non-coding regulatory RNAs that have been identified in recent years [2]. They have important biological functions and show strong association with various diseases [3]. Similar with protein-coding genes, studies on miRNA conservation are important in understanding not only their function and genomic organization but also their relation to human disease and medicine [4]. Indeed, many factors have been revealed to be correlated with miRNA conservation in recent years. For example, many miRNAs tend to be conserved during evolution [2,5,6]. Zhang et al. reported that human-specific miRNAs tend to evolve rapidly [4]. miRNA precursor stem loops show significantly increased mutational robustness [7]. The rapid evolution of an X-linked microRNA cluster [8] and the Alumediated rapid expansion of miRNA genes [9] in primates have been observed. Vazquez et al. found that recently evolved miRNAs tend to be longer than ancient miRNAs in Arabidopsis [10]. Szollosi et al. suggested that the conservation of miRNAs may be associated with selection for environmental robustness [11]. Piriyapongsa et al. revealed that transposable element (TE)derived miRNAs are less conserved than non-TE-derived miRNAs in human [12]. Liang et al. observed that highly expressed human microRNA genes tend to be conserved [13]. Recently, de Wit et al. discovered hairpin shifting, a novel mode of miRNA evolution [14].
The above studies have uncovered several important evolutionary insights. However, the relationship between some important sequence features and miRNA conservation has not been investigated to date. Many questions about the relationships of miRNA conservation and their sequence features remain unknown. Is the base content of miRNAs correlated with their conservation? Do the sequence features of mature miRNA (MIR) sequences and miRNA precursors (pre-miR) have similar correlation with miRNA conservation? Do conserved miRNAs tend to have a high GC content? Are there any differences among species? Does the cleavage process of MIRs have an affect on their conservation?
To address these questions, we analyzed the correlation between miRNA conservation and various miRNA sequence features, including ACGU content of MIRs, pre-miRs, and non-MIR sequences, and the base content at cleavage sites. The results show that these sequence features are significantly correlated with miRNA conservation, and different species unexpectedly show different and even opposite correlation patterns. Our analysis further revealed that the introns of protein-coding genes and the need of functional diversity of might be driving forces for miRNA origin and evolution in mammals.

Species show specific patterns of correlation between miRNA conservation and their base content
We first performed correlation analysis for miRNA conservation and their base content (Features 1-12). The results showed that miRNA conservation is significantly correlated with the sequence features in the six species (Table 1). For human and mouse, miRNA conservation showed a significantly positive correlation with AU content but a significantly negative correlation with GC content in all types of sequences (MIR, pre-miR, and non-MIR). The above correlations disappeared in fly but began to be reversed from worm (Table 1, Figure 1). Compared with two mammals, in two plants, Arabidopsis and rice, the correlation patterns between miRNA conservation and base content are nearly totally reversed (Table 1). For protein-coding genes, however, conserved genes tend to have an increased GC content in mammals, birds [15], yeast, Arabidopsis [16], and rice [17]. Obviously, this situation in mammal miRNAs is different from that in worm and plant miRNAs and that in protein-coding genes. This suggested that the forces connecting base content and evolution in miRNAs might be diverged in miRNAs, especially in mammal miRNAs.

Species show specific patterns of correlation between miRNA conservation and the bases at cleavage sites
The location of caspase cleavage sites has been reported to often have specific structural characteristics [18]. This gives us clues that the cleavage sites of miRNAs may also have specific features for miRNA conservation. To investigate this, we first classified miRNAs into two groups: miRNAs that have at least one other homolog (conserved miRNAs) and miRNAs that do not have any homolog (termed as less-conserved miRNAs). We next analyzed the distribution of ACGU in the four bases at the cleavage sites ( Figure 2). The result shows that conserved miRNAs have a greater fraction of U at the 59 terminus than less-conserved miRNAs in mammals ( Figure S1). For example, 45.5% of conserved miRNAs have a base U at the 59 terminus in human, whereas only 30.2% of less-conserved miRNAs have a base U at the 59 terminus. This difference was further confirmed to be significant (P = 0.02, randomization test; Figure 3). Moreover, miRNAs with a base U at the 59 terminus are more conserved than miRNAs with a non-U base at the 59 terminus (P = 2.0610 24 , Wilcoxon test). As a comparison, the miRNAs of insects and plants do not have the above pattern.

Hypothesis on the intron-driven origin of miRNAs in mammals
The investigation of why animals and plants show opposite patterns in the correlation of base content and conservation of miRNAs is interesting. As described above, unlike protein-coding genes in both animals and plants and unlike miRNAs in plants which show positive correlation between GC content and their conservation, the miRNAs in mammals show a negative correlation between GC content and their conservation, an opposite pattern with protein and plant miRNAs. How does this occur? A big difference in the genome location of miRNAs in animals and plants may give us the clues. Most plant miRNAs are located in intergenic regions, whereas many animal miRNAs are located in the introns of protein-coding genes. Considering that proteincoding genes have a higher fraction of GC content than other DNA regions, as introduced above, novel miRNAs will have a higher fraction of GC content than conserved miRNAs in animals if less-conserved miRNAs have a higher probability of introndriven origins. We first confirmed this in human. We calculated the fractions of intronic miRNAs in conserved miRNAs and lessconserved miRNAs, respectively. Indeed, less-conserved miRNAs have a higher fraction (46.5%) of intron origin than conserved miRNAs (38.2%). We further compared the conservation scores of intronic miRNAs and non-intronic miRNAs. As expected, intronic miRNAs show significantly less conservation than non-intronic miRNAs (P = 1.3610 29 , Wilcoxon test; Figure 4). This result suggested that the higher enrichment of GC in novel miRNAs may mainly result from their enrichment in introns. Therefore, investigating the correlation between base content and miRNA conservation in the context of introns and non-introns, respectively, will be interesting. As a result, non-intronic miRNAs show consistent patterns with plant miRNAs. For example, the content of base A shows a significantly negative correlation with miRNA conservation (R = 20.33, P = 7.0610 211 for miRNA precursors; R = 20.15, P = 0.004 for non-mature sequences, which are the rest sequences of miRNA precursors after removing the mature ones). For intronic miRNAs, however, they still tend to show opposite patterns compared with non-intronic miRNAs, miRNAs in plants, and protein-coding genes. For example, base C is negatively correlated with miRNA conservation (R = 20.17, P = 0.009 for miRNA precursors; R = 227, P = = 2.7610 25 for mature miRNAs). This result suggested that intronic miRNAs prefer to originate from younger protein-coding genes rather than older ones. We repeated the above analysis in mouse, and the main results are the same (data not shown). All these findings suggested the hypothesis that the introns of protein-coding genes, especially young protein-coding genes, may be a driving force for miRNA origin.

Hypothesis on functional diversity-driven miRNA evolution in mammals
We have shown that in mammals, conserved miRNAs have a preference for base U at the 59 end, whereas this pattern does not  [19,20]. During evolution, highly sophisticated organisms need to develop more diverse biological functions in order to meet the challenges of more complex tasks. The change of the base fraction at the 59 terminus may be driven by the need for miRNA functional diversity. The evolution of Dicer genes, critical genes in small RNA biogenesis, may provide some clues. Different species have different numbers of Dicer genes. For example, the six species used in this study have one (human and mouse), two (fly), three (worm), four (Arabidopsis), and six (rice) Dicer genes, respectively. This suggested that during evolution, the number of Dicer genes tends to decrease and that the functions of multiple Dicer genes are integrated into one Dicer gene in mammals. Therefore, the mammal Dicer gene tends to have greater functional diversity, which may result in greater diversity in miRNA biogenesis and function. We first confirmed this by investigating the change of the base U fraction at the 59 end, the most critical cut point of Dicer. From plants to mammals, the differences of the base U fraction between conserved and lessconserved miRNAs increase as the number of Dicer genes decreases (R = 20.81, P = 0.05, Spearman's correlation; Figure 5A). Moreover, mammals also showed greater diversity in miRNA length (R = 20.97, P = 0.001, Spearman's correlation; Figure 5B). In miRNA sorting, Ago1 has been shown to have a strong preference for base U at the 59 end, whereas Ago2 and Ago4 have a preference for other bases at the 59 end [21]. The changes in base content at the 59 end may trigger a switch in miRNA silencing mechanisms, from cleavage to translation inhibition [21]. miRNA is known to silence targets by two mechanisms, cleavage and translation inhibition. The target cleavage often results from perfect miRNA-mRNA hybrids, whereas translation inhibition normally results from imperfect miRNA-mRNA hybrids. The binding style means that miRNAs using the translation inhibition mechanism normally have bigger   target numbers. Therefore, they have greater functional diversity than miRNAs using the cleavage mechanism. Indeed, reports have indicated that typically, animal miRNAs regulate targets by translation inhibition so they normally have dozens of targets. Plant miRNAs, in contrast, typically regulate targets by cleavage so they normally have a limited number of targets [21]. This supported the hypothesis that the origin of miRNAs in highly sophisticated organisms (i.e., human) might be driven by the increasing need for functional diversity. As a result, young miRNAs tend to use the non-U base at 59 ends, which then triggers the switch in miRNA regulation mechanisms and alters functional diversity. To confirm this, we investigated the relationship between miRNA regulation mechanisms and miRNA conservation in human using a number of experimentally supported miRNA regulation mechanism data. In particular, we obtained data on experimentally supported miRNA regulation mechanisms from TarBase [22], then we classified these miRNAs into two groups: the highly conserved group (old miRNAs) and the lowly conserved group (young miRNAs). As expected, young miRNAs have a strong preference for the mechanism of translation inhibition compared with old miRNAs (translation inhibition fraction in two groups: 83.9% vs. 32.3%, P = 7.0610 211 , Fisher's exact test; Figure 6). This suggested that young miRNAs in mammals tend to have diverse functions, supporting the hypothesis that the pressure of functional diversity during the evolution of highly sophisticated organisms might be a force influencing miRNA origin.

Network context of miRNAs have effects on their evolution
Previous studies have shown that molecules evolution is affected by the biological networks of these molecules [23,24]. For miRNAs, they also show preference when targeting genes in the context of biological network, for example signaling network [25]. Therefore, it will be interesting whether miRNAs show different patterns of evolution in the context of biological network. For doing so, we obtained the predicted miRNA target from TargetScan. We then mapped miRNAs into a human signaling network [26] through miRNA targets. We further identified top 10% miRNAs that predominantly target extracellular molecules, membrane molecules, cytosol molecules, and nucleus molecules, respectively. As a result, miRNAs that regulate different network context targets show significantly different evolutionary patterns. miRNAs regulating extracellular molecules are the most conserved ones, followed by miRNAs regulating membrane molecules, cytosol molecules, and nucleus molecules. From extracellular space to nucleus, these miRNAs show decreasing conservation. For example, from extracellular space to nucleus, the four groups of miRNAs have 14, 12, 6, and 2 most conserved miRNAs, respectively.

Discussion
In summary, we have revealed that miRNA sequence features contain information on miRNA conservation, suggesting that these features might be the fingerprint for miRNA evolution. More interestingly, different species show different and even opposite patterns of fingerprints, indicating that although common  Analysis of the correlation between base content and miRNA conservation revealed that mammals and plants show opposite patterns. Furthermore, evidence from intronic miRNAs and their base content distribution put forward the hypothesis on introndriven miRNA origin in mammals. The reasons why many mammal miRNAs originate from the introns of protein-coding genes, especially evolutionary young protein-coding genes, remain largely unknown. This origin of miRNAs is assumed to possess some advantages. For example, intronic miRNAs with their host genes may form transcription units and co-play roles in common tasks.
Analysis of the bases at miRNA cleavage sites showed that conserved miRNAs have a preference for base U at the 59 end in mammals, whereas less-conserved miRNAs have a preference for non-U bases at the 59 end. Interestingly, this pattern does not exist in insects and plants. Further analysis revealed that this pattern might have resulted from the changes in Dicer number and miRNA sorting by Ago proteins. From plants to mammals, the number of Dicer genes decreases from more than 4 to only 1, which means that the Dicer gene in mammals has greater diversity in function. One possible reason for the Dicer gene decrease in mammals may be that mammals show different immune mechanisms from insects and plants. In both insects and plants, Dicer genes are necessary for anti-virus protection [27]. In contrast, mammals have evolved other innate immune mechanisms for anti-virus protection [28]. Therefore, the Dicer gene related with immune response is not necessary for mammals. As a result, the integration of Dicer genes makes miRNAs more diverse. Under the pressure of a complex environment, mammalian cells need functions that are more diverse. Therefore, newly evolved miRNAs tend to have a preference for the non-U base at the 59 end, which may be helpful in switching perfect miRNA-mRNA binding to imperfect miRNA-mRNA binding. A main result of the imperfect binding is that one miRNA will have more binding sites on one gene and will have more targets. Therefore, it will be more diverse in functions.

Data of miRNA
We downloaded miRNA family data from miRBase (November, 2009) [29]. We calculated the conservation score for the miRNAs of human (Homo sapiens, hsa), mouse (Mus musculus, mmu), fly (Drosophila melanogaster, dme), worm (Caenorhabditis elegans, cel), Arabidopsis (Arabidopsis thaliana, ath), and rice (Oryza sativa, osa) using the method presented by Zhang et al. [4], which defined the conservation score of an miRNA using the number of its family members. The method is validated by SNP analysis [4] and evolution analysis of the miRNA transcriptional network [30].
We classified miRNAs into less-conserved and conserved based on the miRNA family data presented in miRBase. For example, for a human miRNA, if it does not have other family members (in human or other species), we take it as less-conserved, otherwise, we take it as conserved. miRNAs that belong to a miRNA family have the same seed regions but most of them do not have the same sequences at miRNA gene level. We also counted the number of family members of for each human miRNA and took this count as its conservation score. Using this score and miRNA expression data by Liang et al. [31], we observed the positive correlation of miRNA conservation and its expression level, as reported by a previous study [13]. This suggests that this score could be a useful metric to evaluate miRNA conservation. Because this metric depends on the miRNA family data, it is not perfectly accurate and will be improved when more miRNA family data becomes available in the future.

miRNA sequences features
We obtained the sequences of MIRs and pre-miRs of the above six species from miRBase [29]. We focused on 16 sequence features, the ACGU content of pre-miRs (Features 1 to 4), the ACGU content of MIRs (Features 5 to 8), the ACGU content of non-MIRs (Features 9 to 12), and the bases at cleavage sites (Features 13 to 16). Features 1 to 12 are real numbers, and Features 13 to 16 are characters (A, C, G, or U).

Statistical computing
All statistical analyses were performed using R, a statistical computing language (http://www.r-project.org/). We used Cluster 3.0 for clustering analysis.