Discovering Numerical Differences between Animal and Plant microRNAs

Previous studies have confirmed that there are many differences between animal and plant microRNAs (miRNAs), and that numerical features based on sequence and structure can be used to predict the function of individual miRNAs. However, there is little research regarding numerical differences between animal and plant miRNAs, and whether a single numerical feature or combination of features could be used to distinguish animal and plant miRNAs or not. Therefore, in current study we aimed to discover numerical features that could be used to accomplish this. We performed a large-scale analysis of 132 miRNA numerical features, and identified 17 highly significant distinguishing features. However, none of the features independently could clearly differentiate animal and plant miRNAs. By further analysis, we found a four-feature subset that included helix number, stack number, length of pre-miRNA, and minimum free energy, and developed a logistic classifier that could distinguish animal and plant miRNAs effectively. The precision of the classifier was greater than 80%. Using this tool, we confirmed that there were universal differences between animal and plant miRNAs, and that a single feature was unable to adequately distinguish the difference. This feature set and classifier represent a valuable tool for identifying differences between animal and plant miRNAs at a molecular level.

Recently, several studies have shown that miRNA genes are lineage-specific or species-specific, and that numerical features of miRNA genes also be conserved [32,33]. Numerical features of miRNA genes refer to quantity index which are used to describe nucleotide content, secondary structure information, free energy and information entropy and so on. These findings imply that there may be numerical differences between animal and plant miRNAs. We therefore aimed to identify any significantly different numerical differences and explore the possibility that these differences could be used to distinguish between animal and plant miRNAs.
We selected 10951 animal and 3188 plant miRNA genes from miRBase (version21) [34] for use as a basic library and examined 132 numerical features that included sequence, structure, energy, and information entropy using the Perl program. We systematically analyzed numerical differences between animal and plant miRNAs using several statistical analysis methods. We found several numerical features, which include helix number, stack number, length of pre-miRNA, MFE and so on that could be used to differentiate between plant and animal miRNA genes. However, none of the numerical differences were sufficient on their own to clearly distinguish between individual animal and plant miRNAs. Using these results, we developed an efficient classifier to distinguish between plant and animal miRNAs based on the differences in the miRNA numerical features. Our findings demonstrate that combinations of numerical features can be used to effectively identify plant and animal miRNAs.

miRNAs gene sequences
We selected 10951 animal and 3188 plant miRNA genes from miRBase for use in this analysis. Details on these genes are shown in Table 1.

Obtaining numerical features of miRNA
We extracted 132 numerical features that included sequence, structure, energy, and information entropy by designing a Perl program (S1 File). These features were divided into eight classes, and the serial numbers and names of the features are described in S1 Table. The first class referred to the frequency characteristics of single nucleotides. The second class referred to twobase combinations of the four bases A, C, G, and U, while the third class referred to three-base combinations of the four bases.
The fourth class referred to frequency features of the secondary structure matching state. Based on RNA secondary structure predicted by Mfold [35], the matching state of each nucleotide was described using the method presented by Xue et al [36]. For example, "C++. " indicates that this nucleotide at the site is "A", with a left matching site, a right mismatching site in the secondary structure and itself is a matching site. Examples are shown in Fig 1. There were 32 frequency features for the secondary structure matching state of miRNAs.
The fifth class included the length of miRNA genes, the number of bulge loops, the number of helices, the number of interior loops, and the number of stacks. Except for the length of the gene, the features were taken from Mfold predictions of secondary structure. Detailed examples are shown in Fig 1. The sixth class included the minimum free energy (MFE) [37], the adjusted MFE [38], and the MFE index [39], while the seventh class included G+C content, (G+C)/(A +U) ratio, A/C ratio, and G/U ratio.
The eighth class referred to features related to information entropy. The information entropy [40] was calculated using the formula: Formula (1) generated four kinds of information entropy (IE) related to the frequency of single nucleotides (IESN), dual nucleotides (IEDN), triple nucleotides (IETN), and the matching state frequency of the secondary structure (IESS). The eight classes were designated A-H in corresponding order. The p-value is frequency of every class nucleotides (For example, frequency of A, C, G and U or frequency of AA, AC, AG, AU, CA, CC, CG, CU, GA, GC, GG, GU, UA, UC, UG and UU). Formula (1) generated four kinds of entropy information related to the frequency of single nucleotides, dual nucleotides, triple nucleotides, and the matching state frequency of the secondary structure.
The 132 numerical features of 10951 animal and 3188 plant miRNA have been obtained and kept in S2 Table. Basic statistical analysis methods We applied a two-sample Kolmogorov-Smirnov test [41,42] and t-test to determine whether there were numerical differences between animal and plant miRNAs. The two-sample Kolmogorov-Smirnov test is a nonparametric test that can be used to compare two samples. The Kolmogorov-Smirnov statistic quantifies a distance between the empirical distribution functions of two samples, and is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples.
The Kolmogorov-Smirnov statistic for two given cumulative distribution functions F1(x) and F2(x) is shown below: The sup is abbreviation of the supremum of one numerical set.

Feature selection and classification method
We applied several feature selection methods to analyze numerical features of the miRNAs, and used the selected features to build a classifier for differentiating between animal and plant miRNAs. Seven feature selection search methods were selected: BestFirst, ExhaustiveSearch, GeneticSearch, GreedyStepwise, LinearForwardSelection, RandomSearch, and RankSearch. Osa-mir156a secondary structure as predicted by Mfold. H1~H7 denote helices. I1~I2 denote interior loops. T1 denote terminal loops or hairpin loops. B1~B3 denote bulge loops. 'G++.' indicates that the left base of G is a matching base ('+' denote matching, the left base of G base corresponding to the first mark behind G) and the right base of G is mismatching base ('.' denote mismatching, the right base of G base corresponding to the third mark behind G). G base is a matching base (the mark of G base is the second mark behind G). These methods have been described previously [43]. The cfsSubsetEval and FilteredSubsetEval attribute evaluators [44] were selected, and the Logistic [45]and J48 [46]models were selected as classification algorithms. NaiveBayes, BayesNet, FilteredClassifier, ZeroR, and RandomForest were used as described previously [47]. Those algorithms have been implemented by Weka [48]. About attribute evaluators, search methods and classification algorithms, we have introduced their details in S3 Table.

Results
Evaluating numerical differences between animal and plant miRNAs based on a single numerical feature To further clarify our results, the threshold for the Kolmogorov-Smirnov test statistic was set at 0.15. Using this threshold, we selected 17 significantly different numerical features: AU%, GU%, AUC%, GAC%, GAU%, GUC%, CUC%, A. . .%, U. . .%, helix number, interior loop number, stack number, length of pre-miRNA, MFE, adjusted MFE, MFE index, and information entropy of secondary structure. Except for GU%, GAC%, GUC%, and CUC%, the results for all features were higher in plant miRNAs than in animal miRNAs.
Specific differences between animal and plant miRNAs based on the top three significant numerical features. The Kolmogorov-Smirnov test statistic was much higher for three out of the 17 significantly different numerical features, specifically stack number, length of pre-miRNA, and MFE. We designed a bar plot for analyzing differences in the three features between animal and plant miRNAs in detail. Our results are shown in Fig 3. As shown in Fig 3A, we found that the distribution of pre-miRNA length in animals was more concentrated than that observed in plants, with >65% of sequences being 70-100 nt in length. The length of plant pre-miRNA was more diverse: only 35% of plant pre-miRNAs were in the 70-100 nt range, and nearly 5% of sequences were longer than 318 nt. In contrast, there were very few animal pre-miRNAs that were longer than 160 nt.
We found that the distribution of animal miRNA MFE values was also more concentrated than that of plants, with over 85% of animal MFE values greater than −46.2 kcal (Fig 3B). Again, the MFE values for plant miRNAs were more widely distributed. Only 50% of plant miRNAs had a MFE value greater than −46.2 kcal, but nearly 4% were larger than −126.2 kcal. Few animal MFE values were less than −76.2 kcal (Fig 3B). Fig 3C shows that distribution of animal miRNA stack numbers was highly concentrated, and over 90% of animal stack numbers were less than 35. Few animal stack numbers were higher than 40. Only 60% of plant miRNA stack numbers were less than 35, but over 20% were more than 40.
Although there were very obvious differences between animal and plant miRNAs based on these three numerical features, there was a large amount of overlap. This showed that a single feature was not sufficient for distinguishing between animal and plant miRNAs.

Single feature differences law for animal and plant miRNAs based on the Kolmogorov-Smirnov test statistic.
To outline a law for identifying differences between plant and animal miRNAs using a single numerical feature, we selected C%, G%, MFE index, and length of pre-miRNA. The Kolmogorov-Smirnov test statistic was from small to large. Based on these parameters, we designed a frequency density plot that included four subplots. The selected features were on four different levels based on the Kolmogorov-Smirnov test statistic. As shown in Fig 4, although feature distribution differences became clearer the closer the Kolmogorov-Smirnov test statistic became to 0.5, there was still a large area of overlap between animal and plant feature distribution density curves. This again showed that a single numerical feature was not sufficient to differentiate between animal and plant miRNAs. In general, we found that the larger the value of the Kolmogorov-Smirnov test statistic, the more significant the difference  between the animal and plant miRNA numerical feature. As a result of these findings, we decided to evaluate a combination of features to try to distinguish between plant and animal miRNAs.
Identification of feature sets that could be used to differentiate between animal and plant miRNAs Based on the Kolmogorov-Smirnov test statistic values, the top 17 out of the 132 examined numerical features were selected. We applied a feature selection technique for these 17 significantly different features, including two attribute evaluators, CfsSubsetEval and FilteredSubse-tEval, and six search methods, BestFirst, ExhaustiveSearch, GeneticSearch, GreedyStepwise, LinearForwardSelection,and RandomSearch. The analysis was finished by Weka software [48]. Our analysis results are shown in Table 2.  From the results, we found that four out of the 17 numerical features almost always arose in the six feature selection strategies. They were helix number, stack number, length of pre-miRNA, and MFE. Therefore, this feature subset was used as the basis of the classifier.

Building a classifier for animal and plant miRNAs
We applied seven classifiers for two feature subsets. Our analysis results are shown in Table 3. The seven classifiers included NaiveBayes, BayesNet, Logistic, FilteredClassifier, ZeroR, J48, and RandomForest. The S1 feature subset included the four features identified by feature selection, while the S2 feature subset included all 17 significantly different features. Analysis was performed using Weka software.
As shown in Table 3, we found that the maximum receiver operating characteristic (ROC) areas for each classifier all occurred in the logistic model for both of the feature subsets. For S1, the logistic classifier's ROC area was 0.805, and the precision of classification was 0.854. For S2, the logistic classifier's ROC area was 0.816, with a precision of classification of 0.861. The Note: Serial number information of being selected features refer to S1 Table. The number 51 represent GUC content frequency, the number 118 represent number of helix, the number 120 represent number of stack, the number 121 represent length of hairpin and the number 122 represent minimum free energy of pre-miRNA's secondary structure. Attribute Evaluator and Search Method refer to papers [43,44] and all details have been recorded in S3 Table. doi:10.1371/journal.pone.0165152.t002 performance of the classifiers was very similar based on the two feature subsets. Consistent with our aim of determining the smallest number of numerical features that could be used to differentiate between animal and plant miRNAs, S1 and the logistic classifier were selected as our research model. The logistic model was as follows: LogitðPÞ ¼ 6:1436 þ 0:0893x 1 À 0:0691x 2 À 0:0241x 3 þ 0:0263x 4 ð3Þ Where P stands for probability of animal miRNA, x 1 denotes helix number, x 2 denotes stack number, x 3 denotes length of pre-miRNA; and x 4 denotes MFE. The model and its coefficients were all significant (P = 0.01).

Discussion
Although there were significant differences between animal and plant miRNAs based on each of the 17 numerical features, none of them could be used in isolation to reliably assess miRNAs. Therefore, a feature selection and classifier method was applied, and a feature subset and analysis model were obtained. We could distinguish between animal and plant miRNAs using the logistic model that was built based on four numerical features. Candidate miRNAs analyzed for these four features, specifically helix number, stack number, length of pre-miRNA, and MFE, could be classified with >85% precision.
Interestingly, 13 of 17 significantly different numerical features were higher in plant miR-NAs than in animal miRNAs. We speculated that there may be were more complexity and a larger variety of sequences and structures in plant miRNAs compared with those in animals [29].
The selected feature subset was composed of the top four features based on Kolmogorov-Smirnov test statistic values. The larger the Kolmogorov-Smirnov test statistic value the more significant the difference between animal and plant miRNAs for a certain numerical feature. This relationship is shown in Fig 4. To clarify this relationship between Kolmogorov-Smirnov test statistic value and the detailed numerical difference between animal and plant miRNA, we used stack number of miRNAs as an example. The results of this analysis are shown in S1 Fig. Based on the results shown in Fig 4 and S1 Fig, we determined that the Kolmogorov-Smirnov test statistic value could be used as an evaluation criterion for differences in frequency distribution.
In this study, several feature selection methods were applied and a high level of accuracy was obtained. However, the relationship among features was not considered. To determine whether a relationship existed between the features, we calculated the Pearson correlation coefficients between any two features (S2 Fig). This analysis showed that relationships between features were ubiquitous, and therefore the nature of a feature relationship might influence the results of feature selection. Feature transformation may be a good method for obtaining effective features for classification without such bias.
By our analysis, 17 highly significant distinguishing features were identified and they would become main numerical difference between plant and animal miRNAs. By further analysis, we found a four-feature subset that included helix number, stack number, length of pre-miRNA, and minimum free energy, and developed a logistic classifier that could distinguish animal and plant miRNAs effectively. The precision of the classifier was greater than 80%. Using this tool, we confirmed that there were universal differences between animal and plant miRNAs, and that a single feature was unable to adequately distinguish the difference. This feature set and classifier represent a valuable tool for identifying between animal and plant miRNAs at a molecular level.