Fig 1.
The workflow used to analyze imbalanced and balanced sequences.
It was used to compare the computational performance of machine learning algorithms for classification.
Fig 2.
Heat-map in this figure represented as row-normalized k-mer counting distribution.
Rows correspond to the k-mers, and columns revealed 16 families of riboswitch. The clustering heatmap depicts feature clustering, clustered features were essential for classification in that family. Red means a high relatively counting number while blue means lower (see details in S1 Fig).
Fig 3.
Heat-map showed features correlation.
It depicts the diagonal white line represented their correlation factor equals to one. Blue means a positive correlation, while red means a negative correlation (see details in S2 Fig).
Table 1.
Accuracy, sensitivity, specificity and F-score.
This parameters were used for Naïve Bayes(NB), Multilayer Perceptron(MLP), Random Forest(RF), Gradient Boosting(GB), Support Vector Machine(SVM) and K-Nearest Neighbors(KNN) algorithms evaluation when applied on the imbalanced sequences. The color trend of F-score from blue to red indicates performance from the best to the poorest. Accuracy, sensitivity, specificity, and F-score are represented in the table as Acc, Sen, Spec, and F-sco, respectively.
Fig 4.
The figures showed a comparison of the balanced and imbalanced sequences and performance of classifiers.
It has been done using the Wilcoxon rank test, A) Accuracy showed significant difference between balanced and imbalanced sequences (p < 0.05) C) Sensitivity showed very significant difference between balanced and imbalanced sequences (p < 0.001) E) Specificity revealed no significant differences at all levels G) F-score showed very significant difference between balanced and imbalanced sequences (p < 0.001). Classifiers performance evaluation on imbalanced and imbalanced sequences shown as B) Accuracy resulted to have significant difference in all classifiers except KNN (p < 0.05, p < 0.01, p < 0.001) D) Sensitivity observed to have significant difference in only MLP and SVM (p < 0.05) whereas the remaining algorithms showed no differences F) Specificity depicted significant differences in NB, SVM and KNN (p < 0.05) on the other hand MLP, RF and GB showed no differences in both sequences group H) F-score depicted very significance differences in NB (p < 0.01), RF (p < 0.001) and SVM (p < 0.001) whereas KNN and MLP showed no differences. Violin box was used to depict the statistical differences between two group were provided as the plots. (* indicated significant difference of p < 0.05, ** denoted very significant difference of p < 0.01, and *** showed very significant difference p < 0.001).
Fig 5.
Confusion matrix for imbalanced sequences from independent test experiments depicted true family and predicted family.
For the classifiers such as: A) K-Nearest Neighbors, B) Support Vector Machine, C) Random Forest, D) Gradient Boosting, E) Multilayer Perceptron and F) Naïve Bayes.
Table 2.
Performances of Naïve Bayes (NB), Multilayer Perceptron (MLP), Random Forest (RF), Gradient Boosting (GB), Support Vector Machine (SVM) and K- Nearest Neighbors (KNN).
These algorithms were evaluated using the balanced sequences from 16 riboswitch families measured by using accuracy, sensitivity, specificity and F-score. The color trend of F-score from blue to red indicates performance from the best to the poorest. Accuracy, sensitivity, specificity, and F-score are represented in the table as Acc, Sen, Spec, and F-sco, respectively.
Fig 6.
Confusion matrix for the balanced sequences from independent test experiments.
It showed True family and Predicted value with classifiers as: A) K-Nearest Neighbors, B) Support Vector Machine, C) Random Forest, D) Gradient Boosting, E) Multilayer Perceptron and F) Naïve Bayes.
Table 3.
Clustered k-mers from S1 Fig used for validation of their biological function and reported riboswitch motifs.
Nucleotide location designated refers to match with their position reported in reference.
Fig 7.
Secondary structure of RF00174 Cobalamin riboswitch (Acido bacterium) (A) and RF01055 MOCO riboswitch class (B).
In every individual base, the color gradient scale represents a normalized hit number from 156 features aligned to the sequence. The different color scale in each region represents its coverage of the k-mers in the family that it represents. Whereas, I, H, and T are abbreviations for Interior loops, Helices, and Terminal loops, respectively.