A novel riboswitch classification based on imbalanced sequences achieved by machine learning

Riboswitch, a part of regulatory mRNA (50–250nt in length), has two main classes: aptamer and expression platform. One of the main challenges raised during the classification of riboswitch is imbalanced data. That is a circumstance in which the records of a sequences of one group are very small compared to the others. Such circumstances lead classifier to ignore minority group and emphasize on majority ones, which results in a skewed classification. We considered sixteen riboswitch families, to be in accord with recent riboswitch classification work, that contain imbalanced sequences. The sequences were split into training and test set using a newly developed pipeline. From 5460 k-mers (k value 1 to 6) produced, 156 features were calculated based on CfsSubsetEval and BestFirst function found in WEKA 3.8. Statistically tested result was significantly difference between balanced and imbalanced sequences (p < 0.05). Besides, each algorithm also showed a significant difference in sensitivity, specificity, accuracy, and macro F-score when used in both groups (p < 0.05). Several k-mers clustered from heat map were discovered to have biological functions and motifs at the different positions like interior loops, terminal loops and helices. They were validated to have a biological function and some are riboswitch motifs. The analysis has discovered the importance of solving the challenges of majority bias analysis and overfitting. Presented results were generalized evaluation of both balanced and imbalanced models, which implies their ability of classifying, to classify novel riboswitches. The Python source code is available at https://github.com/Seasonsling/riboswitch.


INTRODUCTION
Riboswitches, primarily discovered in bacteria (1), are parts of noncoding mRNA (2), predominantly present in the 5' untranslated region (3,4) and they have complex folded structure (5,6). They act as a switch to transform the transcription or translation of the genes. In transcription, they turn a downstream gene 'off' or 'on' (7) in changing concentration of specific metabolites or ligands (8) and allow microbes to quickly react to change degrees of metabolites (7). A high-throughput platform showed how RNA makes structural transitions (9) kinetically compete during transcription in a new mechanism for riboswitch.
A riboswitch (50-250 nt in length) has two main classes aptamer and an expression platform (10). The aptamer region is a highly conserved domain, which is a site for binding of ligands (metabolites) and the latter one alters conformation on the binding of metabolite and hence regulates the expression of related genes (5,6). Recently, almost over twenty diverse classes of riboswitches have been founded in bacteria, archaea (11,12) and eukaryote but the majority in bacteria (12,13). Thiamine pyrophosphate (TPP) is the only eukaryotic riboswitch, others are established in some fungi (13) for instance Neurospora crassa, in the case of algae and Arabidopsis thaliana from plants (14,15).
In the last decades, incredible advances in big and complex omics data had emerged novel highthroughput experimental technologies such as Next Generation Sequencing (16,17). Numerous bioinformatics databases are available to gather data for riboswitches analyses and assemble the information regarding diverse functionality of RNA molecules (18), including GeneBank (9), National Center for Biotechnology Information (NCBI), Rfam (19), Protein Data Bank (PDB), RiboD (20) and European Bioinformatics Institute (EMBL-EBI).
Many efforts have been made to develop suitable bioinformatics tools to predict the presence of riboswitches in ribonucleic acid sequences (18). Most commonly computation tools used for the analysis of riboswitches are: RiboD (20), Riboswitch finder (21), RibEx (22), RiboSW (23), mFold (24) and RegRNA (18). These available bioinformatics tools use Covariance Model (CM), Support Vector Machine (SVM) and Hidden Markov Model (HMM) algorithm. Most research exists mainly depending on the principal of multiple sequence alignment to investigate conserved sequences in already reported riboswitch. The attempt was to find out the conserved sequence of previously reported riboswitches in a targeted manner. Most reported studies are limited for the specific types of riboswitches. However, conducted frequency-dependent research revealed its importance in the classification of riboswitch (25,26). Frequency-dependent classification uses k-mers counts. K-mers counts have many application like, building de Bruijn graphs (27) in case of de novo assembly from very big number of short reads, generated from next generation sequencing (NGS), used in case of multiple sequence alignment (28), and repeat detection (29).
A tremendous amount of data are generated every day that create the demand for learning algorithms that can classify, predict and analyse data more accurately (30). There are two classification categories: classification of binary format (31) and multi-class classification (32,33).
The concept of an imbalanced dataset is defined as follows. Each family in classes of riboswitch with majority groups has more than two thousand class and minority group below thousands, which is considered as an imbalanced sequence. Whereas, the imbalanced group used and treated with Synthetic Minority Over-Sampling Technique (SMOTE) and thereafter it is called a balanced dataset. The classification with imbalanced data gives favors for a sample with the majority class (30). Imbalanced data occur as a circumstance where the records of a dataset of one class are very little regarding the other classes' dataset. This leads classifier algorithms to ignore minority groups and emphasize on majority class, which can result in skewed accuracy of the classifier. The value of the accuracy of the classifier might be high, but minority class misclassified. Several findings have been done for riboswitch classification (25,26) based on imbalanced data. However, data resampling can be a solution to handle the class imbalance problems (30). Synthetic Minority Over-Sampling Technique (SMOTE) has been discovered in 2002, which is a sampling-based algorithm. Synthetic Minority Over-Sampling Technique (34) balances the class distribution of an imbalanced dataset through an incrementing approach on some virtual samples.
To address the needs for riboswitch prediction, conserved nucleotide frequency counts are considered. SMOTE was used for resampling. Different machine learning algorithms are used for evaluation such as: Random Forest (RF), Gradient Boosting (GB), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Naïve Bayes (NB) and Multilayer Perceptron (MLP). The performances of each algorithm on classification were derived from the confusion matrix, which reveals the number of matches correctly and mismatched instances of riboswitches. Specificity, sensitivity, accuracy, and 5 macro F-score were calculated. That parameters are the main performance evaluation criteria for machine learning algorithms (35)(36)(37)(38).

Features engineering and selection
Riboswitch families considered for this analysis and their corresponding details were presented and analyzed in Figure 1 and features where clustered ( Figure 2). Looking into instances in riboswitch, there were differences in representation between families range in distribution from Cobalamin riboswitch (4,826) to PreQ1-II (39). Out of 16 riboswitch class, Cobalamin riboswitch, TPP riboswitch (THI element), and Glycine riboswitch contributed for 68% and the remaining 13 riboswitch family has 32% instances. The performances of algorithms and methods were computed and evaluated based on training and test set (details in the methodological approach part). We produced 5460 k-mers and then calculated 156 features which include k-mers up to six, which was consistent with previous research (26) based on

Imbalanced class on classification performance
Classifiers on minority class resulted in F-score value from 0.50 (NB) to 0.94 (MLP), while on majority class the range is from 0.91 to 1.00, as indicated in Table 1 and Figure 4. Riboswitch families considered for classification was present in Supplementary Table S1. The average performance of each classifier is computed using mean and standard deviation for parameters: accuracy, specificity, sensitivity and Fscore.
The comparative analysis of six algorithms has revealed that MLP performs best, while NB performed the poorest results (Supplementary Table S2). RF00174, RF00059, RF00504, RF00522 classified better than others with minority classes like RF01054, RF00634, RF00380 (Table 1). F-scores of MLP and RF for the majority group (RF00174) were 0.997 and 0.996, respectively. In the minority group, classifiers with high accuracy had F-score up to 0.50 in the case of NB. The computed minimum value in overall NB analysis in RF01054, RF00634, and RF00521 were 0.50, 0.38, and 0.24, respectively.
Accuracy of all algorithms across all riboswitch family showed value greater than 0.97. In confusion matrix, predicted family and true family exhibited performance of classifiers and riboswitch classification ( Figure 5).

SMOTE balancing on classifiers performance
The overall analysis computed for frequency counts of all family had discovered improved performances of classifiers (Supplementary Table S2, Table 2 and Figure 4). RF00059 and RF00174 results showed F-score between 0.93 and 1.00. In the case of NB and KNN the results of the F-score indicated their poorer performance with a value less than 0.84. Performance evaluations have revealed that KNN, NB, SVM, MLP, RF and GB can be used for classification of riboswitch ( Figure 6).
As presented, Random Forest and MLP exhibited the consistently higher accuracy and F-score values compared to NB, GB, SVM and KNN. Figure 4 and Table 2 have shown that SMOTE improves riboswitch classification and algorithm performances.  Table 1. Accuracy, sensitivity, specificity and F-score parameters used for Naïve Bayes, Multilayer Perceptron, Random Forest, Gradient Boosting, Support Vector Machine and K-Nearest Neighbors algorithms evaluation when applied on the imbalanced dataset. The color trend of F-score from blue to red indicates performance from the best to the poorest. better results than an imbalanced dataset (Table 1, 2 and Supplementary Table S2). The specificity of NB, MLP, RF, GB, SVM and KNN were better in the balanced dataset than when applied on an imbalanced dataset. Calculated sensitivity results were slightly better in balanced instances.
Surprisingly, evidence discovered in that F-score value in all the models showed that a balanced dataset can improve the classification of riboswitches. Balanced dataset not only increased classification accuracy but also algorithms performances. Table 2 has depicted F-score values increasing from 0.50 while in the case of the imbalanced dataset to 0.84. Performance of selected six machine learning algorithms. Violin box was used to depict the statistical differences between two group were provided as the plots. ( * indicated significant difference of p < 0.05, * * denoted very significant difference of p < 0.01, and *** showed very significant difference p < 0.001.

Application of statistical significances
Statistical computation using the Wilcoxon rank test (39) between balanced and imbalanced datasets depicts significant differences between these two groups. In addition, the performance of NB, MLP, RF, GB, SVM and KNN statistically showed variation in accuracy, specificity, sensitivity and F-score values. Statistically very significant differences were noticed between balanced and imbalanced in Fscore and specificity (p < 0.001) and accuracies were significantly different (p <0.05), whereas sensitivity showed no significant difference between the two groups ( Figure 4, Supplementary Table   S3).
SVM was a very significant difference in all parameters used for performance evaluation, Fscore (p<0.001) whereas accuracy, specificity and sensitivity were significantly different (p <0.05). RF performance in both groups has shown very significant differences in F-score (p<0.001) and accuracy (p<0.01) (Figure 4 and Supplementary Table S2). In KNN we did not notice statistical significant differences in all used parameters, except significant differences in specificity (p < 0.05).
MLP of the balanced and imbalanced group depicted very significant differences in accuracy and sensitivity (p < 0.01). GB showed significant differences only in accuracy (p<0.05). Finally, both imbalanced and balanced datasets in the case of NB have shown very significantly differences in F-score (p<0.01), accuracy (p< 0.001), whereas specificity was a significant difference (p<0.05). Accuracy of all classifiers is significantly different at different levels in both groups except in KNN (Figure 4 and Supplementary Table S3).

Biological functions of clustered k-mers
K-mers counting was extracted from distribution heat-map (Figure 2A and 2B), which depicted feature clustering and high relative count number. These clustered k-mers were used for biological function and motif searching. Accordingly, in Table 3

DISCUSSIONS
Machine learning has an enormous capacity to boost our knowledge in the classification of riboswitch, an area that is still in the early stage of a comprehensive investigation. Numerous machine learning applications have been developed based on different methods to detect riboswitch. However, most riboswitch classification studies applied machine learning algorithms on the imbalanced dataset (25,26). Several findings revealed the impact of an imbalance dataset on correct classification and performance of algorithms (25,26,30). Chawla and colleagues proposed SMOTE method of treating imbalanced datasets for better classification of majority and minority instances (30,34). SMOTE based balancing of dataset improves the oversampling minority classes accurately and also produce a dataset that does not influence majority class.
In this analysis, there are imbalances of instances in the riboswitch family. Comparative results revealed the reality of the impact of such imbalance on classification which has widely been reported (Supplementary Table S1 and Table 1). Imbalanced distribution exhibited variation from 4826 majority class (Cobalamin riboswitch) to 39 minority class (PreQ1-II riboswitch). In this type of circumstances general classifiers, when encountering such imbalanced data, favor class with majority instances (30,34). The analysis also revealed in imbalanced and balanced confusion matrix the same problem ( Figure   5 and 6). Out of 16 riboswitch class, cobalamin riboswitch, TPP riboswitch (THI element), and glycine riboswitch sum up contribution was 68% and the remaining 13 riboswitch family has 32% instances. In Table 2, full sequences grouped into two sets training (70%) and test set (30%) was selected and performances of classifiers were evaluated regarding sensitivity, accuracy, specificity and F-score. The correlation heat-map in Figure 3 indicates the relationships between k-mers.
Dataset of imbalanced sequences in riboswitch showed different performances of classifiers ranked as: MLP -the best and NB -the poorest regarding their mean scores that range from 0.771 to 0.961. In Table 2, individual score results of this method have shown best result in RF00234, RF00522, RF01057 (1.00 in RF): greater values than reported in other study using BLAST + (26,56), which is most popular tools in analysis of sequence similarity (56) and others (25,26). Conversion of sequences into vector revealed good results in both groups used for analysis (Table 1, 2 and Supplementary Table  S2). In protein study, protein sequence converted into feature vectors showed good performance in cases of SVM and KNN (57)(58)(59)(60). RF00174, RF00059, RF00504, RF00522 predicted better than others with minority classes like RF01054, RF00634, RF00380 (Table 1 and 2). The class with maximum instances (RF00174) resulted in an F-score value greater than 0.94 in all classifiers except NB, which had a value less than 0.93 in both cases.
NB classifier depicted poor performance in imbalanced dataset compared to other classifiers and its accuracy, sensitivity, specificity, and F-score had the following values 0.979, 0.989, 0.814 and 0.705, respectively (Table 3) Figure 4 indicate that the proposed method of balancing instances increases classifier performances. The used approach was also reported as a solution for machine learning (62). RF shows the best result followed by MLP, which revealed optimal performance.
On the other hand, Naïve Bayes has poor performances in imbalanced dataset classification, which is in accordance with Mwagha and colleagues (63,64). The overall comparison revealed that balanced datasets are better for classification of riboswitch, their performances were compared to BLAST + (26) and other finding ( Table 1, 2 and Supplementary Table S3).
The k-mers position in the secondary structure illustrated riboswitch biological function and motif (Table 6 and Figure 7). In RF00174, CCCGC k-mers had predicted the secondary structure of the cobalamin riboswitch in the btuB leader region of Synechococcus. In cases like RF00168, AAAAAA kmer had motifs predicted to interact with the Nova-1 protein, overlaying K-mers in the 3′ aptamer domain, rich in A, which has unique folding pseudoknot that compresses PreQ1 (Scott et al. 2016).
Turning off gene expressing observed in RF00162 with GAGGGA k-mer, is a kink-turn motif which allows pseudoknot interaction. It interacts with SAM which helps to make stable formation, and can cause the downstream expression platform to form a rho-independent TT (transcriptional terminator), turning off gene expression (Montange and Batey 2006). Overall, k-mers and their biological function for this study are summarized and described in Table 3.
The pipeline can be used in machine learning and deep learning study in other domains of bioinformatics and computational biology that suffer from an imbalanced dataset. Finally, the scientific community can use the python source code for analysis of interest as well as to develop suitable software packages.

METHODOLOGY
We showed a complete evaluation of different machine learning approaches for classification and predicting regulatory riboswitches. First of all, we present the benchmark datasets and data mining approach followed by feature engineering that were done through testing. In addition, model selection methods were used to model and compare balanced and imbalanced dataset problem (65). These methods are implemented in an open-source machine learning platform called WEKA 3.8.3 (66,67), SMOTE [31] and Python 3 (68), which allow evaluating different parameters and algorithms for classification and prediction of the riboswitch. Lastly, we described the results of classification from the learned models. The workflow for the analysis of imbalanced and balanced datasets used for performance evaluation of different machine learning algorithms found in Figure 1. This workflow can be used for other research area that suffers from challenges of imbalanced dataset. All datasets used Python source codes, which are available at https://github.com/Seasonsling/riboswitch.

Data preprocessing
The dataset for the investigation was collected from Rfam 13.0 (19) and other datasets that were already produced (26) for comparison of our new methods. Rfam is a source that collects RNA families including riboswitch (19). There is a need to use a machine learning approach to train algorithms to classify riboswitch as it has been happening in other areas of bioinformatics. Only 16 families have been used to compare with previous research work and they clearly show the impact of an imbalanced dataset on the performance of classifiers. Preprocessing, cleaning and filtering were done, as well as handling missing values, noisy data, redundant features and irrelevant features to affect the accuracy of the model (67). The datasets that contain sequences per family are shown in Supplementary Table S1.

Feature selection
FASTA format dataset was used for k-mer ( 1 ≤ k ≤ n ) frequency counts through executing in the R package called kcount (69). In order to obtain a sufficiently informative k-mer counting matrix for the task (70), we set k value to 6, and finally got 5,460 features, which was consistent with some other researchers (26). This k-mers composition was used to make frequencies of each riboswitch. This avoids unnecessary computing power consumption and dimensional disaster caused by extremely sparse matrices due to high k values as well.
Attribute evaluators CfsSubsetEval and BestFirst were used for dimensionality reduction and searching of the space of attribute subsets by greedy hill-climbing augmented with a backtracking facility (71). WEKA 3.8.3 was used to implement the task (66,67). Feature selection was done for the dimensionality reduction and thus for decrease processing load (72,73).

Imbalanced data
The dataset for this finding contains the imbalanced datasets ranging from 4,826 instances (RF00174) to 39 (RF01051) instances (Supplementary Table S1). Learning from the imbalanced datasets that become critical concerns nowadays, particularly when minority class contains small instances in its dataset (25,26,74). Mainstream methods dealing with imbalance data can be roughly divided into two categories. The first category considers the difference in the cost of different misclassifications (75), while the second one mainly focuses on training data sampling strategies. Here over-sampling and under-sampling were common techniques used to adjust class distribution. However, traditional random oversampling adopts the strategy of simply copying samples to increase the minority samples, which is prone to the problem of overfitting that makes the information learned by the model too special and not generalized (76).
SMOTE improved scheme based on random oversampling was applied (59). The basic idea of the SMOTE algorithm is to analyze a small number of samples and to add new samples to the data set based on a small number of samples.
The used algorithm flow is as follows: For each sample in a few classes, calculate the distance from all samples in a few samples sets by Euclidean distance, and get its k-nearest neighbors.
Set a sampling ratio according to the sample imbalance ratio to determine the sampling magnification N. For each minority sample , randomly select several samples from its k-nearest neighbors, assuming that the selected neighbor is �.
For each randomly selected neighbor � , construct a new sample with the original sample according to the following formula: SMOTE was deployed through importing "imblearn.over_sampling" module in Python 3 and it was applied both in 10-fold cross-validation and building final model processes, as shown in Figure   1.

Machine learning models
A crucial step in machine learning is model selection, as the performance of algorithms is sensitive to the calibration parameters. Configuration and choice of the hyper-parameters are found to be crucial.
For our dataset, we calibrated a model using 10-fold cross-validation. Firstly, the complete feature selection of k-mers dataset was divided into two parts randomly: 70% of data were training set, while 30% of data were test set. The 70% training set was used to build multiclass classification models and determine the hyper-parameters through 10-fold cross-validation. Then, the test set was used to test the final generalization performance of the balanced and imbalanced models. In order to increase the credibility of comparison results and to ensure the repeatability of the results, all datasets were chosen randomly. Input data and model parameters except for the step of SMOTE processing were strictly consistent for both balanced and imbalanced models. This task was left to make pipeline module and Pipeline object in Python package imblearn (0.5.0), which ensures that in cross-validation or generalization testing, SMOTE only treats the training data used to build the cross-validation model or the final model. By this means, the validation set in each fold cross-validation was consistent in all models just as in the case of 30% test set.

Experimentation classifiers
Random Forest is commonly used machine learning algorithm (77) with different successful function in computing and bioinformatics (77)(78)(79). It randomizes the variables (columns) and data (rows), generating thousands of classification trees, and then summarizing the results of the classification tree.
In this research, the mean decrease impurity method was used.
SVM is a simple and efficient method for solving the quadratic programming problem (80) through computing the maximum marginal hyper-plane (66). In SVM, the kernel function implicitly defines the feature space for linear partitioning, which means the choice of kernel function is the largest variable of SVM.
Gradient boosting is a boosting algorithm, which belongs to ensemble learning as well as random forest and proved to have great performance in imbalance problem. It builds the model in a stage-wise fashion, and generalizes them by allowing optimization of an arbitrary differentiable loss function (81).
Another classifier is k-Nearest Neighbors (KNN) which also named IBK (instant-based learning with parameter k). This classifier offers numerous choices to speed up the undertaking to locate nearest neighbors (67), NB (Naïve Bayes) classifier based on Bayes' theorem [49]. This is a probabilitybased model in the Bayesian networks (82). MLP are commonly used machine learning algorithms (83).
ncRNA classification and prediction problems have been widely conducted based on the six selected algorithms for this analysis (84)(85)(86) and riboswitch classification and prediction (3,26).

Evaluation
In order to evaluate the performance of the classifiers, the confusion matrices were used to compute sensitivity, specificity, accuracy and F-score (32,87). Most researchers used weighted F-score to evaluate the classifier's comprehensive performance. However, it leads to assessment bias between majority families and minority families. In this evaluation, we used macro F1 instead, which gives an arithmetic mean of the per-class F1-scores and avoids assessment bias to some extent. A statistical test was carried out in GraphPad Prism 8.