Fig 1.
Non-coding RNA functions can be separated into housekeeping (green) and regulatory (purple). Classes can be subdivided depending on the level of description. The classes represented are examples that are often mentioned in the literature, or that are of interest in the datasets mentioned below.
Fig 2.
Overview of existing approaches for ncRNA classification.
Methods are represented in chronological order. We specify the type of input data used (secondary structure, sequence, or epigenetic data) and the type of DL architecture (CNN, RNN, both, or other).
Fig 3.
Each row corresponds to a dataset, and and each column to a metric. Each color corresponds to a method. Results on each validation set are represented by circles of the methods’ colors, on top of which a line represents the mean.
Fig 4.
Comparison of overall performance of the state-of-the-art methods on the test sets of Dataset1, Dataset1-nd and Dataset2.
The shape of the markers designates the datasets, and each color corresponds to a method. Tools on the right of the dashed line, ncRDense and NCYPred, could not be retrained before prediction. No results are shown for MFPred and NCYPred on Dataset1 since these two tools do not process degenerate nucleotides. The tools are sorted chronologically on each of the two sides separated by the dashed line.
Fig 5.
Difference between MCC values measured in our benchmark and those reported by each state-of-the-art method.
A negative value of −x signifies that the value we measured is lower by x than the one originally reported.
Fig 6.
Comparison of accuracy of prediction of each ncRNA class obtained by state-of-the-art tools on Dataset1-nd and Dataset2.
Light colors correspond to lower accuracies, while colors tending towards dark green represent the best results. Results for Dataset1 are comparable to those for Dataset1-nd and are presented in S1 Fig.
Fig 7.
Evolution of accuracy with different numbers of non-functional sequences added to Dataset1-nd.
Each line (and color) corresponds to a method. Accuracy is measured when varying the ratio of non-functional to functional sequences from 0:1 (initial dataset) to 1:1 (as many non-functional sequences as functional sequences).
Fig 8.
Evolution of accuracy with different percentages of added noise to Dataset1-nd sequences.
Each line (and color) corresponds to a method. Accuracy is measured when the noise added to sequences is equal to 0%, 50%, 100%, 150%, or 200% of the sequence length.
Table 1.
Comparison of computation times and CO2 emissions on Dataset1-nd.
The computation time is calculated for preprocessing, training and prediction, while the CO2 emission is calculated for training. Times and emissions are computed for the hyperparameter sets selected in Table 4. (See S4 Table for Dataset1 and Dataset2).
Table 2.
Description of the availability and ease-of-use of DL ncRNA classification tools.
Tools are presented in chronological order of publication. Each cell indicates if the tool is accessible through a web server, if source code can be downloaded, and if the tool can be retrained. All links have been checked at the time of writing (February 2024).
Table 3.
Description of datasets for ncRNA classification.
(a) State-of-the-art datasets. (b) Datasets used in this study. Datasets are sorted by date. The number of instances and classes in each dataset is given. We also give the range of sequence lengths. The last column indicates how many times the dataset has been used by state-of-the-art ncRNA classifiers. For some datasets information on length range is not available (indicated with ‘n/a’) as it is not stated in the publication, and it cannot be computed as the dataset is not available.
Table 4.
Overview of tested hyperparameters.
Parameters that gave the best results on Dataset1 and Dataset2 are denoted by (1) and (2) respectively. The models chosen on Dataset1 were also used for Dataset1-nd.