Deep learning predicts short non-coding RNA functions from only raw sequence data

Small non-coding RNAs (ncRNAs) are short non-coding sequences involved in gene regulation in many biological processes and diseases. The lack of a complete comprehension of their biological functionality, especially in a genome-wide scenario, has demanded new computational approaches to annotate their roles. It is widely known that secondary structure is determinant to know RNA function and machine learning based approaches have been successfully proven to predict RNA function from secondary structure information. Here we show that RNA function can be predicted with good accuracy from a lightweight representation of sequence information without the necessity of computing secondary structure features which is computationally expensive. This finding appears to go against the dogma of secondary structure being a key determinant of function in RNA. Compared to recent secondary structure based methods, the proposed solution is more robust to sequence boundary noise and reduces drastically the computational cost allowing for large data volume annotations. Scripts and datasets to reproduce the results of experiments proposed in this study are available at: https://github.com/bioinformatics-sannio/ncrna-deep.


Repl :
Thanks for the accurate question. Ho ever, e are not using k-mer composition as a feature. Ma be this as not clearl stated in the original manuscript. Here k-mers are used to represent the sequence itself to be given as input to the Neural Net ork, maintaining the sequential order of nucleotides (see Figures 3 and 5 in the paper). Specificall , e do not collapse the sequence information into k-mer histograms, rather e encode ever k-mer of the sequence as a binar vector. For e ample, the sequence AGCTGATT Is 1-mer encoded as: (1000)(0100)(0010)(0001)(0100)(1000)(0001) (0001) As such, e can sa the input of the Neural Net ork is the ra sequence suitabl encoded b k-mer.
Also, since RNA sequence determines structure that often underlies function, it is unclear ho this finding poses a question against the dogma of secondar structure being a ke determinant of function in RNA .
Repl : With this claim, e ould like to figure out an open question ma be in a provocative a . In literature, it is assumed that RNA sequence determines structure that determines function so function depends basicall on sequence through its 2d/3d structure (Tinoco et al. Ho RNA folds Journal of molecular biolog , 1999). The question e arise observing our results is: are RNA functions determined e clusivel going through its 2d/3d structure? We observed that our deep net ork architecture is able to learn functions from light eight sequence representations, such as k-mers, ithout precomputing the 2d structure. This is not a trivial question as in literature 2d/3d structure seems to be pivotal to predict functions (see INFERNAL, EDeN and nRC).
Computing the 2d/3d structure, through folding tools such as ViennaRNA and iPknot, is ver time e pensive. So avoiding it is of course attractive as e sho empiricall . In addition to the objective result of saving computational time, a consequence is the question hether 2d/3d structure is strictl necessar to predict function. We just discussed qualitativel such aspects ithout giving data evidence. It ma be plausible that the deep architecture capabilit to learn abstract features even learns the structure to predict the function but it ma also be not. We did not go deeper into this aspect and let the question just open.
The deep learning architecture in this stud used convolutional neural net orks (CNN). I onder hether some other deep learning techniques, such as recurrent neural net ork and ord embedding, ere also tested for this problem.
Repl : Thanks to the referee for pointing this out. We have tested three bidirectional LSTM recurrent neural net ork (RNN) architectures ith an increasing number of nodes (50,100,150) on the dataset (training set and test set) named as test13 provided b nRC s author in F ., 2017.
Since RNNs are able to process information as sequential data ith no predetermined si e limit, e have applied these architectures on the sequences encoded as k-mers and not as space-filling curves as the are not sequences but rather 2-D representations of the data. Table 5 of the manuscript has been updated ith these ne results. Here, the tested RNN architectures sho performances similar to those of the standard CNN architecture. Ho ever, the improved CNN architecture still remains the best approach for the classification task object of this stud .
Thanks to these results, e believe that a h brid approach ith both CNN and RNN la er blocks perhaps could improve the performance of short ncRNA class classification tasks. Ideall , a first convolutional la er block could identif short sequence motifs correlated ith the biological role of the short ncRNA famil , and then a recurrent la er block could learn long-term relationships bet een inferred functional motifs.
We plan to investigate the comple it of this kind of architecture in future orks.
What are non-functional RNA sequences ? The RNA sequences that do not belong to the considered classes can also be biologicall functional.

Repl :
In the e periments for rejection capabilit of the algorithm e refer to non-functional RNA sequences as sequences randoml generated b shuffling the initial set and preserving the di-nucleotide composition of each original sequence. In the ne version, e have clarified this aspect.
The classifier developed in this stud should be compared directl ith the previous models, especiall the ones using RNA structural features. The results for nRC and RNAGCN in Table 6 ere taken from a previous stud . It is unclear hether the same datasets and testing strateg ere also used in the previous stud .

Repl :
The results reported in Table 6 referred all to the same dataset (training set and test set) named as test13 provided b nRC s author in F ., 2017. We have re-applied onl EDeN on this dataset since the source code, or an e ecutable version, of RNAGCN is not available.

Re ie er #2:
Summar In recent ears, there has been research evidence that secondar structure is the ke factor to kno the function of RNA. Some machine learning based methods have been successfull proved to be able to predict RNA function from secondar structure information. At present, there are more or less deficiencies in the e isting methods for predicting RNA function on the market, such as BLAST, hich has a high false negative rate, GraPPLE, hich has a high false positive rate, and INFERNAL, hich has a high computational cost. In this case, the author proposes a method based on the original sequence ithout calculating the kno n secondar structure features. The method is more robust to the sequence boundar noise and reduces drasticall the computational cost allo ing for large data volume annotations. The last t o advantages together ith fast classification speed are essential for large genome annotation.

Major Comments
In general, the idea of this paper is to find a ne a to predict RNA function from the original sequence information instead of the e isting methods of predicting RNA function through secondar structure, hich is of great significance. Ho ever, hen using k-mer and space filling curve to represent input, the author can add some improvements to these t o e isting methods to some e tent.
Repl : Thanks for the suggestion, indeed the improvements of these representations can be a non-trivial task, ho ever, e emphasi e that the contribution of the paper is to sho ho ra sequence representation can be enough to improve the state of the art in short RNA function prediction avoiding the computation of secondar structure hich could be ver time e pensive.
Secondl , t o uncertaint estimators, information entrop and top difference, ere evaluated in the prediction of RNA function. For the t o uncertain estimators threshold setting, the author lacks the corresponding information.

Repl :
The are usuall adopted in literature and have been empiricall calibrated. An a , e have also reported the ROC curve in Figure 9. We make this clearer in the te t. Thanks for this point.
Finall , in assessing RNA function, the author assumes that an further structural coding in the input representation does not help improve performance, hich remains to be debated and requires corresponding arguments to prove.

Repl :
We compared our approach ith EDeN, nRC, and RNAGCN. All of them precompute the 2d structure ith tools such as ViennaRNA and iPKnot and e tract the set of features adopted b the learning algorithm. We observed that our deep net ork architecture is able to learn functions ithout pre-computing 2d structure but directl from the ra input sequence and performs more robustl to boundar noise. See the ans er to revie er #1 for further arguments.

Minor Comments
Picture la out: The graphs and tables in the paper are far apart from the content of the te t that concerns them and it seems ver inconvenient.

Repl :
In the ne version, e have revised the figure position according to our suggestion.
Supplementar Notes: (13th line from the bottom, page 4) Sentence In our e periments e consider k var ing from 1 to 3 needs to be supplemented to e plain h k varies from 1 to 3 and the effect of K on the e periment.

Repl :
In the computational scenario of ncRNA classification, mono, di-and tri-nucleotide patterns have al a s been considered as important discriminative features. We did not e plore the effect of k in our e periment but just considered three levels for k as three different input representations. Var ing k from 1 to 3 e gain insight spanning from an atomic to a more high level of molecular composition of the sequence.
Subjective argument: (1st line from the bottom, page 5) The sentence We set empiricall the kernel si e to 3 and the number of filters at each i-th la er to 32 * 2i is too subjective in a sense and the author should make it clear hat e perience the si e of the kernel and the number of filters at each i-th la er are based on.

Repl :
In order to choose the best set of parameters for our models that better represent the peculiarities of the functional classification problem, e performed first several h perparameter optimi ation e periments (data not sho n). Regarding the number of filters, e chose an incrementing number of filters in order to e pand the representation in the subsequent la ers from the previous one. Regarding the kernel si e, a smaller si e in general helps to capture local and comple features in the data compared to a larger si e that e tracts features more general and spreads across the sequence. Moreover, ith a smaller kernel si e the amount of e tracted features ill be notable, hich can be further useful in later la ers.

Re ie er #3:
The article b Noviello et al. is a nice investigation on non-coding RNAs for hich functional annotations are beneficial for the biological communit . The authors e ploited deep learning methods to tackle the challenge and their results shed ne light on the structure-function relationship in this class of biomolecules.
The authors also provided all the scripts and documentation to reproduce their ork and compared their ork to other state-of-the-art methods.
The ork is nicel ritten and logical to follo , I have onl minor comments to be addressed: a general proofreading to get rid of the remaining t pos and some grammatical errors, or too ord sentences

Repl :
We have revised the te t as suggested. Thanks.
I am not an e pert on non-coding RNAs and I as ondering in reading about the dataset curation ho the 41 classes have been selected and in general to kno more about ho the classification of non-coding RNA sequences in classes is done. This might be beneficial also for a broad audience as the one of PLOS COMP BIOL.

Repl :
The database is almost an updated version of the dataset adopted in Navarin and Costa (Bioinformatics, 2017) hich is derived from the RFAM database. To address the issue related to the focus of the paper (revie er #1) and then to be consistent ith the ne focus, e decided to further e tend the dataset to include almost all short non-coding RFAM classes. The dataset includes no 88 short non-coding classes and 306016 sequences, almost triplicating the previous datasets, making the stud stronger. So there is no selection no as all RFAM classes are taken into consideration. To make the focus clearer to a broad audience e added some clarification te t in the introduction.
It ill be nice if the authors could e plain a little bit more the rationale behind the choice of the deep net ork architecture to this case stud instead than other approaches also to benefit a broader audience Repl : We added some clarification te t in the introduction. Thanks. make the conclusions less technical and more accessible to biologists so that the can reall appreciate the value of the ork