Identification of RNA pseudouridine sites using deep learning approaches

Abu Zahid Bin Aziz; Md. Al Mehedi Hasan; Jungpil Shin

doi:10.1371/journal.pone.0247511

Abstract

Pseudouridine(Ψ) is widely popular among various RNA modifications which have been confirmed to occur in rRNA, mRNA, tRNA, and nuclear/nucleolar RNA. Hence, identifying them has vital significance in academic research, drug development and gene therapies. Several laboratory techniques for Ψ identification have been introduced over the years. Although these techniques produce satisfactory results, they are costly, time-consuming and requires skilled experience. As the lengths of RNA sequences are getting longer day by day, an efficient method for identifying pseudouridine sites using computational approaches is very important. In this paper, we proposed a multi-channel convolution neural network using binary encoding. We employed k-fold cross-validation and grid search to tune the hyperparameters. We evaluated its performance in the independent datasets and found promising results. The results proved that our method can be used to identify pseudouridine sites for associated purposes. We have also implemented an easily accessible web server at http://103.99.176.239/ipseumulticnn/.

Citation: Aziz AZB, Hasan MAM, Shin J (2021) Identification of RNA pseudouridine sites using deep learning approaches. PLoS ONE 16(2): e0247511. https://doi.org/10.1371/journal.pone.0247511

Editor: Y-h. Taguchi, Chuo University, JAPAN

Received: September 30, 2020; Accepted: February 8, 2021; Published: February 23, 2021

Copyright: © 2021 Aziz et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting information files. You can also access the datasets here: http://103.99.176.239/ipseumulticnn/datasets.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Pseudouridine is the most common RNA modification observed in both prokaryotes and eukaryotes [1]. It is formed by the Ψ synthase enzyme which leads to the proof of its occurrence in various kinds of RNAs [2]. This enzyme separates the uridine residue’s base from its sugar and rotates it 180° along the N3-C6 axis. The separation is completed by the subsequent reattachment of the base’s 5’-carbon to the 1’-carbon of the sugar which results in the formation of an isomer of uridine, Pseudouridine [3]. Psudouridines play a vital role in both biological and genetic aspects of RNAs, especially for tRNA and rRNA. In case of rRNA, ribonucleoproteins are proved to be needed for pseudouridylation [4]. Psudouridines also work as a powerful mechanism for stabilizing tRNAs in both single and double-stranded regions [5]. Besides, different species present different prospects due to pseudouridines such as U6 snRNA mutants pseudouridylate at Ψ28 contributing to the filamentation growth program [6]. Furthermore, mRNAs incorporated with Ψ increase translation efficiency and restrict innate immune response [7]. Therefore, an effective method for identifying Ψ sites has a vital significance.

Some laboratory techniques have been introduced over the years producing promising results. Carlile et al. introduced a transcriptome-wide pseudouridine-seq approach where Lovejoy et al. used induced termination of reverse transcription in their work [8, 9]. Furthermore, Schwartz et al. developed a transcriptome-wide quantitative mapping system to identify pseudouridine [10]. All of these systems are not only expensive but also time consuming. Moreover, skilled and experienced people are required to maintain these systems. That is why a more user-friendly method is required for identifying pseudouridine sites.

Despite the necessity, there are not many in silico methods to identify Ψ sites from nucleotide sequences. Li et al. introduced an SVM based web server which is, as far as we know, the first computational method to identify pseudouridine synthase (PUS) specific Ψ sites [11]. They extracted features from the nucleotides around the Ψ sites which provided good results for human and yeast samples. Later, their performance was improved by taking account of the chemical properties and the occurrence frequency density distributions of nucleotides by iRNA-Pseu, proposed by Chen et al. Their work also covered another species (M. musculus) [12]. He et al. proposed another web server named PseUI by using SVM [13]. First, they generated five different types of features and selected one by using the sequential forward feature selection approach.

Among the recent works, Tahir et al. implemented both machine learning and deep learning methods in their work [14]. They extracted features using n-gram and MMI in their SVM classifier and adopted a convolutional neural network (CNN) in their deep learning method, where the CNN classifier produced better performance. To the best of our knowledge, this is the only method that applied deep learning methodologies for this task so far. Using the best features from forward and incremental features, Liu et al. proposed a gradient boosting based method named XG-Pseu [15]. Furthermore, Mu et al. proposed an ensemble model named iPseu-Layer consisted of three machine learning techniques [16]. They employed random forest for the final prediction.

Many of the recent works used PseKNC for feature extraction [17–19]. That is why we wanted to adopt a CNN model which does not require any additional feature extraction technique. CNN has already proven to be useful in computer vision problems. Recently CNN has been producing satisfactory results in nucleotide-based datasets [14, 20–23]. In this work, we employed a CNN model where multiple channels of convolution layers with different sized filters are applied separately. Each of these convolution layers is then added to a max-pooling layer and concatenated. Our model yielded satisfactory results in the benchmark and independent datasets.

Materials and methods

Dataset collection

In this work, data were collected for three different species which are H. sapiens, S. cerevisiae and M. musculus represented by HS, SC and MM respectively. There were three benchmark datasets, HS_990, SC_628, and MM_944, one for each species for training purposes. Each of these datasets was balanced in terms of the number of samples. These are the same datasets used in Chen et al’s work where they downloaded the RNA sequences from RMBase [12, 24]. In addition to these benchmark datasets, Chen et al. also gave two independent datasets, HS_200 and SC_200 for testing purposes which were for H. sapiens and S. cerevisiae but not for M. musculus. In both HS_200 and SC_200, the number of positive and negative samples was equal. In the datasets, RNA sequences were formulated as shown: (1)

Here, U indicates “uridine”, N_−ξ denotes the ξ-th upstream nucleotide towards the 5’ end and N_+ξ denotes the ξ-th downstream nucleotide towards the 3’ end from the central uridine. The value of ξ in HS_990 and MM_944 was 10 and 15 in SC_628.

Data preprocessing

Before applying the RNA sequences to our model, we needed to preprocess it first. There was only one step involved in the preprocessing step, which was binary “one-hot” encoding to convert our inputs into a 2-dimensional matrix. Each of the nucleotides of an input sequence was represented as a row vector where all the values are zero except for one value. We applied two separate techniques for this task.

General “one-hot” encoding.

In this technique, the length of these row vectors was four which is the number of nucleotides found in RNA. Therefore, a sequence with N nucleotides would be a (N x 4) matrix. The 1D vectors we chose for the nucleotides were: (“A” = [1, 0, 0, 0], “U” = [0, 1, 0, 0],“C” = [0, 0, 1, 0], “G” = [0, 0, 0, 1]).

Merged-seq “one-hot” encoding.

We also applied another technique by predicting secondary structures using RNAfold. Studies showed that secondary structure revealed critical structural features to detect Ψ sites [25]. We wanted to simulate this mechanism in computational methodologies. That’s why we predicted the secondary structure and merged it with the original sequence. We called it “merged-seq”. The secondary structure provided a new set of features and by merging with the original sequence we generated some more features. This technique provided good predictive performance in Zheng et al.’s pre-miRNA detection [23]. The encoding process is shown in Fig 1. The predicted secondary structure and merged sequence for each RNA sequence can be found in the supporting information or in this link: http://103.99.176.239/ipseumulticnn/datasets. The following steps were followed for this technique:

First, we predicted the secondary structure of the original sequence using RNAfold [26]. This structure had three types of symbols: “.”, “(” and “)”. The “(” and “)” indicated that a nucleotide at 5’-end and it’s complimentary nucleotide at 3’-end is paired and the “.” indicated that the nucleotide is not paired with any other nucleotide.
Then, we formed a merged sequence that consisted of the original sequence and the secondary structure. This merged sequence had N pairs, N being the length of the sequence. The pairs were formed by taking one nucleotide from the original sequence and one symbol from the secondary structure.
As there were four types of bases in RNA and three types of indicators in the secondary structure, we had 12 types of pairs in the merged sequences. After that, we encoded the pairs of the merged sequences using “one-hot” technique. So after encoding, an RNA sequence of length, N became a two-dimensional matrix of (N x 12). So for both the HS and MM datasets, the preprocessed inputs turned into a (21 x 12) matrix and for the SC dataset, the inputs turned into a (31 x 12) matrix.

Download:

Fig 1. Vectorization process of the RNA sequences.

Here, secondary structure was the predicted result from rnafold. Merged sequence was the pair of the original sequence and secondary structures. This merged sequence was then encoded using “one-hot” technique.

https://doi.org/10.1371/journal.pone.0247511.g001

CNN architecture

After preprocessing (“one-hot” encoding), the converted 2D inputs were fed to a convolutional neural network. Generally, in a CNN model, the inputs are connected to some convolution and max-pooling layers, followed by a couple of fully connected layers that are connected to the output layer. But in our case, the preprocessed inputs are fed to a multi-channel CNN model which has been very effective in various text classification tasks [27, 28]. The motivation behind this approach was to make sure a sequence is processed at different lengths at a time. In a sequential model, we can use only one size of filter for each convolution which may not extract the best features all the time. That’s why we applied multiple channels of feature extraction operations(convolution and max-pooling) to the input sequence and integrated the features for better Ψ identification. A general architecture of our multi-channel model is shown in Fig 2.

Download:

Fig 2. The architecture of our multi-stage CNN model.

https://doi.org/10.1371/journal.pone.0247511.g002

Each channel of our model started with a convolution layer. We tuned the number of channels and the height of the filters of the convolution layer. The width of the filters remained unchanged. Each of these convolution layers was then connected to a max-pooling layer. Then, the max-pooling layers were concatenated together to combine the features extracted by the convolution and max-pooling layers. Next, the max-pooling layers are connected to the first fully connected layer which had 1024 nodes. After that, we employed dropout regularization to reduce the number of parameters. Then, the final layer was connected which gave a probability distribution of the classes. From the probability distribution, the final output was predicted.

The number of convolution layers was selected by applying k-fold cross-validation and grid search. Cross-validation also helped us to select the learning rate, dropout probability and height of the filters. Relu activation function was employed in every layer except for the last layer where the softmax activation function was used. This was the general structure of our model. Only the height of filters and the number of convolution layers varied for different datasets. We used categorical cross-entropy as the loss function. We also examined some well-known optimizers like Adam, Gradient descent, RMSprop etc. to minimize the loss function. Among these optimizers, Adam produced the best optimization.

Method evaluation metrics

Four evaluation metrics have been frequently used to evaluate the quality of a method in recent studies [29–31]. To calculate them, we required four parameters: true positive (TP), true negative (TN), false positive (FT) and false negative (FN). The equations for the evaluation metrics are given below:

Sensitivity (SN): (2)
Specificity(SP): (3)
Mathews Correlation Coefficient (MCC): (4)
Accuracy (AC): (5)

Results and discussions

Hyperparameter tuning

Hyperparameter tuning is vital to maximize a model’s predictive performance. On the benchmark datasets, we tuned a number of hyperparameters to fine-tune our model. We did it in three separate steps using k-fold cross-validation and grid search. We used k = 5 to compare our results with the existing works as they also applied cross-validation using the same value. This implied that we divided the benchmark datasets into 5 folds. Among them, 4 folds were used for training and the remaining fold was used for testing that particular model.

First, we tuned the number of epochs and batch size. Then, we tuned the number of channels and the height of convolution filters using the values from the first step. The number of channels was tuned to investigate how many of them can be separately connected to the input layer to produce the best accuracy. Finally, using the values from the previous steps we tuned the learning rate and dropout probability. Grid search was adopted to select the values that produced the best result.

The considered and selected values for the hyperparameters are given in Table 1. We calculated accuracies for every possible combination of values of these hyperparameters and selected the ones that provided the highest accuracy. Merged-seq “one-hot” encoding was used when we tuned the hyperparameters. Then we trained our model by applying general and merged-seq “one-hot” encoding separately using the tuned values. As the shape of the inputs were different in the datasets, the selected values were not the same. They were used to train our model in the benchmark datasets and were evaluated by the independent data.

Download:

Table 1. The ranges of values of the hyperparameters of the benchmark datasets.

https://doi.org/10.1371/journal.pone.0247511.t001

Training

Since the performance of CNN in computer vision and NLP tasks is well established, we wanted to use its classification success for biological sequence inputs. After the concatenation of the multiple convolution and max-pooling layers of our multi-stage CNN model, the number of parameters increased significantly. That is why to reduce the number of parameters, we employed dropout regularization after the first fully-connected layer. We also applied early stopping to make sure there was no overfitting in our model which means we stopped the training process if the validation loss did not improve after a certain consecutive epochs. After tuning the hyperparameters, we used the selected values to train our model in the benchmark datasets. The validation and training process were done in a core i5 laptop having NVIDIA 940m as GPU. Because of the grid search, the validation process took almost an hour to complete and the training process took about 2-3 minutes. We implemented our model using Keras Framework (2.2) with TensorFlow as backend.

We trained our model on the benchmark datasets using both general “one-hot” encoding and merged-seq “one-hot” encoding separately. Among these techniques, merged-seq “one-hot” encoding produced better performance. We employed the same model architecture in both cases using the tuned hyperparameters. We compared the performance of our models with the existing predictors (iRNA-PseU [12], PseUI [13], iPseU-CNN [14], XGboost [15], iPseU-Layer [16]) on the benchmark datasets which is shown in Table 2. From the table, we can see that our models produced satisfactory results. The training accuracy of our model was less than that of the iPseu-Layer because of their model’s overfitting which is stated by Mu et al. That’s why our model had better accuracy in the independent dataset despite having less accuracy in the benchmark datasets. Even though our model didn’t achieve the most accuracy it had increased sensitivity by 4.26% and 20.27% in SC_628 and HS_990 datasets respectively. In our case, sensitivity represents the ratio of correctly identified Ψ sites to all sequences which had Ψ sites in reality. That means our models were able to predict actual Ψ sites quite well.

Download:

Table 2. Comparison of the evaluation metrics with the existing predictors on the benchmark datasets.

https://doi.org/10.1371/journal.pone.0247511.t002

Comparative analysis

After training our models in the benchmark datasets, we examined its performance in the independent datasets by comparing the evaluation metrics with the existing predictors (iRNA-PseU [12], PseUI [13], iPseU-CNN [14], iPseU-Layer [16]). The findings are shown in Table 3. Similar to our training process, we tested for both general and merged-seq encoded models. Although both models produced better results than the existing predictors, the merged-seq encoded model outperformed them all.

Download:

Table 3. Comparison of the performance of our model with the existing predictors on the independent datasets.

https://doi.org/10.1371/journal.pone.0247511.t003

Among the existing methods, iPseU-CNN produced the best performance in the SC_200 dataset. So, we calculated the amount of increased performance with respect to this classifier. In the SC_200 dataset, the specificity, accuracy and MCC was increased by 6.65%, 2% and 6.38% respectively for our general “one-hot” encoded model. But for our merged-seq “one-hot” encoded model, accuracy increased by 4.08%, sensitivity increased by 16.34% and MCC increased by 12.76%. Here, our merged-seq “one-hot” encoded model produced better performance.

In the HS_200 dataset, iPseU-Layer produced the best performance among the existing methods. In this dataset, our general “one-hot” encoded model had improved performance in accuracy by 2.11%, sensitivity by 26.98% and MCC by 2.32%. On the other hand, our “merged-seq” encoded model outperformed iPseU-Layer in accuracy, sensitivity and MCC by 4.22%, 15.87% and 11.62% respectively. Similar to the SC_200 dataset, our merged-seq “one-hot” encoded model produced better evaluation metrics in this dataset.

Since we applied deep learning methodologies in our work, we wanted to produce better results than other deep learning methodologies. As far as we know, iPseU-CNN is the only available deep learning methodology that used the same datasets as us. Although their encoding is similar to our general encoding technique, they adopted a single-stage sequential model where our model had multi-stage architecture. Our both general and merged-seq “one-hot” encoded model had better accuracy, sensitivity and MCC than iPseU-CNN. So we can say that our models outperform the existing deep learning methodologies in every evaluation metric. To enhance the comparison, we provided a graphical comparison of our models with the state of the art methods in the independent datasets which is depicted in Fig 3.

Download:

Fig 3. Graphical comparison of our models with the existing works in the independent datasets.

https://doi.org/10.1371/journal.pone.0247511.g003

We also plotted the receiver operating characteristic (ROC) curve on the benchmark and independent datasets to have a better understanding of our merged-seq “one-hot” encoded model. The plot is illustrated in Fig 4. ROC curve tells us how well a model can differentiate between classes. Our model achieved 0.88, 0.94 and 0.83 AUC (Area Under Curve) score on the HS_990, SC_628 and MM_944 benchmark datasets respectively. In case of the independent datasets, our model produced 0.77 and 0.78 on the HS_200 and SC_200 datasets respectively.

Download:

Fig 4. Illustration of the performance of our model in the benchmark and independent datasets using ROC curve.

https://doi.org/10.1371/journal.pone.0247511.g004

Visualization of the learned features

We visualized the outputs after the concatenation of the multi-stage convolution and max-pooling layers to gain further insights into the learned features for both general and merged-seq “one-hot” encoded models. We employed similar approaches used in recent CNN based works [32–34] to convert the kernel outputs into motifs. Then we used sequence logos to visualize and compare them with the logos generated from the independent datasets (Fig 5). The logos were generated in terms of probabilities (first three rows) and information contents (last three rows). From the sequence logos we can see that despite having some differences with the ground truth for the general “one-hot” encoding, the logos of the merged-seq “one-hot” encoding based models are quite similar to the ground truths. We can also observe from the information content logos that our models were able to capture the motifs around the central uracil(U) quite well for both datasets.

Download:

Fig 5. A comparison between the learned motifs of the general and merged-seq “one-hot” encoding.

The ground truths were generated using Weblogo [35].

https://doi.org/10.1371/journal.pone.0247511.g005

Discussion

Our merged-seq “one-hot” encoded classifier is already implemented and taken to the next stage by providing a user-friendly web server. In this work, we tried to tune only those hyperparameters that can impact the performance of our classifier positively. Nevertheless, tuning other hyperparameters may result in improved performance. In our merged-seq “one-hot” encoding, the secondary structure of RNA played a vital role in improving the overall performance. We can further investigate how these new features are helping to improve the predictive performance. We also noticed some false positives for our merged-seq “one-hot” encoded model because of the secondary structure provided by RNAfold. We can investigate other secondary structure predictors in future for further improvements. We can also look for other encoding techniques of RNA sequences like Word2Vec other than “one-hot” encoding in the future. Furthermore, we can extend our work by applying our model to other species for Ψ site identification. Besides, there are other RNA modifications such as inosine (I), m3c, m5c etc. We can investigate whether our classifier can identify those sites from RNA sequences as well. Moreover, compared to the existing methods, our model produced the most accuracy in both HS_200 and SC_200 dataset.

Conclusion

The purpose of our work was to identify pseudouridine sites from RNA sequences using computational methods, in our case, a multi-stage convolutional neural network. After preprocessing our data using “one-hot” encoding, we adopted a CNN model having multiple convolution and max-pooling layers connected to the input layer individually, which was followed by a couple of fully-connected layers and an output layer. We applied k-fold cross-validation and grid search for hyperparameter tuning. We trained our model by using the selected values from tuning. Then we tested the performance of our model using the independent datasets and found 74% accuracy in the HS_200 dataset and 76.5% accuracy in the SC_200 dataset. It is projected that our classifier can become a helpful tool for identifying Ψ sites. We can also say that CNN can be used as an important method for classifying biological data.

Supporting information

S1 File. The benchmark and independent datasets with the secondary structure by RNAfold and merged sequence that we applied in this work.

https://doi.org/10.1371/journal.pone.0247511.s001

(ZIP)

S2 File. Probabilities of each sequence of the independent datasets.

https://doi.org/10.1371/journal.pone.0247511.s002

(ZIP)

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable and constructive comments which were very helpful for improving the contents of the manuscript.

References

1. Hudson GA, Bloomingdale RJ, Znosko BM. Thermodynamic contribution and nearest-neighbor parameters of pseudouridine-adenosine base pairs in oligoribonucleotides. Rna. 2013;19(11):1474–1482. pmid:24062573
- View Article
- PubMed/NCBI
- Google Scholar
2. Ge J, Yu YT. RNA pseudouridylation: new insights into an old modification. Trends in biochemical sciences. 2013;38(4):210–218. pmid:23391857
- View Article
- PubMed/NCBI
- Google Scholar
3. Charette M, Gray MW. Pseudouridine in RNA: what, where, how, and why. IUBMB life. 2000;49(5):341–351. pmid:10902565
- View Article
- PubMed/NCBI
- Google Scholar
4. Bousquet-Antonelli C, Henry Y, Gélugne JP, Caizergues-Ferrer M, Kiss T. A small nucleolar RNP protein is required for pseudouridylation of eukaryotic ribosomal RNAs. The EMBO journal. 1997;16(15):4770–4776. pmid:9303321
- View Article
- PubMed/NCBI
- Google Scholar
5. Davis DR, Veltri CA, Nielsen L. An RNA model system for investigation of pseudouridine stabilization of the codon-anticodon interaction in tRNALys, tRNAHis and tRNATyr. Journal of Biomolecular Structure and Dynamics. 1998;15(6):1121–1132. pmid:9669557
- View Article
- PubMed/NCBI
- Google Scholar
6. Basak A, Query CC. A pseudouridine residue in the spliceosome core is part of the filamentous growth program in yeast. Cell reports. 2014;8(4):966–973. pmid:25127136
- View Article
- PubMed/NCBI
- Google Scholar
7. Karijolich J, Yu YT. The new era of RNA modification. RNA. 2015;21(4):659–660. pmid:25780180
- View Article
- PubMed/NCBI
- Google Scholar
8. Carlile TM, Rojas-Duran MF, Zinshteyn B, Shin H, Bartoli KM, Gilbert WV. Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature. 2014;515(7525):143–146. pmid:25192136
- View Article
- PubMed/NCBI
- Google Scholar
9. Lovejoy AF, Riordan DP, Brown PO. Transcriptome-wide mapping of pseudouridines: pseudouridine synthases modify specific mRNAs in S. cerevisiae. PLoS One. 2014;9(10). pmid:25353621
- View Article
- PubMed/NCBI
- Google Scholar
10. Schwartz S, Bernstein DA, Mumbach MR, Jovanovic M, Herbst RH, León-Ricardo BX, et al. Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA. Cell. 2014;159(1):148–162. pmid:25219674
- View Article
- PubMed/NCBI
- Google Scholar
11. Li YH, Zhang G, Cui Q. PPUS: a web server to predict PUS-specific pseudouridine sites. Bioinformatics. 2015;31(20):3362–3364. pmid:26076723
- View Article
- PubMed/NCBI
- Google Scholar
12. Chen W, Tang H, Ye J, Lin H, Chou KC. iRNA-PseU: Identifying RNA pseudouridine sites. Molecular Therapy-Nucleic Acids. 2016;5:e332. pmid:28427142
- View Article
- PubMed/NCBI
- Google Scholar
13. He J, Fang T, Zhang Z, Huang B, Zhu X, Xiong Y. PseUI: pseudouridine sites identification based on RNA sequence information. BMC bioinformatics. 2018;19(1):306. pmid:30157750
- View Article
- PubMed/NCBI
- Google Scholar
14. Tahir M, Tayara H, Chong KT. iPseU-CNN: Identifying RNA pseudouridine sites using convolutional neural networks. Molecular Therapy-Nucleic Acids. 2019;16:463–470. pmid:31048185
- View Article
- PubMed/NCBI
- Google Scholar
15. Liu K, Chen W, Lin H. XG-PseU: an eXtreme Gradient Boosting based method for identifying pseudouridine sites. Molecular Genetics and Genomics. 2020;295(1):13–21. pmid:31392406
- View Article
- PubMed/NCBI
- Google Scholar
16. Mu Y, Zhang R, Wang L, Liu X. iPseU-Layer: Identifying RNA Pseudouridine Sites Using Layered Ensemble Model. Interdisciplinary Sciences: Computational Life Sciences. 2020; p. 1–11. pmid:32170573
- View Article
- PubMed/NCBI
- Google Scholar
17. Guo SH, Deng EZ, Xu LQ, Ding H, Lin H, Chen W, et al. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics. 2014;30(11):1522–1529. pmid:24504871
- View Article
- PubMed/NCBI
- Google Scholar
18. Feng P, Yang H, Ding H, Lin H, Chen W, Chou KC. iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics. 2019;111(1):96–102. pmid:29360500
- View Article
- PubMed/NCBI
- Google Scholar
19. Yang H, Qiu WR, Liu G, Guo FB, Chen W, Chou KC, et al. iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC. International journal of biological sciences. 2018;14(8):883. pmid:29989083
- View Article
- PubMed/NCBI
- Google Scholar
20. Yang B, Liu F, Ren C, Ouyang Z, Xie Z, Bo X, et al. BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics. 2017;33(13):1930–1936. pmid:28334114
- View Article
- PubMed/NCBI
- Google Scholar
21. Aoki G, Sakakibara Y. Convolutional neural networks for classification of alignments of non-coding RNA sequences. Bioinformatics. 2018;34(13):i237–i244. pmid:29949978
- View Article
- PubMed/NCBI
- Google Scholar
22. Zheng X, Xu S, Zhang Y, Huang X. Nucleotide-level Convolutional Neural Networks for Pre-miRNA Classification. Scientific reports. 2019;9(1):1–6. pmid:30679648
- View Article
- PubMed/NCBI
- Google Scholar
23. Zheng X, Fu X, Wang K, Wang M. Deep neural networks for human microRNA precursor detection. BMC bioinformatics. 2020;21(1):1–7. pmid:31931701
- View Article
- PubMed/NCBI
- Google Scholar
24. Sun WJ, Li JH, Liu S, Wu J, Zhou H, Qu LH, et al. RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data. Nucleic acids research. 2016;44(D1):D259–D265. pmid:26464443
- View Article
- PubMed/NCBI
- Google Scholar
25. Carlile TM, Martinez NM, Schaening C, Su A, Bell TA, Zinshteyn B, et al. mRNA structure determines modification by pseudouridine synthase 1. Nature chemical biology. 2019;15(10):966–974. pmid:31477916
- View Article
- PubMed/NCBI
- Google Scholar
26. Gruber AR, Lorenz R, Bernhart SH, Neuböck R, Hofacker IL. The vienna RNA websuite. Nucleic acids research. 2008;36(suppl_2):W70–W74. pmid:18424795
- View Article
- PubMed/NCBI
- Google Scholar
27. Guo B, Zhang C, Liu J, Ma X. Improving text classification with weighted word embeddings via a multi-channel TextCNN model. Neurocomputing. 2019;363:366–374.
- View Article
- Google Scholar
28. Sun K, Li Y, Deng D, Li Y. Multi-channel CNN based inner-attention for compound sentence relation classification. IEEE Access. 2019;7:141801–141809.
- View Article
- Google Scholar
29. Cheng X, Lin WZ, Xiao X, Chou KC. pLoc_bal-mAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC. Bioinformatics. 2019;35(3):398–406. pmid:30010789
- View Article
- PubMed/NCBI
- Google Scholar
30. Chen W, Ding H, Zhou X, Lin H, Chou KC. iRNA (m6A)-PseDNC: identifying N6-methyladenosine sites using pseudo dinucleotide composition. Analytical biochemistry. 2018;561:59–65. pmid:30201554
- View Article
- PubMed/NCBI
- Google Scholar
31. Qiu WR, Sun BQ, Xiao X, Xu ZC, Jia JH, Chou KC. iKcr-PseEns: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. Genomics. 2018;110(5):239–246. pmid:29107015
- View Article
- PubMed/NCBI
- Google Scholar
32. Quang D, Xie X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods. 2019;166:40–47. pmid:30922998
- View Article
- PubMed/NCBI
- Google Scholar
33. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology. 2015;33(8):831–838. pmid:26213851
- View Article
- PubMed/NCBI
- Google Scholar
34. Quang D, Xie X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic acids research. 2016;44(11):e107–e107. pmid:27084946
- View Article
- PubMed/NCBI
- Google Scholar
35. Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome research. 2004;14(6):1188–1190. pmid:15173120
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Hudson GA, Bloomingdale RJ, Znosko BM. Thermodynamic contribution and nearest-neighbor parameters of pseudouridine-adenosine base pairs in oligoribonucleotides. Rna. 2013;19(11):1474–1482. pmid:24062573
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Ge J, Yu YT. RNA pseudouridylation: new insights into an old modification. Trends in biochemical sciences. 2013;38(4):210–218. pmid:23391857
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Charette M, Gray MW. Pseudouridine in RNA: what, where, how, and why. IUBMB life. 2000;49(5):341–351. pmid:10902565
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Bousquet-Antonelli C, Henry Y, Gélugne JP, Caizergues-Ferrer M, Kiss T. A small nucleolar RNP protein is required for pseudouridylation of eukaryotic ribosomal RNAs. The EMBO journal. 1997;16(15):4770–4776. pmid:9303321
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Davis DR, Veltri CA, Nielsen L. An RNA model system for investigation of pseudouridine stabilization of the codon-anticodon interaction in tRNALys, tRNAHis and tRNATyr. Journal of Biomolecular Structure and Dynamics. 1998;15(6):1121–1132. pmid:9669557
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Basak A, Query CC. A pseudouridine residue in the spliceosome core is part of the filamentous growth program in yeast. Cell reports. 2014;8(4):966–973. pmid:25127136
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Karijolich J, Yu YT. The new era of RNA modification. RNA. 2015;21(4):659–660. pmid:25780180
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Carlile TM, Rojas-Duran MF, Zinshteyn B, Shin H, Bartoli KM, Gilbert WV. Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature. 2014;515(7525):143–146. pmid:25192136
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Lovejoy AF, Riordan DP, Brown PO. Transcriptome-wide mapping of pseudouridines: pseudouridine synthases modify specific mRNAs in S. cerevisiae. PLoS One. 2014;9(10). pmid:25353621
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Schwartz S, Bernstein DA, Mumbach MR, Jovanovic M, Herbst RH, León-Ricardo BX, et al. Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA. Cell. 2014;159(1):148–162. pmid:25219674
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Li YH, Zhang G, Cui Q. PPUS: a web server to predict PUS-specific pseudouridine sites. Bioinformatics. 2015;31(20):3362–3364. pmid:26076723
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Chen W, Tang H, Ye J, Lin H, Chou KC. iRNA-PseU: Identifying RNA pseudouridine sites. Molecular Therapy-Nucleic Acids. 2016;5:e332. pmid:28427142
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. He J, Fang T, Zhang Z, Huang B, Zhu X, Xiong Y. PseUI: pseudouridine sites identification based on RNA sequence information. BMC bioinformatics. 2018;19(1):306. pmid:30157750
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Tahir M, Tayara H, Chong KT. iPseU-CNN: Identifying RNA pseudouridine sites using convolutional neural networks. Molecular Therapy-Nucleic Acids. 2019;16:463–470. pmid:31048185
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. Liu K, Chen W, Lin H. XG-PseU: an eXtreme Gradient Boosting based method for identifying pseudouridine sites. Molecular Genetics and Genomics. 2020;295(1):13–21. pmid:31392406
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref16] 16. Mu Y, Zhang R, Wang L, Liu X. iPseU-Layer: Identifying RNA Pseudouridine Sites Using Layered Ensemble Model. Interdisciplinary Sciences: Computational Life Sciences. 2020; p. 1–11. pmid:32170573
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref17] 17. Guo SH, Deng EZ, Xu LQ, Ding H, Lin H, Chen W, et al. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics. 2014;30(11):1522–1529. pmid:24504871
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref18] 18. Feng P, Yang H, Ding H, Lin H, Chen W, Chou KC. iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics. 2019;111(1):96–102. pmid:29360500
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref19] 19. Yang H, Qiu WR, Liu G, Guo FB, Chen W, Chou KC, et al. iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC. International journal of biological sciences. 2018;14(8):883. pmid:29989083
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref20] 20. Yang B, Liu F, Ren C, Ouyang Z, Xie Z, Bo X, et al. BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics. 2017;33(13):1930–1936. pmid:28334114
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref21] 21. Aoki G, Sakakibara Y. Convolutional neural networks for classification of alignments of non-coding RNA sequences. Bioinformatics. 2018;34(13):i237–i244. pmid:29949978
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

[ref22] 22. Zheng X, Xu S, Zhang Y, Huang X. Nucleotide-level Convolutional Neural Networks for Pre-miRNA Classification. Scientific reports. 2019;9(1):1–6. pmid:30679648
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref23] 23. Zheng X, Fu X, Wang K, Wang M. Deep neural networks for human microRNA precursor detection. BMC bioinformatics. 2020;21(1):1–7. pmid:31931701
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref24] 24. Sun WJ, Li JH, Liu S, Wu J, Zhou H, Qu LH, et al. RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data. Nucleic acids research. 2016;44(D1):D259–D265. pmid:26464443
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref25] 25. Carlile TM, Martinez NM, Schaening C, Su A, Bell TA, Zinshteyn B, et al. mRNA structure determines modification by pseudouridine synthase 1. Nature chemical biology. 2019;15(10):966–974. pmid:31477916
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref26] 26. Gruber AR, Lorenz R, Bernhart SH, Neuböck R, Hofacker IL. The vienna RNA websuite. Nucleic acids research. 2008;36(suppl_2):W70–W74. pmid:18424795
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref27] 27. Guo B, Zhang C, Liu J, Ma X. Improving text classification with weighted word embeddings via a multi-channel TextCNN model. Neurocomputing. 2019;363:366–374.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref28] 28. Sun K, Li Y, Deng D, Li Y. Multi-channel CNN based inner-attention for compound sentence relation classification. IEEE Access. 2019;7:141801–141809.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref29] 29. Cheng X, Lin WZ, Xiao X, Chou KC. pLoc_bal-mAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC. Bioinformatics. 2019;35(3):398–406. pmid:30010789
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref30] 30. Chen W, Ding H, Zhou X, Lin H, Chou KC. iRNA (m6A)-PseDNC: identifying N6-methyladenosine sites using pseudo dinucleotide composition. Analytical biochemistry. 2018;561:59–65. pmid:30201554
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref31] 31. Qiu WR, Sun BQ, Xiao X, Xu ZC, Jia JH, Chou KC. iKcr-PseEns: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. Genomics. 2018;110(5):239–246. pmid:29107015
View Article
PubMed/NCBI
Google Scholar

[120] View Article

[121] PubMed/NCBI

[122] Google Scholar

[ref32] 32. Quang D, Xie X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods. 2019;166:40–47. pmid:30922998
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref33] 33. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology. 2015;33(8):831–838. pmid:26213851
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref34] 34. Quang D, Xie X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic acids research. 2016;44(11):e107–e107. pmid:27084946
View Article
PubMed/NCBI
Google Scholar

[132] View Article

[133] PubMed/NCBI

[134] Google Scholar

[ref35] 35. Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome research. 2004;14(6):1188–1190. pmid:15173120
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Dataset collection

Data preprocessing

General “one-hot” encoding.

Merged-seq “one-hot” encoding.

CNN architecture

Method evaluation metrics

Results and discussions

Hyperparameter tuning

Training

Comparative analysis

Visualization of the learned features

Discussion

Conclusion

Supporting information

S1 File. The benchmark and independent datasets with the secondary structure by RNAfold and merged sequence that we applied in this work.

S2 File. Probabilities of each sequence of the independent datasets.

Acknowledgments

References