Skip to main content
Advertisement
  • Loading metrics

Block sparse Bayes-based fuzzy system for RNA N6-methyladenosine sites prediction

Abstract

N6-methyladenosine (m6A) can significantly affect RNA expression, gene regulation, and determination of cell fate. As a common and abundant post-transcriptional modification (PTM) of RNA, m6A is also closely associated with the occurrence of numerous diseases. Thus, identifying the m6A modification site in the RNA sequence is a prerequisite for related research. High-throughput sequencing technology has high requirements and low cost performance. Computational methods have made encouraging progress in site prediction. However, most models only consider the effects of different species, ignoring the simultaneous exploration of RNA modifications in different tissues within the same species. We develop and validate a fuzzy system based on Block Sparse Bayesian Learning (BSBL), named BSBL-TSK-FS, which is a powerful sequence-level m6A prediction model. We introduce a Bayesian method that provides a posterior probability output to produce more sparse solutions so that the model has higher accuracy. The model classifies the m6A sites in several tissues of mouse, human, and rat. Under the five-fold cross-validation method (5-CV), the precision of the BSBL-TSK-FS model is 0.84∼0.95. The accuracy of our model improves by 9.4% over the existing SOTA predictors. BSBL-TSK-FS achieves superior performance over current SOTA methods. Finally, in order to verify the generalizability of the model, we carry out cross-species tests, and the results prove the robustness and adaptability of the model. An accurate and reliable sequence modification prediction model is developed to better understand the complex landscape of methylation modification.

Author summary

RNA molecules undergo a large number of PTMs that can affect their structure and interaction properties. As the most common type of PTM, N6-methyladenosine (m6A) plays a crucial role in life processes such as gene silencing, cell localization, parental imprinting, and various diseases. Therefore, accurate identification of m6A modification sites from mRNA sequences is of great significance for basic research and drug development. The applicability of experimental methods in large-scale studies is poor. In response to these limitations, computational models have been developed to quickly and economically identify m6A modification sites. In this study, we propose a fuzzy system prediction model, called BSBL-TSK-FS, to identify m6A. We verify the performance of the model on a baseline datasets. Our model, BSBL-TSK-FS, performs well in 11 datasets, with an average AUC value of 0.9619 and an average precision value of 0.9028.

Introduction

Post-transcriptional modifications (PTM) in RNA are common in all areas of life [1,2]. In addition to regulating RNA life stages, modification sites affect RNA localization, tertiary structure, function, and biogenesis [3,4]. As a result, the biological function of RNA is affected. They are produced by covalent alterations or isomerization of nucleotides, usually involving the addition of chemical groups at different locations in the nitrogenous base or ribose cycle [57]. More than 150 PTMs have been identified, of which m6A is the most prevalent type of PTM in RNA [8,9]. As an important epigenetic modification, m6A plays a crucial role in gene silencing, cell localization, parental origin imprinting, and other life processes [1012]. Regulates RNA localization, transcription, splicing, and stability [1315]. In addition, it has been linked to diseases such as stomach cancer, obesity, and breast tumors [1618]. In order to carry out basic research and develop new drugs, it is extremely important to precisely identify the m6A modification sites from mRNA sequences.

Currently, most experimental methods for locating RNA post-transcriptional modifications differ in 3 ways: immunoprecipitation methods, chemical-based detection methods, and enzyme-specific methods. RNA immunoprecipitation dependent methods include MeRIP-Seq [19], m6A-Seq [20], miCLIP [21] and other methods. Pseudo-Seq [22] and AlkAniline-Seq [23] utilize compounds that selectively react with modified ribonucleotides to identify m6A. Specific enzymes are used in methods such as m6A-REF-Seq [24] or DART-Seq [25]. Although these are the current gold standards, they still have certain limitations. For example, experiments require the development of specific protocols for each PTM, sensitivity to cross-reactivity, antibodies, or chemical reactions, and complex protocols can cause bias. They are limited by the availability of compounds or specific antibodies [26,27]. In addition, several methods based on Oxford nanopore technology have been developed, such as Epinano [28], Nanocompore [29], Tombo [30], and CHEUI [31]. These methods are time-consuming, laborious, and expensive. In addition, the slow detection process further limits their applicability to large-scale studies [32,33]. In response to these limitations, computational models have been developed to quickly and economically identify m6A modification sites, making them ideal for large-scale data analysis.

Computational methods have become an attractive option for researchers. iRNA-Methyl is a pioneering predictor specifically designed to identify the m6A site in RNA, using support vector machines combined with hand-extracted features to build a model [34]. m6Apred is a predictor specifically designed to identify m6A sites in the Saccharomyces cerevisiae transcriptome [35]. The predictor is based on physicochemical binary coding and cumulative nucleotide frequency extraction features. SRAMP can be used to predict RNA m6A sites in mammals [36]. It extracts features based on nucleotide binary encoding, secondary structure binary encoding, KSNPF, and KNN score, and assembles three random forests to predict m6A sites. WHISTLE [37] effectively captures key signals associated with m6A modifications by integrating sequence and evolutionary features, and is trained using an SVM. deepSRAMP [38] introduces an isoform-level m6A site prediction framework that leverages BiGRU networks combined with a multi-head attention mechanism to capture complex sequence dependencies. By encoding RNA sequences into fixed-length embeddings, the model effectively extracts deep contextual features and achieves promising predictive performance across multiple tissues and species. These predictors identify m6A modification sites based on specific tissues in a single species. Dao et al. [39] designed iRNA-m6A based on a support vector machine (SVM). Using a one-hot encoding scheme, Liu et al. [40] developed the im6A-TS-CNN tool that predicts using convolutional neural networks (CNNs). TS-m6A-DL is described by Abbas et al. [41] as a method based on deep neural networks (DNNs). A combination of four classification algorithms and three deep learning models is used in the im6APred model presented by Luo et al. [42]. A tool called DL-m6A, which uses three different features encoding schemes, was proposed by Rehman et al. [43]. m6A-TSHub [44] is a comprehensive platform for tissue-specific m6A research, integrating four key modules: m6A-TSDB, m6A-TSFinder, m6A-TSVar, and m6A-CAVar, which support database construction, predictive modeling, and variant impact analysis. It enables systematic exploration of tissue-specific m6A methylation from both low-resolution data and genetic variants across 23 human tissues. Most of these models use one-hot coding, k-mer coding or physicochemical property coding to extract the characterization of RNA sequences. Nevertheless, these methods usually consider only shallow RNA sequence encoding and ignore potential correlations between nucleotides. These models all use tissue-specific datasets, but their accuracy and generality need to be improved.

In this study, we propose a block-sparse Bayesian Learning (BSBL)-based Takagi-Sugeo-Kang fuzzy system (TSK-FS), called BSBL-TSK-FS, to identify m6A. The proposed method is more novel and effective than TSK-FS. In order to achieve complete information extraction, we use the position-specific nucleotide propensity (PSNP) to extract RNA sequence features. Extensive benchmarking experiments were conducted on well-curated datasets, and as a result, BSBL-TSK-FS achieves superior performance than current state-of-the-art methods. Finally, cross-species tests were carried out, and the results obtained prove the robustness and adaptability of the model.

The contributions of this study are summarized as follows.

(1) We improve the TSK fuzzy system on the basis of block sparse Bayesian learning by introducing a Bayesian approach that provides the output of posterior probabilities to produce more sparse solutions. Our model has higher accuracy.

(2) Our model does not require a setting for the penalty factor. The penalty factor in general TSK-FS is a constant to balance the regular and error terms, and the experimental results are very sensitive to this data, and improper settings can cause problems such as overlearning. However, the parameter is automatically assigned in BSBL-TSK-FS.

(3) Compared to traditional task-specific computational tools, our model does not require different coding representations of RNA sequences and can directly predict different types of methylation.

(4) Our model can identify methylation modification sites in various tissues of different species.

The next section displays the model framework, experimental results, sequence analysis, and cross-species validation results, and compares them in detail with other methods. The third part describes the experimental materials and methods, including data set introduction, TSK-FS and BSBL algorithms, feature extraction methods and performance evaluation criteria. Finally the paper is summarized.

Results

The BSBL-TSK-FS framework

Our framework uses a block-sparse Bayes-based fuzzy system to predict widely occurring m6A modifications in different tissues of mouse, human, and rat. It consists of four key modules: Input and encoding module, fuzzification module, block sparse Bayes module and prediction module, as shown in Fig 1. The proposed method is mapped to the high dimensional space by fuzzy rules and fuzzy membership function in the fuzzifier [45]. Then we introduce a Bayesian method that provides a posterior probability output to produce more sparse solutions, which improves the accuracy of the model [46,47]. The idea is to find the posterior probability via the Bayesian rule. Given the hyperparameters, the solution is given by the Maximum-A-Posterior estimate [48]. The hyperparameters are estimated from data Maximum Likelihood [49]. More details can be found in the Materials and Methods Section Materials and methods. Our model does not need to set the penalty factor, which is automatically assigned in BSBL-TSK-FS. This avoids problems such as over-learning caused by improper parameter setting. We use 11 widely recognized datasets created by Dao et al. [39], and performe 5-CV on the datasets. These datasets come from different tissues such as brain, heart, kidney, and liver of mouse, humans, and rats. While we applied BSBL-TSK-FS to the task of m6A modification detection, the framework can also be directly applied to other tasks, such as detecting other types modifications.

thumbnail
Fig 1. Illustration of the BSBL-TSK-FS model architecture.

(A) Input and encoding module. The RNA sequence with length of 41bp was encoded into a matrix via the PSNP with 5-mer. (B) Fuzzification module. The model uses fuzzy system to process the data, then gets fuzzy feature , and applies to the next module. (C) Block sparse Bayesian module. The sparse solution of the model is obtained by block sparse Bayes algorithm, and the parameter p is solved. (D) Prediction module. Identify m6A and non-m6A by predicting results.

https://doi.org/10.1371/journal.pcbi.1013621.g001

BSBL-TSK-FS performance

The main goal of our study was to establish a convenient and reliable predictor that can achieve SOTA accuracy to effectively identify widely occurring m6A modifications from RNA sequences. The Table 1 lists the results of our proposed model on the tissue-specific datasets via five-fold cross-validation (5-CV). Our model, BSBL-TSK-FS, performs well on 11 datasets, with an average AUC value of 0.9619 and an average accuracy value of 0.9028. All other datasets have ACC scores above 84%. On the dataset Rat Liver, this model performed the highest quality, with ACC, MCC and AUC reaching 95.41%, 90.84% and 0.9900, respectively. It also performs well in mouse hearts, with an ACC of 0.9542. Meanwhile, Human Brain’s scores are slightly lower, with ACC, MCC and AUC reaching 84.75%, 69.63% and 0.9231, respectively. AUCs are all above 0.92, and MCCs are all above 0.6963. The mouse brain dataset is the largest and performs poorly, indicating that our model has a slight weakness in handling large data sets. The highest AUC is only 0.07 higher than the lowest AUC. This shows that our model is very stable on the AUC criterion. The BSBL-TSK-FS model demonstrated better performance on small-scale datasets compared to larger ones. As a traditional machine learning method, it tends to be more effective when the sample size is limited. On larger datasets such as Mouse Brain, Human Brain, and Human Kidney, the increased diversity and complexity of the sequences may pose challenges for the fuzzy inference system, which is constrained by the number of fuzzy rules and thus may struggle to capture complex patterns. In addition, the performance of our approach heavily relies on effective feature extraction. The PSTNP module, in particular, has shown higher discriminative power in small datasets, while the greater sequence diversity in large datasets may reduce the effectiveness of the extracted features. These factors may jointly contribute to the slight performance decline observed on large-scale datasets.

thumbnail
Table 1. Results of the BSBL-TSK-FS model on m6A datasets under 5-CV.

https://doi.org/10.1371/journal.pcbi.1013621.t001

We present the intersections of modifications across tissues in Fig 2A2C, showing both correlations and significant differences across the tissue data. Some m6A modification sites may occur only in specific tissues, and some may exhibit some similar tendencies in multiple tissues. Therefore, we show the intersection of modifications between tissues. Specifically, we list the exons associated with each modification and treat modifications that share the same exon as intersecting. We find that there are overlaps between the tissues, but the overlaps are less than 5%. Fig 2D is a Sankey chart that visually shows the flow of samples from 11 data sets based on the reality of the prediction label. The curved thin lines represents the misclassified sample. Most of the samples are classified correctly, so they show strong straight lines. It is clear that our method successfully classified the majority of the samples. Our proposed model performs well overall, preliminarily proving its effectiveness in predicting the m6A sites.

thumbnail
Fig 2. (A–C) Venn diagram of data for different species.

(D) Sankey diagram of prediction results for 11 datasets. Straight lines represent correctly classified samples, curved lines represent incorrectly classified samples, and the stronger the lines, the larger the number of samples.

https://doi.org/10.1371/journal.pcbi.1013621.g002

Consensus region analysis

In order to further understand the mechanism and reason of modification, we use kpLogo [50] to study the distribution of nucleotides around the m6A site. Fig 3A3C shows the visualization of methylation sequence patterns, we can see that the methylated sequential regions in the tissues are very similar. Fig 3D3F shows the statistical difference in nucleotide appearance between m6A and non-m6A samples. The top half represents sequences that contain m6A sites, and the bottom half represents sequences that contain non-m6A sites. There are significant differences in the distribution of nucleotides between positive and negative samples (T-test, p value <0.01). The flanking sequences of m6A in all tissues are biased towards GC rich areas, while the flanking sequences of non-m6A are biased towards AU rich areas. It also shows that the idea of constructing m6A classification model by extracting sequence information is reasonable. We present an analysis of the probability distribution of methylation centers in 3 different human tissues. Fig 4A and 4B show the frequency distribution of positive and negative samples of the 3 datasets, respectively. On the one hand, it can be seen that there are significant differences in the motifs of positive and negative samples. On the other hand, it can be seen that the positive sample shows the motif GGACA with the highest frequency in the logos in position from 19 to 23. This is consistent with m6A modifications occurring primarily on the consensus motif DRACH (D-A, G, or U, R-A, or G, H being A, C, or U) [51]. Figs 3 and 4 reveal clear positional differences in nucleotide composition between positive and negative samples, particularly around the central region. These findings support the use of PSKNP, a position-aware encoding scheme that captures local 5-mer preferences across aligned sequences. By leveraging such positional patterns, BSBL-TSK-FS effectively models the sequence context relevant to m6A modifications.

thumbnail
Fig 3. Motif logo analysis on Human datasets.

(A–C) Probability Logos of positive samples analysis. (D–E) Probability Logos of positive and negative samples comparative analysis. It’s worth noting that kpLogo uses the “T” to represent the “U” in the RNA sequence.

https://doi.org/10.1371/journal.pcbi.1013621.g003

thumbnail
Fig 4. Motif logos in central sequential regions of Human datasets.

(A) Motif logos of the positive samples. (B) Motif logos of the negative samples. It’s worth noting that kpLogo uses the “T” to represent the “U” in the RNA sequence.

https://doi.org/10.1371/journal.pcbi.1013621.g004

Ablation analysis

To further demonstrate the superiority of BSBL-TSK-FS, we conduct ablation studies. Experiments are conduct on baseline data sets using TSK-FS and SBL-TSK-FS respectively, and the proposed methods are compared in many ways. The comparison results highlight the contribution of sparse Bayesian learning and block sparse Bayesian learning. As shown in Fig 5A, ROC curves of the three methods show that our model has the highest AUC performance. Three species have AUC values above 0.92, 0.93, and 0.97, respectively. In contrast, the TSK-FS method is the worst, while SBL-TSK-FS performs slightly better. The AUC values of the other two methods achieve 0.86-0.97 and 0.89-0.98, respectively. Our method achieves the highest AUC across all datasets, with an average accuracy value of 0.9619, surpassing that of TKS-FS (average AUC 0.9136) and SBL-TSK-FS (average AUC 0.9427). In order to demonstrate the advantages of the proposed method in capturing sequence information, we perform a visual comparison analysis with the two limit methods mentioned above (Fig 5B). We use UMAP to visualize the output features of the three methods. Visualization results show that our approach successfully distinguished the vast majority of negative and positive samples. On the contrary, a small number of negative samples in SBL-TSK-FS are misclassified as positive samples. The worst TSK-FS has a higher incidence of classification errors, resulting in areas of overlap between the two classes.

thumbnail
Fig 5. (A) ROC curves of TSK-FS, SBL-TSK-FS and BSBL-TSK-FS models on m6A datasets.

(B) Visualization of feature spatial distribution by 3 methods.

https://doi.org/10.1371/journal.pcbi.1013621.g005

The results of above 3 methods on Mouse are detailed in Table 2. Our model is superior to the two models in the overall evaluation metrics. Specifically, compared to SBL-TSK-FS method on the Mouse dataset, the accuracy and AUC of our model improve by an average of 1.33% and 0.97%, respectively. TSK-FS perform the worst, especially on the MCC value, with an average of just 0.7138. Compared with TSK-FS, BSBL-TSK-FS improves MCC by 3.5%, 10.33%, 0.53%, 1.25% and 22.18% on these 5 datasets, respectively. TSK-FS performs worst on Mouse Testis, with an ACC value of 0.7721. Our model shows slightly lower SP than TSK-FS on the Mouse Liver dataset. However, its SN is only 0.8828, much lower than that of our model.

thumbnail
Table 2. Results of multiple methods on the mouse datasets as analyzed by 5-CV.

https://doi.org/10.1371/journal.pcbi.1013621.t002

The results on Human datasets are detailed in Table 3. The experimental results show that our model is superior to the two models in all evaluation metrics. The performance of TSK-FS is still the lowest, but the MCC is above 0.63, and the performance of SBL-TSK-FS is slightly better, with MCC values between 0.6694 and 0.8349. Compared to SBL-TSK-FS method on the Human dataset, the ACC, MCC and AUC of our model improve by an average of 1.59%, 3.08% and 2.38%, respectively. On these three data sets, our model improves scores of evaluation metrics more significantly. Compared with TSK-FS, BSBL-TSK-FS improves MCC by 3.83%, 10.69% and 16.36% on these 3 datasets, respectively. Moreover, the average ACC and AUC of SBL-TSK-FS are 3.36% and 2.75% higher than that of TSK-FS, respectively.

thumbnail
Table 3. Results of multiple methods on the human datasets as analyzed by 5-CV.

https://doi.org/10.1371/journal.pcbi.1013621.t003

According to Table 4, shows the results of ablation experiments on Rat datasets. Our model generally performs very well, with an average ACC of 0.9357 and an average MCC of 0.8718. The rat is the species that our model predicts most accurately. TSK-FS continues to perform poorly, with the lowest ACC not exceeding 0.8 and the worst MCC only 0.5801. And SBL-TSK-FS performs marginally acceptable, with ACC values between 0.8908 and 0.9298. Compared to SBL-TSK-FS method, the ACC, MCC and AUC of our model improve by an average of 2.26%, 6.36% and 8.01%, respectively. Compared with TSK-FS, the average ACC, MCC and AUC of SBL-TSK-FS improve by 7.83%, 13.15% and 5.34% on these 3 datasets, respectively. Among them, TSK-FS has the highest SP score on Rat Kindey dataset, but the corresponding SN is the lowest. A similar situation is seen with SBL-TSK-FS on the Rat Liver dataset. Overall, our model is more balanced and stable.

thumbnail
Table 4. Results of multiple methods on the rat datasets as analyzed by 5-CV.

https://doi.org/10.1371/journal.pcbi.1013621.t004

Comparison with other advanced tools

To evaluate the effectiveness of our model, we conducted comparisons with several mainstream tools, namely im6A-TS-CNN, im6APred, iRNA-m6A, DL-m6A, TS-m6A-DL, and M6A-BiNP, under a 5-CV framework. The lollipop plot of Fig 6A shows the comparison of six methods SN and SP on three human data sets. It can be seen that our method has the highest performance and has made great progress, especially in SP. Fig 6B and 6C show mcc comparisons across 11 datasets. Fig 6B shows that there is a big gap between other methods and ours in MCC score. It can be intuitively seen that BSBL-TSK-FS is significantly elevated in rat brain, rat liver, and mouse brain (Fig 6C). Fig 7 uses a heat map to show a comparison of ACC, MCC, SN, SP, and AUC scores for different methods on a standard dataset. The brighter the circle, the higher the value. In addition to our approach, DL-m6A and TS-m6A-DL performed better. BSBL-TSK-FS has the best prediction effect on MCC and ACC. The results show that our method is very effective and reliable in m6A modification prediction task.

thumbnail
Fig 6. (A) Comparison results between the proposed method and 5 advanced methods on SN and SP indicators.

(B) Piano plots of MCC comparisons of six methods across 11 datasets. (C) Radar maps of six methods for MCC comparison on 11 data sets.

https://doi.org/10.1371/journal.pcbi.1013621.g006

thumbnail
Fig 7. Comparison of six methods on MCC, SN, SP, ACC and AUC indicators.

The larger and brighter the bubbles, the higher the value.

https://doi.org/10.1371/journal.pcbi.1013621.g007

More detailed comparison results are shown in Tables 57. BSBL-TSK-FS has excellent performance on Mouse datasets (Table 5). As shown in the experiment, our proposed model BSBL-TSK-FS shows the best performance compared to SOTA methods on 5 datasets, with an average improvement of 0.99% Acc, 20.44% MCC, and 8.22% AUC compared to other baseline methods. DL-m6A performs slightly better, with an average ACC of 0.7954, an average MCC of 0.585 and an average AUC of 0.8731. BSBL-TSK-FS provides the best accuracy and AUC values on Mouse Heart (0.9542 and 0,9894). Among the methods evaluated, iRNA-m6A, a predictor based on traditional machine learning techniques, showes the lowest overall performance across all 11 datasets, with an Acc of 0.735 and an MCC of 0.47. This observation suggests that deep characterization of RNA sequences has a stronger ability to characterize RNA sequences than shallow characterization. For other methods, TS-m6A-DL and im6A-TS-CNN initialize the RNA sequence mainly based on one-hot encoding, thus ignoring the underlying semantic information.

thumbnail
Table 5. Comparison with models on mouse datasets via 5-CV.

https://doi.org/10.1371/journal.pcbi.1013621.t005

thumbnail
Table 6. Comparison with models on human datasets via 5-CV.

https://doi.org/10.1371/journal.pcbi.1013621.t006

thumbnail
Table 7. Comparison with models on rat datasets via 5-CV.

https://doi.org/10.1371/journal.pcbi.1013621.t007

On the Human datasets, there is no doubt that ACC and AUC have shown clear signs of improvement (Table 6).The Acc of BSBL-TSK-FS improved by approximately 5.35-9.24% over the next highest predictor (DL-m6A). However, on the Human Brain, our method is slightly lower than DL-m6A in terms of MCC. Our proposed model BSBL-TSK-FS shows the best performance on AUC, with an average improvement of 6.95% compared to other baseline methods. DL-m6A performs slightly better, with an average ACC of 0.8074, an average MCC of 0.7188 and an average AUC of 0.8849. On all three datasets, AUC values for all methods exceed 0.8. BSBL-TSK-FS provides the best accuracy value on Human Liver (0.9211).

On Rat datasets, our model is relatively stable and performs well on all datasets (Table 7). BSBL-TSK-FS provides the best accuracy value on Rat Liver (0.9541). Our proposed model BSBL-TSK-FS shows the best performance on ACC and MCC, with an average improvement of 10.34% ACC and an average improvement of 20.41% MCC, compared to other baseline methods. The AUC alues of our method are higher than DL-m6A method by 0.0943, 0.0616 and 0.0703, respectively. The remaining four methods (iRNA-m6A, im6APred, im6A-TS-CNN and TS-m6A-DL) also achieved reasonable performance on AUC, all above 0.82.

m6A-TSFinder [44] proposed a weakly supervised deep learning framework to predict tissue-specific m6A methylation from low-resolution data and constructed tissue-level models for 23 human tissues. This approach significantly broadened the landscape of m6A prediction beyond base-resolution data. However, due to differences in data resolution, sequence structure, and prediction tasks, our model cannot be directly compared with m6A-TSFinder on the 23 human tissue datasets. To ensure fairness, we evaluated our model on the same benchmark dataset used in the m6A-TSFinder study, enabling direct comparison under an equivalent prediction setting. Detailed performance comparisons are provided in Table 8.

thumbnail
Table 8. Performance comparison between different approaches on independent datasets from human tissues.

https://doi.org/10.1371/journal.pcbi.1013621.t008

Performance comparison on the m5C datasets

Our model is capable of predicting not only m6A methylation but also m5C methylation. To assess the performance of the proposed method, we employed the same dataset used by m5C-pred [52]. Table 9 summarizes the comparison results. On dataset M.musculus, BSBL-TSK-FS achieves the best ACC value (0.9088) with improvements of 14.86% and over the m5C-pred. On dataset A.thaliana, the accuracy and MCC of our prediction are also significantly improved.

Cross-species and cross-tissues prediction analysis

We conducte cross species and cross tissue experiments to demonstrate that the predictor is not dependent on species and tissue. The result is shown in Fig 8, where each circle in the heat map represents the accuracy obtained. The rows represent training sets and the columns represent test sets. The accuracy of cross-tissue prediction may be affected by factors such as differences in sample size, intra-dataset redundancy, and random noise. All prediction accuracy is higher than 0.79. It can be seen that using different species or tissues for prediction can also achieve good accuracy.

thumbnail
Fig 8. Heap map showing ACCs of cross-species and cross-tissues prediction accuracies.

https://doi.org/10.1371/journal.pcbi.1013621.g008

Discussion

In this paper, we designed a BSBL-TFS-FS model to detect RNA m6A sites in a variety of tissues. We apply fuzzy systems to block sparse Bayesian learning. Compared with the traditional fuzzy network, this method has better approximation performance. We tested the benchmark dataset. The experimental results show that BSBL-TFS-FS is an effective model for sequence prediction. Compared with existing predictors, BSBL-TFS-FS has higher predictive performance. Our model can predict methylation sites in different tissues of different species. To account for tissue- and species-specific characteristics of m6A modifications, we trained separate models for each tissue type and organism. This design ensured that each model was optimally adapted to its respective dataset. Moreover, while our framework is not fully interpretable in the strictest sense, it provides greater interpretability compared to deep learning-based methods. This is mainly due to its rule-based fuzzy inference system and structured Bayesian framework, which together offer clearer insight into the relationship between input features and prediction outcomes. Next, we want to change the form of converting sequences into numerical values through physicochemical properties and then input into the model, and instead use biological sequences directly as inputs to the model.

Materials and methods

Benchmark datasets

Zhang et al. [24] developed a technique for detecting m6A sites in different tissues. Based on this study, Dao et al. [39] constructed high-quality benchmark datasets for computational methods. Each dataset contains 41nt long sequences of m6A and non-m6A sites. Using CD-HIT, a sequence similarity score of less than 80% was achieved. Table 10 shows the summary of tissue-specific datasets. Table 11 presents information on independent datasets from three human tissues.

thumbnail
Table 11. Summary of m6A independent datasets from human tissues.

https://doi.org/10.1371/journal.pcbi.1013621.t011

This study also utilized high-quality m5C datasets for Musmusculus and Arabidopsisthaliana from the work of Abbas et al. [52], which were retrieved from the GEO database (accession numbers GSE93751 and GSE94065) and processed using CD-HIT to remove redundant sequences with more than 70% similarity. Detailed information is provided in Table 12.

Position-specific nucleotide propensity

In bioinformatics, the position-specific nucleotide propensity (PSNP) have become a popular method for predicting the sites of biological sequences [27,32]. PSNP is an approach that extracts information from sequences by computing the frequency of nucleotides at certain positions. Most mammalian m6A sequences are found within the consensus motif DRACH [51,53]. Therefore, we use 5-mer nucleotides to calculate the frequency and get 45 combinations. For a sequence of length 41nt, we get a 37-dimensional vector.

BSBL-TSK-FS

Fuzzy system.

Given N samples , where . We notate the label vector in the form of . Suppose the 1-order TSK fuzzy system has K fuzzy rules, then the k-th rule as follows [54,55].

(1)

where represents a fuzzy conjunction operator and refers to the fuzzy subset. Input vector corresponds to each rule, which maps the fuzzy set Ak in the input space to the fuzzy set in the output space. The Gaussian membership function is applied to the sample in the if-parts.

(2)

Fuzzy C-means (FCM) clustering [5658] can be used to determine the membership function mean and variance .

(3)(4)

where h is a manually adjustable coefficient, and represents the fuzzy membership of the i-th sample within the k-th cluster.

The fuzzy membership function and normalized fuzzy membership for fuzzy set Ak are defined as

(5)(6)

The output is given by

(7)

When the if-part of the parameter is determined, then is determined. Let

(8)(9)(10)

A set of parameters for the then-parts can be expressed as

(11)

where and .

Therefor, the output can be rewritten as

(12)

Block sparse Bayesian learning.

Let . Then we can obtain

(13)

First, we assume that the data is divided into blocks and all the sources are mutually independent, and the density of each is Gaussian, given by

(14)

where is a nonnegative hyperparameter controlling the row sparsity of , representing the correlation of the ith block of data. For smaller , the correlation is noise. When , the associated becomes 0. is the covariance matrix of . is a positive-definite matrix that captures the structure of the correlation of that needs to be estimated.

Further, assuming that blocks are uncorrelated with each other, it can be modeled as

(15)

where

(16)

For the observation vector y, it is assumed to obey the following probability density distribution

(17)

The posterior probability density and the likelihood function can be obtained by utilizing Gauss’s constant equation.

(18)

where

(19)

In order to estimate the covariates , a two-type maximum likelihood approach can be used to obtain the cost function :

(20)

Estimation of hyperparameters.

Using a matrix to find the inverse equation, Eq (20) can be written as

(21)

Taking partial derivatives separately yields

(22)

Furthermore, the updated formula can be obtained

(23)

Algorithm 1 describes the entire process of BSBL-TSK-FS, and Fig 1 shows the framework of the proposed approach.

Algorithm 1 Algorithm of BSBL-TSK-FS model.

Require: The training set , number of blocks M, number of fuzzy rules K, adjustable parameter h;

Ensure: The prediction labels ;

1: Determine the mean and variance using the FCM method;

2: Calculate the normalized fuzzy membership by Eq (6);

3: Construct the dataset using the fuzzy rules mapped to the new feature space, where are obtained by Eq (10);

4: Initialize γ and set ;

5: Initialize β and set ;

6: while do

7:   Calculate by Eq (19);

8:   Calculate β, and intra-block correlation coefficient r by Eq (23);

9:   Calculate by Eq (23);

10:   Calculate C by Eq (19);

11: end while

12: Estimate sparse solution , parameter estimation β.

13: Estimate Y by Eq (13).

Evaluation metrics

Model evaluation of methylation datasets is based on Matthews correlation coefficient (MCC), specificity (SP), accuracy (ACC) and sensitivity (SN) [5961]. Moreover, our model was evaluated objectively by calculating the AUC [36]. There is a range of AUC values between 0 and 1. Generally, models with higher AUCs perform better.

Acknowledgments

Computational resources were provided by the High Performance Computing Center of Central South University (to F.G.).

References

  1. 1. Shi H, Wei J, He C. Where, when, and how: context-dependent functions of RNA methylation writers, readers, and erasers. Mol Cell. 2019;74(4):640–50. pmid:31100245
  2. 2. Wang X, Zhao BS, Roundtree IA, Lu Z, Han D, Ma H, et al. N(6)-methyladenosine modulates messenger RNA translation efficiency. Cell. 2015;161(6):1388–99. pmid:26046440
  3. 3. Batista PJ, Molinie B, Wang J, Qu K, Zhang J, Li L, et al. m(6)A RNA modification controls cell fate transition in mammalian embryonic stem cells. Cell Stem Cell. 2014;15(6):707–19. pmid:25456834
  4. 4. Wang Y, Li Y, Toth JI, Petroski MD, Zhang Z, Zhao JC. N6-methyladenosine modification destabilizes developmental regulators in embryonic stem cells. Nat Cell Biol. 2014;16(2):191–8. pmid:24394384
  5. 5. Jia G, Fu Y, He C. Reversible RNA adenosine methylation in biological regulation. Trends Genet. 2013;29(2):108–15. pmid:23218460
  6. 6. Lindstein T, June CH, Ledbetter JA, Stella G, Thompson CB. Regulation of lymphokine messenger RNA stability by a surface-mediated T cell activation pathway. Science. 1989;244(4902):339–43. pmid:2540528
  7. 7. Xiao W, Adhikari S, Dahal U, Chen Y-S, Hao Y-J, Sun B-F, et al. Nuclear m(6)A reader YTHDC1 regulates mRNA splicing. Mol Cell. 2016;61(4):507–19. pmid:26876937
  8. 8. Jiang X, Liu B, Nie Z, Duan L, Xiong Q, Jin Z, et al. The role of m6A modification in the biological functions and diseases. Signal Transduct Target Ther. 2021;6(1):74. pmid:33611339
  9. 9. Zhang S-Y, Zhang S-W, Fan X-N, Meng J, Chen Y, Gao S-J, et al. Global analysis of N6-methyladenosine functions and its disease association using deep learning and network-based methods. PLoS Comput Biol. 2019;15(1):e1006663. pmid:30601803
  10. 10. Zhang S-Y, Zhang S-W, Liu L, Meng J, Huang Y. m6A-Driver: identifying context-specific mRNA m6A methylation-driven gene interaction networks. PLoS Comput Biol. 2016;12(12):e1005287. pmid:28027310
  11. 11. Zou G, Zou Y, Ma C, Zhao J, Li L. Development of an experiment-split method for benchmarking the generalization of a PTM site predictor: lysine methylome as an example. PLoS Comput Biol. 2021;17(12):e1009682. pmid:34879076
  12. 12. Liu J, Li K, Cai J, Zhang M, Zhang X, Xiong X, et al. Landscape and regulation of m6A and m6Am methylome across human and mouse tissues. Mol Cell. 2020;77(2):426-440.e6. pmid:31676230
  13. 13. Liu J, Yue Y, Han D, Wang X, Fu Y, Zhang L, et al. A METTL3-METTL14 complex mediates mammalian nuclear RNA N6-adenosine methylation. Nat Chem Biol. 2014;10(2):93–5. pmid:24316715
  14. 14. Ke S, Pandya-Jones A, Saito Y, Fak JJ, Vågbø CB, Geula S, et al. m6A mRNA modifications are deposited in nascent pre-mRNA and are not required for splicing but do specify cytoplasmic turnover. Genes Dev. 2017;31(10):990–1006. pmid:28637692
  15. 15. An S, Huang W, Huang X, Cun Y, Cheng W, Sun X, et al. Integrative network analysis identifies cell-specific trans regulators of m6A. Nucleic Acids Res. 2020;48(4):1715–29. pmid:31912146
  16. 16. Jia G, Fu Y, Zhao X, Dai Q, Zheng G, Yang Y, et al. N6-methyladenosine in nuclear RNA is a major substrate of the obesity-associated FTO. Nat Chem Biol. 2011;7(12):885–7. pmid:22002720
  17. 17. An Y, Duan H. The role of m6A RNA methylation in cancer metabolism. Mol Cancer. 2022;21(1):14. pmid:35022030
  18. 18. Xiong X, Hou L, Park YP, Molinie B, GTEx Consortium, Gregory RI, et al. Genetic drivers of m6A methylation in human brain, lung, heart and muscle. Nat Genet. 2021;53(8):1156–65. pmid:34211177
  19. 19. Meyer KD, Saletore Y, Zumbo P, Elemento O, Mason CE, Jaffrey SR. Comprehensive analysis of mRNA methylation reveals enrichment in 3’ UTRs and near stop codons. Cell. 2012;149(7):1635–46. pmid:22608085
  20. 20. Dominissini D, Moshitch-Moshkovitz S, Schwartz S, Salmon-Divon M, Ungar L, Osenberg S, et al. Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature. 2012;485(7397):201–6. pmid:22575960
  21. 21. Linder B, Grozhik AV, Olarerin-George AO, Meydan C, Mason CE, Jaffrey SR. Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nat Methods. 2015;12(8):767–72. pmid:26121403
  22. 22. Carlile TM, Rojas-Duran MF, Zinshteyn B, Shin H, Bartoli KM, Gilbert WV. Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature. 2014;515(7525):143–6. pmid:25192136
  23. 23. Marchand V, Ayadi L, Ernst FGM, Hertler J, Bourguignon-Igel V, Galvanin A, et al. AlkAniline-Seq: profiling of m7 G and m3 C RNA modifications at single nucleotide resolution. Angew Chem Int Ed Engl. 2018;57(51):16785–90. pmid:30370969
  24. 24. Zhang Z, Chen L-Q, Zhao Y-L, Yang C-G, Roundtree IA, Zhang Z, et al. Single-base mapping of m6A by an antibody-independent method. Sci Adv. 2019;5(7):eaax0250. pmid:31281898
  25. 25. Meyer KD. DART-seq: an antibody-free method for global m6A detection. Nat Methods. 2019;16(12):1275–80. pmid:31548708
  26. 26. Ryvkin P, Leung YY, Silverman IM, Childress M, Valladares O, Dragomir I, et al. HAMR: high-throughput annotation of modified ribonucleotides. RNA. 2013;19(12):1684–92. pmid:24149843
  27. 27. Lin H, Deng E-Z, Ding H, Chen W, Chou K-C. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 2014;42(21):12961–72. pmid:25361964
  28. 28. Liu H, Begik O, Novoa EM. EpiNano: detection of m 6 A RNA modifications using oxford nanopore direct RNA sequencing. RNA Modifications: Methods and Protocols. 2021. p. 31–52.
  29. 29. Leger A, Amaral PP, Pandolfini L, Capitanchik C, Capraro F, Miano V, et al. RNA modifications detection by comparative Nanopore direct RNA sequencing. Nat Commun. 2021;12(1):7198. pmid:34893601
  30. 30. Stoiber M, Quick J, Egan R, Eun Lee J, Celniker S, Neely RK. De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. BioRxiv. 2016. 094672.
  31. 31. Acera Mateos P, J Sethi A, Ravindran A, Srivastava A, Woodward K, Mahmud S, et al. Prediction of m6A and m5C at single-molecule resolution reveals a transcriptome-wide co-occurrence of RNA modifications. Nat Commun. 2024;15(1):3899. pmid:38724548
  32. 32. He W, Jia C, Zou Q. 4mCPred: machine learning methods for DNA N4-methylcytosine sites prediction. Bioinformatics. 2019;35(4):593–601. pmid:30052767
  33. 33. Wang H, Nie F, Huang H, Risacher SL, Saykin AJ, Shen L, et al. Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics. 2012;28(12):i127-36. pmid:22689752
  34. 34. Chen W, Feng P, Ding H, Lin H, Chou K-C. iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. Anal Biochem. 2015;490:26–33. pmid:26314792
  35. 35. Chen W, Tran H, Liang Z, Lin H, Zhang L. Identification and analysis of the N(6)-methyladenosine in the Saccharomyces cerevisiae transcriptome. Sci Rep. 2015;5:13859. pmid:26343792
  36. 36. Zhou Y, Zeng P, Li Y-H, Zhang Z, Cui Q. SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Res. 2016;44(10):e91. pmid:26896799
  37. 37. Chen K, Wei Z, Zhang Q, Wu X, Rong R, Lu Z, et al. WHISTLE: a high-accuracy map of the human N6-methyladenosine (m6A) epitranscriptome predicted using a machine learning approach. Nucleic Acids Res. 2019;47(7):e41. pmid:30993345
  38. 38. Fan R, Cui C, Kang B, Chang Z, Wang G, Cui Q. A combined deep learning framework for mammalian m6A site prediction. Cell Genom. 2024;4(12):100697. pmid:39571573
  39. 39. Dao F-Y, Lv H, Yang Y-H, Zulfiqar H, Gao H, Lin H. Computational identification of N6-methyladenosine sites in multiple tissues of mammals. Comput Struct Biotechnol J. 2020;18:1084–91. pmid:32435427
  40. 40. Liu K, Cao L, Du P, Chen W. im6A-TS-CNN: identifying the N6-methyladenine site in multiple tissues by using the convolutional neural network. Mol Ther Nucleic Acids. 2020;21:1044–9. pmid:32858457
  41. 41. Abbas Z, Tayara H, Zou Q, Chong KT. TS-m6A-DL: tissue-specific identification of N6-methyladenosine sites using a universal deep learning model. Comput Struct Biotechnol J. 2021;19:4619–25. pmid:34471503
  42. 42. Luo Z, Lou L, Qiu W, Xu Z, Xiao X. Predicting N6-methyladenosine sites in multiple tissues of mammals through ensemble deep learning. Int J Mol Sci. 2022;23(24):15490. pmid:36555143
  43. 43. Rehman MU, Tayara H, Chong KT. DL-m6A: identification of N6-methyladenosine sites in mammals using deep learning based on different encoding schemes. IEEE/ACM Trans Comput Biol Bioinform. 2023;20(2):904–11. pmid:35857733
  44. 44. Song B, Huang D, Zhang Y, Wei Z, Su J, Pedro de Magalhães J, et al. m6A-TSHub: unveiling the context-specific m6A methylation and m6A-affecting mutations in 23 human tissues. Genomics Proteomics Bioinformatics. 2023;21(4):678–94. pmid:36096444
  45. 45. Zhou J, Pedrycz W, Gao C, Lai Z, Wan J, Ming Z. Robust jointly sparse fuzzy clustering with neighborhood structure preservation. IEEE Trans Fuzzy Syst. 2022;30(4):1073–87.
  46. 46. Zhang Z, Rao BD. Sparse signal recovery with temporally correlated source vectors using sparse Bayesian learning. IEEE J Sel Top Signal Process. 2011;5(5):912–26.
  47. 47. Wipf DP, Rao BD. An empirical bayesian strategy for solving the simultaneous sparse approximation problem. IEEE Trans Signal Process. 2007;55(7):3704–16.
  48. 48. Chou W. Maximum a posterior linear regression with elliptically symmetric matrix variate priors. In: Eurospeech; 1999. p. 1–4.
  49. 49. Myung IJ. Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology. 2003;47(1):90–100.
  50. 50. Wu X, Bartel DP. kpLogo: positional k-mer analysis reveals hidden specificity in biological sequences. Nucleic Acids Res. 2017;45(W1):W534–8. pmid:28460012
  51. 51. Hendra C, Pratanwanich PN, Wan YK, Goh WSS, Thiery A, Göke J. Detection of m6A from direct RNA sequencing using a multiple instance learning framework. Nat Methods. 2022;19(12):1590–8. pmid:36357692
  52. 52. Abbas Z, Rehman MU, Tayara H, Zou Q, Chong KT. XGBoost framework with feature selection for the prediction of RNA N5-methylcytosine sites. Mol Ther. 2023;31(8):2543–51. pmid:37271991
  53. 53. Xu Y, Ding J, Wu L-Y, Chou K-C. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS One. 2013;8(2):e55844. pmid:23409062
  54. 54. Zhang Y, Ishibuchi H, Wang S. Deep Takagi–Sugeno–Kang fuzzy classifier with shared linguistic fuzzy rules. IEEE Trans Fuzzy Syst. 2018;26(3):1535–49.
  55. 55. Ding Y, Tiwari P, Zou Q, Guo F, Pandey HM. C-Loss based higher order fuzzy inference systems for identifying DNA N4-methylcytosine sites. IEEE Trans Fuzzy Syst. 2022;30(11):4754–65.
  56. 56. Deng Z, Jiang Y, Choi K, Chung F, Wang S. Knowledge-leverage-based TSK fuzzy system modeling. IEEE transactions on neural networks and learning systems. 2013;24(8):1200–12. pmid:24808561
  57. 57. Giang NL, Son LH, Ngan TT, Tuan TM, Phuong HT, Abdel-Basset M, et al. Novel incremental algorithms for attribute reduction from dynamic decision tables using hybrid filter–wrapper with fuzzy partition distance. IEEE Trans Fuzzy Syst. 2020;28(5):858–73.
  58. 58. Bezdek JC, Ehrlich R, Full W. FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences. 1984;10(2–3):191–203.
  59. 59. Jiao Y, Du P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant Biol. 2016;4(4):320–30.
  60. 60. Tahir M, Tayara H, Chong KT. iPseU-CNN: identifying RNA pseudouridine sites using convolutional neural networks. Mol Ther Nucleic Acids. 2019;16:463–70. pmid:31048185
  61. 61. Zhang D, Xu Z-C, Su W, Yang Y-H, Lv H, Yang H, et al. iCarPS: a computational tool for identifying protein carbonylation sites by novel encoded features. Bioinformatics. 2021;37(2):171–7. pmid:32766811