Figures
Abstract
N6-methyladenosine (m6A) can significantly affect RNA expression, gene regulation, and determination of cell fate. As a common and abundant post-transcriptional modification (PTM) of RNA, m6A is also closely associated with the occurrence of numerous diseases. Thus, identifying the m6A modification site in the RNA sequence is a prerequisite for related research. High-throughput sequencing technology has high requirements and low cost performance. Computational methods have made encouraging progress in site prediction. However, most models only consider the effects of different species, ignoring the simultaneous exploration of RNA modifications in different tissues within the same species. We develop and validate a fuzzy system based on Block Sparse Bayesian Learning (BSBL), named BSBL-TSK-FS, which is a powerful sequence-level m6A prediction model. We introduce a Bayesian method that provides a posterior probability output to produce more sparse solutions so that the model has higher accuracy. The model classifies the m6A sites in several tissues of mouse, human, and rat. Under the five-fold cross-validation method (5-CV), the precision of the BSBL-TSK-FS model is 0.84∼0.95. The accuracy of our model improves by 9.4% over the existing SOTA predictors. BSBL-TSK-FS achieves superior performance over current SOTA methods. Finally, in order to verify the generalizability of the model, we carry out cross-species tests, and the results prove the robustness and adaptability of the model. An accurate and reliable sequence modification prediction model is developed to better understand the complex landscape of methylation modification.
Author summary
RNA molecules undergo a large number of PTMs that can affect their structure and interaction properties. As the most common type of PTM, N6-methyladenosine (m6A) plays a crucial role in life processes such as gene silencing, cell localization, parental imprinting, and various diseases. Therefore, accurate identification of m6A modification sites from mRNA sequences is of great significance for basic research and drug development. The applicability of experimental methods in large-scale studies is poor. In response to these limitations, computational models have been developed to quickly and economically identify m6A modification sites. In this study, we propose a fuzzy system prediction model, called BSBL-TSK-FS, to identify m6A. We verify the performance of the model on a baseline datasets. Our model, BSBL-TSK-FS, performs well in 11 datasets, with an average AUC value of 0.9619 and an average precision value of 0.9028.
Citation: Wang L, Zhao M, Xie H, Qian Y, Lu W, Ding Y, et al. (2025) Block sparse Bayes-based fuzzy system for RNA N6-methyladenosine sites prediction. PLoS Comput Biol 21(10): e1013621. https://doi.org/10.1371/journal.pcbi.1013621
Editor: Saurabh Sinha, Georgia Institute of Technology, UNITED STATES OF AMERICA
Received: September 25, 2024; Accepted: October 16, 2025; Published: October 30, 2025
Copyright: © 2025 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The code and data performing the analyses in this manuscript is available at https://github.com/LeyaoWang199/BSBL-TSK-FS_code.
Funding: This work was supported by the National Natural Science Foundation of China (Grants No. 62322215 and No. 62532017, awarded to F.G.; and No. 62172076 and U22A2038, awarded to Y.D.), the Zhejiang Provincial Natural Science Foundation of China (Grant No. LY23F020003, awarded to Y.D.), and the Municipal Government of Quzhou (Grant No. 2024D002, awarded to Y.D.). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Post-transcriptional modifications (PTM) in RNA are common in all areas of life [1,2]. In addition to regulating RNA life stages, modification sites affect RNA localization, tertiary structure, function, and biogenesis [3,4]. As a result, the biological function of RNA is affected. They are produced by covalent alterations or isomerization of nucleotides, usually involving the addition of chemical groups at different locations in the nitrogenous base or ribose cycle [5–7]. More than 150 PTMs have been identified, of which m6A is the most prevalent type of PTM in RNA [8,9]. As an important epigenetic modification, m6A plays a crucial role in gene silencing, cell localization, parental origin imprinting, and other life processes [10–12]. Regulates RNA localization, transcription, splicing, and stability [13–15]. In addition, it has been linked to diseases such as stomach cancer, obesity, and breast tumors [16–18]. In order to carry out basic research and develop new drugs, it is extremely important to precisely identify the m6A modification sites from mRNA sequences.
Currently, most experimental methods for locating RNA post-transcriptional modifications differ in 3 ways: immunoprecipitation methods, chemical-based detection methods, and enzyme-specific methods. RNA immunoprecipitation dependent methods include MeRIP-Seq [19], m6A-Seq [20], miCLIP [21] and other methods. Pseudo-Seq [22] and AlkAniline-Seq [23] utilize compounds that selectively react with modified ribonucleotides to identify m6A. Specific enzymes are used in methods such as m6A-REF-Seq [24] or DART-Seq [25]. Although these are the current gold standards, they still have certain limitations. For example, experiments require the development of specific protocols for each PTM, sensitivity to cross-reactivity, antibodies, or chemical reactions, and complex protocols can cause bias. They are limited by the availability of compounds or specific antibodies [26,27]. In addition, several methods based on Oxford nanopore technology have been developed, such as Epinano [28], Nanocompore [29], Tombo [30], and CHEUI [31]. These methods are time-consuming, laborious, and expensive. In addition, the slow detection process further limits their applicability to large-scale studies [32,33]. In response to these limitations, computational models have been developed to quickly and economically identify m6A modification sites, making them ideal for large-scale data analysis.
Computational methods have become an attractive option for researchers. iRNA-Methyl is a pioneering predictor specifically designed to identify the m6A site in RNA, using support vector machines combined with hand-extracted features to build a model [34]. m6Apred is a predictor specifically designed to identify m6A sites in the Saccharomyces cerevisiae transcriptome [35]. The predictor is based on physicochemical binary coding and cumulative nucleotide frequency extraction features. SRAMP can be used to predict RNA m6A sites in mammals [36]. It extracts features based on nucleotide binary encoding, secondary structure binary encoding, KSNPF, and KNN score, and assembles three random forests to predict m6A sites. WHISTLE [37] effectively captures key signals associated with m6A modifications by integrating sequence and evolutionary features, and is trained using an SVM. deepSRAMP [38] introduces an isoform-level m6A site prediction framework that leverages BiGRU networks combined with a multi-head attention mechanism to capture complex sequence dependencies. By encoding RNA sequences into fixed-length embeddings, the model effectively extracts deep contextual features and achieves promising predictive performance across multiple tissues and species. These predictors identify m6A modification sites based on specific tissues in a single species. Dao et al. [39] designed iRNA-m6A based on a support vector machine (SVM). Using a one-hot encoding scheme, Liu et al. [40] developed the im6A-TS-CNN tool that predicts using convolutional neural networks (CNNs). TS-m6A-DL is described by Abbas et al. [41] as a method based on deep neural networks (DNNs). A combination of four classification algorithms and three deep learning models is used in the im6APred model presented by Luo et al. [42]. A tool called DL-m6A, which uses three different features encoding schemes, was proposed by Rehman et al. [43]. m6A-TSHub [44] is a comprehensive platform for tissue-specific m6A research, integrating four key modules: m6A-TSDB, m6A-TSFinder, m6A-TSVar, and m6A-CAVar, which support database construction, predictive modeling, and variant impact analysis. It enables systematic exploration of tissue-specific m6A methylation from both low-resolution data and genetic variants across 23 human tissues. Most of these models use one-hot coding, k-mer coding or physicochemical property coding to extract the characterization of RNA sequences. Nevertheless, these methods usually consider only shallow RNA sequence encoding and ignore potential correlations between nucleotides. These models all use tissue-specific datasets, but their accuracy and generality need to be improved.
In this study, we propose a block-sparse Bayesian Learning (BSBL)-based Takagi-Sugeo-Kang fuzzy system (TSK-FS), called BSBL-TSK-FS, to identify m6A. The proposed method is more novel and effective than TSK-FS. In order to achieve complete information extraction, we use the position-specific nucleotide propensity (PSNP) to extract RNA sequence features. Extensive benchmarking experiments were conducted on well-curated datasets, and as a result, BSBL-TSK-FS achieves superior performance than current state-of-the-art methods. Finally, cross-species tests were carried out, and the results obtained prove the robustness and adaptability of the model.
The contributions of this study are summarized as follows.
(1) We improve the TSK fuzzy system on the basis of block sparse Bayesian learning by introducing a Bayesian approach that provides the output of posterior probabilities to produce more sparse solutions. Our model has higher accuracy.
(2) Our model does not require a setting for the penalty factor. The penalty factor in general TSK-FS is a constant to balance the regular and error terms, and the experimental results are very sensitive to this data, and improper settings can cause problems such as overlearning. However, the parameter is automatically assigned in BSBL-TSK-FS.
(3) Compared to traditional task-specific computational tools, our model does not require different coding representations of RNA sequences and can directly predict different types of methylation.
(4) Our model can identify methylation modification sites in various tissues of different species.
The next section displays the model framework, experimental results, sequence analysis, and cross-species validation results, and compares them in detail with other methods. The third part describes the experimental materials and methods, including data set introduction, TSK-FS and BSBL algorithms, feature extraction methods and performance evaluation criteria. Finally the paper is summarized.
Results
The BSBL-TSK-FS framework
Our framework uses a block-sparse Bayes-based fuzzy system to predict widely occurring m6A modifications in different tissues of mouse, human, and rat. It consists of four key modules: Input and encoding module, fuzzification module, block sparse Bayes module and prediction module, as shown in Fig 1. The proposed method is mapped to the high dimensional space by fuzzy rules and fuzzy membership function in the fuzzifier [45]. Then we introduce a Bayesian method that provides a posterior probability output to produce more sparse solutions, which improves the accuracy of the model [46,47]. The idea is to find the posterior probability via the Bayesian rule. Given the hyperparameters, the solution is given by the Maximum-A-Posterior estimate [48]. The hyperparameters are estimated from data Maximum Likelihood [49]. More details can be found in the Materials and Methods Section Materials and methods. Our model does not need to set the penalty factor, which is automatically assigned in BSBL-TSK-FS. This avoids problems such as over-learning caused by improper parameter setting. We use 11 widely recognized datasets created by Dao et al. [39], and performe 5-CV on the datasets. These datasets come from different tissues such as brain, heart, kidney, and liver of mouse, humans, and rats. While we applied BSBL-TSK-FS to the task of m6A modification detection, the framework can also be directly applied to other tasks, such as detecting other types modifications.
(A) Input and encoding module. The RNA sequence with length of 41bp was encoded into a matrix via the PSNP with 5-mer. (B) Fuzzification module. The model uses fuzzy system to process the data, then gets fuzzy feature , and applies
to the next module. (C) Block sparse Bayesian module. The sparse solution of the model is obtained by block sparse Bayes algorithm, and the parameter p is solved. (D) Prediction module. Identify m6A and non-m6A by predicting results.
BSBL-TSK-FS performance
The main goal of our study was to establish a convenient and reliable predictor that can achieve SOTA accuracy to effectively identify widely occurring m6A modifications from RNA sequences. The Table 1 lists the results of our proposed model on the tissue-specific datasets via five-fold cross-validation (5-CV). Our model, BSBL-TSK-FS, performs well on 11 datasets, with an average AUC value of 0.9619 and an average accuracy value of 0.9028. All other datasets have ACC scores above 84%. On the dataset Rat Liver, this model performed the highest quality, with ACC, MCC and AUC reaching 95.41%, 90.84% and 0.9900, respectively. It also performs well in mouse hearts, with an ACC of 0.9542. Meanwhile, Human Brain’s scores are slightly lower, with ACC, MCC and AUC reaching 84.75%, 69.63% and 0.9231, respectively. AUCs are all above 0.92, and MCCs are all above 0.6963. The mouse brain dataset is the largest and performs poorly, indicating that our model has a slight weakness in handling large data sets. The highest AUC is only 0.07 higher than the lowest AUC. This shows that our model is very stable on the AUC criterion. The BSBL-TSK-FS model demonstrated better performance on small-scale datasets compared to larger ones. As a traditional machine learning method, it tends to be more effective when the sample size is limited. On larger datasets such as Mouse Brain, Human Brain, and Human Kidney, the increased diversity and complexity of the sequences may pose challenges for the fuzzy inference system, which is constrained by the number of fuzzy rules and thus may struggle to capture complex patterns. In addition, the performance of our approach heavily relies on effective feature extraction. The PSTNP module, in particular, has shown higher discriminative power in small datasets, while the greater sequence diversity in large datasets may reduce the effectiveness of the extracted features. These factors may jointly contribute to the slight performance decline observed on large-scale datasets.
We present the intersections of modifications across tissues in Fig 2A–2C, showing both correlations and significant differences across the tissue data. Some m6A modification sites may occur only in specific tissues, and some may exhibit some similar tendencies in multiple tissues. Therefore, we show the intersection of modifications between tissues. Specifically, we list the exons associated with each modification and treat modifications that share the same exon as intersecting. We find that there are overlaps between the tissues, but the overlaps are less than 5%. Fig 2D is a Sankey chart that visually shows the flow of samples from 11 data sets based on the reality of the prediction label. The curved thin lines represents the misclassified sample. Most of the samples are classified correctly, so they show strong straight lines. It is clear that our method successfully classified the majority of the samples. Our proposed model performs well overall, preliminarily proving its effectiveness in predicting the m6A sites.
(D) Sankey diagram of prediction results for 11 datasets. Straight lines represent correctly classified samples, curved lines represent incorrectly classified samples, and the stronger the lines, the larger the number of samples.
Consensus region analysis
In order to further understand the mechanism and reason of modification, we use kpLogo [50] to study the distribution of nucleotides around the m6A site. Fig 3A–3C shows the visualization of methylation sequence patterns, we can see that the methylated sequential regions in the tissues are very similar. Fig 3D–3F shows the statistical difference in nucleotide appearance between m6A and non-m6A samples. The top half represents sequences that contain m6A sites, and the bottom half represents sequences that contain non-m6A sites. There are significant differences in the distribution of nucleotides between positive and negative samples (T-test, p value <0.01). The flanking sequences of m6A in all tissues are biased towards GC rich areas, while the flanking sequences of non-m6A are biased towards AU rich areas. It also shows that the idea of constructing m6A classification model by extracting sequence information is reasonable. We present an analysis of the probability distribution of methylation centers in 3 different human tissues. Fig 4A and 4B show the frequency distribution of positive and negative samples of the 3 datasets, respectively. On the one hand, it can be seen that there are significant differences in the motifs of positive and negative samples. On the other hand, it can be seen that the positive sample shows the motif GGACA with the highest frequency in the logos in position from 19 to 23. This is consistent with m6A modifications occurring primarily on the consensus motif DRACH (D-A, G, or U, R-A, or G, H being A, C, or U) [51]. Figs 3 and 4 reveal clear positional differences in nucleotide composition between positive and negative samples, particularly around the central region. These findings support the use of PSKNP, a position-aware encoding scheme that captures local 5-mer preferences across aligned sequences. By leveraging such positional patterns, BSBL-TSK-FS effectively models the sequence context relevant to m6A modifications.
(A–C) Probability Logos of positive samples analysis. (D–E) Probability Logos of positive and negative samples comparative analysis. It’s worth noting that kpLogo uses the “T” to represent the “U” in the RNA sequence.
(A) Motif logos of the positive samples. (B) Motif logos of the negative samples. It’s worth noting that kpLogo uses the “T” to represent the “U” in the RNA sequence.
Ablation analysis
To further demonstrate the superiority of BSBL-TSK-FS, we conduct ablation studies. Experiments are conduct on baseline data sets using TSK-FS and SBL-TSK-FS respectively, and the proposed methods are compared in many ways. The comparison results highlight the contribution of sparse Bayesian learning and block sparse Bayesian learning. As shown in Fig 5A, ROC curves of the three methods show that our model has the highest AUC performance. Three species have AUC values above 0.92, 0.93, and 0.97, respectively. In contrast, the TSK-FS method is the worst, while SBL-TSK-FS performs slightly better. The AUC values of the other two methods achieve 0.86-0.97 and 0.89-0.98, respectively. Our method achieves the highest AUC across all datasets, with an average accuracy value of 0.9619, surpassing that of TKS-FS (average AUC 0.9136) and SBL-TSK-FS (average AUC 0.9427). In order to demonstrate the advantages of the proposed method in capturing sequence information, we perform a visual comparison analysis with the two limit methods mentioned above (Fig 5B). We use UMAP to visualize the output features of the three methods. Visualization results show that our approach successfully distinguished the vast majority of negative and positive samples. On the contrary, a small number of negative samples in SBL-TSK-FS are misclassified as positive samples. The worst TSK-FS has a higher incidence of classification errors, resulting in areas of overlap between the two classes.
(B) Visualization of feature spatial distribution by 3 methods.
The results of above 3 methods on Mouse are detailed in Table 2. Our model is superior to the two models in the overall evaluation metrics. Specifically, compared to SBL-TSK-FS method on the Mouse dataset, the accuracy and AUC of our model improve by an average of 1.33% and 0.97%, respectively. TSK-FS perform the worst, especially on the MCC value, with an average of just 0.7138. Compared with TSK-FS, BSBL-TSK-FS improves MCC by 3.5%, 10.33%, 0.53%, 1.25% and 22.18% on these 5 datasets, respectively. TSK-FS performs worst on Mouse Testis, with an ACC value of 0.7721. Our model shows slightly lower SP than TSK-FS on the Mouse Liver dataset. However, its SN is only 0.8828, much lower than that of our model.
The results on Human datasets are detailed in Table 3. The experimental results show that our model is superior to the two models in all evaluation metrics. The performance of TSK-FS is still the lowest, but the MCC is above 0.63, and the performance of SBL-TSK-FS is slightly better, with MCC values between 0.6694 and 0.8349. Compared to SBL-TSK-FS method on the Human dataset, the ACC, MCC and AUC of our model improve by an average of 1.59%, 3.08% and 2.38%, respectively. On these three data sets, our model improves scores of evaluation metrics more significantly. Compared with TSK-FS, BSBL-TSK-FS improves MCC by 3.83%, 10.69% and 16.36% on these 3 datasets, respectively. Moreover, the average ACC and AUC of SBL-TSK-FS are 3.36% and 2.75% higher than that of TSK-FS, respectively.
According to Table 4, shows the results of ablation experiments on Rat datasets. Our model generally performs very well, with an average ACC of 0.9357 and an average MCC of 0.8718. The rat is the species that our model predicts most accurately. TSK-FS continues to perform poorly, with the lowest ACC not exceeding 0.8 and the worst MCC only 0.5801. And SBL-TSK-FS performs marginally acceptable, with ACC values between 0.8908 and 0.9298. Compared to SBL-TSK-FS method, the ACC, MCC and AUC of our model improve by an average of 2.26%, 6.36% and 8.01%, respectively. Compared with TSK-FS, the average ACC, MCC and AUC of SBL-TSK-FS improve by 7.83%, 13.15% and 5.34% on these 3 datasets, respectively. Among them, TSK-FS has the highest SP score on Rat Kindey dataset, but the corresponding SN is the lowest. A similar situation is seen with SBL-TSK-FS on the Rat Liver dataset. Overall, our model is more balanced and stable.
Comparison with other advanced tools
To evaluate the effectiveness of our model, we conducted comparisons with several mainstream tools, namely im6A-TS-CNN, im6APred, iRNA-m6A, DL-m6A, TS-m6A-DL, and M6A-BiNP, under a 5-CV framework. The lollipop plot of Fig 6A shows the comparison of six methods SN and SP on three human data sets. It can be seen that our method has the highest performance and has made great progress, especially in SP. Fig 6B and 6C show mcc comparisons across 11 datasets. Fig 6B shows that there is a big gap between other methods and ours in MCC score. It can be intuitively seen that BSBL-TSK-FS is significantly elevated in rat brain, rat liver, and mouse brain (Fig 6C). Fig 7 uses a heat map to show a comparison of ACC, MCC, SN, SP, and AUC scores for different methods on a standard dataset. The brighter the circle, the higher the value. In addition to our approach, DL-m6A and TS-m6A-DL performed better. BSBL-TSK-FS has the best prediction effect on MCC and ACC. The results show that our method is very effective and reliable in m6A modification prediction task.
(B) Piano plots of MCC comparisons of six methods across 11 datasets. (C) Radar maps of six methods for MCC comparison on 11 data sets.
The larger and brighter the bubbles, the higher the value.
More detailed comparison results are shown in Tables 5–7. BSBL-TSK-FS has excellent performance on Mouse datasets (Table 5). As shown in the experiment, our proposed model BSBL-TSK-FS shows the best performance compared to SOTA methods on 5 datasets, with an average improvement of 0.99% Acc, 20.44% MCC, and 8.22% AUC compared to other baseline methods. DL-m6A performs slightly better, with an average ACC of 0.7954, an average MCC of 0.585 and an average AUC of 0.8731. BSBL-TSK-FS provides the best accuracy and AUC values on Mouse Heart (0.9542 and 0,9894). Among the methods evaluated, iRNA-m6A, a predictor based on traditional machine learning techniques, showes the lowest overall performance across all 11 datasets, with an Acc of 0.735 and an MCC of 0.47. This observation suggests that deep characterization of RNA sequences has a stronger ability to characterize RNA sequences than shallow characterization. For other methods, TS-m6A-DL and im6A-TS-CNN initialize the RNA sequence mainly based on one-hot encoding, thus ignoring the underlying semantic information.
On the Human datasets, there is no doubt that ACC and AUC have shown clear signs of improvement (Table 6).The Acc of BSBL-TSK-FS improved by approximately 5.35-9.24% over the next highest predictor (DL-m6A). However, on the Human Brain, our method is slightly lower than DL-m6A in terms of MCC. Our proposed model BSBL-TSK-FS shows the best performance on AUC, with an average improvement of 6.95% compared to other baseline methods. DL-m6A performs slightly better, with an average ACC of 0.8074, an average MCC of 0.7188 and an average AUC of 0.8849. On all three datasets, AUC values for all methods exceed 0.8. BSBL-TSK-FS provides the best accuracy value on Human Liver (0.9211).
On Rat datasets, our model is relatively stable and performs well on all datasets (Table 7). BSBL-TSK-FS provides the best accuracy value on Rat Liver (0.9541). Our proposed model BSBL-TSK-FS shows the best performance on ACC and MCC, with an average improvement of 10.34% ACC and an average improvement of 20.41% MCC, compared to other baseline methods. The AUC alues of our method are higher than DL-m6A method by 0.0943, 0.0616 and 0.0703, respectively. The remaining four methods (iRNA-m6A, im6APred, im6A-TS-CNN and TS-m6A-DL) also achieved reasonable performance on AUC, all above 0.82.
m6A-TSFinder [44] proposed a weakly supervised deep learning framework to predict tissue-specific m6A methylation from low-resolution data and constructed tissue-level models for 23 human tissues. This approach significantly broadened the landscape of m6A prediction beyond base-resolution data. However, due to differences in data resolution, sequence structure, and prediction tasks, our model cannot be directly compared with m6A-TSFinder on the 23 human tissue datasets. To ensure fairness, we evaluated our model on the same benchmark dataset used in the m6A-TSFinder study, enabling direct comparison under an equivalent prediction setting. Detailed performance comparisons are provided in Table 8.
Performance comparison on the m5C datasets
Our model is capable of predicting not only m6A methylation but also m5C methylation. To assess the performance of the proposed method, we employed the same dataset used by m5C-pred [52]. Table 9 summarizes the comparison results. On dataset M.musculus, BSBL-TSK-FS achieves the best ACC value (0.9088) with improvements of 14.86% and over the m5C-pred. On dataset A.thaliana, the accuracy and MCC of our prediction are also significantly improved.
Cross-species and cross-tissues prediction analysis
We conducte cross species and cross tissue experiments to demonstrate that the predictor is not dependent on species and tissue. The result is shown in Fig 8, where each circle in the heat map represents the accuracy obtained. The rows represent training sets and the columns represent test sets. The accuracy of cross-tissue prediction may be affected by factors such as differences in sample size, intra-dataset redundancy, and random noise. All prediction accuracy is higher than 0.79. It can be seen that using different species or tissues for prediction can also achieve good accuracy.
Discussion
In this paper, we designed a BSBL-TFS-FS model to detect RNA m6A sites in a variety of tissues. We apply fuzzy systems to block sparse Bayesian learning. Compared with the traditional fuzzy network, this method has better approximation performance. We tested the benchmark dataset. The experimental results show that BSBL-TFS-FS is an effective model for sequence prediction. Compared with existing predictors, BSBL-TFS-FS has higher predictive performance. Our model can predict methylation sites in different tissues of different species. To account for tissue- and species-specific characteristics of m6A modifications, we trained separate models for each tissue type and organism. This design ensured that each model was optimally adapted to its respective dataset. Moreover, while our framework is not fully interpretable in the strictest sense, it provides greater interpretability compared to deep learning-based methods. This is mainly due to its rule-based fuzzy inference system and structured Bayesian framework, which together offer clearer insight into the relationship between input features and prediction outcomes. Next, we want to change the form of converting sequences into numerical values through physicochemical properties and then input into the model, and instead use biological sequences directly as inputs to the model.
Materials and methods
Benchmark datasets
Zhang et al. [24] developed a technique for detecting m6A sites in different tissues. Based on this study, Dao et al. [39] constructed high-quality benchmark datasets for computational methods. Each dataset contains 41nt long sequences of m6A and non-m6A sites. Using CD-HIT, a sequence similarity score of less than 80% was achieved. Table 10 shows the summary of tissue-specific datasets. Table 11 presents information on independent datasets from three human tissues.
This study also utilized high-quality m5C datasets for Musmusculus and Arabidopsisthaliana from the work of Abbas et al. [52], which were retrieved from the GEO database (accession numbers GSE93751 and GSE94065) and processed using CD-HIT to remove redundant sequences with more than 70% similarity. Detailed information is provided in Table 12.
Position-specific nucleotide propensity
In bioinformatics, the position-specific nucleotide propensity (PSNP) have become a popular method for predicting the sites of biological sequences [27,32]. PSNP is an approach that extracts information from sequences by computing the frequency of nucleotides at certain positions. Most mammalian m6A sequences are found within the consensus motif DRACH [51,53]. Therefore, we use 5-mer nucleotides to calculate the frequency and get 45 combinations. For a sequence of length 41nt, we get a 37-dimensional vector.
BSBL-TSK-FS
Fuzzy system.
Given N samples , where
. We notate the label vector in the form of
. Suppose the 1-order TSK fuzzy system has K fuzzy rules, then the k-th rule as follows [54,55].
where represents a fuzzy conjunction operator and
refers to the fuzzy subset. Input vector
corresponds to each rule, which maps the fuzzy set Ak in the input space to the fuzzy set
in the output space. The Gaussian membership function is applied to the sample in the if-parts.
Fuzzy C-means (FCM) clustering [56–58] can be used to determine the membership function mean and variance
.
where h is a manually adjustable coefficient, and represents the fuzzy membership of the i-th sample within the k-th cluster.
The fuzzy membership function and normalized fuzzy membership
for fuzzy set Ak are defined as
The output is given by
When the if-part of the parameter is determined, then is determined. Let
A set of parameters for the then-parts can be expressed as
where and
.
Therefor, the output can be rewritten as
Block sparse Bayesian learning.
Let . Then we can obtain
First, we assume that the data is divided into blocks and all the sources
are mutually independent, and the density of each
is Gaussian, given by
where is a nonnegative hyperparameter controlling the row sparsity of
, representing the correlation of the ith block of data. For smaller
, the correlation
is noise. When
, the associated
becomes 0.
is the covariance matrix of
.
is a positive-definite matrix that captures the structure of the correlation of
that needs to be estimated.
Further, assuming that blocks are uncorrelated with each other, it can be modeled as
where
For the observation vector y, it is assumed to obey the following probability density distribution
The posterior probability density and the likelihood function can be obtained by utilizing Gauss’s constant equation.
where
In order to estimate the covariates , a two-type maximum likelihood approach can be used to obtain the cost function
:
Estimation of hyperparameters.
Using a matrix to find the inverse equation, Eq (20) can be written as
Taking partial derivatives separately yields
Furthermore, the updated formula can be obtained
Algorithm 1 describes the entire process of BSBL-TSK-FS, and Fig 1 shows the framework of the proposed approach.
Algorithm 1 Algorithm of BSBL-TSK-FS model.
Require: The training set , number of blocks M, number of fuzzy rules K, adjustable parameter h;
Ensure: The prediction labels ;
1: Determine the mean and variance
using the FCM method;
2: Calculate the normalized fuzzy membership by Eq (6);
3: Construct the dataset using the fuzzy rules mapped to the new feature space, where
are obtained by Eq (10);
4: Initialize γ and set ;
5: Initialize β and set ;
6: while do
7: Calculate by Eq (19);
8: Calculate β, and intra-block correlation coefficient r by Eq (23);
9: Calculate by Eq (23);
10: Calculate C by Eq (19);
11: end while
12: Estimate sparse solution , parameter estimation
β.
13: Estimate Y by Eq (13).
Evaluation metrics
Model evaluation of methylation datasets is based on Matthews correlation coefficient (MCC), specificity (SP), accuracy (ACC) and sensitivity (SN) [59–61]. Moreover, our model was evaluated objectively by calculating the AUC [36]. There is a range of AUC values between 0 and 1. Generally, models with higher AUCs perform better.
Acknowledgments
Computational resources were provided by the High Performance Computing Center of Central South University (to F.G.).
References
- 1. Shi H, Wei J, He C. Where, when, and how: context-dependent functions of RNA methylation writers, readers, and erasers. Mol Cell. 2019;74(4):640–50. pmid:31100245
- 2. Wang X, Zhao BS, Roundtree IA, Lu Z, Han D, Ma H, et al. N(6)-methyladenosine modulates messenger RNA translation efficiency. Cell. 2015;161(6):1388–99. pmid:26046440
- 3. Batista PJ, Molinie B, Wang J, Qu K, Zhang J, Li L, et al. m(6)A RNA modification controls cell fate transition in mammalian embryonic stem cells. Cell Stem Cell. 2014;15(6):707–19. pmid:25456834
- 4. Wang Y, Li Y, Toth JI, Petroski MD, Zhang Z, Zhao JC. N6-methyladenosine modification destabilizes developmental regulators in embryonic stem cells. Nat Cell Biol. 2014;16(2):191–8. pmid:24394384
- 5. Jia G, Fu Y, He C. Reversible RNA adenosine methylation in biological regulation. Trends Genet. 2013;29(2):108–15. pmid:23218460
- 6. Lindstein T, June CH, Ledbetter JA, Stella G, Thompson CB. Regulation of lymphokine messenger RNA stability by a surface-mediated T cell activation pathway. Science. 1989;244(4902):339–43. pmid:2540528
- 7. Xiao W, Adhikari S, Dahal U, Chen Y-S, Hao Y-J, Sun B-F, et al. Nuclear m(6)A reader YTHDC1 regulates mRNA splicing. Mol Cell. 2016;61(4):507–19. pmid:26876937
- 8. Jiang X, Liu B, Nie Z, Duan L, Xiong Q, Jin Z, et al. The role of m6A modification in the biological functions and diseases. Signal Transduct Target Ther. 2021;6(1):74. pmid:33611339
- 9. Zhang S-Y, Zhang S-W, Fan X-N, Meng J, Chen Y, Gao S-J, et al. Global analysis of N6-methyladenosine functions and its disease association using deep learning and network-based methods. PLoS Comput Biol. 2019;15(1):e1006663. pmid:30601803
- 10. Zhang S-Y, Zhang S-W, Liu L, Meng J, Huang Y. m6A-Driver: identifying context-specific mRNA m6A methylation-driven gene interaction networks. PLoS Comput Biol. 2016;12(12):e1005287. pmid:28027310
- 11. Zou G, Zou Y, Ma C, Zhao J, Li L. Development of an experiment-split method for benchmarking the generalization of a PTM site predictor: lysine methylome as an example. PLoS Comput Biol. 2021;17(12):e1009682. pmid:34879076
- 12. Liu J, Li K, Cai J, Zhang M, Zhang X, Xiong X, et al. Landscape and regulation of m6A and m6Am methylome across human and mouse tissues. Mol Cell. 2020;77(2):426-440.e6. pmid:31676230
- 13. Liu J, Yue Y, Han D, Wang X, Fu Y, Zhang L, et al. A METTL3-METTL14 complex mediates mammalian nuclear RNA N6-adenosine methylation. Nat Chem Biol. 2014;10(2):93–5. pmid:24316715
- 14. Ke S, Pandya-Jones A, Saito Y, Fak JJ, Vågbø CB, Geula S, et al. m6A mRNA modifications are deposited in nascent pre-mRNA and are not required for splicing but do specify cytoplasmic turnover. Genes Dev. 2017;31(10):990–1006. pmid:28637692
- 15. An S, Huang W, Huang X, Cun Y, Cheng W, Sun X, et al. Integrative network analysis identifies cell-specific trans regulators of m6A. Nucleic Acids Res. 2020;48(4):1715–29. pmid:31912146
- 16. Jia G, Fu Y, Zhao X, Dai Q, Zheng G, Yang Y, et al. N6-methyladenosine in nuclear RNA is a major substrate of the obesity-associated FTO. Nat Chem Biol. 2011;7(12):885–7. pmid:22002720
- 17. An Y, Duan H. The role of m6A RNA methylation in cancer metabolism. Mol Cancer. 2022;21(1):14. pmid:35022030
- 18. Xiong X, Hou L, Park YP, Molinie B, GTEx Consortium, Gregory RI, et al. Genetic drivers of m6A methylation in human brain, lung, heart and muscle. Nat Genet. 2021;53(8):1156–65. pmid:34211177
- 19. Meyer KD, Saletore Y, Zumbo P, Elemento O, Mason CE, Jaffrey SR. Comprehensive analysis of mRNA methylation reveals enrichment in 3’ UTRs and near stop codons. Cell. 2012;149(7):1635–46. pmid:22608085
- 20. Dominissini D, Moshitch-Moshkovitz S, Schwartz S, Salmon-Divon M, Ungar L, Osenberg S, et al. Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature. 2012;485(7397):201–6. pmid:22575960
- 21. Linder B, Grozhik AV, Olarerin-George AO, Meydan C, Mason CE, Jaffrey SR. Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nat Methods. 2015;12(8):767–72. pmid:26121403
- 22. Carlile TM, Rojas-Duran MF, Zinshteyn B, Shin H, Bartoli KM, Gilbert WV. Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature. 2014;515(7525):143–6. pmid:25192136
- 23. Marchand V, Ayadi L, Ernst FGM, Hertler J, Bourguignon-Igel V, Galvanin A, et al. AlkAniline-Seq: profiling of m7 G and m3 C RNA modifications at single nucleotide resolution. Angew Chem Int Ed Engl. 2018;57(51):16785–90. pmid:30370969
- 24. Zhang Z, Chen L-Q, Zhao Y-L, Yang C-G, Roundtree IA, Zhang Z, et al. Single-base mapping of m6A by an antibody-independent method. Sci Adv. 2019;5(7):eaax0250. pmid:31281898
- 25. Meyer KD. DART-seq: an antibody-free method for global m6A detection. Nat Methods. 2019;16(12):1275–80. pmid:31548708
- 26. Ryvkin P, Leung YY, Silverman IM, Childress M, Valladares O, Dragomir I, et al. HAMR: high-throughput annotation of modified ribonucleotides. RNA. 2013;19(12):1684–92. pmid:24149843
- 27. Lin H, Deng E-Z, Ding H, Chen W, Chou K-C. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 2014;42(21):12961–72. pmid:25361964
- 28.
Liu H, Begik O, Novoa EM. EpiNano: detection of m 6 A RNA modifications using oxford nanopore direct RNA sequencing. RNA Modifications: Methods and Protocols. 2021. p. 31–52.
- 29. Leger A, Amaral PP, Pandolfini L, Capitanchik C, Capraro F, Miano V, et al. RNA modifications detection by comparative Nanopore direct RNA sequencing. Nat Commun. 2021;12(1):7198. pmid:34893601
- 30. Stoiber M, Quick J, Egan R, Eun Lee J, Celniker S, Neely RK. De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. BioRxiv. 2016. 094672.
- 31. Acera Mateos P, J Sethi A, Ravindran A, Srivastava A, Woodward K, Mahmud S, et al. Prediction of m6A and m5C at single-molecule resolution reveals a transcriptome-wide co-occurrence of RNA modifications. Nat Commun. 2024;15(1):3899. pmid:38724548
- 32. He W, Jia C, Zou Q. 4mCPred: machine learning methods for DNA N4-methylcytosine sites prediction. Bioinformatics. 2019;35(4):593–601. pmid:30052767
- 33. Wang H, Nie F, Huang H, Risacher SL, Saykin AJ, Shen L, et al. Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics. 2012;28(12):i127-36. pmid:22689752
- 34. Chen W, Feng P, Ding H, Lin H, Chou K-C. iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. Anal Biochem. 2015;490:26–33. pmid:26314792
- 35. Chen W, Tran H, Liang Z, Lin H, Zhang L. Identification and analysis of the N(6)-methyladenosine in the Saccharomyces cerevisiae transcriptome. Sci Rep. 2015;5:13859. pmid:26343792
- 36. Zhou Y, Zeng P, Li Y-H, Zhang Z, Cui Q. SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Res. 2016;44(10):e91. pmid:26896799
- 37. Chen K, Wei Z, Zhang Q, Wu X, Rong R, Lu Z, et al. WHISTLE: a high-accuracy map of the human N6-methyladenosine (m6A) epitranscriptome predicted using a machine learning approach. Nucleic Acids Res. 2019;47(7):e41. pmid:30993345
- 38. Fan R, Cui C, Kang B, Chang Z, Wang G, Cui Q. A combined deep learning framework for mammalian m6A site prediction. Cell Genom. 2024;4(12):100697. pmid:39571573
- 39. Dao F-Y, Lv H, Yang Y-H, Zulfiqar H, Gao H, Lin H. Computational identification of N6-methyladenosine sites in multiple tissues of mammals. Comput Struct Biotechnol J. 2020;18:1084–91. pmid:32435427
- 40. Liu K, Cao L, Du P, Chen W. im6A-TS-CNN: identifying the N6-methyladenine site in multiple tissues by using the convolutional neural network. Mol Ther Nucleic Acids. 2020;21:1044–9. pmid:32858457
- 41. Abbas Z, Tayara H, Zou Q, Chong KT. TS-m6A-DL: tissue-specific identification of N6-methyladenosine sites using a universal deep learning model. Comput Struct Biotechnol J. 2021;19:4619–25. pmid:34471503
- 42. Luo Z, Lou L, Qiu W, Xu Z, Xiao X. Predicting N6-methyladenosine sites in multiple tissues of mammals through ensemble deep learning. Int J Mol Sci. 2022;23(24):15490. pmid:36555143
- 43. Rehman MU, Tayara H, Chong KT. DL-m6A: identification of N6-methyladenosine sites in mammals using deep learning based on different encoding schemes. IEEE/ACM Trans Comput Biol Bioinform. 2023;20(2):904–11. pmid:35857733
- 44. Song B, Huang D, Zhang Y, Wei Z, Su J, Pedro de Magalhães J, et al. m6A-TSHub: unveiling the context-specific m6A methylation and m6A-affecting mutations in 23 human tissues. Genomics Proteomics Bioinformatics. 2023;21(4):678–94. pmid:36096444
- 45. Zhou J, Pedrycz W, Gao C, Lai Z, Wan J, Ming Z. Robust jointly sparse fuzzy clustering with neighborhood structure preservation. IEEE Trans Fuzzy Syst. 2022;30(4):1073–87.
- 46. Zhang Z, Rao BD. Sparse signal recovery with temporally correlated source vectors using sparse Bayesian learning. IEEE J Sel Top Signal Process. 2011;5(5):912–26.
- 47. Wipf DP, Rao BD. An empirical bayesian strategy for solving the simultaneous sparse approximation problem. IEEE Trans Signal Process. 2007;55(7):3704–16.
- 48.
Chou W. Maximum a posterior linear regression with elliptically symmetric matrix variate priors. In: Eurospeech; 1999. p. 1–4.
- 49. Myung IJ. Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology. 2003;47(1):90–100.
- 50. Wu X, Bartel DP. kpLogo: positional k-mer analysis reveals hidden specificity in biological sequences. Nucleic Acids Res. 2017;45(W1):W534–8. pmid:28460012
- 51. Hendra C, Pratanwanich PN, Wan YK, Goh WSS, Thiery A, Göke J. Detection of m6A from direct RNA sequencing using a multiple instance learning framework. Nat Methods. 2022;19(12):1590–8. pmid:36357692
- 52. Abbas Z, Rehman MU, Tayara H, Zou Q, Chong KT. XGBoost framework with feature selection for the prediction of RNA N5-methylcytosine sites. Mol Ther. 2023;31(8):2543–51. pmid:37271991
- 53. Xu Y, Ding J, Wu L-Y, Chou K-C. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS One. 2013;8(2):e55844. pmid:23409062
- 54. Zhang Y, Ishibuchi H, Wang S. Deep Takagi–Sugeno–Kang fuzzy classifier with shared linguistic fuzzy rules. IEEE Trans Fuzzy Syst. 2018;26(3):1535–49.
- 55. Ding Y, Tiwari P, Zou Q, Guo F, Pandey HM. C-Loss based higher order fuzzy inference systems for identifying DNA N4-methylcytosine sites. IEEE Trans Fuzzy Syst. 2022;30(11):4754–65.
- 56. Deng Z, Jiang Y, Choi K, Chung F, Wang S. Knowledge-leverage-based TSK fuzzy system modeling. IEEE transactions on neural networks and learning systems. 2013;24(8):1200–12. pmid:24808561
- 57. Giang NL, Son LH, Ngan TT, Tuan TM, Phuong HT, Abdel-Basset M, et al. Novel incremental algorithms for attribute reduction from dynamic decision tables using hybrid filter–wrapper with fuzzy partition distance. IEEE Trans Fuzzy Syst. 2020;28(5):858–73.
- 58. Bezdek JC, Ehrlich R, Full W. FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences. 1984;10(2–3):191–203.
- 59. Jiao Y, Du P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant Biol. 2016;4(4):320–30.
- 60. Tahir M, Tayara H, Chong KT. iPseU-CNN: identifying RNA pseudouridine sites using convolutional neural networks. Mol Ther Nucleic Acids. 2019;16:463–70. pmid:31048185
- 61. Zhang D, Xu Z-C, Su W, Yang Y-H, Lv H, Yang H, et al. iCarPS: a computational tool for identifying protein carbonylation sites by novel encoded features. Bioinformatics. 2021;37(2):171–7. pmid:32766811