Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

NN-RNALoc: Neural network-based model for prediction of mRNA sub-cellular localization using distance-based sub-sequence profiles

  • Negin Sadat Babaiha,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Writing – original draft, Writing – review & editing

    Affiliations Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran, Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, Germany

  • Rosa Aghdam ,

    Roles Supervision, Validation, Writing – review & editing

    ch-eslahchi@sbu.ac.ir (CE); rosaaghdam@gmail.com (RA)

    Affiliations School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran, Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, United States of America

  • Shokoofeh Ghiam,

    Roles Formal analysis, Writing – review & editing

    Affiliation School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran

  • Changiz Eslahchi

    Roles Formal analysis, Investigation, Methodology, Supervision, Writing – review & editing

    ch-eslahchi@sbu.ac.ir (CE); rosaaghdam@gmail.com (RA)

    Affiliations Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran, School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran

Abstract

The localization of messenger RNAs (mRNAs) is a frequently observed phenomenon and a crucial aspect of gene expression regulation. It is also a mechanism for targeting proteins to a specific cellular region. Moreover, prior research and studies have shown the significance of intracellular RNA positioning during embryonic and neural dendrite formation. Incorrect RNA localization, which can be caused by a variety of factors, such as mutations in trans-regulatory elements, has been linked to the development of certain neuromuscular diseases and cancer. In this study, we introduced NN-RNALoc, a neural network-based method for predicting the cellular location of mRNA using novel features extracted from mRNA sequence data and protein interaction patterns. In fact, we developed a distance-based subsequence profile for RNA sequence representation that is more memory and time-efficient than well-known k-mer sequence representation. Combining protein-protein interaction data, which is essential for numerous biological processes, with our novel distance-based subsequence profiles of mRNA sequences produces more accurate features. On two benchmark datasets, CeFra-Seq and RNALocate, the performance of NN-RNALoc is compared to powerful predictive models proposed in previous works (mRNALoc, RNATracker, mLoc-mRNA, DM3Loc, iLoc-mRNA, and EL-RMLocNet), and a ground neural (DNN5-mer) network. Compared to the previous methods, NN-RNALoc significantly reduces computation time and also outperforms them in terms of accuracy. This study’s source code and datasets are freely accessible at https://github.com/NeginBabaiha/NN-RNALoc.

Introduction

Numerous studies have implicated the intracellular localization of RNA as a cellular polarization mechanism, as it plays a crucial role in gene expression regulation [1]. Additionally, messenger RNAs (mRNAs) localization may be preferable to protein localization because a single mRNA molecule can serve as a template for multiple proteins. Therefore, the prediction of mRNA localization rather than protein localization is more efficient and saves time. Recent research suggests, however, that the localization of mRNAs to specific sub-cellular compartments may be more common than previously thought, and that signals in the sequences of both types of molecules can play a role in their transport to specific cellular locations [25]. Notable is the association between incorrect RNA localization within the cell and neuromuscular disorders and cancer. Previously, oligonucleotides were introduced as a new type of drug that targets RNAs rather than disease-causing proteins [68]. On the other hand, mRNA localization has been studied for many years, and there are two well-known experimental datasets in this regard: cell fractionation with RNA-sequencing (CeFra-Seq) and APEX-RIP [9, 10]. CeFra-Seq is a method for mapping the abundance of transcripts in the Nucleus, Cytoplasm, Membrane, and Insoluble fractions of cells. APEX-RIP is a technique for mapping Nuclear, Cytoplasmic, Endoplasmic Reticulum (ER), and Mitochondrial transcriptomes. Additionally, RNALocate is a well-known RNA localization dataset. RNALocate is a web-accessible dataset containing information for over 190,00 RNA-associated sub-cellular localization entries supported by experimental and predicted evidence [11]. It involves over 105,000 RNAs in 44 subcellular locations in 65 species, including Homo sapiens, Mus musculus, and Saccharomyces cerevisiae. The gap between existing mRNAs and those whose location is known is increasing the need for computational predictors despite experimental efforts. In recent years, computational predictors have emerged that rely heavily on machine learning techniques [1214]. RNATracker [9] was the first mRNA localization prediction model to be developed in 2019. RNATracker predicts the location of mRNAs in CeFra-Seq and APEX-RIP datasets using convolutional neural network (CNN) and long short-term Memory (LSTM). In another recent study, a predictive model named RNA-GPS was introduced to predict the localization of the transcripts in the APEX-RIP dataset only [14]. RNA-GPS computes k-mer frequencies for k ranging from 3 to 5 for each transcript and assigns probabilities to mRNA-location using a random forest model. In 2020, mRNALoc was designed to predict mRNA sub-cellular localization by extracting k-mer profiles from mRNA sequences and applying a support vector machine (SVM) and was trained on the RNALocate dataset [12]. Zhang et al., developed a computational method, iLoc-mRNA, which was trained on the RNALocate dataset and applied a SVM model for multiclass classification [13]. It is also noteworthy to mention that in iLoc-mRNA, predictions were made for one of the following locations: Cytosol/Cytoplasm, ribosome, Endoplasmic Reticulum, and nucleus/exosome dendrite/mitochondrion. It is understood that combining nucleus, exosome, dendrite, and Mitochondria as a single location is not appropriate as these are diverse locations which should not be merged into a single sub-cellular class [12]. Meher et al. presented “mLoc-mRNA” to forecast nine distinct sub-cellular localizations for mRNAs. They used k-mers of sizes 1–6 to transform each mRNA sequence into a numerical feature vector. They applied the Elastic Net statistical model to extract the best features from the k-mer features. The sub-cellular localization of mRNAs was then predicted using a Random Forest classifier [15]. In 2021, a multi-label mRNA sub-cellular localization predictor named “DM3Loc” was also proposed using Deep Learning, which predicts the 6 distinct locations of mRNAs in Homo sapiens. They prepared data as the input for CNN using mRNA sequences as the raw data and a novel multi-head self-attention mechanism capable of producing sequence motifs [16]. The deep learning model “EL-RMLocNet”, which predicts the subcellular localization of four different RNA classes (mRNA, miRNA, lncRNA, and snoRNA) in Homo sapiens and Mus musculus species, was developed in [17]. To identify the most informative features from raw RNA sequences, they used the LSTM network, which captured the short and long range relations of nucleotide k-mers. In this study, we focus on the CeFra-Seq and RNALocate datasets, as well as powerful predictive models including mRNALoc, RNATracker, mLoc-mRNA, DM3Loc, iLoc-mRNA, and EL-RMLocNet as benchmarks. The rationale for selecting these methods and datasets is that the APEX-RIP dataset is noisy [12]. We also study a ground neural network (DNN-5mer) that only has two hidden layers and extracts k-mer features from sequences. We presented a novel representation of mRNA sequences based on subsequences of distance k, and we argued that combining this encoding with conventional k-mer frequency profiles can potentially yield more sequence-based information from mRNAs. Using a protein-protein interaction (PPI) network, we developed a neural network-based model that we call NN-RNALoc. Indeed, we utilized the fact that proteins with similar PPI patterns tend to be primarily located in the same sub-cellular location [18, 19] by incorporating this widely-used data into the predictive model. Additionally, it is essential to note that “Chou’s 5-step rule” [20] can be used to develop a more practical predictor for a biological system. The Chou’s 5-step rules have the following notable benefits: clear in logic development, completely transparent in operation, easily repeatable by other researchers, with a high potential for stimulating other sequence-analysis methods, and very user-friendly for the vast majority of experimental scientists [21, 22]. Therefore, in this study, we establish the NN-RNALoc predictor through the following five steps:

  1. Select or create a valid benchmark dataset for use in training and validating the predictor. This step is detailed in the “Data Sources” section.
  2. Represent the mRNAs with an efficient formulation and extract k-mer information that reflects their intrinsic correlation with the to-be-predicted target. This process is outlined in “Feature Encoding” section.
  3. Introduce and develop the potent NN-RNALoc algorithm for prediction purposes. In the “NN-RNALoc” section, the three primary steps of NN-RNALoc and its workflow are described.
  4. Perform cross-validation tests to objectively evaluate the anticipated accuracy of the prediction. This section describes the third step of the NN-RNALoc workflow, which is also covered in the “NN-RNALoc” section.
  5. In our future work, we will create a user-friendly, publicly accessible web server for the predictor. More information is provided in the “Conclusion” section.

The remaining sections are organized as follows: in the Materials and Methods section, we declare the datasets and introduce the features extracted from mRNA transcripts. Then, we discuss the specifics and steps required to create our model. In the Results section, we describe the performance of NN-RNALoc on the aforementioned two datasets and compare it to different methods: mRNALoc, RNATracker, DNN-5mer, DM3Loc, iLoc-mRNA, mLoc-mRNA, and EL-RMLocNet. In the Discussion section, we evaluate the performance of NN-RNALoc on human and non-human transcripts, utilizing novel distance-based subsequence profiles and canonical k-mer information.

Materials and methods

This section describes the data sources utilized in our research. Then, the details of the features and the architecture of NN-RNALoc are explained in more depth.

Data sources

mRNA sequences and localization information.

Two datasets are considered to benchmark the performance of NN-RNALoc against well-known algorithms. The initial dataset is CeFra-Seq, which is also utilized by the RNATracker technique. As stated previously, CeFra-Seq contains human transcripts, and localization information of mRNAs is presented as normalized gene expression values for each of four sub-cellular locations: Cytosol, Nucleus, Membrane, and Insoluble. Therefore, rather than a single cellular location label for each mRNA, we have a four-element vector whose elements represent the probability of each mRNA’s location. In this dataset, there are 11,373 mRNAs, and the sequences come from the Ensemble dataset [23]. For the second dataset (RNALocate), mRNA sequences and sub-cellular localization information are extracted from the RNALocate dataset. NN-RNALoc considers each human and non-human transcript separately, and for each gene, only one isoform is considered. This study examines the Cytoplasm, Endoplasmic Reticulum (ER), Extracellular Region (EX), Mitochondria, and Nucleus. In this dataset, only mRNA sequences belonging to a single location are taken into account. The RNALocate sub-cellular localization data were obtained from RNALocate at https://www.rna-society.org/rnalocate/. The sequences of mRNAs were downloaded from GenBank and the mRNA sequence data in the FASTA format were obtained from the NCBI on December 2022 [24]. In total, this dataset contains 11,180 mRNAs, of which 5,905 are human transcripts and 5,275 are non-human transcripts. Table 1 provides a summary of this dataset. Notably, because the data produced by APEX-RIP is fairly noisy [9, 12], we did not use it in this study.

thumbnail
Table 1. Total number of mRNAs in each five locations in the RNALocate dataset.

https://doi.org/10.1371/journal.pone.0258793.t001

Protein-Protein Interaction (PPI) information.

The PPI information regarding human mRNA is extracted from the STRING database [25]. The longest protein-coding isoform among all isoforms of a gene is considered in this database, and one protein is then assigned to each mRNA. Thus, we obtain a weighted network for which vertices are the proteins assigned to mRNAs, the edges represent the interaction between the corresponding proteins, and the weights represent the STRING-assigned strength of the interaction between two proteins. So, the PPI information can be shown as a matrix for which entries show how strongly two proteins interact with each other.

Feature encoding

With the explosive growth of biological sequences in the post-genomic era, one of the most important and challenging problems in computational biology is how to express a biological sequence using a discrete model or vector while retaining significant sequence-order information or essential pattern characteristics. Two types of characteristics are derived from mRNA sequences. The first is k-mer representation, one of the most commonly employed encodings for nucleotide sequences [2628]. The second is a novel representation that we propose for mRNA sequences. These two characteristics are described in detail below.

k-mer representation.

Counting k-mer frequencies is one way to extract a uniform-length feature vector from these sequences [26, 28]. A k-mer is a potential subsequence of length k within the mRNA sequence. As there are four neucleotides, there are a total of 4k possible k-mers. Some k-mer profiles have been demonstrated to be more important for certain tasks. Specifically, Hart et al., discussed the significance of 5-mer sites in microRNA-gene targeting [29]. As shown in Fig 1, for the ACGCCGCCG sequence, all 5-mer structures are ACGCC, CGCCG, and GCCGC. The k-mer characteristics of this paper are covered by a highly effective web server called “Pse-in-One” [30]. For an mRNA sequence S, we sort all 5-mer structures lexicographically, then count the frequency of each 5-mer structure in the main mRNA sequence and divide it by its length. Consequently, we acquire the following attribute vector: F5(S) = [v1, v2, …, vn], where each vi is the frequency of the i-th 5-mer and n is equal to 45 = 1, 024 in this case.

thumbnail
Fig 1. All 5-mer structures contained in ACGCCGC sequence.

In this example, we have three 5-mers ACGCC, CGCCG and GCCGC that are shown in three different colors.

https://doi.org/10.1371/journal.pone.0258793.g001

Distance-based sub-sequence profiles.

The main drawback of k-mer representation is that when k increases, the feature vector becomes extremely large and sparse, which can be memory-inefficient and can reduce the performance of the model. In order to mitigate the issue of small repeat regions, it may be advantageous to employ larger k-mer sizes. However, as the number of matching subsequences decreases, large k-mers become computationally infeasible and result in significant sparsity in the feature vector. In this study, we propose a novel distance-based representation to partially address this issue. In the novel distance-based profiles, the distance between the first and last nucleotide of the subsequence that we counted is k. The frequency of this subsequence is then determined for each pair of nucleotides separated by k. Consequently, for an mRNA sequence S and a distance k, the following 16-element feature vector is obtained: Dk(S) = [w1, w2, …, w16], where wi is the frequency of each distance-based sub-sequence and X is a sub-sequence of size k. For any k, an illustration of all subsequences to count is provided in Fig 2.

thumbnail
Fig 2. For an mRNA sequence S and a distance k, we depict the 16-element feature vector, where wi is the frequency of each distance-based subsequence and X denotes a possible sub-sequences of size k.

https://doi.org/10.1371/journal.pone.0258793.g002

It is obvious that for an mRNA sequence S with a length of m, X can be replaced with a sub-sequence of nucleotides (A, G, C, and T) ranging from size 0 to m-2. As an example, let’s consider S to be the mRNA with the sequence ACGCCGC with a length of 7, so X can be a sub-sequence of maximum size 5. For example, in Fig 3, four distance-based substructures of ACGCCGC are shown in three different colors. The two sub-sequences CGCC and CCGC with distance 2 are shown in green, one sub-sequence GCCGC with distance 3 is drawn in red, and one sub-sequence ACGCCGC with distance 5 is illustrated in black. For instance, to calculate w3 and w6 in this sequence, for w3: AXG, we have one sub-sequence ACG (k = 1) and one sub-sequence ACGCCG (k = 4), so the frequency of w3 is 2. For w6: CXC, the sequence contains one sub-sequence CC (k = 0), two sub-sequences CGC (k = 1), one sub-sequence CCGC (k = 2), one sub-sequence CGCC (k = 2), and one sub-sequence CGCCGC (k = 4). Therefore, the frequency of w6 is 6. In this work, we tested a wide range of distances, and after many trials and errors, we found the best range for k to be between 0 and 8. As a result, the length of the created feature vector is 9 × 16 = 144.

thumbnail
Fig 3. Four distance-based substructures are shown in three different colors for the mRNA sequence S = ACGCCGC.

Two sub-sequences CGCC and CCGC with k = 2 are shown in green, one sub-sequence GCCGC with k = 3 is depicted in red, and one sub-sequence ACGCCGC with k = 5 is illustrated in black. In addition, the figure depicts the possible subsequences of S between A and G (AXG) and C and C (CXC).

https://doi.org/10.1371/journal.pone.0258793.g003

Principle Component Analysis on PPI network.

As previously stated, the PPI information is represented as an adjacency matrix with the dimension of number of mRNAs × number of mRNAs. As a result, the PPI matrix in the first dataset has a dimension of 11, 373 × 11, 373 whereas the PPI matrix in the RNALocate dataset has a dimension of 5880 × 5880. Because the performance of machine learning models can decrease when too many features are considered, we first employ the Principle Component Analysis (PCA) technique to reduce the dimension of this matrix [31]. PCA is one of the most widely used methods for reducing feature space and increasing storage space or the computational efficiency of a learning algorithm. It applies singular value decomposition to project data into a lower dimensional space, emphasizing variation and highlighting strong patterns in a dataset. After many tests, the number of principal components in this study has been set to 500, and the total variance explained is more than 70% of the total data.

NN-RNALoc

In this section, we express the main steps of NN-RNALoc.

Step 1: Combine of the following three feature vectors:

  1. 5-mer frequencies (a vector of length 1,024)
  2. Subsequence distance-based profiles (a vector of length 144)
  3. Reduced PPI matrix using PCA method (a vector of length 500 for each mRNA)

We combine the collected data into a single 1668-dimensional feature vector (1,024 + 144 + 500). This vector serves as the final input for our neural network model for the prediction task.

Step 2: Design of a neural network model

We propose an artificial neural network (ANN) for assigning probabilities of a mRNA belonging to a specific location based on our developed features. A neural network can be represented as a sequence of matrix multiplications interleaved with nonlinear functions. An ANN is made up of a number of smaller units known as neurons, which can be repeated in multiple layers. To prevent the neural network from becoming complex and thus more difficult to train efficiently, we employ a model with a shallow architecture consisting of one hidden layer and 200 neurons. Dropout is also used in the hidden layer to mask randomly 50% of the connections during model training to prevent overfitting. We use the Rectified Linear Unit (Relu) activation function in the hidden layer, which is described as follows [32]: (1)

The Softmax function as the non-linear function is applied in the last layer of the model to assign a probability to each location (xi) and is formulated as bellow [33]: (2)

Finaly, we use Kullback-Leiber-Divergence as the loss function. For probability distribution and defined on the same probability space X, Kullback-Leiber-Divergence is defined as [34]: (3)

Step 3: Training of the prediction model

The selection of hyper-parameters of the model is based on the training dataset. All parameters were chosen with the intent of minimizing the loss function. For training the model, we employ the 10-fold cross-validation method [35]. The outcomes are then evaluated using a range of values for hidden layers (no hidden layer, 1, 2, and 3), neurons in each fully connected layer (1,000, 700, 500, 200, and 100), and dropout rates (0.1, 0.2, 0.3 and 0.5). Table 2 displays the most optimal parameters utilized by this model. A validation set consisting of 10% of the training data is also applied to monitor the loss function during the training process and detect overfitting. The Keras Library [36] is used to implement NN-RNALoc. In addition, the Adam optimizer with Nesterov momentum is used to train the model [37]. Fig 4 depicts the comprehensive workflow of NN-RNALoc.

thumbnail
Fig 4. The overview of NN-RNALoc method.

In Step 1, we first aggregate all information gathered from both sequence-based features as well as protein-protein interaction (PPI) matrix. In Step 2, we design a neural network model with the mentioned architecture. In Step 3, the model is trained and evaluated using 10-fold cross-validation. We report a probability vector of length 4 for each mRNA in the CeFra-Seq dataset and the location with the highest probability as the mRNA’s predicted location in the RNALocate dataset.

https://doi.org/10.1371/journal.pone.0258793.g004

Evaluation criteria

As stated previously, we work with two datasets, and due to the differences in their structures, we compare different metrics to evaluate the performance of the model on each dataset. As described previously, the CeFra-Seq localization values are continuous. We therefore consider correlation measurements when evaluating model performance similar to [9] study. The initial measure is Pearson Correlation. Pearson Correlation is a method for measuring the linear correlation between predicted and observed values. It has a value between 1 and -1, with +1 representing a total positive linear correlation, 0 representing no linear correlation, and -1 representing a total negative linear correlation. In order to better evaluate the performance of the model, we also consider the Spearman correlation between predicted and experimental values to capture the order of locations to which an mRNA belongs. In addition, we employ classification metrics in the RNALocate dataset because localization information is discrete values similar to [12] study. True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN), Precision, Recall, F-score, Accuracy (ACC), and Matthews Correlation Coefficient (MCC) are computed for this dataset in order to compare the performance of the NN-RNALoc method to that of other methods. These criteria are defined below:

The criteria listed above are some of the most prevalent metrics used in classification problems to evaluate the performance of a model. Clearly, relying solely on a single statistical measure (such as ACC) can lead to overoptimistic results, particularly when analyzing datasets with imbalances. As a result, we also evaluate enhancements to the MCC measure, which is a more reliable statistical rate that yields a high score only if the predictive model achieves success in all categories. In Table 3, we have summarized all the metrics used to evaluate the performance of the models on the two benchmarks.

thumbnail
Table 3. A summary of the localization information for two datasets and metrics used to assess models performance.

https://doi.org/10.1371/journal.pone.0258793.t003

Results

Time complexity

Many of the methodologies, including mRNALoc, are provided as server-based tools, making it impossible to compare their time complexity. But the source code and implementation of RNATracker are available, so we can compare how long it takes to run RNATracker, DNN-5mer, and NN-RNALoc. On a Linux Ubuntu machine with 15 CPUs (Intel Xeon(R) 2.00 GHz) and the CeFra-Seq dataset, NN-RNALoc takes approximately 3 hours, which is significantly faster than RNATracker (RNATracker requires 7 days for training in full length mode and 8 hours in fixed length mode on GTX1080Ti graphic card). On the RNALocate dataset, NN-RNALoc’s time is 2 hours, which is significantly less than RNATracker’s computation time requirement of 6 hours. On the other hand, the computational time of DNN-5mer is comparable to that of NN-RNALoc. Consequently, in terms of training time, we can conclude that NN-RNALoc significantly outperforms RNATracker, while having the same time complexity as DNN-5mer.

Assessment and comparison

The results of 30 times of 10-fold cross-validation on the CeFra-Seq dataset are displayed in Table 4. In the Cytosol, Insoluble, Membrane, and Nuclear regions of the CeFra-Seq dataset, NN-RNALoc finds Pearson correlations of 0.69, 0.65, 0.54, and 0.55, respectively. In every location, these correlations are stronger when compared to the RNATracker fixed length mode method. Using NN-RNALoc, the number of mRNAs with a Spearman correlation of 1, indicating a perfect association of ranks, is 2893, which is slightly better than RNATracker (2849). As demonstrated in Table 4, NN-RNALoc achieves approximately 17% greater overall Pearson correlation than RNATracker’s fixed length mode. Due to the fact that mRNALoc is a standalone tool trained on five different locations than CeFra-Seq, it was not possible to compare it to NN-RNALoc for this dataset. DNN-kMer is a multilayer perceptron-based predictor that extracts k-mer features from sequences (1-mers to k-mers). In both data sets, the DNN-kMer model was trained on 1-mers to 8-mers, and the best results were obtained when all 1-mer to 5-mer information was taken into account. Therefore, DNN-5mer’s inputs are a 1364-dimensional (41+ 42+ 43+ 44+ 45) vector. As a result, using 1-mers to 5-mers as features, we evaluate the performance of NN-RNALoc and DNN-5mer. DNN-5mer has only two hidden layers with the same number of neurons as the input vector. In the hidden layer, the Relu activation function is utilized. Despite the fact that both NN-RNALoc and DNN-5mer have a simple architecture, DNN-5mer performs significantly worse, with Pearson correlations of 0.63 in the Membrane, 0.55 in Insoluble, 0.42 in the Membrane, and 0.48 in the nucleus. Overall, NN-RNALoc achieves a Pearson correlation approximately 35% higher than DNN-5mer. In addition, we ran NN-RNALoc with only k-mer frequencies (for k from 1 to 5) to evaluate the effect of incorporating the distance-based profile into the model. As Table 4 represents, in this context (comparing NN-RNALoc(no PPI) and NN-RNALoc(k-mer profile)) the Pearson correlations were 9% lower in total, demonstrating the advantages of using distance-based profiles.

thumbnail
Table 4. Average Pearson correlations of 30 times 10-fold cross-validation in each location of Cefra-Seq dataset obtained by different methods.

https://doi.org/10.1371/journal.pone.0258793.t004

RNALocate is the most well-known dataset in this field and was used for validation for all the algorithms mentioned in the previous studies. The performance of NN-RNALoc on RNALocate is benchmarked against RNATracker, DM3Loc, mRNALoc, iLoc-mRNA, EL-RMLocNet, and mLoc-mRNA methods. We report the area under the Receiver Operator Characteristic (ROC) curve (AUC-ROC) and the area under the Precision-Recall (PR) curve (AUC-PR) for a fair comparison of the tested methods similar to RNATracker, DM3Loc, mRNALoc, iLoc-mRNA, EL-RMLocNet, and mLoc-mRNA studies. Table 5 summarizes the AUC-ROC, AUC-PR, and Average MCC for different methods for the human part of the RNALocate dataset. For Cyt location, NN-RNALoc and mRNALoc outperformed others based on AUC-ROC and AUC-PR, respectively. For ER, iLoc-mRNA and NN-RNALoc outperformed others based on AUC-ROC and AUC-PR, respectively. For EX, mLoc-mRNA and RNATracker outperformed others based on AUC-ROC and AUC-PR, respectively. For the Nuc location, mLoc-mRNA and RNATracker outperformed others based on AUC-ROC and AUC-PR, respectively. As seen in Table 5, none of the methods outperform the other methods in all locations and for Cyt and ER locations, NN-RNALoc outperformed well-known methods. Similar to some previous methods, we only considered single-location mRNA sequences in the RNALocate dataset. Except for DM3Lo and mLocmRNA methods, which predict multiple locations for each mRNA sequence, all other methods only predict a single location. If the actual location of an mRNA sequence was presented in the prediction results of the mLocmRNA and DM3Lo methods, and it was reported as a true prediction. It is obvious that by predicting multiple locations, these methods improve the performance of their algorithm in some locations compared to other methods, as shown in Table 5. Similarly, Table 6 represents the result of different methods on the non-human part of the RNALocate dataset. In this case, NN-RNALoc outperformed existing methods for the Nuc location and obtained nearly similar results to other methods. In terms of average MCC, NN-RNALoc performs better than other methods, which shows that our method works well overall.

thumbnail
Table 5. Results of AUC-ROC and AUC-PR for different methods on the human part of the RNALocate dataset.

https://doi.org/10.1371/journal.pone.0258793.t005

thumbnail
Table 6. Results of AUC-ROC and AUC-PR for different methods on the non-human part of the RNALocate dataset.

https://doi.org/10.1371/journal.pone.0258793.t006

The performance of NN-RNALoc’s on the RNALocate dataset for the best threshold has been reported as follows: The Precision, Recall, and F-score values for Cytosol using NN-RNALoc are 74%, 72%, and 74%, respectively. Endoplasmic Reticulum (ER) has a precision of 56%, a recall of 48%, and an F-score of 52%. In the Extracellular Region (EX) and Mitochondria, due to a lack of training samples (only 26 and 2, respectively), the Recall and F-score are close to zero. The precision of prediction in nucleus is 52%, whereas recall and F-score are 70% and 60%, respectively. In fact, NN-RNALoc increased the total F-score in all locations by about 17% compared to mRNALoc and by 56% compared to RNATracker. However, in the nucleus, the average F-score obtained for NN-RNALoc and mRNALoc is nearly identical. The overall accuracy of prediction using NN-RNALoc is higher than both RNATracker and mRNALoc. NN-RNALoc additionally achieves an MCC of 0.40, which is greater than RNATracker and mRNALoc (they both achieve an MCC of 0.34 and 0.37, respectively). Results have been shown in S1 Table.

In addition, we used other shallow learning algorithms e.g. SVM, RF, Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) [38] for our learning process methods instead of using NN. SVM-RNALoc used SVM on k-mer and distance-based profile features, XGBoost-RNALoc employed XGBoost on k-mer and distance-based profile features, and LightGBM-RNALoc applied LightGBM on k-mer and distance-based profile features. Table 7 and S2 Table indicate the results of these algorithms for the Cefra-Seq and RNALocate datasets, respectively. The results show that NN-RNALoc for most locations outperforms other methods. Hence, we used the NN method to predict locations based on k-mer and distance-based profile features. Moreover, we applied the DNN-kMer method which is a multilayer perceptron-based predictor that extracts k-mer features from sequences (1-mers to k-mers) and compared them with NN-RNALoc (please see Table 4). The results show that NN-RNALoc outperforms the other shallow learning approaches.

thumbnail
Table 7. Average Pearson correlations of 30 times 10-fold cross-validation in each location of Cefra-Seq dataset obtained by NN-RNALoc, SVM-RNALoc, RF-RNALoc, XGBoost-RNALoc, DNN-RNALoc, LGBM-RNALoc.

https://doi.org/10.1371/journal.pone.0258793.t007

Discussion

To evaluate the effect of incorporating PPI information into our model, the following analysis was performed on both the CeFra-Seq and RNALocate datasets. We only utilize 5-mer and also distance-based sub-sequence information derived from mRNA sequences and compare the results to the scenario in which PPI information is also incorporated into the model. When the reduced PPI matrix is used in the model for the Cefra-se dataset, NN-RNALoc achieves almost 11% higher Pearson correlation in total for all locations, as shown in Table 4. We conduct the same analysis on the RNALocate dataset and human-related transcripts too, utilizing only sequence-based information in the model. These results, which are the same as those found in the first dataset, also show that when NN-RNALoc uses PPI information in the second dataset, its performance totally improves with 10% increase in MCC and 2% in accuracy. Fig 5 compiles the results for a more precise comparison of the performance of the NN-RNALoc algorithm with PPI information (NN-RNALoc) and without PPI information (NN-RNALoc(no PPI)) besides other methods. Fig 5(a) displays the resulted average of Pearson correlation for the CeFra-Seq dataset for four locations, and Fig 5(b) shows the average of F-score values for the five locations in the RNALocate dataset. According to Fig 5, considering PPI information improves the results for all locations in both datasets and has the greatest influence on predicting the insoluble location in CeFra-Seq dataset and Endoplasmic Reticulum location in the RNALocate dataset. To evaluate the impact of including distance-based profiles in the model, we omit this information from the feature vector. As previously discussed in the results and as shown in Table 4 and Fig 5, the poorer performance of NN-RNALoc on both datasets when only k-mer frequencies (for k from 1 to 5) are used can potentially demonstrate the impact of distance-based profiles. We then examine the non-human transcripts within the RNALocate dataset. Due to the large number of species whose transcripts are included in the dataset, the PPI information cannot be used in this instance. Consequently, only 5-mer and distance-based subsequence profiles of mRNA sequences are utilized in the model. Table 5 compares the performance of NN-RNALoc, mRNALoc, and RNATracker on non-human species. In this instance, the total accuracy obtained by NN-RNALoc is 74% which is 4% higher than RNATracker and 9% higher than mRNALoc. Moroevr, in this dataset, NN-RNALoc achieves MCC of 55% which is 8% higher than RNATracker and 12% higher than mRNALoc. Finally, for a more detailed evaluation and to determine the impact of each distance-based k-mer on the prediction of mRNA location, the following experiment was conducted on the CeFra-Seq dataset. We independently considered each distance-based profile for k ranging from 0 to 8. Fig 6 depicts the average Pearson correlation in each of four locations when a single distance-based k-mer profile was used. Using 8-mer distance-based profiles yields the highest correlation in Cytosol, Insoluble, and Nuclear, which are represented by blue, orange, and green curves, respectively, as shown in Fig 6. However, for Membrane, which is depicted by a red curve, the highest correlation is obtained using a 4-mer distance-based profile, despite the fact that the differences in Pearson correlations are negligible. Therefore, in order to find all possible patterns in mRNA sequences, we decided to look at the combination of distance-based profiles for all k-mers in the range of 0 to 8.

thumbnail
Fig 5. Comparison of Pearson correlations and F-measure values of NN-RNALoc algorithm with other methods for two datasets.

(a) The average of Pearson correlation for the CeFra-Seq dataset for four locations. (b) The average of F-score values for the five locations in the RNALocate dataset.

https://doi.org/10.1371/journal.pone.0258793.g005

thumbnail
Fig 6. Pearson correlation obtained by NN-RNALoc on CeFra-Seq dataset when employing each distance-based profile for k in range 0 and 8, individually.

Four locations are represented in four different colors; blue: Cytosol, orange: Insoluble, green: Nuclear, red: Membrane.

https://doi.org/10.1371/journal.pone.0258793.g006

thumbnail
Table 8. Performance of NN-RNALoc, RNATracker(fixed length mode) and mRNALoc on non-human mRNAs of RNALocate dataset.

https://doi.org/10.1371/journal.pone.0258793.t008

Our method has been evaluated using two different datasets. The first dataset, CeFra-Seq, uses a continuous set of values to represent the localization probability of each of the four compartments. Hence, we predict a probability value for each compartment of this dataset. Then, we use Pearson and Spearman correlations to assess the performance of the models in the CeFra-Seq dataset. Using our method, we can either select one location using the maximum probability value or select multiple locations by setting a probability threshold. The second dataset, compiled from the RNALocate dataset, is among the most commonly used datasets for RNA localization and all methods applied for the comparison report their results on this dataset. The element information of this dataset is a binary vector indicating whether a specific RNA is present at a given location or not. Given that this dataset contains five locations, the length of this binary vector is also five. We use a classification method on this dataset to predict the localization of a given mRNA. For evaluating the performance of the classification algorithms, precision, recall, f-score, MCC, and ACC were used. We also reported AUC-ROC and AUC-PR for classification performance comparisons. It is crucial to note that for NN-RNALoc, the probability of each location for each mRNA is computed, then sorted, and the location with the highest probability is reported as the specific mRNA location. To assign more than one location to an mRNA, a threshold can be considered, and all locations with probabilities greater than the chosen threshold can be assigned to the mRNA sequences. However, in order to compare the results of this method with those of other methods, we assign the most probable location. It is worth mentioning that while there is no approach that outperforms the others for predicting all locations, we intend to integrate several methods to predict locations based on a voting measure in our future study.

Conclusion

NN-RNALoc is one of the few methods proposed that uses neural network-based approaches to examine the cellular localization of mRNAs. As a result of the explosive growth of biological sequences discovered in the post-genomic era, and in order to use them in a timely manner for a variety of bioinformatics problems such as RNA and protein localization or drug development, a significant amount of sequence-based information, such as PTM (posttranslational modification) sites in proteins, has been successfully predicted [39]. The rapid development of sequential bioinformatics and structural bioinformatics, as well as the introduction of computational methodologies for this purpose, have led to an unprecedented revolution in this field of study. Consequently, computational (or in silico) methods were also utilized in this study. Localization of messenger RNA (mRNA) molecules within the Cytoplasm provides a foundation for cell polarization, thereby underpinning developmental processes such as asymmetric cell division, cell migration, neuronal maturation, and embryonic patterning [40]. The enormous benefit of mRNA targeting is that it allows for the regulation of gene expression in both space and time; thus, RNA localization would be beneficial for understanding cellular functions [40]. NN-RNALoc is a neural network-based tool that aims to predict the subcellular localization of mRNA based on the interaction information of the proteins encoded by the mRNA transcripts. In this way, we have come up with a different distance-based subsequence profile for representing mRNA sequences. This novel encoding, which is more compact and less likely to add redundant data, was created to address the memory and time issues that arise as k in k-mer representation increases. Using distance-based sub-sequence profiles, k-mer frequencies, and reduced PPI matrix data, the results demonstrate that NN-RNALoc, a neural network with a simple and transparent architecture, outperforms three previously introduced and powerful methods. This simplicity also drastically reduces the computation time required for model training. The application of a dimensional reduction technique, such as PCA, to the PPI data, which is a high-dimensional matrix, is significantly more advantageous than the use of raw interaction patterns. In the future, additional dimension reduction techniques, such as auto-encoders and PPI-specific compression techniques, can be investigated. In addition, it is important to note that future research can utilize the incorporation of other important but difficult-to-implement features, such as the knowledge of protein 3D structures or their complexes with ligands, which is crucial in numerous studies such as drug design [41]. Therefore, in future versions of NN-RNALoc, the incorporation of protein structural information could also be investigated. Moreover, as demonstrated by a number of recent publications [42, 43] demonstrating new findings or approaches, user-friendly and publicly accessible web-servers will significantly increase their impacts [20, 21]. So, in our future work, we will try to make a web server that can be changed by the user and show the results.

Supporting information

S1 Table. Results of 30 times 10-fold cross-validation of NN-RNALoc (with and without employing PPI information) compared with RNATracker(fixed length mode) and mRNALoc on the human part of RNALocate dataset.

https://doi.org/10.1371/journal.pone.0258793.s001

(PDF)

S2 Table. Results of AUC-ROC (ROC) and AUC-PR (PR) for each location of human part of RNALocate database obtained by NN-RNALoc, SVM-RNALoc, RF-RNALoc, XGBoost- RNALoc, DNN-RNALoc, LightGBM-RNALoc.

https://doi.org/10.1371/journal.pone.0258793.s002

(PDF)

Acknowledgments

Changiz Eslahchi and others would like to thank the School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM) and Computing Center of IPM in performing a parallel computing is gratefully acknowledged.

References

  1. 1. Kloc M, Zearfoss NR, Etkin LD. Mechanisms of subcellular mRNA localization. Cell. 2002 Feb 22;108(4):533–44. pmid:11909524
  2. 2. Dominguez D, Freese P, Alexis MS, Su A, Hochman M, Palden T, et al. Sequence, structure, and context preferences of human RNA binding proteins. Molecular cell. 2018 Jun 7;70(5):854–67. pmid:29883606
  3. 3. Ferre F, Colantoni A, Helmer-Citterich M. Revealing protein–lncRNA interaction. Briefings in bioinformatics. 2016 Jan 1;17(1):106–16. pmid:26041786
  4. 4. Gerstberger S, Hafner M, Tuschl T. A census of human RNA-binding proteins. Nature Reviews Genetics. 2014 Dec;15(12):829–45. pmid:25365966
  5. 5. Ray D, Kazan H, Cook KB, Weirauch MT, Najafabadi HS, Li X, et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature. 2013 Jul 11;499(7457):172–7. pmid:23846655
  6. 6. Martin KC, Ephrussi A. mRNA localization: gene expression in the spatial dimension. Cell. 2009 Feb 20;136(4):719–30. pmid:19239891
  7. 7. Smith R. Moving molecules: mRNA trafficking in Mammalian oligodendrocytes and neurons. The Neuroscientist. 2004 Dec;10(6):495–500. pmid:15534035
  8. 8. Masumshah R, Aghdam R, Eslahchi C. A neural network-based method for polypharmacy side effects prediction. BMC bioinformatics. 2021 Dec;22(1):1–7. pmid:34303360
  9. 9. Yan Z, Lécuyer E, Blanchette M. Prediction of mRNA subcellular localization using deep recurrent neural networks. Bioinformatics. 2019 Jul 15;35(14):i333–42. pmid:31510698
  10. 10. Kaewsapsak P, Shechner DM, Mallard W, Rinn JL, Ting AY. Live-cell mapping of organelle-associated RNAs via proximity biotinylation combined with protein-RNA crosslinking. Elife. 2017 Dec 14;6:e29224. pmid:29239719
  11. 11. Zhang T, Tan P, Wang L, Jin N, Li Y, Zhang L, et al. RNALocate: a resource for RNA subcellular localizations. Nucleic acids research. 2017 Jan 4;45(D1):D135–8. pmid:27543076
  12. 12. Garg A, Singhal N, Kumar R, Kumar M. mRNALoc: a novel machine-learning based in-silico tool to predict mRNA subcellular localization. Nucleic Acids Research. 2020 Jul 2;48(W1):W239–43. pmid:32421834
  13. 13. Zhang ZY, Yang YH, Ding H, Wang D, Chen W, Lin H. Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Briefings in Bioinformatics. 2021 Jan;22(1):526–35. pmid:31994694
  14. 14. Wu KE, Parker KR, Fazal FM, Chang HY, Zou J. RNA-GPS predicts high-resolution RNA subcellular localization and highlights the role of splicing. RNA. 2020 Jul 1;26(7):851–65. pmid:32220894
  15. 15. Meher PK, Rai A, Rao AR. mLoc-mRNA: predicting multiple sub-cellular localization of mRNAs using random forest algorithm coupled with feature selection via elastic net. BMC bioinformatics. 2021 Jun 24;22(1):342. pmid:34167457
  16. 16. Wang D, Zhang Z, Jiang Y, Mao Z, Wang D, Lin H, et al. DM3Loc: multi-label mRNA subcellular localization prediction and analysis based on multi-head self-attention mechanism. Nucleic acids research. 2021 May 7;49(8):e46-. pmid:33503258
  17. 17. Asim MN, Ibrahim MA, Malik MI, Zehe C, Cloarec O, Trygg J, et al. EL-RMLocNet: An explainable LSTM network for RNA-associated multi-compartment localization prediction. Computational and Structural Biotechnology Journal. 2022 Jan 1;20:3986–4002. pmid:35983235
  18. 18. Mirzaei Mehrabad E, Hassanzadeh R, Eslahchi C. PMLPR: A novel method for predicting subcellular localization based on recommender systems. Scientific reports. 2018 Aug 13;8(1):12006. pmid:30104743
  19. 19. Jamali R, Eslahchi C, Jahangiri-Tazehkand S. Psl-recommender: protein subcellular localization prediction using recommender system. bioRxiv. 2018 Nov 5:462812.
  20. 20. Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of theoretical biology. 2011 Mar 21;273(1):236–47. pmid:21168420
  21. 21. Chou KC. Advances in predicting subcellular localization of multi-label proteins and its implication for developing multi-target drugs. Current medicinal chemistry. 2019 Aug 1;26(26):4918–43. pmid:31060481
  22. 22. Chou KC. Impacts of pseudo amino acid components and 5-steps rule to proteomics and proteome analysis. Current topics in medicinal chemistry. 2019 Oct 1;19(25):2283–300. pmid:31648642
  23. 23. Aken BL, Achuthan P, Akanni W, Amode MR, Bernsdorff F, Bhai J, et al. Ensembl 2017. Nucleic acids research. 2017 Jan 4;45(D1):D635–42. pmid:27899575
  24. 24. Cui T, Dou Y, Tan P, Ni Z, Liu T, Wang D, et al. RNALocate v2. 0: an updated resource for RNA subcellular localization with increased coverage and annotation. Nucleic acids research. 2022 Jan 7;50(D1):D333–9. pmid:34551440
  25. 25. Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, et al. The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic acids research. 2016 Oct 18:gkw937. pmid:27924014
  26. 26. Asgari E, Garakani K, McHardy AC, Mofrad MR. MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples. Bioinformatics. 2018 Jul 1;34(13):i32–42. pmid:29950008
  27. 27. Gudenas BL, Wang L. Prediction of LncRNA subcellular localization with deep learning from sequence features. Scientific reports. 2018 Nov 6;8(1):16385. pmid:30401954
  28. 28. Kirk JM, Kim SO, Inoue K, Smola MJ, Lee DM, Schertzer MD, et al. Functional classification of long non-coding RNAs by k-mer content. Nature genetics. 2018 Oct;50(10):1474–82. pmid:30224646
  29. 29. Hart M, Kern F, Backes C, Rheinheimer S, Fehlmann T, Keller A, et al. The deterministic role of 5-mers in microRNA-gene targeting. RNA biology. 2018 Jun 3;15(6):819–25. pmid:29749304
  30. 30. Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic acids research. 2015 Jul 1;43(W1):W65–71. pmid:25958395
  31. 31. Jollife IT, Cadima J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016 Apr;374(2065):20150202.
  32. 32. Eckle K, Schmidt-Hieber J. A comparison of deep networks with ReLU activation function and linear spline-type methods. Neural Networks. 2019 Feb 1;110:232–42. pmid:30616095
  33. 33. Tiwari S. Activation functions in neural networks. geeksforgeeks. org. 2020.
  34. 34. Van Erven T, Harremos P. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory. 2014 Jun 12;60(7):3797–820.
  35. 35. Berrar D. Cross-Validation. In: Ranganathan S, Gribskov M, Nakai K, Schönbach C, editors. Encyclopedia of Bioinformatics and Computational Biology. Oxford: Academic Press; 2019, 542–545.
  36. 36. Gulli A, Pal S. Deep learning with Keras. Packt Publishing Ltd; 2017 Apr 26.
  37. 37. Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014 Dec 22.
  38. 38. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011 Nov 1;12:2825–30.
  39. 39. Chou KC. Progresses in predicting post-translational modification. International Journal of Peptide Research and Therapeutics. 2020 Jun;26(2):873–88.
  40. 40. Medioni C, Mowry K, Besse F. Principles and roles of mRNA localization in animal development. Development. 2012 Sep 15;139(18):3263–76. pmid:22912410
  41. 41. Greer J, Erickson JW, Baldwin JJ, Varney MD. Application of the three-dimensional structures of protein target molecules in structure-based drug design. Journal of medicinal chemistry. 1994 Apr;37(8):1035–54. pmid:8164249
  42. 42. Chen W, Tang H, Ye J, Lin H, Chou KC. iRNA-PseU: Identifying RNA pseudouridine sites. Molecular Therapy-Nucleic Acids. 2016 Jan 1;5. pmid:28427142
  43. 43. Liu B, Fang L, Long R, Lan X, Chou KC. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics. 2016 Feb 1;32(3):362–9. pmid:26476782