Deep learning framework for RNA 5hmC prediction using RNA language model embeddings

doi:10.1371/journal.pone.0341649

Table 1.

Dataset summary.

More »

Expand

Table 2.

Feature descriptors and their size. Here, the sample is the number of samples.

More »

Expand

Fig 1.

The architecture of the Inception module.

It has four parallel convolutional paths. Each path uses different kernel sizes in the Conv1D layers. Lastly, the outputs from the four paths are concatenated along the channel dimension to form the final output. In each three-line block, the first line represents the input tensor size, the second line represents the layer name, and the third line represents the output tensor size. Batch refers to the batch size.

More »

Expand

Table 3.

Model configuration of InTrans-RNA5hmC.

More »

Expand

Table 4.

Tuned hyperparameter values from GridSearch conducted on InTrans-RNA5hmC.

More »

Expand

Table 5.

10-fold CV performance results of different feature descriptors. The scores are presented in ‘mean±standard deviation’ format. An XGB model was trained using each one of the feature descriptors separately on the Training set. The highest values of each metric are boldfaced. As RiNALMo generates embeddings with a tensor size of (batch size, seq length, embedding size), the embeddings were averaged along the sequence length dimension for comparison.

More »

Expand

Table 6.

Performance results of different deep learning models which were trained on the 80% Training set and tested on the Validation set (the remaining 20% Training set). The highest values of each metric are boldfaced. The scores are presented in ‘mean±standard deviation’ format. Here, in the {DL1 + DL2} architecture, Word embeddings are used as input to the DL1 branch, and RiNALMo embeddings are used as input to the DL2 branch. This experiment was repeated 20 times with different random seeds.

More »

Expand

Table 7.

Performance results of different ML models which were trained on the 80% Training set and tested on the Validation set (rest 20% Training set). The scores are presented in ‘mean±standard deviation’ format. The highest values of each metric are boldfaced. This experiment was repeated 20 times with different random seeds.

More »

Expand

Table 8.

Comparison with the SOTA methods on the Independent dataset. The highest values of each metric are boldfaced. Metrics not reported by the respective papers are indicated as’-’ in the table.

More »

Expand

Table 9.

Ablation study results: performance comparison of Inception-only, Transformer-only, and InTrans-RNA5hmC models. Models were trained on the Training set and tested on the Independent test set.

More »

Expand

Fig 2.

Model architecture of the proposed model, InTrans-RNA5hmC.

There are two input embeddings: Word embeddings and RiNALMo embeddings. The model has two branches: The Inception branch and the Transformer branch. The Word embeddings and RiNALMo embeddings are fed to the Inception branch and the Transformer branch, respectively. The features from both branches are concatenated and passed through a feed-forward neural network for final predictions. In each three-line block, the first line represents the input tensor size, the second line represents the layer name and the third line represents the output tensor size. Batch refers to the batch size.

More »

Expand

Fig 3.

t-SNE and UMAP visualizations of the proposed InTrans-RNA5hmC model on the Training dataset. The first row shows the initial vs. learned feature representation using t-SNE. The second row shows the same result using UMAP.

A: t-SNE visualization of the initial feature space, B: t-SNE visualization of the last hidden layer feature representation of the InTrans-RNA5hmC model, C: UMAP visualization of the initial feature space, D: UMAP visualization of the last hidden layer feature representation of the InTrans-RNA5hmC model.

More »

Expand

Fig 4.

Comparison of nucleotide count distributions across upstream, downstream, and full regions.

A: Bar plot of nucleotide counts in upstream and downstream regions. As the sequence length was 41 and the central nucleotide was in position 20, the upstream region spans positions 0 to 19, and the downstream region covers positions 21 to 40. The nucleotide counts were averaged across all sequences. B: Bar plot of nucleotide counts in full regions. The full region refers to the whole sequence. The nucleotide counts were averaged across all sequences.

More »

Expand

Fig 5.

Plots of nucleotide frequencies at each position across sequences for both positive and negative samples.

The first plot belongs to the positive samples, and the second plot belongs to the negative samples. In positions 21-26, there is a high frequency of A and a low frequency of G in positive samples. For negative samples, it is the opposite.

More »

Expand

Fig 6.

Visualization of positional nucleotide distributions and feature importance.

A: Two Sample Logo (TSL) plot illustrating the nucleotide frequency differences between positive and negative samples. In positions 21-26, there is a high frequency of A and a low frequency of G in positive samples. For negative samples, it is the opposite. B: A bar plot displaying absolute SHAP values averaged over different ranges of nucleotide positions. The x-axis shows various ranges of neighboring nucleotides, while the y-axis shows the corresponding averaged absolute SHAP values.

More »

Expand