RNAtranslator: Modeling protein-conditional RNA design as sequence-to-sequence natural language translation

doi:10.1371/journal.pcbi.1013541

Fig 1.

Overview of RNAtranslator pipeline

The model follows an encoder-decoder transformer framework, where the encoder processes the protein sequence using positional encoding and encoder blocks to extract an embedding. The decoder takes the RNA sequence input, applying self-attention, encoder-decoder attention, and feed-forward layers to learn to generate target-specific RNA sequences. Training occurs in two steps: Large scale pretraining with experimental and computationally predicted interactions followed by fine-tuning with experimental interactions. During inference, the model requires only the protein sequence as input and generates novel RNA sequences through iterative sampling. We evaluate the designed RNAs in two ways: (i) we predict their in-silico binding affinity to protein X with DeepCLIP^X, a model trained on the CLIPdb^X subset that contains only interactions for protein X and is pre-split into training and test sets; and (ii) we analyse each RNA–protein complex by molecular-dynamics simulation.

More »

Expand

Fig 2.

Evaluation of designed RNA molecules to bind to three therapeutic target proteins: p53, thrombin and EGFR

(A) The figure shows the full pipeline for designing RNA sequences that bind to target proteins. Given the the target protein’s sequence, RNAtranslator generates an RNA sequence to bind it. The 3D structure of the designed RNA sequence is predicted using RhoFold+, and then the RNA-Protein complex is modeled using HDock. Finally, the simulations are conducted to test how well the RNA binds. (B–J) Evaluation of the designed RNA for three protein targets: p53 (Panels B–D), thrombin (Panels E–G), and EGFR (Panels H–J). For each target, the designed RNA is compared with two RNA sequences: a validated RNA known to bind the target, and a randomly selected natural RNA sequence. (B, E, H) Visualization of the three-dimensional structures of the RNA-protein complexes formed after molecular dynamics simulations using the designed RNA sequences. The zoomed-in panels highlight atomic contacts and hydrogen bonds at the interaction interface. (Panels C, F, I) Distributions of hydrogen bonds and binding energies are shown, these distributions are observed during molecular dynamics simulations. The designed RNA sequences form a similar or greater number of hydrogen bonds compared to validated aptamers, and more than random natural RNAs. Additionally, the binding energies of the designed RNAs are lower (indicating stronger binding) than those of random RNAs, and are often comparable to or better than those of the validated aptamers. (Panels D, G, J) Energy landscapes based on root-mean-square deviation (RMSD) and radius of gyration (RG) during molecular dynamics simulations. For p53 and thrombin (Panels D and G), the designed RNA sequences converge to stable conformations characterized by low RMSD and compact structures, comparable to those of the validated aptamers. In contrast, for EGFR (Panel J), the designed RNA does not exhibit a clearly stable structure, displaying more variability in both conformation and compactness.

More »

Expand

Fig 3.

Comparison of predicted binding affinities between generated and natural RNAs for RBM5 and ELAVL1 proteins as targets

(A) Distribution of binding affinities for RBM5 protein predicted by DeepCLIP. (B) Distribution of binding affinities for ELAVL1 protein predicted by DeepCLIP. (C) HDOCKlite scores for the top 100 models for RBM5-RNA complexes. (D) HDOCKlite scores for the top 100 models of ELAVL1-RNA complexes. For HDOCKlite scores, lower values indicate better docking (stronger predicted binding).

More »

Expand

Fig 4.

Molecular dynamics simulations showing interactions of RBM5 and ELAVL1 with RNAs generated by RNAtranslator and Natural RNAs

(A) and (B) show simulation results for two RNA-binding proteins: RBM5 and ELAVL1, respectively. For each protein, three RNA sequences are compared: the RNA designed by RNAtranslator, a natural RNA known to bind the protein, and a randomly selected natural RNA. The top row displays free energy landscapes projected onto root-mean-square deviation (RMSD) and radius of gyration (RG). RNAtranslator achieves energy minima that are comparable to or deeper than those of natural binding RNAs. The bottom row presents simulation metrics over a 20-nanosecond trajectory. RNAtranslator forms more atomic contacts and achieves lower (i.e., more favorable) binding energies than random RNAs, and performs on par with or better than the natural binders. Hydrogen bonding distributions further support the binding of the predicted RNA-protein interactions.

More »

Expand

Fig 5.

Evaluating RNAtranslator generalization to unseen proteins and robustness

(A) For PRPF8, we compare designed RNAs, natural binders, and random RNAs using molecular dynamics. RNAtranslator sequences form more hydrogen bonds and show lower (better) binding energy than random RNAs. Their free-energy landscape (FEL) also shows a deeper basin than the natural and random RNA, indicating stable binding. (B) For PRP4K, RNAtranslator sequences again show more contacts, stronger binding energy, and more hydrogen bonds than random RNAs. FEL plots confirm that the RNAtranslator complex reaches the deepest energy basin, showing stable and strong binding.

More »

Expand

Fig 6.

Binding affinity comparison of RNAtranslator-generated RNAs and natural RNAs

RNAtranslator-generated RNAs show high binding affinities across nine protein targets. The distributions of binding scores indicate that RNAtranslator-generated RNAs generally show affinities comparable to naturally binding RNAs, with substantial overlaps observed for most proteins.

More »

Expand

Fig 7.

Stability analysis of RNAtranslator generated RNAs compared to natural RNAs.

Minimum Free Energy (MFE) distributions is shown in the first row, where RNAtranslator-generated RNAs achieve stability levels comparable to natural binding RNAs, reinforcing their thermodynamic favorability. Ensemble Free Energy () distributions, further validating the structural stability of RNAtranslator-generated RNAs in a thermodynamic ensemble. GC content comparison, indicating that RNAtranslator-generated RNAs closely match the natural binding RNAs, suggesting structural robustness. Distribution of RNA sequence lengths across different groups, showing that RNAtranslator-generated RNAs exhibit a broad yet biologically relevant length distribution.

More »

Expand

Fig 8.

Analysis of RNAtranslator Attentions by visualizing the cross attentions

This figure visualizes the cross-attention analysis of the decoder in RNAtranslator. The model is run on up to 1000 RNA–protein pairs per RNA-binding protein (RBP), collecting attention maps from all layers and heads. For each protein, the maximum attention weight received from any RNA position is attended. To assess the model’s focus on known RNA-binding domains, an attention ratio is defined as the maximum attention score within a known domain divided by the maximum score outside the domain. (A) Four representative RBPs exhibit attention ratios greater than one (dashed line), indicating that the model assigns higher attention to known binding regions. (B) When averaged across all RBPs, this attention ratio peaks in the middle decoder layers (L1–L4), suggesting these layers contribute most to identifying binding sites.

More »

Expand