Fig 1.
(1) The sequence of nucleotides of the RNA molecule is first encoded numerically to facilitate its treatment. (2) The numerically encoded sequence is analyzed using the FFT to detect stems quickly, such that stems are added iteratively to form parallel folding paths. (3) The folding pathways form a graph that connects potential intermediate secondary structures. Starting with the unfolded structure, our kinetic ansatz predicts a complete folding trajectory (i.e. when the RNA molecule adopts a specific structure).
Fig 2.
Algorithm execution for one example sequence which requires two steps.
(Step 1) From the correlation cor(k), we select one peak which corresponds to a position lag k. Then, we search for the largest stem and form it. Two fragments, “In” (the interior part of the stem) and “Out” (the exterior part of the stem), are left, but only the “Out” may contain a new stem to add. (Step 2) The procedure is called recursively on the “Out” sequence fragment only. The correlation cor(k) between the “Out” fragment and its mirror is then computed and analyzing the k positional lags allows to form a new stem. Finally, no more stem can be formed on the fragment left (colored in blue), so the procedure stops.
Fig 3.
Fast folding graph constructed using RAFFT.
In this example, the sequence is folded in two steps: starting from the unfolded structure, the N = 5 most stable stems found are stored in stack 1. From stack 1, multiple stems can be formed but only the N = 5 most stable are stored in stack 2. All secondary structure visualizations were obtained using VARNA [37].
Fig 4.
For samples of 30 sequences per length, we averaged the execution times of five folding tools. The empirical time complexity O(Lη) where η is obtained by non-linear regression. RAFFT denotes the naive algorithm whereas RAFFT(50) denotes the algorithm where 50 structures can be saved per stack.
Fig 5.
RAFFT’s performance on folding task.
PPV and sensitivity vs sequence length. In the left panels, RAFFT (in blue) shows the scores when for the structure (out of N = 50 predictions) with the lowest free energy, whereas RAFFT* (in green) shows the best PPV score in that ensemble. Each dot corresponds to the mean performance for a given sequence length, and vertical lines display their standard deviation. The right panels of both figures show the distribution of PPV and sensitivity sequence-wise.
Table 1.
Average performance displayed in terms of PPV and sensitivity.
The metrics were first averaged at fixed sequence length, limiting the over-representation of shorter sequences. The first two rows show the average performance for all the sequences for each method. The bottom two rows correspond to the performances for the sequences of length ≤ 200 nucleotides. For the ML and MFE only one prediction per sequence and for RAFFT 50 predictions per sequence were used. Here RAFFT (respectively RAFFT*) refers to the case when the lowest free energy (resp. highest PPV) from the ensemble of 50 predictions is selected.
Fig 6.
PCA for the predicted structures using RAFFT, RNAfold, MxFold2 compared to the known structures denoted “True”. The arrows represent the direction to secondary structure types (H = hairpin, I = E = exterior loop, I = interior loop, H = hairpin, B = bulge, S = stacks, M = multi-loop and R = root node).
Fig 7.
Application of the folding kinetic ansatz on CFSE.
(A) Fast-folding graph in four steps and N = 20 structures stored in a stack at each step. The edges are coloured according to ΔΔG. At each step, the structures are ordered by their free energy from top to bottom. The minimum free energy structure found is at the top left of the graph. A unique ID annotates visited structures in the kinetics. For example, “59” is the ID of the MFE structure. (B) MFE (computed with RNAfold) and the native CFSE structure. (C)The change in structure frequencies over time. The simulation starts with the whole population in the open-chain or unfolded structure (ID 0). The native structure (Nat.l) is trapped for a long time before the MFE structure (MFE.l) takes over the population. (D) Folding landscape derived from the 68 distinct structures predicted using RAFFT. The axes are the components optimized by the MDS algorithm, so the base pair distances are mostly preserved. Observed structures are also annotated using the unique ID. MFE-like structures (MFE.l) are at the bottom of the figure, while native-like (Nat.l) are at the top.
Fig 8.
Folding kinetics of CFSE using Treekin.
A) Barrier tree of the CFSE. From a set of 1.5 × 106 sub-optimal structures, 40 local minima were found, connected through saddle points. The tree shows two alternative structures separated by a high barrier with the global minimum (MFE structure) on the right side. (B) Folding kinetics with initial population I1. Starting from an initial population of I1, as the initial frequency decreases, the others increase, and gradually the MFE structure is the only one populated. (C) Folding kinetics with initial population I2. When starting with a population of I2, the native structure (labelled Nat.1) is observable, and gets kinetically trapped for a long time due to the high energy barrier separating it from the MFE structure.