^{*}

Conceived and designed the experiments: NJPW IMM. Performed the experiments: NJPW. Analyzed the data: NJPW IMM. Contributed reagents/materials/analysis tools: NJPW IMM. Wrote the paper: IMM.

The authors have declared that no competing interests exist.

The prediction of functional RNA structures has attracted increased interest, as it allows us to study the potential functional roles of many genes. RNA structure prediction methods, however, assume that there is a unique functional RNA structure and also do not predict functional features required for

Many non-coding genes exert their function via an RNA structure which starts emerging while the RNA sequence is being transcribed from the genome. The resulting folding pathway is known to depend on a variety of features such as the transcription speed, the concentration of various ions and the binding of proteins and other molecules. Not all of these influences can be adequately captured by the existing computational methods which try to replicate what happens

RNA molecules play diverse roles in many of the most basic cellular processes. In the translation process, for instance, the protein coding ‘message’ is encoded in a messenger RNA (mRNA) and transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs) are involved in this catalytic process. Micro RNAs are implicated in regulating mRNA availability. A range of other non-protein-coding RNAs (ncRNAs) have been identified

For many classes of ncRNA molecules studied so far, RNA structure plays a crucial part in defining its functional role in the cell. We know, for example, that tRNAs assume a distinct three-dimensional conformation in order to function properly during translation and that the functional configuration of the ribosome complex relies both, on properly folded rRNAs as well as many proteins binding to the respective rRNAs. In contrast to proteins, we can typically learn a lot about an RNA's functionality by studying only its secondary structure, i.e. the set of base-pairing nucleotide positions in the RNA sequence. This is the case because most RNA sequences studied so far fold in a hierarchical manner, with the secondary structure emerging first and the tertiary contacts between secondary structure elements emerging later.

The view that one RNA sequence has one functional RNA structure turns out to be too simplistic. We know by now of several cases, where a given RNA sequence has more than one functionally important RNA structure, e.g. ribo-switches

There exist by now a wide range of computational methods that can predict an RNA secondary structure given an RNA sequence. Many of these methods

The program RNA

The increasing interest in RNA folding pathways has spurred the development of computational methods for RNA structure prediction which take the folding kinetics explicitly into account. These methods try to model the physical process by which an unfolded RNA folds into its functional conformation(s) as a continuous-time Markov process which allows only local rearrangements of secondary structures. If we knew all entries of the transition rate matrix

Mironov and Lebedev

K

Other computational approaches for predicting kinetic folding pathways consider energy landscapes in order to reduce the size of the state-space. The energy landscape can be viewed as a barrier tree, where the local minima are leaves in the tree which are connected to one or more gradient basins via saddle-points. Saddle-points are the lowest energy structures that connect the gradient basins around these local minima

Long sequences are problematic for all the above methods since the number of possible secondary structures, and therefore the worst-case complexity of the energy landscape, grows exponentially with the sequence length. The K

All of the above prediction methods take at most the RNA sequence itself, the temperature, the

It is also challenging to study kinetic folding pathways experimentally. There exist by now a range of powerful experimental techniques for studying large sets of RNA sequences in an ensemble-averaged way such as UV melting, isothermal titration calometry, circular dichroism, chemical foot-printing and, more recently, single-molecule techniques such as fluorescence correlation spectroscopy

We propose a conceptually new computational approach for studying RNA folding pathways

predict evolutionarily conserved helices that are likely to play a role in the co-transcriptional formation of the functional RNA structure(s)

do not require a detailed knowledge of the

estimate reliability values for all predictions

present a comprehensive performance evaluation

have a performance which is robust with respect to sequence length

If a structural feature is functionally important, it is typically well conserved in groups of related RNAs, even if the level of primary sequence conservation may be low. These conserved structural features can be detected in alignments of several evolutionarily related RNA sequence by identifying pairs of alignment columns where the base-pairing potential, but not necessarily the primary sequence itself has been conserved. This analysis of these so-called co-varying alignment columns is even capable of identifying tertiary structure motifs

We have devised T

In the first step, T

T

See the text for more details.

We model the evolution of base-paired and un-paired alignment columns along the evolutionary input tree with two reversible, time-continuous Markov chains using the same rate matrices and equilibrium distributions as the comparative RNA structure prediction programs P

In contrast to the customary way of calculating the likelihood, we interpret one-sided gaps in base-paired alignment columns as non-consensus base-pairs rather than missing information. Two-sided gaps, however, are still treated as missing information which amounts to summing over all possible base-pairs when moving “up the tree” in the Felsenstein calculation. This treatment of two-sided gaps makes sense as the length of a helix can shrink or expand over time

In the likelihood calculation for two base-paired alignment columns, the Felsenstein algorithm traverses the tree from the leaf nodes (i.e. the observed nucleotides in two base-paired alignment columns) via the internal nodes to the root node of the tree. It sums over

The ability of an RNA sequences to form random helices is known to strongly depend on the sequence itself, in particular its length and its nucleotide and di-nucleotide composition. The log-likelihood value

In the first step, the input alignment is realigned based primary sequence conservation only using T-C

For each original alignment, we generate 500 randomized alignments. For each shuffled alignment (which we assume to no longer contain any real helices), we detect “conserved” helices that may have appeared by chance and then calculate their log-likelihood values. Both are done in the same way as for the original input alignment. We then combine the log-likelihood values from all 500 randomized alignments into a single histogram of log-likelihood values and use the resulting distribution to assign p-values to the log-likelihood values of the conserved helices in the original input alignment. Conserved helices in alignments with a high structure-formation potential thus require – generally speaking – larger log-likelihood values in order to be considered significant than helices in alignments where the overall structure-formation potential is lower.

T

The output of T

Our data set comprises four sub-sets which have been chosen to represent (a) data, where multiple functional RNA secondary structures are known, (b) data, where only one functional RNA secondary structure is known and (c) artificially generated data which allows us to investigate some features of T

The R

We select a sub-set of high-quality seed alignments from the R

Applying these four selection criteria, we arrive at a data set of 134 seed alignments which contain 6 to 712 sequences (average is 60 sequences) and whose length ranges from 100 to 1247 bp (average is 221 bp). The total tree length of these alignments ranges from 0.4 to 116.3, the average being 10.0. We call this data set the R

In order to be able to investigate the dependence of the performance on the alignment length and the total tree length in detail, we generate an artificial data set comprising a total of 990 alignments. Each alignment in this artificial dataset is generated as follows. In the first step, an RNA secondary structure from the RNA STRAND database

Structures selected from the RNA STRAND database were binned according to their sequence length (100–199, 200–299, etc. up to 900–999). For the tree length experiment, 10 structures were selected at random from each bin, and for each structure, alignments of 10 sequences were generated with total tree lengths of 0.5, 1, 2, 4, 8, and 16. The artificial data set for which the tree experiments were performed thus consists of 540 artificial alignments. For the alignment length experiment, 50 structures were selected at random from each bin. For each structure, we generated an alignment with 10 sequences and a total tree length of 4. The artificial data set for which the length experiments were performed consists therefore of 450 artificial alignments.

The

As the

A review of the

The

The formation of the 1∶2 helix during transcription causes the RNA-polymerase to pause. If a ribosome starts translation, it disrupts the 1∶2 helix as soon as it reaches region 1 of the transcript, thus freeing the RNA-polymerase. If tryptophan is limited, the ribosome will pause at the tryptophan codon in region 1, thereby allowing helix 2∶3 to form and simultaneously preventing the formation of the 3∶4 helix which serves as a terminator stem which ends transcription. This allows the

Several protein-mediated ribo-switches which regulate

The performance of new prediction methods is best benchmarked by comparing the set of predicted to the set of known structure for an, ideally, large and diverse data set that has been carefully and completely annotated (the

The conclusive benchmarking of computational methods for predicting kinetic folding pathways has, so far, been difficult. This is due to several reasons. First, detailed experimental results on folding kinetics, usually done via temperature- or pH-jump kinetic trapping procedures

T

Comparing the helices predicted by T

As we want to know how good T

As is customary, we define the sensitivity as

base-pair is known structure | base-pair not in known structure | |

minimum p-value |
TP | FP |

minimum p-value |
FN | TN |

T

p-value |
TP | FP |

p-value |
FN | TN |

As

The top left figure shows the sensitivity as function of the false positive rate (FPR) and the top right figure the sensitivity (Sens) as function of the positive predictive value (PPV). The bottom left figure shows the F-measure and the bottom right figure the MCC as function of the p-value threshold, see the text for the definitions of the F-measure and the MCC. Note that each data point in the figures above corresponds to the respective performance measure averaged over the entire R

The performance of methods that predict a kinetic folding pathway is known to strongly depend on the length of the input sequence. In order to systematically investigate to which extent the performance of T

The top left figures shows the sensitivity (Sens) as function of the false positive rate (FPR) for different alignment lengths. The colors indicate the length of the alignment in nucleotides ranging from 100 to 999 nucleotides. The top right figures shows the sensitivity as function of the positive predictive value (PPV) for different alignment lengths. The bottom left figures shows the F-measure and the bottom right figure the MCC as function of the p-value threshold, see the text for the definitions of the F-measure and the MCC. All figures use the same coloring scheme as the top left figure.

T

The top left figures shows the sensitivity (Sens) as function of the false positive rate (FPR) for different tree lengths. The colors indicate the total length of the maximum-likelihood trees that were derived for the alignments of the artificial data set. They range from 0.5 to 16. The top right figures shows the sensitivity as function of the positive predictive value (PPV). The bottom left figures shows the F-measure and the bottom right figure the MCC as function of the p-value threshold, see the text for the definitions of the F-measure and the MCC. All figures use the same coloring scheme as the top left figure.

When comparing the performance plots for the artificial data set to those for the R

When comparing the effect of the total tree length on the performance to the effect that the alignment length has, it is clear that evolutionary diversity, i.e. the total tree length, has a much greater influence on the performance than the alignment length. This can be understood by the way that T

The number of possible bi-secondary RNA structures, i.e. RNA structures that can be viewed as combination of at most two secondary structure without pseudo-knots, grows exponentially with the sequence length

The

The performance of T

The top left figures shows the sensitivity as function of the false positive rate (FPR) and the top right figure the sensitivity as function of the positive predictive value (PPV). The bottom left figure shows the F-measure and the bottom right figure the MCC as function of the p-value threshold, see the text for the definitions of the F-measure and the MCC.

The top left figures shows the sensitivity as function of the false positive rate (FPR) and the top right figure the sensitivity as function of the positive predictive value (PPV). The bottom left figure shows the F-measure and the bottom right figure the MCC as function of the p-value threshold, see the text for the definitions of the F-measure and the MCC.

A more intuitive way of visualizing the T

The x-axis represents the

The x-axis represents the

The R

T

The T

The T

The four R

For a p-value threshold of

One motivation for devising T

Shown here are two examples, the CsrB/RsmB RNA family (left, RF00018) and the bacterial tmRNA (right, RF00023) for a p-value threshold value of

It is interesting to investigate if the helices predicted by T

In each figure, the x-axis represents the

In the top figure showing the T

T

T

The helices of the pseudo-knotted known structure for the bacterial tmRNA are correctly predicted by T

As opposed to T

Overall, we thus conclude that the presence of multiple functional RNA secondary structures as well as of pseudo-knotted structures is better modelled using T

We devised T

Our comprehensive performance evaluation of T

The T

We hope that the predictions by T

T

Supplementary Information and Figures

(0.96 MB PDF)

I. M. Meyer would like to thank Elena Rivas, Eric Westhof and the participants of the computational RNA workshop in Benasque, Spain, for many inspiring discussions. Both authors would like to thank Björn Voss from the University of Freiburg, Germany, for sharing the