Base pair probability estimates improve the prediction accuracy of RNA non-canonical base pairs

doi:10.1371/journal.pcbi.1005827

Fig 1.

Prediction of extended secondary structure with CycleFold.

(A) A predicted structure for a Sarcin-Ricin loop sequence form Rattus norvegicus [50] using CycleFold with the MFE algorithm. Correctly predicted canonical pairs are drawn with heavy black lines, correctly predicted non-canonical pairs are light black lines, and the incorrectly predicted non-canonical pair is shown with a gray dashed line. The G-A pair at the base of the tetraloop is not present in the reference structure because the 3’ A is not stacked on the subsequent G, but is instead in contact with a protein, Restrictocin. (B) The probability dot plot calculated using Cyclefold with the partition function algorithm. The upper right triangle shows pairs with estimated probabilities > 0.01, color-coded by pairing probability. The lower left triangle shows the pairs that are present in the reference structure. Each dot represents a single base pair, and nucleotide index (starting with 1 at the 5’ end) is shown along the x and y axes.

More »

Expand

Fig 2.

Benchmark of single-sequence prediction of canonical base pairs.

(A) Prediction accuracy of the lowest free energy structure, evaluated on canonical pairs. (B) Prediction with CycleFold, using structures composed of highly probable canonical pairs. Sensitivity and PPV are reported for structures with probability greater than a threshold labeled on the plot). This demonstrates that the threshold stringency provides a tradeoff in terms of sensitivity and PPV.

More »

Expand

Fig 3.

Benchmark of single-sequence prediction of non-canonical base pairs.

(A) Prediction accuracy of the lowest free energy structure, evaluated on non-canonical pairs. This includes a calculation where CycleFold is constrained to include the known canonical base pairs to illustrate the performance of the NCM approach when canonical base pairs are known. (B) Prediction with CycleFold, using structures composed of highly probable non-canonical pairs. Sensitivity and PPV are reported for structures with probability greater than a specified threshold (labeled on the plot). This demonstrates that the threshold stringency provides a tradeoff in terms of sensitivity and PPV.

More »

Expand

Fig 4.

Prediction of canonical base pairs by predicting a conserved structure using multiple homologous sequences for (A) an MVE virus nuclease resistant RNA [53], (B) a D. radiodurans SRP hairpin domain [54], and (C) a O. sativa Twister ribozyme [55]. Prediction accuracy is shown for structures composed of highly probable pairs using information from a single sequence (blue) or a TurboFold calculation with 10 sequences (red). Also shown is prediction accuracy using evolutionary couplings from the plmc program [16] (green).

More »

Expand

Fig 5.

Prediction of non-canonical base pairs by predicting the conserved structure with multiple homologous sequences for (A) an MVE virus nuclease resistant RNA, (B) a D. radiodurans SRP hairpin domain, and (C) a O. sativa Twister ribozyme. Prediction accuracy is shown for structures composed of highly probable pairs using information from a single sequence (blue) or a TurboFold calculation on 10 sequences (red). In panels A and B, no blue line is present because the single sequence prediction did not correctly predict any pairs. Also shown is prediction using evolutionary couplings from the plmc program [16] (green).

More »

Expand

Fig 6.

An example of a pseudo-energy calculation using the NCM model.

ΔG_junction is evaluated for each pair of NCMs that share an edge, i.e. the ones that have an overlapping base pair. This term depends on the identities of the two NCMs (that is, the length of the 5’ and 3’ cycles for a double-stranded NCM, or the total length for a single-stranded NCM), and the nucleotides in the common base pair. This term is evaluated for the junction of NCM a with NCM b, NCM b with NCM c, and NCM c with NCM d.

More »

Expand

Fig 7.

A recursion diagram [41] illustrating the NCM partition function algorithm.

Filled regions indicate terms that are being added to the partition function, and empty regions indicate results that were previously calculated. Solid lines indicate nucleotides that must be paired, while dotted lines indicate nucleotides that may or may not be paired.

More »

Expand