RNA structure prediction using positive and negative evolutionary information

doi:10.1371/journal.pcbi.1008387

Fig 1.

The CaCoFold algorithm.

(a) Toy alignment of five sequences. (b) The statistical analysis identifies five significantly covarying position pairs in the alignment (E-value < 0.05). Column pairs that significantly covary are marked with green arches, compensatory pairwise substitutions including G:U pairs are marked green relative to consensus (black). (c) The maxCov algorithm requires two layers to explain all five covariations. In the first (C0) layer, three positive basepairs depicted in green are grouped together. In successive layers (C+), positive basepairs already taken into account (depicted in red) are excluded. (d) At each layer, a dynamic programming algorithm produces the most probable fold constrained by the assigned positive basepairs (green parentheses), to the exclusion of all negative basepairs and other positive basepairs (red arches). (This toy alignment does not include any negative basepairs.) Residues forming a red arch can pair to other bases. Basepairs that do not significantly covary are depicted by black parentheses. (e) The S+ alternative structures without positive basepairs that overlap in more that half of their residues with the S0 structure are removed. Alternative helices with positive basepairs are always kept. (f) The final consensus structure combining the nested S0 structure with the alternative filtered helices from all other layers is displayed automatically using a modified version of the program R2R. Positive basepairs are depicted in green.

More »

Expand

Fig 2.

RNA models used by the CaCoFold algorithm.

(a) The Nussinov grammar implemented by the maxCov algorithm uses the R-scape E-values of the significantly covarying pairs, and maximizes the sum of -log(E-value). (b) The RBG model used by the first layer of the folding algorithm. (c) The G6X model used by the rest of the layers completing the non-nested part of the RNA structure. For the RBG and G6X models, the F nonterminal is a shorthand for 16 different non-terminals that represent stacked basepairs. The three models are unambiguous, that is, given any nested structure, there is always one possible and unique way in which the structure can be formulated by following the rules of the grammar.

More »

Expand

Fig 3.

The CaCoFold algorithm applied to the transfer-messenger RNA (tmRNA).

Steps (a) to (f) refer to the same methods as described in Fig 1. (a) Characteristics of the input alignment. (b) The statistical test that considers all possible pairs equally resulting in the assignment of 121 significantly covarying positive basepairs. The Rfam consensus structure in not used in the analysis. The whole analysis is performed using the single command R-scape --fold on the input alignment. The analysis takes 25 seconds (30s including drawing all the figures) on a 3.3 GHz Intel Core i7 MacBook Pro. (c) The maxCov algorithm requires 6 layers to incorporate all 121 positive basepairs. (d) The cascade Constraint folding completes the structure with a total of 139 basepairs. (e) After filtering, there are five pseudoknoted helices, three triplets and 10 other mRNA-induced covariations. The structural display in (f) has been modified by hand to match the standard depiction of the tmRNA secondary structure in (g). The thick line in (g) marked with an asterisk indicates the C-C triplet interaction proposed in Ref. 44. Details of the mRNA-induced covariations are given in S6(c) Fig.

More »

Expand

Table 1.

CaCoFold structures with different covariation support than the structures provided with the structural alignments.

CaCoFold structures with different covariation support can only have more positive basepairs. (Left) The 319 structural RNAs (from the Rfam and ZWD databases combined) for which the CaCoFold structure has more covariation support are manually classified into 15 categories. Each RNA is assigned to one main type, although they can belong to others as well. Examples of types 1-11 are presented in S7 Fig. A full description of all 319 RNAs is given in the supplemental table.

More »

Expand

Table 2.

21 RNAs with 3D structures and CaCoFold structures with different covariation support than the structures provided with the structural alignments.

Subset of 21/319 CaCoFold structures with more covariation support for which there is 3D structural information (not including the 6 rRNAs). We compare the 21 CaCoFold predicted structures to the 3D structures in Figs 4, 5 and Supplemental S2–S6 Figs. The associated “types” are described in Table 1.

More »

Expand

Fig 4.

CaCoFold structures confirmed by known 3D structures (part 1/7).

Structural elements with covariation support introduced by CaCoFold relative to the Rfam annotation and corroborated by 3D structures are annotated in blue. (a) The A-type RNase P RNA CaCoFold structure includes relative to the Rfam structure one more helix (P6) and two significant covariations, named tr_1 and tr_2. Blue arrows show the placement of these three covarying motifs relative to the 3D structure [46]. The display of the crystal structure has been modified to indicate with back shaded boxes five regions with tertiary interactions labeled “1” to “5”[68]. “tr_1” occurs in region “3” between P8 and the hairpin loop of P14, and “tr_2” in region “4” representing the interaction between P8 and the hairpin loop of P18. The display of the CaCoFold structure has been modified by hand to match the standard depiction of the structure. (b) The SAM-I riboswitch CaCoFold structure shows relative to the Rfam structure one more helix forming a pseudoknot, and a A-U pair stacking on helix P1 both confirmed by the SAM-I riboswitch 2.9 Å resolution crystal structure of T. tengcongensis [47]. CaCoFold also identifies additional pairs with covariation support for helices P2a, P3 and P4. (c) The U4 snRNA CaCoFold structure identifies one more internal loop and one more helix than the Rfam structure confirmed by the 3D structure [48]. The new U4 internal loop flanked by covarying Watson-Crick basepairs includes a kink turn (UAG-AG). The non Watson-Crick pairs in a kink turn (A-G, G-A) are generally conserved (> 97% in this alignment) and do not covary.

More »

Expand

Fig 5.

CaCoFold structures confirmed by known 3D structures (part 2/7).

Structural elements with covariation support introduced by CaCoFold relative to the Rfam annotation and corroborated by 3D structures are annotated in blue. (a) Relative to the Rfam structure, the Cobalamin riboswitch CaCoFold structure adds one pseudoknot and one Watson-Crick basepair defining a four-way junction between helices P1, P2, and P3, both confirmed by the S. thermophilum crystal structure [49]. It also adds more covariation support for helices P1 and P2. (b) In CaCoFold structures, alternative helices that do not overlap with the nested structure are annotated as pseudoknots (pk), otherwise they are annotated as triplets (tr). For structures obtained from a crystal structure, non Watson-Crick basepairs are annotated as non-canonical (nc) regardless of whether they are overlapping or not with the nested structure. The tRNA CaCoFold structure has been re-annotated manually to match the labeling of the S. cerevisiae phenylalanine tRNA 1EHZ crystal structure (1.93 Å) for all common basepairs [51]. Of the covarying pairs in the CaCoFold structure but not in the Rfam tRNA structure, five (depicted in blue) are confirmed by the 1EHZ structure as analyzed by RNAView. The sequence of the 1EHZ tRNA does not include the V loop, which appears in 16% of the 954 sequences in the Rfam tRNA seed alignment. Two covarying pairs (depicted in orange) appear to be the result of constraints other than RNA structure. The remaining six covarying pairs are labeled in black. Four basepairs identified in the 3D structure but not incorporated in the CaCoFold structure are depicted in brown. The annotation of the non Watson-Crick pairs with at least two H-bonds follows the nomenclature of [34] that reports the two edges of the nucleotides involved in the plain of the H-bonds. “W” stands for the Watson-Crick edge, “S” for the Sugar edge, and “H” for the Hoogsteen face; “c” and “t” stand for cis and trans respectively. WWc is a standard Watson-Crick basepairs. (c) In the U2 spliceosomal RNA, Stem IIa and Stem IIc, both with covariation support, are two alternative helices that compete to promote different splicing steps [53].

More »

Expand