Integrating Chemical Footprinting Data into RNA Secondary Structure Prediction

Chemical and enzymatic footprinting experiments, such as shape (selective 2′-hydroxyl acylation analyzed by primer extension), yield important information about RNA secondary structure. Indeed, since the -hydroxyl is reactive at flexible (loop) regions, but unreactive at base-paired regions, shape yields quantitative data about which RNA nucleotides are base-paired. Recently, low error rates in secondary structure prediction have been reported for three RNAs of moderate size, by including base stacking pseudo-energy terms derived from shape data into the computation of minimum free energy secondary structure. Here, we describe a novel method, RNAsc (RNA soft constraints), which includes pseudo-energy terms for each nucleotide position, rather than only for base stacking positions. We prove that RNAsc is self-consistent, in the sense that the nucleotide-specific probabilities of being unpaired in the low energy Boltzmann ensemble always become more closely correlated with the input shape data after application of RNAsc. From this mathematical perspective, the secondary structure predicted by RNAsc should be ‘correct’, in as much as the shape data is ‘correct’. We benchmark RNAsc against the previously mentioned method for eight RNAs, for which both shape data and native structures are known, to find the same accuracy in 7 out of 8 cases, and an improvement of 25% in one case. Furthermore, we present what appears to be the first direct comparison of shape data and in-line probing data, by comparing yeast asp-tRNA shape data from the literature with data from in-line probing experiments we have recently performed. With respect to several criteria, we find that shape data appear to be more robust than in-line probing data, at least in the case of asp-tRNA.

Increasing β decreases average distance to normalized shape data In this section, we generalize the theorem from the text to show that as RNAsc parameter β increases, the expected distance to normalized shape data decreases. The proof, which generalizes the proof of the theorem in the main text, is given below.
First, we begin with some notation. Thoughout this section, S denotes a secondary structure of an arbitary, but fixed RNA sequence a 1 , . . . , a n . As in the text, each secondary structure S is associated with a binary sequence b 1 , . . . , b n such that b i = 1 if the nucleotide a i is unpaired and b i = 0 if a i is base-paired. Given experimental shape data yielding probabilities q s = (q s 1 , . . . , q s n ), where q s i is the probability that nucleotide i is unpaired, the distance of S to q s is defined by: For secondary structure S and parameter β ≥ 0 in the algorithm RNAsc, shape weight for (S, β) is defined to be ω q s (S, β) = The β-weighted partition function then becomes The β-weighted Boltzmann probability P (S) of secondary structure S is defined by Let 0 ≤ β 1 ≤ β 2 be arbitrary, but fixed values for the parameter β in RNAsc. Define the critical distance d c (β 1 , β 2 ) by Note that d c (β 1 , β 2 ), which has the form of a change in ensemble free energy, does not depend on any particular secondary structure S, although it does depend on n, T, β 1 , β 2 , q s and of course the input RNA sequence a 1 , . . . , a n .
Claim 1: If d 1 ≤ d 2 and β ≥ 0, then Claim 1 is obvious. We now prove the following principal claim.
Claim 2: For any secondary structure S, and strict inequalities hold as well.
For notational ease, let d c abbreviate d c (β 1 , β 2 ). By Equation (5) and Claim 1, for any S, Multiply both sides of the last line by P This establishes Claim 2. For 0 ≤ β, define the β-expected distance D β between q s , obtained by normalizing shape data, and the ensemble of low energy structures by When β = 0, we write D , instead of D 0 . Define disjoint sets A, B of secondary structures S of a 1 , . . . , a n by It follows by definition that for all S ∈ A, d q s (S) ≤ d c , and for all S ∈ B, d q s (S) > d c .
shape discrepancies In order to directly characterize how well shape data reflects RNA secondary structure, we compared normalized shape data with base pairing status, as determined from crystallographic or NMR structures. We define shape distance to equal the difference between normalized shape reactivity (see Methods), scaled from 0 to 1, as just defined, and binary base-pairing status, with 0 for paired, 1 for unpaired, as derived from NMR or crystal structure. Using shape data for S. cerevisiae apartyl-tRNA [1], HCV IRES [2], bI3 group I intron p456 [3], E. coli phenylalanine-tRNA [4], E. coli 5S RNA [4], and Fusobacterium nucleatum glycine riboswitch [4], we computed shape distance at each nucleotide. We observed that at many positions the shape distance has an absolute value greater than 0.5, thus indicating a significant difference between shape reactivity and the actual secondary structure. We refer to these positions as discrepancies. Over the the set of RNAs we examined, between 24 − 35% of the total data corresponded to such discrepancies (see Fig. 1).

Cumulative distributions and Histograms of shape reactivity
Nucleotides with shape reactivities ≥ 0.7 or 0.3 − 0.7 are considered highly and moderately reactive, respectively [2]. Hence it is reasonable to normalize the shape reactivities in a piecewise linear fashion, where 0.3 will be roughly mapped to 0.5. However, very low shape reactivities should not be mapped close to 0.

Integrating pseudo-energy terms into the Partition function recursions
In this section, we provide the full recursive definitions necessary to compute the partition function, where pseudo-energy factors have been included for every nucleotide position. The recursions are adaptations of recursions from [5]. Corresponding recursions for the minimum free energy (MFE) secondary structure are obtained by replacing the Boltzmann factors of energy by the energy, replacing addition by minimization, and replacing multiplication by addition-hence will not be given explicitly.
Definition 1 Define: The partition function for the fragment from nucleotides i to j, inclusive, with i paired to j.
• W (i, j): The partition functions for the fragment from nucleotides i to j, inclusive, such that this fragment will be incorporated in a multibranch loop and it has one single helical branch.
• W L(i, j): The partition functions for the fragment from nucleotides i to j, inclusive, such that this fragment will be incorporated in a multibranch loop and it has one single helical branch. Also j is required to terminate the helical branch as either a paired nucleotide, a 3 dangling end, or a nucleotide in a terminal mismatch.
• W M B(i, j): The partition function from nucleotides i to j, inclusive, such that this fragment will be incorporated into a multibranch loop and it contains two or more branches.  Figure 1. shape discrepancies. Distribution of shape discrepancies for east tRNA-Asp, HCV IRES [2], bI3 group I intron p456 [3], E. coli 5S RNA [4], and Fusobacterium nucleatum glycine riboswitch [4].Using crystal structure as 'gold standard', red squares indicate locations, where the absolute value of the difference of shape data and crystal structure (1 unpaired, 0 paired) exceeds 0.5.
• W M BL(i, j): The partition function from nucleotides i to j, inclusive, such that this fragment will be incorporated into a multibranch loop and it contains two or more branches. Also j is required to be paired or associated with a helix as either a 3 dangling end, a nucleotide in a terminal mismatch, or a nucleotide in a mismatch between two coaxially stacked helices.
• W C(i, j): The partition function from nucleotides from i to j, inclusive, such that there are two , bI3 group I intron p456 [3], E. coli phenylalanine-tRNA [4], E. coli 5S RNA [4], and Fusobacterium nucleatum glycine riboswitch [4].The fraction of base-pairs could be used to estimate the threshold shape moderate reactivity.
• W 5(i): The partition function for the nucleotide fragment from the 5 end of the sequence to and including nucleotide i See [5,6] for details on the storage and calculation of the arrays. V is the sum of terms involving a pair between i and j, base pair stacking, hairpin loop closure, internal loop closure, and multibranch loop closure. For j ≤ N , we have: and for j > N we have: Note that V is not defined for i = N and for j = N + 1 (i.e., the ends of the sequence.) V h is the energy contribution of the hairpin and is defined by V s is the stacking contribution (of a base pair on a previous pair) and is defined by: V i is the energy contribution of internal loops which also considers bulge loops. V i is defined by, Computation of V i requires a search over i and j where i < i < j < j except where i = i + 1 and j = j − 1 simultaneously, that is, the base pair stacking case. The search is limited to internal loops of size 30 nt that is, V e is the energy contribution from the exterior loops. It is defined only when j > N : W 5(i) is the fragment from 1 to i and is computed by: W 3 is defined similarly, as follows.