Computing the Partition Function for Kinetically Trapped RNA Secondary Structures

An RNA secondary structure is locally optimal if there is no lower energy structure that can be obtained by the addition or removal of a single base pair, where energy is defined according to the widely accepted Turner nearest neighbor model. Locally optimal structures form kinetic traps, since any evolution away from a locally optimal structure must involve energetically unfavorable folding steps. Here, we present a novel, efficient algorithm to compute the partition function over all locally optimal secondary structures of a given RNA sequence. Our software, RNAlocopt runs in time and space. Additionally, RNAlocopt samples a user-specified number of structures from the Boltzmann subensemble of all locally optimal structures. We apply RNAlocopt to show that (1) the number of locally optimal structures is far fewer than the total number of structures – indeed, the number of locally optimal structures approximately equal to the square root of the number of all structures, (2) the structural diversity of this subensemble may be either similar to or quite different from the structural diversity of the entire Boltzmann ensemble, a situation that depends on the type of input RNA, (3) the (modified) maximum expected accuracy structure, computed by taking into account base pairing frequencies of locally optimal structures, is a more accurate prediction of the native structure than other current thermodynamics-based methods. The software RNAlocopt constitutes a technical breakthrough in our study of the folding landscape for RNA secondary structures. For the first time, locally optimal structures (kinetic traps in the Turner energy model) can be rapidly generated for long RNA sequences, previously impossible with methods that involved exhaustive enumeration. Use of locally optimal structure leads to state-of-the-art secondary structure prediction, as benchmarked against methods involving the computation of minimum free energy and of maximum expected accuracy. Web server and source code available at http://bioinformatics.bc.edu/clotelab/RNAlocopt/.


Introduction
Kinetics of RNA secondary structure formation plays an important role in many biological functions, as shown by cotranscriptional folding [1] of large RNA molecules, the host-killing (hok) and suppression of killing system (sok) system [2] to control plasmid copy number in E. coli, the kinetically driven trans-splicing of a 5' codon in Leptomonas collosoma [3], and the kinetic control in the formation of the Tetrahymena ribozyme [4].
RNA secondary structure kinetics depends on the distribution of locally optimal secondary structures, where a structure is said to be locally optimal if it is not the case that by adding or removing a single base pair, one can obtain a structure having lower free energy. In the context of the Nussinov energy model [5], where the energy of a base pair is {1, locally optimal structures are exactly the saturated secondary structures, as first defined by M. Zuker [6]. (A secondary structure is saturated if one cannot add any base pairs without violating the definition of a secondary structure; i.e. without either creating a base triple or pseudoknot.) In the paper [7] we developed an algorithm to compute the partition function for all saturated secondary structures of a given RNA sequence. Exploiting the idea behind this algorithm, in the papers [8,9], we subsequently proved that the asymptotic number of saturated secondary structures is 1:07427 : n {3=2: 2:35467 n , which (surprisingly) is not substantially less than the asymptotic number 1:104366 : n {3=2: 2:618034 n of all secondary structures, a result earlier proved by Stein and Waterman [10]. In Waldispühl and Clote [11], we extended our previous algorithm [7] to compute the partition function of all saturated secondary structures, with respect to the widely used Turner energy model [12]. In the Turner energy model, a secondary structure is decomposed into loops, as described in Zuker [13], and the free energy is computed by summing the energy contributions of all loops. A k-loop consists of k{1 base pairs (excluding the closing base pair) and u unpaired bases. The energies of 1-loops (hairpins), 2-loops (stacks if u~0, bulges or interior loops if uw0), 3-loops and 4-loops (also known as 3-way and 4-way multiloop junctions) are obtained by least squares fit of enthalpy and free energy change at 37C 0 , determined by optical melting (UV absorption) of small model systems [14,15]. Even though free energies for the most common multiloops (3-way and 4-way junctions) have been experimentally determined [15], for computational efficiency it is usual to define the free energy of arbitrary multiloops (kw2) by the affine approximation azb(k{1)zcu, where a, b and c are constants.
Computational studies of RNA kinetics are currently performed either by repeated Monte-Carlo simulations, as in software Kinfold of Flamm, Fontana, Hofacker and Schuster [16], Kinefold of Xayaphoummine, Bucher, and Isambert [17], and RNAKI-NETICS of Danilova, Pervouchine, Favorov, and Mironov [18], or by direct solution of the master equation from chemical kinetics dP i t ð Þ dt~X j P j (t) : k j,i {P i (t) : k i,j À Á : Here, P i (t) is the probability that the RNA molecule is in secondary structure S i at time t, and k i,j is the transitional probability of moving from structure S i to neighboring structure S j , which differs from S i by the addition or removal of a single base pair, and where k i,j k j,i~e xp ({(E j {E i )=RT). By constructing probabilistic roadmaps for RNA secondary structure formation, a technique derived from robotic motion planning, Tang, Kirkpatrick, Thomas, Song and Amato [19] and Tang, Thomas, Tapia, Giedroc and Amato [20] are able to apply both Monte Carlo methods and the master equation over a smaller set of structures. Flamm, Fontana, Hofacker and Schuster [16] describe RNA folding at an elementary step resolution, by using a Monte Carlo algorithm to study the kinetics of folding. Their Kinfold program is an implementation of Gillespie's Monte Carlo algorithm [21,22] for stochastic folding, where elementary steps consist of either adding, removing or shifting a single base pair. In that paper, Flamm et al. describe the barrier tree, whose leaves are those locally optimal secondary structures having free energy that lies below a user-defined threshold. The barrier tree is constructed by using the program RNAsubopt [23] to exhaustively generate all secondary structures, whose free energy lies below a user-defined threshold, then aggregating structures into basins containing a locally optimal structure. As more structures are aggregated, using the imagery of flooding a landscape, two basins may be gradually joined together by folding paths, all of whose intermediate structures lie in one of the two basins, for which there exists a saddle structure of highest free energy along the path. Flamm, Hofacker, Stadler and Wolfinger [24] present additional applications of the Barriers program, while Wolfinger et al. [25] describe a coarse-grained approach by applying the master equation of chemical kinetics to macrostates consisting of basins of structures aggregated near locally optimal structures. For additional results on saddle points and energy barriers, see Stadler and Flamm [26], Flamm, Hofacker, Stadler, and Stadler [27], as well as the recent paper by Hofacker, Flamm, Heine, Wolfinger, Scheuermann et al [28], who introduce the notion of barmap which ''links macrostates of temporally adjacent landscapes and defines the transfer of population densities from one 'snapshot' to the next''.
Other groups have studied various aspects of kinetically driven RNA folding. Shapiro, Bengali, Kasprzak and Wu [29] compute likely folding intermediates in the earlier described hok/sok system. Danilova, Pervouchine, Favorov, and Mironov [18] describe the web server, RNAKINETICS, which models the secondary structure kinetics of an elongating RNA molecule. Xayaphoummine, Bucher, and Isambert [17] and Isambert [30] introduce the Kinefold web server, which stochastically folds a user-given RNA sequence into a low energy structure that may include pseudoknots. Quite recently, Dotu, Lorenz, Van Hentenryck and Clote [31] describe an efficient program RNATABUPATH to compute near-optimal folding pathways between two secondary structures of a given RNA sequence. For an overview of RNA folding kinetics, see the review articles by Chen [32] and Al-Hashimi and Walter [1].
In this paper, we describe a novel, efficient algorithm, RNALOCOPT, to compute the partition function over all secondary structures that are locally optimal in the Turner energy model. Locally optimal structures form kinetic traps, hence create basins of attraction in the energy landscape. The structure of this paper is as follows. In the introduction sections, we provide background definitions for the Turner energy model and loop decomposition. To allow the paper to be self-contained, we additionally describe McCaskill's classical algorithm for the partition function [33].
In the Results section, we present three types of analysis using the software RNALOCOPT. First, by performing computational experiments on RNA sequences of increasing length, we show that the number of locally optimal structures is asymptotically the square root of the number of all structures, as depicted in Figure 5. Secondly, we compare the structural diversity, as measured by four different metrics, of the set of locally optimal structures with that of the Boltzmann ensemble of all secondary structures. Structural diversity appears to depend on the type of RNA; for instance, in the case of precursor microRNAs and 5S-rRNA, the structural diversity of the collection of locally optimal secondary structures is markedly lower than that over the Boltzmann ensemble, while structural diversity for TPP riboswitch aptamers appears to be about the same. Thirdly, we demonstrate how to combine McCaskill base pairing probabilities with those from sampled locally optimal structures in order to compute a modified maximum expected accuracy structure [34,35], which appears to be closer to the native structure than structures produced by other thermodynamics-based algorithms. The Discussion section provides additional comments on the energy model of RNALOCOPT and benchmarking issues, and as well describes intended future applications and possible extensions of the software. In particular, in forthcoming work, we will introduce a new method using RNALOCOPT to quickly and accurately determine the mean folding time for a given RNA sequence, a synthetic biology application for de novo RNA design.
In the Methods section, we begin by describing the intuition behind the new O(n 3 ) time and O(n 2 ) space algorithm, whose details and recurrence relations are then provided. Though our software RNALOCOPT additionally can sample a user-specified number of structures from the Boltzmann subensemble of locally optimal structures, we do not describe details of the construction, since it is analogous to the construction of Ding and Lawrence [36,37], albeit where the McCaskill partition function is replaced by the partition function for locally optimal structures.

Background
An RNA molecule is a biopolymer consisting of nucleotides, adenine (A), cytosine (C), guanine (G) and uracil (G), oriented in a natural left-to-right fashion given by the 5' to 3' direction. Given an RNA sequence a 1 , . . . ,a n of length n, an RNA secondary structure S is defined to be a set of base pairs (i,j), where (a) if (i,j) [ S, then (a i ,a j ) [ fAU,U A,GC,CG,GU,U Gg (base pairs are canonical, i.e. either Watson-Crick or wobble pairs); (ii) if (i,j) [ S, then jwizh, where by convention h~3 (minimum of h unpaired bases in a hairpin loop); (iii) if (i,j),(i,k) [ S, then j~k and if (i,j),(k,j) [ S, then i~k (non-existence of base triples); (iv) if (i,j),(k,') [ S, and ivk, then 'vj (non-existence of pseudoknots). See Figure 1 for three equivalent representations of the secondary structure for RNA from human accelerated region HAR1F, a region of the human genome that seems to have been under evolutionary pressure in the divergence of humans from great apes [38]. While secondary structures satisfy a planarity condition, pseudoknots violate that condition, as shown in Figure 2. Although pseudoknots and non-canonical base pairs play important roles in RNA tertiary structure formation [39], the secondary structure forms rapidly and serves largely as a scaffold for the formation of tertiary contacts [40]. In this paper, we are interested in developing an efficient algorithm to explore the energy landscape of kinetically trapped RNA structures. Since Lyngsø and Pedersen [41] have proved that it is NP-complete to compute the minimum free energy structure for a given RNA sequence, when general pseudoknots are permitted, we will restrict our attention throughfoout the paper to secondary structure.

Nearest neighbor energy model
The Turner nearest neighbor energy model is an additive model, where the free energy of an RNA secondary structure is computed as the sum of distinct loop free energies in a unique decomposition of the structure. Figure 3 illustrates the different types of possible loops for an example RNA secondary structure. The structure contains 8 loops of 4 basic different types. Hairpins are formed when a base pair (i,j) encloses an unpaired region of RNA; thus a hairpin contains the nucleotides a i , . . . ,a j , where due to steric constraints, j{iwh, for h~3, and positions iz1, . . . ,j{1 are unpaired. Stacked base pairs are loops containing adjacent base pairs, (i,j), (iz1,j{1), as shown in loops L 2 and L 7 . Left bulges are loops containing the closing base pairs (i,j), (iz1,j{k) for kw1, where j{kz1, . . . ,j{1 are unpaired; right bulges contain the closing base pairs (i,j), (izk,j{1) for kw1, where positions iz1, . . . ,izk{1 are unpaired. Loop L 3 depicts a left bulge. Internal loops are loops bordered by 2 base pairs (i,j), (k,'), where iz1vk and 'z1vj. Loop L 5 depicts an internal loop. A multiloop is a loop bordered by 3 or more base pairs. For instance, L 4 is a multiloop closed by the base pair (3,28), which here is a 3-way junction (i.e. bordered by three base pairs) and which has two components (i.e. stems bordered by base pairs (5,15) and (17,26)). The number k of base pairs that border a loop can be use to classify the loop; k~1 in hairpins, k~2 in stacked base pairs, bulges, and internal loops, and kw2 in multiloops. Finally, external loops, depicted in L 8 , are technically not loops, but rather are defined to be regions containing nucleotide positions x for which there is no base pair (i,j) satisfying iƒxƒj.
In the Turner energy model, there are free energies for each type of loop. For the example structure S depicted in Figure 3, if we denote the energy of loop L i by E(L i ), it follows that the free energy of S is The Turner rules were fit to enthalpy and folding free energy change at 37uC, determined by optical melting of small model systems [12,51] free energy contribution; hairpins, bulges, internal loops, and multiloops generally contribute positive (destabilizing) free energies, although certain 1|1 and 2|2 internal loops contribute stabilizing energies. Figure 1. RNA from human accelerated region HAR1F, a region of the human genome that differs from highly conserved regions of our closest primate relatives and is active in the developing human brain between the 7th and 18th gestational weeks [38]. Secondary structure representation in conventional form (left), as a circular Feynman diagram (center) and as a linear Feynman diagram (right). Sequence and consensus secondary structure taken from Rfam [42]; graphics produced with jViz software [43]. doi:10.1371/journal.pone.0016178.g001 Secondary structure with pseudoknots displayed in conventional form (left) and as a circular Feynman diagram (right). Sequence and structure of PKB239 taken from Pseudobase [44]; graphics produced with jViz software [43]. doi:10.1371/journal.pone.0016178.g002  Important aspects of the Turner energy model are additivity and locality. Both of these properties are critical in the development of an efficient computation of the partition function; indeed, it is this local nature of the energy model that renders it possible to inductively determine all locally optimal structures.
The algorithm to compute the partition function of all locally optimal structures is a modification of McCaskill's algorithm, which we will review now. McCaskill's algorithm recursively computes the partition function for structures on subsequence a i , . . . ,a j by table look-up of the previously computed partition function values for proper subwords of a i , . . . ,a j . Each recursion step involves the addition of either one base pair or one unpaired base to groups of structures whose partition function is already known. Our modification to McCaskill's algorithm is to make sure at each step that the base pair or base added does not cause the occurence of nonoptimal structures. This will require additional information to be stored at each step, but does not change the basic structure of the McCaskill recursions.

McCaskill's partition function
In order to provide a self-contained treatment, we now review the construction of McCaskill's algorithm [33] to construct the partition function for RNA secondary structures.
Given RNA nucleotide sequence a 1 , . . . ,a n , we let E HP (i,j) denote the free energy of a hairpin closed by base pair (i,j), while E IL (i,j,i',j') denotes the free energy of an internal loop enclosed by the base pairs (i,j) and (i',j'), where ivi'vj'vj. (Internal loops comprise the cases of stacked base pairs, left/right bulges and proper internal loops.) The free energy for a multiloop containing N b base pairs and N u unpaired bases is given by the affine approximation azbN b zcN u .
Given an RNA sequence a 1 , . . . ,a n , for 1ƒiƒjƒn, the McCaskill partition function Z(i,j) is defined by P S e {E(S)=RT , where the sum is taken over all secondary structures S of a i , . . . ,a j , E(S) is the free energy of secondary structure S, R is the universal gas constant, and T is absolute temperature. In the sequel we write a½i,j to abbreviate a i , . . . ,a j . subject to the constraint that a½i,j is part of a multiloop and has at exactly one component. Moreover, it is required that i basepair in the interval ½i,j; i.e. (i,r) is a base pair, for some ivrƒj.
Following McCaskill [33], the unconstrained partition function is defined by The constrained partition function closed by base pair (i,j) is given by The multiloop partition function with a single component and where position i is required to base-pair in the interval ½i,j is given by Finally, the multiloop partition function with one or more components, having no requirement that position i base-pair in the interval ½i,j is given by Figure 4 for a pictorial representation of the recursions of McCaskill's (original) algorithm [52]; note that the recursions are equivalent to, but not quite the same as, those given in [53].

Number of locally optimal structures
In this section, we compare the values of the partition function, Z LO , of all locally optimal structures, and the total number, N LO , of locally optimal structures, with those for all structures. The number of locally optimal structures, N LO , is determined by removing all energy factors in the previous equations for the Boltzmann partition function. This is equivalent to setting the temperature to z?, since all energetic factors are of the form e ({E=RT) .
In Figure 5, for lengths between 20 and 200 nt, 100 RNA were randomly generated for each length in the simplest possible manner, with 1/4 probability of A, C, G, and U at each location. For each such RNA, the number of locally optimal structures as well as the number of all secondary structures is determined. These are averaged over the 100 randomly generated RNA sequences of that length, and plotted in the graph shown in Figure 5. We find there is exponential growth in the average, or expected, number of locally optimal structures, as a function of sequence length. Moreover, the slope of the curve in Figure 5 for the total number of structures is approximately twice that of the number of locally optimal structures, hence implying that the number of structures is approximately the square of that for locally optimal structures. Indeed, by fitting the data with a least-squares approximation, we find that the number Num S (n)&10 0:254759 : n{1:95771 with R 2~0 :999815, while the number Num LO (n) of locally optimalstructures for length n random RNA satisfies Num LO (n) &10 0:130366 : n{1:50236 with R 2~9 99407. (The coefficient of determination, R 2 , is the square of Pearson correlation coefficient of the least squares (linear) fit of the logarithm of the average number of structures.) In Figure 6, we compare the partition function, Z LO , of all locally optimal structures, with the partition function, Z, of all structures, by plotting the ratio, Z LO =Z, by the same method, averaging over 100 RNA at each length. This ratio, depicted with error bars, represents the percentage of structures, as weighted by their Boltzmann factor, that are locally optimal. By numerical fitting the data from this curve, it appears that the ratio is approximately 1:0053 exp ({0:0123n) with coefficient of determination R 2~0 :9876 (see [54] for explanation of how to compute the coefficient of determination).
Another interesting computational experiment we performed was to determine the sum of the Boltzmann factors for a nonredundant subset of 1000 sampled locally optimal structures, produced by RNALOCOPT, compared with the sum of the Boltzmann factors for a non-redundant subset of secondary structures, sampled by the Ding-Lawrence algorithm [36], as implemented in RNASUBOPT -p. Table 1 presents these results for RNA generated in the previously described manner from an order 0 Markov chain, for lengths from 20 to 200 in steps of 20.
For each length, we averaged statistics over 10 runs, where for each run, we computed the percent coverage of the partition function; i.e. sum of the Boltzmann factors of a non-redundant subset from 1000 samples generated by RNASUBOPT [resp. RNALOCOPT], divided by the partition function Z [resp. partition function Z LO of locally optimal structures]. The number of locally optimal structures is far fewer than that of all structures (see Figure 5), hence, there is proportionately more redundancy among sampled locally optimal structures than than that over all structures. As well, the percentage coverage of the partition function for sampled locally optimal structures is higher than that for the Boltzmann ensemble.

Structural diversity of ensemble of locally optimal structures
In our paper on RNA saturated structures [55], we suggested that (a) there are far fewer locally optimal structures than there are of saturated structures, and (b) base pairing probabilities over locally optimal structures are similar to the base pair probabilities over all structures. In the previous section, we have shown that (a) holds; indeed, Figure 5 shows that the number of locally optimal structures is approximately the square root of the number of all structures, while the papers [7][8][9]56] show that the number of saturated structures lies closer to that of all structures. While statement (b) holds in some cases, such as for purine riboswitch aptamers, in other cases, such as for precursor microRNAs and 5S-rRNA, it does not hold.
To numerically quantify how closely the ensemble of locally optimal structures resembles the Boltzmann ensemble of all structures, we consider four measures: the pseudo-entropy for base pairing probabilities, the average entropy for the base pairing probabilities, and two forms of structural diversity, the first due to Morgan and Higgs [57] and the second described in the Vienna RNA Package [58].
For a fixed RNA sequence a 1 , . . . ,a n with base pairing probabilities p i,j , the pseudo-entropy is defined by Since the collection of base pairing probabilities p i,j does not form a probability distribution (although it does for fixed i, as exploited in the next definition), we cannot speak of its entropy, but rather use the term pseudo-entropy. The average (Shannon) entropy is defined by  The Morgan-Higgs structural diversity is defined by Finally, the Vienna structural diversity is defined by where the first sum is taken over all secondary structures S,T of a fixed RNA sequence, d(S,T) is the base pair distances between S,T, and P(S) is the Boltzmann probability P(S)~e xp ({E(S)=RT) Z for structure S (and similarly for T). If there is no structural diversity whatsoever, so that p i,j~1 for all base pairs (i,j) in the minimum free energy structure S 0 , then clearly the Morgan-Higgs diversity SD mh T will take on the least possible value, n{jS 0 j&n=2, while the Vienna diversity will equal 0. Variants of the above measures are given as well for the ensemble of locally optimal secondary structures, where we use base pairing frequencies p i,j over a sampled collection of 1000 locally optimal structures for a given RNA sequence. Table 2 summarizes these four measures for 14 families of seed alignments from the Rfam 10.0 database [42]. For essentially all of these measures, we see that the structural diversity of the ensemble of locally optimal structures appears to be less than that for all structures. Notable exceptions are the riboswitch aptamers from Rfam.
By using the new algorithm RNALOCOPT, we have shown that the collection of locally optimal structures constitutes an ensemble that is smaller (see Figure 5) and structurally less diverse in general than that of all structures. This provides additional evidence for the hypothesis advanced in [24,25,27] that locally optimal structures form basins of attraction in the folding landscape of RNA secondary structures. For this reason, RNALOCOPT may prove valuable in the study of kinetics of RNA folding.
Basepair probabilities lead to better RNA secondary structure prediction In ground-breaking work, Knudsen and Hein [59], followed by Do, Mahabhashyam, Brudno and Batzoglou [60] and by Kiryu, Kin and Asai [34], introduced the notion of maximum expected accuracy secondary structure, shown to be closer to the native structure, compared to the minimum free energy structure, when benchmarked against known structures. The underlying idea of this new approach is that there is a strong signal in the Boltzmann ensemble of low energy structures -a signal that is ignored when one computes the minimum free energy (MFE) structure, which is the maximum likelihood structure with respect to Boltzmann probability. Independently and at the same time, Ding, Chan and Lawrence [61] also realized the benefit of considering the Boltzmann ensemble rather than the MFE structure in their construction of the Boltzmann centroid of a cluster of sampled structures.
Following [34,35,59,60], we define the maximum expected accuracy (MEA) structure for a given RNA sequence to be that which is obtained by tracebacks, using the matrix M, defined as follows: where q i~1 { P n j~1 p i,j , and a,b are non-negative constants. In the previous studies [34,35], optimal values of a,b were found to be a~1,b~1. In this paper, we have set b~1 and performed benchmarking for a range of values a in f2 {4 ,2 {3 ,2 {2 ,2 1 ,1, 2,2 2 ,2 3 ,2 4 g. If most structures in the Boltzmann ensemble contain the base pair (i,j), then p i,j will be large, and it can happen that (i,j) will belong to the MEA structure even though (i,j) does not belong to the MFE structure. The values M i,j can be computed by a simple modification of the Nussinov-Jacobson algorithm [5], and the maximum expected accuracy structure with score M 1,n can be subsequently computed by tracebacks. See the references for more details, where the previously cited authors list benchmarking statistics to determine optimal parameters a,b for which the MEA structure is closer to the native structure than is the MFE structure.
We compare variants of the MEA construction, obtained by using (i) base pairing probabilities p M i,j computed by McCaskill's algorithm [33] using RNAFOLD, (ii) base pairing probabilities p LO i,j for locally optimal structures computed by relative frequency count from 10,000 sampled locally optimal structures, and (iii) base pairing probabilities p min i,j , and unpaired probabilities q min i , defined as the minimum of both probabilities; i.e. p min We can determine corresponding MEA structures, denoted by MEA and MEA LO , according to the use of P resp. P LO . We see in Figures 7 and 8 that predictions based on these MEA structures are better than the MFE structure, as predicted by RNAFOLD. However the predictions based on local optima are consistently worse.
However, we can create a third matrix, denoted by P MIN , where for each base pair (i,j), P MIN ((i,j))~min(P((i,j)),P LO ((i,j))): This will in essence emphasize those base pairs that occur prominently in both samples of local optima and samples of all structures. As shown in Figure 7, this consistently increases the sensitivity and positive predictive value.

Discussion
In this paper, we describe a novel and efficient algorithm to compute the partition function over all locally optimal secondary structures of a given RNA sequence. The software, RNALOCOPT runs in O(n 3 ) time and O(n 2 ) space, the same time and space complexity as that of McCaskill's algorithm to compute the partition function over all secondary structures. Additionally, RNALOCOPT samples a user-specified number of structures from the Boltzmann subensemble of all locally optimal structures. Our work completely solves a line of investigation begun originally by M. Zuker [6], who first defined the notion of saturated structure (for which no base pair can be added without violating the definition of secondary structure).
The energy model implemented in RNALOCOPT is the Turner nearest neighbor energy model without dangles; in contrast, the energy model used in the software RNAFOLD and RNASUBOPT is the Turner model with dangles. Our computation of sensitivity and positive predictive value (PPV) is exact; i.e. with no allowed slippage. In contrast, some authors, such as Lu and Mathews [35], benchmark sensitivity and positive predictive values by allowing a slippage of +1; i.e. if base pair (i,j) belongs to the native structure, then the predicted base pair (x,y) is counted as correctly predicted if (x,y) is one of the following: (i{1,j),(i,j),(iz1,j),(i,j{1), (i,j),(i,jz1). In [35], sensitivity and PPV values are reported with slippage for the maximum expected accuracy (MEA) method using the software RNASTRUCTURE [62], which includes energy terms for coaxial stacking.
There may be some discrepancies between reported sensitivity and PPV values from various groups. Such discrepancies will occur due to a combination of benchmarking with respect to Figure 6. Plot of ratio, with error bars, of the restricted Boltzmann partition function Z LO and the total Boltzmann partition function, as a function of RNA length, for the same random RNA generated as described in the Figure 5. This ratio represents the percentage of structures, as weighted by their Boltzmann factor, that are locally optimal. By numerical fitting, we find that this ratio is approximately 1:0053 exp ({0:0123n) with coefficient of determination (see [54]) R 2~0 :9876. doi:10.1371/journal.pone.0016178.g006 different databases, admitting slippage or not, and small differences in the underlying energy model. Nevertheless, there is a consistent improvement of MEA MIN, as shown in this paper, over both minimum free energy (MFE) and maximum expected accuracy (MEA) methods.
By applying RNALOCOPT to randomly generated RNA, we have shown that there are far fewer locally optimal structures than that of all structures (the number of locally optimal structures approximately equals the square root of the number of all structures). We have shown that the structural diversity, as measured by four different parameters, of samples of locally optimal structures can either be similar or quite distinct from samples from the Boltzmann ensemble of all structures -a situation that depends on the particular RNA family. While most RNA families we investigated displayed smaller locally optimal diversity than total structure diversity, notable exceptions were the riboswitch aptamers from Rfam. One might think that this is due to the fact that two distinct low energy conformations (gene-on and gene-off) are present in both the local optimal and Boltzmann ensemble. However, the Rfam database contains only the riboswitch aptamers, which do not undergo any significant conformation change. (Indeed, the riboswitch portion that undergoes conformation change, called the expression platform, is essentially missing from the Rfam data, a situation we will address in a future publication.) Thus it remains unclear exactly why riboswitch aptamers should display a difference in structural diversity between locally optimal and all structures.
Since there are relatively few locally optimal structures, compared to all structures, we are led to the hypothesis that in certain circumstances, a collection of sampled locally optimal structures can more succinctly represent the folding landscape of a given RNA sequence. In forthcoming work, we will describe an application of this observation, by presenting a new method for de novo RNA structure design, where kinetic properties are taken into account.
Theoretical studies of RNA folding kinetics have primarily focused on unit-step resolution, where a single base pair is added or removed in each time step. For such studies, RNALOCOPT will prove to be a valuable new tool. There is some possibility of extending RNALOCOPT to allow the formation or removal of entire helices in each time step, a direction we are currently considering. The idea would be to redefine a locally optimal structure to be one for which no addition or removal of any stem region would lower the free energy. An extension of RNALOCOPT in this direction would allow more rapid exploration of the folding process.
Locally optimal structures S form kinetic traps, in the sense that there does not exist a structure T , obtained from S by the removal or addition of a single base pair, which has lower free energy. Since thermal noise can overcome the energy barrier between certain conformations in the low energy ensemble, a better model of kinetic trap might arguably be a that of a basin of attraction located about locally optimal structure S. Such a basin would be a set S of low energy structures, such that: (i) there is a folding path whose barrier energy is less than a fixed energy threshold e that cannot be overcome by thermal noise, and (ii) if T is reachable by a folding pathway from S with barrier energy less than e, then T [S. Though it is currently unclear what value of e should be taken, it may be possible to extend RNALOCOPT in this direction. This is a possible avenue for future research. (A folding pathway from S to T is a sequence S~S 0 ,S 1 , . . . ,S n~T of secondary structures, such that S iz1 is obtained by adding or removing a single base pair from S i , for each 0ƒivn. The barrier energy of a folding pathway is maxfE(S iz1 ){E(S i ) : 0ƒivn. Computing optimal folding pathways between any two secondary structures is known to be NP-complete, though there are exponential time exact algorithms [24,63] and efficient near optimal algorithms [31].) Finally, we have shown the utility of locally optimal structures by demonstrating that the variant of maximum expected accuracy structure, MEA MIN, provides the most accuracy structure prediction currently available via thermodynamic methods. The improvement in sensitivity and PPV for this method depends on the fact that we take into account the base pairing frequency of pairs (i,j) within the ensemble of locally optimal structures as well as that of the Boltzmann ensemble of all structures.
Why is the MEA MIN structure apparently closer to the native structure, at least in the benchmarking study performed in this paper? Since there is no clear answer to this question, we can only formulate a guess. Recall that there are far fewer locally optimal structures than there are of all secondary structures, and that the ensemble of locally optimal structures appears to be more consistent (i.e. less structurally diverse, at least in most cases) than the ensemble of all structures. For these two reasons, certain unlikely, pathological candidate base pairs have diminuished likelihood of contributing to the MEA LO structure. However, certain important intermediate structures, which do not appear in the ensemble of locally optimal structures, could contribute to the accuracy of the MEA structure. By taking the minimum of base pairing probabilities over both ensembles, MEA MIN is closer to the native structure. Though reasonable, we must stress that this explanation can only be speculative. For each value of n, 1000 structures were sampled, by applying the Ding-Lawrence sampling algorithm [36], as implemented in RNASUBOPT with flag -p, and by applying RNALOCOPT. For each run, the number of non-redundant samples is computed, yielding the expected number SnrT+e for RNASUBOPT and RNALOCOPT, where e is the error bound (standard deviation s= ffiffiffiffiffi 50 p , since 50 sequences generated). For each run the percent coverage of the partition function was computed; i.e. the sum of the Boltzmann factors of the nonredundant collection from 1000 samples generated by RNASUBOPT [resp. RNALOCOPT], divided by the partition function Z [resp. partition function Z LO of locally optimal structures]. Since the number of locally optimal structures is far fewer than that of all structures (see Figure 5), it is not surprising that there is proportionately more redundancy among sampled locally optimal structures than than over all structures. As well, the percentage coverage of the partition function for sampled locally optimal structures is higher than that for the Boltzmann ensemble. doi:10.1371/journal.pone.0016178.t001

Methods
We begin by providing an intuitive overview of the construction, while subsequent sections provide full details and the recurrence relations for the RNALOCOPT algorithm.

Conditional local optimality
To implement our algorithm, at each step we wish to calculate the partition function of only the locally optimal structures. Since the Turner energy model is a loop-based model, it can largely be construed as a local model. Therefore we can locally check whether or not adding the base pair (i,j) makes some structures suddenly no longer locally optimal simply by looking at nucleotides near (i,j). To do this, we need to keep track of a bit more information during our recursion than is done in McCaskill's algorithm.
In this section we show through a simple example the key idea behind the recursions. Consider the partial sequence-structure shown in the left side of Figure 9. The Boltzmann factor (portion of the Boltzmann partition function) of this structure would be included in the term Z B (i,j) in McCaskill's recursion, which denotes the partition function of all structures ending in a base pair at (i,j).
It would be natural to define an analogous term Z min B (i,j) as the partition function of all locally optimal structures ending in a base pair at (i,j). Local optimality would mean that adding or removing Given the collection of base pairing probabilities p i,j over all locally optimal structures [resp. over all structures] of a given RNA sequence a 1 , . . . ,a n , we define four measures of structural diversity. any base pair would raise the energy, or keep it the same. But in our example structure, there is one base pair for which we do not have sufficient information to know the change in energy caused by removing it -namely, the outer base pair (i,j). In the next recursive steps, this structure could be extended in several different possible ways, perhaps with a base pair (x,y) shown in the righthand side of Figure 9. At that point, we will know the energy of the two loops in which the base pair (i,j) is contained. But until then, this energy is unknown. Since we do not yet know how removing the base pair (i,j) will affect the energy, the best we can do is to inductively assume that the structure is conditionally locally optimal, conditioned on the fact that (i,j) must base pair. It will not be until we add the next base pair (x,y) that we will know whether the base pair (i,j) causes the structure to not be locally optimal, that is if removing the base pair (i,j) decreases the energy.
Consider then the structure including the base pair (x,y) on the left side of Figure 10. Remember that we could not determine the change in energy caused by removing base pair (i,j) before. That change in energy is now given by the energy of the new loop, E(L 6 ) minus the energy of the old loops, E(L 4 )zE(L 5 ), as indexed in Figure 10. However, to determine the energies E(L 4 ) and E(L 6 ), we need to know the location of the base pair (a,b).
Our approach for this internal loop example is to induct on the last two base pairs, not just the last base pair. So in our example, our example structure on the left-hand side of Figure 9 will contribute to the term Z B ((i,j),(a,b)), which denotes the partition function of all locally optimal structures with the outermost two base pairs (i, j) and (a,b). Then, if removing the base pair (i,j) doesn't lower the energy, that is if E(L 6 ){(E(L 4 )zE(L 5 )) §0, the structure on the right-hand side of Figure 9 will contribute to the term Z B ((x,y),(i,j)).
We must also check if any base pairs can be added. In our example, when adding the base pair (x,y), we check if any base pairs can be added within the internal loop L 5 defined by (i,j) and (x,y) (see Figures 9 and 10). Any other base pair additions would already have been considered earlier in the recursion, and the energy change of adding different base pairs is independent due to the loop energy model.
The previous discussion deals with internal loops. For external loops and multiloops, the motivation is similar, but the approach is more difficult, and the solution, which is more time-consuming and depends at least theoretically on the parameters of the Turner energy model, is less satisfying. As the recursion progresses, the conditionality of the optimality will be pushed outward, and in checking the final external loop, the conditionality will be  [34,35], the maximum expected accuracy (MEA) structure is computed by applying a variant of the Nussinov-Jacobson [5] algorithm using the base pairing probabilities p i,j as computed by McCaskill's algorithm [33]. The parameter a is a weight for base pairing probability; in other words, the score, following [34,35], of a structure S is given by X

Details of recursion for locally optimal structures
To do our recursion, we need to know the energies of various internal loops, hairpins, and the energies associated with a multiloop in the Turner energy model. These are available as temperaturedependent parameters. For simplicity, all calculations will be at 37uC.
We let E IL (i,j,i',j') denote the free energy of an internal loop enclosed by two base pairs, (i,j) and (i',j'), where ivi'vj'vj. The energy of a hairpin enclosed by a base pair (i,j) will be denoted by E HP (i,j). For a multiloop, such notation is not possible. The accepted energy of a multiloop is given by a multiloop penalty, a, a penalty for unpaired bases in a multiloops, b, and a penalty or bonus for a base pair within a multiloop, c, which can depend on the type of base pair being considered. The energy of the multiloop is then where k is the number of unpaired bases in the multiloop, and l is the number of bases in the multiloop. This is standard, as used in McCaskill's algorithm, and is done in part for computational reasons. There is no affine energy term associated with external loops, but their treatment is somewhat analogous to that of multiloops (indeed, a multiloop can be formed by adding a closing base pair to an external loop).

Explanation of deltas
The method of calculating local optima is straightforward. We will calculate the partition function of locally optimal structures with the same basic McCaskill algorithm used to calculate the partition function over all secondary structures. However, some modifications must be made, for at each step in our recursion, we must make sure that no base pair can be added or removed that would lower the energy. Anything that does not satisfy this property is dropped from the partition function.
The way this is done is to realize all of the different ways a single base pair can be added and removed that can lower the total energy, and to build in a check for all of these cases as we build the partition function. Figure 11 shows all of the possibilities. A base pair can be added to or removed from a hairpin, (Types 3 and 4), an internal loop, which includes bulges and stacked base pairs (Types 1 and 2), or a multiloop (types 5 and 6).
In our recursion, we will have six different delta functions corresponding to these six different cases, where each delta function is 1 if adding, or removing, the relevant base pair does not lower the energy. Such deltas will act as checks whether the structures built so far are locally optimal.
For example, to check whether we can remove a base pair from between two internal loops, we have, from type 1 in Figure 11, , if removing bp (i',j') lowers the energy 1, otherwise This delta is calculated using the energies of a given segment. The energy of the internal loops before removing the base pair are Figure 9. Example structure in recursion. In the left structure, we do not yet know the two loops bordered by the base pair (i,j). Therefore we do not yet know whether by removing this base pair, the free energy will be lowered. In the right structure, one step further in the recursion, we now know which loops border the base pair (i,j) -namely, loops L 4 and L 5 . Images created using the software VARNA [50]. doi:10.1371/journal.pone.0016178.g009 and after removing the base pair, the energy of the resultant single internal loop is Thus we calculate delta by the formula Other deltas are computed in a similar fashion. For types 2 and 4 (in Figure 11), these are precomputed, in order to speed up the algorithm. This precomputation gives us a list (each of order n 2 for a sequence of length n) of possible IL's and HP's respectively, to which an internal base pair cannot be added which would lower the energy.
One note is that some base pairs are never favorable, and thus do not need to be calculated. The important case is adding a base pair to a multiloop, which would split the multiloop into two multiloops when the multiloop is closed. This type of base pair is shown in Figure 12. Provided that there are no energy terms for either dangles within a multiloop, or coaxial stacking, this base pair will never lower the energy. This is fortunate, since it is computationally more difficult to inductively include such base pairs.

Tails, conditional optimality
Just as there are hairpins, internal loops, multiloops, and external loops in the Turner energy model, there are recursion terms for hairpins, internal loops, multiloops, and external loops. However, as we need to keep a little more context to keep track of whether we still have a set of local optima, there will be some extra information.
Note that all these structures will be conditionally locally optimal. We commonly can't know if the most exterior base pair will be locally optimal, as that will depend on future base pairs, thus we need this conditional optimality in order to perform the recursion.  pair (a,b). Images created using the software VARNA [50]. doi:10.1371/journal.pone.0016178.g010 Figure 11. The six ways that a single base pair can be added to or removed from a structure and possibly reduce the overall energy. Images created using the software VARNA [50]. doi:10.1371/journal.pone.0016178.g011 For example, for internal loops, we will denote for the partition function of all locally optimal structures on the subinterval (i,j) with an internal loop with base pairs (i,j) and (i',j'), where ivi'vj'vj. This local optimality is conditional on (i,j) being a base pair, that is, we assume (i,j) is a base pair, and will check later if this is a problem. We cannot tell whether, in the future, removing the base pair (i,j) will lower the energy or not, as we don't know the structure outside of (i,j).
A few of the recursive elements will contain tails. For example, for multiloops, we will let Z ML (i,j,p,q) denote the partition function, on the interval (i,j), for all unclosed locally optimal multiloops (with more than one base pair) that have 'tails', regions of unpaired nucleotides, of lengths p and q on the left and right side respectively. See Figure 13.
These tails are needed. In McCaskill's algorithm, for a multiloop closed by base pair (i,j), there is a recursion of the form Z ML (i,j)~e b : Z ML (i,j{1)zremaining terms: We cannot use such a recursion, as adding an unpaired base may result in a structure that is no longer a local optimum. While there may be better approaches, we avoid this problem by indexing locally optimal multiloops by their tail length. We can then glue such multiloops together with tails. See Figure 14.
We have seen that p and q can be always less than 10, this is sufficient to avoid all possible base pairs in multiloops that lower energy. Almost all such base pairs can be avoided by setting p and q to be always less than 4; this allows for considerable speed-up with little loss of accuracy.
Note that we need the assumption that a single base pair cannot split a multiloop into two multiloops and thereby lower the energy. (This is true under the present Turner energy model. See Figure 12.) Otherwise, such a gluing method could result in a base pair being possible that lowers the energy -that is, the structure would not be locally optimal.

Recursion Relations
Let Z Ã (i,j) denote the partition function of all structures ending in the base pair (i,j) which will enter a multiloop. Note that we know from the Turner energy parameters that only an internal loop can enter a multiloop. It follows that Z Ã (i,j) will be the sum of all possible internal loops ending in (i,j).
where d wobble is 1 if we have an AU or GU base pair, p wobble is the corresponding energy penalty, b is the penalty of adding a base pair in a multiloop, and d enter ML is 0 if removing the base pair (i,j) (and exposing the base pair (i',j') to the multiloop) lowers the energy. Thus (by induction) Z Ã (i,j) is the partition function for all structures that are locally optimal with respect to all of their base pairs, including (i,j). Z M1 (i,j) is the partition function for locally optimal multiloops closed by base pair (i,j) and having with exactly one component, while Z M1 (i,j,p,q) is the partition function for locally optimal multiloops with exactly one component, and which contains tails of length p and q. We let p, q range from 0 to 10, with one extra position, called ''.10'', which is reserved for long tails. Thus p and q each have 12 possible values. (However, in practice, most values of Z M1 are not stored, but calculated as needed. Only those with 1 or 2 long tails need to be stored.) The partition function Z M1 (i,j,p,q) corresponds to having a base pair at (izp,j{q) entering a multiloop, with tails out to (i,j), i and j not base-paired. Z M1 (i,j,m,.10) means an M1 element with large right tail, greater than 10. This is used because if either tail is of length w10, there are no longer any base pairs that can be Figure 13. Example of the formation of a multiloop with tails of length p and q. Images created using the software VARNA [50]. doi:10.1371/journal.pone.0016178.g013 Figure 12. Image of base pair that could not possibly lower the energy by creating a multiloop, since it creates two bordering multiloops. Images created using the software VARNA [50]. doi:10.1371/journal.pone.0016178.g012 Figure 14. Example of gluing together two pieces of a multiloop. Note that if each piece is locally optimal, then the composite, obtained by gluing the pieces together, is as well. Images created using the software VARNA [50]. doi:10.1371/journal.pone.0016178.g014 added that can reduce the energy. This was shown by exhaustive search. This allows for traditional induction, since we don't have to worry about adding a base pair causing a formally locally optimal structure to become not locally optimal.
Z ML is the partition function of multiloops with at least 2 exiting base pairs. Tails are glued together as in Figure 14. Notation is similar to Z M1 , for the same reasons. Here, the recursion is quite nice.
Define the set S~f1,2,3,:::,10,.10g. For p,q [ S Z ML (i,j,p,q)~X where in the expression (kz1{a), we replace w10 with 10. (This corresponds to unambiguously gluing the largest possible fixed tail. Otherwise there are several ambiguous ways to glue two long tails together.) Thus we add a single exiting base pair with tails during the recursion. Note, with 12 possible tail lengths, the memory usage here is 144N 2 =2. As cases of isolated base pairs far into a multiloop lowering the energy are rare, we can reduce the number of tail lengths recorded.
A similar equation for the external loop can be determined. Here we can always assume that the left end of the external loop (usually denoted with the variable i) is 1, since we never need to close an external loop. Also, we need only worry about the right tail, for the same reason. Remember, an external loop can contain 0, 1, or more entering base pairs, corresponding respectively to the empty structure, structure with one component, and structures with more than one component. In this way it is slightly different than a multiloop. where d (q~j or jw10,qw10) is 1 if q~j or (qw10 and jw10), and where again in the expression (kz1{a), w10 is replaced by 10.
The term d (q~j or j §10,q §10) actually represents the empty structure. Note that Z EL (1,0,0) will be set to 1 by the above equation, as will Z EL (1,j,j). These can be thought of as representing the empty structure, or equivalently as initial conditions.
The variable Z MLC (i,j) represents the partition function of all locally optimal closed multiloops ending in base pair (i,j). It is given by all of the ways to end a multiloop. where d wobble , p wobble are as before, a is the closing penalty of a multiloop, and d MLC (i,j,izp,j{q) is 1 if there is no base pair (x,y), ivxƒizp,j{qƒxvj, that would lower the energy of the multiloop. That is, within the available tails that close the ML, there is no way to add a base pair connecting these tails and lowering the energy. All that is left is the partition function Z IL (i,j,i',j'). This is the partition function of all structures that are locally optimal, conditional on i,j base-pairing, that exit in an internal loop with outermost base pairs (i',j') and (i,j), ivi'vj'vj. Following standard convention, we consider only internal loops of size at most 30; i.e. we can restrict to the case i'{izj{j'ƒ30.
There are 3 cases: (i) the internal loop borders a hairpin at (i',j'), (ii) the internal loop borders a multiloop at (i',j'), (iii) the internal loop borders another internal loop with base pairs (i',j'), (i'',j''). In all 3 cases, we need to do our inductive checks on optimality. For the last case, we must sum over all possible internal loops. (In practice, there is a prerecorded set of possible internal loops, increasing speed considerably.) The recursion is a sum over these three cases and is given by where d ILcheck (i,j,i',j',i'',j'')~0 if removing the base pair (i',j') lowers the energy, and d ILmin (i,j,i',j')~1 if no base pair (x,y), ivxvi',j'vyvj can be added that will (split the multiloop in two and) lower the energy. d HP (i,j,i',j') and d MLcheck (i,j,i',j') both check if removing (i',j') lowers the energy. Z HP (i,j) is the partition function of a locally optimal hairpin with outer base pair (i,j), conditional on i, j being base paired. We have Z HP (i,j)~d HP (i,j)e {E HP (i,j)=kT where E HP (i,j) is the Turner energy for the hairpin with external base pair (i,j), and d HP (i,j)~1 if the hairpin is locally optimal, that is if no base pair (i',j'), ivi'vj'vj, can be added that would lower the energy.
This gives consistent recursions. To calculate the total partition function, simply sum up all of the external loops with different tail lengths to yield Z~X q Z EL (n,q):