Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Network Properties of the Ensemble of RNA Structures

  • Peter Clote ,

    clote@bc.edu

    Affiliation Department of Biology, Boston College, Chestnut Hill, MA 02467 United States of America

  • Amir Bayegan

    Affiliation Department of Biology, Boston College, Chestnut Hill, MA 02467 United States of America

Abstract

We describe the first dynamic programming algorithm that computes the expected degree for the network, or graph G = (V, E) of all secondary structures of a given RNA sequence a = a1, …, an. Here, the nodes V correspond to all secondary structures of a, while an edge exists between nodes s, t if the secondary structure t can be obtained from s by adding, removing or shifting a base pair. Since secondary structure kinetics programs implement the Gillespie algorithm, which simulates a random walk on the network of secondary structures, the expected network degree may provide a better understanding of kinetics of RNA folding when allowing defect diffusion, helix zippering, and related conformation transformations. We determine the correlation between expected network degree, contact order, conformational entropy, and expected number of native contacts for a benchmarking dataset of RNAs. Source code is available at http://bioinformatics.bc.edu/clotelab/RNAexpNumNbors.

Introduction

RNA folding kinetics plays an important role in various biological processes, including (i) trans splicing of RNA, which is controlled by trypanosomal spliced leader (SL) RNA kinetics [1], and (ii) the hok/sok host-killing/suppression of killing (hok/sok) system that kills E. coli replicates if insufficient plasmids are transfered to the new daughter cell [2]. To better understand how macromolecules fold into their native state, energy landscapes for protein and RNA folding have been intensively studied [38]. In the case of RNA secondary structure formation, numerous algorithms have been developed beyond thermodynamic equilibrium structure prediction [9, 10], including algorithms (1) to determine optimal or near-optimal folding pathways, [6, 7, 1113], (2) to compute explicit solutions of the master equation for possibly coarse-grained models [1418], and (3) to simulate stepwise folding from an initial secondary structure to the target minimum free energy (MFE) structure [5, 1924]. Nevertheless, RNA secondary structure folding kinetics remains a computationally difficult problem, since it is known that the problem of determining optimal folding pathways is NP-complete [25]. Despite increasing awareness of the importance of regulatory and catalytic RNA, no database currently exists of experimentally determined RNA folding rates, in contrast to the situation for proteins. Indeed, KineticDB is a database that provides users with a diverse set of experimentally determined folding rates for 87 unique proteins and approximately one hundred mutants [26].

It is currently an open problem to predict the folding rate of proteins and RNA molecules from the sequence alone. The goal of this paper is to raise awareness of this problem—in particular, the problem of predicting RNA secondary structure folding rate from the nucleotide sequence. For proteins, it has been shown that absolute contact order, which scales as ≈ n0.7 for sequence length n, correlates rather well with protein folding rates for two- and multi-state folding proteins, reaching a correlation of 77% [27]—see as well Table 1 of [28]. Here, protein contact order is defined as the average chain separation of residues in contact (e.g. within 6 Å) in the native structure. It has also been shown that the number of native contacts correlates with folding rates of small single-domain proteins with two-state kinetics. In this case, Makarov et al. showed that ln(k) ≈ ln(N) + a + bN, where k denotes the folding rate, N is the number of contacts in the folded state, and a, b are constants whose physical meaning is understood [29].

thumbnail
Table 1. This table compares expected network degree and the length-normalized expected network degree for three RNA sequences of moderate size: 32 nt fruA, encoding the A subunit of coenzyme F420-reducing hydrogenase; tRNA RA1180, 56 nt spliced leader RNA from L. collosoma; 76 nt transfer RNA with accession code RA1180 from the database tRNAdb 2009 [41].

Unif-MS1 [resp. Unif-MS2] denote the expected network degree for model B (uniform probability) for MS1 [resp. MS2] move set. Turner99-MS1 [resp. Turner99-MS2] and Turner04-MS1 [resp. Turner04-MS2] and denote the expected network degree for model C (Boltzmann probability for Turner 1999 and Turner 2004 energy parameters [36]) for MS1 [resp. MS2] move set. Sample-MS1 [resp. Sample-MS2] denotes the approximation of the expected network degree for model C (Turner 1999 and Turner 2004 parameters) obtained by generating low energy structures by RNAsubopt -d0 -e 12, as explained in the text. In the case of fruA, all 971,399 possible structures were generated by RNAsubopt -d0 -e 100, so that Sample-MS1 and Sample-MS2 values are correct—for this reason, the standard deviation values are not included. Note that for L. collosoma, the expected degree values for the Turner 2004 energy parameters are much larger than those obtained for Turner 1999 energy parameters.

https://doi.org/10.1371/journal.pone.0139476.t001

To our knowledge, no relation has been established between RNA folding rate and either contact order or the number of native contacts, due in part to the above-mentioned absence of a database of RNA folding rates, and due in part to the notorious difficulty of estimating RNA secondary structure folding rates when using secondary structure kinetics software such as Kinfold [5], Kinefold [20], RNAKinetics [21], KFold [30], or other software [22, 23]. Such programs implement an event-driven Monte Carlo algorithm known as Gillespie’s algorithm [31]; it follows that repeated (time-consuming) simulations will generate a collection of mean first passage times which are approximately exponentially distributed. Since an exponential distribution has the property that the mean is equal to the standard deviation, it follows that precise kinetics obtained by such methods necessarily requires inordinate computation time (e.g. the population occupancy curve for yeast phe-tRNA required 3 months of CPU time on a 2.4 GHz Intel Pentium 4 running linux [14]). Until the availability of a database of experimentally determined RNA folding rates, it is likely that the best approximation of folding rates can be made using exact, coarse-grained approaches using spectral methods, as Treekin [14], basin hopping with RNAlocmin [17], and Hermes [18].

Apart from contact order and the number of native contacts, the expected degree of the network of RNA secondary structures of an RNA sequence is another order parameter that could play a role in RNA folding kinetics—see the left panel of Fig 1 for an example of expected network degree for the toy sequence GGGGCCC. Here, the degree of a node (secondary structure) s is the number of secondary structures t that can be obtained from s by the addition, removal or shift of a base pair. These moves constitute the default move set employed by the program Kinfold [5], often used to estimate RNA folding kinetics. Moreover, by analyzing the network G = (V, E), whose node set V consists of low energy secondary structures of E. coli phe-tRNA (RF6280 [32]) and whose edge set E consists of directed edges st, where t is obtained from s by a base pair addition, removal or shift, the network for phe-tRNA was shown to be small-world in [33].

thumbnail
Fig 1.

(Left) Network for the toy 7-mer GGGGCCC which has 8 nodes and 16 edges (hence 32 directed edges). The expected network degree is . Red edges indicate base pair addition or removal, while blue edges indicate shift moves. (Center) Feynman circular representation of secondary structure of Y RNA. (Right) Conventional representation of secondary structure of Y RNA. According to [55], one function of Y RNA is to bind to certain misfolded RNAs, including 5S rRNA, as part of a quality control mechanism. The secondary structure depicted is the consensus secondary structure of Y RNA with EMBL access number AAPY01489510:220–119 from Rfam family RF00195 in the Rfam database [56]. Images produced with sofware jViz [57].

https://doi.org/10.1371/journal.pone.0139476.g001

In this paper, we provide the first algorithm to efficiently compute the expected degree of an RNA network of secondary structures. Our work generalizes a recent paper [34], which describes a vastly simpler algorithm to compute the expected degree without consideration of shift moves. Since our current algorithm is surprisingly complex, for clarity of exposition, we consider three successive models. Model A is the RNA homopolymer model [35], in which any two positions i, j can constitute a base pair, provided only that i + 1 < j. Model B is the usual RNA secondary structure model, where positions i, j can constitute a base pair if the corresponding nucleotides form a Watson-Crick or wobble pair and i +3 < j; however, in Model B, the energy of a structure is taken to be zero, so the probability of a structure is simply one over the number of structures. Model C extends Model B by using the Turner 2004 energy parameters [36] without dangles. Our algorithms have been extensively tested against brute-force exhaustive methods to be sure of algorithm and implementation. Finally, we begin a preliminary investigation into the relation between network degree, contact order, conformational entropy, and number of native contacts using two benchmarking sets of RNA structures. Since we show later that expected network degree is linear in sequence length for the (theoretical) homopolymer case, we additionally compute the length-normalized network degree.

Preliminaries

Definition 1. A secondary structure for a given RNA nucleotide sequence a1, …, an is a set s of base pairs (i, j), where 1 ≤ i < jn, such that:

  1. if (i, j)s then ai, aj form either a Watson-Crick (AU, UA, CG, GC) or wobble (GU, UG) base pair,
  2. if (i, j)s then ji > θ = 3 (a steric constraint requiring that there be at least θ = 3 unpaired bases between any two positions that are paired),
  3. if (i, j) ∈ s then for all i′ ≠ i and j′ ≠ j, (i′, j) ∉ s and (i, j′) ∉ s (nonexistence of base triples),
  4. if (i, j)s and (k, ℓ) ∈ s, then it is not the case that i < k < j < ℓ (nonexistence of pseudoknots).

Secondary structures can be depicted in several equivalent manners. For instance, the sequence and dot bracket representation for the secondary structure of Y RNA with EMBL access number AAPY01489510:220–119 is given by

GGCUGGUCCGAGUGCAGUGGUGUUUACAACUAAUUGAUCACAGCCAGUUACAGAUUCCUUUGUUCCUUCUCUACUCCCACUGCUUCACUUGACUAGCCUUUU ((((((((.((..(((((((.(.....(((.((.........................)).)))...........))))))...))..))))))))))....

Y RNA is a noncoding RNA, known to be required for the initiation of chromosomal DNA replication in mammalian cells [37]; a distinct function of Y RNA is mentioned in the caption to Fig 1, where two other formats for this secondary structure are depicted. A base pair (i, j) of structure s is an external base pair, if there is no base pair (x, y) ∈ s with the property that x < i < j < y. A position 1 ≤ kn is said to be visible in s if there is no base pair (i, j) ∈ s with the property that ikj. The secondary structure of Y RNA in Fig 1 has only one external base pair, i.e. (1, 98), and only four visible positions, i.e. positions 99, 100, 101, 102. Throughout the remainder of this paper, structure will mean secondary structure.

The base pair distance dBP(s, t) between secondary structures s, t is the number of base pairs ∣st∣ + ∣ts∣ belonging to s but not t, or vice versa. A shift move from base pair (i, j) in the structure s is of the form (i, k) [resp. (k, j)], where (s \ {(i, j)}) ∪ {(i, k)} [resp. (s \ {(i, j)}) ∪ {(k, j)}] is a valid secondary structure. Throughout, let bp(i, j) be a boolean valued function, where bp(i, j) = 1 if positions i, j can form a base pair; i.e. if ai, aj constitute a Watson-Crick or wobble pair. Reference [5] describes the Kinfold program, which implements the Gillespie algorithm [31] for RNA secondary structure folding kinetics. Kinfold produces secondary structure folding trajectories, or sequences s = s0, s1, …, sm = t, where for 0 ≤ i < m, si+1 is obtained from si by the addition or deletion of a base pair, and (optionally) by a shift move. These are defined as follows.

The move set MS1 allows a move from structure s to structure t, if t can be obtained from s by the removal of addition of a base pair; i.e. if t = s \ {(i, j)} or t = s ∪ {(i, j)}. The move set MS2 allows moves from MS1 as well as four shift moves, described by the following. Structure t is obtained from s by the replacement of base pair (i, j) ∈ s by the distinct base pair (i, j′), or (j′, i), or (i′, j), or (j, i′), provided that t is a valid secondary structure. Figs 2, 3 and 4 depict some typical shift moves, including defect diffusion [38].

thumbnail
Fig 2. Defect diffusion [38], where a bulge migrates stepwise to become absorbed in an hairpin loop.

The move from structure (a) to structure (b) is possible by the shift (1, 12) → (1, 13), the move from (b) to (c) by shift (2, 11) → (2, 12), etc. Our algorithm properly accounts for such moves with respect to energy models A, B, C. Image adapted from figure on page 26 [19] and produced by VARNA [58].

https://doi.org/10.1371/journal.pone.0139476.g002

thumbnail
Fig 3. Example of multiloop creation which is handled by our algorithm for all energy models, including the Turner energy model.

To move from (a) to (b), remove the base pair (3, 13); to move from (b) to (c), shift (4, 12) → (12, 18); to move from (c) to (d), add base pair (13, 17). Image produced by VARNA [58].

https://doi.org/10.1371/journal.pone.0139476.g003

thumbnail
Fig 4. Example of multiloop creation which is handled by our algorithm for energy models A, B but not for Turner energy model C.

To move from (a) to (b), apply the shift (3, 13) → (13, 17); to move from (b) to (c), apply the shift (4, 12) → (12, 18). Our algorithm for the Turner energy model properly treats the move from (a) to (b), but not from (b) to (c), as explained in the Remark at the end of Section “Remaining recursions for Qi,j and Zi,j”. Image adapted from figure on page 27 [19] and produced by VARNA [58].

https://doi.org/10.1371/journal.pone.0139476.g004

Expected network degree

Throughout this paper, let a = a1, …, an be a fixed, but arbitrary RNA sequence. Consider the set of all secondary structures of a as a network, or graph, where two structures s, t, are connected by an edge if t can be obtained from s by a base pair addition, removal or shift.

Fig 1 displays the network for a toy 7 nt sequence GGGGCCC, where moves come from move set MS2 (base pair additions and removals indicated by red edge; shift moves indicated by blue edge). Fig 5 displays the network for the slightly larger sequence ACGUACGUACGU, where moves come from move set MS2. In contrast, Fig 6 displays the network where moves are restricted to the move set MS1, and Fig 7 displays the network where shifts are the only allowable move—i.e. moves are restricted to the move set MS2\MS1. When moves are allowed to range over either MS1, or over MS2, the resulting network is connected; this is not the case for moves in MS2\MS1. Since the network represents intermediate moves in RNA folding trajectories, it is of interest to know the average network degree. This was done for move set MS1 in [34]. The goal of this paper is to describe the first algorithm, which computes the expected network degree, or equivalently, the expected number of neighbors, for the RNA network defined with move set MS2. Computing the expected number of neighbors when including shift moves turns out to be remarkably difficult, so for clarity of exposition, we present three versions of the algorithm, each adding a layer of complexity. Source code for all three energy models can be downloaded from http://bioinformatics.bc.edu/clotelab/.

thumbnail
Fig 5. The network of all secondary structures of the 12 nt (toy) sequence ACGUACGUACGU.

The minimum free energy structure is shown in green. Edges connect structures s, t, such that t is obtained by a move in MS2 from s, or vice versa; i.e. structures are connected by an edge if they differ by a base pair addition, removal or shift. There are 35 structures, 126 edges between structures that differ by a base pair removal or addition, and 68 edges between structures that differ by a base pair shift. Altogether, there are 194 edges. It follows that the average network degree is .

https://doi.org/10.1371/journal.pone.0139476.g005

thumbnail
Fig 6. The network of all secondary structures of the 12 nt sequence ACGUACGUACGU, where edges connect structures s, t, such that t is obtained by a move in MS1 from s, or vice versa; i.e. structures are connected by an edge if they differ by a base pair addition or removal.

There are 35 structures, 126 edges between structures that differ by a base pair removal or addition, hence the average network degree is .

https://doi.org/10.1371/journal.pone.0139476.g006

thumbnail
Fig 7. The network of all secondary structures of the 12 nt sequence ACGUACGUACGU, where edges appear between structures that differ by a shift move.

There are 35 structures, 68 edges between structures that differ by a base pair shift, hence the average network degree is . Note that the network is not connected, unlike the previous two networks.

https://doi.org/10.1371/journal.pone.0139476.g007

The plan of this paper is as follows. Section “Results” discusses the degree distribution for move sets MS1 and MS2, obtained by exhaustive enumeration and by sampling low energy structures. Asymptotic network degree is discussed and the correlation is computed between the expected network degree, contact order, conformational entropy, and expected number of native contacts. In Section “Homopolymer Model A”, we derive the recursions for the expected number of neighbors for move set MS2, with respect to the homopolymer Model A. In the homopolymer model, introduced in [35], any two positions i < j can form a base pair, provided only that ji > 1; i.e. in Definition 1, item (1) is removed, and item (2) is modified so that θ = 1. In this model, the partition function Z of a length n homopolymer is simply the number of well-balanced parenthesis expressions with dots, having length n and in which ji > 1 whenever a left [resp. right] parenthesis occurs at position i [resp. j]. For this model, the probability P(s) of each structure s is equal to the uniform probability 1/Z. In Section “Uniform, non-homopolymer Model B”, we give the recursions for the non-homopolymer uniform Model B, in which every secondary structure has energy zero, but where a secondary structure of the RNA sequence a = a1, …, an must satisfy all four properties of Definition 1. In this case, the probability P(s) of structure s is defined by P(s) = exp(−E(s)/RT)/Z where R = 0.00198717 kcal/mol, T is absolute temperature, and the partition function is Z = ∑s exp(−E(s)/RT). However, since E(s) = 0 for each structure s, the partition function Z is simply the number of secondary structures of a, and the probability P(s) is equal to the uniform probability P(s) = 1/Z. In Section “Model C with Turner energy parameters”, we give the the recursions for the full Model C, with respect to the Turner energy model [36] which includes base stacking free energies and free energies for hairpins, bulges, internal loops and multiloops. The partition function Z = ∑s exp(−E(s)/RT) can be computed by the McCaskill algorithm [39], and the probability of structure s is the usual Boltzmann probability P(s) = exp(−E(s)/RT)/Z.

Materials and Methods

Let a = a1, …, an be an arbitrary but fixed RNA sequence. For any 1 ≤ ijn, let a[i, j] denote the subsequence ai, …, aj, and let denote the set of secondary structures of a[i, j]. For , let BF(s) denote the Boltzmann factor exp(−E(s)/RT) of s, and define , where N(s) is the number of secondary structures t of a[i, j] obtained from the structure s by the addition, deletion or shift of a base pair. The partition function for a[i, j] is defined by . It follows that the expected number of neighbors (network degree) is . For clarity of exposition, in the following subsections, we describe recursions to compute Qi,j and Zi,j for three energy models for RNA secondary structures, each model a refinement of the previous model.

Homopolymer Model A

In this section, we derive the recursions for Q1,n and Z1,n for the homopolymer model, in which any two positions 1 ≤ i < jn can form a base pair, provided only that i + 1 < j. For the homopolymer model, there is no RNA sequence a = a1, …, an, but rather only the interval [1, n] = {1, …, n}. Thus we speak of a structure on [i, j], rather than on a[i, j]. The energy of each structure in the homopolymer model is zero, so the probability of each structure s on [i, j] equals one divided by the number of structures on [i, j]. Moreover, there is no need to compute the doubly-indexed values Qi,j and Zi,j, since the values depend only on the size ji + 1 of the sequence [i, j]; i.e. if ji = j′ − i′, then Qi,j = Qi′,j and Zi,j = Zi′,j. Thus it is notationally simpler to define Qn [resp. Zn] in place of Q1,n [resp. Z1,n], and similarly for all other auxilliary functions.

For 0 ≤ n, define Qn to be the sum, taken over all structures s of [1, n], of the number of base pair additions, removals or shifts of a base pair of s. Formally, we have (1) where I denotes the indicator function, and “(x, y) → (k, ℓ)” denotes the move which consists of replacing base pair (x, y) by base pair (k, ℓ). As well, let Zn denote the total number of homopolymer structures on [1, n] with θ = 1. Recursions for Zn are well-known [35], but for completeness given in Eq (2) below.

Auxilliary functions f(n, x) and g(n, x).

Recall that here we take θ = 1 for simplicity of exposition of the ideas. Let Zn denote the total number of structures on the homopolymer of length n. Since any two positions i, j can base-pair, as long as ji > θ = 1, we have (2) The term Zn − 1 counts all structures s on [1, n] in which n is unpaired in s, while the term ZrZnr − 2 counts all structures s on [1, n] that contain the base pair (r + 1, n).

Define f(n, x) to be the number of secondary structures s for a length n homopolymer, such that s has x visible positions. Now for 0 ≤ n and 0 ≤ xn, define f by (3) The computation of f(n, x) uses dynamic programming and proceeds by double induction, i.e. for n fixed, induction is performed on x. The term Zn − 2 arises from structures s on [1, n] that contain the base pair (1, n); the term f(n − 1, x − 1) is the contribution from structures s on [1, n] in which n is unpaired; the term f(r, x) ⋅ Znr − 2 accounts for all structures s on [1, n] that contain the base pair (r + 1, n).

Define g(n, x) to be the number of secondary structures s for the length n homopolymer, such that s has x visible positions in the interval [1, nθ − 1] = [1, n − 2], and position n is unpaired in s. (4) The term f(n − 2, x) accounts for all structures s on [1, n] in which n − 1, n are unpaired. The term Zn − 3 arises in the case n > 2, x = 0 for structures s on [1, n] that contain the base pair (1, n − 1). Finally, the term f(r, x) ⋅ Znr − 3 arises from structures s on [1, n] that contain the base pair (r + 1, n − 1). In all cases, the structures considered are unpaired at position n, and have exactly x visible positions in the interval [1, n − 2].

Auxilliary function En.

For 1 ≤ n, define the function En to be the number of external base pairs in all homopolymer structures on [1, n]; formally, we have (5) Recalling that Zn denotes the number of structures on [1, n], we define Z0 = 1, E0 = 1, and En = 0 for 1 ≤ n ≤ 2 = θ + 1. Note that for 1 ≤ n ≤ 2, it must be that En = 0, since the empty structure is the only possible structure on [1, n] in this case. For larger values of n, note that (6) (7) (8) Note that the rightmost term in the last line arises from the contribution of 1 for base pair (k, n). In summary, we have shown that (9)

Main function Qn.

For clarity in the derivation of Qn, we start by explicitly listing the moves in move set MS2. Let x, x′, y, y′ denote distinct positions all belonging to the interval [1, n]. The structure t can be obtained from structure s by a move from MS2, if t is a valid secondary structure and can be obtained from s by applying a move of the form 1–6.

  1. Addition of a base pair (x, y) to s.
  2. Removal of a base pair (x, y) from s.
  3. Shift of a base pair (x, y) in s to (x, y′) in t.
  4. Shift of a base pair (x, y) in s to (y′, x) in t.
  5. Shift of a base pair (x, y) in s to (x′, y) in t.
  6. Shift of a base pair (x, y) in s to (y, x′) in t.

The shift moves 3–6 are depicted in Fig 8.

thumbnail
Fig 8. Illustration of shift moves defined in Sections “Main function Qn” and “Recursion for function Qi,j”.

https://doi.org/10.1371/journal.pone.0139476.g008

Let , where N(s) is the number of structures t that can be obtained from s by applying a move from move set MS2. Define Q0 = 1, and Q1 = Q2 = 0, Z−1 = 0, Z0 = Z1 = Z2 = 1. For the inductive case where n > 2, initialize Qn = 0 and then add the contributions from below.

Case 1(a): In this case, we consider the contribution from , in which the last position n is unpaired, and t is obtained from s by a move from MS2 involving x, y, x′, y′ ∈ [1, n − 1].

Notice that in shifts of type 3, 4 the original position x is retained, while in shifts of type 5, 6 the original position y is retained, for distinct x, x′, y in the interval [1, n − 1]. Also, notice that shifts of base pairs involving the last position n are not considered in Case 1(a) – such shifts will later be treated in cases 1(c), 2(b) and 2(c). The contribution in this case is given by (10) The term Qn−1 arises from neighbors t of s in which the last position n is unpaired, and the base pair (x, y) is added/removed/shifted in s.

Case 1(b): In this case, we consider the contribution from , in which the last position n is unpaired, and t is obtained from s by adding the base pair (k, n) for some 1 ≤ knθ − 1. The contribution in this case is given by (11)

Case 1(c): In this case, we consider the contribution from , in which the last position n is unpaired, and t is obtained from s by shifting the base pair (x, y) to (x, n), or by shifting the base pair (x, y) to (y, n), for distinct x, y in the interval [1, n − 1]. These shifts are treated separately.

Case 1(c)(i): Consider a shift of the form (x, y) to (x, n), for y < n. The function En−1 counts the number of external base pairs (x, y) where yn − 1, for all structures on [1, n − 1]. For any such (x, y), it is possible to shift the base pair (x, y) to (x, n), and so the contribution is (12)

Case 1(c)(ii): Consider a shift of the form (x, y) to (y, n), for y < n − 1. The function En−2 counts the sum over all structures on [1, n − 2] of the number of external base pairs (x, y) with yn − 2. Since kn − 2 and θ = 1, and n is unpaired, it is possible to shift the base pair (x, y) to (y, n) and vice versa. So far, we have not considered structures s on [1, n − 1] in which n − 1 is base-paired. For a structure s on [1, n − 1] that contains base pair (r + 1, n − 1), there are Znr−3 many structures s2 on [r + 2, n − 2]; moreover, for any external base pair (x, y) in a structure s1 on [1, r], we can shift the base pair (x, y) to (y, n). This explains the presence of the term . Thus the contribution is (13) In conclusion, (14)

Case 2(a): The contribution from , in which the last position n is base-paired, where neighbor t is obtained from s by removal of that last base pair (k, n), is given by (15) Note that Case 2(a) is dual to Case 1(b).

Case 2(b): In this case, we consider the contribution from , in which the last position n is base-paired, where neighbor t is obtained from structure s by a shift of the last base pair (k, n) to (k′, n) for some k′ ≠ k that is visible in structure s − {(k, n)}. Note that if we were to remove base pair (k, n) from s, then the last position of s − {(k, n)} must be unpaired, and the position n − 1 may or may not be base paired. Recall that g(n, x) is the sum over all structures s on [1, n], that contain x visible positions in the interval [1, n − 2], and in which position n is unpaired. If we choose a first position k out of the x visible positions, and subsequently a second distinct position k′ out of the remaining x − 1 visible positions, then we properly count the contribution from structures s containing (k, n) which can be transformed to a structure t by the shift (k′, n).

The contribution in this case is (16) since we have x choices for value k and then (x − 1) choices for k′, both selected from the x visible positions of the structure.

Case 2(c): In this case, we consider the contribution from , in which the last position n is base-paired, where neighbor t is obtained from structure s by a shift of base pair (k, n) to (k, k′), or a shift of the last base pair (k, n) to (k′, k), for some kk′ that is visible in structure s − {(k, n)}. These shifts are treated separately.

Case 2(c)(i): Consider a shift of the form (k, n) to (k, k′), for k′ < n. The function En−1 counts the sum over all structures on [1, n − 1] of the number of external base pairs (k, k′) with k′ ≤ n − 1. For any such (k, k′), it is possible to apply the shift (k, n), and vice versa. Thus Case 2(c)(i) case is dual to Case 1(c)(i) and the contribution is clearly (17) Case 2(c)(ii): Consider a shift of the form (k, n) to (k′, k), for k′ < k − 1. The function En−2 counts the sum over all structures on [1, n − 2] of the number of external base pairs (k′, k) with kn − 2. Since kn − 2 and θ = 1, and n is unpaired, it is possible to shift the base pair (k′, k) to (k, n) and vice versa. By duality to Case 1(c)(ii), we have the additional contribution of to account for shifting the base pair (y, n) to an external base pair (x, y) in a structure s1 on [1, r], in the case that n − 1 is base-paired. Thus Case 2(c)(ii) case is dual to Case 1(c)(ii) and the contribution is clearly (18) In conclusion, (19) Case 2(d): In this case, we consider the contribution from , in which the last position n is base-paired with base pair (k, n), where neighbor t is obtained from a shift or addition/deletion of a base pair in the left portion [1, k − 1] or right portion [k + 1, n − 1], so that t retains the base pair (k, n). In this case, the contribution is (20) The first term arises from the addition/removal/shift of a base pair (x, y), where k + 1 ≤ x < yn − 1, and the second term arises from the addition/removal/shift of a base pair (x, y), where 1 ≤ x < yk−1.

Putting together all contributions from Case 1(a) through Case 2(d), we have (21) The functions f, g require the greatest space and time resources, and it is easily seen that the spece [resp. time] complexity for Z is O(n) [resp. O(n2)], for f is O(n2) [resp. O(n3)], for g is O(n2) [resp. O(n3)], and that given arrays that contain the values of f and g, the additional space [resp. time] complexity for E and Q is O(n) [resp. O(n2)]. It follows that the expected network degree in the homopolymer case Model A can be computed in quadratic space O(n2) and cubic time O(n3). We have implemented a dynamic programming algorithm for each of the functions E, f, g, Q, Z resulting in software for the expected network degree, with respect to homopolymer model. Our code has been cross-checked extensively with alternative brute-force methods, hence is reliable.

Uniform, non-homopolymer Model B

In this section, we consider the uniform, non-homopolymer model B, in which secondary structures must satisfy Definition 1; i.e. compared with the notion of structure from the previous Section “Homopolymer Model A”, each base pair (i, j) of a secondary structure s of the RNA sequence a = a1, …, an must satisfy ji > θ = 3, and ai, aj must constitute a Watson-Crick or wobble pair. In model B, the energy of each structure is zero, so the partition function Z = Z1,n is the total number of structures of a, and the probability P(s) of each structure s is 1/Z. For the recursions necessary to compute , where N(s) denotes the number of neighbors of s under move set MS2, we need to define new functions EL, ER, ER′, F, G. There is a correspondence between functions ELi,j − 1, aj [resp. ] { resp. Gi,j,aj, x } in the current section with the functions En−1 [resp. ] { resp. g(n, x) } from the previous Section “Homopolymer Model A”.

Critical definitions and recursions.

For a given RNA sequence a = a1, …, an, define the subsequence a[i, j] = ai, …, aj. Positions i, j can form a base pair, denoted by bp(i, j) = 1, if ai, aj is either a Watson-Crick pair AU, UA, GC, or CG, or a wobble pair; otherwise bp(i, j) = 0. For k ∈ [1, n] and c ∈ {A, C, G, U}, we also write bp(k, c) = 1 to mean that ak, c constitute either a Watson-Crick or wobble base pair. A nucleotide position k ∈ [1, n] is said to be visible in the secondary structure s, if for every base pair (i, j) ∈ s, it is not the case that ikj. If we state that structure s has exactly x visible occurrences of a nucleotide in [i, jθ − 1] that can base pair with c, then we mean that there are positions ii1 < i2 < ⋯ < ixjθ − 1 visible in s, such that bp(i1, c) = 1, …, bp(ix, c) = 1; moreover there are no other positions beyond i1, …, ix with this property.

The base pair (i, j) ∈ s is said to be an external base pair of the secondary structure s, if there is no distinct base pair (i′, j′) ∈ s with the property that i′ ≤ i < jj′. In formulas, for brevity, we write that ‘(i, j) is external in s’, to mean that (i, j) is an external base pair of s. Let denote the set of all secondary structures of the subword a[i, j]. Recall that the indicator function I[P] is equal to 1 if relation P is true, and 0 otherwise. For 1 ≤ ijn, c ∈ {A, C, G, U}, and x ∈ [0, n], and c ∈ {A, C, G, U}, define the functions ELi,j,c, ERi,j,c, , Fi,j,c,x, G(i,j,c,x) as follows. (22) (23) (24) (25) (26) The two differences between the homopolymer Model A and the current Model B are: (1) in Model B, if (k, j) is a base pair, then the nucleotides at positions k, j must be one of AU, UA, GC, CG, GU, UG, (2) in Model B, θ = 3, so if (k, j) is a base pair, then ji + θ + 1 = i + 4. Both of these issues substantially complicate the treatment, so instead of the function En with one argument, we have three functions, ELi,j,c, ERi,j,c, , each having three arguments. The arguments i, j designate the left and right endpoints of the interval [i, j], and the functions are defined by induction on increasing values of the difference ji. The argument c contains the value A, C, G, U for the nucleotide at position j; this allows one to test whether the nucleotide at position k ∈ [i, jθ − 1] can form a base pair with the nucleotide at position j. Thus ELi,j,c is the sum, taken over all structures on [i, j], of the number of external base pairs (x, y) where we can alternatively form the base pair (x, j) as depicted in panel (a) of Fig 9. As well, is the sum, taken over all structures on [i, j], of the number of external base pairs (x, y) where we can alternatively form the base pair (y, j) as depicted in panel (b) of Fig 9. The function ERi,j,c is first defined, since this simplifies the recursion for . The function Gi,j,c,x has a fourth parameter x, for which Gi,j,c,x counts the number of structures on [i, j] having exactly x visible positions (external to all base pairs) in the interval [i, jθ − 1] = [i, j − 4] of a nucleotide that can form a base pair with nucleotide c, as depicted in panel (d) of Fig 9. It will follow that for structures having exactly x such visible positions that can form a base pair with position j, there are many pairs k′, k where a shift of the form (k, j) → (k′, j). The function Fi,j,c,x is introduced to simplify the recursions for G, where Fi,j,c,x counts the number of structures on [i, j] having exactly x visible occurrences of a nucleotide that can form a base pair with c. With this introduction, we give the formal definitions.

thumbnail
Fig 9. Illustration of cases 1c, 1d, 2c, 2d from Section “Recursion for function Qi,j”.

https://doi.org/10.1371/journal.pone.0139476.g009

Definition of EL.

For 1 ≤ ijn and c ∈ {A, C, G, U}, we define ELi,j,c by induction on ji.

Base Case: If jiθ, define ELi,j,c = 0.

Inductive Case: If ji > θ, define ELi,j,c as the sum of the following (27)

Definition of ER.

For 1 ≤ ijn and c ∈ {A, C, G, U}, we define ERi,j,c by induction on ji.

Base Case: If jiθ, define ERi,j,c = 0.

Inductive Case: If ji > θ, define ERi,j,c as the sum of the following (28)

Definition of ER′.

For 1 ≤ ijn and c ∈ {A, C, G, U}, we define by induction on ji.

Base Case: If jiθ, define .

Inductive Case: If ji > θ, define as the sum of the following (29) Note that the first term to the right of the equality sign in the previous equation is ERi,jθ − 1, c and not .

Definition of F.

For 1 ≤ ijn, c ∈ {A, C, G, U} and x ∈ [0, n], we define Fi,j,c,x by induction on ji. For ji < 0, c ∈ {A, C, G, U}, and 0 ≤ xji + 1, define Fi,j,c,x = 0.

Base Case i = j: For c ∈ {A, C, G, U}, define Fi,i,c,bp(i, c); i.e. (30) and (31) Base Case i < ji+θ: For i < ji + θ, and x ∈ [0, ji + 1], define by double induction on ji and x (32) Inductive Case j > i+θ: For j > i+θ, and x ∈ [0, n], we define F by double induction on ji and x, where we separate the case that x = 0 and x > 0.

Subcase x = 0: (33) Subcase x > 0: (34)

Definition of G.

Recall that Gi,j,c,x is defined to be the number of structures having exactly x visible occurrences of a nucleotide in [i, jθ − 1] that can base-pair with c, and j is unpaired in s. Initially define Gi,j,c,x = 0 for all i, j, c, x.

Base Case: For iji + θ, and c ∈ {A, C, G, U}, define Gi,j,c, 0 = 0.

Inductive Case: In this case, j > i + θ, and c ∈ {A, C, G, U}. We separately treat the subcases x = 0 and x > 0.

Subcase x = 0: (35) Subcase x > 0: (36)

Computing the total number of moves using MS1.

For 1 ≤ ijn, define Qi,j to be the sum, taken over all structures s of ai, …, aj, of the number of base pair additions or removals of a base pair to or from s. Formally, we have (37) or equivalently (38) where dBP(s, t) denotes the base pair distance between structures s, t. Define Qi,j by recursion on ji, for 1 ≤ ijn.

Base Case: For iji + θ, define Qi,j = 0.

Inductive Case: For j > i + θ, define (39)

Computing the total number of moves using MS2.

For 1 ≤ ijn, define Qi,j to be the sum, taken over all structures s of ai, …, aj, of the number of base pair additions, removals or shifts of a base pair of s. Formally, we have (40)

Now define Qi,j by recursion on ji, for 1 ≤ ijn.

Base Case: For iji + θ, define Qi,j = 0.

Inductive Case: For j > i + θ, define (41)

Computing the total number of moves using MS2\MS1.

For 1 ≤ ijn, define Qi,j to be the sum, taken over all structures s of ai, …, aj, of the number of shifts of a base pair of s. Formally, we have (42)

Now define Qi,j by recursion on ji, for 1 ≤ ijn.

Base Case: For iji + θ, define Qi,j = 0.

Inductive Case: For j > i + θ, define (43) We have implemented a dynamic programming algorithm for each of the functions EL, ER, ER′, F, G, Q and Z, resulting in software for the expected network degree, with respect to uniform probability for the move sets MS1, MS2, MS2\MS1. Analysis of space and time resources needed for the program can be determined in a manner similar to that described at the end of Subsection; however, there is an additional factor of n in both space and time requirements, so that the software runs in space O(n3) and time O(n4). During the algorithm development and implementation, we have extensively cross-checked with results obtained by exhaustive, brute force counting, thus ensuring correctness of our code.

Model C with Turner energy parameters

Here we consider the Model C, for which secondary structures satisfy Definition 1 and such that E(s) indicates the Turner energy of s, which involves free energy parameters [36] for stacked base pairs, hairpins, bulges, internal loops and multiloops. For RNA sequence a = a1, …, an, we present recursions in the following for Zi,j and Qi,j, where (44) (45) (46) (47) (48) (49) Note that I is the indicator function, and that QBi,j is the Boltzmann weighted sum of the number of neighbors, using move set MS2, where the sum is taken over all structures that contain the base pair (i, j). Similarly ZBi,j is the sum of Boltzmann factors BF(s), where the sum is taken over all structures that contain the base pair (i, j). We write bp(k, j) = 1 to mean that nucleotides ak, aj can form either a Watson-Crick or wobble base pair, and for nucleotide c ∈ {A, C, G, U}, we write bp(k, c) = 1 to mean that nucleotides ak and c can form a Watson-Crick or wobble base pair. From the context, there should be no confusion between bp(k, j) and bp(k, c).

Auxilliary functions EL, ER, ER′, F, G.

For 1 ≤ ijn, c ∈ {A, C, G, U}, and x ∈ [0, n], and c ∈ {A, C, G, U}, define the Boltzmann version of the functions defined in the previous Section “Uniform, non-homopolymer Model B”, where without risk of confusion we use the same function notations for ELi,j,c, ERi,j,c, , Fi,j,c,x, Gi,j,c,x, although the underlying definitions must be modified. (50) (51) (52) (53) (54) Recursions for a dynamic programming implementation of these functions are given later in Section “Recursions for auxilliary functions”. We focus now on how to compute Qi,j using these auxilliary functions.

Recursion for function Qi,j.

For notational convenience, define Qi,i − 1 = 0 and Zi,i−1 = 1 for all 1 ≤ in. If ij < i + θ + 1, then for any secondary structure , there are no structural neighbors of s and so Qi,j = 0. If ij < i + θ + 1, then the only secondary structure on [i, j] is the empty structure with free energy of zero, so Zi,j = 1. Now assume that i + θ + 1 ≤ j. By definition (55) For the move set MS1 (in the absence of shift moves), it has been shown in [34] that (56) However, when allowing shift moves, the situation is more complicated since there are shifts involving x, y, x′, y′ ∈ [i, j] that are neither fully contained in the segment [i, j − 1] for structures in which j is unpaired, nor fully contained in one of the segments [i, k − 1], [k, j] structures which contain the base pair (k, j). The former shifts are treated in cases 1(c), 1(d), while the latter shifts are treated in cases 2(c), 2(d).

For clarity in the derivation of Qi,j, we start by explicitly listing the moves in move set MS2. Let x, z′, y, y′ denote distinct positions all belonging to the interval [i, j]. The structure t can be obtained from structure s by a move from MS2, if t is a valid secondary structure and can be obtained from s by applying a move of the form 1–6.

  1. Addition of a base pair (x, y) to s.
  2. Removal of a base pair (x, y) from s.
  3. Shift of a base pair (x, y) in s to (x, y′) in t.
  4. Shift of a base pair (x, y) in s to (y′, x) in t.
  5. Shift of a base pair (x, y) in s to (x′, y) in t.
  6. Shift of a base pair (x, y) in s to (y, x′) in t.

The shift moves 3–6 are depicted in Fig 8. Notice that in shifts of type 3, 4 the original position x is retained, while in shifts of type 5, 6 the original position y is retained. for distinct x, x′, y in the interval [i, j].

In the base case, for all i ∈ [1, n], we have Qi,i − 1 = 0, Zi,i − 1 = 1, and for iji + θ = i + 3, Qi,j = 0, Zi,j = 1. For the inductive case in which ji > θ = 3, initialize Qi,j = 0 and then add the contributions from the cases below. The recursions for Zi,j are well-known [39] and are given later in Section “Remaining recursions for Qi,j and Zi,j”.

Case 1(a): In this case, we consider the contribution from , in which j is unpaired in the interval [i, j], and t is obtained from s by a move from MS2 involving x, y, x′, y′ ∈ [i, j − 1]. The contribution is (57) which accounts for the addition, removal or shift of a base pair in [i, j − 1]. Note that shifts of base pairs involving the last position j are not considered in Case 1(a)—such shifts will treated in cases 1(c), 1(d), 2(c), 2(d).

Case 1(b): In this case, we consider the contribution from , in which j is unpaired in [i, j], and t is obtained from s by adding the base pair (k, j) for some ikjθ − 1 = j − 4. The contribution is (58) This term arises from those t obtained from s by adding a base pair (k, j) for some k ∈ [i, jθ − 1].

The remaining cases 1(c), 1(d) treat shifts involving x, y, x′, y′ ∈ [i, j] in structures in which j is unpaired in [i, j], where the position j is touched; i.e. it is not the case that x, y, x′, y′ ∈ [i, j − 1] and so these shifts are not already counted in the term Qi,j − 1.

Case 1(c): In this case, depicted in panel (a) of Fig 9, we consider the contribution from in which j is unpaired in [i, j], and t is obtained from s by a shift of the base pair (x, y) to (x, j) for ixyθ − 1 and yj − 1. The function ELi,j − 1,aj is the sum, taken over all structures in which j in unpaired, of the product of the Boltzmann factor B(s) times the number of external base pairs (x, y) in s with yj − 1 such that the nucleotide ax at position x can form a base pair with the nucleotide aj at position j. For any such (x, y), it is possible to shift the base pair (x, y) to (x, j), and vice versa. Before proceeding, note that the current Case 1(c) handles shifts from (x, y) to (x, j), while Case 2(b) handles shifts from (x, j) to (x, y). The contribution in the current case is clearly (59) Case 1(d): In this case, depicted in panel (b) of Fig 9, we consider the contribution from in which j is unpaired in [i, j], and t is obtained from s by a shift of the base pair (x, y) to (y, j) for ixyθ − 1 and yjθ − 1. The function is the sum, taken over all structures in which j in unpaired, of the product of the Boltzmann factor B(s) times the number of external base pairs (x, y) in s with yjθ − 1 such that the nucleotide ay at position y can form a base pair with the nucleotide aj at position j. For any such external base pair (x, y), it is possible to shift (x, y) to (y, j), and vice versa. Before proceeding, note that the current Case 1(d) handles shifts from (x, y) to (y, j), while Case 2(d) handles shifts from (y, j) to (x, y). The contribution in the case at hand is clearly (60) Case 2(a): In this case, we consider the contribution from structures , which contain the base pair (k, j), for some ikjθ − 1, and t is obtained from s by a move from MS2 involving x, y, x′, y′, such that x, y, x′, y′ ∈ [i, k − 1]. The contribution is (61) Case 2(b): In this case, we consider the contribution from structures , which contain the base pair (k, j), for some ikjθ − 1, and t is obtained from s by a move from MS2 involving x, y, x′, y′, such that x, y, x′, y′ ∈ [k, j]. The contribution is (62) The remaining cases 2(c), 2(d) treat shifts involving x, y, x′, y′ ∈ [i, j] in structures which contain the base pair (k, j) for some ikjθ − 1, where it is neither the case that x, y, x′, y′ ∈ [i, k − 1] nor x, y, x′, y′ ∈ [k, j]; i.e. cross talk shifts that touch both the left [i, k − 1] and the right [k, j] segments.

Case 2(c): In this case, depicted in panel (c) of Fig 9, we consider the contribution from , which contain the base pair (k, j), for some ikjθ − 1, and t is obtained from s by a shift of the base pair (k, j) to (k′, j) for some k′ < k that is visible in structure s\{(k, j)}. Before proceeding, note that for k < k′, the shift of base pair (k, j) to (k′, j) is treated in Case 2(b).

Recall that the function Fi,k − 1,aj, x is the sum of Boltzmann factors of all structures s0 on [i, k − 1] that contain exactly x occurrences of a visible position that can form a base pair with the nucleotide aj at position j. The contribution in this case is (63) Case 2(d): In this case, depicted in panel (d) of Fig 9, we consider the contribution from structures , which contain the base pair (k, j), for some ikjθ − 1, and t is obtained from s by a shift of the base pair (k, j) to (k′, k) for some ik′ ≤ kθ − 1 which is visible in s. Recall that the function Gi,k,ak, x is the sum of Boltzmann factors of all structures s0 on [i, k], in which k is unpaired, for which there are exactly x occurrences of a visible position in [i, kθ − 1] that can form a base pair with ak. The contribution is (64) Putting together all contributions from Case 1(a) through Case 2(d), we have (65)

Recursions for auxilliary functions.

We now provide the recursions for functions EL, ER, ER′, F and G.

Definition of EL.

For 1 ≤ ijn and c ∈ {A, C, G, U}, we define ELi,j,c by induction on ji, where (66) Base Case: If jiθ, define ELi,j,c = 0.

Inductive Case: If ji > θ, define ELi,j,c as the sum of the following (67)

Definition of ER.

For 1 ≤ ijn and c ∈ {A, C, G, U}, we define ERi,j,c by induction on ji, where (68) Base Case: If jiθ, define ERi,j,c = 0.

Inductive Case: If ji > θ, define ERi,j,c as the sum of the following (69)

Definition of ER′.

For 1 ≤ ijn and c ∈ {A, C, G, U}, we define by induction on ji, where (70) Base Case: If jiθ, define .

Inductive Case: If ji > θ, define as the sum of the following (71) Note that the first term to the right of the equality sign in the previous equation is ERi,jθ − 1, c and not .

Definition of F.

For 1 ≤ ijn, c ∈ {A, C, G, U} and x ∈ [0, n], we define Fi,j,c, x by induction on ji, where (72) Define Fi,j,c,x = 0 for j < i and c ∈ {A, C, G, U} and x ∈ [0, n].

Base Case i = j: For c ∈ {A, C, G, U}, define Fi,i,c,bp(i, c) as follows (73) and (74) Base Case i < ji + θ: For i < ji + θ, and x ∈ [0, ji + 1], define by double induction on ji and x (75) Inductive Case j > i + θ: For j > i + θ, and x ∈ [0, n], we define F by double induction on ji and x, where we separate the case that x = 0 and x > 0.

Subcase x = 0: (76) Subcase x > 0: (77)

Definition of G.

Recall that Gi,j,c,x is defined to be the sum of Boltzmann factors of structures having exactly x visible occurrences of a nucleotide in [i, jθ − 1] that can base-pair with c, and j is unpaired in s, i.e. (78) Initially define Gi,j,c,x = 0 for all i, j, c, x.

Base Case: For iji + θ, and c ∈ {A, C, G, U}, define Gi,j,c, 0 = 0.

Inductive Case: In this case, j > i + θ, and c ∈ {A, C, G, U}. We separately treat the subcases x = 0 and x > 0.

Subcase x = 0: (79) Subcase x > 0: (80)

Remaining recursions for Qi,j and Zi,j.

In this section, we furnish the remaining recursions for Qi,j, Zi,j in the Turner 2004 energy model [36]. For a fixed sequence a = a1, …, an and for 1 ≤ ijn, define (81) where Ns is the number of secondary structures that can be obtained from s by a base pair addition, removal or shift–i.e. the number of neighbors of s with respect to move set MS2. It follows that Z = Z1, n is the partition function for secondary structures, and (82) where BF(s) abbreviates the Boltzmann factor exp(−E(s)/RT) of s.

To provide a self-contained treatment, we recall McCaskill’s algorithm [39], which efficiently computes the partition function. For RNA nucleotide sequence a = a1, …, an, let H(i, j) denote the free energy of a hairpin closed by base pair (i, j), while IL(i, j, i′, j′) denotes the free energy of an internal loop enclosed by the base pairs (i, j) and (i′, j′), where i < i′ < j′ < j. Internal loops comprise the cases of stacked base pairs, left/right bulges and proper internal loops. The free energy for a multiloop containing Nb base pairs and Nu unpaired bases is given by the affine approximation a + bNb + cNu.

Definition 2 (Partition function Z and related function Q)

  • Zi,j = ∑s exp(−E(s)/RT) where the sum is taken over all structures .
  • ZBi,j = ∑s exp(−E(s)/RT) where the sum is taken over all structures which contain the base pair (i, j).
  • ZMi,j = ∑s exp(−E(s)/RT) where the sum is taken over all structures which are contained within an enclosing multiloop having at least one component.
  • ZM1i,j = ∑s exp(−E(s)/RT) where the sum is taken over all structures which are contained within an enclosing multiloop having exactly one component. Moreover, it is required that (i, r) is a base pair of x, for some i < r ≤ j.
  • Qi,j = ∑s Ns ⋅ exp(−E(s)/RT) where the sum is taken over all structures .
  • QBi,j = ∑s Ns ⋅ exp(−E(s)/RT) where the sum is taken over all structures which contain the base pair (i, j).
  • QMi,j = ∑s Ns ⋅ exp(−E(s)/RT) where the sum is taken over all structures which are contained within an enclosing multiloop having at least one component.
  • QM1i,j = ∑s Ns ⋅ exp(−E(s)/RT) where the sum is taken over all structures which are contained within an enclosing multiloop having exactly one component. Moreover, it is required that (i, r) is a base pair of s, for some i < rj.

We will define Zi,j and Qi,j by recursion on ji, for 1 ≤ ijn.

Base Case: Recalling that θ = 3, for ji ∈ {−1, 0, 1, 2, 3}, define Qi,j = QBi,j = 0, Zi,j = 1, ZBi,j = ZMi,j = ZM1i,j = 0, since the empty structure is the only possible secondary structure.

Inductive Case for Zi,j: For j > i + θ, define (83) (84) (85) (86) Inductive Case for Qi,j: For j > i + θ, recall that by Eq (65) we have (87) To complete the definition of QBi,j, we need additional auxilliary functions.

Auxilliary function arc.

To complete the inductive definition of Qi,j just given, we must define QBi,j, QM1i,j, QMi,j. This first requires the following auxilliary definitions, which count the number of structures obtained by adding a base pair within a hairpin, bulge, internal loop or multiloop, or by shifting a base pair at a boundary of the loop. For θ = 3 and ji > θ define (88) Note that arc1a(i, j) counts the number of neighbors obtained from structure s by adding a base pair (x, y) in the interval [i, j]. In contrast, arc1b(i, j) [resp. arc1c(i, j)] counts the number of neighbors obtained from structure s by shifting the base pair (i, j) to (i, k) [resp. (k, j)] where i < k < j. The function arc2a(i, j, ℓ, r) counts the number of neighbors obtained from structure s by adding a base pair (x, y) in the internal loop bounded by the base pairs (i, j) and (ℓ, r) where i < x < ℓ < r < y < j–note that i + 1, …, ℓ − 1 and r + 1, …, j − 1 are unpaired in the internal loop bounded by (i, j) and (ℓ, r). In contrast, arc2b,1(i, j, ℓ, r) [resp. arc2b,2(i, j, ℓ, r)] counts the number of neighbors obtained from structure s by shifting the base pair (i, j) to (i, y) [resp. (ℓ, r) to either (y, ℓ) or (ℓ, y)] where y occurs in the internal loop closed on both sides by (i, j) and (ℓ, r). Similarly, arc2c,1(i, j, ℓ, r) [resp. arc2c,2(i, j, ℓ, r)] counts the number of neighbors obtained from structure s by shifting the base pair (i, j) to (x, j) [resp. (ℓ, r) to either (r, x) or (x, r)] where x occurs in the internal loop closed on both sides by (i, j) and (ℓ, r). Finally, arc2b(i, j, ℓ, r) [resp. arc2c(i, j, ℓ, r)] is equal to arc2b,1(i, j, ℓ, r) + arc2b,2(i, j, ℓ, r) [resp. arc2c,1(i, j, ℓ, r) + arc2c,2(i, j, ℓ, r)], and arc2(i, j, ℓ, r) is the sum of arc2a(i, j, ℓ, r), arc2b(i, j, ℓ, r), and arc2c(i, j, ℓ, r). Then arc3(i, j, ℓ, r) counts the number of neighbors obtained from structure s by either adding a base pair within the internal loop defined by (i, j) and (ℓ, r), or by shifting either (i, j) or (ℓ, r). For i < j < k, the function arc4(i, j, k) counts the number of neighbors obtained from structure s by shifting the base pair (i, j) to (i, y) for some j < yk, while arc5(i, j, k) counts the number of neighbors obtained from structure s by shifting the base pair (i, j) to (j, y) for some j < yk.

Recursion for QBi,j.

We can now proceed with the definition of QBi,j, defined to be the sum of Ai,j, Bi,j, Ci,j, each of which is defined below.

Case A: (i, j) closes a hairpin.

In this case, the contribution to QBi,j is given by (89) The term 1 arises from the neighbor of s = {(i, j)} by removing base pair (i, j). The term arc1a(i + 1, j − 1) arises from neighbors of s obtained by adding a base pair in the region [i + 1, j − 1], and the term arc1b(i, j) arises from a shift of the form (i, j) → (i, y), and finally the term arc1c(i, j) arises from a shift of the form (i, j) → (x, j).

Case B: (i, j) closes a stacked base pair, bulge or internal loop, whose other closing base pair is (ℓ, r), where i < ℓ < r < j.

Following the convention in Vienna RNA Package, we assume that all loops have at most 30 unpaired nucleotides. This convention explains the presence of 31 in some indices. In this case, the contribution to QBi,j is given by the following (90) The term 1 arises from the neighbor of s = {(i, j)} by removing base pair (i, j) (the neighbor obtained by removing base pair (ℓ, r) is counted by the term N(s) for ). The term arc3(i, j, ℓ, r) counts neighbors obtained by either adding a base pair within the internal loop defined by (i, j) and (ℓ, r), or by shifting either (i, j) or (ℓ, r).

In Case C below, we follow the convention that in the summation notation , if upper bound b is smaller than lower bound a, then we intend a loop of the form: FOR i = b downto a.

Case C: (i, j) closes a multiloop.

In this case, the contribution to QBi,j is given by the following (91) Now QBi,j = Ai,j + Bi,j + Ci,j. It nevertheless remains to define the recursions for QM1i,j and QMi,j. These satisfy the following. (92) The term arc1a(k + 1, j) counts neighbors obtained by adding a base pair in [k + 1, j]; the term arc4(i, k, j) counts neighbors obtained by a shift of the base pair (i, k) to (i, y) for some k < yj; the term arc5(i, k, j) counts neighbors obtained by a shift of the base pair (i, k) to (k, y) for some k + θ < yj. Finally (93) Note that in the first line of the equation for QMi,j, the position r is required by definition of QM1r, j to pair to some position in [r + θ + 1, j]. Thus r is the left endpoint of a base pair, whose right endpoint will not be known until a subsequent call of function QM1r, j. The term arc1a(i, r − 1) counts neighbors obtained by adding a base pair (x, y) in the interval [i, r − 1]; the term arc1c(i − 1, r) counts neighbors obtained by shifting the base pair whose left endpoint is r to the base pair (x, r) for some ix < r. This completes the description of how to compute the expected number of neighbors with respect to the Turner energy model.

Finally, to accelerate the computation of the functions arc1a, …, arc5, the 4 × n × n array ARC is precomputed, where if a = a1, …, an denotes the input RNA sequence, then (94) As mentioned, we follow the convention that bulges and interior loops have a size of at most 30 nt; however, this bound does not apply to hairpin loops or multiloops.

Remark: Suppose that s = {(i, j), (i1, j1), …, (ik, jk)} is a multiloop closed by (i, j), where i < i1 < j1 < i2 < j2 < ⋯ < ik < jk < j. Then note that we do not count neighbors of s obtained by adding a base pair (x, y) to the multiloop s, where i < x < i < j < y, nor do we count shifts within a multiloop of the form (i, j) → (i, k) for j < k, nor (i, j) → (k, j) for k < i. Following the paradigm in the treatment of multiloops in McCaskill’s partition function algorithm [39], such added base pairs and shifts cannot be included. In particular, our Turner energy algorithm properly counts shifts depicted in Figs 2 and 3, but not those depicted in Fig 4. Multiloops are energetically costly due to entropic considerations, and so penalized in the Turner energy model. For this reason, multiloops are generally small, have few components, and contain few unpaired bases that might allow the formation of base pairs or support shift moves. If a multiloop has sufficient size to permit such moves, then its free energy will be large, hence the Boltzmann factor of such structures s is small and the contribution to ⟨N⟩ is negligeable. By introducing multiloop analogues of functions EL, ER, ER′, F, and G, it should be possible to account for such additional internal multiloop moves. However, this would lead to substantial complications of the algorithm with no likely benefit, hence this will not be pursued.

Results

In this section, we describe several results obtained by applying our novel algorithms to compute the expected network degree for given RNA sequence. The left panel of Fig 10 depicts the length-normalized expected network degree of an RNA homopolymer sequence of length n, defined to be . In the homopolymer model, Qn = ∑s N(s), where N(s) is the number of neighbors of s, and the sum is taken over all secondary structures s of [1, n]. In the homopolymer case, the energy is 0, so the partition function Zn equals the number of structures. Fig 10 displays the normalized network degree as a function of homopolymer size, both in the case of move set MS1 (base pair additions, removals), and move set MS2 (base pair additions, removals, shifts). An asymptotic value of 0.4742 for is suggested by running the dynamic programming (DP) algorithm described in Section “Homopolymer Model A” for values of sequence length 400 ≤ n ≤ 1000. Using methods from algebraic combinatorics, we have analytically proved that the value of for MS1 is ≈ 0.4734176431521986 (see [40]). Runs of the DP algorithm also suggest that the asymptotic value of for MS2 appears to be ≈ 1.530161, so that there are more than 3 times more structural neighbors, on average, for move set MS2 than for move set MS1 for the homopolymer model. The right panel of Fig 10 depicts an overlay of the degree distribution for secondary structures of the 32 nt selenocysteine element of fruA, which latter encoding the A subunit of coenzyme F420-reducing hydrogenase, for move sets MS1, MS2\MS1 and MS2.

thumbnail
Fig 10.

(Left) Normalized expected network degree of an RNA homopolymer sequence of length n is defined to be ; i.e. the length-normalized expected network degree divided by sequence length n. Here Qn is ∑s N(s), where N(s) is the number of neighbors of s, and the sum is taken over all secondary structures s of the homopolymer. In the homopolymer case, the energy is 0, hence the partition function Zn is simply the number of structures of the length n homopolymer. The purple graph was obtained with move set MS1 (base pair additions and removals), while the red graph was obtained with move set MS2 (base pair additions, removals and shifts). For n = 998, the value of with respect to MS1 is 0.472393; using methods from enumerative combinatorics, we have analytically proved that the value of with respect to MS1 is exactly 0.4734176431521986 [40]. For n = 998, the value of with respect to MS2 is 1.530161; since the values of are unchanged for n ≪ 998, it is likely that the asymptotic value is close to that value. It follows that there are more than 3 times more structural neighbors, on average, for move set MS2 than for move set MS1. (Right) Relative frequency for number of neighbors (degree) for the network of all secondary structures of the 32 nt fruA selenocysteine (SECIS) element, produced by exhaustive enumeration of all structures. The blue [resp. purple resp. red] curve corresponds to move set MS2 [resp. (MS2\MS1) resp. MS1].

https://doi.org/10.1371/journal.pone.0139476.g010

Figs 11 and 12 display the relative frequency (for energy model C) for the number of neighbors, or degree, respectively for the 76 nt alanine transfer RNA from Mycoplasma mycoides with accession code RA1180 from tRNAdb 2009 [41] and for the 56 nt spliced leader RNA from L. collosoma. RNAsubopt -d0 -e 12 [10] was used to generate 537,180 [resp. 266,065] structures s having free energy within 12 kcal/mol of the minimum free energy (MFE) for tRNA RA1180 [resp. spliced leader RNA from L. collosoma]. The sum Z* of all Boltzmann factors exp(−E(s)/RT) of the sampled structures was computed, and the ratio Z*/Z of Z* with respect to the partition function Z was determined to be 0.9998 for tRNA RA1180 [resp. 0.9999 for spliced leader L. collosoma]. For tRNA RA1180, the sample mean ± one standard deviation is 29.11 ± 4.63 [resp. 46.51 ± 8.74] for move set MS1 [resp. MS2] using energy model C (Turner 2004 energy parameters), while the corresponding values for L. collosoma spliced leader are 69.87 ± 34.04 [resp. 90.46 ± 37.71] for move set MS1 [resp. MS2]. Table 1 compares these values with those obtained by our dynamic programming method, and additionally compares values for both Turner 1999 and Turner 2004 energy parameters. Note the stark differences between the length-normalized degree distribution for transfer RNA (accession code RA1180 from tRNAdb 2009 [41]) and for the conformational switch of spliced leader from L. collosoma. We are currently investigating whether other conformational switches have large values of length-normalized expected number of neighbors.

thumbnail
Fig 11. Relative frequency for the Boltzmann weighted number of neighbors for the 76 nt alanine transfer RNA from Mycoplasma mycoides with accession code RA1180 from tRNAdb 2009 [41], where the sample mean ± one standard deviation is 29.11 ± 4.63 [resp. 46.51 ± 8.74] for move set MS1 [resp. MS2] using energy model C (Turner 2004 energy parameters).

The length-normalized sample mean is 0.3831 ± 0.0610 for MS1 [resp. 0.6120 ± 0.1150 for MS2]. The number of neighbors, or degree, is given on the x-axis. RNAsubopt -d0 -e 12 [10] was used to generate 537,180 structures s having free energy within 12 kcal/mol of the MFE. The sum Z* of all Boltzmann factors exp(−E(s)/RT) of the sampled structures was computed, and the ratio Z*/Z of Z* with respect to the partition function Z was determined to be 0.9998202. For given number x of neighbors, the corresponding value y is defined to be the sum, taken over all the structures s, whose degree is x, of the Boltzmann factor exp(−E(s)/RT) of s normalized by Z*. Using our code, with respect to energy model C (Turner 2004 energy parameters), we have the following values for the expected number of neighbors expected number of neighbors: (Boltzmann-MS1); (Boltzmann-MS2).

https://doi.org/10.1371/journal.pone.0139476.g011

thumbnail
Fig 12. Boltzmann relative frequency for the number of neighbors for the 56 nt spliced leader RNA from L. collosoma, where the mean ± one standard deviation is 69.87 ± 34.04 [resp. 90.46 ± 37.71] for move set MS1 [resp. MS2] using energy model C (Turner 2004 energy parameters).

The length-normalized sample mean is 1.2477 ± 0.6079 for MS1 [resp. 1.6153 ± 0.6734 for MS2]. The number of neighbors, or degree, is given on the x-axis. RNAsubopt -d0 -e 12 [10] was used to generate 266,065 structures s having free energy within 12 kcal/mol of the MFE. The sum Z* of all Boltzmann factors exp(−E(s)/RT) of the sampled structures was computed, and the ratio Z*/Z of Z* with respect to the partition function Z was determined to be 0.9998812, hence values of relative frequency should be close to the corresponding values for the Boltzmann probability. For given number x of neighbors, the corresponding value y is defined to be the sum, taken over all the structures s, whose degree is x, of the Boltzmann factor exp(−E(s)/RT) of s normalized by Z*. Using our code, with respect to energy model C (Turner 2004 energy parameters), we have the following values for the expected number of neighbors: (Boltzmann-MS1); (Boltzmann-MS2).

https://doi.org/10.1371/journal.pone.0139476.g012

Fig 13 depicts the correlation between expected network degree, conformational entropy, contact order, and expected number of native contacts, computed with respect to a collection of 180 PDB files and to a collection of 1904 RNA sequence and consensus structures taken from the Rfam 12.0 database [42]. Although the results are mixed and preliminary, the PDB data suggests a possible correlation between secondary structure contact order and (uniform) expected network degree, while the Rfam data suggests a possible correlation between the expected number of native contacts and (uniform) expected network degree. Definitions and details of the computational experiments now follow.

thumbnail
Fig 13. Correlation of network degree (expected number of neighbors) with (absolute) contact order, conformational entropy, expected number of native contacts, etc. determined with respect to a collection of 180 PDB files (left panel, see text) and to the first sequence with its consensus structure from the seed alignment of every family from the Rfam 12.0 database [42] (sequence length was capped at 200 nt, providing 1904 sequences and consensus structures).

Move set MS1 consists of base pair additions and removals; move set MS2 consists of base pair additions, removals, and shifts. (Left) The rows [resp. columns] correspond to the following measures, proceeding from top to bottom [resp. left to right]: Unif MS1: uniform expected number of neighbors for move set MS1. Unif MS2: uniform expected number of neighbors for move set MS2. Turner MS1: Boltzmann expected number of neighbors for move set MS1. Turner MS2: Boltzmann expected number of neighbors for move set MS2. Entropy: conformational entropy −kBs p(s) ⋅ lnp(s), where the sum is taken over all structures of a given RNA sequence, and Boltzmann probability p(s) = exp(−E(s)/RT)/Z [50]. 3D CO: 3D (absolute) contact order, where two nucleotides are in contact if at least one atom of each is within with 6 Å. pknot CO: pseudoknot (absolute) contact order determined by of output of RNAview, 2D CO: 2D CO (absolute) contact order, determined by extraction of maximal secondary structure from RNAview output. (Right) The rows [resp. columns] correspond to the following measures, proceeding from top to bottom [resp. left to right]: Unif MS1, Unif MS2, and Entropy: as explained in caption to left panel. MFE CO [resp. Rfam CO]: ∑(i, j) ∈ s0(ji)/∣s0∣, where the sum is taken over all base pairs (i, j) belonging to structure s0, and ∣s0∣ denotes the number of base pairs in s0, where s0 denotes the minimum free energy [resp. Rfam consensus] structure. Native Cont is number of native contacts, defined by ∑s P(s) ⋅ ∣ss0∣, where the sum is taken over all structures s, P(s) = exp(−E(s)/RT)/Z is the Boltzmann probability of s, and ∣ss0∣ denotes the number of base pairs common to both s and s0, where s0 is the Rfam consensus structure.

https://doi.org/10.1371/journal.pone.0139476.g013

Contact order is considered in the context of protein folding in [43], where absolute contact order is defined by ∑i < j(ji)/N, where the sum is over all N pairs of residues i, j that are in contact, taken here to mean that residues i, j each contain a heavy atom (non-hydrogen) within 6 Å, and that i, j are not consecutive (ji + 1). In Fig 13, we consider several formulations of RNA contact order. The 3D absolute contact order for an RNA structure is defined as above. The pseudoknot (pknot) absolute contact order is defined as ∑i < j(ji)/N, where the sum is over all N base pairs (i, j) determined by RNAview [44], a program that determines hydrogen-bonded atoms of distinct nucleotides in a PDB file of RNA and additionally classifies the base pair with respect to the Leontis-Westhof classification [45]. The 2D absolute contact order is defined as ∑i < j(ji)/N, where the sum is over all N base pairs (i, j) in the secondary structure extracted from RNAview output by our implementation of the method described in [46, 47], which essentially applies the Nussinov-Jacobson algorithm [48] to those base pairs determined by RNAview from the tertiary PDB structure, resulting in the secondary structure having a largest number of base pairs (one could alternatively use the web server RNApdbee [49]). We also consider the corresponding versions of relative contact order, by dividing the absolute contact order by RNA sequence length.

For benchmarking purposes, we took two datasets: (1) tertiary structures from the PDB, and (2) consensus secondary structures from the Rfam 12.0 database [42]. For the former, we used PDB files from the dataset [50], since these files have no discrepancies between the SEQRES and ATOM fields. From this set of 486 PDB files, we retained 180 PDB files with a total of 227 RNA chains, after removing PDB files of very short RNAs, as well as those PDB files consisting of NMR data for which RNAview [44] did not use the first MODEL in its determination of base pairing, as well as those for which RNAview returned no base pairing information at all. For the latter, we took the first sequence, with its consensus structure, from the seed alignment of every family of Rfam 12.0, where sequence length was capped at 200 nt. This provided a collection of 1904 sequences and consensus structures.

The left panel of Fig 13 depicts the correlation computed for the 180 PDB files between various formulations of expected network degree and RNA secondary structure conformational entropy [51] (highest correlation value of 0.90) and contact order (highest correlation value of 0.86). Here, the conformational entropy is defined by −kB ⋅ ∑s p(s) ⋅ lnp(s), where p(s) is the Boltzmann probability of secondary structure s, and the sum is taken over all secondary structures of a given RNA sequence (low entropy means that the Boltzmann probability is very high for a small number of structures – i.e. a relatively small number of structures has low free energy). The right panel of Fig 13 depicts the correlation for the 1904 Rfam consensus secondary structures between (uniform) expected network degree and various formulations of conformational entropy (highest correlation 0.80), the expected number of native contacts (highest correlation of 0.86), and two formulations of contact order (highest correlation value of 0.43). Here, the expected number of native contacts is defined by ∑s p(s) ⋅ ∣ss0∣, where the sum is taken over all structures s, p(s) = exp(−E(s)/RT)/Z is the Boltzmann probability of s, and ∣ss0∣ denotes the number of base pairs common to both s and the Rfam consensus structure s0. At present, it is unclear why the correlation between expected network degree and contact order is higher in the PDB data than in the Rfam data.

Conclusion

Computational methods for RNA secondary structure folding kinetics generally involve either (1) algorithms to determine optimal or near-optimal folding pathways, [6, 7, 1113], (2) explicit solutions of the master equation for possibly coarse-grained models [1418], or (3) repeated simulations to fold an initially empty secondary structure to the target minimum free energy (MFE) structure [5, 2024]. Despite its importance, RNA secondary structure folding kinetics remains a computationally difficult problem, since it is known that the problem of determining optimal folding pathways is NP-complete [25].

To shed light on RNA kinetics from a different perspective, in this paper we have investigated a network property of RNA secondary structures. Let G be the network corresponding to the move set MS1 [resp. MS2] of the kinetics program Kinfold [5]; i.e. G = (V, E) is a directed graph, whose vertices are the secondary structures of a given RNA sequence and whose edges st are defined if structure t can be obtained from s by the addition or removal [resp. addition, removal or shift] of a base pair from s. In [34], we described an algorithm that computes the MS1 expected network degree ⟨N⟩ = ∑s p(s) ⋅ N(s), where N(s) is the out-degree of secondary structure s of a user-specified RNA sequence a = a1, …, an and p(s) = exp(−E(s)/RT)/Z is the probability of structure s. In the current paper, we describe (surprisingly) much more difficult algorithms to efficiently compute the MS2 expected network degree ⟨N⟩ = ∑s p(s) ⋅ N(s), with respect to increasingly complex energy models A, B, C. Model A is the homopolymer model [35], which we use to present a simplified version of the more complex algorithms for models B and C. Unlike the simple homopolymer model, Model B concerns the usual notion of RNA secondary structure s, defined in Definition 1 where the energy E(s) is zero, so that the probability p(s) is one over the number of structures (uniform probability). Model C concerns the Turner energy model without dangles, so that the probability p(s) is the Boltzmann probability of s; however, due to technical issues, certain low probability MS2 moves in multiloops can not be considered (see an example in Fig 4). The run time [resp. space] for our algorithm for Model A is O(n3) [resp. O(n2)], while that for models B and C is O(n4) [resp. O(n3)]—cubic space is required uniquely for functions F, G.

Our algorithms for Models A and B are exact, computing the same values as obtained by exhaustive brute force. Our algorithm for Model C ignores certain kinds of base pair additions, removals and shifts within a multiloop. Table 1 compares the values of expected number of neighbors (expected degree) for move sets MS1 and MS2 for Models B, C where Turner 1999 and Turner 2004 energy parameters are considered [36]. Table 1 also includes values obtained by brute force computation from structures generated by RNAsubopt [52] from the Vienna RNA Package [10]. The time required for this method is O(n2) times the number of structures sampled by RNAsubopt plus the overhead to run RNAsubopt. Except for small sequences, this computation cost is prohibitive, which makes our dynamic programming computation of the expected number of neighbors an attractive alternative. Nevertheless much less information is conveyed by a single number, as shown in Table 1 than in the (approximate) distribution as shown in Fig 11 for alanine transfer RNA from Mycoplasma mycoides and Fig 12 for the spliced leader conformational switch from L. collosoma. The striking difference between these figures suggests that perhaps conformational switches may display a bimodal or multimodal degree distribution—something we are currently investigating.

Table 1 displays a strong discrepancy for the expected number of neighbors for L. collosoma when using Turner 1999 or Turner 2004 energy parameters. To investigate the origin of this odd discrepancy, we ran RNAsubopt -d0 -e 12 with Turner 2004 [resp. Turner 1999] parameters to generate 266,065 [resp. 259, 626] structures for 56 nt L. collosoma spliced leader RNA, 189, 404 of which were common to both collections. Letting Z*(04) [resp. Z*(99)] denote the sum of Boltzmann factors of these 189, 404 structures with respect to Turner 2004 [resp. Turner 1999] parameters, we computed the (pseudo) Boltzmann probability Pr04(s) = exp(−E04(s)/RT)/Z*(04) [resp. Pr99(s) = exp(−E04(s)/RT)/Z*(99)] for each of the 189, 404 common structures s. The difference in expected MS2 degree for Turner04 parameters minus that for Turner99 parameters is ∑s(Pr04(s)−Pr99(s)) ⋅ N(s) = 24.35. The contribution to expected degree for the set of sampled structures not common to both sets is negligeable, i.e. less than 0.01. The strongest difference between Turner04 and Turner99 values are for the 1799 [resp. 246] structures having degree 33 [resp. 126], where the difference Pr04(33)−Pr99(33) is −0.1415 [resp. 0.1570], as shown in the large negative [resp. positive] spike in Fig 14. For unknown reasons, there are striking differences in the free energy values for Turner04 and Turner99 energy models for these structures. Although the choice of Turner energy model may entail a large difference in the expected degree computed, as shown in Table 1 and Fig 14, the general form of the corresponding histograms is maintained, as shown in Figs 11 and 12. We now summarize our findings.

thumbnail
Fig 14. Difference in Boltzmann probabilities for 56 nt spliced leader RNA from L. collosoma with respect to move set MS2—see text for explanation.

https://doi.org/10.1371/journal.pone.0139476.g014

Given the 3D native structure of a protein, the (absolute) contact order is defined by ∑i < j(ji)/N, where the sum is over all N pairs of residues i, j that are in contact, where non-contiguous residues i, j are in contact if each contain a heavy atom (non-hydrogen) within 6 Å [43]. We use the definition of [43] for 3D RNA contact order, whereas we define pseudoknot (pknot) contact order by ∑i < j(ji)/N, where the sum is over all N base pairs (i, j) determined by RNAview [44], a program that determines hydrogen-bonded atoms of distinct nucleotides in a PDB file of RNA and additionally classifies the base pair with respect to the Leontis-Westhof classification [45]. We define 2D contact order by ∑i < j(ji)/N, where the sum is over all N base pairs (i, j) in the secondary structure extracted from RNAview.

For benchmarking purposes, by removing short RNAs and RNAs for which RNAview yielded no base pairing information, we extracted a set of 180 PDB files with a total of 227 RNA chains from the datase [50] of 486 PDB files that have no discrepancies between the SEQRES and ATOM fields. For this benchmarking set, the left panel of Fig 13 shows a relatively high correlation between contact order and expected network degree—for instance, there is a correlation of 0.86 between 2D contact order and MS1 or MS2 network degree. Surprisingly, the correlation is generally higher when expected network degree is computed with respect to uniform probability (corresponding to energy model B with zero energy) rather than Boltzmann probability (corresponding to energy model C, i.e. Turner energy model). In the case of energy model C, the correlation is somewhat higher for move set MS1 rather than move set MS2.

The number of native contacts in a transitional protein structure is defined as the number of pairs of noncontiguous residues i, j that are in contact (i.e. close spatial proximity) in the native structure, usually meaning the X-ray structure [53]. The importance of this reaction coordinate for protein folding has been established in [54], where Best et al. analyze long equilibrium simulations of protein folding for more than 10 proteins using molecular dynamics trajectories from D.E. Shaw Research. It follows from Markov chain theory that the expected number of visitations of (transitional) structure s is the Boltzmann probability p(s) = exp(−E(s)/RT)/Z times the trajectory length, and hence the expected number of native contacts for RNA secondary structure formation can be defined by (95) where ∣s0∣ denotes the number of base pairs in the native secondary structure s0, taken here to be the Rfam consensus structure used in benchmarking. In the right panel of Fig 13, we establish a relatively high correlation of 0.86 [resp. 0.84] between the expected number of native contacts for a collection 1904 RNA sequences and their consensus secondary structures from the Rfam 12.0 database and the uniform MS1 [resp. MS2] network degree. Again, it is worth pointing out that the slightly higher correlation of the MS1 measure over the MS2 measure.

RNA secondary structure folding kinetics remains a computationally difficult problem for RNA sequences of even moderate length, despite the availability of software to compute near-optimal folding pathways [7, 11, 13], compute population occupancy curves for coarse-grained models [14, 17, 18], and to repeatedly perform simulations of the Gillespie algorithm [5, 2023, 30]. Our motivation in this article is to approach folding kinetics from a novel network perspective, where we show that network degree is moderately highly correlated with both contact order and the expected number of native contacts, both measures known to be correlated with experimentally measured protein folding kinetics. Despite the new algorithms of this paper and the existence of other software for RNA folding kinetics, it seems clear that significant progress in this field will require the a database of experimentally determined RNA folding rates, comparable to the database KineticDB containing experimentally determined folding rates for proteins [26].

Acknowledgments

We would like to thank Juan Antonio Garcia-Martin for providing code to access the Turner 1999 and 2004 parameters in a uniform manner and related programming issues. We would also like to thank the reviewers for their helpful comments. This research was funded by the National Science Foundation grant DBI-1262439. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Author Contributions

Conceived and designed the experiments: PC. Performed the experiments: PC AB. Analyzed the data: PC AB. Wrote the paper: PC.

References

  1. 1. Harris KA, Crothers DM. The Leptomonas collosoma spliced leader RNA can switch between two alternate structural forms. Biochemistry. 1993;32(20):5301–5311.
  2. 2. Gerdes K, Gultyaev AP, Franch T, Pedersen K, Mikkelsen ND. Antisense RNA-regulated programmed cell death. Annu Rev Genet. 1997;31:1–31. pmid:9442888
  3. 3. Bryngelson JD, Onuchic JN, Socci ND, Wolynes PG. Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins. 1995 Mar;21(3):167–195. pmid:7784423
  4. 4. Bryngelson JD, Wolynes PG. Spin glasses and the statistical mechanics of protein folding. Proc Natl Acad Sci USA. 1987;84:7524–7528. pmid:3478708
  5. 5. Flamm C, Fontana W, Hofacker IL, Schuster P. RNA folding at elementary step resolution. RNA. 2000;6:325–338. pmid:10744018
  6. 6. Shapiro BA, Bengali D, Kasprzak W, Wu JC. RNA folding pathway functional intermediates: their prediction and analysis. J Mol Biol. 2001 September;312(1):27–44. pmid:11545583
  7. 7. Flamm C, Hofacker IL, Stadler PF, Wolfinger M. Barrier trees of degenerate landscapes. Z Phys Chem. 2002;216:155–173.
  8. 8. Heine C, Scheuermann G, Flamm C, Hofacker IL, Stadler PF. Visualization of barrier tree sequences. IEEE Trans Vis Comput Graph. 2006 Sep-Oct;12(5):781–788. pmid:17080800
  9. 9. Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981;9(1):133–148. pmid:6163133
  10. 10. Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011;6:26. pmid:22115189
  11. 11. Morgan SR, Higgs PG. Barrier heights between ground states in a model of RNA secondary structure. J Phys A: Math Gen. 1998;31:3153–3170.
  12. 12. Flamm C, Hofacker IL, Maurer-Stroh S, Stadler PF, Zehl M. Design of multistable RNA molecules. RNA. 2001 February;7(2):254–265. pmid:11233982
  13. 13. Dotu I, Lorenz WA, VAN Hentenryck P, Clote P. Computing folding pathways between RNA secondary structures. Nucleic Acids Res. 2010;38(5):1711–1722. pmid:20044352
  14. 14. Wolfinger M, Svrcek-Seiler WA, Flamm C, Stadler PF. Efficient computation of RNA folding dynamics. J Phys A: Math Gen. 2004;37:4731–4741.
  15. 15. Zhang W, Chen SJ. RNA hairpin-folding kinetics. Proc Natl Acad Sci USA. 2002 February;99(4):1931–1936. pmid:11842187
  16. 16. Tang X, Kirkpatrick B, Thomas S, Song G, Amato NM. Using motion planning to study RNA folding kinetics. J Comput Biol. 2005;12(6):862–881. pmid:16108722
  17. 17. Kucharik M, Hofacker IL, Stadler PF, Qin J. Basin Hopping Graph: a computational framework to characterize RNA folding landscapes. Bioinformatics. 2014 Jul;30(14):2009–2017. pmid:24648041
  18. 18. Senter E, Clote P. Fast, approximate kinetics of RNA folding. J Comput Biol. 2015 February;22(2):124–144. pmid:25684201
  19. 19. Flamm C. Kinetic Folding of RNA. Universität Wien; 1998.
  20. 20. Xayaphoummine A, Bucher T, Isambert H. Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots. Nucleic Acids Res. 2005 July;33(Web):W605–W610. pmid:15980546
  21. 21. Danilova LV, Pervouchine DD, Favorov AV, Mironov AA. RNAKinetics: a web server that models secondary structure kinetics of an elongating RNA. J Bioinform Comput Biol. 2006 April;4(2):589–596. pmid:16819804
  22. 22. Geis M, Flamm C, Wolfinger MT, Tanzer A, Hofacker IL, Middendorf M, et al. Folding kinetics of large RNAs. J Mol Biol. 2008 May;379(1):160–173. pmid:18440024
  23. 23. Aviram I, Veltman I, Churkin A, Barash D. Efficient procedures for the numerical simulation of mid-size RNA kinetics. Algorithms Mol Biol. 2012;7(1):24. pmid:22958879
  24. 24. Anderson JW, Haas PA, Mathieson LA, Volynkin V, Lyngso R, Tataru P, et al. Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information. Bioinformatics. 2013 March;29(6):704–710. pmid:23396120
  25. 25. Thachuk C, Manuch J, Rafiey A, Mathieson LA, Stacho L, Condon A. An algorithm for the energy barrier problem without pseudoknots and temporary arcs. Pac Symp Biocomput. 2010:108–19;0(O):O.
  26. 26. Bogatyreva NS, Osypov AA, Ivankov DN. KineticDB: a database of protein folding kinetics. Nucleic Acids Res. 2009 January;37(Database):D342–D346. pmid:18842631
  27. 27. Ivankov DN, Bogatyreva NS, Lobanov MY, Galzitskaya OV. Coupling between properties of the protein shape and the rate of protein folding. PLoS One. 2009;4(8):e6476. pmid:19649298
  28. 28. Galzitskaya OV. Influence of Conformational Entropy on the Protein Folding Rate. Entropy. 2010;12:961–982.
  29. 29. Makarov DE, Keller CA, Plaxco KW, Metiu H. How the folding rate constant of simple, single-domain proteins depends on the number of native contacts. Proc Natl Acad Sci USA. 2002 March;99(6):3535–3539. pmid:11904417
  30. 30. Dykeman EC. An implementation of the Gillespie algorithm for RNA kinetics with logarithmic time update. Nucleic Acids Res. 2015 Jul;43(12):5708–5715. pmid:25990741
  31. 31. Gillespie DT. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J Comp Phys. 1976;22(403):403–434.
  32. 32. Sprinzl M, Horn C, Brown M, Ioudovitch A, Steinberg S. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 1998;26:148–153. pmid:9399820
  33. 33. Wuchty S. Small worlds in RNA structures. Nucleic Acids Res. 2003 February;31(3):1108–1117. pmid:12560509
  34. 34. Clote P. Expected degree for RNA secondary structure networks. J Comp Chem. 2015 Jan;36(2):103–17.
  35. 35. Stein PR, Waterman MS. On some new Sequences Generalizing the Catalan and Motzkin Numbers. Discrete Mathematics. 1978;26:261–272.
  36. 36. Turner DH, Mathews DH. NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 2010 January;38(Database):D280–D282. pmid:19880381
  37. 37. Zhang AT, Langley AR, Christov CP, Kheir E, Shafee T, Gardiner TJ, et al. Dynamic interaction of Y RNAs with chromatin and initiation proteins during human DNA replication. J Cell Sci. 2011 June;124(Pt):2058–2069. pmid:21610089
  38. 38. Pörschke D. Model calculations on the kinetics of oligonucleotide double-helix coil transitions: Evidence for a fast chain sliding reaction. Biophys Chem. 1974 August;2(2):83–96. pmid:4433687
  39. 39. McCaskill JS. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990;29:1105–1119. pmid:1695107
  40. 40. Clote P. Asymptotic connectivity for the network of RNA secondary structures. arXiv. 2015 Aug;ArXiv identifier: 1508.03815.
  41. 41. Juhling F, Morl M, Hartmann RK, Sprinzl M, Stadler PF, Putz J. tRNAdb 2009: compilation of tRNA sequences and tRNA genes. Nucleic Acids Res. 2009 January;37(Database):D159–D162. pmid:18957446
  42. 42. Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 2014 Nov;0(O):O.
  43. 43. Plaxco KW, Simons KT, Baker D. Contact order, transition state placement and the refolding rates of single domain proteins. J Mol Biol. 1998 Apr;277(4):985–994. pmid:9545386
  44. 44. Yang H, Jossinet F, Leontis N, Chen L, Westbrook J, Berman H, et al. Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res. 2003 Jul;31(13):3450–3460. pmid:12824344
  45. 45. Leontis NB, Westhof E. Geometric nomenclature and classification of RNA base pairs. RNA. 2001 Apr;7(4):499–512. pmid:11345429
  46. 46. Ponty Y. Modélisation de séquences génomiques structurées, génération aléatoire et applications. Université Paris-Sud XI; 2006. Laboratoire de Recherche en Informatique.
  47. 47. Smit S, Rother K, Heringa J, Knight R. From knotted to nested RNA structures: a variety of computational methods for pseudoknot removal. RNA. 2008 Mar;14(3):410–416. pmid:18230758
  48. 48. Nussinov R, Jacobson AB. Fast Algorithm for Predicting the Secondary Structure of Single Stranded RNA. Proceedings of the National Academy of Sciences, USA. 1980;77(11):6309–6313.
  49. 49. Antczak M, Zok T, Popenda M, Lukasiak P, Adamiak RW, Blazewicz J, et al. RNApdbee–a webserver to derive secondary structures from pdb files of knotted and unknotted RNAs. Nucleic Acids Res. 2014 Jul;42(Web):W368–W372. pmid:24771339
  50. 50. Kemena C, Bussotti G, Capriotti E, Marti-Renom MA, Notredame C. Using tertiary structure for the computation of highly accurate multiple RNA alignments with the SARA-Coffee package. Bioinformatics. 2013 May;29(9):1112–1119. pmid:23449094
  51. 51. Garcia-Martin JA, Clote P. RNA thermodynamic structural entropy. PLoS One. 2015;Preprint available at http://arxiv.org/abs/1508.05499.
  52. 52. Wuchty S, Fontana W, Hofacker IL, Schuster P. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers. 1999;49:145–164. pmid:10070264
  53. 53. Shakhnovich E, Farztdinov G, Gutin AM, Karplus M. Protein folding bottlenecks: A lattice Monte Carlo simulation. Phys Rev Lett. 1991 Sep;67(12):1665–1668. pmid:10044213
  54. 54. Best RB, Hummer G, Eaton WA. Native contacts determine protein folding mechanisms in atomistic simulations. Proc Natl Acad Sci USA. 2013 Oct;110(44):17874–17879. pmid:24128758
  55. 55. Reinisch KM, Wolin SL. Emerging themes in non-coding RNA quality control. Curr Opin Struct Biol. 2007 April;17(2):209–214. pmid:17395456
  56. 56. Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, et al. Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res. 2011 January;39(Database):D141–D145. pmid:21062808
  57. 57. Wiese KC, Glen E, Vasudevan A. JViz.Rna–a Java tool for RNA secondary structure visualization. IEEE Trans Nanobioscience. 2005 September;4(3):212–218. pmid:16220684
  58. 58. Darty K, Denise A, Ponty Y. VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics. 2009 Aug;25(15):1974–1975. pmid:19398448