Network Properties of the Ensemble of RNA Structures

We describe the first dynamic programming algorithm that computes the expected degree for the network, or graph G = (V, E) of all secondary structures of a given RNA sequence a = a 1, …, a n. Here, the nodes V correspond to all secondary structures of a, while an edge exists between nodes s, t if the secondary structure t can be obtained from s by adding, removing or shifting a base pair. Since secondary structure kinetics programs implement the Gillespie algorithm, which simulates a random walk on the network of secondary structures, the expected network degree may provide a better understanding of kinetics of RNA folding when allowing defect diffusion, helix zippering, and related conformation transformations. We determine the correlation between expected network degree, contact order, conformational entropy, and expected number of native contacts for a benchmarking dataset of RNAs. Source code is available at http://bioinformatics.bc.edu/clotelab/RNAexpNumNbors.


Introduction
RNA folding kinetics plays an important role in various biological processes, including (i) trans splicing of RNA, which is controlled by trypanosomal spliced leader (SL) RNA kinetics [1], and (ii) the hok/sok host-killing/suppression of killing (hok/sok) system that kills E. coli replicates if insufficient plasmids are transfered to the new daughter cell [2]. To better understand how macromolecules fold into their native state, energy landscapes for protein and RNA folding have been intensively studied [3][4][5][6][7][8]. In the case of RNA secondary structure formation, numerous algorithms have been developed beyond thermodynamic equilibrium structure prediction [9,10], including algorithms (1) to determine optimal or near-optimal folding pathways, [6,7,[11][12][13], (2) to compute explicit solutions of the master equation for possibly coarse-grained models [14][15][16][17][18], and (3) to simulate stepwise folding from an initial secondary structure to the target minimum free energy (MFE) structure [5,[19][20][21][22][23][24]. Nevertheless, RNA secondary structure folding kinetics remains a computationally difficult problem, since it is known that the problem of determining optimal folding pathways is NP-complete [25]. Despite increasing awareness of the importance of regulatory and catalytic RNA, no database currently exists of experimentally determined RNA folding rates, in contrast to the situation for proteins. Indeed, KineticDB is a database that provides users with a diverse set of experimentally determined folding rates for 87 unique proteins and approximately one hundred mutants [26].
It is currently an open problem to predict the folding rate of proteins and RNA molecules from the sequence alone. The goal of this paper is to raise awareness of this problem-in particular, the problem of predicting RNA secondary structure folding rate from the nucleotide sequence. For proteins, it has been shown that absolute contact order, which scales as % n 0.7 for sequence length n, correlates rather well with protein folding rates for two-and multi-state folding proteins, reaching a correlation of 77% [27]-see as well Table 1 of [28]. Here, protein contact order is defined as the average chain separation of residues in contact (e.g. within 6 Å) in the native structure. It has also been shown that the number of native contacts correlates with folding rates of small single-domain proteins with two-state kinetics. In this case, Makarov et al. showed that ln(k) % ln(N) + a + bN, where k denotes the folding rate, N is the number of contacts in the folded state, and a, b are constants whose physical meaning is understood [29].
To our knowledge, no relation has been established between RNA folding rate and either contact order or the number of native contacts, due in part to the above-mentioned absence of a database of RNA folding rates, and due in part to the notorious difficulty of estimating RNA secondary structure folding rates when using secondary structure kinetics software such as Kinfold [5], Kinefold [20], RNAKinetics [21], KFold [30], or other software [22,23]. Such programs implement an event-driven Monte Carlo algorithm known as Gillespie's algorithm [31]; it follows that repeated (time-consuming) simulations will generate a collection of mean first passage times which are approximately exponentially distributed. Since an exponential distribution has the property that the mean is equal to the standard deviation, it follows that precise kinetics obtained by such methods necessarily requires inordinate computation time (e.g. the population occupancy curve for yeast phe-tRNA required 3 months of CPU time on a 2.4 GHz Intel Pentium 4 running linux [14]). Until the availability of a database of experimentally determined RNA folding rates, it is likely that the best approximation of folding rates can be made using exact, coarse-grained approaches using spectral methods, as Treekin [14], basin hopping with RNAlocmin [17], and Hermes [18].
Apart from contact order and the number of native contacts, the expected degree of the network of RNA secondary structures of an RNA sequence is another order parameter that could play a role in RNA folding kinetics-see the left panel of Fig 1 for an example of expected network degree for the toy sequence GGGGCCC. Here, the degree of a node (secondary structure) s is the number of secondary structures t that can be obtained from s by the addition, removal or shift of a base pair. These moves constitute the default move set employed by the program Kinfold [5], often used to estimate RNA folding kinetics. Moreover, by analyzing the network G = (V, E), whose node set V consists of low energy secondary structures of E. coli phe-tRNA (RF6280 [32]) and whose edge set E consists of directed edges s ! t, where t is obtained from s by a base pair addition, removal or shift, the network for phe-tRNA was shown to be small-world in [33].
In this paper, we provide the first algorithm to efficiently compute the expected degree of an RNA network of secondary structures. Our work generalizes a recent paper [34], which describes a vastly simpler algorithm to compute the expected degree without consideration of shift moves. Since our current algorithm is surprisingly complex, for clarity of exposition, we consider three successive models. Model A is the RNA homopolymer model [35], in which any two positions i, j can constitute a base pair, provided only that i + 1 < j. Model B is the usual RNA secondary structure model, where positions i, j can constitute a base pair if the corresponding nucleotides form a Watson-Crick or wobble pair and i + 3 < j; however, in Model B, the energy of a structure is taken to be zero, so the probability of a structure is simply one over the number of structures. Model C extends Model B by using the Turner 2004 energy parameters [36] without dangles. Our algorithms have been extensively tested against bruteforce exhaustive methods to be sure of algorithm and implementation. Finally, we begin a preliminary investigation into the relation between network degree, contact order, conformational entropy, and number of native contacts using two benchmarking sets of RNA structures. Since we show later that expected network degree is linear in sequence length for the (theoretical) homopolymer case, we additionally compute the length-normalized network degree. Table 1. This table compares expected network degree and the length-normalized expected network degree for three RNA sequences of moderate size: 32 nt fruA, encoding the A subunit of coenzyme F420-reducing hydrogenase; tRNA RA1180, 56 nt spliced leader RNA from L. collosoma; 76 nt transfer RNA with accession code RA1180 from the database tRNAdb 2009 [41]. Unif-MS1 [resp. Unif-MS2] denote the expected network degree for model B (uniform probability) for MS1 [resp. MS2] move set. Turner99-MS1 [resp. Turner99-MS2] and Turner04-MS1 [resp. Turner04-MS2] and denote the expected network degree for model C (Boltzmann probability for Turner 1999 and Turner 2004 energy parameters [36]) for MS1 [resp. MS2] move set. Sample-MS1 [resp. Sample-MS2] denotes the approximation of the expected network degree for model C (Turner 1999 andTurner 2004 parameters) obtained by generating low energy structures by RNAsubopt -d0 -e 12, as explained in the text. In the case of fruA, all 971,399 possible structures were generated by RNAsubopt -d0 -e 100, so that Sample-MS1 and Sample-MS2 values are correct-for this reason, the standard deviation values are not included. Note that for L. collosoma, the expected degree values for the Turner 2004 [55], one function of Y RNA is to bind to certain misfolded RNAs, including 5S rRNA, as part of a quality control mechanism. The secondary structure depicted is the consensus secondary structure of Y RNA with EMBL access number AAPY01489510:220-119 from Rfam family RF00195 in the Rfam database [56]. Images produced with sofware jViz [57]. Preliminaries Definition 1. A secondary structure for a given RNA nucleotide sequence a 1 , . . ., a n is a set s of base pairs (i, j), where 1 i < j n, such that: 1. if (i, j) 2 s then a i , a j form either a Watson-Crick (AU, UA, CG, GC) or wobble (GU, UG) base pair, 2. if (i, j) 2 s then j − i > θ = 3 (a steric constraint requiring that there be at least θ = 3 unpaired bases between any two positions that are paired), 3. if (i, j) 2 s then for all i 0 6 ¼ i and j 0 6 ¼ j, (i 0 , j) = 2 s and (i, j 0 ) = 2 s (nonexistence of base triples), 4. if (i, j) 2 s and (k, ℓ) 2 s, then it is not the case that i < k < j < ℓ (nonexistence of pseudoknots).
Secondary structures can be depicted in several equivalent manners. For instance, the sequence and dot bracket representation for the secondary structure of Y RNA with EMBL access number AAPY01489510:220-119 is given by ).)))...........))))))...))..)))))))))).... Y RNA is a noncoding RNA, known to be required for the initiation of chromosomal DNA replication in mammalian cells [37]; a distinct function of Y RNA is mentioned in the caption to Fig 1, where two other formats for this secondary structure are depicted. A base pair (i, j) of structure s is an external base pair, if there is no base pair (x, y) 2 s with the property that x < i < j < y. A position 1 k n is said to be visible in s if there is no base pair (i, j) 2 s with the property that i k j. The secondary structure of Y RNA in Fig 1 has only one external base pair, i.e. (1,98), and only four visible positions, i.e. positions 99, 100, 101, 102. Throughout the remainder of this paper, structure will mean secondary structure.
The base pair distance d BP (s, t) between secondary structures s, t is the number of base pairs js − tj + jt − sj belonging to s but not t, or vice versa. A shift move from base pair (i, j) in the structure s is of the form is a valid secondary structure. Throughout, let bp(i, j) be a boolean valued function, where bp(i, j) = 1 if positions i, j can form a base pair; i.e. if a i , a j constitute a Watson-Crick or wobble pair. Reference [5] describes the Kinfold program, which implements the Gillespie algorithm [31] for RNA secondary structure folding kinetics. Kinfold produces secondary structure folding trajectories, or sequences s = s 0 , s 1 , . . ., s m = t, where for 0 i < m, s i + 1 is obtained from s i by the addition or deletion of a base pair, and (optionally) by a shift move. These are defined as follows.
The move set MS1 allows a move from structure s to structure t, if t can be obtained from s by the removal of addition of a base pair The move set MS2 allows moves from MS1 as well as four shift moves, described by the following. Structure t is obtained from s by the replacement of base pair (i, j) 2 s by the distinct base pair (i, j 0 ), or (j 0 , i), or (i 0 , j), or (j, i 0 ), provided that t is a valid secondary structure. Figs 2, 3 and 4 depict some typical shift moves, including defect diffusion [38].

Expected network degree
Throughout this paper, let a = a 1 , . . ., a n be a fixed, but arbitrary RNA sequence. Consider the set of all secondary structures of a as a network, or graph, where two structures s, t, are connected by an edge if t can be obtained from s by a base pair addition, removal or shift.  [38], where a bulge migrates stepwise to become absorbed in an hairpin loop. The move from structure (a) to structure (b) is possible by the shift (1, 12) ! (1, 13), the move from (b) to (c) by shift (2, 11) ! (2, 12), etc. Our algorithm properly accounts for such moves with respect to energy models A, B, C. Image adapted from figure on page 26 [19] and produced by VARNA [58].  Example of multiloop creation which is handled by our algorithm for all energy models, including the Turner energy model. To move from (a) to (b), remove the base pair (3,13); to move from (b) to (c), shift (4, 12) ! (12,18); to move from (c) to (d), add base pair (13,17). Image produced by VARNA [58].  Example of multiloop creation which is handled by our algorithm for energy models A, B but not for Turner energy model C. To move from (a) to (b), apply the shift (3, 13) ! (13, 17); to move from (b) to (c), apply the shift (4, 12) ! (12,18). Our algorithm for the Turner energy model properly treats the move from (a) to (b), but not from (b) to (c), as explained in the Remark at the end of Section "Remaining recursions for Q i,j and Z i,j ". Image adapted from figure on page 27 [19] and produced by VARNA [58].  When moves are allowed to range over either MS1, or over MS2, the resulting network is connected; this is not the case for moves in MS2\MS1. Since the network represents intermediate moves in RNA folding trajectories, it is of interest to know the average network degree. This was done for move set MS1 in [34]. The goal of this paper is to describe the first algorithm, which computes the expected network degree, or equivalently, the expected number of neighbors, for the RNA network defined with move set MS2. Computing the expected number of neighbors when including shift moves turns out to be remarkably difficult, so for clarity of exposition, we present three versions of the algorithm, each adding a layer of complexity. Source code for all three energy models can be downloaded from http://bioinformatics.bc.edu/clotelab/.
The plan of this paper is as follows. Section "Results" discusses the degree distribution for move sets MS1 and MS2, obtained by exhaustive enumeration and by sampling low energy structures. Asymptotic network degree is discussed and the correlation is computed between the expected network degree, contact order, conformational entropy, and expected number of native contacts. In Section "Homopolymer Model A", we derive the recursions for the expected number of neighbors for move set MS2, with respect to the homopolymer Model A. In the homopolymer model, introduced in [35], any two positions i < j can form a base pair, provided only that j − i > 1; i.e. in Definition 1, item (1) is removed, and item (2) is modified so that θ = 1. In this model, the partition function Z of a length n homopolymer is simply the number of well-balanced parenthesis expressions with dots, having length n and in which j − i > 1 whenever a left [resp. right] parenthesis occurs at position i [resp. j]. For this model, the probability  The network of all secondary structures of the 12 nt sequence ACGUACGUACGU, where edges appear between structures that differ by a shift move. There are 35 structures, 68 edges between structures that differ by a base pair shift, hence the average network degree is 68 35 ¼ 1:94. Note that the network is not connected, unlike the previous two networks. P(s) of each structure s is equal to the uniform probability 1/Z. In Section "Uniform, nonhomopolymer Model B", we give the recursions for the non-homopolymer uniform Model B, in which every secondary structure has energy zero, but where a secondary structure of the RNA sequence a = a 1 , . . ., a n must satisfy all four properties of Definition 1. In this case, the probability P(s) of structure s is defined by P(s) = exp(−E(s)/RT)/Z where R = 0.00198717 kcal/ mol, T is absolute temperature, and the partition function is Z = ∑ s exp(−E(s)/RT). However, since E(s) = 0 for each structure s, the partition function Z is simply the number of secondary structures of a, and the probability P(s) is equal to the uniform probability P(s) = 1/Z. In Section "Model C with Turner energy parameters", we give the the recursions for the full Model C, with respect to the Turner energy model [36] which includes base stacking free energies and free energies for hairpins, bulges, internal loops and multiloops. The partition function Z = ∑ s exp(−E(s)/RT) can be computed by the McCaskill algorithm [39], and the probability of structure s is the usual Boltzmann probability P(s) = exp(−E(s)/RT)/Z.

Materials and Methods
Let a = a 1 , . . ., a n be an arbitrary but fixed RNA sequence. For any 1 i j n, let a[i, j] denote the subsequence a i , . . ., a j , and let SS½i; j denote the set of secondary structures of a[i, j]. For s 2 SS½i; j, let BF(s) denote the Boltzmann factor exp(−E(s)/RT) of s, and define Q i;j ¼ P s2SS½i;j BFðsÞ Á NðsÞ, where N(s) is the number of secondary structures t of a[i, j] obtained from the structure s by the addition, deletion or shift of a base pair. The partition function for a[i, j] is defined by Z i;j ¼ P s2SS½i;j BFðsÞ. It follows that the expected number of neighbors (network degree) is Q 1;n Z 1;n . For clarity of exposition, in the following subsections, we describe recursions to compute Q i,j and Z i,j for three energy models for RNA secondary structures, each model a refinement of the previous model.

Homopolymer Model A
In this section, we derive the recursions for Q 1,n and Z 1,n for the homopolymer model, in which any two positions 1 i < j n can form a base pair, provided only that i + 1 < j. For the homopolymer model, there is no RNA sequence a = a 1 , . . ., a n , but rather only the interval [1, n] = {1, . . ., n}. Thus we speak of a structure on [i, j], rather than on a[i, j]. The energy of each structure in the homopolymer model is zero, so the probability of each structure s on [i, j] equals one divided by the number of structures on [i, j]. Moreover, there is no need to compute the doubly-indexed values Q i,j and Z i,j , since the values depend only on the size j − i + 1 of the sequence [i, j]; i.e. if j − i = j 0 − i 0 , then Q i,j = Q i 0 , j 0 and Z i,j = Z i 0 , j 0. Thus it is notationally simpler to define Q n [resp. Z n ] in place of Q 1,n [resp. Z 1,n ], and similarly for all other auxilliary functions.
For 0 n, define Q n to be the sum, taken over all structures s of [1, n], of the number of base pair additions, removals or shifts of a base pair of s. Formally, we have I½ððx; yÞ ! ðk; 'ÞÞ 2 MS2; ðs n fðx; yÞgÞ [ fðk; 'Þg is a valid str ð1Þ where I denotes the indicator function, and "(x, y) ! (k, ℓ)" denotes the move which consists of replacing base pair (x, y) by base pair (k, ℓ). As well, let Z n denote the total number of homopolymer structures on [1, n] with θ = 1. Recursions for Z n are well-known [35], but for completeness given in Eq (2) below.
Auxilliary functions f(n, x) and g(n, x). Recall that here we take θ = 1 for simplicity of exposition of the ideas. Let Z n denote the total number of structures on the homopolymer of length n. Since any two positions i, j can base-pair, as long as j − i > θ = 1, we have ( The term Z n − 1 counts all structures s on [1, n] in which n is unpaired in s, while the term Z r Á Z n − r − 2 counts all structures s on [1, n] that contain the base pair (r + 1, n). Define f(n, x) to be the number of secondary structures s for a length n homopolymer, such that s has x visible positions. Now for 0 n and 0 x n, define f by The computation of f(n, x) uses dynamic programming and proceeds by double induction, i.e. for n fixed, induction is performed on x. The term Z n − 2 arises from structures s on [1, n] that contain the base pair (1, n); the term f(n − 1, x − 1) is the contribution from structures s on [1, n] in which n is unpaired; the term f(r, x) Á Z n − r − 2 accounts for all structures s on [1, n] that contain the base pair (r + 1, n). Define g(n, x) to be the number of secondary structures s for the length n homopolymer, such that s has x visible positions in the interval [1, n − θ − 1] = [1, n − 2], and position n is unpaired in s.
Note that the rightmost term in the last line arises from the contribution of 1 for base pair (k, n). In summary, we have shown that Main function Q n . For clarity in the derivation of Q n , we start by explicitly listing the moves in move set MS2. Let x, x 0 , y, y 0 denote distinct positions all belonging to the interval [1, n]. The structure t can be obtained from structure s by a move from MS2, if t is a valid secondary structure and can be obtained from s by applying a move of the form 1-6.
1. Addition of a base pair (x, y) to s.   Let Q n ¼ P s2SS½1;n NðsÞ, where N(s) is the number of structures t that can be obtained from s by applying a move from move set MS2. Define Q 0 = 1, and For the inductive case where n > 2, initialize Q n = 0 and then add the contributions from below. CASE 1(a): In this case, we consider the contribution from s 2 SS½1; n, in which the last position n is unpaired, and t is obtained from s by a move from MS2 involving x, y, Notice that in shifts of type 3, 4 the original position x is retained, while in shifts of type 5, 6 the original position y is retained, for distinct x, x 0 , y in the interval [1, n − 1]. Also, notice that shifts of base pairs involving the last position n are not considered in Case 1(a)such shifts will later be treated in cases 1(c), 2(b) and 2(c). The contribution in this case is given by The term Q n−1 arises from neighbors t of s in which the last position n is unpaired, and the base pair (x, y) is added/removed/shifted in s. CASE 1(b): In this case, we consider the contribution from s 2 SS½1; n, in which the last position n is unpaired, and t is obtained from s by adding the base pair (k, n) for some 1 k n − θ − 1. The contribution in this case is given by CASE 1(c): In this case, we consider the contribution from s 2 SS½1; n, in which the last position n is unpaired, and t is obtained from s by shifting the base pair (x, y) to (x, n), or by shifting the base pair (x, y) to (y, n), for distinct x, y in the interval [1, n − 1]. These shifts are treated separately. CASE 1(c)(i): Consider a shift of the form (x, y) to (x, n), for y < n. The function E n−1 counts the number of external base pairs (x, y) where y n − 1, for all structures on [1, n − 1]. For any such (x, y), it is possible to shift the base pair (x, y) to (x, n), and so the contribution is : Consider a shift of the form (x, y) to (y, n), for y < n − 1. The function E n−2 counts the sum over all structures on [1, n − 2] of the number of external base pairs (x, y) with y n − 2. Since k n − 2 and θ = 1, and n is unpaired, it is possible to shift the base pair (x, y) to (y, n) and vice versa. So far, we have not considered structures s on [1, n − 1] in which n − 1 is base-paired. For a structure s on [1, n − 1] that contains base pair (r + 1, n − 1), there are Z n−r −3 many structures s 2 on [r + 2, n − 2]; moreover, for any external base pair (x, y) in a structure s 1 on [1, r], we can shift the base pair (x, y) to (y, n). This explains the presence of the term P nÀ4 r¼1 E r Á Z nÀrÀ3 . Thus the contribution is In conclusion, CASE 2(a): The contribution from s 2 SS½1; n, in which the last position n is base-paired, where neighbor t is obtained from s by removal of that last base pair (k, n), is given by Note that Case 2(a) is dual to Case 1(b). CASE 2(b): In this case, we consider the contribution from s 2 SS½1; n, in which the last position n is base-paired, where neighbor t is obtained from structure s by a shift of the last base pair (k, n) to (k 0 , n) for some k 0 6 ¼ k that is visible in structure s − {(k, n)}. Note that if we were to remove base pair (k, n) from s, then the last position of s − {(k, n)} must be unpaired, and the position n − 1 may or may not be base paired. Recall that g(n, x) is the sum over all structures s on [1, n], that contain x visible positions in the interval [1, n − 2], and in which position n is unpaired. If we choose a first position k out of the x visible positions, and subsequently a second distinct position k 0 out of the remaining x − 1 visible positions, then we properly count the contribution from structures s containing (k, n) which can be transformed to a structure t by the shift (k 0 , n).
The contribution in this case is since we have x choices for value k and then (x − 1) choices for k 0 , both selected from the x visible positions of the structure. CASE 2(c): In this case, we consider the contribution from s 2 SS½1; n, in which the last position n is base-paired, where neighbor t is obtained from structure s by a shift of base pair (k, n) to (k, k 0 ), or a shift of the last base pair (k, n) to (k 0 , k), for some k 6 ¼ k 0 that is visible in structure s − {(k, n)}. These shifts are treated separately. CASE 2(c)(i): Consider a shift of the form (k, n) to (k, k 0 ), for k 0 < n. The function E n−1 counts the sum over all structures on [1, n − 1] of the number of external base pairs (k, k 0 ) with k 0 n − 1. For any such (k, k 0 ), it is possible to apply the shift (k, n), and vice versa. Thus Case 2(c)(i) case is dual to Case 1(c)(i) and the contribution is clearly The function E n−2 counts the sum over all structures on [1, n − 2] of the number of external base pairs (k 0 , k) with k n − 2. Since k n − 2 and θ = 1, and n is unpaired, it is possible to shift the base pair (k 0 , k) to (k, n) and vice versa. By duality to Case 1(c)(ii), we have the additional contribution of P nÀ4 r¼1 E r Á Z nÀrÀ3 to account for shifting the base pair (y, n) to an external base pair (x, y) in a structure s 1 on [1, r], in the case that n − 1 is base-paired. Thus Case 2(c)(ii) case is dual to Case 1(c)(ii) and the contribution is clearly In conclusion, In this case, we consider the contribution from s 2 SS½1; n, in which the last position n is base-paired with base pair (k, n), where neighbor t is obtained from a shift or addition/deletion of a base pair in the left portion [1, k − 1] or right portion [k + 1, n − 1], so that t retains the base pair (k, n). In this case, the contribution is The first term arises from the addition/removal/shift of a base pair (x, y), where k + 1 x < y n − 1, and the second term arises from the addition/removal/shift of a base pair (x, y), where 1 x < y k−1.
Putting together all contributions from Case 1(a) through Case 2(d), we have The functions f, g require the greatest space and time resources, and it is easily seen that the

Uniform, non-homopolymer Model B
In this section, we consider the uniform, non-homopolymer model B, in which secondary structures must satisfy Definition 1; i.e. compared with the notion of structure from the previous Section "Homopolymer Model A", each base pair (i, j) of a secondary structure s of the RNA sequence a = a 1 , . . ., a n must satisfy j − i > θ = 3, and a i , a j must constitute a Watson-Crick or wobble pair. In model B, the energy of each structure is zero, so the partition function Z = Z 1,n is the total number of structures of a, and the probability P(s) of each structure s is 1/Z. For the recursions necessary to compute Q i;j ¼ P s2SS½i;j NðsÞ, where N(s) denotes the number of neighbors of s under move set MS2, we need to define new functions EL, ER, ER 0 , F, G. There is a correspondence between functions x) } from the previous Section "Homopolymer Model A".
Critical definitions and recursions. For a given RNA sequence a = a 1 , . . ., a n , define the subsequence a[i, j] = a i , . . ., a j . Positions i, j can form a base pair, denoted by bp(i, j) = 1, if a i , a j is either a Watson-Crick pair AU, UA, GC, or CG, or a wobble pair; otherwise bp(i, j) = 0. For k 2 [1, n] and c 2 {A, C, G, U}, we also write bp(k, c) = 1 to mean that a k , c constitute either a Watson-Crick or wobble base pair. A nucleotide position k 2 [1, n] is said to be visible in the secondary structure s, if for every base pair (i, j) 2 s, it is not the case that i k j. If we state that structure s has exactly x visible occurrences of a nucleotide in [i, j − θ − 1] that can base pair with c, then we mean that there are positions i i 1 The base pair (i, j) 2 s is said to be an external base pair of the secondary structure s, if there is no distinct base pair (i 0 , j 0 ) 2 s with the property that i 0 i < j j 0 . In formulas, for brevity, we write that '(i, j) is external in s', to mean that (i, j) is an external base pair of s.
I½ðx; yÞ is external bp in s; bpðx; cÞ ¼ 1 ð22Þ I½ðx; yÞ is external bp in s; bpðy; cÞ ¼ 1 ð23Þ I½ðx; yÞ 2 s is ext: bp in s; bpðy; cÞ ¼ 1; y j À y À 1; j unpaired in s ð24Þ I½s has exactly x visible occurrences of a nucleotide that can pair with c ð25Þ I½s has exactly x visible occurrences of a nucleotide in ½1; j À y À 1 that can pair with c; and j unpaired in s The two differences between the homopolymer Model A and the current Model B are: (1) in Model B, if (k, j) is a base pair, then the nucleotides at positions k, j must be one of AU, UA, Definition of ER. For 1 i j n and c 2 {A, C, G, U}, we define ER i,j,c by induction on j − i.
c as the sum of the following Definition of ER 0 . For 1 i j n and Note that the first term to the right of the equality sign in the previous equation is ð35Þ SUBCASE X > 0: Computing the total number of moves using MS1. For 1 i j n, define Q i,j to be the sum, taken over all structures s of a i , . . ., a j , of the number of base pair additions or removals of a base pair to or from s. Formally, we have I½ððx; yÞ ! ðk; 'ÞÞ 2 MS1; ðs n fðx; yÞgÞ [ fðk; 'Þg valid str ð37Þ or equivalently where d BP (s, t) denotes the base pair distance between structures s, t. Define Q i,j by recursion on j − i, for 1 i j n. BASE CASE: For i j i + θ, define Q i,j = 0.
Computing the total number of moves using MS2. For 1 i j n, define Q i,j to be the sum, taken over all structures s of a i , . . ., a j , of the number of base pair additions, removals or shifts of a base pair of s. Formally, we have I½ððx; yÞ ! ðk; 'ÞÞ 2 MS2; ðs n fðx; yÞgÞ [ fðk; 'Þg is valid str ð40Þ Now define Q i,j by recursion on j − i, for 1 i j n.
Computing the total number of moves using MS2\MS1. For 1 i j n, define Q i,j to be the sum, taken over all structures s of a i , . . ., a j , of the number of shifts of a base pair of s. Formally, we have I½ðx; yÞ 2 s; ððx; yÞ ! ðk; 'ÞÞ 2 fMS2 n MS1g; ðs n fðx; yÞgÞ [ fðk; 'Þg valid str Now define Q i,j by recursion on j − i, for 1 i j n.
We have implemented a dynamic programming algorithm for each of the functions EL, ER, ER 0 , F, G, Q and Z, resulting in software for the expected network degree, with respect to uniform probability for the move sets MS1, MS2, MS2\MS1. Analysis of space and time resources needed for the program can be determined in a manner similar to that described at the end of Subsection; however, there is an additional factor of n in both space and time requirements, so that the software runs in space O(n 3 ) and time O(n 4 ). During the algorithm development and implementation, we have extensively cross-checked with results obtained by exhaustive, brute force counting, thus ensuring correctness of our code.

Model C with Turner energy parameters
Here we consider the Model C, for which secondary structures satisfy Definition 1 and such that E(s) indicates the Turner energy of s, which involves free energy parameters [36] for stacked base pairs, hairpins, bulges, internal loops and multiloops. For RNA sequence a = a 1 , . . ., a n , we present recursions in the following for Z i,j and Q i,j , where Note that I is the indicator function, and that QB i,j is the Boltzmann weighted sum of the number of neighbors, using move set MS2, where the sum is taken over all structures s 2 SS½i; j that contain the base pair (i, j). Similarly ZB i,j is the sum of Boltzmann factors BF(s), where the sum is taken over all structures s 2 SS½i; j that contain the base pair (i, j). We write bp(k, j) = 1 to mean that nucleotides a k , a j can form either a Watson-Crick or wobble base pair, and for nucleotide c 2 {A, C, G, U}, we write bp(k, c) = 1 to mean that nucleotides a k and c can form a Watson-Crick or wobble base pair. From the context, there should be no confusion between bp (k, j) and bp(k, c).  BFðsÞ Á I½ðx; yÞ 2 s is ext: bp in s; bpðy; cÞ ¼ 1; y j À y À 1; j unpaired in s ð52Þ BFðsÞ Á I½s has x visible occurrences of a nucleotide that can pair with c ð53Þ BFðsÞ Á I½s has exactly x visible occurrences of a nucleotide in ½1; j À y À 1 that can pair with c; and j unpaired in s Recursions for a dynamic programming implementation of these functions are given later in Section "Recursions for auxilliary functions". We focus now on how to compute Q i,j using these auxilliary functions.
Recursion for function Q i,j . For notational convenience, define Q i, i − 1 = 0 and Z i, i−1 = 1 for all 1 i n. If i j < i + θ + 1, then for any secondary structure s 2 SS½i; j, there are no structural neighbors of s and so Q i,j = 0. If i j < i + θ + 1, then the only secondary structure on [i, j] is the empty structure with free energy of zero, so Z i,j = 1. Now assume that i + θ + 1 j. By definition For the move set MS1 (in the absence of shift moves), it has been shown in [34] that However, when allowing shift moves, the situation is more complicated since there are shifts involving x, y, x 0 , y 0 2 [i, j] that are neither fully contained in the segment [i, j − 1] for structures s 2 SS½i; j in which j is unpaired, nor fully contained in one of the segments [i, k − 1], [k, j] structures s 2 SS½i; j which contain the base pair (k, j). The former shifts are treated in cases 1 (c), 1(d), while the latter shifts are treated in cases 2(c), 2(d).
For clarity in the derivation of Q i,j , we start by explicitly listing the moves in move set MS2. Let x, z 0 , y, y 0 denote distinct positions all belonging to the interval [i, j]. The structure t can be obtained from structure s by a move from MS2, if t is a valid secondary structure and can be obtained from s by applying a move of the form 1-6.
1. Addition of a base pair (x, y) to s.

2.
Removal of a base pair (x, y) from s.
3. Shift of a base pair (x, y) in s to (x, y 0 ) in t.
4. Shift of a base pair (x, y) in s to (y 0 , x) in t.

5.
Shift of a base pair (x, y) in s to (x 0 , y) in t.
6. Shift of a base pair (x, y) in s to (y, x 0 ) in t.
The shift moves 3-6 are depicted in Fig 8. Notice that in shifts of type 3, 4 the original position x is retained, while in shifts of type 5, 6 the original position y is retained. for distinct x, x 0 , y in the interval [i, j].
In the base case, for all i 2 [1, n], we have Q i, i − 1 = 0, Z i, i − 1 = 1, and for i j i + θ = i + 3, Q i,j = 0, Z i,j = 1. For the inductive case in which j − i > θ = 3, initialize Q i,j = 0 and then add the contributions from the cases below. The recursions for Z i,j are well-known [39] and are given later in Section "Remaining recursions for Q i,j and Z i,j ". CASE 1(a): In this case, we consider the contribution from s 2 SS½i; j, in which j is unpaired in the interval [i, j], and t is obtained from s by a move from MS2 involving x, y, x 0 , y 0 2 [i, j − 1]. The contribution is which accounts for the addition, removal or shift of a base pair in [i, j − 1]. Note that shifts of base pairs involving the last position j are not considered in Case 1(a)-such shifts will treated in cases 1(c), 1(d), 2(c), 2(d). CASE 1(b): In this case, we consider the contribution from s 2 SS½i; j, in which j is unpaired in [i, j], and t is obtained from s by adding the base pair (k, j) for some i k j − θ − 1 = j − 4. The contribution is This term arises from those t obtained from s by adding a base pair (k, j) for some k 2 [i, j − θ − 1]. The remaining cases 1(c), 1(d) treat shifts involving x, y, x 0 , y 0 2 [i, j] in structures s 2 SS½i; j in which j is unpaired in [i, j], where the position j is touched; i.e. it is not the case that x, y, x 0 , y 0 2 [i, j − 1] and so these shifts are not already counted in the term Q i,j − 1 . CASE 1(c): In this case, depicted in panel (a) of Fig 9, we consider the contribution from s 2 SS½i; j in which j is unpaired in [i, j], and t is obtained from s by a shift of the base pair (x, y) to (x, j) for i x y − θ − 1 and y j − 1. The function EL i,j − 1,a j is the sum, taken over all structures s 2 SS½i; j in which j in unpaired, of the product of the Boltzmann factor B(s) times the number of external base pairs (x, y) in s with y j − 1 such that the nucleotide a x at position x can form a base pair with the nucleotide a j at position j. For any such (x, y), it is possible to shift the base pair (x, y) to (x, j), and vice versa. Before proceeding, note that the current Case 1 (c) handles shifts from (x, y) to (x, j), while Case 2(b) handles shifts from (x, j) to (x, y). The contribution in the current case is clearly CASE 1(d): In this case, depicted in panel (b) of Fig 9, we consider the contribution from s 2 SS½i; j in which j is unpaired in [i, j], and t is obtained from s by a shift of the base pair (x, y) to (y, j) for i x y − θ − 1 and y j − θ − 1. The function ER 0 i;j;a j is the sum, taken over all structures s 2 SS½i; j in which j in unpaired, of the product of the Boltzmann factor B(s) times the number of external base pairs (x, y) in s with y j − θ − 1 such that the nucleotide a y at position y can form a base pair with the nucleotide a j at position j. For any such external base pair (x, y), it is possible to shift (x, y) to (y, j), and vice versa. Before proceeding, note that the current Case 1(d) handles shifts from (x, y) to (y, j), while Case 2(d) handles shifts from (y, j) to (x, y). The contribution in the case at hand is clearly CASE 2(a): In this case, we consider the contribution from structures s 2 SS½i; j, which contain the base pair (k, j), for some i k j − θ − 1, and t is obtained from s by a move from MS2 involving x, y, x 0 , y 0 , such that x, y, x 0 , y 0 2 [i, k − 1]. The contribution is CASE 2(b): In this case, we consider the contribution from structures s 2 SS½i; j, which contain the base pair (k, j), for some i k j − θ − 1, and t is obtained from s by a move from MS2 involving x, y, x 0 , y 0 , such that x, y, x 0 , y 0 2 [k, j]. The contribution is The remaining cases 2(c), 2(d) treat shifts involving x, y, x 0 , y 0 2 [i, j] in structures s 2 SS½i; j which contain the base pair (k, j) for some i k j − θ − 1, where it is neither the case that x, y, x 0 , y 0 2 [i, k − 1] nor x, y, x 0 , y 0 2 [k, j]; i.e. cross talk shifts that touch both the left [i, k − 1] and the right [k, j] segments. CASE 2(c): In this case, depicted in panel (c) of Fig 9, we consider the contribution from s 2 SS½i; j, which contain the base pair (k, j), for some i k j − θ − 1, and t is obtained from s by a shift of the base pair (k, j) to (k 0 , j) for some k 0 < k that is visible in structure s\{(k, j)}. Before proceeding, note that for k < k 0 , the shift of base pair (k, j) to (k 0 , j) is treated in Case 2 (b).
Recall that the function F i,k − 1,a j , x is the sum of Boltzmann factors of all structures s 0 on [i, k − 1] that contain exactly x occurrences of a visible position that can form a base pair with the nucleotide a j at position j. The contribution in this case is Putting together all contributions from Case 1(a) through Case 2(d), we have Recursions for auxilliary functions. We now provide the recursions for functions EL, ER, ER 0 , F and G.
Definition of EL. For 1 i j n and c 2 {A, C, G, U}, we define EL i,j,c by induction on j − i, where BFðsÞ Á I½ðx; yÞ is external bp in s; bpðx; cÞ ¼ 1 ð66Þ Note that the first term to the right of the equality sign in the previous equation is ER i,j − θ − 1, c and not ER 0 i;jÀyÀ1;c . Definition of F. For 1 i j n, c 2 {A, C, G, U} and x 2 [0, n], we define F i,j,c,x by induction on j − i, where Definition of G. Recall that G i,j,c,x is defined to be the sum of Boltzmann factors of structures s 2 SS½i; j having exactly x visible occurrences of a nucleotide in [i, j − θ − 1] that can base-pair with c, and j is unpaired in s, i.e.
I½j À y À 1 þ u À k > y Á bpðk; j À y À 1 þ uÞ Á F i;kÀ1;c;0 Á ZB k;jÀyÀ1þu ð79Þ SUBCASE X > 0: Remaining recursions for Q i,j and Z i,j . In this section, we furnish the remaining recursions for Q i,j , Z i,j in the Turner 2004 energy model [36]. For a fixed sequence a = a 1 , . . ., a n and for 1 i j n, define where N s is the number of secondary structures that can be obtained from s by a base pair addition, removal or shift-i.e. the number of neighbors of s with respect to move set MS2. It follows that Z = Z 1, n is the partition function for secondary structures, and where BF(s) abbreviates the Boltzmann factor exp(−E(s)/RT) of s.
To provide a self-contained treatment, we recall McCaskill's algorithm [39], which efficiently computes the partition function. For RNA nucleotide sequence a = a 1 , . . ., a n , let H(i, j) denote the free energy of a hairpin closed by base pair (i, j), while IL(i, j, i 0 , j 0 ) denotes the free energy of an internal loop enclosed by the base pairs (i, j) and (i 0 , j 0 ), where i < i 0 < j 0 < j. Internal loops comprise the cases of stacked base pairs, left/right bulges and proper internal loops. The free energy for a multiloop containing N b base pairs and N u unpaired bases is given by the affine approximation a + bN b + cN u .
Definition 2 (Partition function Z and related function Q) • Z i,j = ∑ s exp(−E(s)/RT) where the sum is taken over all structures s 2 SS½i; j.
• ZB i,j = ∑ s exp(−E(s)/RT) where the sum is taken over all structures s 2 SS½i; j which contain the base pair (i, j).
• ZM i,j = ∑ s exp(−E(s)/RT) where the sum is taken over all structures s 2 SS½i; j which are contained within an enclosing multiloop having at least one component.
• ZM1 i,j = ∑ s exp(−E(s)/RT) where the sum is taken over all structures s 2 SS½i; j which are contained within an enclosing multiloop having exactly one component. Moreover, it is required that (i, r) is a base pair of x, for some i < r j.
• Q i,j = ∑ s N s Á exp(−E(s)/RT) where the sum is taken over all structures s 2 SS½i; j.
• QB i,j = ∑ s N s Á exp(−E(s)/RT) where the sum is taken over all structures s 2 SS½i; j which contain the base pair (i, j).
• QM i,j = ∑ s N s Á exp(−E(s)/RT) where the sum is taken over all structures s 2 SS½i; j which are contained within an enclosing multiloop having at least one component.
where the sum is taken over all structures s 2 SS½i; j which are contained within an enclosing multiloop having exactly one component. Moreover, it is required that (i, r) is a base pair of s, for some i < r j.
We will define Z i,j and Q i,j by recursion on j − i, for 1 i j n. BASE CASE: Recalling that θ = 3, for j − i 2 {−1, 0, 1, 2, 3}, define Q i,j = QB i,j = 0, Z i,j = 1, ZB i,j = ZM i,j = ZM1 i,j = 0, since the empty structure is the only possible secondary structure.
Recursion for QB i,j . We can now proceed with the definition of QB i,j , defined to be the sum of A i,j , B i,j , C i,j , each of which is defined below.
CASE A: (i, j) closes a hairpin. In this case, the contribution to QB i,j is given by The term 1 arises from the neighbor of s = {(i, j)} by removing base pair (i, j). The term arc1 a (i + 1, j − 1) arises from neighbors of s obtained by adding a base pair in the region [i + 1, j − 1], and the term arc1 b (i, j) arises from a shift of the form (i, j) ! (i, y), and finally the term arc1 c (i, j) arises from a shift of the form (i, j) ! (x, j). CASE B: (i, j) closes a stacked base pair, bulge or internal loop, whose other closing base pair is (ℓ, r), where i < ℓ < r < j.
Following the convention in Vienna RNA Package, we assume that all loops have at most 30 unpaired nucleotides. This convention explains the presence of 31 in some indices. In this case, the contribution to QB i,j is given by the following The term 1 arises from the neighbor of s = {(i, j)} by removing base pair (i, j) (the neighbor obtained by removing base pair (ℓ, r) is counted by the term N(s) for s 2 SS½'; r). The term arc3(i, j, ℓ, r) counts neighbors obtained by either adding a base pair within the internal loop defined by (i, j) and (ℓ, r), or by shifting either (i, j) or (ℓ, r).
In Case C below, we follow the convention that in the summation notation P b i¼a , if upper bound b is smaller than lower bound a, then we intend a loop of the form: FOR i = b downto a. CASE C: (i, j) closes a multiloop. In this case, the contribution to QB i,j is given by the following ði;jÞ closes a multiloop Now QB i,j = A i,j + B i,j + C i,j . It nevertheless remains to define the recursions for QM1 i,j and QM i,j . These satisfy the following.
exp À cðj À kÞ RT Á BFðsÞ Á NðsÞ þ arc1 a ðk þ 1; jÞ þ arc4ði; k; jÞ þ arc5ði; k; jÞ ½ ¼ X j k¼iþyþ1 exp À cðj À kÞ RT Á QB i;k þ ZB i;k Á arc1 a ðk þ 1; jÞ þ arc4ði; k; jÞ þ arc5ði; k; jÞ ð Þ Â Ã : The term arc1 a (k + 1, j) counts neighbors obtained by adding a base pair in [k + 1, j]; the term arc4(i, k, j) counts neighbors obtained by a shift of the base pair (i, k) to (i, y) for some k < y j; the term arc5(i, k, j) counts neighbors obtained by a shift of the base pair (i, k) to (k, y) for some k + θ < y j. Finally Note that in the first line of the equation for QM i,j , the position r is required by definition of QM1 r, j to pair to some position in [r + θ + 1, j]. Thus r is the left endpoint of a base pair, whose right endpoint will not be known until a subsequent call of function QM1 r, j . The term arc1 a (i, r − 1) counts neighbors obtained by adding a base pair (x, y) in the interval [i, r − 1]; the term arc1 c (i − 1, r) counts neighbors obtained by shifting the base pair whose left endpoint is r to the base pair (x, r) for some i x < r. This completes the description of how to compute the expected number of neighbors with respect to the Turner energy model. Finally, to accelerate the computation of the functions arc1 a , . . ., arc5, the 4 × n × n array ARC is precomputed, where if a = a 1 , . . ., a n denotes the input RNA sequence, then As mentioned, we follow the convention that bulges and interior loops have a size of at most 30 nt; however, this bound does not apply to hairpin loops or multiloops. REMARK: Suppose that s = {(i, j), (i 1 , j 1 ), . . ., (i k , j k )} is a multiloop closed by (i, j), where i < i 1 < j 1 < i 2 < j 2 < Á Á Á < i k < j k < j. Then note that we do not count neighbors of s obtained by adding a base pair (x, y) to the multiloop s, where i < x < i ℓ < j ℓ < y, nor do we count shifts within a multiloop of the form (i ℓ , j ℓ ) ! (i ℓ , k) for j ℓ < k, nor (i ℓ , j ℓ ) ! (k, j ℓ ) for k < i ℓ . Following the paradigm in the treatment of multiloops in McCaskill's partition function algorithm [39], such added base pairs and shifts cannot be included. In particular, our Turner energy algorithm properly counts shifts depicted in Figs 2 and 3, but not those depicted in Fig 4. Multiloops are energetically costly due to entropic considerations, and so penalized in the Turner energy model. For this reason, multiloops are generally small, have few components, and contain few unpaired bases that might allow the formation of base pairs or support shift moves. If a multiloop has sufficient size to permit such moves, then its free energy will be large, hence the Boltzmann factor of such structures s is small and the contribution to hNi is negligeable. By introducing multiloop analogues of functions EL, ER, ER 0 , F, and G, it should be possible to account for such additional internal multiloop moves. However, this would lead to substantial complications of the algorithm with no likely benefit, hence this will not be pursued.

Results
In this section, we describe several results obtained by applying our novel algorithms to compute the expected network degree for given RNA sequence. The left panel of Fig 10 depicts the length-normalized expected network degree of an RNA homopolymer sequence of length n, defined to be Q n nZ n . In the homopolymer model, Q n = ∑ s N(s), where N(s) is the number of neighbors of s, and the sum is taken over all secondary structures s of [1, n]. In the homopolymer case, the energy is 0, so the partition function Z n equals the number of structures. Fig 10 displays the normalized network degree as a function of homopolymer size, both in the case of move set MS1 (base pair additions, removals), and move set MS2 (base pair additions, removals, shifts). An asymptotic value of 0.4742 for Q n nZ n is suggested by running the dynamic programming (DP) algorithm described in Section "Homopolymer Model A" for values of sequence length 400 n 1000. Using methods from algebraic combinatorics, we have analytically proved that the value of Q n nZ n for MS1 is % 0.4734176431521986 (see [40]). Runs of the DP algorithm also suggest that the asymptotic value of Q n nZ n for MS2 appears to be % 1.530161, so that there are more than 3 times more structural neighbors, on average, for move set MS2 than for move set MS1 for the homopolymer model. The right panel of Fig 10 depicts an overlay of the degree distribution for secondary structures of the 32 nt selenocysteine element of fruA, which latter encoding the A subunit of coenzyme F420-reducing hydrogenase, for move sets MS1, MS2\MS1 and MS2.
Figs 11 and 12 display the relative frequency (for energy model C) for the number of neighbors, or degree, respectively for the 76 nt alanine transfer RNA from Mycoplasma mycoides with accession code RA1180 from tRNAdb 2009 [41] and for the 56 nt spliced leader RNA from L. collosoma. RNAsubopt -d0 -e 12 [10] was used to generate 537,180 [resp. 266,065] Fig 10. (Left) Normalized expected network degree of an RNA homopolymer sequence of length n is defined to be Q n nZ n ; i.e. the length-normalized expected network degree Q n Z n divided by sequence length n. Here Q n is ∑ s N(s), where N(s) is the number of neighbors of s, and the sum is taken over all secondary structures s of the homopolymer. In the homopolymer case, the energy is 0, hence the partition function Z n is simply the number of structures of the length n homopolymer. The purple graph was obtained with move set MS1 (base pair additions and removals), while the red graph was obtained with move set MS2 (base pair additions, removals and shifts). For n = 998, the value of Q n nZ n with respect to MS1 is 0.472393; using methods from enumerative combinatorics, we have analytically proved that the value of Q n nZ n with respect to MS1 is exactly 0.4734176431521986 [40]. For n = 998, the value of Q n nZ n with respect to MS2 is 1.530161; since the values of Q n nZ n are unchanged for n ( 998, it is likely that the asymptotic value is close to that value. It follows that there are more than 3 times more structural neighbors, on average, for move set MS2 than for move set MS1.    Table 1 compares these values with those obtained by our dynamic programming method, and additionally compares values for both Turner 1999 and Turner 2004 energy parameters. Note the stark differences between the length-normalized degree distribution for transfer RNA (accession code RA1180 from tRNAdb 2009 [41]) and for the conformational switch of spliced leader from L. collosoma.
We are currently investigating whether other conformational switches have large values of length-normalized expected number of neighbors. Fig 13 depicts the correlation between expected network degree, conformational entropy, contact order, and expected number of native contacts, computed with respect to a collection of 180 PDB files and to a collection of 1904 RNA sequence and consensus structures taken from the Rfam 12.0 database [42]. Although the results are mixed and preliminary, the PDB data suggests a possible correlation between secondary structure contact order and (uniform) expected network degree, while the Rfam data suggests a possible correlation between the expected number of native contacts and (uniform) expected network degree. Definitions and details of the computational experiments now follow.
Contact order is considered in the context of protein folding in [43], where absolute contact order is defined by ∑ i < j (j − i)/N, where the sum is over all N pairs of residues i, j that are in contact, taken here to mean that residues i, j each contain a heavy atom (non-hydrogen) within 6 Å, and that i, j are not consecutive (j 6 ¼ i + 1). In Fig 13, we consider several formulations of RNA contact order. The 3D absolute contact order for an RNA structure is defined as above. The pseudoknot (pknot) absolute contact order is defined as ∑ i < j (j − i)/N, where the sum is over all N base pairs (i, j) determined by RNAview [44], a program that determines hydrogenbonded atoms of distinct nucleotides in a PDB file of RNA and additionally classifies the base pair with respect to the Leontis-Westhof classification [45]. The 2D absolute contact order is defined as ∑ i < j (j − i)/N, where the sum is over all N base pairs (i, j) in the secondary structure extracted from RNAview output by our implementation of the method described in [46,47], which essentially applies the Nussinov-Jacobson algorithm [48] to those base pairs determined by RNAview from the tertiary PDB structure, resulting in the secondary structure having a largest number of base pairs (one could alternatively use the web server RNApdbee [49]). We also consider the corresponding versions of relative contact order, by dividing the absolute contact order by RNA sequence length.
For benchmarking purposes, we took two datasets: (1) tertiary structures from the PDB, and (2) consensus secondary structures from the Rfam 12.0 database [42]. For the former, we used PDB files from the dataset [50], since these files have no discrepancies between the SEQRES and ATOM fields. From this set of 486 PDB files, we retained 180 PDB files with a total of 227 RNA chains, after removing PDB files of very short RNAs, as well as those PDB files consisting of NMR data for which RNAview [44] did not use the first MODEL in its determination of base pairing, as well as those for which RNAview returned no base pairing information at all. For the latter, we took the first sequence, with its consensus structure, from the seed alignment of every family of Rfam 12.0, where sequence length was capped at 200 nt. This provided a collection of 1904 sequences and consensus structures.
The left panel of Fig 13 depicts the correlation computed for the 180 PDB files between various formulations of expected network degree and RNA secondary structure conformational entropy [51] (highest correlation value of 0.90) and contact order (highest correlation value of 0.86). Here, the conformational entropy is defined by −k B Á ∑ s p(s) Á ln p(s), where p(s) is the Boltzmann probability of secondary structure s, and the sum is taken over all secondary structures of a given RNA sequence (low entropy means that the Boltzmann probability is very high for a small number of structuresi.e. a relatively small number of structures has low free energy). The right panel of Fig 13 depicts the correlation for the 1904 Rfam consensus secondary structures between (uniform) expected network degree and various formulations of conformational entropy (highest correlation 0.80), the expected number of native contacts (highest correlation of 0.86), and two formulations of contact order (highest correlation value of 0.43). Here, the expected number of native contacts is defined by ∑ s p(s) Á js \ s 0 j, where the sum is taken over all structures s, p(s) = exp(−E(s)/RT)/Z is the Boltzmann probability of s, and js \ s 0 j denotes the number of base pairs common to both s and the Rfam consensus structure s 0 . At present, it is unclear why the correlation between expected network degree and contact order is higher in the PDB data than in the Rfam data.
To shed light on RNA kinetics from a different perspective, in this paper we have investigated a network property of RNA secondary structures. Let G be the network corresponding to the move set MS1 [resp. MS2] of the kinetics program Kinfold [5]; i.e. G = (V, E) is a directed graph, whose vertices are the secondary structures of a given RNA sequence and whose edges s ! t are defined if structure t can be obtained from s by the addition or removal [resp. addition, removal or shift] of a base pair from s. In [34], we described an algorithm that computes the MS1 expected network degree hNi = ∑ s p(s) Á N(s), where N(s) is the out-degree of secondary structure s of a user-specified RNA sequence a = a 1 , . . ., a n and p(s) = exp(−E(s)/ RT)/Z is the probability of structure s. In the current paper, we describe (surprisingly) much more difficult algorithms to efficiently compute the MS2 expected network degree hNi = ∑ s p (s) Á N(s), with respect to increasingly complex energy models A, B, C. Model A is the homopolymer model [35], which we use to present a simplified version of the more complex algorithms for models B and C. Unlike the simple homopolymer model, Model B concerns the usual notion of RNA secondary structure s, defined in Definition 1 where the energy E(s) is zero, so that the probability p(s) is one over the number of structures (uniform probability). Model C concerns the Turner energy model without dangles, so that the probability p(s) is the Boltzmann probability of s; however, due to technical issues, certain low probability MS2 moves in multiloops can not be considered (see an example in Fig 4). The run time [resp. space] for our algorithm for Model A is O(n 3 ) [resp. O(n 2 )], while that for models B and C is O(n 4 ) [resp. O (n 3 )]-cubic space is required uniquely for functions F, G.
Our algorithms for Models A and B are exact, computing the same values as obtained by exhaustive brute force. Our algorithm for Model C ignores certain kinds of base pair additions, removals and shifts within a multiloop. Table 1 compares the values of expected number of neighbors (expected degree) for move sets MS1 and MS2 for Models B, C where Turner 1999 and Turner 2004 energy parameters are considered [36]. Table 1 also includes values obtained by brute force computation from structures generated by RNAsubopt [52] from the Vienna RNA Package [10]. The time required for this method is O(n 2 ) times the number of structures sampled by RNAsubopt plus the overhead to run RNAsubopt. Except for small sequences, this computation cost is prohibitive, which makes our dynamic programming computation of the expected number of neighbors an attractive alternative. Nevertheless much less information is conveyed by a single number, as shown in Table 1 than in the (approximate) distribution as shown in Fig 11 for alanine transfer RNA from Mycoplasma mycoides and Fig 12 for the spliced leader conformational switch from L. collosoma. The striking difference between these figures suggests that perhaps conformational switches may display a bimodal or multimodal degree distribution-something we are currently investigating. Table 1 displays a strong discrepancy for the expected number of neighbors for L. collosoma when using Turner 1999 Table 1 and Fig 14, the general form of the corresponding histograms is maintained, as shown in Figs 11 and 12. We now summarize our findings.
Given the 3D native structure of a protein, the (absolute) contact order is defined by ∑ i < j (j − i)/N, where the sum is over all N pairs of residues i, j that are in contact, where non-contiguous residues i, j are in contact if each contain a heavy atom (non-hydrogen) within 6 Å [43]. We use the definition of [43] for 3D RNA contact order, whereas we define pseudoknot (pknot) contact order by ∑ i < j (j − i)/N, where the sum is over all N base pairs (i, j) determined by RNAview [44], a program that determines hydrogen-bonded atoms of distinct nucleotides in a PDB file of RNA and additionally classifies the base pair with respect to the Leontis-Westhof classification [45]. We define 2D contact order by ∑ i < j (j − i)/N, where the sum is over all N base pairs (i, j) in the secondary structure extracted from RNAview. For benchmarking purposes, by removing short RNAs and RNAs for which RNAview yielded no base pairing information, we extracted a set of 180 PDB files with a total of 227 RNA chains from the datase [50] of 486 PDB files that have no discrepancies between the SEQRES and ATOM fields. For this benchmarking set, the left panel of Fig 13 shows a relatively high correlation between contact order and expected network degree-for instance, there is a correlation of 0.86 between 2D contact order and MS1 or MS2 network degree. Surprisingly, the correlation is generally higher when expected network degree is computed with respect to uniform probability (corresponding to energy model B with zero energy) rather than Boltzmann probability (corresponding to energy model C, i.e. Turner energy model). In the case of energy model C, the correlation is somewhat higher for move set MS1 rather than move set MS2.
The number of native contacts in a transitional protein structure is defined as the number of pairs of noncontiguous residues i, j that are in contact (i.e. close spatial proximity) in the native structure, usually meaning the X-ray structure [53]. The importance of this reaction coordinate for protein folding has been established in [54], where Best et al. analyze long equilibrium simulations of protein folding for more than 10 proteins using molecular dynamics trajectories from D.E. Shaw Research. It follows from Markov chain theory that the expected number of visitations of (transitional) structure s is the Boltzmann probability p(s) = exp(−E(s)/RT)/Z times the trajectory length, and hence the expected number of native contacts for RNA secondary structure formation can be defined by Q ¼ X i<j X s2SS½1;n pðsÞ Á jfði; jÞ : 1 i < j j; ði; jÞ 2 s 0 gj ¼ where js 0 j denotes the number of base pairs in the native secondary structure s 0 , taken here to be the Rfam consensus structure used in benchmarking. In the right panel of Fig 13, we establish a relatively high correlation of 0.86 [resp. 0.84] between the expected number of native contacts for a collection 1904 RNA sequences and their consensus secondary structures from the Rfam 12.0 database and the uniform MS1 [resp. MS2] network degree. Again, it is worth pointing out that the slightly higher correlation of the MS1 measure over the MS2 measure. RNA secondary structure folding kinetics remains a computationally difficult problem for RNA sequences of even moderate length, despite the availability of software to compute nearoptimal folding pathways [7,11,13], compute population occupancy curves for coarse-grained models [14,17,18], and to repeatedly perform simulations of the Gillespie algorithm [5,[20][21][22][23]30]. Our motivation in this article is to approach folding kinetics from a novel network perspective, where we show that network degree is moderately highly correlated with both contact order and the expected number of native contacts, both measures known to be correlated with experimentally measured protein folding kinetics. Despite the new algorithms of this paper and the existence of other software for RNA folding kinetics, it seems clear that significant progress in this field will require the a database of experimentally determined RNA folding rates, comparable to the database KineticDB containing experimentally determined folding rates for proteins [26].