Network Properties of the Ensemble of RNA Structures

Peter Clote; Amir Bayegan

doi:10.1371/journal.pone.0139476

Abstract

We describe the first dynamic programming algorithm that computes the expected degree for the network, or graph G = (V, E) of all secondary structures of a given RNA sequence a = a₁, …, a_n. Here, the nodes V correspond to all secondary structures of a, while an edge exists between nodes s, t if the secondary structure t can be obtained from s by adding, removing or shifting a base pair. Since secondary structure kinetics programs implement the Gillespie algorithm, which simulates a random walk on the network of secondary structures, the expected network degree may provide a better understanding of kinetics of RNA folding when allowing defect diffusion, helix zippering, and related conformation transformations. We determine the correlation between expected network degree, contact order, conformational entropy, and expected number of native contacts for a benchmarking dataset of RNAs. Source code is available at http://bioinformatics.bc.edu/clotelab/RNAexpNumNbors.

Citation: Clote P, Bayegan A (2015) Network Properties of the Ensemble of RNA Structures. PLoS ONE 10(10): e0139476. https://doi.org/10.1371/journal.pone.0139476

Editor: Danny Barash, Ben-Gurion University, ISRAEL

Received: June 26, 2015; Accepted: September 14, 2015; Published: October 21, 2015

Copyright: © 2015 Clote, Bayegan. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: Source code has been deposited to GitHub: http://dx.doi.org/10.5281/zenodo.31326.

Funding: PC received funding from the National Science Foundation under grant DBI-1262439 (www.nsf.gov). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

RNA folding kinetics plays an important role in various biological processes, including (i) trans splicing of RNA, which is controlled by trypanosomal spliced leader (SL) RNA kinetics [1], and (ii) the hok/sok host-killing/suppression of killing (hok/sok) system that kills E. coli replicates if insufficient plasmids are transfered to the new daughter cell [2]. To better understand how macromolecules fold into their native state, energy landscapes for protein and RNA folding have been intensively studied [3–8]. In the case of RNA secondary structure formation, numerous algorithms have been developed beyond thermodynamic equilibrium structure prediction [9, 10], including algorithms (1) to determine optimal or near-optimal folding pathways, [6, 7, 11–13], (2) to compute explicit solutions of the master equation for possibly coarse-grained models [14–18], and (3) to simulate stepwise folding from an initial secondary structure to the target minimum free energy (MFE) structure [5, 19–24]. Nevertheless, RNA secondary structure folding kinetics remains a computationally difficult problem, since it is known that the problem of determining optimal folding pathways is NP-complete [25]. Despite increasing awareness of the importance of regulatory and catalytic RNA, no database currently exists of experimentally determined RNA folding rates, in contrast to the situation for proteins. Indeed, KineticDB is a database that provides users with a diverse set of experimentally determined folding rates for 87 unique proteins and approximately one hundred mutants [26].

It is currently an open problem to predict the folding rate of proteins and RNA molecules from the sequence alone. The goal of this paper is to raise awareness of this problem—in particular, the problem of predicting RNA secondary structure folding rate from the nucleotide sequence. For proteins, it has been shown that absolute contact order, which scales as ≈ n^0.7 for sequence length n, correlates rather well with protein folding rates for two- and multi-state folding proteins, reaching a correlation of 77% [27]—see as well Table 1 of [28]. Here, protein contact order is defined as the average chain separation of residues in contact (e.g. within 6 Å) in the native structure. It has also been shown that the number of native contacts correlates with folding rates of small single-domain proteins with two-state kinetics. In this case, Makarov et al. showed that ln(k) ≈ ln(N) + a + bN, where k denotes the folding rate, N is the number of contacts in the folded state, and a, b are constants whose physical meaning is understood [29].

Download:

Table 1. This table compares expected network degree and the length-normalized expected network degree for three RNA sequences of moderate size: 32 nt fruA, encoding the A subunit of coenzyme F420-reducing hydrogenase; tRNA RA1180, 56 nt spliced leader RNA from L. collosoma; 76 nt transfer RNA with accession code RA1180 from the database tRNAdb 2009 [41].

Unif-MS1 [resp. Unif-MS2] denote the expected network degree for model B (uniform probability) for MS1 [resp. MS2] move set. Turner99-MS1 [resp. Turner99-MS2] and Turner04-MS1 [resp. Turner04-MS2] and denote the expected network degree for model C (Boltzmann probability for Turner 1999 and Turner 2004 energy parameters [36]) for MS1 [resp. MS2] move set. Sample-MS1 [resp. Sample-MS2] denotes the approximation of the expected network degree for model C (Turner 1999 and Turner 2004 parameters) obtained by generating low energy structures by RNAsubopt -d0 -e 12, as explained in the text. In the case of fruA, all 971,399 possible structures were generated by RNAsubopt -d0 -e 100, so that Sample-MS1 and Sample-MS2 values are correct—for this reason, the standard deviation values are not included. Note that for L. collosoma, the expected degree values for the Turner 2004 energy parameters are much larger than those obtained for Turner 1999 energy parameters.

https://doi.org/10.1371/journal.pone.0139476.t001

To our knowledge, no relation has been established between RNA folding rate and either contact order or the number of native contacts, due in part to the above-mentioned absence of a database of RNA folding rates, and due in part to the notorious difficulty of estimating RNA secondary structure folding rates when using secondary structure kinetics software such as Kinfold [5], Kinefold [20], RNAKinetics [21], KFold [30], or other software [22, 23]. Such programs implement an event-driven Monte Carlo algorithm known as Gillespie’s algorithm [31]; it follows that repeated (time-consuming) simulations will generate a collection of mean first passage times which are approximately exponentially distributed. Since an exponential distribution has the property that the mean is equal to the standard deviation, it follows that precise kinetics obtained by such methods necessarily requires inordinate computation time (e.g. the population occupancy curve for yeast phe-tRNA required 3 months of CPU time on a 2.4 GHz Intel Pentium 4 running linux [14]). Until the availability of a database of experimentally determined RNA folding rates, it is likely that the best approximation of folding rates can be made using exact, coarse-grained approaches using spectral methods, as Treekin [14], basin hopping with RNAlocmin [17], and Hermes [18].

Apart from contact order and the number of native contacts, the expected degree of the network of RNA secondary structures of an RNA sequence is another order parameter that could play a role in RNA folding kinetics—see the left panel of Fig 1 for an example of expected network degree for the toy sequence GGGGCCC. Here, the degree of a node (secondary structure) s is the number of secondary structures t that can be obtained from s by the addition, removal or shift of a base pair. These moves constitute the default move set employed by the program Kinfold [5], often used to estimate RNA folding kinetics. Moreover, by analyzing the network G = (V, E), whose node set V consists of low energy secondary structures of E. coli phe-tRNA (RF6280 [32]) and whose edge set E consists of directed edges s → t, where t is obtained from s by a base pair addition, removal or shift, the network for phe-tRNA was shown to be small-world in [33].

Download:

Fig 1.

(Left) Network for the toy 7-mer GGGGCCC which has 8 nodes and 16 edges (hence 32 directed edges). The expected network degree is . Red edges indicate base pair addition or removal, while blue edges indicate shift moves. (Center) Feynman circular representation of secondary structure of Y RNA. (Right) Conventional representation of secondary structure of Y RNA. According to [55], one function of Y RNA is to bind to certain misfolded RNAs, including 5S rRNA, as part of a quality control mechanism. The secondary structure depicted is the consensus secondary structure of Y RNA with EMBL access number AAPY01489510:220–119 from Rfam family RF00195 in the Rfam database [56]. Images produced with sofware jViz [57].

https://doi.org/10.1371/journal.pone.0139476.g001

In this paper, we provide the first algorithm to efficiently compute the expected degree of an RNA network of secondary structures. Our work generalizes a recent paper [34], which describes a vastly simpler algorithm to compute the expected degree without consideration of shift moves. Since our current algorithm is surprisingly complex, for clarity of exposition, we consider three successive models. Model A is the RNA homopolymer model [35], in which any two positions i, j can constitute a base pair, provided only that i + 1 < j. Model B is the usual RNA secondary structure model, where positions i, j can constitute a base pair if the corresponding nucleotides form a Watson-Crick or wobble pair and i +3 < j; however, in Model B, the energy of a structure is taken to be zero, so the probability of a structure is simply one over the number of structures. Model C extends Model B by using the Turner 2004 energy parameters [36] without dangles. Our algorithms have been extensively tested against brute-force exhaustive methods to be sure of algorithm and implementation. Finally, we begin a preliminary investigation into the relation between network degree, contact order, conformational entropy, and number of native contacts using two benchmarking sets of RNA structures. Since we show later that expected network degree is linear in sequence length for the (theoretical) homopolymer case, we additionally compute the length-normalized network degree.

Preliminaries

Definition 1. A secondary structure for a given RNA nucleotide sequence a₁, …, a_n is a set s of base pairs (i, j), where 1 ≤ i < j ≤ n, such that:

if (i, j) ∈ s then a_i, a_j form either a Watson-Crick (AU, UA, CG, GC) or wobble (GU, UG) base pair,
if (i, j) ∈ s then j − i > θ = 3 (a steric constraint requiring that there be at least θ = 3 unpaired bases between any two positions that are paired),
if (i, j) ∈ s then for all i′ ≠ i and j′ ≠ j, (i′, j) ∉ s and (i, j′) ∉ s (nonexistence of base triples),
if (i, j) ∈ s and (k, ℓ) ∈ s, then it is not the case that i < k < j < ℓ (nonexistence of pseudoknots).

Secondary structures can be depicted in several equivalent manners. For instance, the sequence and dot bracket representation for the secondary structure of Y RNA with EMBL access number AAPY01489510:220–119 is given by

GGCUGGUCCGAGUGCAGUGGUGUUUACAACUAAUUGAUCACAGCCAGUUACAGAUUCCUUUGUUCCUUCUCUACUCCCACUGCUUCACUUGACUAGCCUUUU ((((((((.((..(((((((.(.....(((.((.........................)).)))...........))))))...))..))))))))))....

Y RNA is a noncoding RNA, known to be required for the initiation of chromosomal DNA replication in mammalian cells [37]; a distinct function of Y RNA is mentioned in the caption to Fig 1, where two other formats for this secondary structure are depicted. A base pair (i, j) of structure s is an external base pair, if there is no base pair (x, y) ∈ s with the property that x < i < j < y. A position 1 ≤ k ≤ n is said to be visible in s if there is no base pair (i, j) ∈ s with the property that i ≤ k ≤ j. The secondary structure of Y RNA in Fig 1 has only one external base pair, i.e. (1, 98), and only four visible positions, i.e. positions 99, 100, 101, 102. Throughout the remainder of this paper, structure will mean secondary structure.

The base pair distance d_BP(s, t) between secondary structures s, t is the number of base pairs ∣s − t∣ + ∣t − s∣ belonging to s but not t, or vice versa. A shift move from base pair (i, j) in the structure s is of the form (i, k) [resp. (k, j)], where (s \ {(i, j)}) ∪ {(i, k)} [resp. (s \ {(i, j)}) ∪ {(k, j)}] is a valid secondary structure. Throughout, let bp(i, j) be a boolean valued function, where bp(i, j) = 1 if positions i, j can form a base pair; i.e. if a_i, a_j constitute a Watson-Crick or wobble pair. Reference [5] describes the Kinfold program, which implements the Gillespie algorithm [31] for RNA secondary structure folding kinetics. Kinfold produces secondary structure folding trajectories, or sequences s = s₀, s₁, …, s_m = t, where for 0 ≤ i < m, s_i+1 is obtained from s_i by the addition or deletion of a base pair, and (optionally) by a shift move. These are defined as follows.

The move set MS1 allows a move from structure s to structure t, if t can be obtained from s by the removal of addition of a base pair; i.e. if t = s \ {(i, j)} or t = s ∪ {(i, j)}. The move set MS2 allows moves from MS1 as well as four shift moves, described by the following. Structure t is obtained from s by the replacement of base pair (i, j) ∈ s by the distinct base pair (i, j′), or (j′, i), or (i′, j), or (j, i′), provided that t is a valid secondary structure. Figs 2, 3 and 4 depict some typical shift moves, including defect diffusion [38].

Download:

Fig 2. Defect diffusion [38], where a bulge migrates stepwise to become absorbed in an hairpin loop.

The move from structure (a) to structure (b) is possible by the shift (1, 12) → (1, 13), the move from (b) to (c) by shift (2, 11) → (2, 12), etc. Our algorithm properly accounts for such moves with respect to energy models A, B, C. Image adapted from figure on page 26 [19] and produced by VARNA [58].

https://doi.org/10.1371/journal.pone.0139476.g002

Download:

Fig 3. Example of multiloop creation which is handled by our algorithm for all energy models, including the Turner energy model.

To move from (a) to (b), remove the base pair (3, 13); to move from (b) to (c), shift (4, 12) → (12, 18); to move from (c) to (d), add base pair (13, 17). Image produced by VARNA [58].

https://doi.org/10.1371/journal.pone.0139476.g003

Download:

Fig 4. Example of multiloop creation which is handled by our algorithm for energy models A, B but not for Turner energy model C.

To move from (a) to (b), apply the shift (3, 13) → (13, 17); to move from (b) to (c), apply the shift (4, 12) → (12, 18). Our algorithm for the Turner energy model properly treats the move from (a) to (b), but not from (b) to (c), as explained in the Remark at the end of Section “Remaining recursions for Q_i,j and Z_i,j”. Image adapted from figure on page 27 [19] and produced by VARNA [58].

https://doi.org/10.1371/journal.pone.0139476.g004

Expected network degree

Throughout this paper, let a = a₁, …, a_n be a fixed, but arbitrary RNA sequence. Consider the set of all secondary structures of a as a network, or graph, where two structures s, t, are connected by an edge if t can be obtained from s by a base pair addition, removal or shift.

Fig 1 displays the network for a toy 7 nt sequence GGGGCCC, where moves come from move set MS2 (base pair additions and removals indicated by red edge; shift moves indicated by blue edge). Fig 5 displays the network for the slightly larger sequence ACGUACGUACGU, where moves come from move set MS2. In contrast, Fig 6 displays the network where moves are restricted to the move set MS1, and Fig 7 displays the network where shifts are the only allowable move—i.e. moves are restricted to the move set MS2\MS1. When moves are allowed to range over either MS1, or over MS2, the resulting network is connected; this is not the case for moves in MS2\MS1. Since the network represents intermediate moves in RNA folding trajectories, it is of interest to know the average network degree. This was done for move set MS1 in [34]. The goal of this paper is to describe the first algorithm, which computes the expected network degree, or equivalently, the expected number of neighbors, for the RNA network defined with move set MS2. Computing the expected number of neighbors when including shift moves turns out to be remarkably difficult, so for clarity of exposition, we present three versions of the algorithm, each adding a layer of complexity. Source code for all three energy models can be downloaded from http://bioinformatics.bc.edu/clotelab/.

Download:

Fig 5. The network of all secondary structures of the 12 nt (toy) sequence ACGUACGUACGU.

The minimum free energy structure is shown in green. Edges connect structures s, t, such that t is obtained by a move in MS2 from s, or vice versa; i.e. structures are connected by an edge if they differ by a base pair addition, removal or shift. There are 35 structures, 126 edges between structures that differ by a base pair removal or addition, and 68 edges between structures that differ by a base pair shift. Altogether, there are 194 edges. It follows that the average network degree is .

https://doi.org/10.1371/journal.pone.0139476.g005

Download:

Fig 6. The network of all secondary structures of the 12 nt sequence ACGUACGUACGU, where edges connect structures s, t, such that t is obtained by a move in MS1 from s, or vice versa; i.e. structures are connected by an edge if they differ by a base pair addition or removal.

There are 35 structures, 126 edges between structures that differ by a base pair removal or addition, hence the average network degree is .

https://doi.org/10.1371/journal.pone.0139476.g006

Download:

Fig 7. The network of all secondary structures of the 12 nt sequence ACGUACGUACGU, where edges appear between structures that differ by a shift move.

There are 35 structures, 68 edges between structures that differ by a base pair shift, hence the average network degree is . Note that the network is not connected, unlike the previous two networks.

https://doi.org/10.1371/journal.pone.0139476.g007

The plan of this paper is as follows. Section “Results” discusses the degree distribution for move sets MS1 and MS2, obtained by exhaustive enumeration and by sampling low energy structures. Asymptotic network degree is discussed and the correlation is computed between the expected network degree, contact order, conformational entropy, and expected number of native contacts. In Section “Homopolymer Model A”, we derive the recursions for the expected number of neighbors for move set MS2, with respect to the homopolymer Model A. In the homopolymer model, introduced in [35], any two positions i < j can form a base pair, provided only that j − i > 1; i.e. in Definition 1, item (1) is removed, and item (2) is modified so that θ = 1. In this model, the partition function Z of a length n homopolymer is simply the number of well-balanced parenthesis expressions with dots, having length n and in which j − i > 1 whenever a left [resp. right] parenthesis occurs at position i [resp. j]. For this model, the probability P(s) of each structure s is equal to the uniform probability 1/Z. In Section “Uniform, non-homopolymer Model B”, we give the recursions for the non-homopolymer uniform Model B, in which every secondary structure has energy zero, but where a secondary structure of the RNA sequence a = a₁, …, a_n must satisfy all four properties of Definition 1. In this case, the probability P(s) of structure s is defined by P(s) = exp(−E(s)/RT)/Z where R = 0.00198717 kcal/mol, T is absolute temperature, and the partition function is Z = ∑_s exp(−E(s)/RT). However, since E(s) = 0 for each structure s, the partition function Z is simply the number of secondary structures of a, and the probability P(s) is equal to the uniform probability P(s) = 1/Z. In Section “Model C with Turner energy parameters”, we give the the recursions for the full Model C, with respect to the Turner energy model [36] which includes base stacking free energies and free energies for hairpins, bulges, internal loops and multiloops. The partition function Z = ∑_s exp(−E(s)/RT) can be computed by the McCaskill algorithm [39], and the probability of structure s is the usual Boltzmann probability P(s) = exp(−E(s)/RT)/Z.

Materials and Methods

Let a = a₁, …, a_n be an arbitrary but fixed RNA sequence. For any 1 ≤ i ≤ j ≤ n, let a[i, j] denote the subsequence a_i, …, a_j, and let denote the set of secondary structures of a[i, j]. For , let BF(s) denote the Boltzmann factor exp(−E(s)/RT) of s, and define , where N(s) is the number of secondary structures t of a[i, j] obtained from the structure s by the addition, deletion or shift of a base pair. The partition function for a[i, j] is defined by . It follows that the expected number of neighbors (network degree) is . For clarity of exposition, in the following subsections, we describe recursions to compute Q_i,j and Z_i,j for three energy models for RNA secondary structures, each model a refinement of the previous model.

Homopolymer Model A

In this section, we derive the recursions for Q_1,n and Z_1,n for the homopolymer model, in which any two positions 1 ≤ i < j ≤ n can form a base pair, provided only that i + 1 < j. For the homopolymer model, there is no RNA sequence a = a₁, …, a_n, but rather only the interval [1, n] = {1, …, n}. Thus we speak of a structure on [i, j], rather than on a[i, j]. The energy of each structure in the homopolymer model is zero, so the probability of each structure s on [i, j] equals one divided by the number of structures on [i, j]. Moreover, there is no need to compute the doubly-indexed values Q_i,j and Z_i,j, since the values depend only on the size j − i + 1 of the sequence [i, j]; i.e. if j − i = j′ − i′, then Q_i,j = Q_i′,j′ and Z_i,j = Z_i′,j′. Thus it is notationally simpler to define Q_n [resp. Z_n] in place of Q_1,n [resp. Z_1,n], and similarly for all other auxilliary functions.

For 0 ≤ n, define Q_n to be the sum, taken over all structures s of [1, n], of the number of base pair additions, removals or shifts of a base pair of s. Formally, we have (1) where I denotes the indicator function, and “(x, y) → (k, ℓ)” denotes the move which consists of replacing base pair (x, y) by base pair (k, ℓ). As well, let Z_n denote the total number of homopolymer structures on [1, n] with θ = 1. Recursions for Z_n are well-known [35], but for completeness given in Eq (2) below.

Auxilliary functions f(n, x) and g(n, x).

Recall that here we take θ = 1 for simplicity of exposition of the ideas. Let Z_n denote the total number of structures on the homopolymer of length n. Since any two positions i, j can base-pair, as long as j − i > θ = 1, we have (2) The term Z_{n − 1} counts all structures s on [1, n] in which n is unpaired in s, while the term Z_r ⋅ Z_{n − r − 2} counts all structures s on [1, n] that contain the base pair (r + 1, n).

Define f(n, x) to be the number of secondary structures s for a length n homopolymer, such that s has x visible positions. Now for 0 ≤ n and 0 ≤ x ≤ n, define f by (3) The computation of f(n, x) uses dynamic programming and proceeds by double induction, i.e. for n fixed, induction is performed on x. The term Z_{n − 2} arises from structures s on [1, n] that contain the base pair (1, n); the term f(n − 1, x − 1) is the contribution from structures s on [1, n] in which n is unpaired; the term f(r, x) ⋅ Z_{n − r − 2} accounts for all structures s on [1, n] that contain the base pair (r + 1, n).

Define g(n, x) to be the number of secondary structures s for the length n homopolymer, such that s has x visible positions in the interval [1, n − θ − 1] = [1, n − 2], and position n is unpaired in s. (4) The term f(n − 2, x) accounts for all structures s on [1, n] in which n − 1, n are unpaired. The term Z_{n − 3} arises in the case n > 2, x = 0 for structures s on [1, n] that contain the base pair (1, n − 1). Finally, the term f(r, x) ⋅ Z_{n − r − 3} arises from structures s on [1, n] that contain the base pair (r + 1, n − 1). In all cases, the structures considered are unpaired at position n, and have exactly x visible positions in the interval [1, n − 2].

Auxilliary function E_n.

For 1 ≤ n, define the function E_n to be the number of external base pairs in all homopolymer structures on [1, n]; formally, we have (5) Recalling that Z_n denotes the number of structures on [1, n], we define Z₀ = 1, E₀ = 1, and E_n = 0 for 1 ≤ n ≤ 2 = θ + 1. Note that for 1 ≤ n ≤ 2, it must be that E_n = 0, since the empty structure is the only possible structure on [1, n] in this case. For larger values of n, note that (6) (7) (8) Note that the rightmost term in the last line arises from the contribution of 1 for base pair (k, n). In summary, we have shown that (9)

Main function Q_n.

For clarity in the derivation of Q_n, we start by explicitly listing the moves in move set MS2. Let x, x′, y, y′ denote distinct positions all belonging to the interval [1, n]. The structure t can be obtained from structure s by a move from MS2, if t is a valid secondary structure and can be obtained from s by applying a move of the form 1–6.

Addition of a base pair (x, y) to s.
Removal of a base pair (x, y) from s.
Shift of a base pair (x, y) in s to (x, y′) in t.
Shift of a base pair (x, y) in s to (y′, x) in t.
Shift of a base pair (x, y) in s to (x′, y) in t.
Shift of a base pair (x, y) in s to (y, x′) in t.

The shift moves 3–6 are depicted in Fig 8.

Download:

Fig 8. Illustration of shift moves defined in Sections “Main function Q_n” and “Recursion for function Q_i,j”.

https://doi.org/10.1371/journal.pone.0139476.g008

Let , where N(s) is the number of structures t that can be obtained from s by applying a move from move set MS2. Define Q₀ = 1, and Q₁ = Q₂ = 0, Z₋₁ = 0, Z₀ = Z₁ = Z₂ = 1. For the inductive case where n > 2, initialize Q_n = 0 and then add the contributions from below.

Case 1(a): In this case, we consider the contribution from , in which the last position n is unpaired, and t is obtained from s by a move from MS2 involving x, y, x′, y′ ∈ [1, n − 1].

Notice that in shifts of type 3, 4 the original position x is retained, while in shifts of type 5, 6 the original position y is retained, for distinct x, x′, y in the interval [1, n − 1]. Also, notice that shifts of base pairs involving the last position n are not considered in Case 1(a) – such shifts will later be treated in cases 1(c), 2(b) and 2(c). The contribution in this case is given by (10) The term Q_n−1 arises from neighbors t of s in which the last position n is unpaired, and the base pair (x, y) is added/removed/shifted in s.

Case 1(b): In this case, we consider the contribution from , in which the last position n is unpaired, and t is obtained from s by adding the base pair (k, n) for some 1 ≤ k ≤ n − θ − 1. The contribution in this case is given by (11)

Case 1(c): In this case, we consider the contribution from , in which the last position n is unpaired, and t is obtained from s by shifting the base pair (x, y) to (x, n), or by shifting the base pair (x, y) to (y, n), for distinct x, y in the interval [1, n − 1]. These shifts are treated separately.

Case 1(c)(i): Consider a shift of the form (x, y) to (x, n), for y < n. The function E_n−1 counts the number of external base pairs (x, y) where y ≤ n − 1, for all structures on [1, n − 1]. For any such (x, y), it is possible to shift the base pair (x, y) to (x, n), and so the contribution is (12)

Case 1(c)(ii): Consider a shift of the form (x, y) to (y, n), for y < n − 1. The function E_n−2 counts the sum over all structures on [1, n − 2] of the number of external base pairs (x, y) with y ≤ n − 2. Since k ≤ n − 2 and θ = 1, and n is unpaired, it is possible to shift the base pair (x, y) to (y, n) and vice versa. So far, we have not considered structures s on [1, n − 1] in which n − 1 is base-paired. For a structure s on [1, n − 1] that contains base pair (r + 1, n − 1), there are Z_n−r−3 many structures s₂ on [r + 2, n − 2]; moreover, for any external base pair (x, y) in a structure s₁ on [1, r], we can shift the base pair (x, y) to (y, n). This explains the presence of the term . Thus the contribution is (13) In conclusion, (14)

Case 2(a): The contribution from , in which the last position n is base-paired, where neighbor t is obtained from s by removal of that last base pair (k, n), is given by (15) Note that Case 2(a) is dual to Case 1(b).

Case 2(b): In this case, we consider the contribution from , in which the last position n is base-paired, where neighbor t is obtained from structure s by a shift of the last base pair (k, n) to (k′, n) for some k′ ≠ k that is visible in structure s − {(k, n)}. Note that if we were to remove base pair (k, n) from s, then the last position of s − {(k, n)} must be unpaired, and the position n − 1 may or may not be base paired. Recall that g(n, x) is the sum over all structures s on [1, n], that contain x visible positions in the interval [1, n − 2], and in which position n is unpaired. If we choose a first position k out of the x visible positions, and subsequently a second distinct position k′ out of the remaining x − 1 visible positions, then we properly count the contribution from structures s containing (k, n) which can be transformed to a structure t by the shift (k′, n).

The contribution in this case is (16) since we have x choices for value k and then (x − 1) choices for k′, both selected from the x visible positions of the structure.

Case 2(c): In this case, we consider the contribution from , in which the last position n is base-paired, where neighbor t is obtained from structure s by a shift of base pair (k, n) to (k, k′), or a shift of the last base pair (k, n) to (k′, k), for some k ≠ k′ that is visible in structure s − {(k, n)}. These shifts are treated separately.

Case 2(c)(i): Consider a shift of the form (k, n) to (k, k′), for k′ < n. The function E_n−1 counts the sum over all structures on [1, n − 1] of the number of external base pairs (k, k′) with k′ ≤ n − 1. For any such (k, k′), it is possible to apply the shift (k, n), and vice versa. Thus Case 2(c)(i) case is dual to Case 1(c)(i) and the contribution is clearly (17) Case 2(c)(ii): Consider a shift of the form (k, n) to (k′, k), for k′ < k − 1. The function E_n−2 counts the sum over all structures on [1, n − 2] of the number of external base pairs (k′, k) with k ≤ n − 2. Since k ≤ n − 2 and θ = 1, and n is unpaired, it is possible to shift the base pair (k′, k) to (k, n) and vice versa. By duality to Case 1(c)(ii), we have the additional contribution of to account for shifting the base pair (y, n) to an external base pair (x, y) in a structure s₁ on [1, r], in the case that n − 1 is base-paired. Thus Case 2(c)(ii) case is dual to Case 1(c)(ii) and the contribution is clearly (18) In conclusion, (19) Case 2(d): In this case, we consider the contribution from , in which the last position n is base-paired with base pair (k, n), where neighbor t is obtained from a shift or addition/deletion of a base pair in the left portion [1, k − 1] or right portion [k + 1, n − 1], so that t retains the base pair (k, n). In this case, the contribution is (20) The first term arises from the addition/removal/shift of a base pair (x, y), where k + 1 ≤ x < y ≤ n − 1, and the second term arises from the addition/removal/shift of a base pair (x, y), where 1 ≤ x < y ≤ k−1.

Putting together all contributions from Case 1(a) through Case 2(d), we have (21) The functions f, g require the greatest space and time resources, and it is easily seen that the spece [resp. time] complexity for Z is O(n) [resp. O(n²)], for f is O(n²) [resp. O(n³)], for g is O(n²) [resp. O(n³)], and that given arrays that contain the values of f and g, the additional space [resp. time] complexity for E and Q is O(n) [resp. O(n²)]. It follows that the expected network degree in the homopolymer case Model A can be computed in quadratic space O(n²) and cubic time O(n³). We have implemented a dynamic programming algorithm for each of the functions E, f, g, Q, Z resulting in software for the expected network degree, with respect to homopolymer model. Our code has been cross-checked extensively with alternative brute-force methods, hence is reliable.

Uniform, non-homopolymer Model B

In this section, we consider the uniform, non-homopolymer model B, in which secondary structures must satisfy Definition 1; i.e. compared with the notion of structure from the previous Section “Homopolymer Model A”, each base pair (i, j) of a secondary structure s of the RNA sequence a = a₁, …, a_n must satisfy j − i > θ = 3, and a_i, a_j must constitute a Watson-Crick or wobble pair. In model B, the energy of each structure is zero, so the partition function Z = Z_1,n is the total number of structures of a, and the probability P(s) of each structure s is 1/Z. For the recursions necessary to compute , where N(s) denotes the number of neighbors of s under move set MS2, we need to define new functions EL, ER, ER′, F, G. There is a correspondence between functions EL_{i,j − 1, a_j} [resp. ] { resp. G_{i,j,a_j, x} } in the current section with the functions E_n−1 [resp. ] { resp. g(n, x) } from the previous Section “Homopolymer Model A”.

Critical definitions and recursions.

For a given RNA sequence a = a₁, …, a_n, define the subsequence a[i, j] = a_i, …, a_j. Positions i, j can form a base pair, denoted by bp(i, j) = 1, if a_i, a_j is either a Watson-Crick pair AU, UA, GC, or CG, or a wobble pair; otherwise bp(i, j) = 0. For k ∈ [1, n] and c ∈ {A, C, G, U}, we also write bp(k, c) = 1 to mean that a_k, c constitute either a Watson-Crick or wobble base pair. A nucleotide position k ∈ [1, n] is said to be visible in the secondary structure s, if for every base pair (i, j) ∈ s, it is not the case that i ≤ k ≤ j. If we state that structure s has exactly x visible occurrences of a nucleotide in [i, j − θ − 1] that can base pair with c, then we mean that there are positions i ≤ i₁ < i₂ < ⋯ < i_x ≤ j − θ − 1 visible in s, such that bp(i₁, c) = 1, …, bp(i_x, c) = 1; moreover there are no other positions beyond i₁, …, i_x with this property.

The base pair (i, j) ∈ s is said to be an external base pair of the secondary structure s, if there is no distinct base pair (i′, j′) ∈ s with the property that i′ ≤ i < j ≤ j′. In formulas, for brevity, we write that ‘(i, j) is external in s’, to mean that (i, j) is an external base pair of s. Let denote the set of all secondary structures of the subword a[i, j]. Recall that the indicator function I[P] is equal to 1 if relation P is true, and 0 otherwise. For 1 ≤ i ≤ j ≤ n, c ∈ {A, C, G, U}, and x ∈ [0, n], and c ∈ {A, C, G, U}, define the functions EL_i,j,c, ER_i,j,c, , F_i,j,c,x, G(i,j,c,x) as follows. (22) (23) (24) (25) (26) The two differences between the homopolymer Model A and the current Model B are: (1) in Model B, if (k, j) is a base pair, then the nucleotides at positions k, j must be one of AU, UA, GC, CG, GU, UG, (2) in Model B, θ = 3, so if (k, j) is a base pair, then j ≥ i + θ + 1 = i + 4. Both of these issues substantially complicate the treatment, so instead of the function E_n with one argument, we have three functions, EL_i,j,c, ER_i,j,c, , each having three arguments. The arguments i, j designate the left and right endpoints of the interval [i, j], and the functions are defined by induction on increasing values of the difference j − i. The argument c contains the value A, C, G, U for the nucleotide at position j; this allows one to test whether the nucleotide at position k ∈ [i, j − θ − 1] can form a base pair with the nucleotide at position j. Thus EL_i,j,c is the sum, taken over all structures on [i, j], of the number of external base pairs (x, y) where we can alternatively form the base pair (x, j) as depicted in panel (a) of Fig 9. As well, is the sum, taken over all structures on [i, j], of the number of external base pairs (x, y) where we can alternatively form the base pair (y, j) as depicted in panel (b) of Fig 9. The function ER_i,j,c is first defined, since this simplifies the recursion for . The function G_i,j,c,x has a fourth parameter x, for which G_i,j,c,x counts the number of structures on [i, j] having exactly x visible positions (external to all base pairs) in the interval [i, j − θ − 1] = [i, j − 4] of a nucleotide that can form a base pair with nucleotide c, as depicted in panel (d) of Fig 9. It will follow that for structures having exactly x such visible positions that can form a base pair with position j, there are many pairs k′, k where a shift of the form (k, j) → (k′, j). The function F_i,j,c,x is introduced to simplify the recursions for G, where F_i,j,c,x counts the number of structures on [i, j] having exactly x visible occurrences of a nucleotide that can form a base pair with c. With this introduction, we give the formal definitions.

Download:

Fig 9. Illustration of cases 1c, 1d, 2c, 2d from Section “Recursion for function Q_i,j”.

https://doi.org/10.1371/journal.pone.0139476.g009

Definition of EL.

For 1 ≤ i ≤ j ≤ n and c ∈ {A, C, G, U}, we define EL_i,j,c by induction on j − i.

Base Case: If j − i ≤ θ, define EL_i,j,c = 0.

Inductive Case: If j − i > θ, define EL_i,j,c as the sum of the following (27)

Definition of ER.

For 1 ≤ i ≤ j ≤ n and c ∈ {A, C, G, U}, we define ER_i,j,c by induction on j − i.

Base Case: If j − i ≤ θ, define ER_i,j,c = 0.

Inductive Case: If j − i > θ, define ER_i,j,c as the sum of the following (28)

Definition of ER′.

For 1 ≤ i ≤ j ≤ n and c ∈ {A, C, G, U}, we define by induction on j − i.

Base Case: If j − i ≤ θ, define .

Inductive Case: If j − i > θ, define as the sum of the following (29) Note that the first term to the right of the equality sign in the previous equation is ER_{i,j−θ − 1, c} and not .

Definition of F.

For 1 ≤ i ≤ j ≤ n, c ∈ {A, C, G, U} and x ∈ [0, n], we define F_i,j,c,x by induction on j − i. For j − i < 0, c ∈ {A, C, G, U}, and 0 ≤ x ≤ j − i + 1, define F_i,j,c,x = 0.

Base Case i = j: For c ∈ {A, C, G, U}, define F_{i,i,c,bp(i, c)}; i.e. (30) and (31) Base Case i < j ≤ i+θ: For i < j ≤ i + θ, and x ∈ [0, j − i + 1], define by double induction on j − i and x (32) Inductive Case j > i+θ: For j > i+θ, and x ∈ [0, n], we define F by double induction on j − i and x, where we separate the case that x = 0 and x > 0.

Subcase x = 0: (33) Subcase x > 0: (34)

Definition of G.

Recall that G_i,j,c,x is defined to be the number of structures having exactly x visible occurrences of a nucleotide in [i, j − θ − 1] that can base-pair with c, and j is unpaired in s. Initially define G_i,j,c,x = 0 for all i, j, c, x.

Base Case: For i ≤ j ≤ i + θ, and c ∈ {A, C, G, U}, define G_{i,j,c, 0} = 0.

Inductive Case: In this case, j > i + θ, and c ∈ {A, C, G, U}. We separately treat the subcases x = 0 and x > 0.

Subcase x = 0: (35) Subcase x > 0: (36)

Computing the total number of moves using MS1.

For 1 ≤ i ≤ j ≤ n, define Q_i,j to be the sum, taken over all structures s of a_i, …, a_j, of the number of base pair additions or removals of a base pair to or from s. Formally, we have (37) or equivalently (38) where d_BP(s, t) denotes the base pair distance between structures s, t. Define Q_i,j by recursion on j − i, for 1 ≤ i ≤ j ≤ n.

Base Case: For i ≤ j ≤ i + θ, define Q_i,j = 0.

Inductive Case: For j > i + θ, define (39)

Computing the total number of moves using MS2.

For 1 ≤ i ≤ j ≤ n, define Q_i,j to be the sum, taken over all structures s of a_i, …, a_j, of the number of base pair additions, removals or shifts of a base pair of s. Formally, we have (40)

Now define Q_i,j by recursion on j − i, for 1 ≤ i ≤ j ≤ n.

Base Case: For i ≤ j ≤ i + θ, define Q_i,j = 0.

Inductive Case: For j > i + θ, define (41)

Computing the total number of moves using MS2\MS1.

For 1 ≤ i ≤ j ≤ n, define Q_i,j to be the sum, taken over all structures s of a_i, …, a_j, of the number of shifts of a base pair of s. Formally, we have (42)

Now define Q_i,j by recursion on j − i, for 1 ≤ i ≤ j ≤ n.

Base Case: For i ≤ j ≤ i + θ, define Q_i,j = 0.

Inductive Case: For j > i + θ, define (43) We have implemented a dynamic programming algorithm for each of the functions EL, ER, ER′, F, G, Q and Z, resulting in software for the expected network degree, with respect to uniform probability for the move sets MS1, MS2, MS2\MS1. Analysis of space and time resources needed for the program can be determined in a manner similar to that described at the end of Subsection; however, there is an additional factor of n in both space and time requirements, so that the software runs in space O(n³) and time O(n⁴). During the algorithm development and implementation, we have extensively cross-checked with results obtained by exhaustive, brute force counting, thus ensuring correctness of our code.

Model C with Turner energy parameters

Here we consider the Model C, for which secondary structures satisfy Definition 1 and such that E(s) indicates the Turner energy of s, which involves free energy parameters [36] for stacked base pairs, hairpins, bulges, internal loops and multiloops. For RNA sequence a = a₁, …, a_n, we present recursions in the following for Z_i,j and Q_i,j, where (44) (45) (46) (47) (48) (49) Note that I is the indicator function, and that QB_i,j is the Boltzmann weighted sum of the number of neighbors, using move set MS2, where the sum is taken over all structures that contain the base pair (i, j). Similarly ZB_i,j is the sum of Boltzmann factors BF(s), where the sum is taken over all structures that contain the base pair (i, j). We write bp(k, j) = 1 to mean that nucleotides a_k, a_j can form either a Watson-Crick or wobble base pair, and for nucleotide c ∈ {A, C, G, U}, we write bp(k, c) = 1 to mean that nucleotides a_k and c can form a Watson-Crick or wobble base pair. From the context, there should be no confusion between bp(k, j) and bp(k, c).

Auxilliary functions EL, ER, ER′, F, G.

For 1 ≤ i ≤ j ≤ n, c ∈ {A, C, G, U}, and x ∈ [0, n], and c ∈ {A, C, G, U}, define the Boltzmann version of the functions defined in the previous Section “Uniform, non-homopolymer Model B”, where without risk of confusion we use the same function notations for EL_i,j,c, ER_i,j,c, , F_i,j,c,x, G_i,j,c,x, although the underlying definitions must be modified. (50) (51) (52) (53) (54) Recursions for a dynamic programming implementation of these functions are given later in Section “Recursions for auxilliary functions”. We focus now on how to compute Q_i,j using these auxilliary functions.

Recursion for function Q_i,j.

For notational convenience, define Q_{i,i − 1} = 0 and Z_i,i−1 = 1 for all 1 ≤ i ≤ n. If i ≤ j < i + θ + 1, then for any secondary structure , there are no structural neighbors of s and so Q_i,j = 0. If i ≤ j < i + θ + 1, then the only secondary structure on [i, j] is the empty structure with free energy of zero, so Z_i,j = 1. Now assume that i + θ + 1 ≤ j. By definition (55) For the move set MS1 (in the absence of shift moves), it has been shown in [34] that (56) However, when allowing shift moves, the situation is more complicated since there are shifts involving x, y, x′, y′ ∈ [i, j] that are neither fully contained in the segment [i, j − 1] for structures in which j is unpaired, nor fully contained in one of the segments [i, k − 1], [k, j] structures which contain the base pair (k, j). The former shifts are treated in cases 1(c), 1(d), while the latter shifts are treated in cases 2(c), 2(d).

For clarity in the derivation of Q_i,j, we start by explicitly listing the moves in move set MS2. Let x, z′, y, y′ denote distinct positions all belonging to the interval [i, j]. The structure t can be obtained from structure s by a move from MS2, if t is a valid secondary structure and can be obtained from s by applying a move of the form 1–6.

Addition of a base pair (x, y) to s.
Removal of a base pair (x, y) from s.
Shift of a base pair (x, y) in s to (x, y′) in t.
Shift of a base pair (x, y) in s to (y′, x) in t.
Shift of a base pair (x, y) in s to (x′, y) in t.
Shift of a base pair (x, y) in s to (y, x′) in t.

The shift moves 3–6 are depicted in Fig 8. Notice that in shifts of type 3, 4 the original position x is retained, while in shifts of type 5, 6 the original position y is retained. for distinct x, x′, y in the interval [i, j].

In the base case, for all i ∈ [1, n], we have Q_{i,i − 1} = 0, Z_{i,i − 1} = 1, and for i ≤ j ≤ i + θ = i + 3, Q_i,j = 0, Z_i,j = 1. For the inductive case in which j − i > θ = 3, initialize Q_i,j = 0 and then add the contributions from the cases below. The recursions for Z_i,j are well-known [39] and are given later in Section “Remaining recursions for Q_i,j and Z_i,j”.

Case 1(a): In this case, we consider the contribution from , in which j is unpaired in the interval [i, j], and t is obtained from s by a move from MS2 involving x, y, x′, y′ ∈ [i, j − 1]. The contribution is (57) which accounts for the addition, removal or shift of a base pair in [i, j − 1]. Note that shifts of base pairs involving the last position j are not considered in Case 1(a)—such shifts will treated in cases 1(c), 1(d), 2(c), 2(d).

Case 1(b): In this case, we consider the contribution from , in which j is unpaired in [i, j], and t is obtained from s by adding the base pair (k, j) for some i ≤ k ≤ j − θ − 1 = j − 4. The contribution is (58) This term arises from those t obtained from s by adding a base pair (k, j) for some k ∈ [i, j − θ − 1].

The remaining cases 1(c), 1(d) treat shifts involving x, y, x′, y′ ∈ [i, j] in structures in which j is unpaired in [i, j], where the position j is touched; i.e. it is not the case that x, y, x′, y′ ∈ [i, j − 1] and so these shifts are not already counted in the term Q_{i,j − 1}.

Case 1(c): In this case, depicted in panel (a) of Fig 9, we consider the contribution from in which j is unpaired in [i, j], and t is obtained from s by a shift of the base pair (x, y) to (x, j) for i ≤ x ≤ y − θ − 1 and y ≤ j − 1. The function EL_{i,j − 1,a_j} is the sum, taken over all structures in which j in unpaired, of the product of the Boltzmann factor B(s) times the number of external base pairs (x, y) in s with y ≤ j − 1 such that the nucleotide a_x at position x can form a base pair with the nucleotide a_j at position j. For any such (x, y), it is possible to shift the base pair (x, y) to (x, j), and vice versa. Before proceeding, note that the current Case 1(c) handles shifts from (x, y) to (x, j), while Case 2(b) handles shifts from (x, j) to (x, y). The contribution in the current case is clearly (59) Case 1(d): In this case, depicted in panel (b) of Fig 9, we consider the contribution from in which j is unpaired in [i, j], and t is obtained from s by a shift of the base pair (x, y) to (y, j) for i ≤ x ≤ y − θ − 1 and y ≤ j − θ − 1. The function is the sum, taken over all structures in which j in unpaired, of the product of the Boltzmann factor B(s) times the number of external base pairs (x, y) in s with y ≤ j − θ − 1 such that the nucleotide a_y at position y can form a base pair with the nucleotide a_j at position j. For any such external base pair (x, y), it is possible to shift (x, y) to (y, j), and vice versa. Before proceeding, note that the current Case 1(d) handles shifts from (x, y) to (y, j), while Case 2(d) handles shifts from (y, j) to (x, y). The contribution in the case at hand is clearly (60) Case 2(a): In this case, we consider the contribution from structures , which contain the base pair (k, j), for some i ≤ k ≤ j − θ − 1, and t is obtained from s by a move from MS2 involving x, y, x′, y′, such that x, y, x′, y′ ∈ [i, k − 1]. The contribution is (61) Case 2(b): In this case, we consider the contribution from structures , which contain the base pair (k, j), for some i ≤ k ≤ j − θ − 1, and t is obtained from s by a move from MS2 involving x, y, x′, y′, such that x, y, x′, y′ ∈ [k, j]. The contribution is (62) The remaining cases 2(c), 2(d) treat shifts involving x, y, x′, y′ ∈ [i, j] in structures which contain the base pair (k, j) for some i ≤ k ≤ j − θ − 1, where it is neither the case that x, y, x′, y′ ∈ [i, k − 1] nor x, y, x′, y′ ∈ [k, j]; i.e. cross talk shifts that touch both the left [i, k − 1] and the right [k, j] segments.

Case 2(c): In this case, depicted in panel (c) of Fig 9, we consider the contribution from , which contain the base pair (k, j), for some i ≤ k ≤ j − θ − 1, and t is obtained from s by a shift of the base pair (k, j) to (k′, j) for some k′ < k that is visible in structure s\{(k, j)}. Before proceeding, note that for k < k′, the shift of base pair (k, j) to (k′, j) is treated in Case 2(b).

Recall that the function F_{i,k − 1,a_j, x} is the sum of Boltzmann factors of all structures s₀ on [i, k − 1] that contain exactly x occurrences of a visible position that can form a base pair with the nucleotide a_j at position j. The contribution in this case is (63) Case 2(d): In this case, depicted in panel (d) of Fig 9, we consider the contribution from structures , which contain the base pair (k, j), for some i ≤ k ≤ j − θ − 1, and t is obtained from s by a shift of the base pair (k, j) to (k′, k) for some i ≤ k′ ≤ k − θ − 1 which is visible in s. Recall that the function G_{i,k,a_k, x} is the sum of Boltzmann factors of all structures s₀ on [i, k], in which k is unpaired, for which there are exactly x occurrences of a visible position in [i, k − θ − 1] that can form a base pair with a_k. The contribution is (64) Putting together all contributions from Case 1(a) through Case 2(d), we have (65)

Recursions for auxilliary functions.

We now provide the recursions for functions EL, ER, ER′, F and G.