Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

RNAmountAlign: Efficient software for local, global, semiglobal pairwise and multiple RNA sequence/structure alignment

  • Amir H. Bayegan,

    Roles Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Biology Department, Boston College, Chestnut Hill, MA, United States of America

  • Peter Clote

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

    clote@bc.edu

    Affiliation Biology Department, Boston College, Chestnut Hill, MA, United States of America

RNAmountAlign: Efficient software for local, global, semiglobal pairwise and multiple RNA sequence/structure alignment

  • Amir H. Bayegan, 
  • Peter Clote
PLOS
x

Abstract

Alignment of structural RNAs is an important problem with a wide range of applications. Since function is often determined by molecular structure, RNA alignment programs should take into account both sequence and base-pairing information for structural homology identification. This paper describes C++ software, RNAmountAlign, for RNA sequence/structure alignment that runs in O(n3) time and O(n2) space for two sequences of length n; moreover, our software returns a p-value (transformable to expect value E) based on Karlin-Altschul statistics for local alignment, as well as parameter fitting for local and global alignment. Using incremental mountain height, a representation of structural information computable in cubic time, RNAmountAlign implements quadratic time pairwise local, global and global/semiglobal (query search) alignment using a weighted combination of sequence and structural similarity. RNAmountAlign is capable of performing progressive multiple alignment as well. Benchmarking of RNAmountAlign against LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA, and MUSCLE shows that RNAmountAlign has reasonably good accuracy and faster run time supporting all alignment types. Additionally, our extension of RNAmountAlign, called RNAmountAlignScan, which scans a target genome sequence to find hits having high sequence and structural similarity to a given query sequence, outperforms RSEARCH and sequence-only query scans and runs faster than FOLDALIGN query scan.

Introduction

A number of different metrics exist for comparison of RNA secondary structures, including base pair distance (BP), string edit distance (SE) [1], mountain distance (MD) [2], tree edit distance (TE) [3], coarse tree edit distance (HTE) [4], morphological distance [5] and a few other metrics. In what appears to be the most comprehensive published comparison of various secondary structure metrics [6], it was shown that all of these distance measures are highly correlated with respect to Pearson correlation when computing distances between structures taken from the Boltzmann low-energy ensemble of secondary structures [7] for the same RNA sequence—so-called intra-ensemble correlation. In contrast, these distance measures have low Pearson correlation when computing distances between structures taken from Boltzmann ensembles of different RNA sequences of the same length—so-called inter-ensemble correlation. For instance, the intra-ensemble correlation between base pair distance (BP) and mountain distance (MD) is 0.822, while the corresponding inter-ensemble correlation drops to 0.210. Intra-ensemble correlation between string edit distance (SE) and the computationally more expensive tree edit distance (TE) is 0.975, while the corresponding intra-ensemble correlation drops to 0.590—see Table 1.

thumbnail
Table 1. Pearson correlation between various secondary structure metrics.

https://doi.org/10.1371/journal.pone.0227177.t001

Due to poor inter-ensemble correlation of RNA secondary structure metrics, and the fact that most secondary structure pairwise alignment algorithms depend essentially on some form of base pair distance, string edit distance, or free energy of common secondary structure, we have developed the first RNA sequence/structure pairwise alignment algorithm that is based on (incremental ensemble) mountain distance. Our software, RNAmountAlign, uses this distance measure, since the Boltzmann ensemble of all secondary structures of a given RNA of length n can represented as a length n vector of real numbers, thus allowing an adaptation of fast sequence alignment methods. Depending on the command-line flag given, our software, RNAmountAlign can perform pairwise alignment, (Needleman-Wunsch global [8], Smith-Waterman local [9] or semiglobal [10] alignment) as well as progressive multiple alignment (global and local), computed using a guide tree as in CLUSTAL [11]. Expect values E for local alignments are computed using Karlin-Altschul extreme-value statistics [12, 13], suitably modified to account for our new sequence/structure similarity measure. Additionally, RNAmountAlign can determine p-values (hence E-values) by parameter fitting for the normal (ND), extreme value (EVD) and gamma (GD) distributions.

We benchmark the performance of RNAmountAlign on pairwise and multiple global sequence/structure alignment of RNAs against the widely used programs LARA, FOLDALIGN, DYNALIGN, LocARNA, STRAL and MXSCARNA. LARA (Lagrangian relaxed structural alignment) [14] formulates the problem of RNA (multiple) sequence/structure alignment as a problem in integer linear programming (ILP), then computes optimal or near-optimal solutions to this problem. The software FOLDALIGN [1517], and DYNALIGN [18] are different O(n4) approximate implementations of Sankoff’s O(n6) optimal RNA sequence/structure alignment algorithm. FOLDALIGN sets limits on the maximum length of the alignment as well as the maximum distance between subsequences being aligned in order to reduce the time complexity of the Sankoff algorithm. DYNALIGN [18] implements pairwise RNA secondary structural alignment by determining the common structure to both sequences that has lowest free energy, using a positive (destabilizing) energy heuristic for gaps introduced, in addition to setting bounds on the distance between subsequences being aligned. In particular, the only contribution from nucleotide information in Dynalign is from the nucleotide-dependent free energy parameters for base stacking, dangles, etc. LocARNA (local alignment of RNA) [19, 20] is a heuristic implementation of PMcomp [21] which compares the base pairing probability matrices computed by McCaskill’s algorithm. Although the software is not maintained, STRAL [22] which is similar to our approach, uses up- and downstream base pairing probabilities as the structural information and combines them with sequence similarity in a weighted fashion. MXSCARNA [23] is a progressive multiple alignment algorithm based on the pairwise alignment algorithm of SCARNA [24]. In contrast to Sankoff-type methods, SCARNA is a heuristic algorithm that performs alignment based on the detection of fixed-length stem candidates that belong to the consensus secondary structure of given sequences.

LARA, mLocARNA (extension of LocARNA), FOLDALIGNM [16, 25] (extension of FOLDALIGN), Multilign [26, 27] (extension of DYNALIGN), STRAL and MXSCARNA support multiple alignment. LARA computes all pairwise sequence alignments and subsequently uses the T-Coffee package [28] to construct multiple alignments. Both FOLDALIGNM and mLocARNA implement progressive alignment of consensus base pairing probability matrices using a guide tree similar to the approach of PMmulti [21]. For a set of given sequences, Multilign uses DYNALIGN to compute the pairwise alignment of a single fixed index sequence to each other sequence in the set, and computes a consensus structure. In each pairwise alignment, only the index sequence base pairs found in previous computations are used. More iterations in the same manner with the same index sequence are then used to improve the structure prediction of other sequences. The number of pairwise alignments in Multilign is linear with respect to the number of sequences. STRAL and MXSCARNA perform multiple alignment in a fashion similar to CLASTALW [29]. Table 2 provides an overview of various features, to the best of our knowledge, supported by the software benchmarked in this paper.

In this paper we provide a very fast, comprehensive software package capable of pairwise/multiple local/global/semiglobal alignment with p-values and E-values for statistical significance. Moreover, due to its speed and relatively good accuracy, the software can be used for whole-genome scans for homologues of an RNA as query, a similar modality as in the software RSEARCH [30]. This type of whole-genome scan is in contrast to Infernal [31], which requires a multiple alignment of distinct RNAs to construct a covariance model for whole-genome searches. Searching for homologues from only a single sequence is potentially useful in the case of orphan RNAs, very recently discovered RNAs that do not belong to any known RNA family.

Materials and methods

Quadratic time alignment using affine gap cost is implemented in RNAmountAlign using the Gotoh method [32] with the following pseudocode, shown for the case of semiglobal alignment. In our query scan form of semiglobal alignment, if a = a1, …, an is the query sequence and b = b1, …, bm is the current genomic window being searched for an occurrence of a, then there is no penalty for gaps occurring to the left of the nucleotide of b aligned with a1 nor for gaps occurring to the right of the nucleotide of b aligned with am, although internal gaps are penalized. This is obtained by the following pseudocode, as also found in [33]. Let g(k) denote an affine cost for size k gap, defined by g(0) = 0 and g(k) = gi + (k − 1) ⋅ ge for positive gap initiation [resp. extension] costs gi [resp. ge]. For query a = a1, …, an and target b = b1, …, bm, define (n + 1) × (m + 1) matrices M, P, Q as follows: Mi,0 = g(i) for all 1 ≤ in, M0,j = 0 for all 1 ≤ jm, while for positive i, j we have Mi,j = max (Mi−1,j−1 + sim (ai, bj), Pi,j, Qi,j), where sim is a similarity measure formally defined later in Eq (17). For 1 ≤ in, 1 ≤ jm, let P0,j = 0 and Pi,j = max(Mi−1,j + gi, Pi−1,j + ge), and define Qi,0 = 0 and Qi,j = max(Mi,j−1 + gi, Qi,j−1 + ge, 0). Determine the maximum semiglobal alignment score in row n, then perform backtracking to obtain an optimal semiglobal alignment.

Incremental ensemble mountain height

A secondary structure for a given RNA nucleotide sequence a = a1, …, an is a set s of base pairs (i, j), where 1 ≤ i < jn, such that:

  1. if (i, j) ∈ s then ai, aj form either a Watson-Crick (AU,UA,CG,GC) or wobble (GU,UG) base pair,
  2. if (i, j) ∈ s then ji > θ = 3 (a steric constraint requiring that there be at least θ = 3 unpaired bases between any two positions that are paired),
  3. if (i, j) ∈ s then for all i′ ≠ i and j′ ≠ j, (i′, j) ∉ s and (i, j′) ∉ s (nonexistence of base triples),
  4. if (i, j) ∈ s and (k, ) ∈ s, then it is not the case that i < k < j < (nonexistence of pseudoknots).

Secondary structures can be depicted in several equivalent manners, but in this paper, we use the dot-bracket notation in which matching left and right parenthesis positions indicate base-paired nucleotides. For instance, the EMBL accession code, sequence and secondary structure of a 53 nt type III hammerhead ribozyme from peach latent mosaic viroid (PLMVd), obtained from the Rfam database [34], is given as follows

> AJ550901.1/282-334

12345678901234567890123456789012345678901234567890123

GAAGAGUCGCGCUAAGCGCACUGAUGAGUCUUUGAGAUAAGACGAAACUCUUC

.((((((.((((…))))……‥((((……‥))))…)))))).

Positions 1 and 53 (for instance) are unpaired, as indicated by a dot, while positions 2 and 52 are paired and form the outermost base pair (2, 52), positions 12, 16 are paired and base pair (12, 16) constitutes one of the two apical (hairpin) loops, while the other apical (hairpin) loop is closed by the base pair (31, 40), etc.

Given an RNA sequence a = a1, …, an, the base pairing probabilities are defined by (1) where Z is the partition function, denotes the set of secondary structures of a, E(s) is the free energy of secondary structure s using energy parameters from [35]. Given a = a1, …, an of length n, the mountain height [36] hs(k) of a secondary structure s of a at position k is defined as the number of base pairs in s that lie between an external loop and k, formally given by (2) We follow [2, 36] in our definition of mountain height, and related notions of ensemble mountain height and distance, while [37] and Vienna RNA package [4] differ in an inessential manner by defining hs(k) = |{(i, j) ∈ s: i < k}| − |{(i, j) ∈ s: jk}|. If we consider the secondary structure s, defined as a set of base pairs (i, j) where 1 ≤ i < jn, as a dot-bracket notation, then hs(k) is simply the running count, where in scanning from left-to-right we add 1 to the count for each open parenthesis (, and subtract 1 from the count for each closed parenthesis ), encountered between 1, …, k.

The ensemble mountain heighth(k)〉 [37] for RNA sequence a = a1, …, an at position k is defined as the average mountain height, where the average is taken over the Boltzmann ensemble of all low-energy structures s of sequence a. If base pairing probabilities pi,j have been computed, then it follows that (3) and hence the incremental ensemble mountain height, which for values 1 < kn is defined by ma(k) = 〈h(k)〉 − 〈h(k − 1)〉 can be computed by (4) It is clear that −1 ≤ ma(k) ≤ 1, and that both ensemble mountain height and incremental ensemble mountain height can be computed in time that is quadratic in sequence length n, provided that base pairing probabilities pi,j have been computed. Except for the cubic time taken by a C-library function call of export_bppm() from from Vienna RNA package [4], the software RNAmountAlign has quadratic time and space requirements. The following pairwise alignment of 72 nt tRNA AL671879.2 and 69 nt tRNA D16387.1, both taken from the BRAliBase 2.1 K2 database [38], was generated by the RNAmountAlign web server, with default parameters of gap initiation −3, gap extension −1 and structural weight γ = 1/2, which latter equally weights the contributions from sequence and structural similarity. The alignment produced by RNAmountAlign is identical to the reference alignment from BRAliBase 2.1 K2, although sequence identity is only 28% (twilight zone). The consensus structure, shown in S1 Fig, is produced by a call of RNAalifold from Vienna RNA Package, given the alignment produced by RNAmountAlign; Fig 1 is an alternative display of this alignment as a superimposition of the (gapped) ensemble mountain heights of tRNA AL671879.2 and tRNA D16387.1.

thumbnail
Fig 1. Ensemble mountain height display of the alignment computed by RNAmountAlign for 72 nt tRNA AL671879.2 and 69 nt tRNA D16387.1, both taken from BRAliBase 2.1 K2, using default gap parameters of −3 for gap initiation and −1 for gap extension, and structural weight γ = 1/2.

The alignment produced by RNAmountAlign is identical to the reference alignment from BRAliBase 2.1 K2, although sequence identity is only 28% (twilight zone). The gap in the D16387.1 curve above corresponds to the size 3 gap in the alignment described in the text.

https://doi.org/10.1371/journal.pone.0227177.g001

> AL671879.2

GGGGAUGUAGCUCAGUGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCCCCGGGUUCGAUCCCCGGCAUCUCCA

> D16387.1

GUUUCAUGAGUAUAGC---AGUACAUUCGGCUUCCAACCGAAAGGUUUUUGUAAACAACCAAAAAUGAAAUA

> RNAalifold consensus structure

(((((((‥((((…….)))).(((((…….)))))….((((((……))))))))))))).

Transforming distance into similarity

In [39], Seller’s (distance-based) global pairwise alignment algorithm [40] was rigorously shown to be equivalent to Needleman and Wunsch’s (similarity-based) global pairwise alignment algorithm [8]. Recalling that Seller’s alignment distance is defined as the minimum, taken over all alignments of the sum of distances d(x, y) between aligned nucleotides x, y plus the sum of (positive) weights w(k) for size k gaps, while Needleman-Wunsch alignment similarity is defined as the maximum, taken over all alignments of the sum of similarities s(x, y) between aligned nucleotides x, y plus the sum of (negative) gap weights g(k) for size k gaps, Smith and Waterman [39] show that by defining (5) (6) and by taking the minimum distance, rather than maximum similarity, the Needleman-Wunsch algorithm is transformed into Seller’s algorithm. Though formulated here for RNA nucleotides, equivalence holds over arbitrary alphabets and similarity measures (e.g. BLOSUM62).

To improve the intuitive understanding of our structural distance measure STRSIM, we initially define a simple distance measure d0 between dot-bracket symbols. For dot-bracket symbol x ∈ { (, •, ) }, define the sign function by (7) Now define the distance d0(x, y) = |sign(x) − sign(y)| between dot-bracket symbols x, y ∈ { (, •, ) } by (8)

Let denote a fixed alignment between two arbitrary secondary structures s, t of (possibly different) lengths n, m, where , the dash − denotes the gap symbol, and where a gap is never aligned above another gap—this follows the notational convention for representation of alignments in [41]. We define the structural alignment distance for A by summing over those positions i where neither character is a gap symbol, then adding w(k) for all size k gaps in A.

Using incremental ensemble mountain height from Eq (4), we can generalize structural alignment distance from the simple case of comparing two dot-bracket representations of secondary structures to the more representative case of comparing the low-energy Boltzmann ensemble of secondary structures for RNA sequence a to that of RNA sequence b. Given RNA sequences a = a1, …, an and b = b1, …, bm, and given a fixed sequence alignment let denote the corresponding alignment between the incremental ensemble mountain heights ma(1) ⋯ ma(n) of a and and the incremental ensemble mountain heights mb(1) ⋯ mb(m) of b. Generalize structural distance d0 defined in Eq (8) to d1 defined by d1(ai, bj) = |ma(i) − mb(j)|, where ma(i) and mb(j) are real numbers in the interval [−1, 1], defined by Eq 4. Define ensemble structural alignment distance for A by summing d1(ai, bj) over all aligned positions i, j for which neither character is a gap symbol, then adding positive weight w(k) for all size k gaps. By Eqs (5) and (6), it follows that an equivalent ensemble structural similarity measure between two positions ai, bj, denoted STRSIM(ai, bj), is obtained by multiplying d1 and w(k) by −1: (9) This equation will be used later, since our algorithm RNAmountAlign combines both sequence and ensemble structural similarity. Indeed, −|ma(i) − mb(j)| ∈ [−2, 0] with maximum value of 0 while RIBOSUM85-60, shown in Table 3, has similarity values in the interval [−1.86, 2.22]. In order to combine sequence with structural similarity, both ranges should be rendered comparable as shown in the next section.

Pairwise alignment

In order to combine sequence and ensemble structural similarity, we determine a multiplicative scaling factor αseq and an additive shift factor αstr such that the mean and standard deviation for the distribution of sequence similarity values from a RIBOSUM matrix [30] (after being multiplied by αseq) are equal to the mean and standard deviation for the distribution of structural similarity values from STRSIM (after additive shift of αstr). The RIBOSUM85-60 nucleotide similarity matrix used in this paper is given in Table 3, and the distributions for RIBOSUM and STRSIM values are shown in Fig 2 for the 72 nt transfer RNA AL671879.2. Given query a = a1, …, an [resp. target b = b1, …, bn], let pA, pC, pG, pU [resp. ] denote the nucleotide relative frequencies for a [resp. b], i.e. the proportion of occurrences of each nucleotide A,C,G,U in query a [resp. target b]. Define the mean μseq and standard deviation σseq of RIBOSUM nucleotide similarities by (10) (11)

thumbnail
Fig 2. For 72 nt tRNA query sequence AL671879.2, there are 12 A’s, 20 C’s, 24 G’s, and 16 U’s so nucleotide relative frequencies are approximately pA ≈ 0.167, pC ≈ 0.278, pG ≈ 0.333, pU ≈ 0.222, and for 69 nt tRNA target sequence D16498.1, there are 26 A’s, 12 C’s, 12 G’s, and 19 U’s so nucleotide relative frequencies are approximately pA ≈ 0.377, pC ≈ 0.174, pG ≈ 0.174, pU ≈ 0.275.

From the base pairing probabilities computed by RNAfold -p, we have query frequencies p( = 0.3035, p = 0.3930, p) = 0.3035 and target frequencies p( = 0.2835, p = 0.433, p) = 0.2835, so by Eqs (10), (11), (13) and (14), we have μseq = −0.9098, σseq = 1.5871 and μstr = −0.8298, σstr = 0.6967. By Eqs (15) and (16), we determine that RIBOSUM scaling factor αseq = 0.4390 and αstr = 0.4305 (all values shown rounded to 4-decimal places). Using these values, the scaled RIBOSUM mean is now −0.39936, now equal to the shifted STRSIM mean, and both the scaled RIBOSUM standard deviation and shifted STRSIM standard deviation equal 0.6967. Panels (a) resp. (b) show the distribution of RIBOSUM resp. STRSIM values for the nucleotide and base pairing probabilities determined from query and target, while panel (c) shows the distribution of αseq-scaled RIBOSUM and αstr-shifted STRSIM values. It follows that the distributions in panel (c) have the same (negative) mean and same standard deviation.

https://doi.org/10.1371/journal.pone.0227177.g002

Compute the base pairing probabilities of query sequence a for 1 ≤ ijn, and of target sequence b for 1 ≤ ijm by a call to the matrix bppm, using the Vienna RNA Package C-library.

Define the expected left and right base pairing probabilities, and the expected unpaired probability of query sequence a by the following (12) Analogously define the corresponding probabilities , , for the target sequence b.

Setting s0(x, y) = −d0(x, y), where d0(x, y) is defined in Eq (8), for given query [resp. target] base pairing probabilities p(, p, p) [resp. ] of dot-bracket characters, it follows that the mean μstr and standard deviation σstr of structural similarities can be computed by (13) (14) Now we compute a multiplicative factor αseq and an additive shift term αstr, both dependent on frequencies pA, pC, pG, pU and p(, p, p), such that the mean [resp. standard deviation] of nucleotide similarity multiplied by αseq is equal to the mean [resp. standard deviation] of structural similarity after addition of shift term αstr: (15) (16)

Given the query RNA a = a1, …, an and target RNA b = b1, …, bm with incremental ensemble mountain heights ma(1) ⋯ ma(m) of a, mb(1) ⋯ mb(m) of b, and user-defined weight 0 ≤ γ ≤ 1, our final similarity measure is defined by (17) where αseq, αstr are computed by Eqs (15) and (16) depending on probabilities pA, pC, pG, pU [resp. ] and p(, p, p) [resp. ] of the query [resp. target]. All benchmarking computations were carried out using γ = 1/2, although it is possible to use position-specific weight γi,j defined as the average probability that i is paired in a and j is paired in b.

Our structural similarity measure is closely related to that of STRAL, which we discovered only after completing a preliminary version of this paper. Let and be the probability that position i of sequence a is paired to a position on the left or right, respectively. The similarity measure used in STRAL is defined by (18)

From Eqs (17) and (4) our measure can be defined as (19) Though clearly very related, RNAmountAlign was developed independently, without knowledge of STRAL, and offers a number of functionalities unavailable in STRAL, which latter appears to be no longer maintained. For instance, RNAmountAlign supports local and semiglobal alignment, and reports p-values and E-values; these features are not available in STRAL. Since we were unable to compile STRAL, our benchmarking results for STRAL use an adaptation of our code to support Eq (18). There are nevertheless some differences in how progressive alignment is implemented in STRAL that could affect run time.

To illustrate the method, suppose that the query [resp. target] sequence is the 72 nt tRNA AL671879.2 [resp. 69 nt tRNA D16498.1] with sequence GGGGAUGUAG CUCAGUGGUA GAGCGCAUGC UUCGCAUGUA UGAGGCCCCG GGUUCGAUCC CCGGCAUCUC CA. Then nucleotide query [resp. target] probabilities are (approximately) pA ≈ 0.167, pC ≈ 0.278, pG ≈ 0.333, pU ≈ 0.222. For the 69 nt tRNA target sequence D16498.1 with sequence GUUUCAUGAG UAUAGCAGUA CAUUCGGCUU CCAACCGAAA GGUUUUUGUA AACAACCAAA AAUGAAAUA the nucleotide relative frequencies are approximately , , , . From the base pairing probabilities returned by RNAfold -p [4], we determine that p( = 0.3035, p = 0.3930, p) = 0.3035 [resp. , , ]. Using these probabilities in Eqs (10)–(14), we determine that μseq = −0.9098, σseq = 1.5871 and μstr = −0.8298, σstr = 0.6967. By Eqs (15) and (16), we determine that RIBOSUM scaling factor αseq = 0.4390 and αstr = 0.4305 (all values shown rounded to 4-decimal places). Using these values, the scaled RIBOSUM mean is now −0.39936, now equal to the shifted STRSIM mean, and both the scaled RIBOSUM standard deviation and shifted STRSIM standard deviation equal 0.6967. Since the mean and standard deviation of αseq-scaled RIBOSUM values are identical with that of αstr-shifted STRSIM values, hence can be combined in Eq (17). Although sequence identity of the reference alignment of these tRNAs from BRAliBase 2.1 K2 is only 28%, the global alignment produced by RNAmountAlign is identical to that the reference alignment, using default parameters of gap initiation −3, gap extension −1, and structural weight γ = 1/2 in Eq (17).

Fig 2 depicts the distribution of RIBOSUM85-60 [resp. STRSIM] values in this case, both before and after application of scaling factor αseq [resp. shift αstr]—recall that αseq and αstr] depend on pA, pC, pG, pU, p(, p, p) of tRNA AL671879.2 and of tRNA D16498.1.

Statistics for pairwise alignment

Karlin-Altschul statistics for local pairwise alignment.

For a finite alphabet A and similarity measure s, suppose that the expected similarity is negative and that s(x, y) is positive for at least one choice of x, y. In the case of BLAST, amino acid and nucleotide similarity scores are integers, for which the Karlin-Altschul algorithm was developed [12]. In contrast, RNAmountAlign similarity scores are not integers (or more generally values in a lattice), because Eq (17) combines real-valued αseq-scaled RIBOSUM nucleotide similarities with real-valued αstr-shifted STRSIM structural similarities, which depend on query [resp. target] probabilities pA, pC, pG, pU, p(, p, p) [resp. ]. For that reason, we use Theorem 1 of Karlin, Dembo and Kawabata [13], reformulated using the notation of this paper, where the similarity score s(x, y) for RNA nucleotides x, y is defined by Eq (17).

Theorem 1 (Karlin, Dembo, Kawabata [13])

Given similarity measure s between nucleotides in alphabet A = {A, C, G, U}, let λ* be the unique positive root of , and let random variable Sk denote the score of a length k gapless alignment. For large local alignment score z, (20) where M denotes maximal segment scores for local alignment of random RNA sequences a1, …, an and b1, …, bm, and where (21)

Fitting data to probability distributions.

Data were fit to the normal distribution (ND) by the method of moments (i.e. mean and standard deviation were taken from data analysis). Data were fit to the extreme value distribution (EVD) (22) by an in-house implementation of maximum likelihood to determine λ, K, as described in supplementary information to [30]. Data were fit to the gamma distribution by using the function fitdistr(x, ‘gamma’) from the package MASS in the R programming language, which determines rate and shape parameters for the density function (23) with where α is the shape parameter, the rate is 1/λ, where λ is known as the scale parameter.

Multiple alignment

Suppose pA, pC, pG, pU are the nucleotide probabilities obtained after the concatenation of all sequences. Let p(, p, p) be computed by individually folding each sequence and taking the arithmetic average of probabilities of (, • and ) over all sequences. The mean and standard deviation of sequence and structure similarity are computed similar to Eqs (10)–(14). (24) (25) (26) (27) Sequence multiplicative scaling factor αseq and the structure additive shift factor αstr are computed from these values using Eqs (15) and (16).

As in CLUSTAL [42] and the CLUSTAL Omega [43], our software RNAmountAlign implements progressive multiple alignment using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [44] and p. 360 of [41]. In UPGMA, one first defines a similarity matrix S, where S[i, j] is equal to (maximum) pairwise sequence similarity of sequences i and j. A rooted tree is then constructed by progressively creating a parent node of the two closest siblings. Parent nodes are profiles (PSSMs) that represent alignments of two or more sequences, hence can be treated as pseudo-sequences in a straightforward adaptation of pairwise alignment to the alignment of profiles. Let’s consider an alignment of N sequences composed of M columns. Let denote column i of the alignment (for 1 ≤ iM). Suppose p(i, x), for x ∈ {A, C, G, U, −}, indicates the probability of occurrence of a nucleotide or gap at column i of alignment A. Then sequence similarity SEQSIM between two columns is defined by (28) where (29)

The structural measure for a profile is computed from the incremental ensemble heights averaged over each column. Let mA(i) denote the arithmetic average of incremental ensemble mountain height at column Ai (30) where is the incremental ensemble mountain height at position i of sequence obtained from Eq (4). Here, let if is a gap. Structural similarity between two columns is defined by (31) Finally, the combined sequence/structure similarity is computed from (32)

Benchmarking

Accuracy measures

Sensitivity, positive predictive value, and F1 score for pairwise alignments were computed as follows. Let denote an alignment, where ai, bi ∈ {A, C, G, U, −}, and the aligned sequences include may contain occurrences of the gap symbol ‘−’, provided that not both and are gap symbols. The number TP of true positives [resp. FP of false positives] is the number of alignment pairs in the predicted alignment that belong to [resp. do not belong to] the reference alignment. The sensitivity (Sen) [resp. positive predictive value (PPV)] of a predicted alignment is TP divided by reference alignment length [resp. TP divided by predicted alignment length]. The F1 score is the harmonic mean of sensitivity and PPV, so . For the computation of Sen, PPV, and F1, we consider not only nucleotide pairs (ai, bj), but also include pairs of the form (ai, —) [resp. (—, bj)], which represent an insertion in the top sequence of alignment (equivalently a deletion from the bottom sequence of ) [resp. deletion from the top sequence of (equivalently an insertion in the bottom sequence of )]. Suppose that the (toy) alignment is produced by an algorithm (such as RNAmountAlign, FOLDALIGN, MXSCARNA, etc.). Then a = a1a2a3a4a5, a = ACGUA, b = b1b2b3b4, b = ACUA. The length of alignment is N = 6, while and . If is the (toy) alignment of the same sequences taken from a reference database (such as BRAliBase 2.1), then sequences a, b are as above, the length of alignment is M = 5, while and . The number TP of correctly aligned pairs is 4, since Because TP = 4, rather than 3 (if we had instead counted only aligned nucleotide pairs), the computations of F1 score, Sen, and PPV are duly affected.

To compare predicted structures with consensus Rfam structures, we computed the Matthews correlation coefficient (MCC) [45] as follows: (33) where TPstr is the number of correctly predicted base pairs, FPstr is the number of incorrectly predicted base pairs, TNstr is the number of potential base pairs absent in both predicted and reference structures and FNstr is the number of base pairs in the reference structure that were not predicted.

In the case of local alignment, since the size of the reference alignment is unknown, only the predicted alignment length and PPV are reported. To compute the accuracy of multiple alignment, we used sum-of-pair-scores (SPS) [11], defined as the number of correctly nucleotide pairs in the alignment produced by a given algorithm, divided by the total number of aligned nucleotide pairs in the reference alignment. For completeness, and to contrast this with our definition of sensitivity, PPV and F1 score, we formally define SPS as follows. Suppose that A denotes a multiple alignment of the form . For 1 ≤ i, jM, 1 ≤ kN define pi,j,k = 1 if the nucleotide is aligned with the nucleotide in both the reference and predicted alignments, and pi,j,k = 0 otherwise. Sum-of-pairs score SPS is then the sum, taken over all i, j, k, of the pi,j,k. Though SPS can be considered as the average sensitivity, taken over all sequence pairs in the alignment, this is not technically correct in our case, since our definition of sensitivity also counts pairs of the form (X, —) and (—, X), where X ∈ {A, C, G, U}, from the reference alignment.

To measure the conservation of secondary structures in alignments, structural conservation index (SCI) was computed using RNAalifold [46]. RNAalifold computes SCI as the ratio of the free energy of the alignment, computed by RNAalifold, with the average minimum free energy of individual structures in the alignment. SCI values close to 1 [resp. 0] indicate high [resp. low] structural conservation. All computations made with Vienna RNA Package used version 2.1.7 [4] using default Turner 2004 energy parameters [35]).

Dataset for global and local alignment comparison

For pairwise global alignment benchmarking in Table 4, Figs 3 and 4, and S2S4 Figs all 8, 976 pairwise alignments in k2 from BRAliBase 2.1 database [38] were used. Note that BRAliBase 2.1 does not include consensus secondary structures for the reference alignments, which are required in the computation of MCC. However, since BRAliBase 2.1 alignments are obtained from Rfam 7.0, we searched for the exact occurrences BRAliBase 2.1 sequences in Rfam 7.0 and computed their Rfam consensus structure. Some BRAliBase alignments are manually curated and thus different from Rfam alignments for which consensus structure could not be obtained. This produced 7, 154 pairwise BRAliBase alignments with their consensus structures. For the computation of MCC in S5 Fig, this subset of reference alignments were used.

thumbnail
Fig 3. (Left) F1 score and (Right) structural conservation index (SCI) for pairwise global alignments using RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA, and sequence-only(γ = 0).

F1 score and SCI are shown as a function of reference alignment sequence identity for pairwise alignments in the BRAliBase 2.1 database used for benchmarking. Moving averages taken for centered, symmetric windows of size 11.

https://doi.org/10.1371/journal.pone.0227177.g003

thumbnail
Fig 4. Run time of pairwise global alignment for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, and MXSCARNA.

(Left) The log base 10 run time is shown as a function of reference alignment length for pairwise alignments in the BRAliBase 2.1 K2 database used for benchmarking. Moving averages taken for centered, symmetric windows of size 51. (Right) Actual run time for RNAmountAlign, LARA and MXSCARNA on the same data. Unlike the left panel the actual run time is shown, rather than log run time, without any moving average taken.

https://doi.org/10.1371/journal.pone.0227177.g004

For each pairwise alignment, a consensus structure was determined from the Rfam family consensus structure by removing gaps, removing base pairs (i, j) which are noncanonical or for which jiθ = 3. These derived consensus structures were used as the reference structures in the computation of MCC. To be explicit, consider the following toy example of multiple alignment in Stockholm format—for illustrative purposes, pretend that this is an Rfam seed alignment of a small RNA family (recall that additional requirements must be met concerning number of sequences and average nucleotide length for an Rfam family to be selected).

# STOCKHOLM 1.0

positions   123456789012345678

Eg1      GGACAAGCAAUGCUUGCC

Eg2      GG-CAAGCA-UGCUUGAC

Eg3      GGACAAGCAAUGCUUGCC

#=GC SS_cons ≪-⋘≪___⋙⋙>

#=GC RF    GGACAAGCAAUGCUUGCC

This multiple alignment would then produce the following consensus structures for the pairwise alignment of Eg1 with Eg2.

> Eg1

123456789012345678

GGACAAGCAAUGCUUGCC

((.(((((…)))))))

> Eg2

1234567890123456

GGCAAGCAUGCUUGAC

(.((((….)))).)

Note that the Rfam consensus indicates that positions 2 and 17 are paired in the multiple sequence alignment, hence the reference structure for Eg1 has the GC base pair (2, 17) as expected. However, the the corresponding positions in Eg2 are G,A. To avoid violating the requirement (1) in the definition of secondary structure, the reference structure for Eg2 has dots at positions 2 and 15. Note that the Rfam consensus indicates a hairpin loop at positions 8 and 12 of the multiple sequence alignment, so the consensus structure for Eg1 contains the base pair (8, 12). However, since Eg2 contains A-U, this hairpin would contain only two nucleotides A,U, which would violate condition (2) of the definition of secondary structure. For that reason, there is no base pair (7, 10) in the reference structure of Eg2.

For multiple global alignment benchmarking, we used k5 BRAliBase 2.1, which includes 2,405 reference alignments, each composed of 5 sequences. For pairwise local alignment benchmarking, 75 pairwise alignments having sequence identity ≤ 70% were randomly selected from each of 20 well-known families from the Rfam 12.0 database [34], many of which were considered in a previous study [47], yielding a total of 1500 alignments. Following [48], these alignments were trimmed on the left and right, so that both first and last aligned pairs of the alignment do not contain a gap symbol. For sequences a = a1, …, an [resp. b = b1, …, bm] from each alignment, random sequences a′ [resp. b′] were generated with the same nucleotide frequencies, then a random position was chosen in a′ [resp. b′] in which to insert a [resp. b], thus resulting in a pair of sequences of lengths 4n and 4m. Finally, since sequence identity was at most 70%, the RIBOSUM70-25 similarity matrix was used in RNAmountAlign. Preparation of the benchmarking dataset for local alignment was analogous to the method used in multiple local alignment of [48].

Dataset for correlation of p-values for different distribution fits

A pool of 2220 sequences from the Rfam 12.0 database [34] was created as follows. One sequence was selected from each Rfam family having average sequence length at most 200 nt, with the property that the base pair distance between its minimum free energy (MFE) structure and the Rfam consensus structure was a minimum. For example, the 102 nt sequence with EMBL accession code AAOX01000028.142464-42565 was selected from family RF00167 since its MFE structure was identical to its Rfam consensus structure, while the 178 nt sequence with EMBL accession code AE016827.11325416-1325239 was selected from RF00168 since the base pair distance between its MFE structure and its Rfam consensus structure was 11—the base pair distance between MFE and consensus structure for all other sequences from RF00168 was greater than or equal to 11. Subsequently, for each of 500 randomly selected query sequences from the pool of 2220 sequences, 1000 random target sequences of length 400 nt were generated to have the same expected nucleotide frequency as that of the query. For each query and random target, five semiglobal alignments were created using gap initiation costs of gi ∈ {−1, −2, −3, −4, −5} with gap extension cost ge equal to one-third the gap initiation cost. For each alignment score x for query and random target, the p-value was computed as 1 − CDF(x) for ND, EVD and GD, where CDF(x) is the cumulative density function evaluated at x. Additionally, a heuristic p-value was determined by calculating the proportion of alignment scores for given query that exceed x.

Software version and hardware specs

For benchmarking, we used LocARNA (version 1.8.7), FOLDALIGN (version 2.5), FOLDALIGNM (version 1.0.1), LARA (version 1.3.2), DYNALIGN (from version 5.7 of RNAstructure), the sequence alignment algorithm MUSCLE (version 3.8.31), STRAL (in-house implementation due to unavailability of the software [22], and MXSCARNA (version 2.1). For genome query scan benchmarking, we used RSEARCH 1.1, FOLDALIGN (version 2.5), and RNAmountAlignScan. Recommended default parameters were used for each software, including RNAmountAlign from this paper. The commands used to run the software are given in S1, S2 and S3 Tables. Each software package was run on a cluster of identically configured Intel Xeon 2.66 GHz 4-core nodes with 16GB of memory, running CentOS Linux release 6.10.

In contrast to all other benchmarking work described in this paper, benchmarking tests for RNAmountAlignScan, RSEARCH, and FOLDALIGN in genome scanning mode, as described in Section Query Scan, were conducted on a 24-processor Intel Xeon CPU E5-2440 2.40 GHz system with 198GB memory.

Results

We benchmarked RNAmountAlign’s performance for pairwise and multiple alignments on BRAliBase K2 and K5 datasets, respectively.

Pairwise alignment

Fig 3, S2 and S3 Figs depict moving averages of pairwise global alignment sensitivity, positive predictive value (PPV) and F1-score for the software described in this paper, as well as for LocARNA, FOLDALIGN, LARA, DYNALIGN, STRAL, and MXSCARNA. For pairwise benchmarking, reference alignments of size 2, a.k.a. K2, were taken from the BRAliBase 2.1 database [38]. BRAliBase 2.1 K2 data are based on seed alignments of the Rfam 7.0 database, and consist of 8,976 alignments of RNA sequences from 36 Rfam families.

Moving averages (window size 11) of sensitivity, positive predictive value, and F1 score were computed as a function of sequence identity, where it should be noted that the number of pairwise alignments for different values of sequence identity can vary for the BRAliBase 2.1 data (e.g. there are only 35 pairwise alignments having sequence identity < 20%). In computing moving averages, each value represents the average over a symmetric window (k − 5, k + 5) of size 11 nt centered at the value from the x-axis. For 504 pairwise global alignments, FOLDALIGN produced “No global alignment was found. This can either be due to the pruning, or because no structural alignment exists” and for 29, DYNALIGN yielded no alignments. The only BRAliBase alignment that both tools failed was Hammerhead_1.apsi36.sci72.no1. In our benchmarking the sensitivity, PPV and F1 values for such failures of FOLDALIGN and DYNALIGN were all assigned the value of 0, which explains the weaker performance of these two methods in Fig 3 compared to [14], where such failures are simply excluded from the analysis. Moreover, we observed a larger number of failed alignments in FOLDALIGN 2.5 than in the previous work of [14]. Default parameters were used for all software. For our software RNAmountAlign, gap initiation cost was -3, gap extension -1, and sequence/structure weighting parameter γ was 0.5 (value obtained by optimizing on a small set of 300 random alignments from Rfam 12.0, not considered in training or testing set). The sequence-only alignment is computed from RNAmountAlign with the same gap penalties, but for γ = 0. While its accuracy is high, RNAmountAlign is faster by an order of magnitude than LocARNA, LARA, FOLDALIGN, and DYNALIGN—indeed, algorithmic time complexity of our method is O(n3) for two sequences of length n. Since STRAL could not be compiled on any of our systems, we implemented its algorithm by modifying RNAmountAlign and obtained results for STRAL’s default parameter settings. Therefore, the run time of STRAL is identical to RNAmountAlign but RNAmountAlign achieves higher F1 score, sensitivity and PPV. MXSCARNA and RNAmountAlign have similar average F1 scores. Indeed, a paired 2-tailed Wilcoxson signed-rank test of the difference between F1 scores, as computed by MXSCARNA and RNAmountAlign, for the 8,976 pairwise global alignments mentioned in Table 4. It follows that the (null) hypothesis H0 cannot be rejected, where H0 asserts that the difference is 0 between F1 scores of MXSCARNA and RNAmountAlign—see S4 Table. Note however that the software MXSCARNA has slower run time than RNAmountAlign. Both MXSCARNA and RNAmountAlign support global and local alignment; however, unlike MXSCARNA, RNAmountAlign also supports semiglobal alignment and reports p-values. The right panel of Fig 4 depicts actual run times of the fastest software, RNAmountAlign, with the next fastest software, MXSCARNA and LARA. Unlike the graph in the left panel, actual run times are shown, graphed as a function of sequence length, rather than logarithms of moving averages.

In addition, Table 4 displays average pairwise global alignment F1 scores for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, and MXSCARNA when benchmarked on 36 families from the BRaliBase K2 database comprising altogether 8,976 RNA sequences with average length of 249.33. Averaging over all sequences, the F1 scores for the programs just mentioned were respectively 0.8370, 0.7808, 0.8406, 0.7977, 0.6822, 0.8247, 0.8402; i.e. F1 score 0.8406 of LARA and 0.8402 of MXSCARNA slightly exceeded the F1 score 0.8370 of RNAmountAlign and 0.8247 of STRAL, while other methods trailed by several percentage points. S4 Table indicates the statistical significance of difference between all 8,976 F1 scores. Morevoer, S6 and S7 Tables display values for global alignment sensitivity and positive predictive value, benchmarked on the same data for the same programs—these results are similar to the F1 scores in Tables 2 and 4.

Although there appears to be no universally accepted criterion for quality of local alignments, Table 5 shows pairwise local alignment comparisons for the above-mentioned methods supporting local alignment: RNAmountAlign, FOLDALIGN, and LocARNA. We had intended to include SCARNA_LM [48] in the benchmarking of multiple local alignment software; however, SCARNA_LM no longer appears to be maintained (The SCARNA_LM web server is no longer functional, and the authors did not respond to our request for the source code of SCARNA_LM). Since the reference alignments for the local benchmarking dataset are not known, and sensitivity depends upon the length of the reference alignment, we only report local alignment length and positive predictive value. Abbreviating RNAmountAlign by MA, FOLDALIGN by FA, and LocARNA by LOC, Table 5 shows average run time in seconds of MA (2.30 ± 2.12), FA (625.53 ± 2554.61), LOC (5317.96 ± 8585.19), average alignment length of reference alignments (118.67 ± 47.86), MA (50.35 ± 42.33), FA (114.86 ± 125.33), LOC (556.82 ± 227.00), and average PPV scores MA (0.53 ± 0.42), FA (0.64 ± 0.36), LOC (0.03 ± 0.04). S5 Table presents p-values for a 2-tailed paired Wilcoxon signed-rank test whether the difference in positive predictive values fo 1,500 pairwise local alignments.

Taken together, these results suggest that RNAmountAlign has comparable accuracy, but much faster run time, hence making it a potentially useful tool for genome processing applications. Here it should be stressed that all benchmarking results used equally weighted contributions of sequence and ensemble structural similarity; i.e. parameter γ = 1/2 when computing similarity by Eq (17).

Statistics for pairwise alignment

Fig 5 shows fits of the relative frequency histogram of alignment scores with the normal (ND), extreme value (EVD) and gamma (GD) distributions, where local, semiglobal and global alignment scores are shown in panels from left to right. The EVD provides the best fit for local alignment sequence-structure similarity scores, as expected by Karlin-Altschul theory [12, 13]. Moreover, Fig 6 shows a 96% correlation between (expect) E-values computed by our implementation of the Karlin-Altschul method, and E-values obtained by EVD fitting of local alignment scores. In contrast, the ND provides the best fit for semiglobal sequence/structure alignment similarity scores, at least for the sequence considered in Fig 5. This is not an isolated phenomenon, as shown in Fig 6, which depicts scatter plots, Pearson correlation values and sums of squared residuals (SSRs) when computing p-values for semiglobal alignment scores between Rfam sequences and random RNA. As explained earlier, a pool of 2220 sequences from the Rfam 12.0 database [34] was created by selecting one sequence of length at most 200 nt from each family, with the property that base pair distance between its minimum free energy (MFE) structure and the Rfam consensus structure was a minimum. Then 500 sequences were randomly selected from this pool, and for each of five gap initiation and extension costs gi = −5, −4, −3, −2, −1 with . Taking each of the 500 sequences successively as query sequence and for each choice of parameters, 1000 random 400 nt RNAs were generated with the same expected nucleotide relative frequency as that of the query. For each alignment score z for query and random target, the p-value was computed as 1 minus the cumulative density function, 1 − CDF(z), for fitted normal (ND), extreme value (EVD) and gamma (GD) distributions, thus defining 1000 p-values. Additionally, a heuristic p-value was determined by calculating the proportion of alignment scores for given query that exceed z. For each set of 2.5 million (500 × 5 × 1000) p-values (heuristic, ND, EVD, GD), Pearson correlation values were computed and displayed in the upper triangular portion of Fig 6, with SSRs shown in parentheses. Note that residuals were computed for regression equation row = m ⋅ column + b, where column values constitute the independent variable. Assuming that heuristic p-values constitute the reference standard, it follows that p-values computed from the normal distribution correlate best with semiglobal alignment scores computed by RNAmountAlign. Earlier studies have suggested that protein global alignment similarity scores using PAM120, PAM250, BLOSUM50, and BLOSUM62 matrices appear to be fit best by the gamma distribution (GD) [49], and that semiglobal RNA sequence alignment similarity scores (with no contribution from structure) appear to be best fit by GD [50]. However, in our preliminary studies (not shown), it appears that the type of distribution (ND, EVD, GD) that best fits RNAmountAlign semiglobal alignment depends on the gap costs applied (indeed, for certain choices, EVD provides the best fit). Since there is no mathematical theory concerning alignment score distribution for global or semiglobal alignments, it must be up to the user to decide which distribution provides the most reasonable p-values.

thumbnail
Fig 5. Relative frequency histograms of alignment scores for local (left), semiglobal (middle) and global (right) alignments of random RNAs produced by RNAmountAlign.

For the 5S rRNA AY544430.1:375-465 from the Rfam 12.0 database having nucleotide relative frequencies pA = 0.25, pC = 0.27, pG = 0.26, pU = 0.21, we generated 10,000 random sequences having the same nucleotide relative frequencies, each of length 400 nt. For each method (local, semiglobal, global), RNAmountAlign was run using default parameters to determine the optimal pairwise alignment score, when aligning the 5S rRNA with each random RNA, thus producing relative frequency histograms which were subsequently fit by the normal distribution (ND), extreme value distribution (EVD) and gamma distribution (GD). As expected by Karlin-Altschul theory [12], local alignment scores are best fit by EVD, while semiglobal alignment scores are best fit by ND. Our conclusions of the best fitting distributions were additionally supported by d computations of variation distance, symmetrized Kullback-Leibler distance, and χ2 goodness-of-fit tests (data not shown).

https://doi.org/10.1371/journal.pone.0227177.g005

thumbnail
Fig 6.

(Left) Pearson correlation values and scatter plots for p-values of semiglobal alignment scores between Rfam sequences and random RNAs. For 500 Rfam sequences, 1000 random semiglobal alignments were computed for 5 different gap penalties yielding the total of 2.5 million alignment scores. For each score a p-value is computed assuming Normal (ND), Gamma (GD) and Extreme Value (EVD) distributions in order to investigate which distribution is closest to the heuristic p-value, that is assumed to be the gold standard. Heuristic p-value for score z is determined by the proportion of alignment scores that exceed z. For all 2.5 million p-values, pairwise Pearson correlation values were computed and displayed in the upper triangular portion of the figure, with sums of squared residuals shown in parentheses, and histograms of p-values along the diagonal. It follows that ND p-values correlate best with heuristic p-values. (Right) Scatter plot of E-values computed by EVD fitting, EEVD, as well as our implementation of the Karlin-Altschul, EKA, for pairwise local alignments. The regression equation is EEVD = 0.1764 + 0.7991 · EKA; Pearson correlation between EEVD and EKA is 96%, with correlation p-value of 2 ⋅ 10−16 indicating that p-values obtained from these two methods are in well agreement. E-values were determined from local alignment scores computed by the genome scanning form of RNAmountAlign with query tRNA AB031215.1/9125-9195 and targets consisting of 300 nt windows from E. coli str. K-12 MG1655 with GenBank accession code NC_000913.

https://doi.org/10.1371/journal.pone.0227177.g006

Multiple alignment

We benchmarked RNAmountAlign with the software LARA, mLocARNA, FOLDALIGNM, Multilign, MXSCARNA, and sequence only MUSCLE for multiple global K5 alignments in BRAliBase 2.1. STRAL is not included since the source code could not be compiled. Fig 7 indicates average SPS and SCI as a function of average pairwise sequence identity (APSI). We used the -sci flag of RNAalifold to compute SCI from the output of each software without reference to the reference alignment. Fig 7 indicates that SCI values for outputs from various alignment algorithms is higher than the SCI value from reference alignments, suggesting that the consensus structure obtained from sequence/structure alignment algorithms has a larger number of base pairs than the the consensus structure obtained from reference alignments (this phenomenon was also in [51]). Fig 7 indicates that RNAmountAlign produces SPS scores comparable to mLocARNA, LARA and MXSCARNA, and higher than Multilign and FOLDALIGNM while the SCI score obtained from RNAmountAlign are lower than other software. Averaging over all sequences, the SPS scores for RNAmountAlign, LARA, mLocARNA, FOLDALIGNM, Multilign, MXSCARNA, and MUSCLE were respectively: 0.81 ± 0.18, 0.85 ± 0.15, 0.86 ± 0.15, 0.74 ± 0.24, 0.63 ± 0.17, 0.84 ± 0.15, and 0.82 ± 0.17, while SCI scores are respectively: 0.84 ± 0.24, 0.92 ± 0.22, 0.96 ± 0.21, 0.88 ± 0.23, 0.96 ± 0.21, 0.91 ± 0.20, and 0.79 ± 0.26. SPS score for each reference alignments is 1 by definition, and average SCI score over all reference alignments is 0.79 ± 0.26. Fig 8 indicates software run time in seconds on a logarithmic scale. This figure clearly shows that RNAmountAlign has faster run time than all other structural alignment software in our benchmarking tests, thus confirming the earlier result from pairwise benchmarking.

thumbnail
Fig 7. Sum-of-pairs (SPS) score (left) and structural conservation index (SCI) (right) for multiple global alignments using RNAmountAlign, LARA, mLocARNA, FoldalignM, Multilign, MXSCARNA and MUSCLE.

SPS and SCI are shown as a function of average pairwise sequence identity(APSI). The k5 BRAliBase 2.1 database was used for benchmarking. Moving averages taken for centered, symmetric windows of size 11.

https://doi.org/10.1371/journal.pone.0227177.g007

thumbnail
Fig 8. Run time of multiple global alignment for RNAmountAlign, mLocARNA, LARA, FoldalignM, Multilign, MXSCARNA and MUSCLE.

Log run time is as shown a function of reference alignment length for K5 alignments in BRAliBase 2.1. For the structural multiple alignment algorithms benchmarked on this data, RNAmountAlign and MXSCARNA both appear to have the fastest runtime, while RNAmountAlign is faster than MXSCARNA for pairwise alignment.

https://doi.org/10.1371/journal.pone.0227177.g008

Query scan

We developed an extension of RNAmountAlign, called RNAmountAlignScan, which scans a target genome sequence to find hits having high sequence and structural similarity to a given query sequence. RNAmountAlignScan functions by computing optimal semiglobal alignments of the query to sliding windows of the target, then returns the aligned target segments sorted by p-value. For a query of length n, target genome of length m, window size w > n, and window step size of s, a total of many semiglobal alignments must be computed, making the total run time of RNAmountAlignScan . In order to show the utility of RNAmountAlignScan, we compared it with RSEARCH [30], FOLDALIGN and RNAmountAlignScan sequence-only where only sequence homology is considered. RSEARCH takes a single query sequence with its secondary structure and performs local alignment with the target genome, in order to search for homologous RNA. For our comparison, we used the query sequence from Fig 6, a 71 nt tRNA from Rfam 12.0, selected with the property that the base pair distance between its MFE structure and its Rfam consensus structure is a minimum compared with other tRNAs from Rfam 12.0. As target genome, we used the E. coli K12 MG1655 genome having 4, 641, 652 nt. Since tRNAscan-SE [52] is generally considered to be the gold standard in tRNA prediction, we measured accuracy of the software by the amount of overlap between returned predictions and predicted tRNAs according to tRNAscan-SE. tRNAscan-SE found 49 full tRNA sequences on the forward strand and 37 on the reverse strand. Default parameters of each software package were used to search the query in the forward strand of E. coli genome (see S3 Table); in the case of RNAmountAlignScan this means gap initiation of −3, gap extension of −1, and sequence/structure weighting parameter γ = 0.5. RNAmountAlignScan was also used to scan the E. coli genome using sequence-only semiglobal alignments, in which gap parameters were unchanged, but with γ = 0. Genome scanning was performed using windows of 300 nt, with step size 200, thus ensuring an overlap of 100 nt between target segments—this resulted in a total of 23, 209 semiglobal alignments. We also tested FOLDALIGN, which provides an option to scan two sequences and return the best local alignment score for each pair of positions in the two sequences. The output from FOLDALIGN with this option then requires subsequent postprocessing using the LocateHits script included in FOLDALIGN in order to generate a list of non-overlapping local alignments. For the query tRNA described above, FOLDALIGN could not process the E. coli genome, and instead produced memory errors. There is a newly developed, specialized form of FOLDALIGN, called “Foldalign version 2.5.1_long_sequences” that we recently learned of from Jakob Havgaard through personal communication and can process long RNA sequences (of more than 30,000 nt) without such memory errors. Fig 9 indicates the precision-recall plot for the top 49 and 37 tRNA predictions on the forward and reverse strands, respectively. RSEARCH is not shown in the plot because it reported 89 hits, none of which had any overlap with tRNAs predicted by tRNAscan-SE and thus its sensitivity and PPV are equal to zero. Sequence length of the predictions returned by RSEARCH was 11.73 ± 0.58. For calculation of PPV as a function of sensitivity, for each of the three methods on each strand, we sorted the predictions in descending order by p-value. Given the i-th prediction, we determined the number of predictions preceding the current i-th prediction (including the i-th), and calculated the number of true positives in that collection then divided this number by 49 for positive or 37 for reverse strands to obtain sensitivity. True positives are predictions with overlap proportion of greater than 0.8 to any of the tRNAscan-SE sequences. The overlap proportion between predictions and tRNAscan-SE is defined as the ratio |AB|/|B|, where A is a tRNA predicted by each software and B is a tRNA detected by tRNAscan-SE, the latter assumed to be the gold standard. FOLDALIGN with area under the curve of 0.90 and 0.64 for forward and reverse strands outperforms RNAmountAlignScan with 0.27 and 0.28 and sequence-only with 0.08 and 0.16. However, only one the FOLDALIGN predictions on each strand had p-value less than 0.1 while RNAmountAlignScan reported significant p-values for all of the true positives Fig 9. Moreover, FOLDALIGN finished in 12.06 hours while run time of RNAmountAlign was 1.63.

thumbnail
Fig 9.

(A) PPV-sensitivity (precision-recall) plot for tRNA query scan in E. coli genome. Measures are computed for the top 49 and 37 tRNA predictions in forward (left) and reverse (right) strands of E. coli genome, respectively. Since RSEARCH had no true positives and thus zero sensitivity and PPV, it is not shown here. (B) The predicted p-values of the top predictions for each software in forward (left) and reverse (right) strands.

https://doi.org/10.1371/journal.pone.0227177.g009

Fig 10 depicts the sequence and MFE structure of the first false positive of RNAmountAlignScan; i.e. that prediction having statistically significant sequence and structural similarity to the query tRNA, but which shares no overlap with a tRNA predicted by tRNAscan-SE. This prediction has 42% sequence identity to the query tRNA, and its MFE forms a cloverleaf secondary structure. In contrast to all other benchmarking computations of this paper, due to the memory requirements for genome scanning mode, each program for query search was run on a 24-processor Intel Xeon CPU E5-2440 2.40 GHz system with 198GB memory. The commands to run the software are given in S3 Table. Run time for FOLDALIGN, RNAmountAlignScan, and RSEARCH were respectively 12.06, 1.63, and 0.62 hours and memory usage were 0.57, 20.61 and 0.49 Gigabytes. The number of random RNAs used for p-value computation in both RSEARCH and RNAmountAlignScan was 1000, and both used the RIBOSUM 85-60 matrix. The outputs of all the program are available at http://bioinformatics.bc.edu/clotelab/RNAmountAlign/.

thumbnail
Fig 10. The MFE secondary structure of an E. coli segment which is predicted to be a tRNA by RNAmountAlignScan and not by tRNAscan-SE.

Two flanking nucleotides on both ends are added to the hit sequence. The structure is color-coded by base pairing probability generated by RNAfold web server, where red [resp. blue] base pairs (i, j) have base pairing probability pi,j ≈ 1 [resp. pi,j ≈ 0].

https://doi.org/10.1371/journal.pone.0227177.g010

Conclusion

RNAmountAlign is a new C++ software package for RNA local, global, and semiglobal sequence/structure alignment, which provides accuracy comparable with that of a number of widely used programs, but provides much faster run time. RNAmountAlign additionally computes E-values for local alignments, using Karlin-Altschul statistics, as well as p-values for normal, extreme value and gamma distributions by parameter fitting.

Supporting information

S1 Table. Commands used for benchmarking various software packages for global alignment.

https://doi.org/10.1371/journal.pone.0227177.s001

(TIFF)

S2 Table. Commands used for benchmarking various software packages for local alignment.

https://doi.org/10.1371/journal.pone.0227177.s002

(TIFF)

S3 Table. Commands used for benchmarking software packages for query scan.

https://doi.org/10.1371/journal.pone.0227177.s003

(TIFF)

S4 Table. p-values from two-tailed paired Wilcoxon signed rank test of all 8,976 F1 scores for pairwise global alignments indicated in Table 4 of the main text.

Tables of p-values computed separately for each family are available at http://bioinformatics.bc.edu/clotelab/RNAmountAlign.

https://doi.org/10.1371/journal.pone.0227177.s004

(TIFF)

S5 Table. p-values from two-tailed paired Wilcoxon signed rank test of all 1500 positive predictive values for pairwise local alignments indicated in Table 5 of the main text.

Tables of p-values computed separately for each family are available at http://bioinformatics.bc.edu/clotelab/RNAmountAlign.

https://doi.org/10.1371/journal.pone.0227177.s005

(TIFF)

S6 Table. Average sensitivity scores (± one standard deviation) for pairwise global alignment of RNAmountAlign and four widely used RNA sequence/structure alignment algorithms on the benchmarking set of 8,976 pairwise alignments from the BRaliBase K2 database [38].

For each indicated Rfam family, the the number of alignments (NumAln), sequence identity (SeqId), and sensitivity scores for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL and MXSCARNA are listed, along with pooled averages over all 8,976 pairwise alignments. Parameters used in Eq (17) of the main text for RNAmountAlign were similarity matrix RIBOSUM85-60, structural similarity weight γ = 1/2, gap initiation gi = −3, gap extension ge = −1.

https://doi.org/10.1371/journal.pone.0227177.s006

(TIFF)

S7 Table. Average positive predictive value (PPV) scores (± one standard deviation) for pairwise global alignment of RNAmountAlign and four widely used RNA sequence/structure alignment algorithms on the benchmarking set of 8,976 pairwise alignments from the BRaliBase K2 database [38].

For each indicated Rfam family, the the number of alignments (NumAln), sequence identity (SeqId), and PPV-scores for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL and MXSCARNA are listed, along with Pooled averages over all 8,976 pairwise alignments. Parameters used in Eq (17) of the main text for RNAmountAlign were similarity matrix RIBOSUM85-60, structural similarity weight γ = 1/2, gap initiation gi = −3, gap extension ge = −1.

https://doi.org/10.1371/journal.pone.0227177.s007

(TIFF)

S8 Table. Initial portion of a table that determines expected base pairing probabilities p(, p, p) as a function of nucleotide probabilities pA, pC, pG, pU.

The full table (not shown) has 1770 rows. To determine average base pairing probabilities, given nucleotide probabilities pA, pC, pG, pU, a total of N = 10000 RNA sequences of length n = 200 were randomly generated to have the given expected nucleotide frequency. To compute p( [resp. std(], a library call of function pf_fold() from Vienna RNA Package [4] was made in order to determine for position in each sequence, and the average [resp. standard deviation] was taken over all sequences and values i = 1, …, n. In a similar fashion, p and p) were determined.

https://doi.org/10.1371/journal.pone.0227177.s008

(TIFF)

S1 Fig. Consensus structure for the pairwise alignment indicated in Fig 1 of the main text.

The consensus structure is computed by a calling function alifold() from Vienna RNA Package. The figure is obtained from RNAalifold web server.

https://doi.org/10.1371/journal.pone.0227177.s009

(TIFF)

S2 Fig. F1-score for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA and sequence-only alignments(γ = 0) for pairwise global alignment.

Moving averages of F1-score for centered, symmetric windows of size 11 are shown as a function of sequence identity for pairwise alignments in the BRAliBase 2.1 database used for benchmarking. Moving averages taken for centered, symmetric windows of size 11.

https://doi.org/10.1371/journal.pone.0227177.s010

(TIFF)

S3 Fig. Average sensitivity (Sen) and positive predictive value (PPV) for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA and sequence-only alignments (γ = 0) for pairwise global alignment.

Sensitivity is shown as a function of sequence identity for pairwise alignments in the BRAliBase 2.1 database used for benchmarking. Moving averages taken for centered, symmetric windows of size 11.

https://doi.org/10.1371/journal.pone.0227177.s011

(TIFF)

S4 Fig. Average pairwise sensitivity (left) and positive predictive value (right) for multiple global alignments using RNAmountAlign, LARA, mLocARNA, FoldalignM and Multilign in the k5 BRAliBase 2.1 database used for benchmarking.

Note that in our definition of Sen and PPV, pairs of the form (X, —) and (—, X) are also counted while SPS is the average pairwise sensitivity only considering aligned residue pairs (Fig 7). However, the results with and without gap counts, indicated in this figure and Fig 7, respectively, are very close. Moving averages taken for centered, symmetric windows of size 11.

https://doi.org/10.1371/journal.pone.0227177.s012

(TIFF)

S5 Fig. Matthews Correlation Coefficient (MCC) for the quality of secondary structure prediction (see text) from each of RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA and sequence-only alignments (γ = 0) for pairwise global alignment.

Moving averages of MCC are computed for centered, symmetric windows of size 11 and shown as a function of sequence identity for a subset of 7, 154 reference pairwise alignments from k2 BRAliBase 2.1. Overall average MCC values ± one standard deviation are shown in parentheses. These 7, 154 alignments were selected with the property that both sequences in the alignment (exactly) appear in an Rfam family seed multiple alignment from Rfam 7.0, and so can be assigned an Rfam consensus secondary structure as described in the text. These consensus structures are taken as the reference structures in the computation of MCC. Predicted structures are obtained directly from the output of each software in the benchmarking test. For RNAmountAlign and for our in-house implementation of STRAL, the -alifold flag was used to compute the consensus structure by a function call to alifold() from libRNA.a in the Vienna RNA Package.

https://doi.org/10.1371/journal.pone.0227177.s013

(TIFF)

S6 Fig. F1-score for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA and sequence-only alignments(γ = 0) for pairwise global alignment, where any failure of the benchmarked program to output an assignment is simply ignored.

Moving averages of F1-score for centered, symmetric windows of size 11 are shown as a function of sequence identity for pairwise alignments in the BRAliBase 2.1 database used for benchmarking. Moving averages taken for centered, symmetric windows of size 11. This figure should be compared with S2 Fig, where failure to output an alignment is counted as zero.

https://doi.org/10.1371/journal.pone.0227177.s014

(TIFF)

S7 Fig. Average sensitivity (Sen) and positive predictive value (PPV) for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA and sequence-only alignments (γ = 0) for pairwise global alignment, where any failure of the benchmarked program to output an assignment is simply ignored.

Sensitivity is shown as a function of sequence identity for pairwise alignments in the BRAliBase 2.1 database used for benchmarking. Moving averages taken for centered, symmetric windows of size 11. This figure should be compared with S3 Fig, where failure to output an alignment is counted as zero.

https://doi.org/10.1371/journal.pone.0227177.s015

(TIFF)

S8 Fig. Illustration of a potential weakness of RNAmountAlign.

Using RNAmountAlign genome-scanning software, semiglobal alignments of the query tRNA AB031215.1/9125-9195 were made with each 300 nt window (successive window overlap of 200 nt) of the E. coli str. K-12 substr. MG1655 genome. This figure shows the MFE structure, color-coded by positional entropy [53], for the alignment of positions 696097-696164 with score −7.70, p-value of 4.145010 ⋅ 10−6. (gap costs gi = −3, gi = −1, γ = 0.5, scaling factor αseq = 0.447648, shift term αstr = 0.304766, γ = 1/2). However, this RNA is clearly not a tRNA, since the three loops are not within the scope of a multiloop, and the variable loop is located in the wrong position, and the large positional entropy suggests that there is not an unambiguous structure. Moreover, this sequence is not one of the 40 tRNA genes/pseudogenes on the plus-strand predicted by tRNAscan-SE [52].

https://doi.org/10.1371/journal.pone.0227177.s016

(TIFF)

References

  1. 1. Levenshtein VI. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady. 1966;10:707.
  2. 2. Moulton V, Zuker M, Steel M, Pointon R, Penny D. Metrics on RNA secondary structures. Journal of Computational Biology. 2000;7:277–292. pmid:10890402
  3. 3. Shapiro BA. An algorithm for comparing multiple RNA secondary structures. Comput Appl Biosci. 1988;4(3):387–393. pmid:2458170
  4. 4. Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011;6:26. pmid:22115189
  5. 5. Voss B, Meyer C, Giegerich R. Evaluating the predictability of conformational switching in RNA. Bioinformatics. 2004;20(10):1573–1582. pmid:14962925
  6. 6. Barsacchi M, Baù A, Bechini A. Extensive Assessment of Metrics on RNA Secondary Structures and Relative Ensembles. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing. SAC’16. New York, NY, USA: ACM; 2016. p. 44–47. Available from: http://doi.acm.org/10.1145/2851613.2851868.
  7. 7. Ding Y, Lawrence CE. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res. 2003;31(24):7280–7301. pmid:14654704
  8. 8. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48(3):443–453. pmid:5420325
  9. 9. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195–197. pmid:7265238
  10. 10. Gusfield D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University; 1997.
  11. 11. Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999;27(13):2682–2690. pmid:10373585
  12. 12. Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA. 1990;87(6):2264–2268. pmid:2315319
  13. 13. Karlin S, Dembo A, Kawabata T. Statistical composition of high-scoring segments from molecular sequences. Annals of Statistics. 1990;18(2):571–581.
  14. 14. Bauer M, Klau GW, Reinert K. Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization. BMC Bioinformatics. 2007;8:271. pmid:17662141
  15. 15. Havgaard JH, Lyngsø R, Stormo G, Gorodkin J. Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics. 2005;21(9). pmid:15657094
  16. 16. Havgaard J, Kaur S, Gorodkin J. Comparative ncRNA gene and structure prediction using Foldalign and FoldalignM. Curr Protoc Bioinformatics. 2012;0(O):O.
  17. 17. Sundfeld D, Havgaard JH, De Melo AC, Gorodkin J. Foldalign 2.5: multithreaded implementation for pairwise structural RNA alignment. Bioinformatics. 2016;32(8):1238–1240. pmid:26704597
  18. 18. Mathews DH, Turner DH. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J Mol Biol. 2002;317(2):191–203. pmid:11902836
  19. 19. Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol. 2007;3(4):e65. pmid:17432929
  20. 20. Smith C, Heyne S, Richter AS, Will S, Backofen R. Freiburg RNA Tools: a web server integrating INTARNA, EXPARNA and LOCARNA. Nucleic Acids Res. 2010;38(Web):W373–W377. pmid:20444875
  21. 21. Hofacker IL, Bernhart SH, Stadler PF. Alignment of RNA base pairing probability matrices. Bioinformatics. 2004;20(14):2222–2227. pmid:15073017
  22. 22. Dalli D, Wilm A, Mainz I, Steger G. STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatics. 2006;22(13):1593–1599. pmid:16613908
  23. 23. Tabei Y, Kiryu H, Kin T, Asai K. A fast structural multiple alignment method for long RNA sequences. BMC Bioinformatics. 2008;9(1):33. pmid:18215258
  24. 24. Tabei Y, Tsuda K, Kin T, Asai K. SCARNA: fast and accurate structural alignment of RNA sequences by matching fixed-length stem fragments. Bioinformatics. 2006;22(14):1723–1729. pmid:16690634
  25. 25. Torarinsson E, Havgaard JH, Gorodkin J. Multiple structural alignment and clustering of RNA sequences. Bioinformatics. 2007;23(8):926–932. pmid:17324941
  26. 26. Xu Z, Mathews DH. Multilign: an algorithm to predict secondary structures conserved in multiple RNA sequences. Bioinformatics. 2011;27(5):626–632. pmid:21193521
  27. 27. Xu ZZ, Mathews DH. Prediction of Secondary Structures Conserved in Multiple RNA Sequences. Methods Mol Biol. 2016;1490:35–50. pmid:27665591
  28. 28. Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000;302(1):205–217. pmid:10964570
  29. 29. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22(22):4673–4680. pmid:7984417
  30. 30. Klein RJ, Eddy SR. RSEARCH: Finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003;4:44. pmid:14499004
  31. 31. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29(22):2933–2935. pmid:24008419
  32. 32. Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982;162(3):705–708. pmid:7166760
  33. 33. Ferre F, Ponty Y, Lorenz WA, Clote P. DIAL: a web server for the pairwise alignment of two RNA three-dimensional structures using nucleotide, dihedral angle and base-pairing similarities. Nucleic Acids Res. 2007;35(Web):W659–W668. pmid:17567620
  34. 34. Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 2015;43(Database):D130–D137. pmid:25392425
  35. 35. Turner DH, Mathews DH. NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 2010;38(Database):D280–D282. pmid:19880381
  36. 36. Hogeweg P, Hesper B. Energy directed folding of RNA sequences. Nucleic Acids Res. 1984;12(1):67–74. pmid:6198625
  37. 37. Huynen MA, Perelson A, Vieira WA, Stadler PF. Base pairing probabilities in a complete HIV-1 RNA. J Comput Biol. 1996;3(2):253–274. pmid:8811486
  38. 38. Gardner PP, Wilm A, Washietl S. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. 2005;33(8):2433–2439. pmid:15860779
  39. 39. Smith TF, Waterman MS. Comparison of biosequences. Advances in Applied Mathematics. 1981;2:482–489.
  40. 40. Sellers PH. On the theory and computation of evolutionary distances. SIAM J Appl Math. 1974;26:787–793.
  41. 41. Waterman MS. Introduction to Computational Biology. Chapman and Hall/CRC; 1995.
  42. 42. Bashford D, Chothia C, Lesk AM. Determinants of a protein fold: Unique features of the globin amino acid sequences. Journal of Molecular Biology. 1987;196(1):199–216. https://doi.org/10.1016/0022-2836(87)90521-3. pmid:3656444
  43. 43. Sievers F, Wilm A, Dineen D, Gibson JJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011;7(539):1–6.
  44. 44. Sneath PHA, Sokal RR. Numerical taxonomy. The principles and practice of numerical classification. San Francisco, W.H. Freeman and Company., USA: Taylor & Francis, Ltd.; 1973.
  45. 45. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)—Protein Structure. 1975;405(2):442–451. https://doi.org/10.1016/0005-2795(75)90109-9.
  46. 46. Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics. 2008;9:474. pmid:19014431
  47. 47. Clote P, Ferre F, Kranakis E, Krizanc D. Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA. 2005;11(5):578–591. pmid:15840812
  48. 48. Tabei Y, Asai K. A local multiple alignment method for detection of non-coding RNA sequences. Bioinformatics. 2009;25(12):1498–1505. pmid:19376823
  49. 49. Pang H, Tang J, Chen SS, Tao S. Statistical distributions of optimal global alignment scores of random protein sequences. BMC Bioinformatics. 2005;6:257. pmid:16225696
  50. 50. Hertel J, De Jong D, Marz M, Rose D, Tafer H, Tanzer A, et al. Non-coding RNA annotation of the genome of Trichoplax adhaerens. Nucleic Acids Res. 2009;37(5):1602–1615. pmid:19151082
  51. 51. Smith MA, Seemann SE, Quek XC, Mattick JS. DotAligner: identification and clustering of RNA structure motifs. Genome Biol. 2017;18(1):244. pmid:29284541
  52. 52. Lowe TM, Chan PP. tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes. Nucleic Acids Research. 2016;44(W1):W54–W57. pmid:27174935
  53. 53. Huynen M, Gutell R, Konings D. Assessing the reliability of RNA folding using statistical mechanics. Edited by Draper D. E. Journal of Molecular Biology. 1997;267(5):1104–1112. https://doi.org/10.1006/jmbi.1997.0889.