RNAmountAlign: Efficient software for local, global, semiglobal pairwise and multiple RNA sequence/structure alignment

Alignment of structural RNAs is an important problem with a wide range of applications. Since function is often determined by molecular structure, RNA alignment programs should take into account both sequence and base-pairing information for structural homology identification. This paper describes C++ software, RNAmountAlign, for RNA sequence/structure alignment that runs in O(n3) time and O(n2) space for two sequences of length n; moreover, our software returns a p-value (transformable to expect value E) based on Karlin-Altschul statistics for local alignment, as well as parameter fitting for local and global alignment. Using incremental mountain height, a representation of structural information computable in cubic time, RNAmountAlign implements quadratic time pairwise local, global and global/semiglobal (query search) alignment using a weighted combination of sequence and structural similarity. RNAmountAlign is capable of performing progressive multiple alignment as well. Benchmarking of RNAmountAlign against LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA, and MUSCLE shows that RNAmountAlign has reasonably good accuracy and faster run time supporting all alignment types. Additionally, our extension of RNAmountAlign, called RNAmountAlignScan, which scans a target genome sequence to find hits having high sequence and structural similarity to a given query sequence, outperforms RSEARCH and sequence-only query scans and runs faster than FOLDALIGN query scan.


Introduction
A number of different metrics exist for comparison of RNA secondary structures, including base pair distance (BP), string edit distance (SE) [22], mountain distance (MD) [26], tree edit distance (TE) [33], coarse tree edit distance (HTE) [23], morphological distance [44] and a few other metrics.In what appears to be the most comprehensive published comparison of various secondary structure metrics [1], it was shown that all of these distance measures are highly correlated when computing distances between structures taken from the Boltzmann low-energy ensemble of secondary structures [6] for the same RNA sequence -so-called intraensemble correlation.In contrast, these distance measures have low correlation when computing distances between structures taken from Boltzmann ensembles of different RNA sequences of the same length -socalled inter-ensemble correlation.For instance, the intra-ensemble correlation between base pair distance (BP) and mountain distance (MD) is 0.822, while the corresponding inter-ensemble correlation drops to 0.210.Intra-ensemble correlation between string edit distance (SE) and the computationally more expensive tree edit distance (TE) is 0.975, while the corresponding intra-ensemble correlation drops to 0.590 -see Table 1.
Due to poor inter-ensemble correlation of RNA secondary structure metrics, and the fact that most secondary structure pairwise alignment algorithms depend essentially on some form of base pair distance, string edit distance, or free energy of common secondary structure, we have developed the first RNA sequence/structure pairwise alignment algorithm that is based on (incremental ensemble) mountain distance.Our software, RNAmountAlign, uses this distance measure, since the Boltzmann ensemble of all secondary structures of a given RNA of length n can represented as a length n vector of real numbers, thus allowing an adaptation of fast sequence alignment methods.Depending on the command-line flag given, our software,  1: Correlation between various secondary structure metrics, as computed in [1]: base pair distance (BP), string edit distance (SE) [22], mountain distance (MD) [26], tree edit distance (TE) [33] and coarse tree edit distance (HTE) [23].Lower triangular values indicate intra-ensemble correlations; upper triangular values indicate inter-ensemble correlations.Table values are taken from [1].
RNAmountAlign can perform pairwise alignment, (Needleman-Wunsch global [29], Smith-Waterman local [37] or semiglobal [11] alignment) as well as progressive multiple alignment (global and local), computed using a guide tree as in CLUSTAL [41].Expect values E for local alignments are computed using Karlin-Altschul extreme-value statistics [19,20], suitably modified to account for our new sequence/structure similarity measure.Additionally, RNAmountAlign can determine p-values (hence E-values) by parameter fitting for the normal (ND), extreme value (EVD) and gamma (GD) distributions.
We benchmark the performance of RNAmountAlign on pairwise and multiple global sequence/structure alignment of RNAs against the widely used programs LARA, FOLDALIGN, DYNALIGN, LocARNA and STRAL.LARA (Lagrangian relaxed structural alignment) [2] formulates the problem of RNA (multiple) sequence/structure alignment as a problem in integer linear programming (ILP), then computes optimal or near-optimal solutions to this problem.The software FOLDALIGN [12,13,38], and DYNALIGN [25] are different O(n 4 ) approximate implementations of Sankoff's O(n 6 ) optimal RNA sequence/structure alignment algorithm.FOLDALIGN sets limits on the maximum length of the alignment as well as the maximum distance between subsequences being aligned in order to reduce the time complexity of the Sankoff algorithm.DYNALIGN [25] implements pairwise RNA secondary structural alignment by determining the common structure to both sequences that has lowest free energy, using a positive (destabilizing) energy heuristic for gaps introduced, in addition to setting bounds on the distance between subsequences being aligned.In particular, the only contribution from nucleotide information in Dynalign is from the nucleotide-dependent free energy parameters for base stacking, dangles, etc. LocARNA (local alignment of RNA) [34,45] is a heuristic implementation of PMcomp [15] which compares the base pairing probability matrices computed by McCaskill's algorithm.Although the software is not maintained, STRAL [5] which is similar to our approach, uses up-and downstream base pairing probabilities as the structural information and combines them with sequence similarity in a weighted fashion.
LARA, mLocARNA (extension of LocARNA), FOLDALIGNM [12,42] (extension of FOLDALIGN), Multilign [46,47] (extension of DYNALIGN) and STRAL support multiple alignment.LARA computes all pairwise sequence alignments and subsequently uses the T-Coffee package [30] to construct multiple alignments.Both FOLDALIGNM and mLocARNA implement progressive alignment of consensus base pairing probability matrices using a guide tree similar to the approach of PMmulti [15].For a set of given sequences, Multilign uses DYNALIGN to compute the pairwise alignment of a single fixed index sequence to each other sequence in the set, and computes a consensus structure.In each pairwise alignment, only the index sequence base pairs found in previous computations are used.More iterations in the same manner with the same index sequence are then used to improve the structure prediction of other sequences.The number of pairwise alignments in Multilign is linear with respect to the number of sequences.STRAL performs multiple alignment in a fashion similar to CLASTALW [40].Table 2 provides an overview of various features, to the best of our knowledge, supported by the software benchmarked in this paper.
RNAmountAlign can perform semiglobal alignments in addition to global and local alignments.As in the RNA tertiary structural alignment software DIAL [7], semiglobal alignment allows the user to perform a query search, where the query is entirely matched to a local portion of the target.Quadratic time alignment using affine gap cost is implemented in RNAmountAlign using the Gotoh method [10] with the following pseudocode, shown for the case of semiglobal alignment.Let g(k) denote an affine cost for size k gap, defined by g(0) = 0 and g(k) = g i + (k − 1) • g e for positive gap initiation [resp.extension] costs g i [resp.g e ].For query a = a 1 , . . ., a n and target b = b 1 , . . ., b m , define (n + 1) × (m + 1) matrices M, P, Q as follows: M i,0 = g(i)  for all 1 ≤ i ≤ n, M 0,j = 0 for all 1 ≤ j ≤ m, while for positive i, j we have and define Q i,0 = 0 and Q i,j = max (M i,j−1 + g i , Q i,j−1 + g e , 0).Determine the maximum semiglobal alignment score in row n, then perform backtracking to obtain an optimal semiglobal (or query search) alignment.
In this paper we provide a very fast, comprehensive software package capable of pairwise/multiple local/global/semiglobal alignment with p-values and E-values for statistical significance.Moreover, due to its speed and relatively good accuracy, the software can be used for whole-genome searches for homologues of a given orphan RNA as query.This is in contrast to Infernal [28], which requires a multiple alignment to construct a covariance model for whole-genome searches.

Incremental ensemble expected mountain height
Introduced in [16], the mountain height1 h s (k) of secondary structure s at position k is defined as the number of base pairs in s that lie between an external loop and k, formally given by The ensemble mountain height h(k) [17] for RNA sequence a = a 1 , . . ., a n at position k is defined as the average mountain height, where the average is taken over the Boltzmann ensemble of all low-energy structures s of sequence a.If base pairing probabilities p i,j have been computed, then it follows that and hence the incremental ensemble mountain height, which for values 1 It is clear that −1 ≤ m a (k) ≤ 1, and that both ensemble mountain height and incremental ensemble mountain height can be computed in time that is quadratic in sequence length n, provided that base pairing probabilities p i,j have been computed.Except for the cubic time taken by a function call of RNAfold from Vienna RNA package [23], the software RNAmountAlign has quadratic time and space requirements.Figure 1 depicts a global alignment of two transfer RNAs, computed by RNAmountAlign, shown as superimposed ensemble mountain height displays with gaps.Since the BRAliBase 2.1 K2 reference (pairwise) alignment [9] has only 28% sequence identity, structural similarity parameter γ was set to 1 in our software RNAmountAlign, which returned the correct alignment.See Methods section for explanation of γ and the algorithm used by RNAmountAlign.

Transforming distance into similarity
In [36], Seller's (distance-based) global pairwise alignment algorithm [32] was rigorously shown to be equivalent to Needleman and Wunsch's (similarity-based) global pairwise alignment algorithm [29].Recalling that Seller's alignment distance is defined as the minimum, taken over all alignments of the sum of distances d(x, y) between aligned nucleotides x, y plus the sum of (positive) weights w(k) for size k gaps, while Needleman-Wunsch alignment similarity is defined as the maximum, taken over all alignments of the sum of similarities s(x, y) between aligned nucleotides x, y plus the sum of (negative) gap weights g(k) for size k gaps, Smith and Waterman [36] show that by defining and by taking the minimum distance, rather than maximum similarity, the Needleman-Wunsch algorithm is transformed into Seller's algorithm.Though formulated here for RNA nucleotides, equivalence holds over arbitrary alphabets and similarity measures (e.g.BLOSUM62).For x, y ∈ { ( , •, ) } from Eq (3) we have Define the distance d 0 (x, y) between characters x, y in the dot-bracket representation of a secondary structure by  ) over all positions i, j for which neither character is a gap symbol, then adding positive weight w(k) for all size k gaps.By Eq (4) and Eq (5), it follows that an equivalent ensemble structural similarity measure between two positions a i , b j , denoted ST RSIM (a i , b j ), is obtained by multiplying d 1 and w(k) by −1: This equation will be used later, since our algorithm RNAmountAlign combines both sequence and ensemble structural similarity.Indeed, −|m a (i) − m b (j)| ∈ [−2, 0] with maximum value of 0 while RIBOSUM85-60, shown in Table 3, has similarity values in the interval [−1.86, 2.22].In order to combine sequence with structural similarity, both ranges should be rendered comparable as shown in the next section.

Pairwise alignment
In order to combine sequence and ensemble structural similarity, we determine a multiplicative scaling factor α seq and an additive shift factor α str such that the mean and standard deviation for the distribution of sequence similarity values from a RIBOSUM matrix [21] (after being multiplied by α seq ) are equal to the mean and standard deviation for the distribution of structural similarity values from STRSIM (after additive shift of α str ).The RIBOSUM85-60 nucleotide similarity matrix used in this paper is given in Table 3, and the distributions for RIBOSUM and STRSIM values are shown in Figure 2 for the 72 nt transfer RNA AL671879.2.Given query [resp.target] nucleotide frequencies p A , p C , p G , p U [p A , p C , p G , p U ] that sum to 1, the mean µ seq and standard deviation σ seq of RIBOSUM nucleotide similarities can be computed by Setting s 0 (x, y) = −d 0 (x, y), where d 0 (x, y) is defined in Eq (7), for given query [resp.target] base pairing probabilities p ( , p • , p ) [resp.p ( , p • , p ) ] of dot-bracket characters, it follows that the mean µ str and standard deviation σ str of structural similarities can be computed by Now we compute a multiplicative factor α seq and an additive shift term α str , both dependent on frequencies p A , p C , p G , p U and p ( , p • , p ) , such that the mean [resp.standard deviation] of nucleotide similarity multiplied by α seq is equal to the mean [resp.standard deviation] of structural similarity after addition of shift term α str : Table 3: RIBOSUM85-60 similarity matrix for RNA nucleotides from [21].
Given the query RNA a = a 1 , . . ., a n and target RNA b = b 1 , . . ., b m with incremental ensemble expected mountain heights m a (1) of b, and user-defined weight 0 ≤ γ ≤ 1, our final similarity measure is defined by where α seq , α str are computed by Eqs (13,14) All benchmarking computations were carried out using γ = 1/2, although it is possible to use position-specific weight γ i,j defined as the average probability that i is paired in a and j is paired in b.
Our structural similarity measure is closely related to that of STRAL, which we discovered only after completing a preliminary version of this paper.Let pl a i = j<i p a j,i and pr a i = j>i p a i,j be the probability that position i of sequence a is paired to a position on the left or right, respectively.The similarity measure used in STRAL is defined by From Eq (15) and Eq (3) our measure can be defined as Though RNAmountAlign was developed independently much later than STRAL, our software offers functionalities unavailable in STRAL, which latter appears to be no longer maintained.

Statistics for pairwise alignment
Karlin-Altschul statistics for local pairwise alignment.For a finite alphabet A and similarity measure s, suppose that the expected similarity x,y∈A p x p y • s(x, y) is negative and that s(x, y) is positive for at least one choice of x, y.In the case of BLAST, amino acid and nucleotide similarity scores are integers, for which the Karlin-Altschul algorithm was developed [19].In contrast, RNAmountAlign similarity scores scores are not integers (or more generally values in a lattice), because Eq (15) combines real-valued α seq -scaled RIBOSUM nucleotide similarities with real-valued α str -shifted STRSIM structural similarities, which depend on query For that reason, we use the following reformulation of a result by Karlin, Dembo and Kawabata [20], the similarity score s(x, y) for RNA nucleotides x, y is defined by Eq (15).
Theorem 1 (Theorem 1 of [20]) Given similarity measure s between nucleotides in alphabet A = {A, C, G, U }, let λ * be the unique positive root of E[e s(x,y) ] = x,y∈A p x p y • e λs(x,y) , and let random variable S k denote the score of a length k gapless alignment.For large z, where M denotes high maximal segment scores for local alignment of random RNA sequences a 1 , . . ., a n and b 1 , . . ., b m , and where Fitting data to probability distributions.Data were fit to the normal distribution (ND) by the method of moments (i.e.mean and standard deviation were taken from data analysis).Data were fit to the extreme value distribution (EVD) by an in-house implementation of maximum likelihood to determine λ, K, as described in supplementary information to [21].Data were fit to the gamma distribution by using the function fitdistr(x,'gamma') from the package MASS in the R programming language, which determines rate and shape parameters for the density function with where α is the shape parameter, the rate is 1/λ, where λ is known as the scale parameter.

Multiple alignment
Suppose p A , p C , p G , p U are the nucleotide probabilities obtained after the concatenation of all sequences.Let p ( , p • , p ) be computed by individually folding each sequence and taking the arithmetic average of probabilities of ( , • and ) over all sequences.The mean and standard deviation of sequence and structure similarity are computed similar to Eqs (9)(10)(11)(12).
Sequence multiplicative scaling factor α seq and the structure additive shift factor α str are computed from these values using Eqs (13,14).
RNAmountAlign implements progressive multiple alignment using UPGMA to construct the guide tree.In UPGMA, one first defines a similarity matrix S, where S[i, j] is equal to (maximum) pairwise sequence similarity of sequences i and j.A rooted tree is then constructed by progressively creating a parent node of the two closest siblings.Parent nodes are profiles (PSSMs) that represent alignments of two or more sequences, hence can be treated as pseudo-sequences in a straightforward adaptation of pairwise alignment to the alignment of profiles.Let's consider an alignment of the probability of occurrence of a nucleotide or gap at column i of alignment A. Then sequence similarity SEQSIM between two columns is defined by where The structural measure for a profile is computed from the incremental ensemble heights averaged over each column.Let m A (i) denote the arithmetic average of incremental ensemble mountain height at column where m a * j (i) is the incremental ensemble mountain height at position i of sequence a * j obtained from Eq (3).Here, let m a * j (i) = 0 if a * ji is a gap.Structural similarity between two columns is defined by Finally, the combined sequence/structure similarity is computed from 2.1 Benchmarking

Accuracy measures
Sensitivity, positive predictive value, and F1-measure for pairwise alignments were computed as follows.Let positive predictive value (P P V )] of a predicted alignment is TP divided by reference alignment length [resp.TP divided by predicted alignment length].The F 1-score is the harmonic mean of sensitivity and PPV, so F 1 = 2 1/Sen+1/P P V .For the computation of Sen ,P P V , and F 1, pairs of the form (X, -) and (-, X) are also counted.In the case of local alignment, since the size of the reference alignment is unknown, only the predicted alignment length and PPV are reported.To compute the accuracy of multiple alignment, we used sum-of-pair-scores (SPS) [41], defined as follows.Suppose that A denotes a multiple alignment of the form ik is aligned with a * jk in both the reference and predicted alignments, and p ijk = 0 otherwise.Sum-of-pairs score SPS is then the sum, taken over all i, j, k, of the p ijk .Though SPS can be considered as the average sensitivity, taken over all sequence pairs in the alignment, this is not technically the case, since our definition of sensitivity also counts pairs of the form (X, -) and (-, X) from the reference alignment.
To measure the conservation of secondary structures in alignments, structural conservation index (SCI) was computed using RNAalifold [3].RNAalifold computes SCI as the ratio of the free energy of the alignment, computed by RNAalifold, with the average minimum free energy of individual structures in the alignment.SCI values close to 1 [resp.0] indicate high [resp.low] structural conservation.All computations made with Vienna RNA Package used version 2.1.7[23] using default Turner 2004 energy parameters [43]).

Dataset for global and local alignment comparison
For pairwise global alignment benchmarking in Table 4 and Figures 3, 4, S2 and S3 all 8976 pairwise alignments in k2 from BRAliBase 2.1 database [9] were used.For multiple global alignment benchmarking in Fig 7, k5 BRAliBase 3 was used [8].This dataset includes 583 reference alignments, each composed of 5 sequences.For pairwise local alignment benchmarking, 75 pairwise alignments having sequence identity ≤ 70% were randomly selected from each of 20 well-known families from the Rfam 12.0 database [27], many of which were considered in a previous study [4], yielding a total of 1500 alignments.Following [39], these alignments were trimmed on the left and right, so that both first and last aligned pairs of the alignment do not contain a gap symbol.For sequences a = a 1 , . . ., a n [resp.b = b 1 , . . ., b m ] from each alignment, random sequences a [resp.b ] were generated with the same nucleotide frequencies, then a random position was chosen in a [resp.b ] in which to insert a [resp.b], thus resulting in a pair of sequences of lengths 4n and 4m.Finally, since sequence identity was at most 70%, the RIBOSUM70-25 similarity matrix was used in RNAmountAlign.Preparation of the benchmarking dataset for local alignment was analogous to the method used in multiple local alignment of [39].We used LocARNA (version 1.8.7),FOLDALIGN (version 2.5), LARA (version 1.3.2) DYNALIGN (from version 5.7 of RNAstructure), and STRAL (in-house implementation due to unavailability) for benchmarking.

Dataset for correlation of p-values for different distribution fits
A pool of 2220 sequences from the Rfam 12.0 database [27] was created as follows.One sequence was selected from each Rfam family having average sequence length at most 200 nt, with the property that the base pair distance between its minimum free energy (MFE) structure and the Rfam consensus structure was a minimum.Subsequently, for each of 500 randomly selected query sequences from the pool of 2220 sequences, 1000 random target sequences of length 400 nt were generated to have the same expected nucleotide frequency as that of the query.For each query and random target, five semiglobal (query search) alignments were created using gap initiation costs of g i ∈ {−1, −2, −3, −4, −5} with gap extension cost g e equal to one-third the gap initiation cost.For each alignment score x for query and random target, the p-value was computed as 1−CDF (x) for ND, EVD and GD, where CDF (x) is the cumulative density function evaluated at x. Additionally, a heuristic p-value was determined by calculating the proportion of alignment scores for given query that exceed x.

Results
We benchmarked RNAmountAlign's performance for pairwise and multiple alignments on BraliBase k2 and k5 datasets, respectively.Running averages of sensitivity, positive predictive value, and F1-measure, averaging over windows of size 11 nt (interval [k − 5, k + 5]), were computed as a function of sequence identity, where it should be noted that the number of pairwise alignments for different values of sequence identity can vary for the BRAliBase 2.1 data (e.g.there are only 35 pairwise alignments having sequence identity < 20%).Default parameters were used for all other software.For our software RNAmountAlign, gap initiation cost was -3, gap extension -1, and sequence/structure weighting parameter γ was 0.5 (value obtained by optimizing on a small set of 300 random alignments from Rfam 12.0, not considered in training or testing set).The sequence-only alignment is computed from RNAmountAlign with the same gap penalties, but for γ = 0.While its accuracy is high, RNAmountAlign is faster by an order of magnitude than LocARNA, LARA, FOLDALIGN, and DYNALIGN -indeed, algorithmic time complexity of our method is O(n 3 ) compared with O(n 4 ) for these methods.Since STRAL could not be compiled on any of our systems, we implemented its algorithm by modifying RNAmountAlign and obtained results for STRAL's default parameter settings.Therefore, the run time of STRAL is identical to RNAmountAlign but we achieve slightly higher F1-measure, sensitivity and PPV.Moreover In addition, Table 4 displays average pairwise global alignment F1 scores for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, and STRAL when benchmarked on 36 families from the BRaliBase K2 database comprising altogether 8976 RNA sequences with average length of 249.33.Averaging over all sequences, the F1 scores for the programs just mentioned were respectively 0.8370, 0.7808, 0.8406, 0.7977, 0.6822, 0.8247; i.e.F1 score 0.8406 of LARA slightly exceeded the F1 score 0.8370 of RNAmountAlign and 0.8247 of STRAL, while other methods trailed by several percentage points.Supplementary Information (SI) Tables S1 and  S2 display values for global alignment sensitivity and positive predictive value, benchmarked on the same data for the same programs -these results are similar to the F1-scores in Tables 2 and 4.

Pairwise alignment
Although there appears to be no universally accepted criterion for quality of local alignments, Table 5 shows pairwise local alignment comparisons for the above-mentioned methods supporting local alignment: RNAmountAlign, FOLDALIGN, and LocARNA.We had intended to include SCARNA LM [39] in the benchmarking of multiple local alignment software; however, SCARNA LM no longer appears to be maintained, since the web server is no longer functional and no response came from our request for the source code.Since the reference alignments for the local benchmarking dataset are not known, and sensitivity depends upon the length of the reference alignment, we only report local alignment length and positive predictive value.Abbreviating RNAmountAlign by MA, FOLDALIGN by FA, and LocARNA by LOC, Table 5  Taken together, these results suggest that RNAmountAlign has comparable accuracy, but much faster run time, hence making it a potentially useful tool for genome scanning applications.Here it should be stressed that all benchmarking results used equally weighted contributions of sequence and ensemble structural similarity; i.e. parameter γ = 1/2 when computing similarity by Eq (15).By setting γ = 1, RNAmountAlign alignments depend wholly on structural similarity (see Figure 1).Indeed, for the following BRAliBase 2.1 alignment with 28% sequence identity, by setting γ = 1, RNAmountAlign returns the correct alignment.extracted from the Rfam 12.0 database [27], and prepared in a manner analogous to that of the dataset used in benchmarking multiple local alignment in [39] -see text for details.Parameters used in Eq (15) of the main text for RNAmountAlign were structural similarity weight γ = 1/2, gap initiation g i = −3, gap extension g e = −1; since reference alignments were required to have at most 70% sequence identity, nucleotide similarity matrix RIBOSUM8570-25 was used in RNAmountAlign. between Rfam sequences and random RNA.As explained earlier, a pool of 2220 sequences from the Rfam 12.0 database [27] was created by selecting one sequence of length at most 200 nt from each family, with the property that base pair distance between its minimum free energy (MFE) structure and the Rfam consensus structure was a minimum.Then 500 sequences were randomly selected from this pool, and for each of five gap initiation and extension costs g i = −5, −4, −3, −2, −1 with g e = gi 3 .Taking each of the 500 sequences successively as query sequence and for each choice of parameters, 1000 random 400 nt RNAs were generated with the same expected nucleotide relative frequency as that of the query.For each alignment score z for query and random target, the p-value was computed as 1 minus the cumulative density function, 1−CDF (z), for fitted normal (ND), extreme value (EVD) and gamma (GD) distributions, thus defining 1000 p-values.Additionally, a heuristic p-value was determined by calculating the proportion of alignment scores for given query that exceed z.For each set of 2.5 million (500 × 5 × 1000) p-values (heuristic, ND, EVD, GD), Pearson correlation values were computed and displayed in the upper triangular portion of Fig 6, with SSRs shown in parentheses.Note that residuals were computed for regression equation row = m • column + b, where column values constitute the independent variable.Assuming that heuristic p-values constitute the reference standard, it follows that p-values computed from the normal distribution correlate best with semiglobal alignment scores computed by RNAmountAlign.

Statistics for pairwise alignment
Earlier studies have suggested that protein global alignment similarity scores using PAM120, PAM250, BLOSUM50, and BLOSUM62 matrices appear to be fit best by the gamma distribution (GD) [31], and that semiglobal RNA sequence alignment similarity scores (with no contribution from structure) appear to be best fit by GD [14].However, in our preliminary studies (not shown), it appears that the type of distribution (ND, EVD, GD) that best fits RNAmountAlign semiglobal alignment depends on the gap costs applied (indeed, for certain choices, EVD provides the best fit).Since there is no mathematical theory concerning alignment score distribution for global or semiglobal alignments, it must be up to the user to decide which distribution provides the most reasonable p-values.

Multiple alignment
We benchmarked RNAmountAlign with the software LARA, mLocARNA, FOLDALIGNM and Multilign for multiple global K5 alignments in Bralibase 3. STRAL is not included since the source code could not be compiled.Fig 7 indicates average SPS and SCI as a function of average pairwise sequence identity (APSI).We used the -sci flag of RNAalifold to compute SCI from the output of each software without reference to the reference alignment.Fig 7 indicates that SCI values for outputs from various alignment algorithms is higher than the SCI value from reference alignments, suggesting that the consensus structure obtained from sequence/structure alignment algorithms has a larger number of base pairs than the the consensus structure  S3.From these values, scaling factor α seq and shift α str , were computed; with structural similarity weight γ = 1/2, the overall similarity function from Eq (15) in the text was determined.obtained from reference alignments (this phenomenon was also in [35]).Fig 7 indicates that RNAmountAlign produces SPS scores comparable to mLocARNA and LARA and higher than Multilign and FOLDALIGNM while the SCI score obtained from RNAmountAlign are slightly lower than other software.Averaging over all sequences, the SPS scores for RNAmountAlign, LARA, mLocARNA, FOLDALIGNM and Multilign were respectively: 0.84 ± 0.17, 0.85 ± 0.17, 0.84 ± 0.17

Conclusion
RNAmountAlign is a new C++ software package for RNA local, global, and semiglobal sequence/structure alignment, which provides accuracy comparable with that of a number of widely used programs, but provides much faster run time.RNAmountAlign additionally computes E-values for local alignments, using Karlin-Altschul statistics, as well as p-values for normal, extreme value and gamma distributions by parameter fitting.

A Supplementary Information
A.1 Software usage RNAmountAlign performs local, semiglobal, and global sequence/structure alignments.By default the global alignment is computed unless flags -local or -semi are used to perform local and semiglobal alignments, respectively.In the simplest case, the program could be run with > ./ RNAmountAlign −f <i n p u t F a s t a > o r > ./ RNAmountAlign −s s e q 1 s e q 2 The parameters that were used to produce the results in the main text are used as the default by the software: structural similarity weight γ = 0.5, gap initiation g i = −3, and gap extension g e = −1.The weight factor γ defines the importance of structural similarity versus sequence similarity.When γ = 0 only sequence similarity is considered, while γ = 1 only uses the incremental ensemble mountain heights for the alignment.As an example, let's consider the following two toy sequences each forming a stem loop secondary structure >s e q 1 AAAAAAAAAACCCCCUUUUUUUUUU ( ( ( ( ( ( ( ( ( (   For local alignments either Karlin-Altschul statistics (default) or EVD fitting can be computed.Let's consider an example of a local alignment between two purine riboswitches with Rfam seed alignment length of 102 and sequence identity 0.58.Random flanking regions with the same nucleotide composition are added to the seed alignment as discussed in the main text to obtain two sequences of length 408 and 400.The local alignment between these two sequences has length 53 with extremely low E-value, with the property that all pairs in the local alignment are found in the reference seed alignment (P P V = 1).E-value from Karlin-Altschul statistics can be obtained very fast from the following command: Our software could also be used for searching a query sequence defined by -qf <fastaFile> in a target sequence defined by -tf <fastaFile>.The search computes semiglobal alignments of the query to sliding windows of the target, and returns the aligned segments of the target sorted by p-value.The query is aligned to windows of a fixed size defined by -window, sliding by steps defined by -step flag.To compute the statistics, random alignment scores are computed and fitted to ND.However, the software does not compute random alignments for each window separately as it would be very slow.Instead, following [21], the range of the GC-content of the target sequence over all the sliding windows is first obtained and binned using bin size defined by -gc.For each GC-content bin, fitting paremeters are precomputed by generating a number of random sequences whose GC-content is equal to the bin midpoint, aligning the query to random sequences, and fitting random alignment scores to normal distribution.For each sliding window the corresponding precomputed parameters are used for the computation of p-value.As an example, a random tRNA from Rfam 12.0 whose minimum free energy structure has the minimum base pair distance to the Rfam consensus structure was selected and used as the query to search E. coli K12 MG1655 genome using window size 300 and step size 200 by the following command.; for each bin 1000 random sequences whose GCcontent are equal to the average GC-content of the bins are generated, aligned to the query and their fitted location (mean) and scale (standard deviation) parameters are precomputed to be used for computation of p-values.From the top 20 hits of our software, the first 18 are reported to be tRNAs by tRNAscan-SE [24].
To see all the full parameter list for the software please use > ./ RNAmountAlign −h    (gap costs g i = −3, g i = −1, γ = 0.5, scaling factor α seq = 0.447648, shift term α str = 0.304766, γ = 1/2).However, this RNA is clearly not a tRNA, since the three loops are not within the scope of a multiloop, and the variable loop is located in the wrong position, and the large positional entropy suggests that there is not an unambiguous structure.Moreover, this sequence is not one of the 40 tRNA genes/pseudogenes on the plus-strand predicted by tRNAscan-SE [24].pA pC pG pU p ( p ) p• std ( std ) std• 0.00 0.00 0.00 1.00 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00 0.00 0.05 0.95 0.000533 0.000533 0.998933 0.000292 0.000292 0.000583 0.00 0.00 0.10 0.90 0.001396 0.001396 0.997209 0.000818 0.000818 0.001636 0.00 0.00 0.15 0.85 0.002704 0.002704 0.994592 0.001548 0.001548 0.003096 0.00 0.00 0.20 0.80 0.004785 0.004785 0.990431 0.002863 0.002863 0.005725 0.00 0.00 0.25 0.75 0.008039 0.008039 0.983922 0.004992 0.004992 0.009983 0.00 0.00 0. Table S3: Initial portion of a table that determines expected base pairing probabilities p ( , p • , p ) as a function of nucleotide probabilities p A , p C , p G , p U .The full table (not shown) has 1770 rows.To determine average base pairing probabilities, given nucleotide probabilities p A , p C , p G , p U , a total of N = 10000 RNA sequences of length n = 200 were randomly generated to have the given expected nucleotide frequency.To compute p ( [ resp.std ( ], a library call of function pf fold() from Vienna RNA Package [23] was made in order to determine P rob[i pairs to right] = n i=1 n j=i+1 p i,j for position in each sequence, and the average [ resp.standard deviation ] was taken over all sequences and values i = 1, . . ., n.In a similar fashion, p • and p ) were determined.

Figure 1 :
Figure 1: Ensemble mountain heights of 72 nt tRNA AL671879.2and 69 nt tRNA D16387.1,aligned together by RNAmountAlign.Since the BRAliBase 2.1 K2 reference (pairwise) alignment[9] has only 28% sequence identity, structural similarity parameter γ was set to 1 in our software RNAmountAlign, which returned the correct alignment.See Methods section for explanation of γ and the algorithm used by RNAmountAlign.
sequence b.Given sequences a = a 1 , . . ., a n and b = b 1 , . . ., b m , let A = m a (1) * • • • m a (N ) * m b (1) * • • • m b (N ) * denote an alignment between the incremental ensemble expected mountain height m a (1) • • • m a (n) of a and and the ensemble incremental expected mountain height m b (1) • • • m b (m) of b.Generalize structural distance d 0 defined in Eq (7) to d 1 defined by d 1 (a i , b j ) = |m a (i) − m b (j)|, where m a (i) and m b (j) are real numbers in the interval [−1, 1], and define ensemble structural alignment distance for A by summing d 1 (a i , b j

Figure 2
depicts the distribution of RIBOSUM85-60 [resp.STRSIM] values in this case, both before and after application of scaling factor α seq [resp.shift α str ] -recall that α seq and α str ] depend on p A , p C , p G , p U , p ( , p • , p ) of tRNA AL671879.2and p A , p C , p G , p U , p ( , p • , p ) of tRNA D16498.1.

n
denotes an alignment, where a i , b i ∈ {A, C, G, U, -}, and the aligned sequences include may contain gap symbols -provided that it is not the case that both a * i and b * i are gaps.The number TP of true positives [resp.FP of false positives] is the number of alignment pairs (a * i , b * i ) in the predicted alignment that belong to [resp.do not belong to] the reference alignment.The sensitivity (Sen) [resp.

Figure 4 :
Figure 4: Run time of pairwise global alignment for RNAmountAlign, LocARNA, LARA, FOLDALIGN, and DYNALIGN.(Left) Log run time is shown as a function of seed length for pairwise alignments in the BRAliBase 2.1 database used for benchmarking.Window size of 51 is used for the computation of moving average.(Right) Actual run time for RNAmountAlign and LARA on the same data.Unlike the left panel the actual run time is shown, rather than log run time, without any moving average taken.

Figures 3 ,
Figures 3, S2 and S3 depict running averages of pairwise global alignment F1-measure, sensitivity, and positive predictive value (PPV) for the software described in this paper, as well as for LocARNA, FOLDALIGN, LARA, DYNALIGN, and STRAL.For pairwise benchmarking, reference alignments of size 2, a.k.a.K2, were taken from the BRAliBase 2.1 database [9].BRAliBase 2.1 K2 data are based on seed alignments of the Rfam 7.0 database, and consist of 8976 alignments of RNA sequences from 36 Rfam families.Running averages of sensitivity, positive predictive value, and F1-measure, averaging over windows of size 11 nt (interval [k − 5, k + 5]), were computed as a function of sequence identity, where it should be noted that the number of pairwise alignments for different values of sequence identity can vary for the BRAliBase 2.1 data (e.g.there are only 35 pairwise alignments having sequence identity < 20%).Default parameters were used for all other software.For our software RNAmountAlign, gap initiation cost was -3, gap extension -1, , RNAmountAlign supports semiglobal and local alignments as well as reporting p-values.The right panel of Fig 4 depicts actual run times of the fastest software, RNAmountAlign, with the next fastest software, LARA.Unlike the graph in the left panel, actual run times are shown, graphed as a function of sequence length, rather than logarithms of moving averages.

Fig 5
Fig 5 shows fits of the relative frequency histogram of alignment scores with the normal (ND), extreme value (EVD) and gamma (GD) distributions, where local [resp.semiglobal] alignment scores are shown in the left [resp.right] panel.The EVD provides the best fit for local alignment sequence-structure similarity scores, as expected by Karlin-Altschul theo [19, 20].Moreover, Fig 6 shows a 96% correlation between (expect) E-values computed by our implementation of the Karlin-Altschul method, and E-values obtained by maximum likelihood fitting of local alignment scores.In contrast, the ND provides the best fit for semiglobal sequence/structure alignment similarity scores, at least for the sequence considered in Fig 5.This is not an isolated phenomenon, as shown in Fig 6, which depicts scatter plots, Pearson correlation values and sums of squared residuals (SSRs) when computing p-values for semiglobal (query search) alignment scores

Figure 5 :
Figure 5: Fits of 30-bin relative frequency histograms of scores for local (left), semiglobal (middle) and global (right) alignments produced by RNAmountAlign for the randomly chosen 5S rRNA AY544430.1:375-465from Rfam 12.0 database having A,C,G,U relative frequency of 0.25, 0.27, 0.26, 0.21.A total of 10,000 random sequences having identical expected nucleotide relative frequencies were generated, each of length 400 nt for local/semiglobal and 100 nt for global.Local (left), semiglobal (middle) and global (right) alignments were computed by RNAmountAlign, in each case fitting the data with the normal (ND), extreme value (EVD) and gamma (GD) distributions.As expected by Karlin-Altschul theory [19], local alignment scores are best fit by EVD, while semiglobal alignment scores are best fit by ND (results supported by data not shown, involving computations of variation distance, symmetrized Kullback-Leibler distance, and χ 2 goodness-of-fit tests).

Figure 6 :
Figure 6: (Left)Pearson correlation values and scatter plots for p-values of semiglobal alignment(query search) scores between Rfam sequences and random RNA.For each score in a set of 2.5 million global pairwise alignment scores, a p-value was computed by direct counts (heuristic), or by data fitting the normal (ND), extreme value (EVD), or gamma (GD) distributions.Pairwise Pearson correlation values were computed and displayed in the upper triangular portion of the figure, with sums of squared residuals shown in parentheses, and histograms of p-values along the diagonal.It follows that ND p-values correlate best with heuristic p-values, where the latter is assumed to be the gold standard.(Right)Scatter plot of expect values E ML , computed by maximum likelihood, following the method described in [21] (y-axis) and expect values E KA , computed by our implementation of the Karlin-Altschul, as described in the text.The regression equation is E ML = 0.1764 + 0.7991 • E KA ; Pearson correlation between E ML and E KA is 96%, with correlation p-value of 2 • 10 −16 .Expect values were determined from local alignment scores computed by the genome scanning form of RNAmountAlign with query tRNA AB031215.1/9125-9195and targets consisting of 300 nt windows (with 200 nt overlap) from E. coli str.K-12 substr.MG1655 with GenBank accession code AKVX01000001.1.From the tRNA query sequence, the values p A , p C , p G , p U for nucleotide relative frequencies, are determined, then average base pairing probabilities p ( , p • , p ) are computed by RNAfold -p [23].For the current 300 nt target window, the nucleotide relative frequencies p A , p C , p G , p U are computed, then precomputed probabilities p ( , p • , p ) are obtained from SI TableS3.From these values, scaling factor α seq and shift α str , were computed; with structural similarity weight γ = 1/2, the overall similarity function from Eq(15) in the text was determined.

Figure 7 :
Figure 7: Sum-of-pairs(SPS) score (left) and structural conservation index (SCI) (right) for multiple global alignments using RNAmountAlign, LARA, mLocARNA, FoldalignM and Multilign .SPS and SCI are shown as a function of average pairwise sequence identity(APSI) in the k5 BRAliBase 3 database used for benchmarking.

Figure 8 :
Figure 8: Run time of multiple global alignment for RNAmountAlign, mLocARNA and LARA, FoldalignM and Multilign.(Left) Log run time is as shown a function of reference alignment length for K5 alignments in Bralibase 3. (Right) Actual run time in seconds for mLocARNA and LARA.
, 0.77 ± 0.22, and 0.84 ± 0.19.The left panel of Fig 8 indicates the run time of all software on a logarithmic scale, while the right panel shows the actual run time in seconds for RNAmountAlign as well as that of the next two fastest algorithms, mLocARNA and LARA.This figure clearly shows that RNAmountAlign has much faster run time than all other software in our benchmarking tests, thus confirming the earlier result from pairwise benchmarking.

1 produces
. . . . . ) ) ) ) ) ) ) ) ) ) ( −2.1) >s e q 2 CCCCCCCAAAAGGGGGGG ( ( ( ( ( ( ( . . . . ) ) ) ) ) ) ) ( −15.7)Running the software considering only sequence similarity with gap initiation and extension penalties of -2 and -1, respectively, by the command> ./ RNAmountAlign −s AAAAAAAAAACCCCCUUUUUUUUUU CCCCCCCAAAAGGGGGGG −gamma 0 −g i −2 −g e −−−−−CCCCCCCAAAAGGGGGGG 18where four C nucleotides are aligned together, regardless of the fact that in the secondary structure for the first sequence, they are found in an apical loop region, while in the secondary structure for the second sequence, they are part of a stem.However, using -gamma 1 −−AAAAGGGGGGG −−− 18 where the opening, closing and unpaired bases are aligned to each other.Finally, using -gamma 0.−−−−−GGGGGGG 18 where both sequence and structural similarity are equally weighted.The default nucleotide similarity matrix is RIBOSUM85-60.Other RIBOSUM matrices are included in the software and can be selected with -m flag based on the user's knowledge of divergence of the input sequences.RNAmountAlign computes the consensus secondary structure by calling alifold() function from libRNA.a in the Vienna RNA Package when flag -alifold is used.For example the following command outputs the consensus structure in addition to the alignment for the same sequences indicated in Fig 1 of the main text.See Fig S1.

>
. / RNAmountAlign −f examples / t r n a .f a − a l i f o l d −g l o b a l Computation of alignment statistics depends on the alignment type.As discussed in the main text, local alignment scores follow extreme value distribution(EVD) while global and semiglobal scores tend to follow normal distribution(ND).Flag -stat can be set to compute both E-values and p-values, where the transformation between E-values and p-values is made by p = 1 − exp(−E).For global and semiglobal alignments, the first (query) sequence is aligned to a number of random RNAs, defined by -num flag, with the same nucleotide composition as the second sequence (target), then the random alignment scores are fitted to normal distribution and a p-value is returned.> ./ RNAmountAlign −f examples / t r n a .f a −g l o b a l −s t a t −num 100 As part of the output, p-value from ND normal fitting of 100 random alignment scores is reported: Normal d i s t r i b u t i o n E−v a l u e : 0 .0 4 7 6 1 4 8 Normal d i s t r i b u t i o n p−v a l u e : 0 .0 4 6 4 9 9 5 2 1 3 7 e −06 K a r l i n −A l t s c h u l p−v a l u e : 2 .5 2 1 3 7 e −06 Computation of E-value from EVD fitting is more accurate but slower: > ./ RNAmountAlign −f examples / RF00167 1 .raw − l o c a l −s t a t −evd −num 200 Extreme v a l u e d i s t r i b u t i o n E−v a l u e : 4 .4 1 4 1 7 e −05 Extreme v a l u e d i s t r i b u t i o n P−v a l u e : 4 .4 1 4 0 8 e −05 RNAmountAlign computes Karlin-Altschul E-values from maximum likelihood method described in the main text, and then multiplies it by the regression coefficient of 0.7991, indicated in the right panel of Fig 6, to obtain an estimated E-value.Therefore, there might be discrepancy between the EVD fitting and Karlin-Altschul E-values.For the most accurate statistics EVD fitting is recommended.

Figure S1 :
Figure S1: Consensus structure for the pairwise alignment indicated in Fig 1 of the main text.The consensus structure is computed by a calling function alifold() from Vienna RNA Package.The figure is obtained from RNAalifold web server.

Figure S2 :
Figure S2: Average sensitivity for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL and sequence-only alignments (γ = 0) for pairwise global alignment.Sensitivity is shown as a function of sequence identity for pairwise alignments in the BRAliBase 2.1 database used for benchmarking.

Figure S3 :
Figure S3: Average positive predictive value (PPV) for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL and sequence-only alignments(γ = 0) for pairwise global alignment.PPV is shown as a function of sequence identity for pairwise alignments in the BRAliBase 2.1 database used for benchmarking.

Figure S4 :
Figure S4: Average pairwise sensitivity (left) and positive predictive value (right) for multiple global alignments using RNAmountAlign, LARA, mLocARNA, FoldalignM and Multilign in the k5 BRAliBase 3 database used for benchmarking.Note that in our definition of Sen and P P V , pairs of the form (X, -) and (-, X) are also counted while SPS is the average pairwise sensitivity only considering aligned residue pairs (Fig 7).However, the results with and without gap counts, indicated in this Fig and Fig 7, respectively, are very close.

Figure S5 :
Figure S5: Illustration of a potential weakness of RNAmountAlign.Using RNAmountAlign genome-scanning software, semiglobal alignments of the query tRNA AB031215.1/9125-9195were made with each 300 nt window (successive window overlap of 200 nt) of the E. coli str.K-12 substr.MG1655 genome.This figure shows the MFE structure, color-coded by positional entropy [18], for the alignment of positions 696097-696164 with score −7.70, p-value of 4.145010 • 10 −6 .(gapcosts g i = −3, g i = −1, γ = 0.5, scaling factor α seq = 0.447648, shift term α str = 0.304766, γ = 1/2).However, this RNA is clearly not a tRNA, since the three loops are not within the scope of a multiloop, and the variable loop is located in the wrong position, and the large positional entropy suggests that there is not an unambiguous structure.Moreover, this sequence is not one of the 40 tRNA genes/pseudogenes on the plus-strand predicted by tRNAscan-SE[24].

Table 2 :
Overview of features in software used in benchmarking tests, where [resp.-] indicates the presence [resp.absence] of said feature, to the best of our knowledge.Average F1 [resp.SPS] scores for the pairwise [resp.multiple] global alignment are given, computed as explained in the text.
an alignment between two arbitrary secondary structures s, t of (possibly different) lengths n, m, where s * i , t * i ∈ { ( , •, ) , −} and − denotes the gap symbol.We define the structural alignment distance for A by summing d 0 (s * i , t * i ) over those positions i where neither character s * i , t * i is a gap symbol, then adding w(k) for all size k gaps in A. Using previous definitions of incremental ensemble expected mountain height from Eq (3), we can generalize structural alignment distance from the simple case of comparing two dot-bracket representations of secondary structures to the more representative case of comparing the low-energy Boltzmann ensemble of secondary structures for RNA sequence a to that of RNA

Table 5 :
Comparison of alignment length and positive predictive value (PPV) for pairwise local alignment by RNAmountAlign against the widely used local alignment software FOLDALIGN and LocARNA.Local alignment benchmarking was performed on 1500 pairwise alignments (75 alignments per family, 20 Rfam families) As indicated, six GC bins are generate in range [0.23 − 0.74]