Fast Pairwise Structural RNA Alignments by Pruning of the Dynamical Programming Matrix

doi:10.1371/journal.pcbi.0030193

Figure 1.

The Cases of Equation 1

(A) Adds a conserved basepair. In the structure, a conserved basepair is indicated using ( and ).

(B) Adds aligned unpaired nucleotides. An insert basepair is indicated using < and >.

(D) Adds aligned unpaired nucleotides. An unpaired nucleotide is indicated with a “.”.

(F) Adds unpaired insert nucleotides.

(J) Equation 1J (shown) joins substructures into one structure. Due to the bifurcation constraint, the nucleotide at position i must be basepaired to a nucleotide in the subsequence from I + 1,...,m, and the nucleotides at position m + 1 and j must also basepair. The same constraints are placed on the corresponding nucleotides in the other sequence. See Figure 2 for extra details.

More »

Expand

Figure 2.

The Bifurcation Constraint

(A) The allowed case. The first nucleotides of the left substructures in both sequence 1 and sequence 2 are basepaired, and the first and last nucleotides in the right substructures are basepaired to each other.

(B) A disallowed case. The score of the joined alignment will not be calculated using the bifurcation-loop calculation (the resulting alignment will, however, be calculated by expanding the alignment result from (A). There are two reasons: the first nucleotide of the left substructure in sequence 2 is not basepaired, and the first and the last nucleotides of the right substructures are not basepaired in both sequences.

More »

Expand

Table 1.

The Parameters of the Default Score Matrices for Local and Global Alignment

More »

Expand

Figure 3.

The Divide and Conquer Algorithm

(1) In the first step, the region of interest is found using the local-alignment algorithm. During the local alignment, the bifurcation constraint is used to save a lot of memory.

(2) In the second step (first step of a global alignment), the region of interest is realigned. The bifurcation constraint is still used, but an additional list of branch points is made. The list stores the six coordinates of the branch point and pointers to the next branch points (one for each of the two substructures). For each subalignment, a pointer to the last branch point is kept.

(3) Using the branch point pointers and coordinates, the alignment is split into shorter unbranched segments. Here A is an initial unbranched segment. MBP 1 (Multi Branch Point) splits the alignment into two segments: EDC and B. C is a new initial unbranched segment. MBP 2 splits the ED segment into the E and D segments.

(4) The five segments are realigned and backtrack without using the bifurcation constraint to save memory. These realignments are fast, as it is not necessary to evaluate the bifurcation part of the algorithm. Finally, the segments are joined together into the final alignment.

More »

Expand

Figure 4.

The Average Run Time

(A) Time requirements as a function of λ. The “real data” curves were made using a SRP dataset. It contains eight pairs of 1,000 nucleotide-long sequences. Each sequence contains one SRP gene with a length of approximately 300 nucleotides. The “shuffled data” curve was made using shuffled versions of the same dataset. The “2.0” curves were made using the previous version of the program. “No pruning” curves were made using the current version, but without pruning (option -no_pruning). “Pruning” curves were made using the current version and default values of pruning. It is clear that the time needed to make the alignments explodes when pruning is not used, whereas the run time remains much lower when pruning is used.

(B) The pruning curves with a smaller time scale. The curve for the sequences containing a motif grows much faster than the curves for the shuffled sequences until the point where λ is as long as the motif. After this, the curves appear to grow at the same rate.

More »

Expand

Figure 5.

The Average Memory Requirements

Memory requirements as a function of λ. See Figure 4 for details.

More »

Expand

Table 2.

The Localization Performance

More »

Expand

Figure 6.

The Time Gained by Using Pruning

The global-alignment time without pruning divided by the alignment time with pruning as a function of the length difference between the input sequences. The SRP dataset and δ = 25 was used. The curve shows the average gain. The points are the individual measurements.

More »

Expand

Figure 7.

Structure Correlation Coefficient

The structure correlation coefficient as a function of sequence identity. The dataset consists of tRNAs and 5S rRNAs; for details see [12]. The difference in correlation coefficient for “FOLDALIGN” and “FOLDALIGN without pruning” at identities 0.1 and 0.15 is due to three sequence pairs for which no alignment and structure are produced. Rerunning FOLDALIGN for those three pairs without pruning should be trivial. “Stemloc” did not produce an alignment for two pairs. For both FOLDALIGN and Stemloc, the pairs which did not produce an alignment are counted as zero.

More »

Expand

Table 3.

Pairwise Methods for Predicting Structures and Alignments of RNAs

More »

Expand