Figure 1.
Longest prefix matches may fail to deliver the position of the optimally scoring local alignment.
Assume a simple scoring scheme that assigns a score of +1 to a single character match and a score of 0 to a single character mismatch, a single insertions or deletion. Using longest prefix matches bears the risk of ignoring differences in the best, i.e. optimally scoring, local alignment. Its retrieval fails if a longer match can be obtained at another position of the reference sequence by matching a character, that is inserted, deleted, or mismatched in the best local alignment. Depending on the length of the reference genome and its nucleotide composition the probability is determined by the length of the substring that can be matched to the position of the best local alignment before the first difference occurs. (A) The optimally scoring alignment of the read P: = cttcttcggc begins at position 3 of the reference genome S: = atacttcttcggcaga. Let Pi denote the ith suffix of the read P. For each Pi, the starting positions of the longest match in S comprise the position of Pi in the best local alignment (solid green lines). That is, the longest match of P0 begins at position 3, the longest match of P1 begins at position 4, the longest match of P2 begins at position 5 and so forth. (B) For the read P: = cttcgtcggc, the retrieval of the best local alignment fails for all Pi, i<5 (dashed red line) due to the inclusion of a character that results in a mismatch in the optimally scoring local alignment. (C) The read P: = cttctgcggc contains, with respect to the best local alignment, a mismatch at position 5 of the read. Here the position 5 of the read is not included in the longest prefix match and nearly all substrings align correctly to the reference genome.
Figure 2.
Matching stems and matching branches.
We give an explanation based on a suffix trie which is equivalent to the suffix interval tree shown in Fig. 5 (see Methods). The suffix trie for S$ with S: = acttcttcggc (left) holds twelve leaves. Each numbered leaf corresponds to exactly one suffix in S. Nodes with only one child are not explicitly shown. Note, that internal nodes implicitly represent all leafs in their respective subtree. Thus, internal nodes can be regarded as sets of suffixes. The right panel holds the longest matches for different matching paths in the trie. Matching the first three suffixes of the read P: = cgtcggc results in three different paths in the suffix trie. Each path is equivalent to a sequence of suffix intervals, a matching stem, in the enhanced suffix array. Let denote the matching stem for Pi = ith suffix of P. The qth interval in
, denoted by
, implicitly represents the set of suffixes in S matching P[i‥i+q−1]. The path for the first suffix P0 is of length two (green solid line). Hence, the equivalent matching stem
is a sequence of three intervals:
,
and
. Since
only represents the suffix S7, the longest prefix match of P0 is of length 2 occurring at position 7 of the reference sequence (right panel). The matching stem
for P1 (red solid line) ends with
. Therefore, matches of length one occur at positions 8 and 9 in S. The longest prefix match for P3 occurs at position 6 of S (dashed orange line). Note, that the intervals
of
equivalently represent S6. An alternative path leads to a match with position 4. The branch
denotes the alternative that accepts the mismatch of g and t at position 1 of P0.
Figure 3.
Comparison of recall rates and running time for several short read aligners.
Average running time for the different programs (A) in matching runs with 500 000 reads in two different data sets (logarithmic scale; S refers to segemehl). The differences are uniformly distributed and consist of only mismatches (B) or mismatches, insertions and deletions (C). The recall rate describes the fraction of reads which was mapped to the correct position. All programs were used with default parameters. Bowtie was called with option –all and SOAP with option –r 2.
Figure 4.
segemehl recall rates for varying difference values and distributions.
Recall rates are depicted for k = 0 (dashed) and k = 1 (solid). For terminal–, 3′– and 5′– increased difference distributions, segemehl achieves a recall rate above 80% for reads with 4 errors.
Table 1.
Comparison of the performance of Bowtie, MAQ, and segemehl on two real-life datasets.
Figure 5.
The enhanced suffix array yields a tree structure of nested suffix intervals.
The enhanced suffix array for the sequence S: = attcttcggc (left) and its suffix interval tree (right), equivalent to the suffix trie in Fig. 2, is shown. The array suf represents the lexicographical order of the suffixes in S$. In other words, Ssuf[0], Ssuf[1], …, Ssuf[n] is the sequence of suffixes of S$ in ascending lexicographic order. The lcp-table lcp is an array of integers such that for each h, 1≤h≤n, lcp[h] is the length of the longest common prefix of Ssuf[h−1] and Ssuf[h]. A suffix interval [l‥r, h] denotes an interval in the suffix array with lcp[i]≥h for all i, l+1≤i≤r, i.e. all suffixes in the interval [l+1‥r] have a longest common prefix of length at least h. Additionally, requiring l = 0 or lcp[l]<h makes the suffix interval left maximal and requiring r = n or lcp[r+1]<h makes it right maximal. The suffix interval [0‥10, 0] spans the whole suffix array and is equivalent to the root of a suffix interval tree. This interval contains five subintervals, one for each character in S$, with h = 1. Equivalently, the root node of the suffix interval tree has five children. Note, that two children, labeled by 0 and 11, are singletons. The child nodes of singletons are not explicitly shown here.
Figure 6.
The suffix interval [l‥r, h], representing some string of length h, is split into its children [l‥u, h+1], [u+1‥v, h+1] and [v+1‥r, h+1] by matching an additional character a∈{A, C, T}. We proceed building
by matching the character C (solid bold green line). Beforehand, alternative suffix intervals are stored in
, either representing mismatches (dashed red line), insertions (dashed dotted black line) or deletions (dotted blue line).
holds suffix link intervals that in turn branch off from
. The branch closure
holds all such alternative intervals.
Figure 7.
Enumeration of exact and inexact seeds.