Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

doi:10.1371/journal.pcbi.1000502

Figure 1.

Longest prefix matches may fail to deliver the position of the optimally scoring local alignment.

Assume a simple scoring scheme that assigns a score of +1 to a single character match and a score of 0 to a single character mismatch, a single insertions or deletion. Using longest prefix matches bears the risk of ignoring differences in the best, i.e. optimally scoring, local alignment. Its retrieval fails if a longer match can be obtained at another position of the reference sequence by matching a character, that is inserted, deleted, or mismatched in the best local alignment. Depending on the length of the reference genome and its nucleotide composition the probability is determined by the length of the substring that can be matched to the position of the best local alignment before the first difference occurs. (A) The optimally scoring alignment of the read P: = cttcttcggc begins at position 3 of the reference genome S: = atacttcttcggcaga. Let P_i denote the i^th suffix of the read P. For each P_i, the starting positions of the longest match in S comprise the position of P_i in the best local alignment (solid green lines). That is, the longest match of P₀ begins at position 3, the longest match of P₁ begins at position 4, the longest match of P₂ begins at position 5 and so forth. (B) For the read P: = cttcgtcggc, the retrieval of the best local alignment fails for all P_i, i<5 (dashed red line) due to the inclusion of a character that results in a mismatch in the optimally scoring local alignment. (C) The read P: = cttctgcggc contains, with respect to the best local alignment, a mismatch at position 5 of the read. Here the position 5 of the read is not included in the longest prefix match and nearly all substrings align correctly to the reference genome.

More »

Expand

Figure 2.

Matching stems and matching branches.

We give an explanation based on a suffix trie which is equivalent to the suffix interval tree shown in Fig. 5 (see Methods). The suffix trie for S$ with S: = acttcttcggc (left) holds twelve leaves. Each numbered leaf corresponds to exactly one suffix in S. Nodes with only one child are not explicitly shown. Note, that internal nodes implicitly represent all leafs in their respective subtree. Thus, internal nodes can be regarded as sets of suffixes. The right panel holds the longest matches for different matching paths in the trie. Matching the first three suffixes of the read P: = cgtcggc results in three different paths in the suffix trie. Each path is equivalent to a sequence of suffix intervals, a matching stem, in the enhanced suffix array. Let denote the matching stem for P_i = i^th suffix of P. The q^th interval in , denoted by , implicitly represents the set of suffixes in S matching P[i‥i+q−1]. The path for the first suffix P₀ is of length two (green solid line). Hence, the equivalent matching stem is a sequence of three intervals: , and . Since only represents the suffix S₇, the longest prefix match of P₀ is of length 2 occurring at position 7 of the reference sequence (right panel). The matching stem for P₁ (red solid line) ends with . Therefore, matches of length one occur at positions 8 and 9 in S. The longest prefix match for P₃ occurs at position 6 of S (dashed orange line). Note, that the intervals of equivalently represent S₆. An alternative path leads to a match with position 4. The branch denotes the alternative that accepts the mismatch of g and t at position 1 of P₀.

More »

Expand

Figure 3.

Comparison of recall rates and running time for several short read aligners.

Average running time for the different programs (A) in matching runs with 500 000 reads in two different data sets (logarithmic scale; S refers to segemehl). The differences are uniformly distributed and consist of only mismatches (B) or mismatches, insertions and deletions (C). The recall rate describes the fraction of reads which was mapped to the correct position. All programs were used with default parameters. Bowtie was called with option –all and SOAP with option –r 2.

More »

Expand

Figure 4.

segemehl recall rates for varying difference values and distributions.

Recall rates are depicted for k = 0 (dashed) and k = 1 (solid). For terminal–, 3′– and 5′– increased difference distributions, segemehl achieves a recall rate above 80% for reads with 4 errors.

More »

Expand

Table 1.

Comparison of the performance of Bowtie, MAQ, and segemehl on two real-life datasets.

More »

Expand

Figure 5.

The enhanced suffix array yields a tree structure of nested suffix intervals.

The enhanced suffix array for the sequence S: = attcttcggc (left) and its suffix interval tree (right), equivalent to the suffix trie in Fig. 2, is shown. The array suf represents the lexicographical order of the suffixes in S$. In other words, S_suf[0], S_suf[1], …, S_suf[n] is the sequence of suffixes of S$ in ascending lexicographic order. The lcp-table lcp is an array of integers such that for each h, 1≤h≤n, lcp[h] is the length of the longest common prefix of S_suf[h−1] and S_suf[h]. A suffix interval [l‥r, h] denotes an interval in the suffix array with lcp[i]≥h for all i, l+1≤i≤r, i.e. all suffixes in the interval [l+1‥r] have a longest common prefix of length at least h. Additionally, requiring l = 0 or lcp[l]<h makes the suffix interval left maximal and requiring r = n or lcp[r+1]<h makes it right maximal. The suffix interval [0‥10, 0] spans the whole suffix array and is equivalent to the root of a suffix interval tree. This interval contains five subintervals, one for each character in S$, with h = 1. Equivalently, the root node of the suffix interval tree has five children. Note, that two children, labeled by 0 and 11, are singletons. The child nodes of singletons are not explicitly shown here.

More »

Expand

Figure 6.

The branch closure.

The suffix interval [l‥r, h], representing some string of length h, is split into its children [l‥u, h+1], [u+1‥v, h+1] and [v+1‥r, h+1] by matching an additional character a∈{A, C, T}. We proceed building by matching the character C (solid bold green line). Beforehand, alternative suffix intervals are stored in , either representing mismatches (dashed red line), insertions (dashed dotted black line) or deletions (dotted blue line). holds suffix link intervals that in turn branch off from . The branch closure holds all such alternative intervals.

More »

Expand

Figure 7.

Algorithm.

Enumeration of exact and inexact seeds.

More »

Expand