Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/.


Introduction
Since the 454 pyrosequencing technology [3] has been introduced to the market, the need for algorithms that efficiently map huge amounts of reads to reference genomes has rapidly increased. Later, high throughput sequencing (HTS) methods such as Illumina [4] and SOLiD (Applied Biosystems) have intensified the demand. The development of read mapping methods decisively depends on specifications and error models of the respective technologies. Unfortunately, little is known about specific error models, and models are likely to change as manufactures are constantly modifying chemistry and machinery. Increasing the read length is a key aim of all vendors -tolerating a trade-off with read accuracy. In a recent investigation on error models of 454 and Illumina technologies, it has been shown that 454 reads are more likely to include insertions and deletions while Illumina reads typically contain mismatches [5,6]. Currently available read mapping programs are specifically designed to allow for mismatches when aligning the reads to the reference genome. Most of the programs, e.g. MAQ [7], SOAP [8], SHRiMP [9] or ELAND (proprietary), use seeding techniques that gain their speed from pre-computed hash look-up tables. Some of these programs, in particular SOAP and MAQ, are specifically designed to map short Illumina or SOLiD reads. Longer sequences cannot be mapped by these tools. The matching models of MAQ, ZOOM [10], SOAP, SHRiMP, Bowtie [11], and ELAND focus on mismatches and largely neglect insertions and deletions. Indels are only considered during subsequent alignment steps but not while searching for seeds. With indels accounting for more than two thirds of all 454 sequencing errors, this is a major shortcoming for these kinds of reads [5]. Only PatMaN [12] and BWA [13] are able to handle a limited number of indels.
Mapping is aggravated by the manufacturers' overestimation of their read accuracies. While an overall error rate of 0.5% has been observed for 454, the error rate increases drastically for reads shorter than 80 bp and longer than 100 bp [5], leading to considerably larger error frequencies in real-life datasets. This implies that, sequencing projects aiming to find short transcripts such as miRNAs lose a substantial fraction of their data, unless a matching strategy is used that takes indels into account. In Illumina reads, error rates of up to 4% have been observed [6]. This differs significantly from Illumina's specification. Compared to 454, the frequency of indels is significantly lower. Moreover, differences between reads and reference genome might also occur due to genomic variations such as SNPs. We present a matching method that uses enhanced suffix arrays to compute exact and inexact seeds. Sufficiently good seeds subsequently trigger a full dynamic programming alignment. Our method is insensitive to errors and contaminations at the ends of a read including 39 and 59 primers and tags. The results section describes the basic ideas and an evaluation of our segemehl software implementing our method. The technical details of the matching model are described in the Methods section at the end of this contribution.

Outline of the Algorithmic Approach
A read aligner should deliver the original position of the read in the reference genome. Such a position will be called the true position in the following. Optimally scoring local alignments of the read and the reference genome can be used to obtain a possible true position, but because an alignment of the read with the reference genome at the true position does not always have an optimal score according to the chosen scoring scheme, this method does not always work. Nevertheless, there are no better approaches available unless further information about the read is at hand.
We present a new read mapping approach that aims at finding optimally scoring local alignments of a read and the reference genome. It is based on computing inexact seeds of variable length and allows to handle insertions, deletions (indels; gaps), and mismatches. Throughout the document the notion of differences refers to mismatches, insertions and deletions in some local alignment of the read and the reference genome, irrespective of whether they arise from technical artifacts or sequence variation. A single difference is either a single mismatch, a single character insertion or a single character deletion. Although not limited to a specific scoring scheme, we have implemented our seed search model in the program segemehl assigning a score of 1 to each match and a score of 21 to each mismatch, insertion or deletion. Our matching strategy derives from a simple and commonly used idea. Assume an optimally scoring local alignment of a read with the reference genome with exactly two differences. If the positions of the differences in the alignment are sufficiently far apart, we can efficiently locate exact seeds which in turn may deliver the position of the optimal local alignment in the reference genome. Likewise, if the distance between the two differences is small, two continuous exact matches at the ends of the read possibly allow to map the read to this position. To exploit this observation, the presented method employs a heuristic based on searches starting at all positions of the read. That is, for each suffix of the read the longest prefix match, i.e. the longest exact match beginning at the first position of the suffix with all substrings of the reference genome is computed. If the longest prefix match is long enough that it only occurs in a few positions of the reference genome, it may be feasible to check all these positions to verify if the longest prefix match is part of a sufficiently good alignment. While this approach works already well for many cases, we need to increase the sensitivity for cases where the computation of the longest prefix match fails to deliver a match at the position of the optimally scoring local alignment. This is the case when a longer prefix match can be obtained at another position of the reference genome by exactly matching characters that would result in a mismatch, insertion or deletion in the optimal local alignment (cf. Fig. 1). Therefore, during the computation of each longest prefix match we check a limited number of differences by enumerating at certain positions all possible mismatches and indels (cf. Fig. 2).
To efficiently compute the longest prefix matches, we exploit their properties for two consecutive suffixes of a read, i.e. for two suffixes starting at position i and i+1. If the suffix starting at position i has a longest prefix match of length ,, then the suffix starting at position i+1 has a longest prefix match of length at least ,21. For example, assume a read ACTGACTG. If the second suffix has a longest prefix match of length 4, i.e. CTGA, with the reference genome, we immediately see that the third suffix has a longest prefix match not shorter than 3-because we already know that the substring TGA exists in the reference genome. Using an enhanced suffix array of the reference sequence, we can easily exploit this fact and determine the longest prefix match of the next suffix without rematching the first ,21 characters. Likewise, the enumeration of mismatches and indels is also restricted to the remaining characters of the suffix in our model.
For each suffix of a read, we thus obtain a set of exact matches and alternative inexact matches and their respective positions in the reference sequence. These exact and inexact matches act as seeds. If a seed occurs more than t times in the reference genome, then it is omitted, where t is a user specified parameter (segemehl option -maxocc). The heuristics rigorously selects the exact or inexact seed with the smallest E-value, computed according to the Blast-statistics [14]. If this E-value is smaller than some user defined threshold (segemehl option -E), the bitvector algorithm of [1] is applied to a region around the genomic position of the seed to obtain an alignment of the read and the reference sequence. While the score based search for local alignment seeds controls the sensitivity of our matching model, the bitvector alignment controls its specificity: if the alignment has more matching characters than some user specified percentage a of the read (segemehl option -A) the corresponding genomic position is reported (see Methods).
The computation of the longest prefix match is implemented by a top-down traversal of a conceptual suffix interval tree, guided by the characters of the read. The suffix interval tree is equivalent to a suffix trie (see Methods). The traversal delivers a matching stem.

Author Summary
The successful mapping of high-throughput sequencing (HTS) reads to reference genomes largely depends on the accuracy of both the sequencing technologies and reference genomes. Current mapping algorithms focus on mapping with mismatches but largely neglect insertions and deletions-regardless of whether they are caused by sequencing errors or genomic variation. Furthermore, trailing contaminations by primers and declining read qualities can be cumbersome for programs that allow a maximum number of mismatches. We have developed and implemented a new approach for short read mapping that, in a first step, computes exact matches of the read and the reference genome. The exact matches are then modified by a limited number of mismatches, insertions and deletions. From the set of exact and inexact matches, we select those with minimum score-based Evalues. This gives a set of regions in the reference genome which is aligned to the read using Myers bitvector algorithm [1]. Our method utilizes enhanced suffix arrays [2] to quickly find the exact and inexact matches. It maps more reads and achieves higher recall rates than previous methods. This consistently holds for reads produced by 454 as well as Illumina sequencing technologies.
Note that for the DNA alphabet there are at most four edges outgoing from each node of the suffix interval tree. To introduce mismatches, the traversal is simply continued with alternative edges, i.e. edges diverging from the matching stem. To introduce insertions, the traversal is not regularly continued, but characters of the read are skipped. Deletions are simulated by skipping nodes of the suffix interval tree and continuing the search at their child nodes (see Methods). We refer to these alternative paths that branch off from the matching stem as branches. The maximum number of branches to be considered is controlled by the seed differences threshold k (segemehl option -D). Note, that while matching character by character along a suffix of a read, the number of branches is expected to decrease quickly.

Performance Tests
segemehl constructs indices either for each chromosome of a genome and the matching is performed chromosome-wise or, depending on the available RAM, chromosomes are combined to larger sequences. Compared to other methods, the index structure used by segemehl is significantly larger. For example, the enhanced suffix array of human chromosome 1 occupies approximately 3 GB of space. As it is stored on disk, the index only needs to be computed once. The construction of the index requires linear time. For example, on a single CPU, the construction of the complete enhanced suffix array for human chromosome 1 takes approximately 15 minutes. For our comparison, we ran segemehl with maximum occurrence parameter t = 500. The maximum E-value for seeds was set to 0.5 and minimum identity threshold to a = 85% which corresponds to a maximum of q0.15?mr differences in an alignment of the read of length m.
We compared segemehl to Bowtie v0.9.7 with option -all, BWA v0.2.0, MAQ v0.7.1, PatMaN v1.2.1 and SOAP v1.11 with option -r 2. MAQ and SOAP are based on ungapped alignments which are computed by hash lookups [7,8,13]. Due to length restrictions, MAQ is limited to Illumina (and SOLiD) reads. It additionally takes quality scores into account. The quality values needed by MAQ were, for all nucleotides, uniformly set to a value corresponding to the error rate. Bowtie [11] and BWA [13] index the reference genome with the Burrows-Wheeler transform. BWA allows a limited number of indels. PatMaN [12] matches the reads by traversing a nondeterministic suffix automaton constructed from the reference genome. Except for PatMaN, all programs only report matches with the smallest edit distance. BWA and Bowtie each need about 10 minutes to build their index. The fastq files needed by MAQ are built in approximately 2 minutes. PatMaN and SOAP require no indexing steps. The options for the other programs were chosen so as to achieve results similar to segemehl. For our comparison, we performed tests on simulated as well as real-life read data sets. For the simulation we generated read sets representing different error rates, types and distributions. We used three distinct error sets, one containing only mismatches, one containing only indels and a last one representing reads with mismatches and indels at a ratio of 1:1. Additionally, different error distributions were used to model error scenarios such as terminal contamination (e.g. linker, poly-A tails) or decreasing read quality. We chose uniform, 59, 39 and terminal error distributions.
Each simulated dataset contained 500 000 simulated reads, each of length 35 bp, sampled from a 50 MB large region of the human genome (chromosome 21). We introduced errors to each simulated read according to previously defined rates, error types and distributions. For the 50 MB region we constructed the indexes required for segemehl and Bowtie. For MAQ we constructed the index for the read set under consideration. Index construction took approximately one minute for Bowtie and BWA. The construction for the enhanced suffix array for segemehl took 3.5 minutes. The binary fastq files for MAQ were created in about 20 seconds.
We ran segemehl with seed differences threshold k = 0 and k = 1. For k = 0, only exact seeds are computed and for k = 1 seeds with at most one difference are computed. All programs were executed single-threaded on the same machine. The results for a uniform error distribution for mismatches only as well as for mismatches and indels are shown in Fig. 3. We measured the performance in terms of running time (Fig. 3 (A)) and recall rates, i.e. the percentage of reads mapped to the correct position. segemehl has recall rates of more than 95% (k = 1) and 80% (k = 0) in each setup with not more than two errors in the reads. With four uniformly distributed errors in the reads, the recall rate drops below 80% (k = 1) and 50% (k = 0), respectively. Hence, for k = 1 segemehl outperforms all other methods in terms of recall rates. For reads containing only mismatches and k = 0, segemehl is comparable to other methods (Fig. 3 (B)) while it has a significantly better recall rate as soon as insertions and deletions are involved (Fig. 3 (C)). As expected, the recall rate of most short read aligners drops if insertions and deletions are introduced into the reads. The running time of segemehl for k = 0 is comparable to other short read aligners. For k = 1, the running time increases by a factor of 10. Figure 1. Longest prefix matches may fail to deliver the position of the optimally scoring local alignment. Assume a simple scoring scheme that assigns a score of +1 to a single character match and a score of 0 to a single character mismatch, a single insertions or deletion. Using longest prefix matches bears the risk of ignoring differences in the best, i.e. optimally scoring, local alignment. Its retrieval fails if a longer match can be obtained at another position of the reference sequence by matching a character, that is inserted, deleted, or mismatched in the best local alignment. Depending on the length of the reference genome and its nucleotide composition the probability is determined by the length of the substring that can be matched to the position of the best local alignment before the first difference occurs. (A) The optimally scoring alignment of the read P: = cttcttcggc begins at position 3 of the reference genome S: = atacttcttcggcaga. Let P i denote the i th suffix of the read P. For each P i , the starting positions of the longest match in S comprise the position of P i in the best local alignment (solid green lines). That is, the longest match of P 0 begins at position 3, the longest match of P 1 begins at position 4, the longest match of P 2 begins at position 5 and so forth. (B) For the read P: = cttcgtcggc, the retrieval of the best local alignment fails for all P i , i,5 (dashed red line) due to the inclusion of a character that results in a mismatch in the optimally scoring local alignment. (C) The read P: = cttctgcggc contains, with respect to the best local alignment, a mismatch at position 5 of the read. Here the position 5 of the read is not included in the longest prefix match and nearly all substrings align correctly to the reference genome. doi:10.1371/journal.pcbi.1000502.g001  In contrast to Bowtie, BWA, MAQ, and SOAP, segemehl reports, by default, multiple matches for a read within the reference genome if the corresponding alignments have an E-value smaller than some user defined threshold. This behavior leads to an increase in the running time and a decrease in specificity. Compared to PatMaN, which is also able to report multiple matches, segemehl can cope with more than two differences and still is on average faster by a factor of 1.7 (k = 1) and 14 (k = 0). As expected, the worst segemehl results are seen for high error rates with a uniform error distribution (Fig. 4). Terminal, 39 and 59 error distributions yield better results, suggesting that segemehl implements a robust method that is insensitive to leading and trailing contaminations. Next, we compared segemehl, Bowtie and MAQ on two real-life data sets. We used Bowtie with optionall and MAQ with option -C 513 as suggested in the manuals to achieve maximum sensitivity. segemehl's sensitivity was controlled by option -M 500 to omit all seeds occurring more than 500 times in the reference sequence.
The data set ERR000475 of 20 million Illumina reads (length 45) for H. sapiens was downloaded from the NCBIs Short Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/). The second data set comprised about 40 000 short 454 reads from the arabidopsis mpss plus database (http://mpss.udel. edu/at/). The average length of the 454 reads was 23 bp.
We partitioned the 454-set into subsets of equal size, to satisfy input requirements for MAQ. An average quality value was assigned to each base.
Mapping multiple reads to a reference genome is a task which can easily be parallelized. Like all other methods, segemehl offers a parallelization option to run the program on multiple cores. segemehl runs for the ERR000475 dataset were carried out in eight parallel threads on a single machine with two Quadcore CPUs and 16GB of RAM. Seven enhanced suffix arrays were constructed representing the whole human genome. segemehl mapped 92% of the reads to the reference sequence while MAQ mapped 85% without and 89% with quality values. The corresponding values for Bowtie are 81% and 89%. The largest difference between the three tools is for the total number of exact matches. Although MAQ was, according to the manual, running in maximum sensitivity mode, segemehl computes 20 times more matches than MAQ (Tab. 1 (a)). Bowtie reports 2.5 billion matches which is much more than the two other tools. As expected, for the 454-set, the difference among the compared programs is even larger. While Bowtie is able to map 71% of all reads, segemehl achieves 95%. MAQ, a program explicitly designed for Illumina reads, matches 79% of the reads. Interestingly, compared to Bowtie, MAQ reports more matches with two mismatches. segemehl mainly achieves this result by mapping more reads with one or two errors. In fact, by allowing insertions and deletions segemehl doubles the number of reads matched at the unit edit distance of 1 ( Tab. 1 (b)).

Discussion
We have presented a novel read mapping approach that is able to efficiently handle 39 and 59 contaminations as well as mismatches, insertions and deletions in short and medium length reads. It is based on a matching model with inexact seeds containing mismatches, insertions and deletions. The sensitivity and specificity of our method is controlled by a maximum seed differences threshold, a maximum occurence threshold, an E-value threshold and an identity threshold. Compared to previous methods, our approach yields improved recall rates especially for reads containing insertions and deletions. Since indels have been reported to be the predominant error type in 454 reads, allowing for indels is most important to achieve a correct mapping. While PatMaN, by default, fully enumerates all matches with up to two differences, segemehl's heuristic reports only best-scoring matches. The price for the gain in sensitivity is an increase in running time: with k = 1 our method is approximately ten times slower than Bowtie, the fastest program in our comparison. As we used enhanced suffix arrays, matching against a large mammalian genome has to be done chromosome by chromosome when offthe-shelf hardware is used. However, the gain in sensitivity for reads with mismatches and the failure of other methods when dealing with indels may be, depending on the users demands, a reasonable trade off for these shortcomings. Our method is not limited to a specific technology or read length. Although quality values are not considered yet, the matching strategy can easily be adapted to evaluate low quality bases specifically. In principle, we show that for k = 0, i.e. exact seeds, our method is sufficiently sensitive to map reads with up to two differences. This is an interesting result since most of the current methods do not tolerate insertions and deletions. In summary, segemehl with k = 0 is among the fastest mapping algorithms. For k = 1, segemehl is able to achieve good recall rates beyond the two error barrier. This is especially interesting since manufacturers try to increase their read lengths at the cost of higher error rates. The increased sensitivity of the presented matching model, along with its ability to handle leading and terminal contaminations is a trade off for the large memory requirements of the enhanced suffix arrays. In the future, compressed index structures like the FM-index [15] may be a suitable framework to implement our matching model with smaller memory requirements.

Methods
Our strategy, based on enhanced suffix arrays, aims to find a best local alignment of short reads and reference sequences with respect to a simple scoring system. It does so by determining, for each suffix of the read, the longest prefix occurring as a substring in the reference sequence. This gives a matching backbone, from which a limited number of branches are derived by mismatches, insertions and deletions (Fig. 2). The concept of a matching backbone is equivalent to the concept of matching statistics introduced in [16]. We introduce the concept of matching backbone and branches via a conceptual tree of suffix intervals. Our heuristic approach delivers a small number of inexact seeds of variable length that are subsequently checked by the bitvector algorithm of Myers [1] to verify the existence of alignments with a limited number of differences. First, a short introduction to the basic notions for sequence processing and enhanced suffix arrays will be given, before the concept of suffix intervals is defined. Subsequently, we introduce our new matching strategy.

Basic Notions for Sequence Processing
We consider sequences over the DNA alphabet S DNA = {A, C, G, T, N}, where N denotes an undetermined base. In our approach the alignment of N with any character, including N itself, results in a mismatch.
Enhanced suffix arrays. First we introduce basic notions for the suffix array and enhanced suffix array. We then formally introduce the concept of a suffix interval.
Suppose that S is a sequence of length n. We index S from position 0. That is, S The concept of suffix arrays is based on lexicographically sorting the suffixes of S$. Suppose that the characters are ordered such that A,C,G,T,N,$. This character order induces an order on all non-empty suffixes of S$, which is captured in the suffix array. Formally, the suffix array suf of S is an array of integers in the range 0 to n, specifying the lexicographic order of the n+1 nonempty suffixes of S$. In other words, S suf[0] , S suf [1] , …, S suf[n] is the sequence of suffixes of S in ascending lexicographic order.
The lcp-  -table and the suffix link table. We now formally introduce the notion of suffix intervals that is at the heart of our matching strategy in enhanced suffix arrays.
An interval [l..r, h] is a suffix interval if the following holds: The notion of suffix intervals slightly generalizes the notion of lcp-intervals, as introduced in [2]. A suffix interval [l..r, h] of width at least 2 is an lcp-interval if, besides condition 1.-5. above, we additionally have lcp[i] = h for at least one i, l+1#i#r. This condition requires that at least one pair of consecutive suffixes in the suffix interval has a longest common prefix of length exactly h (Fig. 5). In other words, a suffix interval [l..r, h] of width 2 which is not an lcp-interval does not have a maximum lcp-value h, implying that [l..r, h+1] is also a suffix interval.
While suffix intervals correspond one-to-one to the nodes of a suffix trie for S$ (cf. [17]), lcp-intervals correspond to the branching nodes of a suffix tree for S$ (cf. [2]). Interpreting the additional condition for lcp-intervals for trees means that in suffix trees nodes with a single child are omitted, while they are allowed in suffix tries.   . The enhanced suffix array yields a tree structure of nested suffix intervals. The enhanced suffix array for the sequence S: = attcttcggc (left) and its suffix interval tree (right), equivalent to the suffix trie in Fig. 2, is shown. The array suf represents the lexicographical order of the suffixes in S$. In other words, S suf[0] , S suf [1] , …, S suf[n] is the sequence of suffixes of S$ in ascending lexicographic order. The lcp-