Conceived and designed the experiments: SMR PL AS MB. Performed the experiments: SMR PL AVD MF. Analyzed the data: SMR PL AVD MF AS MB. Contributed reagents/materials/analysis tools: AS. Wrote the paper: SMR PL AVD MF AS MB.
The authors have declared that no competing interests exist.
The development of Next Generation Sequencing technologies, capable of sequencing hundreds of millions of short reads (25–70 bp each) in a single run, is opening the door to population genomic studies of non-model species. In this paper we present SHRiMP - the SHort Read Mapping Package: a set of algorithms and methods to map short reads to a genome, even in the presence of a large amount of polymorphism. Our method is based upon a fast read mapping technique, separate thorough alignment methods for regular letter-space as well as AB SOLiD (color-space) reads, and a statistical model for false positive hits. We use SHRiMP to map reads from a newly sequenced
Next Generation Sequencing (NGS) technologies are revolutionizing the way biologists acquire and analyze genomic data. NGS machines, such as Illumina/Solexa and AB SOLiD, are able to sequence genomes more cheaply by 200-fold than previous methods. One of the main application areas of NGS technologies is the discovery of genomic variation within a given species. The first step in discovering this variation is the mapping of reads sequenced from a donor individual to a known (“reference”) genome. Differences between the reference and the reads are indicative either of polymorphisms, or of sequencing errors. Since the introduction of NGS technologies, many methods have been devised for mapping reads to reference genomes. However, these algorithms often sacrifice sensitivity for fast running time. While they are successful at mapping reads from organisms that exhibit low polymorphism rates, they do not perform well at mapping reads from highly polymorphic organisms. We present a novel read mapping method, SHRiMP, that can handle much greater amounts of polymorphism. Using
Next generation sequencing (NGS) technologies are revolutionizing the study of variation among individuals in a population. The ability of sequencing platforms such as AB SOLiD and Illumina (Solexa) to sequence one billion basepairs (gigabase) or more in a few days has enabled the cheap re-sequencing of human genomes, with the genomes of a Chinese individual
NGS technologies are also opening the door to the study of population genomics of non-model individuals in other species. Various organisms have a wide range of polymorphism rates - from 0.1% in humans to 4.5% in the marine ascidian
Previous short read mapping tools typically allow for a fixed number of mismatches by separating a read into several sections and requiring some number of these to match perfectly, while others are allowed to vary
This becomes especially apparent when working with data from Applied Biosystem's SOLiD sequencing platform (AB SOLiD). AB SOLiD uses a di-base sequencing chemistry that generates one of four possible calls (colors) for each pair of nucleotides. While a sequencing error is a change of one color-call to another, a single SNP will change two adjacent color positions. Hence a read with two (non-adjacent) SNPs and a sequencing error will differ from the reference genome in five different positions. Simultaneously, the nature of the di-base sequencing code allows for the identification (and correction) of sequencing errors, so by carefully analyzing the exact sequence of matches and mismatches within a read, it is possible to determine that the read and the genome differ by two SNPs. While efficient mappers for color-space sequences have been developed
In this paper we develop algorithms for the mapping of short reads to highly polymorphic genomes and methods for the analysis of the mappings. We demonstrate an algorithm for mapping short reads in the presence of a large amount of polymorphism. By employing a fast k-mer hashing step and a simple, very efficient implementation of the Smith-Waterman algorithm, our method conducts a full alignment of each read to all areas of the genome that are potentially homologous. Secondly, we introduce a novel, specialized algorithm for mapping di-base (color-space) reads, which allows for an accurate, non-heuristic alignment of AB SOLiD reads to a reference genome. Finally, we introduce methodology for evaluating the accuracy of discovered alignments. Because a read may match the genome in several locations with variable amounts of polymorphism, we develop a statistical method for scoring the hits, allowing for the selection of the most probable variants, and filtering of false positives.
Our methods are implemented as part of SHRiMP: the SHort Read Mapping Package. To demonstrate the usefulness of SHRiMP we re-sequenced a Japanese
This section is organized as follows: we begin with three methodological sections, in which we first present an overview of the algorithms used in SHRiMP for mapping short reads, explain our specialized algorithm for alignment of di-base sequencing (AB SOLiD) data, and present our framework for computing p-values and other statistics for alignment quality. The data flow for these methods is illustrated in
Candidate mapping locations are first discovered by the seed scanner and then validated by the vectorized Smith-Waterman algorithm, computing only a score. Top scoring hits are then fully aligned by a platform-specific algorithm (i.e. letter-space for Solexa data and color-space for SOLiD data). Statistical confidence for the final mappings are then computed using the PROBCALC utility.
The SHRiMP algorithm draws upon three recent developments in the field of sequence alignment: q-gram filter approaches, introduced by Rasmussen et al
Most heuristic methods for local alignment rely on the identification of seeds – short exact matches between the two sequences. The advantage of using exact matches is that they are easy to find using hash tables, suffix arrays, or related techniques. While classically seeds have been contiguous matches, more recently “spaced” seeds, where
While most older local alignment tools, such as BLAST, use a single matching seed to start a thorough comparison of the strings around the seed, more recently Rassmussen et al
If a particular read has the required number of seeds matching to a window of the genome we conduct a rapid alignment of the two regions to verify the similarity. This alignment is done using the classical Smith-Waterman algorithm
After we finish aligning all of the reads to all of the potential locations, we conduct a final, full alignment of each read to all of the top hits. This final alignment stage differs depending on the specifics of the sequencing technology. Within SHRiMP we have implemented separate final alignment modules for Illumina/Solexa data (this is done with the regular Smith-Waterman algorithm) and for color-space (di-base) data produced by the AB SOLiD instrument (described in the next section). Additionally we have an experimental module for alignment of two-pass sequencing data, where two reads are generated from every genomic location, which is described elsewhere
The AB SOLiD sequencing technology introduced a novel dibase sequencing technique, which reads overlapping pairs of letters and generates one of four colors (typically labelled 0–3) at every stage. Each base is interrogated twice: first as the right nucleotide of a pair, and then as the left one. The exact combinations of letters and the colors they generate are shown in
A: The standard representation, with the first and second letter of the queried pair along the horizontal and vertical axes, respectively. B: The equivalent Finite State Automaton representation, with edges labelled with the readouts and nodes corresponding to the basepairs of the underlying genome.
The AB SOLiD sequencing technology has the remarkable property of differentiating between sequencing errors and biological SNPs (under the assumption that the reference genome has no sequencing errors): a SNP changes two adjacent readouts of the color-space code, while a sequencing error is unlikely to happen at two adjacent positions by chance (the technology does not sequence adjacent letters at adjacent time points). At the same time, however, the color-space code introduces certain complexities.
Let us consider a comparison done by first translating the color-space read code into the letter-space sequence. Notice that a single sequencing error would cause every position after the place of error to be mistranslated (
The reference genome is labeled G and the read R. A: A perfect alignment; B: In case of a sequencing error (the 2 should have been read as a 0) the rest of the read no longer matches the genome in letter-space; C: In case of a SNP two adjacent colors do not match the genome, but all subsequent letters do match. However, D: only 3 of the 9 possible color changes represent valid SNPs; E: the rules for deciding which insertion and deletion events are valid are even more complex, as indels can also change adjacent color readouts.
There is an additional drawback to translating the genome into color-space code: a sequence of matches and mismatches in color-space does not map uniquely into letter-space similarity. For example, a single SNP results in two sequential color-space mismatches. However, given two consecutive colors, there are 9 possible ways to generate two mismatches. Of these, only 3 correspond to a SNP, while the rest lead to DNA strings that completely differ from the reference. This is illustrated in
We propose an alternate approach. Our key observation is that while a color-space error causes the rest of the sequence to be mistranslated, the genome will match one of the other three possible translations. This is illustrated in
A: The Dynamic Programming (DP) representation, B: recurrences, and C: alignment of a letter space sequence to a color-space read with a sequencing error. Within the DP matrix we simultaneously align all of the four possible translations (vertical) to the reference genome (horizontal); however the alignment can transition between translations by paying the crossover penalty. This is illustrated by the fourth recurrence, where the third index (
In the SHRiMP algorithm, we only apply the special color-space Smith-Waterman algorithm in the final stage. For the initial stages, we convert the genome from letter-space to color-space, and search for k-mer matches as well as perform vectorized Smith-Waterman strictly in color-space. In order to better incorporate SNPs in color-space data, we use a spaced seed that allows for two adjacent mismatching colors between the read and the reference genome.
Once all of the reads are mapped, for every read and mate-pair we compute mapping confidence statistics. Initially these are computed for each read; however, they are then combined to compute likelihoods of accidental matches for mate-pairs.
While a very thorough statistical theory for local alignments has been established
The
Our second computation,
Finally, we define the quality measurement of this hit as the
SHRiMP also assigns mate-pair confidence values (akin to the read confidence values predicted by probcalc) by combining the confidence values for individual reads with emprically observed distributions of insert sizes in the library. We compute the distribution of the mapped distances (distance between the mapped positions of the two reads)
A discussion of the implementation steps are included in the SHRiMP README, and a more detailed discussion of the statistical values is included in
In our experiments, we used SHRiMP to compare 135 million 35 bp reads from a tunicate
One of the advantages of the SHRiMP algorithm is the seamless parallelism provided by the fact that we can simply subdivide the reads into separate computational jobs, without affecting the results. This allows us to take full advantage of compute clusters regardless of the amount of memory available at each machine. We took a random subset consisting of 500,000 35 bp
The running times at several parameter settings are summarized in
K-mer | (7,8) | (8,9) | (9,10) | (10,11) | (11,12) | (12,13) |
% K-mer Scan | 10.1% | 16.5% | 18.9% | 13.4% | 9.8% | 7.4% |
% Vectorized SW Filter | 88.8% | 75.4% | 49.8% | 30.2% | 20.1% | 14.9% |
% Full SW Alignment | 1.1% | 8.0% | 30.7% | 55.5% | 68.8% | 76.2% |
Time | 1 d21 h34 m | 6 h18 m | 1 h36 m | 50 m28 s | 37 m52 s | 32 m32 s |
In all cases, two k-mer hits were required within a 41 bp window to invoke the vectorized Smith-Waterman filter.
The primary strength of SHRiMP and other mapping methods based on Smith-Waterman alignments is the ability to map reads containing complex patterns of sequence variation, including insertions, deletions and clusters of closely-space SNPs. Mappers that exclusively produce ungapped alignments can only find SNPs. Furthermore they are more likely to miss dense clusters of SNPs, since the overlapping reads contain many mismatches, and SNPs adjacent to an indel, since only a small fraction of the overlapping reads contain just the SNP. Finally, since SHRiMP produces local alignments, it can map a read even if either end overlaps a large indel or structural variant.
To evaluate the effectiveness of SHRiMP for detecting sequence variation we used it to find polymorphisms in a resequenced
We used the AB SOLiD sequencing platform to generate 135 million reads of length 35 bp from a single
SHRiMP | SOLiD Mapper | |
Uniquely-Mapped Reads | 51,856,904 (38.5%) | 15,268,771 (11.3%) |
Non-Uniquely-Mapped Reads | 64,252,692 (47.7%) | 12,602,387 (9.4%) |
Unmapped Reads | 18,657,736 (13.8%) | 106,896,174 (79.3%) |
Average Coverage (Uniquely-Mapped Reads) | 10.3 | 3.0 |
Median Coverage (Uniquely-Mapped Reads) | 8 | 1 |
SNPs | 2,119,720 | 383,099 |
Deletions (1–5 bp) | 51,592 | 0 |
Insertions (1–5 bp) | 19,970 | 0 |
Non-uniquely-mapped reads have at least two alignments, none of which is significantly better than the others (see
The alignment data contains noise due to two types of errors: sequencing errors and chance alignments. Chance alignments are a significant problem for short reads, particularly with the low alignment score thresholds necessary for mapping reads containing significant variation. Reads containing both sequence variation and sequencing errors are even more likely to map to the wrong position in the reference sequence. To combat the high false-positive rate, for the remaining analysis we focused on a high-quality subset of the data consisting of sequence variants supported by at least four independent reads.
Across the genome SHRiMP detected 2,119,720 SNPs supported by at least four reads. For comparison, we used the SOLiD aligner provided by Applied Biosystems to map the reads to the genome with up to three mismatches, where each mismatch can be either a single color-space mismatch or a pair of adjacent mismatches consistent with the presence of a SNP. Compared to the SOLiD mapper, SHRiMP mapped 4.2 times as many reads and found 5.5 times as many SNPs. The AB mapper, however, was a lot faster, requiring 255 CPU hours to complete the alignments, or roughly 50× faster than SHRiMP. While it is possible to run the mapper with greater sensitivity, allowing for more errors and SNPs, and thus more mapped reads, doing so would surrender much of the runtime advantage and still not overcome its fundamental inability to detect insertion and deletion polymorphisms. SHRiMP, on the other hand, is capable of handling indels, and detected tens of thousands of them.
SHRiMP detected 51,592 deletions and 19,970 insertions of size 1–5 bp. The observed ratio of 2.5× between insertions and deletions for the
(A) and distance between adjacent SNPs (B) detected by SHRiMP. The distance between adjacent SNPs shows a clear 3-periodicity, due to the fact that a significant fraction of the non-repetitive
Mapping
In order to further validate the accuracy of the SHRiMP alignments we have designed simulated experiments, where we sampled random locations from the
Number of SNPs | |||||||||||
0 | 1 | 2 | 3 | 4 | |||||||
Prec. | Rec. | Prec. | Rec. | Prec. | Rec. | Prec. | Rec. | Prec. | Rec. | ||
0 | 85.7 | 83.2 | 84.8 | 81.3 | 83.5 | 76.6 | 80.6 | 65.2 | 75.6 | 46.8 | |
Max | 1 | 83.8 | 79.4 | 82.2 | 74.0 | 79.4 | 62.6 | 72.8 | 43.2 | 63.1 | 24.7 |
Indel | 2 | 83.2 | 77.1 | 80.8 | 69.6 | 77.9 | 56.6 | 68.2 | 36.4 | 56.4 | 18.9 |
Length | 3 | 80.7 | 71.0 | 79.6 | 64.2 | 73.6 | 48.3 | 66.5 | 31.5 | 57.1 | 16.6 |
4 | 78.0 | 65.4 | 76.5 | 56.1 | 71.4 | 41.9 | 60.6 | 23.9 | 50.3 | 12.4 | |
5 | 75.9 | 58.9 | 73.0 | 48.1 | 69.7 | 36.6 | 57.0 | 21.3 | 46.0 | 12.7 |
Each cell shows the precision and recall for mapping simulated reads with varying amounts of polymorphism. SHRiMP was able to accurately map >46% of all reads with either 4 SNPs or 5 bp indels, despite the large number of sequencing errors in our dataset (up to 7% towards the end of the read).
The algorithm starts with a rapid k-mer hashing step to localize potential areas of similarity between the reads and the genome. All of the spaced k-mers present in the reads are indexed. Then for each k-mer in the genome, all of the matches of that particular k-mer among the reads are found. If a particular read has as many or more than a specified number of k-mer matches within a given window of the genome, we execute a vectorized Smith-Waterman step, described in the next section, to score and validate the similarity. The top
A: Overview of the k-mer filtering stage within SHRiMP: A window is moved along the genome. If a particular read has a preset number of k-mers within the window the vectorized Smith-Waterman stage is run to align the read to the genome. B: Schematic of the vectorized-implementation of the Needleman-Wunsch algorithm. The red cells are the vector being computed, on the basis of the vectors computed in the last step (yellow) and the next-to-last (blue). The match/mismatch vector for the diagonal is determined by comparing one sequence with the other one reversed (indicated by the red arrow below). To obtain the set of match/mismatch positions for the next diagonal, the lower sequence needs to be shifted to the right.
We build an index of all spaced k-mers in the reads, and query this index with the genome. Our approach was taken primarily for simplicity: our algorithm can rapidly isolate which reads have several k-mer matches within a small window by maintaining a simple circular buffer of recent positions in the genome that matched the read. Since our targeted compute platform is a cluster of batch processing machines, indexing the reads means that we can easily control memory usage and parallelism by varying the read input size and splitting the read set accordingly. Data is only loaded at program invocation; we do not stream in new reads from disk as the algorithm runs.
The SHRiMP approach relies on a rather liberal initial filtering step, followed by a rigorous, but very fast Smith-Waterman alignment process. By maximizing the speed of the Smith-Waterman comparison, we are permitted to let the algorithm test a larger number of potential regions.
Most contemporary mobile, desktop and server-class processors have special
We propose an alternate method, where the running time of the fully vectorized algorithm is independent of the number of matches and mismatches in the matrix, though it only supports fixed match/mismatch scores (rather than full scoring matrices). Our key observation is that it is possible to completely parallelize the score computation for every diagonal.
Processor type | Unvectorized | Wozniak | Farrar | SHRiMP |
97 | 261 | 335 | 338 | |
105 | 285 | 533 | 537 |
We inserted each into SHRiMP, and used SHRiMP to align 50 thousand reads to a reference genome with default parameters. The improvements of the Core 2 architecture for vectored instructions lead to a significant speedup for our approach and Farrar's, while Wozniak's algorithm slight improvement is due to the slow match/mismatch lookups.
The vectorized Smith-Waterman approach described above is used to rapidly determine if the read has a strong match to the local genomic sequence. The locations of the top
In
Below, we refer to this denominator term
We begin with the mathematical formulation of
The factor that lies at the core of this calculation is
However, the calculation of the number of references to which a read will map with a particular
In order to estimate the correct value for
In
In this section we provide several details for the implementation, usage and statistics of the matepair post-processing step introduced in
Next, we call a matepair
For the
We thank Ziming Weng for preparing the