SAP—A Sequence Mapping and Analyzing Program for Long Sequence Reads Alignment and Accurate Variants Discovery

The third-generation of sequencing technologies produces sequence reads of 1000 bp or more that may contain high polymorphism information. However, most currently available sequence analysis tools are developed specifically for analyzing short sequence reads. While the traditional Smith-Waterman (SW) algorithm can be used to map long sequence reads, its naive implementation is computationally infeasible. We have developed a new Sequence mapping and Analyzing Program (SAP) that implements a modified version of SW to speed up the alignment process. In benchmarks with simulated and real exon sequencing data and a real E. coli genome sequence data generated by the third-generation sequencing technologies, SAP outperforms currently available tools for mapping short and long sequence reads in both speed and proportion of captured reads. In addition, it achieves high accuracy in detecting SNPs and InDels in the simulated data. SAP is available at https://github.com/davidsun/SAP.


Introduction
Next-generation sequencing (NGS) technologies, such as the second-generation NGS including Roche 454, Illumina (Solexa) and ABI SOLiD, and the third-generation NGS including Pacific Bioscience' single molecular sequencing, are transforming today's biology [1]. To fully take advantage of NGS to accelerate today's biological research, a key challenge is to develop efficient and accurate algorithms that can handle and analyze large-scale sequence data produced by NGS. Currently, the widely used algorithms for mapping sequence reads onto a reference genomes can be generally classified into two categories, those using a hash table for indexing the reference genome, such as BLAT [2], MAQ [3], SHRiMP [4] and ZOOM [5], and those implementing Burrows Wheeler transform (BWT)-based techniques for indexing [6], such as BWA [7], BOWTIE [8] and SOAP2 [9]. BWT-based methods create an efficient index structure of the reference genome, making them running much faster with low memory consumption than hash table-based methods. For example, BOWTIE reported a 30-fold increase in computational speed, compared to MAQ [8]. However, these algorithms often allow only a limited number of mismatches in order to speed up the mapping process of short sequence reads, making them inappropriate for identifying genomic regions with high polymorphism rates. On the other hand, they are exclusively designed for aligning reads no longer than 100 bp, limiting their usefulness to meet the needs of mapping long reads generated by the newest NGS technologies.
The Smith-Waterman (SW) algorithm calculates every possible combination of alignment, and guarantees to find the best alignment [10]. However, SW is time-consuming with a complexity of O(nm) where n and m are the length of sequence read and reference sequence, respectively, making it computationally inefficient for mapping huge number of sequence reads against the reference genome. SHRiMP, a short read mapping algorithm, introduced a simplified Smith-Waterman algorithm by ''seeding'' a sequence read to a given genomic region and then determining the exact placement of the sequence reads [4]. However, in our experiment SHRiMP is found to be very timeconsuming. Based on the BWT transformation, Lam et al. [11] developed the BWT-SW algorithm that runs dynamic programming to align two FM-indices [12] created from BWT-transformed suffix tree, and is equivalent to SW yet thousands of times faster. Li and Durbin [13] furthered the BWT-SW idea, and developed BWA-SW for aligning long sequence reads. BWA-SW introduces heuristics during the seeding process by only allowing extension of alignment for high-scoring seeds, and was found to be more accurate and faster than Blat for mapping long sequence reads [2].
In this study, we have developed a new alignment tool named SAP (Sequencing mapping and Alignment Program). SAP contains two component algorithms: the SAP Mapper and the SAP Predictor for mapping sequence reads and predicting polymorphisms, respectively. In SAP Mapper, a hash table is used to index the sequence reads. Then, when aligning the sequence reads, different from other algorithms, a modified SW is used by taking advantage of the hashing structure during the seeding process, which reduces its complexity to approximately O(n) and greatly speeds up the mapping process. The modified SW is similar to the alignment algorithm in FASTA [14]. However, with the hashing structure, SAP Mapper efficiently filter out most ''seeding'' sites before the application of the modified SW, which greatly accelerate the mapping process while ensuring the quality of the alignment. SAP Predictor uses a Bayesian model and a SNP filter for SNP-calling, and computes an alignment score for InDels-calling. Because of the typical memory consumption issue for hashing-based mapping tools, we benchmark SAP for short reads mapping using simulated and real exon sequencing data, and for long reads (,1000 bp) mapping using simulated exon sequencing data and a real E.coli genome data sequenced using the third-generation of NGS technology. The benchmark results show that for short reads mapping, SAP mapper not only outperform MAQ, SOAP2 and Blat in detecting SNPs, but also achieve the fastest speed; for long reads mapping, SAP mapper is superior to BWA-SW and Blat in both the percentage of captured reads and the computational speed (a speed nearly 1006 faster than Blat). In addition, SAP achieves high accuracy in detecting SNPs and InDels in the simulated data, and identifies more ''valid'' SNPs in the real exon sequencing dataset. As such, SAP offers the flexibility to deal with sequence reads generated by currently popular second-generation NGS technologies for polymorphism detection, while meeting the needs for mapping long sequence reads generated by the thirdgeneration NGS technologies.

Mapping reads to the reference
There are two mapping modes in SAP Mapper: FastMap and SlowMap that allow for exact hash look-up and inexact hash lookup, respectively, and are therefore evaluated separately in this study. Table 1 shows the time consumption of the compared methods, and Table 2 shows the proportion of captured reads by each method.
For short reads mapping (, = 150 bp), overall SAP FastMap is the fastest among the five algorithms, followed by SOAP2, SAP SlowMap, MAQ, Blat and SHRiMP. In terms of the proportion of captured reads, in simulated data SAP SlowMap and Blat are comparable, both capturing above 99.5% reads. SAP FastMap captures slightly less number of reads than SAP SlowMap, while MAQ and Blat are significantly worse. Note that MAQ fails to capture any reads of 150 bp, while SOAP2's performance drops significantly when read length increases to 150 bp. In real data (about 54 bp), SHRiMP captures the most number of reads (95%), yet it is nearly 50006 slower than SAP. SAP SlowMap, Blat, MAQ, and SAP FastMap all capture above 92% of reads, with SAP SlowMap being the best of them (93.1%). SOAP2 is noticeably worse than the other methods, capturing only 88.3% reads.
For long reads mapping, in simulated data (1000 bp) SAP FastMap is the fastest method, followed by SOAP2, SAP SlowMap, MAQ, BWA-SW, and Blat. Depending on read coverage, SAP is about 206 faster than BWA-SW, and 60-1006 faster than Blat. Note that the time-consumption of Blat is sensitive to read coverage. In terms of the proportion of captured reads, SAP SlowMap, FastMap, Blat and BWA-SW all capture more than 98.8% reads, with SAP SlowMap being the best one. For real E. coli genome sequencing data (about 900 bp), because they contain large number of errors, SAP FastMap is not evaluated. Blat is faster than SAP and BWA-SW, however this is because it does not find the exact match of most reads, and consequently only captures 2.98% reads. With default settings, BWA-SW captures only 17.5% reads with the time consumption of 30 m; with parameter tuning, it capture 61% reads with the time consumption of 250 m. In contrast, SAP SlowMap captures more than 62% reads by spending 15 m; with parameter tuning, it captures even 79% reads with the time consumption of 43 m. Therefore, SAP outperforms both BWA-SW and Blat for mapping long reads.

SNPs detection
We use SAP Predictor to predict SNPs from the mapping results of SAP Fastmap and Slowmap, Blat, BWA-SW and SOAP2. SOAP2 has its own SNP prediction program, however for format problem it does not work well with its read mapping results. We employ MAQ's own SNP-caller to analyze its mapping result. SHRiMP is not evaluated, because it does not have a SNP Predictor and the format of its output does not fit SAP Predictor's requirement. Here, the mapping results of E. coli sequencing data are not analyzed, because the data contain high errors.
For simulated data, we know in advance where SNPs are located. Therefore, we can evaluate the accuracy and coverage of SNP calling. In general, the accuracy of SNP detection based on the mapping result of all methods except BWA-SW is above 97% irrespective read length and coverage (Table 3). This suggests that SAP predictor is able to accurately identify SNPs. The coverage of SNPs detection increases when read coverage or read length increases (Table 3). For example, based on the mapping results of SAP FastMap, the coverage of SNPs detection is increased from 53% to 94% when read coverage is increased from 56to 206and reads length is fixed at 75 bp, and is increased from 53% to 81% when read length is increased from 75 bp to 1000 bp and read coverage is set at 56. Comparing the coverage of SNP detection in between the mapping results of different methods, we find that for short reads, SAP SlowMap corresponds to the best coverage, followed by SAP FastMap, Blat, SOAP2, and MAQ; for long reads, SAP SlowMap is also the best one, followed by SAP FastMap, Blat and BWA-SW. Thus, in the simulated data SAP is able to achieve the best performance on SNP detection using either short or long reads. For real data, the SNPs of each individual are not known. In order to evaluate a method's performance, here we consider a predicted SNP ''valid'' if it either has already been reported in dbSNP [15] or is predicted by at least two other methods (

InDel detection
The prediction of insertions and deletions is of great importance, for these variations may have a larger impact on gene functions. However, because all the other methods do not report InDels, here we only evaluate SAP's performance on detecting InDels in the simulated datasets (Table 5). SAP can detect InDels with high accuracy irrespective read length: the accuracy is above 97% and 87% for insertions and deletions, respectively, for reads with different length. The coverage of InDel detection is dependent on read coverage and read length, and in general increases as read length and coverage increase, highlighting the necessities of developing new NGS technology to produce longer sequence reads. For example, at 56 read coverage, the prediction coverage of deletions by SAP FastMap is improved from 12.3% to 54.7% when read length is increased from 75 bp to 1000 bp; for reads of 1000 bp, the prediction coverage of deletions is improved

Discussion
With the development of third-generation NGS, sequence reads are expected to become longer and contain more polymorphism information such as SNPs and InDels. For example, the average read length of the recent E. coli genome sequence data is about 904 bp with high error rate. This effectively rules out the use of most currently available reads mapping tools, as they often require very few number of mismatches in order to speed up the mapping process. Though Smith-Waterman algorithm is able to determine the exact displacement of long reads at the reference, it is very time consuming, making it computationally infeasible to apply SW directly for mapping and aligning long sequence reads. Therefore, there is a strong need for new alignment tool for efficiently mapping long reads containing high polymorphism information.  BWA-SW is a recently developed algorithm for long reads mapping, which uses the standard SW for alignment extension only if the ''seed'' sequence reads has a score above a certain threshold in the initial comparison. However, since a read is often mapped to different positions with positive scores, BWA-SW is still time consuming. In this study, we tackle this problem by introducing a modified version of SW that takes advantage of the hashing result to reduce its complexity to approximately O(n). This has significantly speeds up the mapping process, making SAP the fastest method for reads mapping in the benchmarks. On the other hand, since SW is still used, the alignment is of high quality, which results in the high accuracy and coverage of SNP detection by SAP in simulated datasets. This makes SAP a useful tool to meet the needs of third generation NGS technologies for efficiently mapping long reads containing high polymorphisms information or with high error rate.
Although SAP is developed for long reads mapping, it is still well adapted for short reads mapping given that third generation sequencing technologies have not yet been widely applied. What makes SAP different from other ''hash-lookup''-based algorithms are that: (1) SAP uses the ''seeding'' result to optimize the SW algorithm when aligning the read sequence to the reference genome, which greatly reduces the running time while guarantees that each read is mapped to the most possible location on the reference genome; (2) Unlike most other algorithms which require additional processing after alignment when determining SNPS and InDels, in SAP the information of SNPs and InDels are already contained in the alignment with the implementation of the SW algorithm. The above features of SAP make it an accurate and convenient algorithm for short reads mapping.
However, since hashing is used in SAP, it is expected that for large reference, such as the whole human genome sequence, memory consumption will be an issue for SAP, because it usually consumes a memory of about 306 larger than the size of the reference. This may be solved in the future by implementing the BWT techniques to index the genome sequence and apply the modified SW to align two FM-indices. Nevertheless, at current stage, SAP is well suited for mapping sequence reads to relatively small reference sequence, such as the human exome sequence or bacteria genome sequence.

SAP Mapper
Indexing and hash look-up. SAP uses a hash table to index sequence reads before the mapping process. By default, SAP cuts a read into seven pieces each with a length of 15 bp, which can be adjusted by user. For example, a read R (r 1 r 2 ::: r n , where n is the length of the read) is cut into seven pieces. The starting position of each piece is L 1~1 , L 2~m z1, L 3~2 mz1, :::, L 6~5 mz1, , and the piece, P i , is P i~rLi r Liz1 ::: r Liz14 . SAP Mapper searches with each piece, P i , against the reference sequence for a possible starting position of R, and define it as l i , where l i~M P i {L i , and MP i is the mapped position of P i on the reference. There are two mode parameters to control the hash look-up: the FastMap and the SlowMap mode that allow zero or only one mismatch in a given piece according to the reference, respectively. Finally, SAP will keep a read if at least two of the seven pieces are successfully mapped to the reference.
A modified version of the Smith-Waterman algorithm. After hash look-up, l i is sorted. Without the loss of generosity, we can suppose l 1 ƒl 2 ƒ::: l 7 . Then, SAP Mapper compares l 1 with l 7 . If l 1~l7 , then the read does not contain any insertions or deletions, and a direct comparison between read and reference sequence is performed, with the score calculated as S read~2 0 : L map , where L map refers to the number of identical nucleotides between the read and the reference. If l 1 vl 7 , then it indicates the presence of InDels, and a modified version of the SW algorithm will be initiated to align the sequence read to the reference. A typical SW is to optimize function F i,,j , which is defined recursively as where E j is the (jzl 1 {(l 7 {l 1 ))-th position on reference sequences. After optimization of F i,j , the read is aligned to the reference by backtracking. However, the typical SW is very timeconsuming, with a complexity of O(nm), where n is the length of read, and m is the portion of reference that a read can be mapped onto. Here, we introduce a modified version of SW that takes advantage of the hash structure. After analyzing F i,,j during the calculation process, we find that in most cases, when Dj{iD is very large, the value of F i,,j is not useful. Therefore, we require SAP Mapper calculate F i,,j only if Dj{iDƒl 7 {l 1 . Since in most cases l 7 {l 1 is quite small, the complexity of the modified SW is approximately O(n), making it much more computationally efficient than the typical SW.
Finally, F i,,j is transformed into a score: InitialScore R~F i,j 20 : n .

SAP Predictor
After reads mapping, SAP Predictor is used to compute the probability of a given site to be either SNP or InDels.
SNP prediction. At a given site in the reference genome, there are at most two different nucleotides. Therefore, to infer the genotype, we only need to consider the two most frequent nucleotides among the reads across that site. Suppose in an observation D, the two most frequent nucleotides are b and b9, with a frequency of k and n-k reads, respectively, where n is the total number of reads with either b or b9, then there are three possible genotypes (Q): ,b,b., ,b,b9., or ,b9,b9.. For genotype ,b,b9., its probability follows a binomial distribution, i.e., , where P(D)~P g P(DDQ)P(Q), g[(vb,bw,vb,b 0 w, orvb 0 ,b 0 w). If the genotype with the highest probability is different from that on the reference, then it is tentatively called a SNP. SNP filter. To filter out false positive of SNP calls, we develop two SNP filters: a read depth filter, and a score filter. The read depth filter checks the depth of reads at a given site, and by default is set at Cover/5, where Cover is the average coverage of the sequencing data, and is computed as Cover~T otal Length Of Reads Total Length Of Sequenced DNA . The score filter checks the quality of a predicted SNP, with a lower score indicating a better reliability in our model. The score is calculated as S i~0 , 1 genotype reported 1000 ln(P max ) 2 ln(P 1 ) ln(P 2 ) , w1 genotypes reported , where P max is P(Q|D) with the highest probability, and P 1 and P 2 are the other two probabilities. InDel prediction. SW can automatically detect InDels when aligning the reads. Suppose an insertion is detected on a read, the quality score of the insertion is calculated as q ins~1 n X n i~1 q i , where q i is the quality of i-th site on the read. Then, SAP Predictor sums all q ins in that position and compares it with the quality data of the nearby sites. If q ins is greater, an insertion would be reported. The prediction of deletions is done in a similar way. SAP is publically available at https://github.com/davidsun/SAP.

Benchmark SAP
SAP Mapper and SAP Predictor are evaluated separately. The following benchmark datasets are prepared: the simulated human exon sequencing data with both short and long reads, the real human exon sequencing data with short reads, and the real E. coli genome sequence data with long reads.
Simulated exon sequencing dataset. In simulation, we first randomly introduce SNPs and InDels into the human genomic sequence (hg18 assembly), with a probability of 0.02% and 0.009%, respectively. Then, we simulate the exon-capture sequencing process by cutting each exon randomly into small DNA fragments with length of 300-1500 bps. The number of DNA fragments generated from each exon follows a Poisson distribution. Finally, with a given read length, e.g., 75 bp, we generate the single-end reads from either the 39-or 59-end of the fragments with an error rate of 2%. By controlling the number of fragment-cutting and read-generating, we obtain sequencing data with different read coverage. The advantages of using the simulated data are that (1) we know in advance where the SNP and InDels are located in the reference, and (2) we can control the sequencing error, read length, and read coverage, making it possible to evaluate the performance of the program and make appropriate improvements in a dynamic way. Here, for each combination of read coverage (56, 106, and 206) and read length (75 bp, 150 bp, and 1000 bp), we generate three datasets following the above procedures.
Real exon sequencing dataset. We download the exoncapture sequencing data of 16 individuals from a recently published article [16]. There are approximately 8.4 million quality-filtered reads with a length of 76 bp (including a 22 bp primer) for each individual's exome. The read coverage is over 406.
Real E. coli genome sequence data. We download a recently sequenced E. coli O104:H4 genome data using the third generation NGS technologies. The average read length is about 902 bp. The read coverage is about 75.
Evaluation measures. All methods described above are evaluated on an IBM computer server with Intel(R) Xeon(R) E5405 CPU of 2.00 GHz with 16GB memory installed. One CPU thread is used. For the simulated dataset, the prediction accuracy of SNP or InDel detection is defined as TP/P, where TP is the number of true SNPs or InDels predicted by a program, and P is the total number of predicted SNPs or InDels. The prediction coverage is defined as TP/T, where T is the total number of true SNPs or InDels. For each combination of read coverage and read length, the prediction accuracy and coverage are averaged across the three datasets.