ConDeTri - A Content Dependent Read Trimmer for Illumina Data

During the last few years, DNA and RNA sequencing have started to play an increasingly important role in biological and medical applications, especially due to the greater amount of sequencing data yielded from the new sequencing machines and the enormous decrease in sequencing costs. Particularly, Illumina/Solexa sequencing has had an increasing impact on gathering data from model and non-model organisms. However, accurate and easy to use tools for quality filtering have not yet been established. We present ConDeTri, a method for content dependent read trimming for next generation sequencing data using quality scores of each individual base. The main focus of the method is to remove sequencing errors from reads so that sequencing reads can be standardized. Another aspect of the method is to incorporate read trimming in next-generation sequencing data processing and analysis pipelines. It can process single-end and paired-end sequence data of arbitrary length and it is independent from sequencing coverage and user interaction. ConDeTri is able to trim and remove reads with low quality scores to save computational time and memory usage during de novo assemblies. Low coverage or large genome sequencing projects will especially gain from trimming reads. The method can easily be incorporated into preprocessing and analysis pipelines for Illumina data. Availability and implementation Freely available on the web at http://code.google.com/p/condetri.


Introduction
Sequencing technologies evolve rapidly. Since Sanger sequencing [1] was introduced, many genomes have been sequenced, including large eukaryotic genomes such as human, mouse and chicken. Recently, several next generation sequencing (NGS) methods have been released and established in biological and medical sciences (see e.g. [2,3]). However, NGS techniques differ from traditional Sanger sequencing among others with respect to the error probabilities of each read. For Illumina sequencing, the probability of sequencing errors increases exponentially from the 59 to the 39 end of a sequencing read [4]. Read accuracy is crucial to consider when using NGS data because it not only affects the assembly and mapping process, but also downstream applications like single nucleotide polymorphism (SNP) discovery and copy number variation (CNV) identification.
Programs that perform de novo assemblies of NGS data generally exploit either an overlapping-consensus approach (see for example [5]) or implement a de Bruijn graph [6]. Popular short read de novo assemblers like VELVET [7] or SOAPDENOVO [8] use the latter approach whereas genome assemblers developed for Sanger or 454 data like CABOG [9] or NEWBLER [10] use the overlappingconsensus technique. Regardless of the technique used for the assembly, all assembly programs take as input sequenced reads and perform an assembly without a reference genome. Generally, base quality is not used in the contig building of the assembly process. However, programs should use base quality information to correct or to remove erroneous bases from the assembly process in order to reduce the search space. This is especially important for programs based on the de Bruijn graph because it will allow them to save computational resources, thereby reduce assembly time and enabling a more correct assembly.
Correcting for sequencing errors can be done in two different ways. Either bases or reads with low quality are removed completely [11] or erroneous bases are corrected without removing them [12,13,14,15,16]. The latter approach assumes very high coverage per sequenced base in order to identify erroneous bases, whereas the first approach can be applied to high and low coverage sequence data. Generally, to remove bases (trimming), the quality value for each base is evaluated and bases are removed if they do not exceed a certain quality threshold. This can be done from either end of the read, or along the whole read.
Here, we present our content depend trimming (CONDETRI, available at http://code.google.com/p/condetri) program designed for read trimming of Illumina data. The program removes potential sequencing errors starting from the 39 read end and also removes reads containing too many low quality bases.

Test environment
To test the performance of CONDETRI, we used two different Drosophila melanogaster data sets obtained by whole genome pairedend sequencing (NCBI SRA:SRR063698 and NCBI SRA: SRR063699) and two different insert sizes (,170 bp and ,280 bp) from a Gallus gallus resequencing project [17] (NCBI SRA:SRX043655 and NCBI SRA:SRX043656). We used SOLEX-AQA [11] to investigate the quality of the data. Data coming from D. melanogaster showed two distinct quality patterns. Generally, data set SRR063698 showed much better and higher Illumina quality scores than the data coming from SRR063699 which makes the two data sets especially useful to test the influence of trimming on 'good' and 'bad' sequencing data. For both sets we also prepared a reduced version, where we selected 25% of the paired-end reads randomly (below referred to as the reduced data or reduced set). Overall, the G. gallus data showed high Illumina quality scores. We did not create a reduced set from this data. Additionally, we used data from the collared flycatcher (Ficedula albicollis) genomesequencing project to test the performance of CONDETRI on a non-model species where no genome sequence is available.
Sequencing reads can be duplicated due to biased PCR replication, sequencing artifacts, and genomic DNA shearing at the same location in different DNA-molecules [18,19,20,21]. Therefore, we scanned each data set for duplicated paired-end reads using the first 50 nucleotides in each read and kept only one unique read-pair. Additional read-pairs were removed. We removed around 3% of all read-pairs in the SRR063698 data. Less than 1% of all read-pairs were removed in the SRR063699, SRX043655, and SRX043656 data. Raw data and filtered data are summarized in Table 1.
We tested CONDETRI against untrimmed data, and against three recently published methods, SOLEXAQA version 1.7 [11], the BWA quality trimming algorithm [22] as implemented in SolexaQA version 1.7 and QUAKE version 0.2 [13] using the D. melanogaster and G. gallus data. SOLEXAQA and BWA quality trimming use a quality based read trimming approach whereas QUAKE performs quality detection and correction of potential sequencing errors. For the three trimming programs we used same parameters for quality cutoff and minimum read length to be able to make a fair comparison between the programs. Quality cutoff was set to 25 and minimum read length to 50. The values were chosen after inspecting several data sets (see Method section). SOLEXAQA and BWA have no other relevant parameters that can be adjusted to improve filtering quality. For each data set, the optimal k parameter according to the genome size was calculated to be able to run COUNT-QMERS from QUAKE. QUAKE could not be run without user interaction because the data does not provide enough sequencing depth to estimate the coverage cutoff parameter. Therefore, we investigated the coverage histograms for each data set manually and chose the best cutoff value according to the QUAKE online manual.
Data that was trimmed based on quality and untrimmed was de novo assembled using SOAPDENOVO version 1.04 [8] with k-mer sizes ranging from 19 to 31. For each method and data set, the best assembly was chosen according to the N50 size of the assembly and then aligned to the D. melanogaster reference genome (Release 5) or G. gallus reference genome (Release WUGSC 2.1/galGal3) using NUCMER version 3.07 (64-bit compiled version) from the MUMMER package [23]. To infer the alignment quality, we used SHOW-TILING from the MUMMER package with a minimum percent identity of 95% to construct a tiling path out of the query scaffolds as mapped to the reference sequences (recall rate) and we estimated the proportion of the assembly that could be aligned onto the reference genome (accuracy rate). Single nucleotide polymorphism (SNP) frequency per base was estimated as follows. First, reads were mapped onto the G. gallus reference genome using BWA version 0.5.9 [24]. Second, SNP calling was done applying the PILEUP command as implemented in SAMTOOLS version 0.1.16 [25]. The coverage cutoff was set to 60 after inspecting the coverage across the genome to avoid false positive SNP calls due to unresolved repeats. Note, for estimating SNP frequencies we took only the covered genome into account and disregarded the proportion that was not covered by reads.

Trimming effect
Theoretically, trimming should reduce the problem complexity of de novo assemblies since it shrinks the size of the de Bruijn graph. This should lead to more accurate de novo assemblies. Still, there are no measures available to quantify the quality of a de novo assembly. Therefore, we estimated the proportion of the assembly that could be mapped to the reference genome (accuracy) and the proportion of the genome that was covered with the assembly (recall). We think that accuracy is more important than recall because it is gives the amount of the assembly that is correctly assembled.
For the 'higher quality' D. melanogaster data set (NCBI SRA:SRR063698) the assembly using no trimming yielded the best accuracy (88%) and the best recall (63%, 106.0 MB of the genome covered). CONDETRI gave quite similar results, with an accuracy of 86% and recall of 61% (102.8 MB of the genome covered). Data filtered using BWA, SOLEXAQA or QUAKE gave slightly less accurate assemblies (83%, 81% and 71%) and also smaller proportions of the genome assembled (recall 59%, 56% and 48%, respectively). Interestingly, the assembly with the longest N50 sizes (QUAKE) gave the smallest genome assembly. Potentially, mis-assemblies are more common in assemblies with longer N50 sizes. Small assembly mistakes tend to produce continuous sequences from contigs that are not actually located close to each other and can thereby greatly reduce the ability of programs to create longer scaffolds. Using the reduced data showed even more pronounced results in favor of the untrimmed assembly (see Fig. 1 and Table 2).
The data from the 'lower quality' D. melanogaster sample (NCBI SRA:SRR063699) gave slightly different results. As the overall quality of the sample was not very good, a smaller proportion of the genome could be assembled. The proportion of the assembly that could accurately mapped onto the reference genome (accuracy) varies between 76% using CONDETRI for trimming and 6% using QUAKE. Accuracy for the untrimmed data was 72%, 66% for SOLEXAQA, and 63% for BWA trimmed data, respectively. Recall was highest in the untrimmed data (45%), followed by CONDETRI (35%), Bwa (34%) and SOLEXAQA (24%). QUAKE had the lowest recall (,1%). Interestingly, using only 25% of the data gave a slightly larger proportion of the assembly that was accurate for CONDETRI (77%), BWA (70%) and SOLEXAQA (70%) but the overall assembled size of the genome drops down. Untrimmed data gave an accuracy of 65%. Recall was highest using the   untrimmed data (24%) reflecting that this data set contains most of the data. BWA, CONDETRI and SOLEXAQA have a recall of 20%, 11% and 7% respectively. Again, the assembled data from QUAKE covered less than 1% of the reference genome. QUAKE was outperformed using the 'lower quality' sample by the read trimming method because the coverage in this sample is not high enough to perform a read correction. According to the authors, coverage should be at least 156 [13]. All results are summarized in Figure 1 and Table 2.
Assembly results for the G. gallus data were quite similar between the different trimming methods and the untrimmed data. Accuracy for the read trimming methods ranged from 78% using BWA to 82% using CONDETRI whereas the untrimmed data gave the most accurate result with 85%. Recall was highest for untrimmed data (82%), followed by CONDETRI (78%), SOLEXAQA (77%) and BWA (75%). Interestingly, CONDETRI was the method that removes the largest amount of reads but it gives the best result among the trimming methods. We think untrimmed data gave the best result in this comparison because the assembly has the lowest N50, which potentially reduces the probability of wrongly assembled regions. Additionally, untrimmed data needed more time and memory for the assembly. Especially the latter point can be crucial for sequencing projects. We were not able to retrieve results using QUAKE on this data set, as it was not able to correct the data from SRX043656. We managed to correct data from SRX043655 but after read correction less than 5% of the original reads were included in the corrected FASTQ files. Results for the G. gallus data are summarized in Table 3.
We also applied a SNP calling method on the G. gallus data set to be able to compare the performance of read trimming versus untrimmed reads. As the individual used for sequencing and resequencing the G. gallus genome is highly inbred [17,26] we expect a much lower heterozygosity rate than in natural populations and that the per base SNP frequency is lower for trimmed data than for untrimmed data because the untrimmed data contains more sequencing errors. We found one SNP every 1,299 bp in the untrimmed data and one every ,1,450 in the trimmed data sets, regardless which trimming method was applied, which was consistent with our predictions. Note that the SNP frequency per base is much lower than the estimated frequency of one SNP in 374 bp [26] in G. gallus. We think that the difference between trimmed and untrimmed data is mainly based on sequencing errors. Per base coverage in the data ranged from 39.56 in the untrimmed data to 33.26 in the data using SOLEXAQA for trimming. Data trimmed using BWA had almost the same coverage per base as the untrimmed data (37.36) whereas CONDETRI data has coverage of 35.96. The differences in per base coverage between the different trimmed and the untrimmed data corresponds 94%, 91% and 84% for BWA, CONDETRI and SOLEXAQA, respectively. As shown before, SNP frequency was almost identical in the trimmed data sets. We conclude that coverage has not a strong impact on SNP calling in our data. Furthermore, we think the difference in SNP frequency between trimmed and untrimmed data, is mainly due to the presence of more sequencing errors in the untrimmed data but more sophisticated tests are needed to verify this findings, which is outside the scope of this study.
To test the effect of filtering on a non-model organism where no reference genome is available, we made use of data from an attempt towards genome sequencing in the collared flycatcher (H. Ellegren et al. unpublished), a small songbird. Although the genome size of this species has not been determined, there is a high degree of genome size conservation among birds with most song birds having an estimated haploid DNA content of 1.1-1.3 Gb [27]. We cannot calculate recall and accuracy for this data set because there is no reference genome available for the collared flycatcher. Therefore, we concentrated on assembly time and memory usage because this should be related to the complexity of the de Bruijn graph. We selected 4 lanes of Illumina Genome Analyzer II data (insert size ,200 bp), which gave a total of 19.9 Gb of untrimmed sequence data. After trimming using CONDETRI, 15.6 Gb sequence data remained for the assembly. We tested only untrimmed data and CONDETRI on this data, as they were the best performers on the D. melanogaster and G. gallus data. Assembly size and N50 size was quite similar between trimmed and untrimmed data (not shown) but running time for the trimmed data set was 152 minutes with a peak memory usage of 39 GB whereas the untrimmed finished within 388 min and a peak memory usage of 76 GB of RAM showing the big impact of sequencing errors on memory usage and running time. Given that the untrimmed data contains around one third more data, the running time is more than twice as long and the memory usage almost doubles in comparison to the trimmed data.
Trimming may not have a big impact on de novo assemblies for smaller genomes, because there is a sufficient amount of per base coverage. Also, the de Bruijn graph is still quite small for genomes about the size of the D. melanogaster data sets that we used so that there is little benefit of trimming reads on the assembly. However, we have shown that trimming has an effect on the assembly process of larger genomes with a more complex genome structure (higher repeat content, and higher proportion of non-coding sequences). Using untrimmed data for the assembly of e.g. mammalian or avian genomes, where genomes sizes exceed 1 Gb, complicates the de Bruijn graph. Sequencing errors introduce k-mers in the graph that do not occur frequently, which increases the number of nodes and edges and can make the graph unwieldy even for powerful computers. One approach to avoid this is to correct for erroneous k-mers, as programs like QUAKE does. However, this is only possible if a sufficient base coverage is reached to be able to correct for sequencing errors. We have shown that the data we have used does not provide enough coverage to be able to correct sequencing errors using a k-mer based approach. Instead it is better to remove bases or reads that do not fulfill a certain quality criteria.

Implementation
CONDETRI is implemented in Perl (required version 5.8.9 or above), is platform independent, has no additional hardware or library requirements, and is distributed under Artistic License/ GPL. It is designed to run single-threaded on desktop computers or on cluster machines. In default mode, it can be run by giving only one FASTQ file for single-end sequencing or two FASTQ files for paired-end sequencing. More advanced options allow the user to control such things as the quality values used for trimming, trimming size, the fraction of a read containing high quality bases, and the quality format used (either Illumina/Solexa FASTQ format or Sanger FASTQ format is chosen by different offset scores).
Our trimming approach does not correct the actual quality scores called by the Illumina pipeline. Instead, it removes bases with quality values lower than a threshold from the 39-end of a read and checks the remaining read for internal low quality bases.
CONDETRI applies two filtering steps on each read. First, each read is trimmed, one base at the time, in an iterative process. Starting from the 39-end of the read, bases are removed if the corresponding quality score is lower than a threshold Q H . When reaching a base with a quality score higher than Q H , the base is kept temporarily while following bases are evaluated. After parsing a certain number of consecutive high quality bases, n H , the trimming is terminated. However, even bases with low quality scores below Q H , recorded before n H is reached, are saved temporarily. Up to n L consecutive low quality bases are accepted when they are surrounded by high quality bases. If n L is overrun, all temporarily saved bases are removed, and the process starts over again. The trimming continues until either n H consecutive high quality bases are found, or the read is trimmed down to length L.
For a trimmed read to be approved, it must contain more than a certain fraction f of bases with a quality score higher than Q H , and no bases with a quality score less than a lower bound threshold Q L . If a base has a quality score lower than Q L the read is removed. When all reads have been trimmed, each read or each read pair is examined. If a single read passes the quality check, it is stored in a new FASTQ file. For paired end reads, pairs where both the reads fulfill the quality demands are saved in new paired FASTQ files. If a pair contains only one read passing the quality requirements, the high quality read is saved in an extra FASTQ. These reads can be used as single end reads. Besides FASTQ files, CONDETRI reports the number of scanned and removed reads and the number of reads that are present as paired-end and as single-end reads. Figure S1 summarizes the algorithm in a flowchart and Figure S2 gives two examples.
Per default, the high quality score (Q H ) is set to 25, which is similar to a sequencing error probability of 0.0032. This value was chosen after inspecting quality score distributions from several data sets with different insert sizes from the collared flycatcher genome-sequencing project, as a level where the number of bases kept are of highest possible quality without having a considerable loss of reads. For the sets inspected, changing the quality threshold to 30 resulted in a loss of the majority of reads during filtering. On the other hand, lowering it to 20 did not increase the number of reads kept significantly, but the per base error probability of those reads will be up to three times higher (,0.01). However, the default value is by no means universal, and the threshold should be set according to the data. The low quality score (Q L ) is set to 10, which equals a probability of a sequencing error of 0.0909, the fraction f of bases with a quality score higher than Q H is set to 80% and L, the minimum number of bases after trimming, is set to 50, to prevent saving reads that are too short for de novo assembly. The parameters n H and n L are set to 5 and 1, respectively. This means that for each low quality-base there must be at least five high quality bases, which is more than the Q H value of 80%. The connection between these numbers must be considered when tweaking the parameters -keeping n H and n L as 5 and 1 but increasing the Q H to 95% results in removing a large proportion of reads in the second step. However, all these settings can be changed as desired. Quality score distribution along reads and read length distribution after trimming for the libraries used for choosing the default values are shown in Figures S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, S16, S17 and Table S1. CONDETRI can read all three different FASTQ quality score standards: Illumina and Solexa (early Illumina) quality scores with an offset of 64 and Sanger standard with an offset of 33.

Conclusion
The main focus of our quality filtering approach was to provide an accurate, standardized and easy to use method for trimming Illumina sequencing data. In comparison to other programs, data filtered with CONDETRI gave better results with respect to the size of the assembled data and also the accuracy of the de novo assembly. In comparison to untrimmed data, less memory and time is needed for de novo assemblies. This is crucial for larger eukaryotic genomes, because affordable computational resources are still a limiting factor in performing assemblies of larger genomes using several insert sizes for paired-end sequencing. Using qualityfiltered data reduces the de Bruijn graph in the assembly process and should improve downstream analyses of NGS data (e.g. SNP calling).  Figure S10, S11, S12, S13, S14, S15, S16. (PNG) Figure S4 (PNG) Figure S5 (PNG) Figure S6 (PNG) Figure S7 (PNG) Figure S8 (PNG) Figure S9 (PNG) Figure S10 Figures S10, S11, S12, S13, S14, S15, S16-Quality plots for backward reads. The backward reads corresponding to Figure S3, S4, S5, S6, S7, S8, S9. For the first flow cell, only 65 bp were sequenced for the backward reads due to technical problems. (PNG) Figure S11 (PNG) Figure S12 (PNG) Figure S13 (PNG) Figure S14 (PNG) Figure S15 (PNG) Figure S16 (PNG) Figure S17 Example length distribution after trimming. Example of read length distribution for the filtered data set corresponding to Figure S3. A majority of the reads are kept at full length. The wave-like pattern in cycles of 5 bp comes from that n H is set to 5. (PNG)

Supporting Information
Table S1 Data before and after filtering. The number of reads before and after filtering for the data used for estimating CONDETRI default parameters. (PDF)