Compression of FASTQ and SAM Format Sequencing Data

Storage and transmission of the data produced by modern DNA sequencing instruments has become a major concern, which prompted the Pistoia Alliance to pose the SequenceSqueeze contest for compression of FASTQ files. We present several compression entries from the competition, Fastqz and Samcomp/Fqzcomp, including the winning entry. These are compared against existing algorithms for both reference based compression (CRAM, Goby) and non-reference based compression (DSRC, BAM) and other recently published competition entries (Quip, SCALCE). The tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs. All programs are freely available on SourceForge. Fastqz: https://sourceforge.net/projects/fastqz/, fqzcomp: https://sourceforge.net/projects/fqzcomp/, and samcomp: https://sourceforge.net/projects/samcomp/.


SequenceSqueeze results
The evaluation machine used by the competition was an Amazon m2.xlarge instance with a separate 300GB mounted file-system for contest data and temporary storage. Amazon define this instance type as having 6.5 EC2 compute units (2 64 The plots below have been generated from the table of results at www.sequencesqueeze.org. All entries from authors are shown, not just the best one. Entries that fail to uncompress without mismatch are omitted, except where an entrant had no programs that were 100% lossless -these were marked appropriately.
The cluster on the far left are the two reference based encoders -Fastqz and Samcomp. These include the time taken for the entire fastq → compress → decompress → fastq process, so this includes the bowtie2 alignment time. Fasqtz is demonstrably faster at performing alignments.

Figure S2
0 Despite many varied techniques, it is clear from compression times that there is a limit on compressibility, requiring exponentially more CPU to achieve only a linear and small improvement to ratio. Asymmetry of gzip (competition baseline) is clear. Most others are symmetric.
In the above plot fqzcomp appears to be the only program matching gzip on decompression speed. We believe this is likely due to both being I/O bound on the AWS test system. Our own tests show gzip to be faster at decompression.
Zooming up between ratio 0.17 and 0.19 more clearly shows the tradeoff between time vs ratio for the nonreference based compressors. From these the Pareto frontier consists of A.J. Pinho's IEETA entry, D. Jones' Quip program and J. Bonfield's fqzcomp. Programs may have been modified since the entry closed.
(For example, Fqzcomp is 10-40% faster depending on options used.) Figure S4 0 A similar picture is seen with compression ratio vs memory usage. Compression and decompression memory usage is largely symmetric, so we show only compression memory uage. Note that these memory figures are as quoted by the SequenceSqueeze web site, which erroneously listed them as the number of 1KB blocks; they are instead the number of 256-byte blocks.
Once again we see we rapidly reach a cliff, requiring exponential growth in memory for a linear decrease in size. The two reference based compression programs have the requirement of loading the reference genome into memory.

Bowtie2 alignment usage
Alignments for Samcomp and other SAM based aligners were produced using bowtie2 with the following script.
Bowtie2 was considerably slower than the built-in aligner used by fastqz, but aligns more data.

Fastqz alignment benchmarks
The following results were obtained on the complete SRR062634 file (6,  Producing the alignment adds significant time to the preprocessing stages. However in full slow compression mode this reduces the overall time spent due to the data volume presenting to the ZPAQ stage being smaller.

Fqzcomp parameter space
Fqzcomp has separate parameters controlling the compression level for sequence names (identifiers), basecalls and quality values. Additionally for base-call compression it may use a single or double stranded model and it may optionally encode using a single model or with a pair of low + high order models. This gives a considerable search space to explore.
To choose appropriate low, mid and high compression ratio parameters we produced charts with consistent name ("n") and quality ("q") parameters, along with consistent choices of single vs double ("b") strand and single or paired ("+") model, but varied the sequence ("s") order to chart lines of compression ratio vs time.
We tested this using two Illumina data sets (shallow and deep) and a 454 data set.
"s*" refers to -s1 to -s8 parameters except on slower compression modes where -s6 to -s8 was used (visible in the lines that contain just 3 data points). The model used for predicting base-calls is order 7 + x where x is the value after -s. E.g. -s1 uses an order-8 model and -s8 uses an order-15 model.
"+" refers to -s1+ to -s8+ parameters, indicates the use of an additional shorter order-7 model. No context mixing is used. Instead the program encodes using either the order-7 model or the order-8 to order-15 model (as indicated by the -snum), depending on which appears to have the most extreme probability bias (for any base type, not just the one being encoded).
"b" refers to the -b parameter, specifying that updates to the sequence model should take place on both strands.