Compression of Structured High-Throughput Sequencing Data

doi:10.1371/journal.pone.0079871

Figure 1.

Structured Data Compression Techniques.

We present the techniques that we devised for compressing structured High-Throughput Sequencing (HTS) data. We use a combination of general compression techniques (panel A) and of techniques that take advantage of the information provided by a data schema (B-E). (A) General compression techniques convert structured data to streams of bytes (serialization, typically done one message at a time) and then compressing the resulting stream of bytes with a general purpose compression approach such as Gzip and Bzip2. We use such techniques alone (Gzip and Bzip2 codecs) or in combination with structured data compression (Hybrid codecs, labeled H, H+T or H+T+D according to the technique used). (B) Separate field encoding reorganizes blocks of messages in lists of field values before compressing each field independently. The technique requires compressing blocks of PB messages, or PB chunks. (C) Field Modeling helps compress data by expressing the value of one field as a function of other fields and constants. (D) Template Compression Technique. Here, the data structure is used to detect subset of messages that repeat in the input messages. Fields that vary frequently are ignored from the template. The template values are stored with the number of template repetitions and the values needed to reconstruct the input messages. (E) Domain Modeling Technique. Alignment messages refer to each other with message links (i.e., references between messages) represented here as pair-link messages with three fields: position, target-index and fragment-index of the linked message. We realized that within a PB chunk, it is possible to remove the three fields representing the link and replace them with an integer index that counts how many messages up or down stream is the linked message in the chunk. Links from an entry in a chunk to an entry in another chunk cannot be removed and are stored explicitly with the three original fields.

More »

Expand

Figure 2.

A comprehensive approach to store HTS data during the analysis life-cycle.

This diagram illustrates how HTS data stored with the approaches described in this manuscript support common analysis steps of a typical HTS study. HTS reads (Tier I) are stored in files ending with the.compact-reads extension. These files can be read directly by alignment programs and facilitate efficient parallelization on compute grids. When the reads are aligned to a reference genome, alignment files are written in sets of two or three files. Files ending with.entries contain alignment entries. Each alignment entry describes how a segment of a read aligns against the reference genome. Files ending in.header contain global information about the reads, the reference genome, and the alignment (See Figure S1A in File S1 for the data schema that precisely describes these data structures). An optional.tmh file stores the identity of the reads that matched the reference so many times the aligner did not output matches for them. Aligned reads can be sorted with the ‘sort’ Goby tool, producing a.index file with enough information to support fast random access by genomic position. A permutation file (extension.perm) can also be produced to improve compression of sorted files (see Methods). Files in Tier II are stand-alone and can be transferred across the network for visualization (e.g., IGV). Files in Tier III are available for some specific types of analyses that require linking HTS alignments back to primary read data.

More »

Expand

Table 1.

Benchmark against BZip2 general compression.

More »

Expand

Table 2.

Benchmark against a CRAM baseline.

More »

Expand

Table 3.

Benchmark against a BAM baseline.

More »

Expand