## Figures

## Abstract

Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA sequencing outstrips the rate of increase in disk storage capacity, the storage and data transferring of large genome data are becoming important concerns for biomedical researchers. We propose a two-pass lossless genome compression algorithm, which highlights the synthesis of complementary contextual models, to improve the compression performance. The proposed framework could handle genome compression with and without reference sequences, and demonstrated performance advantages over best existing algorithms. The method for reference-free compression led to bit rates of 1.720 and 1.838 bits per base for bacteria and yeast, which were approximately 3.7% and 2.6% better than the state-of-the-art algorithms. Regarding performance with reference, we tested on the first Korean personal genome sequence data set, and our proposed method demonstrated a 189-fold compression rate, reducing the raw file size from 2986.8 MB to 15.8 MB at a comparable decompression cost with existing algorithms. DNAcompact is freely available at https://sourceforge.net/projects/dnacompact/for research purpose.

**Citation: **Li P, Wang S, Kim J, Xiong H, Ohno-Machado L, Jiang X (2013) DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique. PLoS ONE 8(11):
e80377.
doi:10.1371/journal.pone.0080377

**Editor: **Christos A. Ouzounis, The Centre for Research and Technology, Hellas, Greece

**Received: **February 9, 2013; **Accepted: **October 2, 2013; **Published: ** November 25, 2013

**Copyright: ** © 2013 Li et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **XJ, SW, JK, and LOM were funded in part by the National Library of Medicine (http://www.nlm.nih.gov/) (K99LM011392, R01HS019913) and NHLBI (http://www.nhlbi.nih.gov/) (U54HL108460, UH2HL108785) and UL1TR00010003. HX was funded in part by the NSFC (http://www.nsfc.gov.cn/) under grants U1201255, 61271218, and 61228101. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Massively parallel sequencing (MPS) is leading revolutionary advances in understanding of health and disease in humans. It produces far more sequencing reads at a significantly lower cost than conventional techniques such as Sanger-based capillary sequencing, which contributed to the Human Genome Project that released the first human DNA sequence in 2001 [1]. Each human has two complementary copies of 3.2 Gigabases. The 1000 Genomes Project [2] has produced more than fifty TeraBytes (TB) of data with 1,092 individuals from fourteen populations, toward the goal of sequencing 2,500 individuals in total [3]. The data size of the Sequence Read Archive, an international public archival resource of sequence reads, is expecting to exceed 1000 Terabases by the end of 2013 (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement/). The sequencing output is doubling every nine months, surpassing the performance improvements of computation and storage [4]. Data compression methods for reducing the storage space and saving the data transfer bandwidth are becoming crucial for the efficient management of large genome data.

Two previous studies [5], [6] have classified genome compression problem into three categories based on the data type: (1) unaligned short reads in FASTQ format (e.g. Quip [7], G-SQZ [8], SCALCE [9] and DSRC [10]); (2) aligned short reads in BAM format (e.g. CRAM [11], SlimGene [5], SAMZIP [12], and NGC [6]); and (3) assembly in FASTA format (e.g. XM [13], RLZ [14], GRS [15], compression with the Burrows-Wheeler transform [16] and GReEn [17]). In this paper, we focus on the third category and study the compression issue in two scenarios, i.e., without and with a reference sequence.

De novo sequencing, where the reference sequence is unavailable, is getting a lot of attention from the biomedical community. One such example is in metagenomics [18], where combined metagenome is enormous even though individual genome size might be small. In the realm of reference-free genome data compression, two categories of approaches, *dictionary-based algorithms* and *statistics-based algorithms*, are used to tackle this problem. Most dictionary-based methods search for repeating subsequences (including forward and reverse complements) and encode them by referring a previous subsequence with maximum matching length. The most representative works include the first dedicated DNA compression algorithm Biocompress [19] and CTW+LZ [20]. Alternatively, researchers also use low-order Markov models to encode regions [21]–[24] when the substitutional methods perform unfavorably. The statistical coding algorithms been evolving: the rudimentary second-order arithmetic encoding [24], the normalized maximum likelihood (NML) algorithm [25], the expert-model (XM) algorithm [13] and the state-of-the-art FCM-C [26] algorithm. NML aims at finding the best regressor block, i.e., approximate repetition or first-order dependencies that have not been considered in the substitutional approaches. XM relies on a mixture of experts to provide symbol-by-symbol probability estimates that are used to drive an arithmetic encoder (AE). FCM-C uses two competing finite-context models to capture different aspects of statistical information along the sequence.

Due to the fact that the nucleotide diversity within the same species is relatively small (e.g., the difference in humans is around 0.1% [27], i.e., one difference per 1,000 base pairs), most recent improvements in genome compression models are reference-based, such as RLZ [14], GRS [15] and GReEn [17]. RLZ indexes the reference sequence and applies the relative Lempel-Ziv algorithm. GRS applies the Huffman algorithm after checking the differential rate between the target and reference sequences. GReEn is based on arithmetic coding that relies on the copy model, where the pointers to the reference sequence position (highly likely conserved ones) are encoded to generate the probability distribution of the symbols. All these existing reference-based DNA compression algorithms apply identical schemes to the mapped regions that can find the repeats in the reference sequence and unmapped regions, which lead to a source of redundancy in the compressed file.

Essentially, all compression methods have to make compromise on the trade-off between compression ratio and complexity. We setup our target applications to a practical scenario, where disk space is the limiting factor but compression time is relatively more tolerable. We propose a novel two-pass lossless DNA compression framework to take advantage of dictionary-based and statistics-based algorithms to deal with the genome compression for scenarios with and without reference sequences. A high level overview is illustrated in Figure 1.

In the first pass, the COMPACT-NOREF scheme aims to search self-similarity and complimentary palindromes within raw sequence, while the COMPACT-REF scheme explores the sparse representation of the target sequences in terms of the reference sequence. Both schemes share the second pass to discriminate statistical regularities.

In the first pass of compression without references (COMPACT-NOREF), we search exact repeats within the raw sequence as well as complementary palindromes and represent them by a compact quadruplet. This is slightly different in the first pass of compression with references (COMPACT-REF), in which we study variations between the target sequences and reference sequences from the same species. Both COMPACT-NOREF and COMPACT-REF share the same second pass. Our main contributions are two-fold. First, we introduce non-sequential contextual models to capture non-sequential characteristics within DNA sequences, which are not considered in the existing DNA compression methods. Second, contrary to methods XM [13] and FCM-Mx [28] that combine contextual models by Bayesian averaging and its modifications, our approach for synthesizing contextual models is less likely to produce biased results (As pointed out by Minka [29], Bayesian averaging tends to favor only one model among many with MAP (i.e., Maximum A Posteriori) estimation, while logistic regression assigns ‘’weights’’ as the expression of the ‘’appropriateness’’ of all candidate hypotheses to make full use of the information that is available).

## Materials and Methods

### Experimental Materials

We used a public dataset of DNA sequences in Table 1, which has been used in many other DNA compression publications (ftp://ftp.infotech.monash.edu.au/ftp/ftp/software/DNAcompress-XM/XMCompress/dataSet/). The material (www.mfn.unipmn.it/manzini/dnacorpus) showed in Table 2 is the same as the one used by Manzini et al. in dnaX [30]. This corpus contains sequences from four organisms: yeast (*Saccharomyces cerevisiae*, chromosomes 1, 4, 14 and the mitochondrial DNA), mouse (*Mus musculus*, chromosomes 7, 11, 19, X and Y), arabidopsis (*Arabidopsis thaliana*, chromosomes 1, 3 and 4) and human (*Homo sapiens*, chromosomes 2, 13, 22, X and Y). In our experiments, we also used the bacteria DNA sequences (Table 3) collected from the National Center for Biotechnology Information (NCBI) directory (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). In addition, complete DNA sequences of eleven species of various sizes were also used [31]–[41].

For compression with a reference sequence, we used the same data as those listed in [17], [15] to perform a fair comparison with GRS [15] and GReEn [17]. These data include two versions of the individual Korean genome sequences, KOREF_20090131 and KOREF_20090224 [42]. Note that the genome of a Han Chinese is referred to as YH [43]. The human genome reference assembly hg18 was released from the UCSC Genome Browser.

### Algorithm Description

The proposed two-pass framework for DNA compression is depicted in Figure 2. In this section, we start with the description of the proposed first pass, which focuses on substitution-based exact matching coding (EMC). For the second pass, we introduce our pattern-aware contextual modeling technique.

The first pass is designed to maximally reduce repetitions. The second pass handles statistical regularities in the non-repetition zones by synthesizing a mixture of contextual models. ‘’EMC’’ refers to ‘’Exact Matching Coding’’ and ‘’EMD’’ refers to ‘’Exact Matching Decoding’’.

### The first pass

The goal of this pass is to remove redundancy, such as repetitions, reverse complements (a.k.a., complemented palindromes), etc. Because COMPACT was designed to handle genome data compression with and without references, we will discuss the first pass algorithms for each scenario, separately.

**Without reference.** Figure 3 illustrates how the first-pass algorithm works in situations without a reference. Like the typical compression algorithm Lempel-Ziv (i.e., LZ) ([44]), suppose the initial portion of the input sequence has been compressed, and is the remaining sequence to be compressed, where indicates the -th symbol andis the total length of the input sequence. We denote the portionas the search window for the remaining sequence , where is the predefined sliding window size. The algorithm compares substrings withinand to find the longest match (such that L≤M≤W), where is a threshold parameter (we will discuss the ‘’optimal parameter’’ in the *Results* section).

given a sliding window. The uncertainty of the symbol in red (i.e., the next symbol to be compressed) is reduced with the information provided by the symbol in blue (i.e., the neighboring symbol after the longest match in the sliding window).

As illustrated in Figure 3, each repeat is represented by a quadruplet Substitution flag ‘’ is encoded in the second pass with other DNA bases, so it is not discussed here. Matching type requires only one bit, which has no need of compression. We concentrated on the encoding of the offset position and the matching length . Our algorithm calculated the offset position as the distance (i.e., number of symbols) between the first symbol of matching and the beginning of the search window (see Figure 3).

We used a log-skewed encoding mechanism [30] to store the offset position because: (1) the average code length of such method is no more than bits, and (2) the method assigns fewer bits for smaller when possible. Regarding the longest matching length , we encoded the value of instead of (because ) for better coding efficiency, using a Gamma coding mechanism [45] that writes the value's binary representation preceded by zeros. We replaced each subsequence that has a satisfying match with the corresponding encoded quadruplet, and sent unmatched bases to the second pass COMPACT coder for further processing.

**With reference.** We used an adaptive mechanism (denoted by rLZ) to compress genomes when reference sequences are available. Similar to the aforementioned first pass in the reference-free compression, rLZ treats subsequences of the reference sequence as the sliding window, and conducts bi-directionally searches from the starting position for the longest and nearest exact repeats of the current DNA fragment in the target sequence (see Figure 4). The bi-directional search ensures the tracking of substitution, insertion, and deletion at a close range between the target sequence and the reference sequence.

We define the range of the sliding window as .

Different from the scenario for compression without reference, the match length is usually much longer and the offset value is much smaller. Hence, we encoded the match length by log-skewed coding while encoding the offset value using Gamma coding. Likewise, the remaining bases after substitution or insertion will be sent to the second pass.

### The second pass

After the first pass, the remaining uncoded sequence will be further compressed through our second pass, in which each contextual model provides a probability given certain prior knowledge of the symbol/bit to be encoded, as shown in Figure 5. Then, the logistic regression model will synthesize these contextual models’ predictions. The eventual output will be sent to an arithmetic encoder (i.e., a form of entropy encoding that encodes the entire message into a single message).

Multiple context models (e.g., sequential models, non-sequential models, etc.) are combined by logistic regression to feed an arithmetic encoder for compression. ‘’DMC’’ is the abbreviation of ‘’Dynamic Markov Compression’’.

The example shown in Figure 6 offers an intuitive way to show the advantage of non-sequential contextual models [46]. Suppose that the alphabet of DNA sequence complies with the following mappings, i.e., T-00, A-01, G-10, C-11. Then, for example, a given DNA sequence ATCAT in Figure 6 can be represented by its corresponding binary form as 0100110100. The left side of Figure 6 shows a order sequential contextual model for the given binary DNA sequence, where each bit at location depends on the previous three bits, i.e., (the red link lines are an example). In contrast, a order non-sequential contextual model can be found in the right side of Figure 6, where each bit only depends on the bits and . Furthermore, we define the dependency of the DNA bit as the context denoted by . For example, both and are the contexts of in our non-sequential model. We also define and as the number of bit 0's and bit 1's with the context in a given DNA sequence, for which we will not assign any context for the first bits in the sequence. For instance, the number of bit 0’s and bit 1’s with context (the bits with yellow color) in the left side of Figure 6 is and . Finally, we denote by the estimated probability under corresponding context, and use to represent the weighted probabilities over a set of contexts (i.e.). Readers can check the CTW algorithm in ([47]-[48]) for more details about procedures for calculating and . In the example illustrated in Figure 6, the probability obtained through a non-sequential model is much higher than that of a sequential model. As a higher will result in a better compression performance (in an arithmetic coder), we expect the non-sequential model have greater prediction power in this case.

(a, Left) The context tree is created according to a third order sequential context set. The code length generated for the sequence is bits; (b, Right) The context tree is made with the corresponding model excluding the second symbol (i.e., a non-sequential model). The code length generated for the sequence is bits.

Given a DNA sequence , one of the non-sequential models for such sequence can be defined as Where and each bit depends on its previous bits excluding the one (). Similarly, the sequential model of the same sequence can be expressed as .

We denote by and the contexts that satisfy the aforementioned sequential model with the bit equaling and , respectively. The context of our non-sequential model is almost the same as and except for that omitted bit. We further denote by and the number of bit and under the given contexts, respectively. Then, the ratios between the corresponding counts can be expressed as and . The following proposition, which was proved by Dai *et. al.* in [46], gives the sufficient condition by which a non-sequential model can outperform a sequential one, under the following assumptions: (1) the difference between and is small and are not close to zero or one; (2) the ratio factor is close to; (3) the sequence length is large enough.

**Proposition 1.** Under the KT-estimator [47], the non-sequential context model will result in a better coding efficiency than a sequential context model when the distribution (i.e., the counts of bit 1’s and 0’s) of the corresponding context satisfies the following inequality. (1)where is a constant near the value.

### Pattern-Aware Contextual Modeling:

Although non-sequential models have advantages under certain conditions, in reality, sequential and non-sequential context models are complementary, which should be taken into consideration for completeness. However, to make computation feasible, we have to use only a small sample of the available context models. First, we define the foremost label of the selected symbols ahead of the base including the bit to be compressed as the context order ^{n}, e.g., the context order is 6 when the context is selected asfor . According to previous context-based DNA algorithms [26], the sequential context models with competing order (e.g., a low order and a high order) have significant effect on the final compression performance. Hence, when the maximum context order in the proposed method is set to 16, our models specifically consist of eleven general sequential context models with orders equal 1, 2, 4, 6, 8, 10, 11, 12, 13, 14, 16, a total of eleven non^{—}sequential sparse models performed on the last four bytes (before the base including the bit to be compressed). If we use bit 1 to refer the picked bit and 0 for the excluded one, the eleven sparse models can be represented as ‘00F0F0F0’, ‘F0F0F0F0’, ‘00F8F8F8’, ‘F8F8F8F8’, ’00E0E0E0’, ‘E0E0E0E0’, ‘00F0F0FE’, ‘AAAAAAAA’, ‘F00F00FA’, ‘F000F0FD’ and ‘F0000F00’, which are series of hexadecimal digits (The reason why we choose these non-sequential contexts is explained in Appendix 5 of File S1 combined with Figure S3 in File S1).

For the () model in COMPACT, the prediction of the next outcome bit can be expressed as , where is the number of total context models, and , (i.e., the maximum context order multiplies the bit number of each byte) indicates the contextual dependencies (a.k.a. bit history) for defined in the model. For example, if depends on in the model, otherwise, . Then, can be easily calculated as . In our implementation, we apply three lookup-table based functions (i.e., Run Map, Stationary Map, and Nonstationary Map, referred to http://cs.fit.edu/mmahoney/compression/), to map the bit history to the corresponding probability.

In the rest of this paper, we denote the prediction result of bit to be processed by the context model by instead of . In the next section, we discuss how to find the most likely probabilitygiven the individual predictions of all context models.

### Model synthesis based on logistic regression

According to the Maximum Entropy Principle (i.e. MAXENT [49]), the most likely probability for is the one with the highest entropy [50] as follows, (2)where is the empirical probability of and is subject to (3)

where is a function that returns if equals to the bit being predicted, or returns 0 otherwise. And is the experienced probability of . Eq.(2) with constrains of Eq.(3) can be transformed into the following equation through Lagrange multipliers. which gives (4)By replacing the parameter with , which can be viewed as the weight of the model, we obtain (5)

Given a binary bit , this is exactly a logistic regression model, which can be optimized efficiently using the Newton-Raphson algorithm. We add one diagram Figure S1 in File S1 to show how this model works in each step.

## Results

### Parameters discussion: model order and minimum match length

We implemented the encoder and decoder of COMPACT in C++, and ran experiments on a workstation with Intel(R) Xeon(R) 3.6 GHZ CPU and 96 GB of RAM. Denoting the parameters model order by and minimum match length by in the proposed algorithm, we explored the relationship between the compression performance (the average number of bits per base, bpb) and parameters. To study the issue comprehensively, we selected four sequences from different species with different sizes : HUMHPRTB, 56,737 symbols; HEHCMVCG, 229,354 symbols; y-4, 1,531,929 symbols; NC013929, 10,148,695 symbols.

As indicated in Figure 7, almost all sequences demonstrated a decrease in bits per base with the increase in model order. Correlation in the sequence was best predicted using context models of moderate orders. We chose 16 as the context model order. The figure also shows that the compression performance does not always improve with the growing of the minimum match length.

(a) model order n (using ) and (b) minimum match length (using ).

### Performance of compression without reference

In this section, we conducted experiments under the condition that model order equaled to 16 on a standard dataset of DNA sequences (Table 1), a DNA corpus consisting of four organisms used by Manzini et al. in [30] (Table 2), the bacteria DNA sequences (Table 3) from the National Center for Biotechnology Information (NCBI) directory, and the complete DNA sequences of ten species with various sizes (See Table 4). We represented time in seconds. In these experiments, the proposed compression algorithm without reference, i.e., COMPACT-NOREF, demonstrated performance advantage compared to existing models.

### Performance of compression with reference

We tested the performance of COMPACT-REF (i.e. rLZ+COMPACT) with model order 16 and minimum match length in three cases. As a result, the KOREF_20090224 genome sequence data using KOREF_20090131 as reference, for which the raw file is 2937.7 MB (KOREF_20090224), were compressed into a 15.8 MB file, achieving a 189-fold compression rate (Figure 8). Table 5 displays the 177-fold compression result for another experiment when the genome of a Han Chinese individual (YH) was compressed using KOREF_20090224 as reference. Table 6 displays the compression results of COMPACT-REF and GReEn [30] for three different human genome assemblies (YH, KOREF_20090224 and KOREF_20090131) with transformed alphabets using hg18 (NCBI36) as their common reference, and the results of them for the same datasets with original alphabets are displayed in Table S3 in File S1.

The original sequence alphabets have been preserved. The size of the alphabet in the target sequence is 21 for all chromosomes, except for chrM chromosome whose size is 11. The left y-axis refers to the compression ratio while the right y-axis indicates the compression or decompression time in seconds.

## Discussion

### Discussion for compression without reference

Table 1 compares the compression results in bits per base (bpb). Along with our proposed algorithm, we presented here the existing algorithms, i.e. DNA3 [30], XM500 [13] and FCM [51]. Our COMPACT-NOREF with the proposed LZ method applied in the first pass outperformed the one with dnaX (a fast algorithm using fingerprints introduced by Manzini et al.[30]) for short sequences. But such advantage was not obvious as the sequence sizes increased. The reason is that encoding the finite repeats with a leading indicator rather than copious repeat locations saves space, and Gamma coding is more efficient than dnaX's continuation bit encoding for short sequences. Hence, we implemented the first pass of remaining experiments for reference-free scenarios with dnaX.

As for “difficult” sequences like HUMD (i.e., HUMDYSTROP), HUMH (i.e., HUMHBB) in which we did not gain performance advantage, we realized that both of them were human genomes, which often contained approximate repeats rather than exact duplicated strings. Therefore, the COMPACT-NONREF and other similar algorithms such as [23], [30], which only took the exact repeats into consideration, did not achieve the best performance for human genome compression. We have conducted a testing experiment in Figure S2 in File S1 to support this hypothesis. In Table 2, it is further witnessed that COMPACT-NOREF is a little inferior to XM for the “difficult” sequences of four organisms (i.e., yeast, mouse, arabidopsis, and human). However, we have realized that XM adopts more sophisticated modeling approach (e.g., the combination of various ‘’experts’’) and too many expert models to attain better representatoin for DNA sequences at a rapidly increasing cost of memory and time, while ours only picks out certain suitable ones relying on short-term knowledge from the past. In our experiments, XM took around or larger than 2.0 GB memory to compress the sequences over 100 MB in the forth group, which consumed much more resource than other methods (The proposed method took much less memory, please refer to Table S2 in File S1). We carried out additional experiments on the bacteria DNA sequences from the National Center for Biotechnology Information (NCBI) directory (Table 3). We compared a variation of the proposed algorithm: semi-COMPACT (i.e., COMPACT-NOREF without the first pass), COMPACT-NOREF, together with an XM encoder, and a state-of-the-art algorithm, FCM-Mx [28], in terms of compression and required time. Table 3 presents the individual compression results on these sequences with 10,000,000 or more bases. The table also includes the average compression result of each algorithm in the last row.

For bacteria, the proposed method demonstrated the best performance among all algorithms. The average compression rates of five sequences reported for XM500 and FCM-Mx were 1.787 bpb and 1.7543 bpb, while our method COMPACT-NOREF's average performance on the same set was 1.7204 bpb. The time cost for the proposed methods was comparable to that of XM. Results for eleven complete genomes are shown in Table 4. The FCM-S and FCM-M [52] columns contained results provided by the finite-context models and by the multiple competing finite-context models. FCM-S processed DNA sequences using the single finite-context model approach, in which the best context depth was used, whereas FCM-M obtained the results with the multiple competing models. The results presented in the Table show a similar pattern as Table 3. What’s more, all tables from 1 to 4 include the compression results of ‘COMPACT-seq’ that uses only traditional sequential models (including models based on Markov chains) followed by logistic regression model. Especially for Table 4, ‘COMPACT-seq’ outperforms all other algorithms on ten complete genomes, and ‘COMPACT-NOREF’ exceeds ‘COMPACT-seq’ on almost all sequences. It can be inferred that both the proposed non-sequential models and the logistic regression mixture model are extraordinary.

### Discussion for compression with reference

We compared the performance of the proposed method COMPACT-REF (i.e., rLZ+COMPACT) to that of GRS [15] and GReEn [17], two most recently proposed approach for compressing genome resequencing data that handled sequences over arbitrary alphabets. Figure 8 displays the compression performance for human genome KOREF_20090224 using KOREF_20090131 as reference. COMPACT-REF gave better results in terms of compression ratio but it is slower than GReEn and GRS. In fact, the speed disadvantage deserved a special note. Different from GReEn, the compression time of COMPACT-REF does not vary linearly with the size of the sequence but rather depending on the degree of similarity between the reference sequence and the target sequence. This was also the reason why some longer sequences took shorter time to compress, like chr8 and chr9. On the other hand, Figure 8 also demonstrates that COMPACT-REF can achieve comparable decompression consumption time with GReEn since both the decompression procedure and the decompression runtime of GReEn are identical to its compression’s, while COMPACT-REF saves time in the first pass for decompression because it only locates the match position instead of searching repeats. Hence, we believe that COMPACT-REF is advantageous in applications for which disk space and decompression time are the limiting factors but the compression time is more tolerable, such as sequence archive and sequence acquisition. In order to have a more specific and clearer comparison, we also present the results in tabular form in File S1 (Please refer to Table S1).

In order to provide a more comprehensive comparison between GRS, GReEn and the proposed compression approaches, we investigated another human genome assembly, YH, which referred to the genome of a Han Chinese individual. Table 5 displays the compression results of YH using KOREF_20090224 as reference. GRS performed poorly in both compression rate and speed. Our COMPACT-REF approach achieved good results with the appropriate window size, which can be selected by choosing a large window size and gradually shrinking it down. Note that the window range only slightly affects the compression performance. The default left window size and right window size are adaptively obtained through calculating the difference percentage, which equals to the sum of the difference values of each base’s (i.e., ‘A’, ‘C’, ‘T’, ‘G’, ‘N’, ‘a’, ‘c’, ‘t’, ‘g’ and ‘n’) number in the source sequence and reference sequence dividing by the base length of source sequence. If the percentage is smaller than 0.65%, the default window size is set to [–12, 650]; If the percentage is larger than 0.65% but smaller than 5%, the default window size is set to [–12, 812]; Otherwise the window size is set to [–12, 11560]. For the compression of KOREF_20090224 using KOREF_20090131 as reference, the default window range is [–12,650] except for [–12, 812] for chr1, 4, X and chrY. In the compression of YH using KOREF_20090224 as reference, the default window range is [–12, 11560] for most chromosomes.

Table 6 summarizes compression results of COMPACT-REF and GReEn for three different human genome assemblies (YH, KOREF_20090224 and KOREF_20090131) using the same choice of reference hg18 (NCBI36). KOREF_20090131, KOREF_20090224 and YH database are three genome databases generated by two different organizations. YH is the first diploid genome sequence of a Han Chinese, a representative of Asian population, completed by Beijing Genomics Institute at Shenzhen (BGI-Shenzhen). KOREF_20090131 and KOREF_20090224 are two versions of the first individual Korean genome released in December 2008 as the result of Korean reference genome construction project. Consequently, the alphabet set of these two datasets are different. Both KOREF_20090131 and KOREF_20090224 consist of 21 symbols, such as ‘A’, ‘C’, ‘T’, ‘G’, ‘N’, ‘M’ and etc., with the additional bases besides {‘A’, ‘C’, ‘T’, ‘G’} indicating different sequencing quality or uncertainty. But all bases in YH are confined to {‘A’, ‘C’, ‘T’, ‘G’, ‘N’} by using only ‘N’ to represent uncertain bases. Hence, in order to keep the alphabet size identical, we transformed all characters to lowercase and mapped unknown nucleotides to 'n' for the sake of comparison. Table S3 in File S1 displays the compression results of COMPACT-REF and GReEn for the same datasets with original alphabets. The size of each sequence in Table 6 reduced significantly by the proposed method although GReEn seems to show a superior performance. The reason why GReEn generates better compression results than COMPACT-REF in this situation may be that GReEn relies on the probability distribution of characters in the target sequence (assuming that the characters of the target sequence are an exact copy of (parts of) the reference sequence). When we do not eliminate the effect of character case (i.e., uppercase or lowercase) in Table S3, GReEn demonstrates an obvious disadvantage in terms of compression performance. These experiments demonstrated the applicability of our framework in compressing genomic data sets and encouraging further investigation.

## Supporting Information

### File S1.

**Supporting figures and tables. Figure S1.** The diagram of logistic regression model synthesizing different models to obtain a single probability. **Figure S2**. The relationship between the compression rate and the quantity of noise over the sequence HEHCMVCG. **Figure S3**. The schematic diagram of the selected contexts for eleven non-sequential sparse models. Red block refers to the picked bit while the others refer to the excluded one. **Table S1**. Homo sapiens genome: compression of KOREF_20090224 using KOREF_20090131 as reference. **Table S2**. The evalution of memory usage in our experiments. **Table S3**. Homo sapiens genome: compression with COMPACT-REF and GReEn of the YH, KOREF_20090224 and KOREF_20090131 versions with original alphabets using hg18 as a reference.

doi:10.1371/journal.pone.0080377.s001

(DOCX)

## Author Contributions

Conceived and designed the experiments: PHL HKX XQJ. Performed the experiments: PHL. Analyzed the data: PHL SW XQJ. Contributed reagents/materials/analysis tools: SW JK LOM. Wrote the paper: PHL SW JK HKX LOM XQJ. Designed the software: PHL SW XQJ.

## References

- 1. Mardis ER (2011) A decade's perspective on DNA sequencing technology. Nature 470(7333): 198–203.
- 2. Altshuler DM, Lander ES, Ambrogio L, Bloom T, Cibulskis K, et al. (2010) A map of human genome variation from population scale sequencing. Nature 467(7319): 1061–1073.
- 3. Autosomes Chromosome (2012) X (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65.
- 4. Kahn SD (2011) On the future of genomic data. Science 331(6018): 728–729.
- 5. Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G (2011) Compressing genomic sequence fragments using SlimGene. Journal of Computational Biology 18(3): 401–413.
- 6. Popitsch N, Haeseler A (2013) NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Research 41(1): e27.
- 7. Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research 40(22): e171.
- 8. Tembe W, Lowey J, Suh E (2010) G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17): 2192–2194.
- 9. Hach F, Numanagić I, Alkan C, Sahinalp SC (2012) SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23): 3051–3057.
- 10. Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27(6): 860–862.
- 11. Fritz MHY, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome research 21(5): 734–740.
- 12. Sakib MN, Tang J, Zheng WJ, Huang CT (2011) Improving Transmission Efficiency of Large Sequence Alignment/Map (SAM) Files. PloS one 6(12): e28251.
- 13.
Cao MD, Dix TI, Allison L, Mears C (2007) A Simple Statistical Algorithm for Biological Sequence Compression. Data Compression Conference (DCC'07), pages 43–52.
- 14. Kuruppu S, Puglisi S, Zobel J (2010) Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. String Processing and Information Retrieval 6393/2010: 201–206.
- 15. Wang C, Zhang D (2011) A novel compression tool for efficient storage of genome resequencing data. Nucleic acids research 39(7): e45.
- 16. Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-scale compression of genomic sequence databases with the Burrows—Wheeler transform. Bioinformatics 28(11): 1415–1419.
- 17. Pinho AJ, Pratas D, Garcia SP (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucleic acids research 40(4): e27.
- 18. Wooley JC, Godzik A, Friedberg I (2010) A primer on metagenomics. PLoS computational biology 6(2): e1000667.
- 19.
Grumbach S, Tahi F (1993) Compression of DNA sequences. Data Compression Conference (DCC'93), pages 340–350.
- 20. Matsumoto T, Sadakane K, Imai H (2000) Biological sequence compression algorithms. Genome informatics.. Workshop on Genome Informatics 11: 43–52.
- 21. Behzadi B, Le Fessant F (2005) DNA compression challenge revisited: a dynamic programming approach. Combinatorial Pattern Matching 3537(2005): 85–96.
- 22. Chen X, Li M, Ma B, Tromp J (2002) DNACompress: fast and effective DNA sequence compression. Bioinformatics (Oxford, England) 18(12): 1696–8.
- 23. Grumbach S, Tahi F (1994) A new challenge for compression algorithms: Genetic sequences. Information Processing & Management 30(6): 875–886.
- 24. Chen X, Kwong S, Li M (2001) A compression algorithm for dna sequences. Engineering in Medicine and Biology Magazine, IEEE 20: 61–66.
- 25.
Korodi G, Tabus I (2007) Normalized maximum likelihood model of order-1 for the compression of DNA sequences. Data Compression Conference (DCC'07), Snowbird, Utah, pages 33–42.
- 26.
Pratas D, Pinho AJ (2011) Compressing the human genome using exclusively Markov models. 5th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2011), pages 213–220.
- 27. Jorde LB, Wooding SP (2004) Genetic variation, classification and ‘race’. Nature genetics 36: S28–S33.
- 28.
Pinho AJ (2011) Bacteria DNA sequence compression using a mixture of finite-context models. IEEE Statistical Processing Workshop (SSP), pages 125–128.
- 29.
Minka TP (2000) Bayesian model averaging is not model combination. MIT Media Lab note (7/6/00). Available: http://research.microsoft.com/en-us/um/people/minka/papers/minka-bma-isnt-mc.pdf. Accessed 20 December 2012.
- 30. Manzini G, Rastero M (2004) A simple and fast DNA compressor. Software: Practice and Experience 34(14): 1397–1411.
- 31.
Arabidopsis thaliana. ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes. Accessed 10 December 2012.
- 32.
Aspergillus nidulans. ftp://ftp.ncbi.nlm.nih.gov/genomes/Fungi/Aspergillus_nidulans_FGSC_A4_uid13961/. Accessed 10 December 2012.
- 33.
Candida albicans. http://www.candidagenome.org/download/sequence/C_albicans_SC5314/Assembly21/archived_as_released. Accessed 10 December 2012.
- 34.
Escherichia coli. ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr_MG1655_uid57779/. Accessed 10 December 2012.
- 35.
Methanocaldococcus jannaschii. ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanocaldococcus_jannaschii_DSM_2661_uid57713/. Accessed 10 December 2012.
- 36.
Mycoplasma genitalium. ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Mycoplasma_genitalium_G37_uid57707/. Accessed 10 December 2012.
- 37.
Saccharomyces cerevisiae. ftp://ftp.ncbi.nlm.nih.gov/genomes/Fungi/Saccharomyces_cerevisiae_uid128/. Accessed 10 December 2012.
- 38.
Schizosaccharomyces pombe. ftp://ftp.ncbi.nlm.nih.gov/genomes/Fungi/Schizosaccharomyces_pombe_uid127/. Accessed 10 December 2012.
- 39.
Staphylococcus Aureus. ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Staphylococcus_aureus_MSSA476_uid57841/. Accessed 10 December 2012.
- 40.
Thermococcus kodakarensis. ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Thermococcus_kodakarensis_KOD1_uid58225/. Accessed 10 December 2012.
- 41.
Homo sapiens. ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/April_14_2003. Accessed 10 December 2012.
- 42. Ahn SM, Kim TH, Lee S, Kim D, Ghang H, et al. (2009) The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome research 19(9): 1622–1629.
- 43. Wang J, Wang W, Li R, Li Y, Tian G, et al. (2008) The diploid genome sequence of an Asian individual. Nature 456(7218): 60–5.
- 44. Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. Information Theory, IEEE Transactions on 23(3): 337–343.
- 45. Elias P (1975) Universal codeword sets and representations of the integers. Information Theory, IEEE Transactions on 21(2): 194–203.
- 46.
Dai W, Xiong H, Song L (2008) On Non-sequential Context Modeling with Application to Executable Data Compression.
*Data Compression Conference (DCC'08)*, Snowbird, Utah, number 2006, pages 172–181. - 47. Krichevsky R, Trofimov V (1981) The performance of universal encoding. Information Theory, IEEE Transactions on 27(2): 199–207.
- 48. Willems FMJ, Shtarkov YM, Tjalkens TJ (1995) The context-tree weighting method: Basic properties. Information Theory, IEEE Transactions on 41(3): 653–664.
- 49. Jaynes ET (1957) Information theory and statistical mechanics. Physical review 106(4): 620.
- 50.
Mahoney MV (2000) Fast text compression with neural networks. InFLAIRS Conference. pp. 230–234.
- 51.
Pinho AJ, Neves AJ, Bastos CA, Ferreira PJ (2009) DNA coding using finite-context models and arithmetic coding. IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, pages 1693–1696.
- 52. Pinho AJ, Ferreira PJ, Neves AJ, Bastos CA (2011) On the representability of complete genomes by multiple competing finite-context (Markov) models. PloS one 6(6): e21588.