DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique

Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA sequencing outstrips the rate of increase in disk storage capacity, the storage and data transferring of large genome data are becoming important concerns for biomedical researchers. We propose a two-pass lossless genome compression algorithm, which highlights the synthesis of complementary contextual models, to improve the compression performance. The proposed framework could handle genome compression with and without reference sequences, and demonstrated performance advantages over best existing algorithms. The method for reference-free compression led to bit rates of 1.720 and 1.838 bits per base for bacteria and yeast, which were approximately 3.7% and 2.6% better than the state-of-the-art algorithms. Regarding performance with reference, we tested on the first Korean personal genome sequence data set, and our proposed method demonstrated a 189-fold compression rate, reducing the raw file size from 2986.8 MB to 15.8 MB at a comparable decompression cost with existing algorithms. DNAcompact is freely available at https://sourceforge.net/projects/dnacompact/for research purpose.


Introduction
Massively parallel sequencing (MPS) is leading revolutionary advances in understanding of health and disease in humans. It produces far more sequencing reads at a significantly lower cost than conventional techniques such as Sanger-based capillary sequencing, which contributed to the Human Genome Project that released the first human DNA sequence in 2001 [1]. Each human has two complementary copies of 3.2 Gigabases. The 1000 Genomes Project [2] has produced more than fifty TeraBytes (TB) of data with 1,092 individuals from fourteen populations, toward the goal of sequencing 2,500 individuals in total [3]. The data size of the Sequence Read Archive, an international public archival resource of sequence reads, is expecting to exceed 1000 Terabases by the end of 2013 (http://www.ncbi.nlm.nih.gov/Traces/sra/ sra.cgi?view = announcement/). The sequencing output is doubling every nine months, surpassing the performance improvements of computation and storage [4]. Data compression methods for reducing the storage space and saving the data transfer bandwidth are becoming crucial for the efficient management of large genome data.
De novo sequencing, where the reference sequence is unavailable, is getting a lot of attention from the biomedical community. One such example is in metagenomics [18], where combined metagenome is enormous even though individual genome size might be small. In the realm of reference-free genome data compression, two categories of approaches, dictionary-based algorithms and statistics-based algorithms, are used to tackle this problem. Most dictionary-based methods search for repeating subsequences (including forward and reverse complements) and encode them by referring a previous subsequence with maximum matching length. The most representative works include the first dedicated DNA compression algorithm Biocompress [19] and CTW+LZ [20]. Alternatively, researchers also use low-order Markov models to encode regions [21][22][23][24] when the substitutional methods perform unfavorably. The statistical coding algorithms been evolving: the rudimentary second-order arithmetic encoding [24], the normalized maximum likelihood (NML) algorithm [25], the expert-model (XM) algorithm [13] and the state-of-the-art FCM-C [26] algorithm. NML aims at finding the best regressor block, i.e., approximate repetition or first-order dependencies that have not been considered in the substitutional approaches. XM relies on a mixture of experts to provide symbol-by-symbol probability estimates that are used to drive an arithmetic encoder (AE). FCM-C uses two competing finite-context models to capture different aspects of statistical information along the sequence.
Due to the fact that the nucleotide diversity within the same species is relatively small (e.g., the difference in humans is around 0.1% [27], i.e., one difference per 1,000 base pairs), most recent improvements in genome compression models are referencebased, such as RLZ [14], GRS [15] and GReEn [17]. RLZ indexes the reference sequence and applies the relative Lempel-Ziv algorithm. GRS applies the Huffman algorithm after checking the differential rate between the target and reference sequences. GReEn is based on arithmetic coding that relies on the copy model, where the pointers to the reference sequence position (highly likely conserved ones) are encoded to generate the probability distribution of the symbols. All these existing reference-based DNA compression algorithms apply identical schemes to the mapped regions that can find the repeats in the reference sequence and unmapped regions, which lead to a source of redundancy in the compressed file.
Essentially, all compression methods have to make compromise on the trade-off between compression ratio and complexity. We setup our target applications to a practical scenario, where disk space is the limiting factor but compression time is relatively more tolerable. We propose a novel two-pass lossless DNA compression framework to take advantage of dictionary-based and statisticsbased algorithms to deal with the genome compression for scenarios with and without reference sequences. A high level overview is illustrated in Figure 1.
In the first pass of compression without references (COMPACT-NOREF), we search exact repeats within the raw sequence as well as complementary palindromes and represent them by a compact quadruplet. This is slightly different in the first pass of compression with references (COMPACT-REF), in which we study variations between the target sequences and reference sequences from the same species. Both COMPACT-NOREF and COMPACT-REF share the same second pass. Our main contributions are two-fold. First, we introduce non-sequential contextual models to capture non-sequential characteristics within DNA sequences, which are not considered in the existing DNA compression methods. Second, contrary to methods XM [13] and FCM-Mx [28] that combine contextual models by Bayesian averaging and its modifications, our approach for synthesizing contextual models is less likely to produce biased results (As pointed out by Minka [29], Bayesian averaging tends to favor only one model among many with MAP (i.e., Maximum A Posteriori) estimation, while logistic regression assigns ''weights'' as the expression of the ''appropriateness'' of all candidate hypotheses to make full use of the information that is available).
For compression with a reference sequence, we used the same data as those listed in [17,15] to perform a fair comparison with GRS [15] and GReEn [17]. These data include two versions of the individual Korean genome sequences, KOREF_20090131 and KOREF_20090224 [42]. Note that the genome of a Han Chinese is referred to as YH [43]. The human genome reference assembly hg18 was released from the UCSC Genome Browser.

Algorithm Description
The proposed two-pass framework for DNA compression is depicted in Figure 2. In this section, we start with the description of the proposed first pass, which focuses on substitution-based exact matching coding (EMC). For the second pass, we introduce our pattern-aware contextual modeling technique.

The first pass
The goal of this pass is to remove redundancy, such as repetitions, reverse complements (a.k.a., complemented palin- dromes), etc. Because COMPACT was designed to handle genome data compression with and without references, we will discuss the first pass algorithms for each scenario, separately.
Without reference. Figure 3 illustrates how the first-pass algorithm works in situations without a reference. Like the typical compression algorithm Lempel-Ziv (i.e., LZ) ( [44]), suppose the initial portion S½1,Z{1 of the input sequence has been compressed, and S½Z,N is the remaining sequence to be compressed, where Z indicates the Z-th symbol andNis the total length of the input sequence. We denote the portionS½max(1,Z À W),Z À 1as the search window for the remaining sequence S½Z,N, where Wis the predefined sliding window size. The algorithm compares substrings with-inS½max(1,Z À W),Z À 1and S½Z,ZzW À 1to find the longest match S½Z,ZzM À 1(such that L#M#W), where L is a threshold parameter (we will discuss the ''optimal parameter'' in the Results section). As illustrated in Figure 3, each repeat is represented by a quadruplet vD,r,P,Mw:Substitution flag 'D' is encoded in the second pass with other DNA bases, so it is not discussed here. Matching type r requires only one bit, which has no need of compression. We concentrated on the encoding of the offset position P and the matching length M. Our algorithm calculated the offset position P as the distance (i.e., number of symbols) between the first symbol of matching and the beginning of the search window (see Figure 3).
We used a log-skewed encoding mechanism [30] to store the offset position Pbecause: (1) the average code length of such method is no more than qlog 2 (P)r bits, and (2) the method assigns fewer bits for smaller P when possible. Regarding the longest matching length M, we encoded the value of M À Lz1 instead of M (because M §L) for better coding efficiency, using a Gamma coding mechanism [45] that writes the value's binary representation preceded by qlog 2 (M À Lz1)r zeros. We replaced each subsequence that has a satisfying match with the corresponding encoded quadruplet, and sent unmatched bases to the second pass COMPACT coder for further processing.
With reference. We used an adaptive mechanism (denoted by rLZ) to compress genomes when reference sequences are available. Similar to the aforementioned first pass in the reference-free compression, rLZ treats subsequences of the reference sequence as the sliding window, and conducts bi-directionally searches from the starting position for the longest and nearest exact repeats of the current DNA fragment in the target sequence (see Figure 4). The bi-directional search ensures the tracking of substitution, insertion, and deletion at a close range between the target sequence and the reference sequence.
Different from the scenario for compression without reference, the match length Mis usually much longer and the offset value P is much smaller. Hence, we encoded the match length M by log-  skewed coding while encoding the offset value P using Gamma coding. Likewise, the remaining bases after substitution or insertion will be sent to the second pass.

The second pass
After the first pass, the remaining uncoded sequence will be further compressed through our second pass, in which each contextual model provides a probability given certain prior knowledge of the symbol/bit to be encoded, as shown in Figure  5. Then, the logistic regression model will synthesize these contextual models' predictions. The eventual output will be sent to an arithmetic encoder (i.e., a form of entropy encoding that encodes the entire message into a single message).
The example shown in Figure 6 offers an intuitive way to show the advantage of non-sequential contextual models [46]. Suppose that the alphabet of DNA sequence complies with the following mappings, i.e., T-00, A-01, G-10, C-11. Then, for example, a given DNA sequence ATCAT in Figure 6 can be represented by its corresponding binary form as 0100110100. The left side of Figure 6 shows a d~3 order sequential contextual model for the given binary DNA sequence, where each bit y t at location t depends on the previous three bits, i.e., y t{1 t{3~y t{3 ,y t{2 ,y t{1 f g (the red link lines are an example). In contrast, a d~3 order nonsequential contextual model can be found in the right side of Figure 6, where each bit y t only depends on the bits y t{3 and y t{1 . Furthermore, we define the dependency of the DNA bit as the context denoted by s. For example, both s~y t{3 ,y t{1 f gand s~y t{1 f gare the contexts of y t in our non-sequential model. We also define a s and b s as the number of bit 0's and bit 1's with the context s in a given DNA sequence, for which we will not assign any context for the first d bits in the sequence. For instance, the number of bit 0's and bit 1's with context s~f1g(the bits with yellow color) in the left side of Figure 6 is a s~2 and b s~1 . Finally, we denote by P e (a s ,b s )the estimated probability under corresponding context, and use P w (P e (a s ,b s ),S) to represent the weighted probabilities over a set of contexts S (i.e.Vs,s[S). Readers can check the CTW algorithm in ( [47][48]) for more details about procedures for calculating P e (a s ,b s ) and P w (P e (a s ,b s ),S). In the example illustrated in Figure 6, the probability P w (P e (a s ,b s ),S) obtained through a non-sequential model is much higher than that of a sequential model. As a higher P w (P e (a s ,b s ),S) will result in a better compression performance (in an arithmetic coder), we expect the non-sequential model have greater prediction power in this case.
Given a DNA sequence y N 1 , one of the non-sequential models for such sequence can be defined as Where t~1, Á Á Á ,N and each bit y t depends on its previous D bits excluding the i À thone (1ƒiƒD{1). Similarly, the sequential  We denote by s 0 and s 1 the contexts that satisfy the aforementioned sequential model with the i À thbit equaling 0 and 1, respectively. The context s of our non-sequential model is almost the same as s 0 and s 1 except for that omitted i À thbit. We further denote by a,a 0 ,a 1 and b,b 0 ,b 1 the number of bit 0 0 s and 1 0 s under the given contextss,s 0 ,s 1 , respectively. Then, the ratios between the corresponding counts can be expressed as  [46], gives the sufficient condition by which a non-sequential model can outperform a sequential one, under the following assumptions: (1) the difference between u and v is small and u,v are not close to zero or one; (2) the ratio factor r is close to 1 2 ; (3) the sequence length azb is large enough.

Proposition 1.
Under the KT-estimator [47], the nonsequential context model will result in a better coding efficiency than a sequential context model when the distribution (i.e., the counts of bit 1's and 0's) of the corresponding context satisfies the following inequality.

Pattern-Aware Contextual Modeling:
Although non-sequential models have advantages under certain conditions, in reality, sequential and non-sequential context models are complementary, which should be taken into consideration for completeness. However, to make computation feasible, we have to use only a small sample of the available context models. First, we define the foremost label of the selected symbols ahead of the base including the bit to be compressed as the context order n , e.g., the context order is 6 when the context is selected as y t{6 ,y t{4 ,y t{3 ,y t{2 ,y t{1 f g for y t . According to previous context-based DNA algorithms [26], the sequential context models with competing order (e.g., a low order and a high order) have significant effect on the final compression performance. Hence, when the maximum context order in the proposed method is set to 16, our models specifically consist of eleven general sequential context models with orders equal 1, 2, 4, 6, 8, 10, 11, 12, 13, 14, 16, a total of eleven nonsequential sparse models performed on the last four bytes (before the base including the bit to be compressed). If we use bit 1 to refer the picked bit and 0 for the excluded one, the eleven sparse models can be represented as '00F0F0F0', 'F0F0F0F0', '00F8F8F8', 'F8F8F8F8', '00E0E0E0', 'E0E0E0E0', '00F0F0FE', 'AAAAAAAA', 'F00F00FA', 'F000F0FD' and 'F0000F00', which are series of hexadecimal digits (The reason why we choose these non-sequential contexts is explained in Appendix 5 of File S1 combined with Figure S3 in File S1).
For the i À th(i~1 Á Á Á M c ) model in COMPACT, the prediction of the next outcome bit y j can be expressed as P i (y j~1 )~p(y j~1 jc (i) k (y j{k ), Á Á Á ,c (i) 1 (y j{1 )), where M c is the number of total context models, and c (i) k (y j{k ), k~1 Á Á Á 16|8(i.e., the maximum context order multiplies the bit number of each byte) indicates the contextual dependencies (a.k.a. bit history) for y j defined in the i À thmodel. For example, c (i) k (y j{k )~y j{k if y j depends on y j{k in the i À thmodel, otherwise, c (i) k (y j{k )~NULL. Then, P i (y j~0 ) can be easily calculated as 1{P i (y j~1 ). In our implementation, we apply three lookup-table based functions (i.e., Run Map, Stationary Map, and Nonstationary Map, referred to http://cs.fit.edu/mmahoney/ compression/), to map the bit history to the corresponding probability.
In the rest of this paper, we denote the prediction result of bit y j to be processed by the i À th context model by t i instead of P i (y j~1 ). In the next section, we discuss how to find the most likely probabilityP(y j jt Mc 1 )given the individual predictions t Mc 1~f t 1 , Á Á Á ,t Mc g of all M c context models.

Model synthesis based on logistic regression
According to the Maximum Entropy Principle (i.e. MAXENT [49]), the most likely probability for P(y j jt Mc 1 ) is the one with the highest entropy [50] as follows, where P (t) is the empirical probability of t and P 0 is subject to P 0~( P(yjt)j X y p(yjt)~1,Vt and where f i (t,y) is a function that returns t i if y equals to the bit being predicted, or returns 0 otherwise. And P (t,y) is the experienced probability of (t,y). Eq.
By replacing the parameter l i with w i , which can be viewed as the weight of the i À th model, we obtain Given a binary bit y j , this is exactly a logistic regression model, which can be optimized efficiently using the Newton-Raphson algorithm. We add one diagram Figure S1 in File S1 to show how this model works in each step.

Results
Parameters discussion: model order n and minimum match length L m We implemented the encoder and decoder of COMPACT in C++, and ran experiments on a workstation with Intel(R) Xeon(R) 3.6 GHZ CPU and 96 GB of RAM. Denoting the parameters model order by n and minimum match length by L m in the proposed algorithm, we explored the relationship between the compression performance (the average number of bits per base, bpb) and parameters. To study the issue comprehensively, we selected four sequences from different species with different sizes : HUMHPRTB, 56,737 symbols; HEHCMVCG, 229,354 symbols; y-4, 1,531,929 symbols; NC013929, 10,148,695 symbols.
As indicated in Figure 7, almost all sequences demonstrated a decrease in bits per base with the increase in model order. Correlation in the sequence was best predicted using context models of moderate orders. We chose 16 as the context model order. The figure also shows that the compression performance does not always improve with the growing of the minimum match length.

Performance of compression without reference
In this section, we conducted experiments under the condition that model order equaled to 16 on a standard dataset of DNA sequences (Table 1), a DNA corpus consisting of four organisms used by Manzini et al. in [30] (Table 2), the bacteria DNA sequences (Table 3) from the National Center for Biotechnology Information (NCBI) directory, and the complete DNA sequences of ten species with various sizes (See Table 4). We represented time in seconds. In these experiments, the proposed compression algorithm without reference, i.e., COMPACT-NOREF, demonstrated performance advantage compared to existing models.

Performance of compression with reference
We tested the performance of COMPACT-REF (i.e. rLZ+COMPACT) with model order 16 and minimum match length L m~5 0 in three cases. As a result, the KOREF_20090224 genome sequence data using KOREF_20090131 as reference, for which the raw file is 2937.7 MB (KOREF_20090224), were compressed into a 15.8 MB file, achieving a 189-fold compression rate ( Figure 8). Table 5 displays the 177-fold compression result for another experiment when the genome of a Han Chinese individual (YH) was compressed using KOREF_20090224 as reference. Table 6 displays the compression results of COM-PACT-REF and GReEn [30] for three different human genome assemblies (YH, KOREF_20090224 and KOREF_20090131) with transformed alphabets using hg18 (NCBI36) as their common reference, and the results of them for the same datasets with original alphabets are displayed in Table S3 in File S1.

Discussion
Discussion for compression without reference Table 1 compares the compression results in bits per base (bpb). Along with our proposed algorithm, we presented here the existing algorithms, i.e. DNA3 [30], XM500 [13] and FCM [51]. Our COMPACT-NOREF with the proposed LZ method applied in the first pass outperformed the one with dnaX (a fast algorithm using fingerprints introduced by Manzini et al. [30]) for short sequences. But such advantage was not obvious as the sequence sizes increased. The reason is that encoding the finite repeats with a leading indicator rather than copious repeat locations saves space, and Gamma coding is more efficient than dnaX's continuation bit encoding for short sequences. Hence, we implemented the first pass of remaining experiments for reference-free scenarios with dnaX.
As for ''difficult'' sequences like HUMD (i.e., HUMDY-STROP), HUMH (i.e., HUMHBB) in which we did not gain performance advantage, we realized that both of them were human genomes, which often contained approximate repeats rather than exact duplicated strings. Therefore, the COMPACT-NONREF and other similar algorithms such as [23,30], which only took the exact repeats into consideration, did not achieve the best performance for human genome compression. We have conducted a testing experiment in Figure S2 in File S1 to support this hypothesis. In Table 2, it is further witnessed that COMPACT-NOREF is a little inferior to XM for the ''difficult'' sequences of four organisms (i.e., yeast, mouse, arabidopsis, and human). However, we have realized that XM adopts more sophisticated modeling approach (e.g., the combination of various ''experts'') and too many expert models to attain better representatoin for DNA sequences at a rapidly increasing cost of memory and time, while ours only picks out certain suitable ones relying on short-term knowledge from the past. In our experiments, XM took around or larger than 2.0 GB memory to compress the sequences over 100 MB in the forth group, which consumed much more resource than other methods (The proposed method took much less memory, please refer to Table  S2 in File S1). We carried out additional experiments on the bacteria DNA sequences from the National Center for Biotechnology Information (NCBI) directory (Table 3). We compared a variation of the proposed algorithm: semi-COMPACT (i.e., COMPACT-NOREF without the first pass), COMPACT- Note: The minimum match length L m in the first pass is set to 25 in this experiment. The unit of time is second except for H.sapiens whose unit is minute. The ''XM200'' column shows the results obtained with the XM algorithm using at most 200 experts. '*' indicates that a model consumes much more memory (around or larger than 2.0 GB) than other methods. doi:10.1371/journal.pone.0080377.t004 Figure 8. Homo sapiens genome: compression of KOREF_20090224 using KOREF_20090131 as reference. The original sequence alphabets have been preserved. The size of the alphabet in the target sequence is 21 for all chromosomes, except for chrM chromosome whose size is 11. The left y-axis refers to the compression ratio while the right y-axis indicates the compression or decompression time in seconds. doi:10.1371/journal.pone.0080377.g008 NOREF, together with an XM encoder, and a state-of-the-art algorithm, FCM-Mx [28], in terms of compression and required time. Table 3 presents the individual compression results on these sequences with 10,000,000 or more bases. The table also includes the average compression result of each algorithm in the last row. For bacteria, the proposed method demonstrated the best performance among all algorithms. The average compression rates of five sequences reported for XM500 and FCM-Mx were 1.787 bpb and 1.7543 bpb, while our method COMPACT-NOREF's average performance on the same set was 1.7204 bpb. The time cost for the proposed methods was comparable to that of XM. Results for eleven complete genomes are shown in Table 4. The FCM-S and FCM-M [52] columns contained results provided by the finite-context models and by the multiple competing finitecontext models. FCM-S processed DNA sequences using the single finite-context model approach, in which the best context depth was used, whereas FCM-M obtained the results with the multiple competing models. The results presented in the Table show a  similar pattern as Table 3. What's more, all tables from 1 to 4 include the compression results of 'COMPACT-seq' that uses only traditional sequential models (including models based on Markov chains) followed by logistic regression model. Especially for Table  4, 'COMPACT-seq' outperforms all other algorithms on ten complete genomes, and 'COMPACT-NOREF' exceeds 'COM-PACT-seq' on almost all sequences. It can be inferred that both the proposed non-sequential models and the logistic regression mixture model are extraordinary.

Discussion for compression with reference
We compared the performance of the proposed method COMPACT-REF (i.e., rLZ+COMPACT) to that of GRS [15] and GReEn [17], two most recently proposed approach for compressing genome resequencing data that handled sequences over arbitrary alphabets. Figure 8 displays the compression performance for human genome KOREF_20090224 using KOREF_20090131 as reference. COMPACT-REF gave better results in terms of compression ratio but it is slower than GReEn and GRS. In fact, the speed disadvantage deserved a special note. Different from GReEn, the compression time of COMPACT-REF does not vary linearly with the size of the sequence but rather  depending on the degree of similarity between the reference sequence and the target sequence. This was also the reason why some longer sequences took shorter time to compress, like chr8 and chr9. On the other hand, Figure 8 also demonstrates that COMPACT-REF can achieve comparable decompression consumption time with GReEn since both the decompression procedure and the decompression runtime of GReEn are identical to its compression's, while COMPACT-REF saves time in the first pass for decompression because it only locates the match position instead of searching repeats. Hence, we believe that COMPACT-REF is advantageous in applications for which disk space and decompression time are the limiting factors but the compression time is more tolerable, such as sequence archive and sequence acquisition. In order to have a more specific and clearer comparison, we also present the results in tabular form in File S1 (Please refer to Table S1).
In order to provide a more comprehensive comparison between GRS, GReEn and the proposed compression approaches, we investigated another human genome assembly, YH, which referred to the genome of a Han Chinese individual. Table 5 displays the compression results of YH using KOREF_20090224 as reference. GRS performed poorly in both compression rate and speed. Our COMPACT-REF approach achieved good results with the appropriate window size, which can be selected by choosing a large window size and gradually shrinking it down. Note that the window range only slightly affects the compression performance. The default left window size and right window size are adaptively obtained through calculating the difference percentage, which equals to the sum of the difference values of each base's (i.e., 'A', 'C', 'T', 'G', 'N', 'a', 'c', 't', 'g' and 'n') number in the source sequence and reference sequence dividing by the base length of source sequence. If the percentage is smaller than 0.65%, the default window size is set to [-12, 650]; If the percentage is larger than 0.65% but smaller than 5%, the default window size is set to [-12, 812]; Otherwise the window size is set to [-12, 11560]. For the compression of KOREF_20090224 using KOREF_20090131 as reference, the default window range is [-12,650] except for [-12, 812] for chr1, 4, X and chrY. In the compression of YH using KOREF_20090224 as reference, the default window range is [-12, 11560] for most chromosomes. Table 6 summarizes compression results of COMPACT-REF and GReEn for three different human genome assemblies (YH, KOREF_20090224 and KOREF_20090131) using the same choice of reference hg18 (NCBI36). KOREF_20090131, KOREF_20090224 and YH database are three genome databases generated by two different organizations. YH is the first diploid genome sequence of a Han Chinese, a representative of Asian population, completed by Beijing Genomics Institute at Shenzhen (BGI-Shenzhen). KOREF_20090131 and KOREF_20090224 are two versions of the first individual Korean genome released in December 2008 as the result of Korean reference genome construction project. Consequently, the alphabet set of these two datasets are different. Both KOREF_20090131 and KOREF_20090224 consist of 21 symbols, such as 'A', 'C', 'T', 'G', 'N', 'M' and etc., with the additional bases besides {'A', 'C', 'T', 'G'} indicating different sequencing quality or uncertainty. But all bases in YH are confined to {'A', 'C', 'T', 'G', 'N'} by using only 'N' to represent uncertain bases. Hence, in order to keep the alphabet size identical, we transformed all characters to lowercase and mapped unknown nucleotides to 'n' for the sake of comparison. Table S3 in File S1 displays the compression results of COMPACT-REF and GReEn for the same datasets with original alphabets. The size of each sequence in Table 6 reduced significantly by the proposed method although GReEn seems to show a superior performance. The reason why GReEn generates better compression results than COMPACT-REF in this situation may be that GReEn relies on the probability distribution of characters in the target sequence (assuming that the characters of the target sequence are an exact copy of (parts of) the reference sequence). When we do not eliminate the effect of character case (i.e., uppercase or lowercase) in Table S3, GReEn demonstrates an obvious disadvantage in terms of compression performance. These experiments demonstrated the applicability of our framework in compressing genomic data sets and encouraging further investigation.

Supporting Information
File S1 Supporting figures and tables. Figure S1. The diagram of logistic regression model synthesizing different models to obtain a single probability. Figure S2. The relationship between the compression rate and the quantity of noise over the sequence HEHCMVCG. Figure S3. The schematic diagram of the selected contexts for eleven non-sequential sparse models. Red block refers to the picked bit while the others refer to the excluded one. Table S1. Homo sapiens genome: compression of KOREF_20090224 using KOREF_20090131 as reference. Table S2. The evalution of memory usage in our experiments.