On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models

A finite-context (Markov) model of order yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth . Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.


Introduction
Since the work of Grumbach and Tahi [1], many contributions have been made in the area of DNA data compression (see, for example, [2][3][4][5][6][7][8][9][10] and for a recent review [11]). These works explore the non-stationary nature of DNA sequence data, which are characterized by an alternation between regions of relatively high and low entropy. Typically, there are two compression approaches, one based on Lempel-Ziv-like substitutional procedures [12] (that usually perform well on repetitive, low entropy regions) and another based on low-order context-based (Markov) arithmetic coding (better suited for regions of high entropy).
According to the substitutional paradigm, repeated regions of the DNA sequence are represented by a pointer to a past occurrence of the repetition and by the length of the repeating sequence. Both exact and approximate repetitions have been explored, as well as their inverted complements.
Markov modeling has long been applied to DNA data sequences (see, for example, the works of Borodovsky et al. [13,14] and of Tavaré and Song [15]). Since then, a large number of publications have addressed this topic, although mainly with the aim of proposing techniques for gene finding (some examples can be found in [16][17][18][19][20][21][22]). Other applications, such as the detection of short inverted DNA segments [23], the assessment of the statistical significance of DNA patterns [24] or the identification of CpG islands [25], have also relied on Markov models. However, Markov models have never been used as the sole paradigm for DNA sequence modeling or compression. In this paper, we address a modeling question that we do believe has not been satisfactorily answered before: How well can complete genomes be described using exclusively a combination of Markov models? We seek descriptions that are good in the sense of the minimal description length principle [26], i.e., that require as few bits as possible for representing the information.
To investigate this matter, we developed a method based on multiple competing finite-context models that incorporate features found in DNA sequence data, such as the existence of inverted repeats. Finite-context models are computational models that provide a probability estimate of the next DNA base, given the recent past of the sequence, in accordance with the Markov property.
There is a close connection between compression and modeling. Compression methods depend on statistical models of the data. If a compression method outperforms another, it is because the underlying statistical model is better suited to the data. Conversely, if a statistical data model explains a string of data very well, that is, if it provides good estimates of the distribution of each data symbol, then it is conceivable that its application in sequence compression might lead to good compression performance.
To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure.
Our experimental results show that the ability of multiple competing finite-context models to describe DNA sequences is surprisingly close to that attained by more complex state-of-the-art DNA compression methods, such as XM [10]. In fact, for smallsized sequences, the finite-context models perform better.
XM, the method that we use as the reference to compare the performance of the finite-context models, relies on a mixture of experts for providing symbol by symbol probability estimates, which are then used for driving an arithmetic encoder. The algorithm comprises three types of experts: (1) order-2 Markov models; (2) order-1 context Markov models, i.e., Markov models that use statistical information only of a recent past (typically, the 512 previous symbols); (3) the copy experts, that consider the next symbol as part of a copied region from a particular offset. The probability estimates provided by the set of experts are then combined using Bayesian averaging and sent to the arithmetic encoder.
Besides a global comparison, based on the average of the negative logarithm of the probability estimates (i.e., the average of the per base information content) performed for several genomes of various sizes, we also provide some samples of the local profiles of the so-called information sequences [27]. These information sequences contain the per base information content generated by the models (measured in bits), allowing, for example, the comparative analysis of long DNA sequences [28], the classification of biological sequences [29] or sequence alignment [30]. In addition, we show an example of the context depth profile produced along the sequences, that might have independent interest.
As we mentioned before, in this paper we explore multiple competing finite-context models, with the aim of finding how well complete DNA data sequences can be described exclusively by this modeling paradigm.
As far as we know, this paper provides the first comprehensive investigation of the extent to which Markov models explain DNA data. We believe that this is important because it provides evidence that complete DNA data sequences can be reasonably well described by statistical models that rely only on the immediate past of the sequence. In other words, local, short-range models perform as well as or better than non-local models built in the state-of-theart compression methods. Since the search for better data compression methods is intimately related to the problem of finding better data models, this work contributes to an improved understanding of the laws that govern the DNA data, an objective that has been long pursued (see, for example, [1,[31][32][33]).

DNA data sequences
In this study, we used the complete DNA sequences of eleven species of various sizes. The genomes were obtained from the following sources:

Finite-context models
Consider an information source that generates symbols, s, from a finite alphabet A~fs 1 ,s 2 , . . . ,s jAj g, where jAj denotes the size of the alphabet. In the case of DNA data, A~fA,C,G,Tg and, therefore, jAj~4. Also, consider that the information source has already generated the sequence of n symbols x n~x A finite-context model assigns probability estimates to the symbols of the alphabet, regarding the next outcome of the information source, according to a conditioning context computed over a finite and fixed number, kw0, of the most recent past outcomes c~x n{kz1 . . . x n{1 x n (order-k finite-context model) [34][35][36]. The number of conditioning states of the model is jAj k , determining the model complexity or cost. The context, c, varies along the sequence, i.e., it depends on the position n. However, for alleviating the notation, we considered this dependency implicit and, therefore, when we refer to c we mean the value of the context at the location that should be easily inferred from the formula where it occurs.
The probability estimates, P(X nz1~s jc),V s[A , are usually calculated using symbol counts that are accumulated while the sequence is processed, which makes them dependent not only of the past k symbols, but also of n. In other words, these probability estimates are generally time varying. Table 1 shows an example of how statistical data are usually collected in finite-context modeling. In this example, an order-5 finite-context model is presented (as that of the FCM1 model of Fig. 1). Each row represents a probability model that is used to represent a given symbol according to the last processed symbols (five in this example). The counters are updated each time a symbol is processed.
The theoretical per symbol information content average provided by the finite-context model after having processed n symbols is given by where ''bpb'' stands for ''bits per base''. Recall that the entropy of any sequence of four symbols is limited to two bits per symbol, a value that is obtained when the symbols are independent and equally likely, and that the fewer the number of bits produced the better is the model. One of the drawbacks of implementing the finite-context models using the approach illustrated in Table 1 is that the memory requirements grow exponentially with k. In fact, the total number of counters needed in this case is (jAjz1)jAj k . For DNA data, and even considering only two-byte counters, this would imply about 40 Gbytes of memory for implementing an order-16 model. However, this table would also be very sparse, because the maximum number of different words of size k that can be found in a sequence of length n is clearly upper bounded by n. Using this simple observation and appropriate data structures such as hashtables, we managed to implement a computer program that allows using finite-context models of orders up to sixteen in a laptop computer with 3 Gbytes of memory (the source code of this computer program is publicly available in ftp://www.ieeta.pt/ ,ap/codecs/DNAEnc3.tar.gz).

Updating the inverted complements
Frequently, DNA sequences contain sub-sequences that are reversed and complemented copies of some other sub-sequences. These sub-sequences are named ''inverted repeats''. As mentioned before, this particularity of DNA sequence data is used by most of the DNA compression methods that have been proposed and that rely on the sliding window searching paradigm.
For exploring the inverted repeats of a DNA sequence, besides updating the corresponding counter after encoding a symbol, we also update another counter that we determine in the following way [37]. Consider the example given in Fig. 1 where the context is the string ''ATAGA'' and the symbol to encode is ''C''. Reversing the string obtained by concatenating the context string and the symbol, i.e., ''ATAGAC'', we obtain the string ''CAGATA''. Complementing this string (A<T, C<G), we get ''GTCTAT''. Now we consider the prefix ''GTCTA'' as the context and the suffix ''T'' as the symbol that determines which counter should be updated. Therefore, according to this procedure, we take into consideration the inverted repeats if, after encoding symbol ''C'' of the example FCM1 of Fig. 1, the counters are updated according to Table 2. As shown in [37], this provides additional modeling performance.

Multiple competing models
DNA sequence data are non-stationary. In fact, one of the reasons why most DNA compression algorithms use a mixture of two methods, one based on repetitions and the other relying on low-order finite-context models, is to try to cope with the nonstationary nature of the data. We also follow this line of reasoning, i.e., that of using different models along the sequence. However, unlike the other approaches, we use exclusively the finite-context paradigm for modeling the data, changing only the order of the model as the characteristics of the data change. More precisely, we Figure 1. Example of finite-context models. In this example, A~fA,C,G,Tg and the context depths, k, are k 1~5 and k 2~1 1. The probability of the next outcome, X nz1 , is conditioned by the last k outcomes. When more than one model is running competitively, the particular context depth used is chosen on a block basis. doi:10.1371/journal.pone.0021588.g001 explore an approach based on multiple finite-context models of different orders that compete for encoding the data.
Using several models with different orders allows a better handling of DNA regions with diverse characteristics. Therefore, although these multiple models are continuously updated, only the best one is used for encoding a given region. For convenience, the DNA sequence is partitioned into non-overlapping blocks of fixed size, which are then encoded by one (the best one) of the finitecontext models. Figure 1 shows an example where two competing finite-context models are used. In this example, each model collects statistical information from a context of depth k 1~5 and k 2~1 1, respectively. At time n, the two conditioning contexts are c 1~xn{k1z1 . . . x n{1 x n and c 2~xn{k2z1 . . . x n{1 x n .

Estimating the probabilities
How to estimate probabilities based on counting the occurrences of past events has been a problem addressed by several researchers, going back at least to the works of Bayes and Laplace [38,39].
The central problem is the estimation of the probability of events that have never been observed (this is also known as the pseudocount estimation problem). For that purpose, we use an estimator that is a generalization of earlier formulae (see, for example, [40][41][42][43][44][45]), which is given by where n c s represents the number of times that, in the past, the information source generated symbol s having c as the conditioning context and where the estimator can be rewritten as P a (X nz1~s jc)~m n c s n c z(1{m) revealing a linear interpolation between the maximum likelihood estimator and the uniform distribution. This also shows that when the total number of events, n c , is large, the estimator behaves as a maximum likelihood estimator (when n c ??, m?1), regardless of the value of a. Therefore, the main interest in the estimator of (2) is when n c is small, in which case the value of a plays a key role. Moreover, it can also be seen that the parameter a controls the probability assigned to previously unseen (but possible) events, i.e., the probability when n c s~0 . This probability is given by which decreases faster with n c for smaller values of a. The estimator described in (2) assumes a Dirichlet prior, Dir(a 1 , . . . ,a jAj ), over the probabilities that are being estimated, with a 1~. . .~a jAj~a . Dirichlet mixtures have also been used, for example in the context of protein family modeling (see, for instance, [46,47]).

Results and Discussion
Each organism under analysis was processed using eight competing finite-context models with context depths k~2,4,6, 8,10,12,14,16. The decision of which depth to use was taken on a block by block basis, using blocks of two hundred DNA bases. This block size, although not optimal for every sequence, has revealed to be on average a good compromise.
Since we are interested in evaluating the performance of the models, we used the average number of bits per DNA base (bpb) provided by these models, as a measure of their fitness to the data. This is essentially the value provided by (1) when n equals the length of the sequence, i.e., after processing the whole sequence. Besides this per symbol average information content, the overhead required to indicate the depth of the particular finite-context model used in each data block was also considered. Note that, for blocks of two hundred bases, and without further modeling, this implies a small overhead of 3=200~0:015 bpb (recall that the eight possible context depths can be represented with three bits). Nevertheless, we also used a finite-context model for representing this information in a more efficient way. It was found, experimentally, that an order-4 model was able to provide a good performance.
For comparison, we processed the DNA sequences using the single finite-context model approach. In this case, the best context depth was used. For genomes composed of several chromosomes, the best context depth was determined for each chromosome. The results regarding this approach are presented in the ''FCM-S'' column of Table 3, whereas the results obtained with the multiple competing models are shown in column ''FCM-M''. We used the currently best-performing DNA compression algorithm, XM [10], for evaluating the overall performance of the multiple competing finite-context models in comparison with the state-of-the-art technique for DNA sequence compression. Also, with the aim of providing an additional term of comparison, we include the results attained by another DNA compression method, developd by Manzini et al. [6], because it is a fast, although competitive DNA compressor. This technique is based on fingerprints for fast pattern matching, and relies on fallback mechanisms for encoding the regions where matching fails, which are order-2 (DNA2) or order-3 (DNA3) finite-context models. The results presented in Table 3 correspond to the average number of bits actually generated.
The probabilities associated to the finite-context models were estimated using (2), with a~1 (corresponding to Laplace's estimator) for model orders k~2,4,6,8,10 and with a~0:05 for model orders k~12,14,16. As explained in the previous section, when n c is large, the estimator converges to the maximum likelihood estimator, meaning that the value of a is virtually irrelevant when n c ??. This is what happens for small-order models, because, due to the reduced number of contexts, on average the total number of events that occur associated to each context, i.e., n c , quickly attains a sufficiently high value for rendering m&1 in (4). However, when k is large, then the number of conditioning states, 4 k , is very high. This implies that statistics have to be estimated using only a few observations (small values of n c ), which is the case where the value of a might play an important role.
In fact, during our study, we have found out experimentally that, using the combination of multiple finite-context models, the probability estimates calculated for the higher order models lead to significantly better results when smaller values a are used. We have performed a number of experiments and reached the conclusion that picking a~0:05 would provide, globally, good results. Other values similar to this one would also produce good results, meaning that the performance of the estimator is robust with respect to small variations of a.
The results presented in Table 3 show a clear distinction between organisms with small genomes and organisms with large ones. For small-sized genomes, the modeling ability of the multiple finite-context models is basically the same as the more sophisticated modeling approach provided by the XM algorithm. This is observed in the case of the A. nidulans, C. albicans, S. pombe, S. cerevisiae, E. coli, S. aureus, T. kodakarensis, M. jannaschii and M. genitalium organisms, with some of them being slightly better compressed by the finite-context models. For the H. sapiens and A. thaliana species, the modeling capability of the finite-context models appears to be insufficient, because the XM approach is able to attain lower entropies (about 5% lower when using 200 experts). This result agrees with the known strong repetitive characteristic of the DNA data of the higher organisms, a characteristic that is usually better modeled by the copy expert mechanism provided by XM. On the contrary, the species with small-sized genomes seem to be very well represented exclusively by finite-context models, without needing the help of the copy experts. This observation allow us to conclude that the DNA sequence data of these species can be represented by models that rely only on short-term knowledge of the past, i.e., sixteen bases or less as suggested by the experimental results that we have obtained. Moreover, even in the higher species, the capability of the Markov-only approach seems to be quite significant, since it is able to represent, at least, about 95% of the information of the genome.
For better understanding how the two approaches behave locally, i.e., with and without the copy expert mechanism, Fig. 2 presents the information sequences regarding the first 400000 welldefined bases (i.e., ignoring the ''N'' cases) of the human chromosome number one. The (a), (b) and (d) plots represent the instantaneous number of bits required by each of the two modeling approaches for representing the DNA bases. Consequently, smaller values indicate that the DNA bases in that particular region of the DNA sequence were ''easier'' to represent (i.e., they required less bits) than other bases for which the values of the plot are higher. Note that, for facilitating the visualization of the curves, the data were low-pass filtered.
As can be seen, the curves displayed in Fig. 2(a) and (b) are reasonably similar. These plots exhibit valleys of varying depth mixed with a kind of plateau regions, clearly showing the different complexities that we referred along the paper and that motivated the adoption of the multiple competing finite-context models. The plateau-type regions reveal DNA segments that are difficult to represent, in the sense that they require more bits than average. These regions are typically encoded by the low-order finite context models. On the contrary, the valleys indicate DNA regions easier to represent, and, therefore, requiring less bits per DNA base. These parts of the DNA sequences are usually better handled by the high-order finite-context models or by the repetition-seeking mechanisms of the compression methods that incorporate this paradigm.
Still making use of the analytical power provided by the information sequences, Fig. 2(d) shows how important the value of a is in the probability estimator formula for a good performance of the high-order models. As can be seen in the figure, when using a~1 for all model depths the majority of the valleys is much less deep than when using a~0:05 for the high-order models (k~12, 14,16), showing that the representation of the low complexity regions is strongly affected by this parameter.
Finally, in Fig. 2(c) we display the plot of the variation of the context depth along the sequence when processed with the same parameters as those used to produce the graphic shown in Fig. 2(b). It can be observed that, generally, deeper context models are chosen when the entropy is lower. Nevertheless, this is not always the case, and, therefore, these kind of plots may provide additional information about the structure of the DNA sequence.

Conclusions
We have provided the first comprehensive investigation of the extent to which Markov models explain complete genomes. To explore the potential of Markov models as completely as possible, we have used a model that includes several competing Markov models of different orders. The model adapts to the statistical characteristics of the DNA sequences, which vary widely across the sequence, depending on the nature of the data (consider coding and non-coding DNA segments, for instance). The adaptation reflects itself on the selection of Markov models of different orders for different DNA segments. We have noted that this approach is effective when complemented with the following ideas, that we have come to regard as essential: (a) careful programming, based on hash tables, to cope with the memory demands posed by the Markov models with longer context depth and the inherent sparsity of their associated contexts (b) probability estimates adequate to the wide range of context depths used (c) inverted repeat handling.
To measure the fit of the model at a certain position we adopted the negative logarithm of the probability estimate at that position. This standard measure yields information profiles of the sequences, which are of independent interest, and reveal instantaneous innovation along the sequences (that is, segments through which the sequence behaves in a more random and unpredictable way, as opposed to segments in which the behaviour is more predictable). The average of the measure over the entire sequence reduces to the average number of bits per base to describe the sequence, and works as a global performance measure.
A comparison of the results obtained with our multiple Markov model and state-of-the-art compression models reveals that the Markov-only description is able to explain genomes almost as well or even better. This is surprising for the following reasons. Our method is not intended to be a complete compression method -it does not attempt to explore long-range correlations and it does not take advantage of the presence of segments that are repeated (exactly or approximately) across the sequences. Furthermore, it consists only of Markov models, which are inherently short-range or local. Compression methods do take advantage of local correlations (and commonly resort to Markov models for that purpose) but also employ techniques such as copy experts, that are able to efficiently represent repetitions found along the sequence (potentially at unbounded distances). The fact that the degree of local dependence present in DNA sequences allows representations that compete with advanced compression methods is unexpected. The sequences for which our method gave better performance than state-of-the-art compressors (generally speaking, the shortest sequences) must include those for which short-range dependencies out-weight long-range dependencies. In other words, those that are less rich in exact and approximate repeats.
To conclude, our work provides evidence that complete DNA data sequences can be reasonably well described by statistical models that are inherently local, provided that inverted repeats are accounted for and that the probability estimates are taylored to the wide range of context depths used. Since the search for better data compression methods is closely related to the problem of finding better data models, this work contributes to an improved understanding of the laws that govern the DNA data. Figure 2. Example of information sequences for the first 400000 well-defined bases of human chromosome 1. (a) Information sequence generated by the XM method; (b) Information sequence generated by the multiple competing finite-context models, using a~0:05 for the high-order models (k~12, 14,16) and a~1 for the remainder models; (c) Variation of the depth of the context-model along the sequence, for the same setup as in (b); (d) The effect of parameter a. In this case, we show the information sequence generated by the multiple competing finite-context models with a~1 for all the models. doi:10.1371/journal.pone.0021588.g002