Inherent limitations of probabilistic models for protein-DNA binding specificity

The specificities of transcription factors are most commonly represented with probabilistic models. These models provide a probability for each base occurring at each position within the binding site and the positions are assumed to contribute independently. The model is simple and intuitive and is the basis for many motif discovery algorithms. However, the model also has inherent limitations that prevent it from accurately representing true binding probabilities, especially for the highest affinity sites under conditions of high protein concentration. The limitations are not due to the assumption of independence between positions but rather are caused by the non-linear relationship between binding affinity and binding probability and the fact that independent normalization at each position skews the site probabilities. Generally probabilistic models are reasonably good approximations, but new high-throughput methods allow for biophysical models with increased accuracy that should be used whenever possible.


Author summary
Transcription factors (TFs), a class of DNA-binding proteins, play a central role in the regulation of gene expression. TFs control the rate of transcription by binding to the genome in a sequence-specific manner. Thus, one important aspect in the study of gene regulation mechanism is to model the binding specificities of TFs, namely the features of the DNA sequences that a TF prefers to bind. Multiple models have been proposed to characterize the binding specificities of TFs, among which the class of probabilistic models is the most popular. In this study, we point out several major limitations of the well-established probabilistic model by comparing it with the biophysical model. Through simulations we demonstrate that the probabilistic model is only an approximation of the biophysical model. The latter has most of the advantages of the former, and is a more accurate representation of binding specificities. We propose a shift from the probabilistic model to the biophysical model in future studies of protein-DNA interactions.

Introduction
The study of protein-DNA interactions has a long history and includes binding to both singleand double-stranded DNA and both non-specific and sequence-specific interactions [1]. Our interests are primarily in the sequence-specific interactions of transcription factors (TFs) that bind to DNA to regulate gene expression. Detailed modeling of the in vivo interactions of TFs with genomic DNA that control gene expression requires accounting for many complicating factors, including competition and cooperativity with other TFs, competition with nucleosomes for occupancy of specific DNA regions and how sequences flanking the TF binding sites can affect occupancy [2][3][4][5][6][7][8][9][10][11]. Regardless of the modeling approach, one component is always the specificity of the TF, how its binding affinity varies depending on the DNA sequence of the binding site. Representations of specificity typically employ matrix-based models where the positions within the binding site are assumed to contribute independently to the TF's binding affinity [12][13][14]. In various methods the elements of the matrix may represent probabilities (or log-probabilities) of each base occurring at each position, or energetic contributions from each base at each position, or more generally just abstract scores that are related to the functional contributions of each base at each position [13]. The data used to estimate the matrix parameters may come from many types of either in vivo or in vitro experiments, employing various types of algorithms [13,. To determine the intrinsic specificity of a TF, how its binding affinity varies between different sequences, in vitro methods are preferred because the data are unconfounded by the complications that exist in vivo. In this paper, we compare two methods of matrix representation of TF specificity, a probabilistic model in which the matrix elements are probabilities of each base at each position, and a biophysical model in which the matrix elements are the binding energy contributions of each base at each position. Either type of matrix could be obtained from the same types of data, and we show that there are inherent limitations of the probabilistic model that do not apply to the biophysical model, suggesting that energy matrices should be preferred in general. Probabilistic models (PMs) for DNA binding proteins were initially introduced by Harr et al. for E. coli promoters that even treated variable length binding sites [39]. Soon after, Staden converted to the use of log-probability to put the model into a weight matrix (additive) model, also including parameters for variable spacing [40]. Schneider et al. drew connections between the probabilistic models and information theory and introduced the log-odds model that accounts for the background distribution of bases [41] and later introduced the popular logo graphical representation of specificity [42]. The probabilistic model was also the basis of the earliest motif discovery algorithms [33,34,43]. Since then there have been many different algorithms for motif modeling and discovery using probabilistic models (reviewed in [12,13,15,24,44]).
Even earlier von Hippel introduced an energy-based model of protein-DNA interactions [14]. At the time, there were almost no data on actual binding sites so the paper used first principles to describe the informational specificity required for functional regulatory sites. The paper made simplifying assumptions such as the independence between positions and that every mismatch from the preferred sequence had the same energy difference. The first assumption, of independent contributions, has proven to be a reasonably good approximation for most transcription factors, whereas differences in contributions of alternative bases at each position are now well known and form the basis of most specificity modeling approaches. Berg and von Hippel derived an energy model that was identical to the probabilistic one under some simplifying assumptions and connections between the energy approach and the information theory models of specificity became clear [45][46][47]. Hwa and colleagues put the energy modeling approach into a more general biophysical model that accounts for the effects of protein concentration on binding probabilities [48,49]. Djordjevic et al. pointed out the importance of the biophysical approach in modeling specificity [50]. They further provided an algorithm that is guaranteed, for any collection of known binding sites, to predict the minimum number of additional sites in a genome, thereby minimizing the number of false positive predictions, although the method is not guaranteed to provide a more accurate model of the true specificity [50,51]. Regression methods have been used to find optimal energy parameters and Foat et al. provided the first regression algorithm for motif discovery of optimal energy models [16,52]. Since then several related methods have been developed to determine biophysical (energy) models of protein specificity from various types of high-throughput experimental data [18-20, 25, 27, 28, 31, 37, 38, 53-55].
Despite the development of several high-throughput experimental methods for measuring the specificity of protein-DNA interactions [22,56] and the algorithms described above for modeling them with the biophysical approach, probabilistic models remain the most popular. The purpose of this report is to point out that when good energy models are available there is no advantage to using the probabilistic models. In fact, due to inherent limitations the probabilistic models can be misleading and are highly sensitive to the samples used for inference of the parameters. Energy models can be readily obtained and can easily accommodate non-independent contributions between positions [13,52,57]. We conclude that energy modeling should become the approach generally used for modeling specificity and predicting protein-DNA interactions.

Results
We first introduce the fundamentals of the probabilistic model and the biophysical model and then describe the simulations used to compare the two approaches.

Probabilistic model
This model is based on a probability matrix PM(b, j) for each base b 2 (A, C, G, T) at each position j = 1, 2,Á Á Á, m for an m-long binding site. Any particular DNA sequence S i can be encoded as a similar matrix, S i (b, j), of 1s and 0s, where a 1 represents the base that occurs at position j and all other elements are 0 [13]. From the model, the probability of the sequence S i being among the bound sites is: Often this is converted to a log-odds weight matrix is the background, or prior, probability of base b [13,41]. For simplicity, we assume the prior probability is a constant, 0.25 for each base, and therefore the two approaches give equivalent results. Importantly, this is the probability of observing the sequence S i given a binding site, whereas what is desired is usually the probability that a sequence S i is bound, P(B|S i ). Instead, searching sequences with probabilistic models generally just provides a list of sites within some probability range of the preferred site, including a predicted relative probability for each sequence.

Biophysical model
This model is based on the thermodynamics of the interaction between two molecules, the protein T and a binding site S i (additional details provided in S1 Supporting Information). The association constant, which we refer to as the affinity, can be determined by measuring the concentrations of free reactants (protein and DNA) and of the complex: It is common to assume that the positions contribute independently to the binding affinity, just as the probabilistic model assumes the positions contribute independently to the site probability. This is represented as a matrix of affinity contributions K(b, j) such that From that one can determine the probability of a sequence S i being bound based on the protein concentration (or really the chemical potential of the protein which is related to its free concentration) and the association constant K i : where E i = −ln K i is the free energy of binding to sequence S i and μ = ln[TF] is the chemical potential. The probability of sequence S i in the bound sequences is obtained by Bayes' rule and is dependent on the chemical potential, which differs from the probabilistic approximation (noted above with e P): if P(S i ) is the same for all S i . More importantly the true probability of sequence S i in the bound sequences has a non-linear relationship with its binding affinity. This becomes pronounced at high protein concentrations where the energy can be additive across the positions of the binding site and yet the probabilities of the bases at each position are not independent.

Simulation results
From Eqs (1) and (5), it is clear that probabilistic models of protein binding specificity provide approximations to true binding probabilities, e PðS i jBÞ % PðS i jBÞ. We used simulations (see the details in the Methods section) to measure the accuracy of the approximation under various values of the chemical potential and for different methods of estimating the PM from the observed binding sites. Of particular interest is how well the rank order of binding site probabilities is preserved.
At different protein concentrations, the PMs derived from binding probabilities are usually different. As shown in Eq (4), the binding probability of a sequence S i depends on both its binding affinity K i (or energy E i ) and the protein concentration [T] (or chemical potential μ).
( 1, there is a linear relationship between affinity and probability, but that occurs only when P(B│S i ) ( 0.5, which is unlikely to be the case in vivo for true regulatory sites. At high protein concentrations, where K i [T] > 1 and the preferred binding site is highly occupied, the non-linear relationship between binding probability and affinity has several consequences. One is that the PM itself depends on the protein concentration, whereas the binding energy does not. Fig 1a and 1b show one example from the simulation of an energy matrix and its associated energy logo [13,16]. Note that in the matrix the lowest energy base at each position is assigned energy 0 (using the convention of Berg and von Hippel [45]), and in the logo the average energy for each position is set to 0, with the lower energy (higher affinity) bases on top. Since only the differences in energy matter for relative binding affinities, both representations lead to the same results. Fig 1c and 1d show the information logo [42] and the PM for that protein obtained at very low protein concentration, μ = −3. At low μ the PM corresponds very closely to the independent contributions of each base to the binding affinity (Eq (5) converges to Eq (1)). But at high protein concentration, such as μ = 3, the logo and PM are different (Fig 1e and 1f). The second logo shows that the information is "compressed" at μ = 3, with the mean column information content (MCIC) decreasing from 0.9 bits to 0.7 bits. But the change in probability is not evenly distributed. Comparing the two PMs (Fig 1d and 1f), at high protein concentration the base probabilities tend to move toward the mean, 0.25; the high probability bases decrease in probability and the low probability bases increase. But each position is normalized independently so that the magnitude of the change varies from position to position. For each position, the rank order of probability for the bases remains unchanged, but because the positions are normalized independently, the probabilities of different binding sites may change rank order. That is, even though the binding energy is completely additive across the positions, the probabilities of bases do not factor accurately across the positions.
The rank correlation (the square of the Spearman's rank correlation coefficient) between the predicted and true all-sequence distributions depends on the protein concentration and how the PM is computed. Table 1 shows the mean values and standard deviations of r 2 for 100  simulations of 8-long binding sites with μ = −3, 0 and 3 (which correspond to the preferred sequence being bound at 0.05, 0.5 and 0.95 probability, respectively). The rank correlation is shown for the complete distribution of binding sites based on PMs generated from the full distribution of binding data and from just the top 1% of sites, either weighted or unweighted. At μ = −3 there is a nearly perfect fit to the true ranking when the PM is derived from the entire distribution. However, when it is based on the top 1% of sites, the ranking is slightly less accurate (0.994) when the sites are weighted by their true probability. In both of those cases the PM provides a very good approximation to the true ranking of binding sites. If the top 1% are used unweighted to make the PM, the fit to the true ranking is 0.984 and that is true regardless of the value of μ because the top 1% of sites is the same and their probabilities are ignored. When μ = 0, the results are very similar. For μ = 3 the rank correlation drops to 0.988 when the weighted top 1% of sites are used to obtain the PM. Fig 2 plots the logarithms of the predicted and true relative binding probabilities for the case of the protein of Fig 1 and μ = 3. In each case the overall fit is quite good but the width of the plots indicates some degree of mis-ranking of the binding sites. While the overall rankings are quite good, it is the highest affinity sites that are of primary interest. In fact, all DNA-binding proteins exhibit a non-specific binding affinity [49] such that there is a minimum binding affinity below which the sequence no longer matters. In addition, functional regulatory sites must have sufficient occupancy to fulfill their roles, so only sites within some restricted range of the optimum are likely to be functional. Table 2 shows the rank correlations for the same PMs used in Table 1, but now focusing on the 1% highest affinity binding sites. Note that the values in Table 2 are all lower than for Table 1, indicating that the accuracy is lower when the PM is used to predict the highest affinity sites.  Tables 1 and 2 show that the quality of PMs, their ability to correctly rank binding sites, vary widely depending on both the protein concentration at which the binding data was obtained and the set of binding sites used to derive the PM. The effect of the protein concentration is most evident at high values of μ where the non-linearity of Eq (4) is largest and non-independence of the position probabilities is most pronounced. The effect of site sampling is due to the sensitivity of the PM to the exact set of example sites used.
Another consequence of the non-linear relationship between binding affinity and probability is that pairs (and higher order combinations) of positions have non-independent effects on binding probability, even though the contributions to binding affinity are completely independent. We show this with one example in Fig 4 based on the protein with energy matrix shown in Fig 1 at μ = 3. When the preferred binding site, TGGTAACG with binding probability of 0.95, is mutated to TGGTAAAG, the binding probability decreases to 0.79, about a 17% decrease in binding probability. If the same C to A mutation occurs in another sequence, TGGCAACG to TGGCAAAG, the binding probability decreases from 0.62 to 0.23, a 63% decrease. This apparent non-independence, where the effect of the mutation varies depending its context, is an artifact of the PM because the change in binding affinity (1.69 kT, Fig 1b) is completely independent of context. The correlation between the true distribution (in logarithm) and that predicted by the PM generated from the weighted top 1% binding sites. (c) The correlation between the true distribution (in logarithm) and that predicted by the PM generated from the unweighted top 1% binding sites.

Effect of noise in experimental data
In the preceding simulations, we compared the probabilistic model to the true biophysical model for the specificity of TFs. However, in real experimental data the observed probabilities of binding to specific sites will include noise that will affect both the probabilistic and biophysical models. To measure the effects of noise on the accuracy of rank predictions, we included errors in the observed probabilities from which both the probabilistic and biophysical matrices were obtained (details in S1 Supporting Information). We added random noise, with mean of 0 and standard deviation of 0.5 kT, to the energy of each sequence prior to generating its probability. That amount of noise is larger than what can be obtained with methods such as Specseq, where we typically get standard deviationsof about 0.2 kT, at least for the high affinity sequences [54,58]. The results for the probabilistic models are slightly worse than those without noise, described above. For the biophysical models the fits are much better, decreasing to 0.97 when only the top 1% of sites are used to estimate the parameters (see S1 Supporting Information). This is because the noise is added to each sequence independently, whereas the models include the parameters for each base at each position, which are averaged over all of the sequences containing those bases. One advantage of low-dimensional models, such as all types of positional matrices, is that experimental noise is averaged out. In fact, even if the true interaction is not precisely independent between the positions, by averaging over all of the contexts one can obtain models that are good representations of the true specificity.

Discussion
Probabilistic models of protein-DNA interactions are commonly used because they are easy to obtain and they provide an intuitive representation of specificity. However, they do not provide the information usually desired, the probability that a specific sequence is bound, P(B│S i ), but rather an approximation to the probability of observing a specific sequence given a binding site, e PðS i jBÞ. From that one can obtain a predicted rank order of all possible binding sites and, if one assumes a specific probability, or occupancy, for the preferred sequence, the predicted probabilities for all other sequences. To obtain binding probabilities from the biophysical model one needs to know the chemical potential, but just as with the probabilistic model if one assumes the probability, or occupancy, of the preferred sequence, then the probabilities of all other sequences can be obtained from the model. Since both models really return the same information, a predicted ranked list of binding sites and relative binding probabilities, they should be judged on the accuracy of those predictions and the ease of obtaining the model parameters.
The accuracy of PMs is limited by availability of binding site affinity data. When a PM is based on the entire probability distribution of binding sites it is a good approximation overall, even at high μ. However, it does have discrepancies that include mis-ordering of the ranks of binding sites as well as the appearance of non-independence between positions that are in fact independent. These effects are due to the intrinsic lack of proportionality between binding probability and binding affinity that is most problematic at high protein concentrations. More  severe defects occur due to incomplete information about the binding probability distribution.
Obtaining the full distribution of binding probabilities requires in vitro experiments, such as protein binding microarrays, HT-SELEX (or SELEX-seq) or other high-throughput methods [20,22,27,35,53,[59][60][61][62][63][64], but many algorithms utilize only the highest affinity binding sites. PMs can be derived from in vivo data of binding site locations, and have the advantage of being easily derived from such data using many motif discovery algorithms [13, 24, 32-34, 43, 65]. But in those cases, the PM is derived from only a fraction of the binding sites. Functional regulatory sites will be among the high affinity sequences and in ChIP-seq experiments the peaks will also tend to contain the highest affinity sites. And if the sample size is small, those sites are not even weighted by their binding probabilities. In addition, confounding factors occurring in vivo, such as competition and cooperativity with other proteins, lead to incomplete information about the probability distribution and that causes further inaccuracies in the PMs.
Good binding models are still important after the advent of high-throughput methods and their parameters can be readily determined by using appropriate algorithms. Binding affinities to small numbers of sequences can be obtained with arbitrarily high accuracy using a variety of experimental techniques [66]. If the additivity (positional independence) assumption is valid, the relative affinities, compared to the preferred sequence, of only the 3m single nucleotide variants are needed for the full energy model. Of course, additivity is unlikely to be completely accurate, but there are still only 3m + 9(m − 1) single variants plus double variants at adjacent positions, where the non-additivity is likely to be most prevalent. But multiple high-throughput methods are now available that provide quantitative binding data from which accurate energy models can be obtained by using appropriate algorithms [16, 18-20, 25, 27, 28, 31, 37, 38, 53, 54, 62]. From sufficiently abundant and accurate quantitative binding data one can even skip the modeling and just use the list of relative binding energies to all possible sites (or at least the highest affinity sites that are likely to function as regulatory sites), avoiding approximations entirely (to the degree allowed by the measurement accuracy). However, models are still useful because they provide a compact representation of specificity, usefully visualized with logos [13,16,42], and can provide insight into the mechanisms of binding specificity, such as the contribution of DNA structure to binding specificity [67,68]. It is also important to have good specificity models obtained from in vitro binding experiments to compare to data obtained in vivo. This allows one to identify cases where interacting TFs alter the specificity of individual TFs, which one can only infer by having good models for each TF alone [69][70][71].
We conclude by pointing out that when accurate energy models are available for DNA binding specificity there is no advantage to using probabilistic models, and in fact they can be misleading and provide inaccurate predictions. There are now good high-throughput methods for measuring relative binding affinities to very large collections of sites and good algorithms for determining accurate energy models. We propose that such models become the standard approach for representing specificity and predicting binding sites in vivo.

Methods Simulations
We developed a program, BEnDS (Binding Energy Distribution Simulations), to generate random energy matrices of a user-specified length, m. One base is randomly chosen as the preferred base at each position and assigned an energy of 0. Energies for the other bases are drawn randomly from a normal distribution with a user-specified mean and standard deviation (with default values of N(μ = 2.5, σ = 1.0)). We assume perfect additivity of binding energies so that the energy for any sequence is the sum of the energies for its bases at each position (equivalent to Eq (3)). This model, implemented in different programs, achieved the best overall performance in a test of various programs on modeling the specificity of DNA-binding proteins based on protein binding microarray (PBM) data [25]. Given an energy matrix, probabilities of binding to all possible sites are obtained using Eq (4) for various (user-specified) values of μ. For every set of site probabilities, PMs were determined. This was done both for the entire distribution and from a subset of high affinity sites, such as the top 1% (as might be expected to be functional sites). When only the top 1% of sites are used, PMs from the sites could be obtained either weighted by their probabilities, or just from the list of sites unweighted, as one might expect from a collection of known regulatory sites or from ChIP-seq type of experiment with a limited sample of observed binding sites.
Simulations that include noise in the energies, and therefore the probabilities, of each sequence are described in S1 Supporting Information. In those cases the energy matrix is obtained using non-linear regression on the site probabilities, similar to BEESEM [37] but without the need to infer the binding site position.

Measure of rank correlations
The probabilistic model does not attempt to report the probability that a site is bound, P(B│S i ), it only reports the predicted relative probability, and therefore the rank order, of different sites being bound. We compare the true rank order of the sites from their binding energies to the predicted rank order based on the PM at different protein concentrations. We report the square of the Spearman's rank correlation coefficient, r 2 .