^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: MS TM VH. Performed the experiments: MS TM VH. Analyzed the data: MS TM VH. Contributed reagents/materials/analysis tools: MS TM VH. Wrote the paper: MS TM VH.

The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair contributes independently to the transcription factor (TF) binding. However, this description ignores correlations between nucleotides at different positions, and is generally inaccurate: analysing fly and mouse

Gene regulatory networks are at the basis of our understanding of cell states and of the dynamics of their response to environmental cues. Central effectors of this regulation are Transcription Factors (TFs), which bind on short DNA regulatory sequences and interact with the transcription apparatus or with histone-modifying proteins to alter target gene expressions

Despite the widespread use and success of PWMs, there is mounting evidence that its central hypothesis of independence between positions is not always justified. Several works have reported cases of correlations between nucleotides at different positions in TFBSs

On the computational side, a number of probabilistic models have been proposed to describe nucleotide correlations in TFBSs, generally based on specific simplifying assumptions, such as mutually exclusive groups of co-varying nucleotide positions

Here, using a variety of ChIPseq experiments coming both from fly and mouse, we first show that the PWM model generally does not reproduce the observed

To this purpose, we propose the general Pairwise Interaction model (PIM), which generalizes the PWM model by accurately reproducing pairwise correlations between nucleotides in addition to position-dependent nucleotide usage. The model derives from the principle of maximum entropy, which has been recently applied with success to a variety of biological problems where correlations play an important role, from the correlated activity of neurons

We find that the

The difference between the PIM and the PWM model lies in the pairwise interaction between nucleotides. Surprisingly, despite significant differences in prediction accuracy between the two models, these interactions are fairly weak, sparse and found dominantly between consecutive nucleotides, in general qualitative agreement with

The PIM only requires a modest computational effort, and the refined description of TFBS that it affords should generally prove useful when enough data is available.

We first tested how well the usual PWM model reproduced the observed TFBS statistics. Specifically, we asked how well the frequencies of different TFBSs were predicted using only single nucleotide frequencies. For this purpose, we used a collection of ChIPseq data available from the literature

An initial Position Weight Matrix (PWM) is used to find a set of binding sites on ChIPseq data. Models are then learned using single-point frequencies (PWM), two-point correlations (PIM) or a mixture of PWM models learned on sites clustered by K-Means with increasing complexity,

We then enquired whether the frequencies of the different binding sites in the set agreed with that predicted by the PWM, as would be the case if the probabilities of observing nucleotides at different positions were independent.

Given a set of TFBSs predicted by the PWM model on ChIP fragments, we computed the TFBS frequencies (how many times a given sequence appears in the set, gray bars), and compared them to the PWM predicted frequencies (blue bars) computed using single nucleotide frequencies alone. We show the results for the

To get a full measure of the discrepancy between the observed distribution of TFBSs and the PWM prediction, we calculated the relative entropy, or Kullback-Leibler divergence (DKL), between the two distributions

Name | |||||||

Bap | |||||||

Bin | |||||||

Mef2 | |||||||

Tin | |||||||

Twi | |||||||

c-Myc | |||||||

E2f1 | |||||||

Esrrb | |||||||

Klf4 | |||||||

Nanog | |||||||

N-Myc | |||||||

Oct4 | |||||||

Smad1 | |||||||

Sox2 | |||||||

STAT3 | |||||||

Tcfcp2l1 | |||||||

Zfx | |||||||

C/EBP-beta | |||||||

CTCF | |||||||

E2f4 | |||||||

Fosl1 | |||||||

Max | |||||||

MyoD | |||||||

Myog | |||||||

NRSF | |||||||

SRF | |||||||

TCF3 | |||||||

USF1 |

The discrepancy between the observed statistics of TFBSs and that predicted by the PWM model calls for a re-evaluation of the model's main hypothesis, namely the independence of bound nucleotides. To account for the correlative structure of TFBS statistics, we wish to construct a model that assigns a frequency to each of the possible

the frequency counts of the 4 nucleotides at each position in the sequence data, (e.g. 40% of nucleotides at the third position are C and 60% at the fifth position are T), as the PWM does,

the frequency counts of each pair of nucleotides in the sequence data (e.g. 40% of pairs of nucleotides at the third and fifth position are (C,T); that is a C at the third position is always associated to a T at the fifth position and not only in 60% of the cases as would be expected for independently bound nucleotides).

There are many models that can achieve these two requirements. In order to precisely specify a single one, we ask that the model probability distribution exactly reproduces the frequency counts of single nucleotides and pairs of nucleotides in the sequence data but is otherwise as unconstrained as possible. This is the principle of maximum entropy. This provides a natural generalization of the PWM model since the PWM model is the maximum entropy model that reproduces the frequencies of single nucleotides (condition i) above). Specifically, call

The PIM binding energy

The first sum comprises the binding energies of the individual nucleotides, with

This is an example of an inverse problem, where energies are devised from observed frequencies. As mentioned in the introduction, such problems have recently been studied in a variety of biological contexts

Similarly to

Precise predictions of TFBSs are one important output of ChIPseq data. They condition further validation experiments such as gel mobility shift assays or mutageneses. Therefore, we assessed the difference in TFBS predictions between pairwise (PIM) and independent (PWM) models.

First, we compared the set of ChIP sequences retrieved by the two models at the cutoff of

(A) Venn diagrams showing the overlap between the ChIP predicted by the PWM model(blue) and PIM (red). (B) Difference (one minus the proportion of shared binding sites) between the best binding sites predicted by the PIM and PWM model on ChIPseq peaks (light red), and the same quantity when including the next best predicted binding sites on each peak (dark red). In several cases (

Second, using the set of ChIPseq peaks on which the PIM was learned, we looked for the best predicted binding sites on each ChIPseq bound fragment using both the PWM model and the PIM (

In conclusion, we found that the TFBS predictions made by the two models could differ significantly both in the rank of ChIPseq fragments and in the rank of binding sites on these fragments.

An underlying assumption of the PWM model is that there exists a preferred consensus sequence, of which other sequences are close variants. Some authors have instead analyzed the binding specificity of transcription factors by introducing multiple preferred sequences

(A) Minimisation of the Bayesian information criterion (BIC, see

The best description of Twi ChIPseq data is, for instance, provided by a mixture of 5 PWMs, which corresponds to 184 independent parameters. The mixture model yields a significant improvement compared to the single-PWM model, and milder ones for Essrb and MyoD. In the three cases however, it does perform as well as the PIM.

As in the PWM case, the finite size of the datasets leads us to expect fluctuations in the estimation of the DKL. In order to assess the magnitude of these finite-size fluctuations, we computed the average DKL between the best-fitting PIM and a finite-size artificial sample drawn from its own distribution, as shown in

The connection between the PIM and the PWM-mixture model can be further explored by considering the binding energies of all possible L-mers. This can be viewed as the “ energy landscape” of the PIM in the space of all possible binding nucleotide sequences. The “energy” of a sequence is defined in term of its probability (i.e. how frequently it appears in a set of binding sites) by

Using this procedure, we computed the set of PWMs and weights corresponding to the PIM inferred from the 22 TFs for which the PWM did not offer a satisfying description. Examples are shown in

The DNA sequence variety described by each model is illustrated using the software WebLogo

This representation allows one to identify interesting features captured by the PIM. For example, in the case of Twist, most of the correlations are coming from the two nucleotides at the center of the motif, which take mainly

The inference of the PIM yields explicit values for the interaction parameters

To estimate the strength of interaction between two positions, we used the tool of Direct Information, originally introduced to predict contacts between residues from large-scale correlation data of protein families

(A) Heat maps showing the values of the Normalized Direct Information between pairs of nucleotides. The matrix is symmetric by definition. PWMs are shown on the side for better visualization of the interacting nucleotides. The participation ratio R is indicated below each heat map. (B) Distances between interacting nucleotides. The box plots show the relative importance of the Normalized Direct Information as a function of the distance between interacting nucleotides. Red dots denote average values. (C) Sum of normalized direct informations in the TFBSs at a given position, averaged over all considered factors (blue line). The average site information content relative to background as a function of position is also shown (red line). In both quantities, the average over the two TFBS orientations has been taken.

Name | Part. Ratio DInorm | Part. Ratio MInorm |

Bin | ||

Mef2 | ||

Twi | ||

E2f1 | ||

Esrrb | ||

Klf4 | ||

Nanog | ||

N-Myc | ||

Oct4 | ||

Sox2 | ||

Tcfcp2l1 | ||

Zfx | ||

C/EBP-beta | ||

CTCF | ||

E2f4 | ||

Fosl1 | ||

Max | ||

MyoD | ||

Myog | ||

NRSF | ||

TCF3 | ||

USF1 |

The interaction strength can also be used to measure the typical distance between interacting nucleotides. To that purpose, we computed the relative weight of the Direct Information as a function of the distance between nucleotides (see

Finally, we asked how the interaction strength depended on the position along the sequence. We found that interactions were strongest in the flanking regions of the binding site, in clear anti-correlation with the information content, which concentrates in the central region (

The fact that nearest-neighbor interactions are found to be predominant in our unbiased analysis may suggest that they are in fact sufficient to reproduce the statistics of TFBS. This appears interesting to test since PIM s restricted to nearest-neighbor interactions are equivalent to first-order Markov models which are computationally more tractable and in widespread use. In order to assess this possibility, we therefore followed the same iterative procedure as fo the PIM but only allowing the addition of nearest-neighbor interactions. The results for the resulting Nearest-Neighbor Model (NNM) are shown in

We study the effect of restricting the PIM to nearest-neighbor interactions, resulting in the NNM. (A) The BIC is shown for the PIM (red crosses) and NNM (cyan dots) as a function of the number of interactions added. Shade from light to dark indicates the iteration, similarly to

We also analysed the interaction matrix

This form is reminiscent of the Hopfield model

We wondered how many patterns were necessary to approximate the full interaction matrix

The full interaction matrix

The availability of ChIPseq data for many TFs is an opportunity to revisit the question of nucleotide correlations in TFBSs, and to propose alternative descriptions of TFBS ensembles beyond the PWM

To refine the PWM description, we have proposed and analyzed a model with general pairwise interactions (the PIM), as well as a model using a mixture of PWMs. While the mixture model somewhat improves over the PWM description, the PIM achieves a much more significant and general improvement, and can even be shown to be optimal given the amount of available data. The PIM could account for higher-order correlations than pairwise, superseding explicit descriptions in terms of multiple motifs such the one provided by the PWM-mixture model.

Several other approaches have previously been proposed to describe nucleotides correlations in TFBS, usually based on computationally-friendly approaches such as a Markov models or Bayesian networks

The PIM derives from the principle of maximum entropy, with the constraint that pairwise correlations are accurately described by the model. This approach has already been applied in a variety of biological contexts. The determination of amino acid interactions in protein structures

The inferred parameters of the PIM provide insight into the location and strength of the effective interactions between nucleotides without potential biases coming from model simplifying assumptions. The dominant pairwise interactions are found mainly between consecutive nucleotides in the TFBS flanking regions, in agreement with

However, the physical interpretation of the effective interactions is not clear, since these may combine real physical interactions with genomic correlations. This is similar to the case of protein families, where structural and functional contraints are hard to distinguish from phylogenic correlations or other observational biases

Independently of these future prospects, we have found that the TFBSs predicted from ChIPseq data depended significantly on the model used to extract them. Since the PIM and the developed workflow significantly improve TFBS description and require a modest computational effort, they should prove worthy tools in future data analyses.

We use both ChIP-on-chip data from

It is important to discriminate the statistics of the motifs proper from that of the background DNA on which motifs are found. Besides particular nucleotides frequencies, the background DNA can exhibit significant nucleotide correlations, for instance arising from CpG depletion in mammalian genomes (Figure S5 in

Along with the ChIPseq data for the different factors, we also retrieved corresponding PWMs from the literature

First, because we restricted ourselves to binding sites of size

The PWM model consist of a matrix of single nucleotide probabilities of size

The Kullback-Leibler divergence is a measure of distance between two probability distributions

Throughout this paper, when a DKL is calculated between a finite sample and a model distribution,

To estimate whether the description of the data by a model (

Information theory offers a principled way to determine the probabilities of a set of states given some measurable constraints. It consists in maximizing a functional known as the entropy

Finally, using the constraint

The normalization constant

The split of the energy

To uniquely prescribe the

This condition can be imposed on any set of energies

These can be imposed on any set of pairwise interactions

The parameters of the model in Eq. (9), giving the energy of an observed sequence of length

To build the model, we start from the PWM description, characterized by the set of initial

Consider a sample

That is, the probability of the model given the data can be inferred from the probability that the data is generated by the model. The latter is obtained by marginalizing the joint distribution of the data and the parameters over the space of parameters

For a unidimensional parameter

In the present case, the sample

The interpretation of Eq. (21) is clear: adding new parameters improves the fit, but also adds new sources of uncertainty about these parameters due to the finite size of the data. This uncertainty disappears as

Finally, Eq. (21) is a functional over models, the chosen model

We investigated an approach based on a mixture of PWMs. For that purpose, we used a comparable setup as for the PIM. However, instead of adding correlations to a given PWM, new PWMs were added to a mixture model. More precisely, a mixture of

We defined the basins of attraction of a PIM energy landscape, in the following fashion. Let

We computed local-energy-minimum sequences and their basins of attraction for the final set of bound sites obtained with the best PIM. A PWM was learned on each basin of attraction, leading to a set of representative PWMs, with different weights representing different proportions of bound sites in their basins.

We wanted to build a quantity based solely on direct interactions

The

The normalization of the probabilities

The Direct Information

As there is no upper bound for this direct information, we built a normalized version of the direct information:

The Mutual Information was defined as:

For each TF, an interaction weight was defined for each pair of nucleotides as

Similarly, a ‘correlation weight’ can be defined by replacing

The interpretation is simple: if all weights are equal, then

The previously defined interaction weights were averaged over all possible pairs of nucleotides at a given distance

where

In the PIM energy, shown in Eq. (1) from the main text, only

Since the matrix

Denoting by

Finally, the full PIM energy is given by:

The source code used in the present paper is available at

(PDF)

We wish to thank PY Bourguignon and I Grosse for stimulating discussions at a preliminary stage of this work.