^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

The authors have declared that no competing interests exist.

Gene regulatory networks are ultimately encoded by the sequence-specific binding of (TFs) to short DNA segments. Although it is customary to represent the binding specificity of a TF by a position-specific weight matrix (PSWM), which assumes each position within a site contributes independently to the overall binding affinity, evidence has been accumulating that there can be significant dependencies between positions. Unfortunately, methodological challenges have so far hindered the development of a practical and generally-accepted extension of the PSWM model. On the one hand, simple models that only consider dependencies between nearest-neighbor positions are easy to use in practice, but fail to account for the distal dependencies that are observed in the data. On the other hand, models that allow for arbitrary dependencies are prone to overfitting, requiring regularization schemes that are difficult to use in practice for non-experts. Here we present a new regulatory motif model, called dinucleotide weight tensor (DWT), that incorporates arbitrary pairwise dependencies between positions in binding sites, rigorously from first principles, and free from tunable parameters. We demonstrate the power of the method on a large set of ChIP-seq data-sets, showing that DWTs outperform both PSWMs and motif models that only incorporate nearest-neighbor dependencies. We also demonstrate that DWTs outperform two previously proposed methods. Finally, we show that DWTs inferred from ChIP-seq data also outperform PSWMs on HT-SELEX data for the same TF, suggesting that DWTs capture inherent biophysical properties of the interactions between the DNA binding domains of TFs and their binding sites. We make a suite of DWT tools available at

Gene regulatory networks are ultimately encoded in constellations of short binding sites in the DNA and RNA that are recognized by regulatory factors such as transcription factors (TFs). For several decades, computational analysis of regulatory networks has relied on a model of TF sequence-specificity, the position-specific weight-matrix (PSWM), that assumes different positions in a binding site contribute independently to the total binding energy of the TF. However, in recent years evidence has been accumulating that, at least for some TFs, this assumption does not hold. Here we present a new model for the sequence-specificity of TFs, the dinucleotide weight tensor (DWT), that takes arbitrary dependencies between positions in binding sites into account and show that it consistently outperforms PSWMs on high-throughput datasets on TF binding. Moreover, in contrast to previous approaches, DWTs are directly derived from first principles within a Bayesian framework, and contain no tunable parameters. This allows them to be easily applied in practice and we make a suite of tools available for computational analysis with DWTs.

Gene regulatory networks are a crucial component of essentially all forms of life, allowing organisms to respond and adapt to their environment, and allowing multi-cellular organisms to express a single genotype into many different cellular phenotypes. Transcription factors (TFs) are central players in gene regulatory networks that bind to DNA in a sequence-specific manner. Although the molecular mechanisms through which TFs regulate expression of their target genes involve a complex interplay of interactions between TFs, co-factors, chromatin modifiers, and signaling molecules, gene regulatory networks are ultimately genetically encoded by constellations of transcription factor binding sites (TFBSs) to which the TFs bind in a sequence-specific manner.

Consequently, a key question in the analysis of gene regulatory networks is to find a proper mathematical representation of the sequence-specificities of TFs. That is, for each TF, we want to determine an energy function ^{λE(s)} [^{l}, which is already over a million for relatively short TFBSs of length _{i} is the base occurring at position

With the drastic reduction in costs of DNA sequencing over the last decade and the development of a number of experimental techniques for identifying TFBSs in high-throughput, such as ChIP-seq [

Studies going back over a decade, such as [

Several works have modeled TF binding specificity by including dependence between binding positions. A major challenge is that, when an arbitrary number of dependencies between arbitrary pairs of positions is allowed, the number of possible models and parameters grows rapidly, so that it becomes difficult to reliably identify the best models, and to avoid overfitting. Previous works have taken different approaches for addressing this challenge.

In some approaches, model complexity is directly controlled by only allowing dependencies between adjacent positions, e.g. [

In other approaches, PDs between arbitrary pairs of positions are in principle allowed, but instead of incorporating all possible pairwise dependencies, different

Alternatively, some approaches start from a model without dependencies, and use a greedy algorithm that iteratively adds PDs which maximally improve the model. For example, Sharon

A similar iterative approach is used in the work of Santolini

In spite of these efforts, no model that incorporates PDs has found widespread application in the community so far. Models that only use adjacent positions are attractive for their simplicity, but fail to capture the distal PDs that are clearly evident in the data. In contrast, models that consider arbitrary PDs make use of

Here we present a new Bayesian network model, called dinucleotide weight tensor (DWT), which takes into account all possible PDs within a rigorous probabilistic framework that has no tunable parameters and automatically avoids over-fitting. In particular, in the DWT model all unknown parameters including the topology of the network of direct interactions and the joint probabilities for all dependent pairs of nucleotides within the network are analytically marginalized over, so that binding energies

We demonstrate the power of the DWT approach using a large collection of 121 ChIP-seq data-sets representing 92 different human TFs. We show that DWTs outperform PSWMs for a substantial fraction of the TFs, and never perform substantially worse, demonstrating that DWTs automatically avoid over-fitting, even though there are no explicit regularization schemes. Second, we show that DWTs outperform a restricted model that only incorporates dependencies between adjacent positions for the large majority of datasets, demonstrating that distal positions contribute to the accuracy of TFBS prediction. We also show that DWTs substantially outperform two previous approaches [

We here present the dinucleotide weight tensor (DWT) model for describing TF sequence-specificities using arbitrary pairwise dependencies. The DWT model is based on a Bayesian network model that we have applied previously to model interactions between proteins [

Let _{i}) for the individual alignment columns _{i}, i.e. _{i}) is given by an integral over all possible PSWM columns _{i}) = ∫^{i} _{i}|^{i})^{i}), where ^{i}) is a prior probability density on the PSWM column and the integral is over the simplex

Here we generalize the PSWM model by assuming that arbitrary pairwise dependencies can occur between pairs of positions. In complete analogy with the calculations for the PSWM above, we can introduce a dinucleotide weight tensor ^{ij} we then obtain the probability _{i}, _{j}) for a pair of columns (

The evidence for dependency in the frequencies of letters at positions (_{ij}:
_{ij} will play a crucial role in the calculations. As a side remark on the interpretation of the dependencies _{ij}, in the limit of a large number of sequences ^{x} exp(−_{ij} is the mutual information of the letter frequencies in columns

In contrast to the PSWM model, we do not assume that the probability _{i}) for each column _{i}|_{j})_{j}|_{k})_{k}|_{m})⋯. For any such factorization, there is a single ‘root’ position that is not dependent on any other position, and each other position _{i}|_{j}) of column _{i}|_{j}) = _{ij} _{i}), we obtain for the probability _{ij} along the edges (

Instead of assuming one particular factorization ^{l−2} is the number of spanning trees of a complete graph with _{ij} over all edges in the spanning tree

Specifically, the Laplacian _{ii} = 0 with minus the sum of the entries on the row, i.e. _{ii} = −∑_{j≠i} _{ij}, and _{ij} = _{ij} when ^{3}) steps. One complication in practice is that, when there are many sequences in

We first briefly review binding site prediction using PSWMs. Assume a set of known TFBSs _{i} is the letter at position

These calculations generalize in a straight-forward manner to our DWT model. The probability to sample sequence segment

Whereas the probabilities

Finally, as explained in the supporting information, we adapted the rescaling procedure explained above to ensure numerical stability of the ratio of determinants in _{s}

To infer a motif

Specifically, we will assume TF binding is well approximated by a thermodynamic equilibrium model and define, for any length-_{α} the overall frequency of letter _{b}(_{0}) that an isolated sequence segment _{0} is the energy with which the TF can be bound to _{b}(_{0}) is just the sum of the binding probabilities at each of the segments of _{S} is the number of sequence segments in

For a large set of sequence segments sampled from the background distribution

Finally, our desired log-likelihood _{0}) is the log-probability to sample all the sequences _{0}).

We initialize the DWT from a PSWM that can either be specified by the user, e.g. when a known PSWM motif is already available for the TF in question, or it can be obtained by running a standard PSWM motif finder on the input sequences

We then iterate the following steps. First we calculate the binding energies _{0} by finding the root of

Third, we predict binding sites in the sequences _{b}(_{0}) over all sites in which letters (

To visualize DWT models, we propose a graphical representation which generalizes the well-known sequence logo and which we call a ‘dilogo’. As an example,

The top row of the dilogo shows the familiar sequence logo representation of the marginal probabilities _{i}|_{j}) are shown in sequence logo format, with each row corresponding to the identity of the parent letter _{j} and each column showing the probabilities _{i}|_{j}) for the child letter _{i}.

The dilogo first of all shows the classical sequence logo representation of the marginal probabilities

Because it is unwieldy to show the conditional probabilities _{i}|_{j}) for all pairs of positions (

Finally, for those positions _{i}|_{j}) are shown in sequence logo format with one sequence logo (rows in the figure) for each possible state of the parent letter _{j} (shown on the left of the figure). For example, in the NRF1 example, the letters at position 3 through 5 depend on the letter at position 2. If position 2 shows a G, positions 3 − 5 are very likely to show the pattern CGC. However, when position 2 shows a T, positions 3 − 5 are most likely to show the pattern CTC.

To enable easy application of DWT models in motif finding we have made a tool-box with software available for motif inference with DWTs, prediction of TFBSs using DWTs, and visualization of DWT models using dilogos. Source code and executables can be downloaded from

Given a motif model that assigns energies _{s∈S} ^{E(s)}]. One complication is that the HT-SELEX sequences are all very short, i.e. about 20 nucleotides, such that some motifs can be longer than the input HT-SELEX sequences. To deal with this we padded each HT-SELEX sequence with

We assume that, in each round of the HT-SELEX experiment, the probability of sampling a sequence ^{E(S)}. Let _{t}(_{t}(

However, when we applied this calculation we find, for almost all corresponding HT-SELEX/ChIP-seq combinations, that the likelihood _{0} is _{t}(^{E(S)} then the observed log-enrichment log[_{t+1}(_{t}(_{t+1}(_{t}(

To incorporate this observation, we introduce a ‘temperature’ parameter ^{βE(S)}, and calculate a log-likelihood _{0} can be written as
_{S} _{S}(_{t} is the average energy of the sequences in generation ^{βE}〉_{t} is the average selection probability of sequences in generation

For each PSWM and DWT model, we optimize

To compare the performance DWT models with the performance of PSWMs and other motif models, we analyzed a collection of 121 ChIP-seq datasets for 92 different human TFs from the ENCODE consortium [

We processed each of the ChIP-seq datasets using CRUNCH, an integrated ChIP-seq analysis pipeline that we developed in-house and that includes automated PSWM motif analysis [

Using this PSWM as a starting motif we then iteratively fitted a PSWM and a DWT motif on the training sequences (

We then assess the ability of the fitted DWT and PSWM models to explain the ChIP-seq data. In particular, besides the 500 peak sequences of the test set, we created 2000 random decoy sequences that have the same overall dinucleotide frequencies and distribution of lengths as the binding peaks. For each of these 2500 sequences we calculate an overall binding energy

By systematically varying a cut-off on the binding energy _{c}, the precision is the fraction of all sequences with _{c} that are true binding peaks, and the recall is the fraction of all true binding peaks that have _{c}.

We investigated whether TFs for which the DWT most significantly outperforms the PSWM tend to fall within particular structural families and did not find any clear association. Although it is true that DWTs without any clear pairwise dependencies do not outperform PSWMs, the reverse will not generally hold. That is, the fact that certain positions show clear dependencies does not guarantee that these dependencies will help distinguish binding sites from decoy sequences. Indeed, there are datasets for which DWTs show pairwise dependencies with very high posterior, but where the DWT does not significantly outperform the PSWM. For example, the TF MEF2A shows several pairs of positions with very strong dependency, but the MEF2A DWT does not significantly outperform the corresponding PSWM (see the table with results at

Previous investigations of dependencies between positions in TFBSs have suggested that dependencies between immediately adjacent positions are much more common and significant than dependencies between distal positions [

Whereas the PSWM never substantially outperformed the DWT (the largest difference in average precision being 3%), there is one dataset for which the ADJ model outperformed the DWT by more than 16% in average precision. This is for ChIP-seq experiment performed in the HeLa cell-line with the chromodomain-like TF CHD2. Notably, the CHD2 TF was also assayed in the GM12878 cell-line, and for this dataset the DWT motif did outperform the ADJ motif. We investigated this case in more detail and found that the DWT had converged to a motif without any significant dependencies, whereas the ADJ had converged to a motif with identical consensus, but with several strong adjacent dependencies. As a test, we reran the DWT motif search on this dataset using the trained ADJ model as a starting motif. We found that the DWT search now converged to a motif that does outperform the ADJ model. That is, there are DWT models that outperform the ADJ for this dataset and the reason the DWT performed poorly was that the motif search happened to have gotten stuck in a poor local optimum.

Comparing the performance of DWTs with previously proposed approaches is challenging because readily usable software that can be applied to large-scale ChIP-seq results is often not available, and even when software is available it can be challenging to apply it in a manner that allows meaningful performance comparison. Our discussion of the results on the CHD2 dataset underlined that, in order to compare the performance of different motif models, it is essential that all other sources of variability are kept as constant as possible, i.e. not only should we use the exact same training and test data, also the way the motifs are inferred, the way scores of segments are combined to calculate scores of longer sequences, and so on, should be kept as similar as possible. While it was straightforward to accomplish this for comparing our own PSWM, ADJ, and DWT models in the previous section, this is much more challenging when using software from other groups. However, we performed a comparison analysis with two previous methods that allow distal dependencies, for which software was available.

The authors of the FMM method (Sharon

We investigated to what extent pairs of positions that show dependency are restricted to nearest-neighbor interactions. Combining results from all 121 ChIP-seq datasets we calculated the total number of adjacent and and non-adjacent pairs at each posterior probability of dependency.

The total number of of adjacent (solid red) and distal (solid blue) dependent pairs as a function of a cut-off on the posterior probability of the dependency of the pairs. The dashed lines show the number of adjacent (red) and distal (blue) pairs in randomized data in which DWTs were constructed from sequences sampled from PSWM models.

To confirm the statistical significance of the observed dependencies, we constructed a randomized dataset that should be devoid of dependencies as follows. For each dataset we took the inferred DWT and marginalized it to obtain the corresponding PSWM. When then sampled the same number of binding sites from this PSWM as went into the construction of the DWT, and constructed a new DWT from this set of synthetic binding sites. Finally, we calculated the posterior probabilities of dependency in the set of 121 DWTs so constructed. As shown in

Systematic evolution of ligands by exponential enrichment (SELEX) is a well-established

We collected, for each of the TFs assayed in [

To model the HT-SELEX data we assume that, at each round of the experiment, sequences are selected according to their binding energy to the TF as explained in the materials and methods. As a performance measure of a given motif model, we calculate the average excess of the log-likelihood per selected sequence in each HT-SELEX generation relative to a model which assumes random sampling of sequences.

Difference in the log-likelihood per sequence between the DWT and PSWM models for each of the 45 corresponding HT-SELEX/ChIP-seq dataset combinations, ordered from left to right by the difference in log-likelihood per sequence. The inset shows the log-likelihood per sequence for the DWT (vertical axis) against the log-likelihood per sequence for the PSWM (horizontal axis), with each dot corresponding to one dataset combination.

For 35 of the 45 combinations, the DWT outperforms the PSWM model on the HT-SELEX data (

Since its introduction in the early 1980s [

Here we have presented a new motif model, the dinucleotide weight tensor, that is general in that it allows for dependencies between arbitrary positions in the motif, is rigorous in that it is derived from first principles within a Bayesian framework, and avoids over-fitting by explicitly marginalizing over all unknown parameters. In particular, because the model has no parameters that the user needs to tune, it can be easily and robustly applied in practice. Indeed, by inferring DWTs on a large set of ChIP-seq datasets, we have shown that DWTs never perform significantly worse than PSWMs and clearly outcompete them in a substantial fraction of the cases. By showing that, for most datasets, DWTs also outperform a model in which only dependencies between adjacent positions are allowed, we further showed that distal dependencies contribute significantly to the performance of the DWTs. We also showed that DWTs outperform two previously proposed methods that incorporate distal dependencies. Notably, while we were finishing this work, a very interesting new approach was proposed by Siebert and Söding [

The fact that DWT models inferred from ChIP-seq data also outperform PSWMs on HT-SELEX data, suggests that the dependencies captured by the DWT reflect something in the biophysics of the interaction between the DNA binding domain of the TF and the DNA sequence of the site. Our observation that, while significant dependencies occur between distal positions, interactions between neighboring positions are the most common, is also consistent with this interpretation. Another interesting area for future research is to investigate the possible structural and biophysical basis for the observed direct dependencies. However, we should note that, in spite of investing considerable efforts ourselves in analyzing whether the occurrence of dependencies can be related to structural features of the TFs, or to the way that they interact with the DNA, we have so far not been able to uncover any consistent biophysical interpretation of the observed dependencies. It is conceivable that there is no simple biophysical interpretation to the direct dependencies. For example, inspection of some of the DWT models suggests that dependencies often cause combinations of deleterious mutations to reduce the binding energy less than predicted by the PSWM model and this might be a global effect that is spread across many dependencies, rather than reflecting particular structural features of the TF-DNA interaction.

Our analysis has also shown that, notwithstanding the fact that DWTs strongly outperform PSWMs for some TFs, for the majority of TFs the improvement that the DWT provides is rather modest. This highlights that, for many TFs, PSWMs are sufficiently accurate for TFBS prediction, and few significant dependencies exist. Consequently, robust practical application of more complex motif models requires strong safe-guards against over-fitting, i.e. because for many TFs there will simply not be many strong dependencies. This is arguably the biggest advantage of the DWT models presented here: DWTs have no parameters to tune, do not overfit, and automatically reduce to a PSWM model when no significant dependencies exist. We believe that these properties make DWTs especially attractive for adopting in practical settings and we hope that many researchers can be convinced to start using DWT models in their motif finding and TFBS prediction.

(PDF)

SO thanks Lukas Burger for help with the Bayesian model and its implementation, and Peter Pemberton-Ross and Stephanie Bishop for help with the writing of the manuscript.