Automated incorporation of pairwise dependency in transcription factor binding site prediction using dinucleotide weight tensors

Gene regulatory networks are ultimately encoded by the sequence-specific binding of (TFs) to short DNA segments. Although it is customary to represent the binding specificity of a TF by a position-specific weight matrix (PSWM), which assumes each position within a site contributes independently to the overall binding affinity, evidence has been accumulating that there can be significant dependencies between positions. Unfortunately, methodological challenges have so far hindered the development of a practical and generally-accepted extension of the PSWM model. On the one hand, simple models that only consider dependencies between nearest-neighbor positions are easy to use in practice, but fail to account for the distal dependencies that are observed in the data. On the other hand, models that allow for arbitrary dependencies are prone to overfitting, requiring regularization schemes that are difficult to use in practice for non-experts. Here we present a new regulatory motif model, called dinucleotide weight tensor (DWT), that incorporates arbitrary pairwise dependencies between positions in binding sites, rigorously from first principles, and free from tunable parameters. We demonstrate the power of the method on a large set of ChIP-seq data-sets, showing that DWTs outperform both PSWMs and motif models that only incorporate nearest-neighbor dependencies. We also demonstrate that DWTs outperform two previously proposed methods. Finally, we show that DWTs inferred from ChIP-seq data also outperform PSWMs on HT-SELEX data for the same TF, suggesting that DWTs capture inherent biophysical properties of the interactions between the DNA binding domains of TFs and their binding sites. We make a suite of DWT tools available at dwt.unibas.ch, that allow users to automatically perform ‘motif finding’, i.e. the inference of DWT motifs from a set of sequences, binding site prediction with DWTs, and visualization of DWT ‘dilogo’ motifs.

for all other nodes. Using this contracted matrix R (i,j) , the posterior is given by Note, however, that these calculations assume that each position has 1 parent that it depends on, i.e. it is impossible for a position not to have a dependency. While this is a reasonable approximation for applications where dependencies are common, in our case there are a considerable number of motifs where the PSWM appears to be an excellent approximation, i.e. there is almost no evidence for dependencies, and forcing each position to have a dependency is inappropriate. To account for this, we extended the calculations to not just sum over all possible spanning trees, but over all possible forests of the positions in a site. Spanning forests consist of all factorizations of the positions into one or more trees. Equivalently, each position in the site can either have zero or 1 other position that it depends on. We assign a prior to the space of forests in proportion to the number of edges occurring in the forest, i.e. if φ is a forest of the l positions with n edges in total, we assign a prior probability P (φ) ∝ ρ n (1 − ρ) l−1−n , where ρ can be interpreted as the prior probability that a given position has a dependency. All calculations the apply to sums over spanning trees can be easily extended to sums over forests by replacing the matrix R with a matrix Q given by The contracting of the edge works exactly the same for matrix Q(ρ) as for matrix R and equation (4) is replaced by Note that the expression D(Q(ρ)) corresponds to the the log-likelihood of the sequences S given ρ. Thus, to calculate the posterior probabilities of dependency for a given DWT, we first determine the value ρ * that maximizes D(Q(ρ)) and then calculate posteriors using equation (6) with ρ set to ρ * .

Rescaling of the dependency matrix
When the pair-counts n ij αβ are large, the entries R ij of the dependency matrix R may range over many orders of magnitude. When this happens, the calculation of the determinant D(R) may become numerically unstable. As far as we are aware, there is no principled method for avoiding this numerically instability of determinant calculations and we therefore rely on an ad hoc procedure for ensuring the determinant calculation is numerically stable. In particular, when the largest and smallest values of the R matrix, call them R max and R min , vary by more than a factor e k we rescale all entries in log-space the transformation where . Note that, consequently, the entries of the transformed matrix span a range of e k . In this study we chose k = 25, i.e. the maximum ratio between the largest and smallest entry of the rescaledR is e 25 ≈ 7 * 10 10 . Note that matrix entries for which the dependent and independent models have equal likelihood, i.e. when R ij = 1, are invariant under this rescaling transformation.
As explained in the main text, calculation of the conditional probability P (s|S) involves the ratio of determinants D(R(s, S))/D(R(S)), i.e. see equation (9) of the main text. When both the matrices R(S) and R(s, S) are naively rescaled toR(s, S) andR(S) according to the formula (7), then the resulting P (s|S) may no longer be precisely normalized, i.e. the sum s P (s|S) over all possible sequence segments s is no longer strictly 1. However, for the stability of the iterative motif finding procedure it is essential that the conditional probabilities P (s|S) are strictly normalized. To ensure this we adapted the rescaling procedure as follows.
Note that the conditional probability P (s|S) can also be written as with the conditional probabilities P (s i |s j , S) are given by and the posterior probability P (π|S) of the spanning tree π given alignment S is given by That is, the probability P (s|S) can be written as a weighted sum over all possible spanning trees π of the conditional probability P (s|S, π) given the sequences in S and the spanning tree π, weighing each spanning tree with its posterior probability P (π|S) given the sequences in S. To ensure numerical stability while retaining the strict normalization of P (s|S) we only rescale the entries of R in the expression P (π|S). That is we replace P (π|S) with and substitute this in equation (8). This corresponds to calculating the conditional probabilities P (s|S, π) exactly for each spanning tree π, while letting the rescaling only affect the relative probabilities P (π|S) of the different spanning trees in the sum. Finally, note that if we define the new matrix then equation (8) can be rewritten as i.e. just as equation (9) in the main text.

Scoring of partial site matches
Here we derive an approximation for scoring sequence segments that contain one or more N (i.e. unknown) nucleotides. Formally, let x be a sequence segment that contains one or more N nucleotides and let e E(x) = P (x|M )/P (x|B) correspond to the score of this degenerate sequence. Formally, P (x|M )/P (x|B) corresponds to the average of P (s|M )/P (s|B) over all sequence segments s that are consistent with x, and weighing each possible segment s with probability proportional to its probability under the background model, i.e.
where by a small abuse of notation we also use x to represent the set of sequence segments consistent with x and P (s|x) is given by Combining these equations we find P (x|M ) P (x|B) = s∈x P (s|M ) s∈x P (s|B) .
For the PSWM model the scores are given by simple products, i.e. P (s|M ) = l i=1 w i si and P (s|B) = l i=1 b si . For each position i in sequence x that is N, the sum over all s involves a sum over all possible values that s i can take. Since α w i α = 1, we have i.e. the contribution of all positions i where s i = N just disappears from the sum. The same applies to the probability P (x|B) and, consequently, the score P (x|M )/P (x|B) is simple given by the product of contributions from all letters that are not N: As we saw in equation (8) above, under the DWT model the probability P (s|S) can be written as a weighted sum over spanning trees π, of the conditional probabilities P (s|S, π) given a spanning π. In turn, the probabilities P (s|S, π) can be written as a product over conditional probabilities P (s i |s j , S) for each base s i given its parent base s j , i.e equation (9), and the conditional probability can be written as the product of the PSWM condition, and a factor that incorporates the effect of the dependency Note that the second factor on the right is precisely the factor by which the matrixR(S) is multiplied to obtainR(s, S) in equation (13). Finally, we saw that for the PSWM case, the score for sequences containing N nucleotides are obtained simply by only including the contributions from all nucleotides that are not N in the product over positions. In other words, the contribution P (s i |S) is set to 1 for positions i where s i = N . This generalizes in a straight-forward way to the DWT case. In particular, the whenever letter s i = N , we set P (s i |s j , S) = 1, which is equivalent to setting both P (s i |S) = 1, and the factor P (s i , s j |S)/(P (s i |S)P (s j |S)) = 1. That is, to obtain matrixR(s, S) of equation (13), we only multiplyR ij (S) by the factor P (s i , s j |S)/(P (s i |S)P (s j |S)) when neither s i nor s j are N.

DWTs with only adjacent dependencies: The ADJ model
To assess the contribution of distal dependencies to the motif finding we investigated the performance of a restricted DWT model in which only dependencies between neighboring positions are allowed, which we call the adjacent (ADJ) model. Instead of summing over all spanning trees π, in the adjacent model each position i is only allowed to depend on the immediately adjacent positions (i−1) and (i+1).
Restricting the sum over spanning trees in this way can be easily accomplished by simply setting R ij = 0 whenever i = (j + 1) and i = (j − 1). That is, only the entries with i = j + 1 and i = j − 1 are retained.

Training and testing the PIM model
To train a motif the PIM model of Santolini et al. [2] requires an initial PSWM motif and, for a motif of length l, all l-mers occurring in the training data. Besides using the exact same training and test data for the 121 ChIP-seq datasets, we made sure train the PIM model starting from the exact same PSWM models as were used as a starting points to train the DWT models. However, since the method calculates statistics over all l-mers, and this becomes intractable for long motifs, e.g. l = 20, we needed to prune long motifs. Thus, whenever the initial PSWM motif was longer than PIM's default length of l = 12, we pruned the PSWM to the 12 consecutive columns with the highest information content. In addition, while PIM's motif training typically finished within half an hour, some datasets took many hours, and for 3 of the 121 datasets the training had not converged after several weeks of running. Time constraints necessitated us to terminate these runs and we thus did not obtain PIM results for 3 of the 121 datasets. We set the average precision to 0.2, i.e. equal to random performance, for these 3 datasets. We adapted the PIM MATLAB code to use the trained model to calculate binding energies E(s) for each sequence segment s occurring in the test set and we calculated total binding energies E(S) = log[ s∈s e E(s) ] for each training sequence S in the exact same manner as for the DWT models.

Training and test the FMM model
The FMM method of Sharon et al. [3] differs from the other methods in that it does not require an initial PSWM motif (or a motif length), but in contrast to the other algorithms it requires not only a set of positive sequences but also a set of negative sequences. For this we used a set of 2000 random sequences with the same dinucleotide content as the input sequences, i.e. just as the decoy sequences for testing were created. Because all other methods were asked to only infer one motif, we also instructed the FMM algorithm to infer a single motif.
Eilon Sharon graciously provided us with a python script that calculates FMM scores for every sequence segment s in the input sequences and we used this to calculate, for each sequence S in the test set, a total binding energy E(S) from the binding energy of each segment s.
For 2 datasets the FMM model did not report a motif, presumably because it failed to detect any statistically significant sequence patterns, and we set the average precision to 0.2 for these 2 datasets.