Figure 1.
An initial Position Weight Matrix (PWM) is used to find a set of binding sites on ChIPseq data. Models are then learned using single-point frequencies (PWM), two-point correlations (PIM) or a mixture of PWM models learned on sites clustered by K-Means with increasing complexity, i.e. increasing number of features in the model. Finally the models with best Bayesian Information Criteria (BIC) are used to predict new binding sites until convergence to a stable set of TFBSs.
Figure 2.
Observed TFBS frequencies are poorly predicted by a PWM model.
Given a set of TFBSs predicted by the PWM model on ChIP fragments, we computed the TFBS frequencies (how many times a given sequence appears in the set, gray bars), and compared them to the PWM predicted frequencies (blue bars) computed using single nucleotide frequencies alone. We show the results for the most frequent sequences for the TFs Twist (A), Esrrb (B) and MyoD (C). We can see that the use of single nucleotide frequencies alone does not allow one to reproduce the statistics of the most observed binding sites. (D) Kullback-Leibler Divergence (DKL) between the observed probability distribution and the PWM model distribution (blue). As a control we show the mean (cyan bars) along with two standard deviations of the DKL between the PWM model and a finite sample drawn from it (see Methods). A significant discrepancy between the observed and predicted sequence probabilities is reported for 22 out of 28 factors.
Table 1.
Information about the TFs.
Figure 3.
Models with correlations improve TFBS statistics prediction.
Similarly to Figure 2, we show the observed frequencies (gray bars) of the most represented TFBSs for Twist (A), Esrrb (B) and MyoD (C) TFs, together with the probabilities of these sequences predicted by the PWM model (blue bars), the PIM taking into account interactions between nucleotides (red bars), and the PWM-mixture model (green bars). (B,D,F) show the comparison between frequencies for all binding sequences and predicted sequence probabilities for the three models (same color code). The predicted probabilities of the PIM and to a lesser extent of the mixture model are in much better agreement with the observed frequencies than those of the PWM model.
Figure 4.
Overlap between predicted TFBSs.
(A) Venn diagrams showing the overlap between the ChIP predicted by the PWM model(blue) and PIM (red). (B) Difference (one minus the proportion of shared binding sites) between the best binding sites predicted by the PIM and PWM model on ChIPseq peaks (light red), and the same quantity when including the next best predicted binding sites on each peak (dark red). In several cases (e.g. Fosl1, Max, N-Myc, Srf, STAT3, Usf1), the difference between predicted binding sites is much smaller when the two best binding sites are considered, indicating that the PIM and the PWM model rank differently the two best binding sites in ChIP peaks with multiple bound sites.
Figure 5.
(A) Minimisation of the Bayesian information criterion (BIC, see Methods) is used to select the optimal number of model parameters and avoid over-fitting the training set. The evolution of the BIC is shown for the PIM (red crosses) and the PWM-mixture model (green lines) as a function of the number of model parameters. Shades from light to dark indicate the iteration number (main loop in Figure 1), the darkest shade being assigned to the final model. (B) Kullback-Leibler divergences (DKL) between the PWM, PWM-mixture and PIM distributions and the observed distribution for the different TFs, for the BIC optimal parameters. In all cases the PIM outcompetes both the PWM and PWM-mixture models. The DKL between the PIM and a finite-size distribution of sequences drawn from it is also displayed (pink, see Methods) to assess the DKL magnitude simply due to the finite number of TFBS in the dataset. The result show that the PIM generally fits the available dataset as well as possible given its finite size. Error bars represent two standard deviations.
Figure 6.
PWMs corresponding to the different basins of attraction of the PIM.
The DNA sequence variety described by each model is illustrated using the software WebLogo [52]. Shown are PWMs built from all TFBS, from the PWM-mixture model, and from the basins of attraction of the PIM for Twist (A), Esrrb (B), and MyoD (C). The attractor PWMs are grouped under the mixture PWMs with smallest distance (measured by DKL, in bits). Heatmaps showing the DKLs between attractor PWMs and mixture PWMs are displayed on the right for each factor (minimal DKLs are in black). The proportions of binding sites used for each logo are also indicated and serve to denote the corresponding PWM.
Figure 7.
Location and strength of the nucleotide pairwise interactions.
(A) Heat maps showing the values of the Normalized Direct Information between pairs of nucleotides. The matrix is symmetric by definition. PWMs are shown on the side for better visualization of the interacting nucleotides. The participation ratio R is indicated below each heat map. (B) Distances between interacting nucleotides. The box plots show the relative importance of the Normalized Direct Information as a function of the distance between interacting nucleotides. Red dots denote average values. (C) Sum of normalized direct informations in the TFBSs at a given position, averaged over all considered factors (blue line). The average site information content relative to background as a function of position is also shown (red line). In both quantities, the average over the two TFBS orientations has been taken.
Table 2.
Participation Ratios.
Figure 8.
Comparison with the Nearest-Neighbor Model (NNM).
We study the effect of restricting the PIM to nearest-neighbor interactions, resulting in the NNM. (A) The BIC is shown for the PIM (red crosses) and NNM (cyan dots) as a function of the number of interactions added. Shade from light to dark indicates the iteration, similarly to Figure 5. The NNM performs less well than the PIM, which provides a quantitative ground for the addition of non-neighbor interactions. (B) Comparison between the observed and predicted frequencies of TFBS according to the PWM, NNM, and PIM. We show the number of added interactions for the PIM and NNM in the legend of each plot. (C) DKLs between the NNM or PIM predicted distributions, and the observed distribution, with the number of parameters that is optimal for the NNM (first two bars) and with the number of parameters that is optimal for the PIM (last bar). The improvement yielded by the PIM is clearly seen for factors like Klf4, CTCF, E2f4 or MyoD. (D) Cumulative distribution of nearest-neighbor (red) and non nearest-neighbor (black) interactions added as a function of the number of interactions added (ranked by strength).
Figure 9.
Representation of interactions by Hopfield patterns.
The full interaction matrix is approximated by a matrix
built from the
Hopfield patterns with highest eigenvalue moduli. We show the Normalized Direct Information matrices computed from
,
,
and the full matrix
. For MyoD, the correspondence between successive pairs of patterns and distinct interaction domains (middle, upper left and bottom right) is particularly clear. In all cases the full Direct Information matrix is already well approximated by
. The bottom plots show histograms of the
eigenvalues of highest moduli (red) and of the other ones (blue). The high eigenvalues lie on both sides of a core of smaller eigenvalues centered around
.