A Feature-Based Approach to Modeling Protein–DNA Interactions

Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a position specific scoring matrix (PSSM), which assumes independence between binding positions. However, in many cases, this simplifying assumption does not hold. Here, we present feature motif models (FMMs), a novel probabilistic method for modeling TF–DNA interactions, based on log-linear models. Our approach uses sequence features to represent TF binding specificities, where each feature may span multiple positions. We develop the mathematical formulation of our model and devise an algorithm for learning its structural features from binding site data. We also developed a discriminative motif finder, which discovers de novo FMMs that are enriched in target sets of sequences compared to background sets. We evaluate our approach on synthetic data and on the widely used TF chromatin immunoprecipitation (ChIP) dataset of Harbison et al. We then apply our algorithm to high-throughput TF ChIP data from mouse and human, reveal sequence features that are present in the binding specificities of mouse and human TFs, and show that FMMs explain TF binding significantly better than PSSMs. Our FMM learning and motif finder software are available at http://genie.weizmann.ac.il/.

For each dataset, we used the following protocol: (1) For each CV group, and for each motif finder: run with the training data as input, and acquire putative aligned TFBSs for the best motif.
(2) For each CV group, and for each motif finder: learn both a FMM and a PSSM representation of the best motif, from the aligned TFBSs generated in step 1.
(3) For each CV group, and for each motif finder: score each of the test sequences (positive and negative) by the log-likelihood of the best hit of the FMM in that sequence (FMM_score), and similarly with the PSSM (PSSM_score, so each sequence has two scores). (Repeat also for the train sequences).
(4) For each CV group, and for each motif finder: rank all test sequences (positive and negative) by their FMM_score (from highest to lowest). Using ROC (receiver operator characteristic) analysis, based on the above ranking, calculate the AUC (the area under the ROC curve), as a measure of how well the FMM discriminates the positive set from the negative set (an AUC of 0.5 is no better than random, the higher the AUC the better the discrimination). Call this AUC  (Here the protocol ends).
To follow the above protocol, we sought to compare our motif finder to other motif finders that output aligned TFBSs (as required by step 1). Different motif finders may find motifs of different lengths, thus they cannot be compared directly based on the likelihood of their best hits (FMM_score and PSSM_score). Since we expect a true motif to discriminate between the positive and negative sets, we chose the AUC score as a basis for comparison. This score eliminates the fear of motif-length related bias in favor of any of the motif finders. As we used a discriminative score, and as our FMM motif finder is discriminative, we sought to test our motif finder also versus a discriminative motif finder. To meet the above, we compared our motif finder's performance with three other: AlignACE [1], MDscan [2] and DEME [3]. The first two are non discriminative, thus were run using only the positive training sequences sets as input. The last is a state-of-theart discriminative motif finder, thus received the negative training sequences sets as well (as did our own motif finder). The three motif finders were run with default parameters. MDscan and DEME require that the motif width be given as input.
For that matter we used the lengths of the best motifs found by our motif finder for the datasets (see below in "Supplemental Results: De-Novo Motifs"). The comparison results are summarized in Figures S6-S8. Figure S6 shows the results when we compared PSSMs learned by our motif finder to PSSMs learned by the other tools. In a majority of the cases, our PSSMs were found to better represent the motif. This supports the claim that our motif finder does not produce aligned TFBSs that are wrongfully biased against the PSSM representation. Figure S7 shows the results when we compared FMMs learned by our motif finder to PSSMs learned by the other tools. In a majority of the cases our FMMs were found to better represent the motif. This supports our basic claim that producing FMM motif models using our motif finder has an advantage over the PSSMs that other motif finders produce. Figure S8 shows the results when we compared FMMs learned by our motif finder to FMMs learned by the other tools. In a majority of cases our FMMs were found to better represent the motif. This demonstrates the advantage of using our motif finder in order to learn FMM motif models.     Figure 3) as a function of the number of data instances and the L1 penalty free parameter (α). We observed that the effect of the value of α is, as predicted much stronger on small datasets. Where too small values of α might not prevent overfitting (those resulting in low average test likelihood), too large values might pose to harsh restriction on the learned features. However relativelly small values of α (α=1) have prevented overfitting for PSSM sampled datasets of size 1000. Base on this results we selected the value 1, which gave relativelly good performances on all datasets, for our runs.
De-Novo Motifs. Figures S9-S23 show a summery of the de-novo found motifs in the examined human and mouse data sets, which are described in Table 1.

Figure S9
A summary of all de-novo found motifs for the c-Myc dataset. "P"/"N" stand