^{1}

^{1}

^{1}

^{2}

^{*}

Conceived and designed the experiments: E. Sharon, S. Lubliner, E. Segal. Performed the experiments: E. Sharon, S. Lubliner, E. Segal. Analyzed the data: E. Sharon, S. Lubliner, E. Segal. Contributed reagents/materials/analysis tools: E. Sharon, S. Lubliner, E. Segal. Wrote the paper: E. Sharon, S. Lubliner, E. Segal.

The authors have declared that no competing interests exist.

Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a

Transcription factor (TF) protein binding to its DNA target sequences is a fundamental physical interaction underlying gene regulation. Characterizing the binding specificities of TFs is essential for deducing which genes are regulated by which TFs. Recently, several high-throughput methods that measure sequences enriched for TF targets genomewide were developed. Since TFs recognize relatively short sequences, much effort has been directed at developing computational methods that identify enriched subsequences (motifs) from these sequences. However, little effort has been directed towards improving the representation of motifs. Practically, available motif finding software use the position specific scoring matrix (PSSM) model, which assumes independence between different motif positions. We present an alternative, richer model, called the feature motif model (FMM), that enables the representation of a variety of sequence features and captures dependencies that exist between binding site positions. We show how FMMs explain TF binding data better than PSSMs on both synthetic and real data. We also present a motif finder algorithm that learns FMM motifs from unaligned promoter sequences and show how de novo FMMs, learned from binding data of the human TFs c-Myc and CTCF, reveal intriguing insights about their binding specificities.

Precise control of gene expression lies at the heart of nearly all biological processes. An important layer in such control is the regulation of transcription. This regulation is preformed by a network of interactions between transcription factor proteins (TFs) and the DNA of the genes they regulate. To understand the workings of this network, it is thus crucial to understand the most basic interaction between a TF and its target site on the DNA. Indeed, much effort has been devoted to detecting the TF–DNA binding location and specificities.

Experimentally, much of the binding specificity information has been determined using traditional methodologies such as footprinting, gel-shift analysis, Southwestern blotting, or reporter constructs. Recently, a number of high-throughput technologies for identifying TF binding specificities have been developed. These methods can be classified into two major classes, in vitro and in vivo methods. In vitro methods can further be classified to methods that select high-affinity binding sequences for a protein of interest

However, despite these technological advances, distilling the TF binding specificity from these assays remains a great challenge, since in many cases the in vivo measured targets of a TF do not have common binding sites, and in other cases genes that have the known and experimentally determined site for a TF are not measured as its targets. For these reasons, the problem of identifying transcription factor binding sites (TFBSs) has also been the subject of much computational work (reviewed by Elnitski

The experimental and computational approaches above revealed that TFBSs are short, typically 6–20 base pairs, and that some degree of variability in the TFBSs is allowed. For these reasons, the binding site specificities of TFs are described by a sequence

Despite its successes, the PSSM representation makes the strong assumption that the binding specificities of TFs are position-independent. That is, the PSSM assumes that for any given TF and TFBS, the contribution of a nucleotide at one position of the site to the overall binding affinity of the TF to the site does not depend on the nucleotides that appear in other positions of the site. In theory, it is easy to see where this assumption fails. For example, consider the models described in

(A) Eight input TFBSs that the TF recognizes. (B) A PSSM for the input data in (A), showing its log-linear model network representation, probability distributions over each position, and sequence logo. Note that the PSSM assigns a high probability to CG and GC in positions 2 and 3 as expected by the input data, but it also undesirably (and unavoidably) assigns the same high probability to CC and GG in these positions. (C) An FMM for the input data in (A), showing the associated log-linear model network, with 3 features and sequence logo. Note that features _{1} and _{2} assign a high probability to CG and GC in positions 2 and 3 but not to CC and GG in these positions, as desired.

From the above discussion, it should be clear that the position-independent assumption of PSSMs is rather strong, and that relaxing this assumption may lead to a qualitatively better characterization of TF motifs. Indeed, recent studies revealed specific cases in which dependencies between positions may exist

The second class of models was proposed by Barash et al.

Another class of TF binding specificities models that is complementary to the above two is a mixture of models. In the above mentioned work, Barash et al. also used a mixture of PSSMs to model TFBSs. In this representation, each motif is modeled as a mixture of PSSMs each defining a different mode of binding. This approach was later extended as a part of the LOGOS

Here, we propose a novel approach for modeling TFBS motifs, termed

The rest of the paper is organized as follows: The

We first briefly describe the FMM representation, and how it is learned from aligned TFBS sequences. Next, we give a high-level view of our motif finder, that finds motifs in unaligned sequences, and allows their representation as FMMs. All of the algorithms described here are available as downloadable software or as an online web service at our web site:

As mentioned above, we represent TF binding specificities as the set of _{k}_{k}_{k}_{k}_{i}_{i}

A detailed description of the learning process is presented in the ^{D}_{1}-Regularization suggested by Lee et al. _{1} penalty term of our objective function. This process of features selection is guaranteed to converge. Finally, the output FMM is represented using a simple sequence logo as in the example given in

As a proof of concept, we developed a novel motif finder software and used it to compare the FMM to the PSSM as models for motif representation, within a de novo motif finding process. Our motif finder follows a discriminative methodology, which means that it finds motifs that are enriched in a positive set of unaligned sequences compared to a negative set of unaligned sequences. It receives as input a set of unaligned sequences that a TF binds to (positive set), and a background set of unaligned sequences that are not bound by the TF (negative set). The motif finding scheme consists of two main steps: In the first, we extract all sequences of length

The algorithm gets as input (1) sets of positive and negative (in terms of TF binding) unaligned sequences. It then (2) computes for every possible _{Distance} or if they can be aligned without mismatches with a relative shift of up to _{Shift} (here blue line edge stands for Hamming distance 1 and dotted green edge for Hamming distance 2). The algorithm then (5) iteratively selects the most significant _{min}≤_{max} and again the overall

We now present an experimental evaluation of our FMM learning approach. First, we used synthetic data to tune the free parameter of the penalty term and to test whether our method can reconstruct sequence features that span multiple positions when these are present. We then compared the ability of our approach to that of PSSMs on learning real binding site specificities of human TFs from two datasets of TFBS

Before integrating our algorithm for learning FMM from aligned TFBS data into our motif finder algorithm, we separately evaluated it in a controlled setting. As an initial test for our method, we wanted to evaluate the ability of our algorithm to learn sequence features that span multiple positions when such exist, and to avoid learning such features when none exist. For this purpose, we manually created eight sequence models of varying weights and features (which we will refer to as “true” models), and learned both PSSM and FMMs from aligned TFBSs that we sampled from them (

Results are shown for eight manually constructed models, from which we drew samples and constructed FMMs and PSSMs. The presented models from top down are three synthetic models. A PSSM and an FMM learned from MacIsaac et al.

We first tested the effect of the penalty term free parameter, ^{−6} to 100, while using a varied number of 10–1,000 input sequences. The results in the range 10^{−1}≤

Second, we estimated the minimum number of samples needed for learning FMMs, by sampling different training set sizes in the range of 10–500. In these experiments, we fixed the penalty term free parameter to

Having validated our approach on synthetic data, we next applied it to TFBSs data of human TFs. Our goal was to identify whether FMMs can describe the sequence specificities of human TFs better than PSSMs. To that end, we compared FMMs and PSSMs that were learned from the same sets of aligned TF binding sites. We chose three published sets of aligned binding sites sequences of two important human TFs. The first set contains aligned NRSF binding sites published by Johnson et al.

For each input set we tested whether FMM represents the TFBSs better than PSSM using the following 10-fold cross validation (CV) scheme. Each input set was partitioned into ten subsets. Ten CV groups were created, where in each one a different subset was used as test data, while the other nine were used as training set from which both an FMM and a PSSM were learned. For each CV group, we computed the average likelihood of the test TFBSs according to both the PSSM and FMM, as a measure for the learned sequence model success in representing the binding specificities. The difference between the log average FMM likelihood and the log average PSSM likelihood expresses the improvement of the FMM over the PSSM. The mean and standard deviation for these differences were calculated over the ten CV groups. The results for all three input sets are shown in

(A) Train (green points) and test log-likelihood (blue bars), shown as the mean and standard deviation improvements in the average log-likelihood per instance compared to a PSSM for the datasets of NRSF, CTCF predicted sites, and CTCF predicted conserved sites. (B) and (C) show the PSSM and FMM features expectations logo for CTCF predicted conserved sites respectively. (D) and (E) show the same for NRSF sites. Each feature in the FMM feature expectation logo ((B) and (E)) is represented by a box. The horizontal position and the letters in the box define the feature. For example, the feature in the purple dashed box in (C) represent the feature “T at position 2 and A at position 7.” The height of the feature is linear with respect to its expectation in the probability distribution defined by the model. Gray background marks a double position feature.

As previously described, our motif finder algorithm consists of two steps. The first step results in a collection of K-mer set Motif Models (KMMs). Each KMM is a set of K-mers that defines an enriched motif, and can be used to extract a set of aligned TFBSs from the input positive sequences (see

In order to evaluate our motif finder's performance we chose the dataset of Harbison et al.

In order to distinguish between biologically relevant motifs and motifs that can appear by chance, we followed the following procedure. We partitioned the Harbison et al. TF-condition sets into 15 bins according to their sizes. The bins were tagged by the center set sizes, [10,20,…,100,120,…,200]. For example, bin “50” contained all TF-condition sets of sizes 45–54. For each bin “X,” we generated 1,000 sets of X sequences that were randomly picked out of the entire collection of 6,725 Harbison et al. microarray sequences. For each set, all remaining sequences out of the 6,725 sequences were considered as a negative set. We ran our motif finder on all 123 true and 15,000 random sets and computed the best motif MHG

(A) Shown is the fraction of Harbison et al.

Having chosen biologically relevant motif MHG

In a previous section, we compared FMMs and PSSMs learned from aligned TFBS data. Although there are databases that contain sets of aligned TFBS

We searched each dataset for de novo motifs using a 5-fold cross validation scheme. We assumed that each sequence in the positive set has at least one TFBS. Following this, we computed for each positive sequence the top motif's FMM and PSSM best TFBS probability and considered it as the sequence binding likelihood. We show here the improvement of our FMM approach over PSSM in terms of train (green dots) and test (blue bars) log average likelihood. In the dataset STAT1_IFNg, two different motifs appear as best/second best in different cross validation runs and are marked by one and two asterisks, respectively.

Dataset |
Symbol |
Enriched protein | Organism | Experiment |
Size | Reference |

Robertson et al. | STAT1_Unstimulated | STAT1 | Human | ChIP-seq | 11004 | |

Robertson et al. | STAT1_INFg | STAT1 | Human | ChIP-seq | 41582 | |

Johnson et al. | NRSF | NRSF | Human | ChIPSeq | 1946 | |

Kim et al. | CTCF | CTCF | Human | ChIP-chip | 13804 | |

Lee et al. | PRC2_SUZ12 | SUZ12 | Human | ChIP-chip | 3465 | |

Wei et al. | P53 | P53 | Human | ChIP-PET | 510 | |

Wei et al. | P53_PET3 | P53 | Human | ChIP-PET | 307 | |

Zeller et al. | c-Myc | c-Myc | Human | ChIP-PET | 4297 | |

Zeller et al. | c-Myc_PET3 | c-Myc | Human | ChIP-PET | 593 | |

Loh et al. | Oct4_Loh | Oct4 | Mouse | ChIP-PET | 1051 | |

Loh et al. | Nanog_Loh | Nanog | Mouse | ChIP-PET | 2971 | |

Boyer et al. | Oct4_Boyer | Oct4 | Human | ChIP-microarray | 603 | |

Boyer et al. | Nanog_Boyer | Nanog | Human | ChIP-microarray | 1554 | |

Boyer et al. | Sox2_Boyer | Sox | Human | ChIP-microarray | 1165 | |

Boyer et al. | E2F4_Boyer | E2F4 | Human | ChIP-microarray | 957 |

Note that the Robertson et al. STAT1 sequences contain two sets: an interferon γ stimulated dataset and unstimulated dataset.

For p53 and c-MYC we consider both the noisier set of sequences that were represented by two PETs and a smaller and less noisy set (suffixed by “_PET3”) of sequences that were represented by at least three PETs. For every dataset we created a negative dataset as described in

Both Chip-seq and ChipSeq (as referred by the authors) use Illumina 1G system as platform. ChIP-PET methodology is described in

We focus next on our results for three important human TFs. For the first two, c-Myc and CTCF, we discuss their best FMM and PSSM motifs, and show how their FMM motifs reveal intriguing insights about their binding specificities, that are missed by the PSSM, and that may be correlated with previously published experimental results. For the third, STAT1, we found several motifs, exhibiting the cooccurrence of STAT1 and other TFs binding sites.

In

(A) c-Myc FMM and PSSM. (B) c-Myc FMM and PSSM learned only from sequences of PET3+ clusters (a cleaner set). The black square in (A) and (B) highlights the E-Box motif. (C) Statistics for the c-Myc FMM feature marked by a dashed line. Expected occurrences are according to the PSSM in (B). The

What may be the biological significance of the flanking “C-G” feature? The “CACGTG” E-box is known to be optimal for the binding of not only the c-Myc/Max heterodimer, but also of other basic/helix-loop-helix/leucine zipper (bHLHZ) dimers such as Mad/Max, Max/Max and USF/USF. Past works claimed that flanking bases contribute to binding specificities

In

Notably, our results are also well correlated with recently published work by Xie et al.

Finally, to emphasize the significance of features captured by the FMM in

We ran our motif finder on two STAT1 datasets. The “STAT1_IFNg” (see

In this paper we present

To show that FMM models describe TF binding motifs better than PSSM models, we compared their likelihoods over held-out synthetic and real data. The real biological data included both aligned TFBS data and unaligned TF bound regions data. We showed that for all types of data, the FMM representation of motifs outperforms the PSSM representation.

Importantly, FMMs can be presented using a clear and easy to understand logo, where important position dependencies are plainly visible. Examining our FMM results for two human TFs, c-Myc and CTCF, we found intriguing dinucleotide features that may be important for their binding. Some of those features are well-correlated with previously published results, while others may provide hypotheses on the binding specificities of these TFs. These hypotheses can be further studied experimentally to gain better understanding of how the TF recognizes its binding sites. Notably, in those examples, the FMMs hint at the importance of positions that are regarded uninformative by the PSSMs.

In order to allow de novo FMM motif finding, we developed a novel motif finder. Our motif finder finds motifs that are discriminatively enriched in a positive set of unaligned sequences over a negative set of unaligned sequences. For each motif, it learns either an FMM or a PSSM representation. An important property of our motif finding algorithm is that it extracts enriched sets of

We demonstrated the benefits of using log-linear models (a representation of Markov networks) for representing important features of TF binding specificities, and suggested a methodology to learn such features from both aligned and unaligned input sequences. In the

There are several directions for refining and extending our FMM approach. First, our rich framework can model many other types of features. Examples of features that can be added are: to what extent is the sequence a palindrome and the structural curvature of the sequence. Another direction is to add to our learning process the ability to learn binding energies associated with a given set of sequences. Finally, using our models as an improved basic building block, we can integrate it into higher level regulatory models (e.g.,

We now present our approach for representing TF binding specificities. Much like in the PSSM representation, our goal is to represent commonalities among the different TFBSs that a given TF can recognize, and assign a different strength to each potential site, corresponding to the affinity that the TF has for it. The key difference between our approach and the PSSM is that we want to represent more expressive types of motif commonalities compared to the PSSM representation, in which motif commonalities can only be represented separately for each position of the motif. Intuitively, we think of a TF–DNA interaction as one that can be described by a set of sequence

One way to achieve the above task is to represent a probability distribution over the set of all sequences of the length recognized by the given TF. That is, for a motif of length ^{L}_{1},…,_{L}_{i}^{L}

A more natural approach that can easily capture our above desiderata is the framework of undirected graphical models, such as log-linear representation of Markov networks (log-linear model), which have been used successfully in an increasingly large number of settings. As it is more intuitive for our setting, we focus our presentation on log-linear models. Let _{1},…,_{L}_{k}_{k}_{k}_{k}_{k}_{k}_{k}_{x}_{∈X} _{k}_{k}

Recall that a PSSM defines independent probability distributions over each of the _{iJ}_{iJ}

Given a TF that recognizes TFBSs of length _{k}_{k}

In the previous section, we presented our feature-based model for representing motifs. Given a collection of features

We now present our algorithm for learning a feature-based model from TFBSs data. Our approach follows the Markov network structure learning method of Lee et al.

For the parameter estimation task, we assume that we are given as input a dataset _{1},…,_{k}_{1},…,_{k}_{i}_{k}

Although no closed-form solution exists for finding the parameters that maximize Equation 2, the objective function is concave (as discussed by Lee et al.

Applying numerical optimization procedures such as gradient ascent requires the computation of the objective function and the gradient with respect to any of the _{k}^{L}

Since algorithms for learning log-linear models usually require computation of the partition function, this problem was intensively researched. Although in some cases the structure of the features may be such that we can decompose the computation to achieve efficient computation, in the general case it can be shown to be a NP-hard problem and hence requires approximation. Here we suggest a novel strategy of optimizing the objective function. We first use the (known) observation that the gradient of Equation 2 can also be expressed in terms of features expectations. Specifically, since

We further observe that since Equation 2 is a concave function, its absolute directional derivative along any given line in its domain is also a concave function. We used this observation to use the conjugate gradient function optimization algorithm

Above, we developed our approach for estimating the feature parameters for a fixed model in which the feature set

In this paper we followed the Markov network structure learning approach suggested by Lee et al. _{1}-Regularization over the model parameters. To incorporate the _{1}-Regularization into our model we need to introduce a

It is easy to see that this modified objective function is also concave (as it is an addition of a concave function and a linear function) in the feature parameters _{i}_{i}_{1}-Regularized concave function provides a stopping criteria to the algorithm that leads to the global optimum _{1}-Regularization has yet another desirable quality for our purpose, as it has a preference for learning sparse models with a limited number of features

Although the method described above is complete in the sense that it searches over all possible features for the features that are relevant for the optimal solution, the number of possible motifs increases with the max size of the feature domain ^{D}_{1} and nucleotide _{2}” that appear in

In the previous sections we described how to learn an FMM model from aligned TFBS data. We now turn to the more complex problem of finding de novo FMM elements that are enriched in a target set of relatively long and unaligned sequences compared with a background set. Recent years have shown a development of several high throughput methods reviewed in the introduction. The most dominant methods include chromatin immunoprecipitation (ChIP) of DNA-bound proteins followed by either DNA chip (ChIP-chip)

To properly describe our motif finding algorithm, we introduce the notion of a

In the first step we start with extracting the hits of all sequences of length _{1} of the _{1} of the _{2} of the _{2} of the

Next we rank the _{threshold}, controlled by FDR (in this work we used MHG_{threshold} = 10^{−3}), to filter the _{Distance} (in this work we used _{Distance} = 1), or the two _{Shift} base pairs with respect to the other (in this work we used_{Shift} = 1).

The node with the best MHG _{min}≤_{max} (in this work we used _{min} = 5 and _{max} = 8). At the end of the first step the best

In the second step, the motif finder produces either a PSSM or a FMM for each KMM. For each KMM, the algorithm uses all of its hits in the positive set to generate aligned TFBS data (

A special case that our motif finder recognizes and handles is that of dimer motifs. KMMs may represent dimers by holding two different alignment offsets per single K-mer sequence. For a detailed description of how dimer motifs are recognized and produced, see

The main novelty in our motif finder is in its ability to produce FMMs instead of PSSMs. Producing FMMs requires the motif finding algorithm to preserve inter-position dependencies, if they exists in the data. Our KMM methodology of producing motifs from

Finally, the performance of our motif finder with respect to memory and running time is discussed in

An example for a transition from KMM to FMM or PSSM. The KMM in this example contains four short sequences. The length of the KMM sequence alignment is 11 bp. Hence, we determine that the motif length will be 11 bp long. We next extract all of the hits of each of the KMM K-mers in the positive set. We extend each hit of a K-mer according to the KMM alignment to produce an 11 bp long putative TFBS. For example, for “Seq1” hits we extend two bases to the left and one to the right, due to its position in the alignment. Note that different K-mers may have mutual hits (in the figure the sequence is surrounded by a blue dashed line is a hit for both “Seq2” or “Seq3”). In this way we generate a set of 11 bp long aligned putative TFBS sequences from which we can learn an FMM or PSSM.

(0.77 MB TIF)

Evaluation of the L1 penalty term free parameter on synthetic data. FMM model performance in terms of the average test set likelihood on eight synthetic datasets (sampled from the models in

(11.12 MB TIF)

Supporting methods.

(0.77 MB PDF)

Supporting results.

(11.12 MB PDF)

_{2}/M checkpoints.

_{1}-Regularization.

_{1}vs. L

_{2}regularization, and rotational invariance.