Conceived and designed the experiments: YZ GDS. Performed the experiments: YZ DG. Analyzed the data: YZ DG GDS. Wrote the paper: YZ DG GDS.
The authors have declared that no competing interests exist.
We employ a biophysical model that accounts for the non-linear relationship between binding energy and the statistics of selected binding sites. The model includes the chemical potential of the transcription factor, non-specific binding affinity of the protein for DNA, as well as sequence-specific parameters that may include non-independent contributions of bases to the interaction. We obtain maximum likelihood estimates for all of the parameters and compare the results to standard probabilistic methods of parameter estimation. On simulated data, where the true energy model is known and samples are generated with a variety of parameter values, we show that our method returns much more accurate estimates of the true parameters and much better predictions of the selected binding site distributions. We also introduce a new high-throughput SELEX (HT-SELEX) procedure to determine the binding specificity of a transcription factor in which the initial randomized library and the selected sites are sequenced with next generation methods that return hundreds of thousands of sites. We show that after a single round of selection our method can estimate binding parameters that give very good fits to the selected site distributions, much better than standard motif identification algorithms.
The DNA binding sites of transcription factors that control gene expression are often predicted based on a collection of known or selected binding sites. The most commonly used methods for inferring the binding site pattern, or sequence motif, assume that the sites are selected in proportion to their affinity for the transcription factor, ignoring the effect of the transcription factor concentration. We have developed a new maximum likelihood approach, in a program called BEEML, that directly takes into account the transcription factor concentration as well as non-specific contributions to the binding affinity, and we show in simulation studies that it gives a much more accurate model of the transcription factor binding sites than previous methods. We also develop a new method for extracting binding sites for a transcription factor from a random pool of DNA sequences, called high-throughput SELEX (HT-SELEX), and we show that after a single round of selection BEEML can obtain an accurate model of the transcription factor binding sites.
Sequence-specific DNA binding proteins, including many transcription factors (TFs), are a critical component of transcriptional regulatory networks. Knowing their quantitative specificity, both the preferred binding sites and the relative binding affinity to different sites, can facilitate the understanding of gene expression patterns and how they are affected by altered cell states and variations in the genome sequences. A variety of methods are used to estimate the quantitative specificity of DNA-binding proteins, some of them direct experimental measurements of individual sequences or a few sequences at a time
The bimolecular interaction between a DNA binding protein, TF, and a particular DNA binding sequence,
The equilibrium binding constant of the TF to the site
The specific binding component,
Equation (1) is derived by considering a simple experiment where only a single sequence,
The overall probability of the sequence being bound (
The experiment we model is a binding reaction with a pool of TF molecules and a large pool of different sequences,
As pointed out by Djordjevic
Given a large enough sample of binding sites this experimental procedure could provide good estimates of the binding free energy for each sequence in the initial pool. However, for typical lengths
Equation (1) was used by Djordjevic
This completes the description of the model. By substituting equation (3) into equation (2), and that into equation (6), we obtain the relationship between the statistics of observed binding sites,
Given a collection of N bound sequences, we model the relationship between
Maximizing the likelihood function (7) with respect to
This is a non-linear parameter estimation problem and we minimize
A practical issue is the calculation of the denominator of equation (6), the partition function. For longer values of
We use the half-site of the Mnt protein to test the method. Mnt is a repressor from phage P22 for which the binding affinity to all single base variants of the preferred binding sequence have been measured experimentally
(A) Prior distribution of binding energy for Mnt half-site
We also used BEEML to analyze the binding data for the human transcription factor MaxA. Binding affinities to all possible 4-long half sites, in the context of the preferred GTG for the other half-site, were determined experimentally by the MITOMI method
The zinc-finger protein Zif268, fused to Glutathione-S-Transferase (GST), was previously purified from an E. coli expression system for use in SELEX experiments
Top Panel (A–C): Effects of
Similar results are obtained for variations of
(A) Fit of point-estimate of binding energy as done in Maerkl & Quake paper (B) BEEML fit with PWM energy model and non-specific energy parameter (C) BEEML fit with position specific di-nucleotide energy model and non-specific energy parameter. (Note that in a previous analysis of this data
The sequencing of the initial library showed a small bias in the composition on the synthetic strand: A = 24.5%; C = 21.0%; G = 27.2%; T = 27.4%. We estimate the prior probabilities of sequences,
Probabilistic models for binding site recognition, such as the fairly standard log-odds method, are popular because of their simplicity, intuitive appeal and because they can be easily implemented in motif discovery algorithms
Djordjevic
Besides the introduction of BEEML, this paper also introduces a novel HT-SELEX procedure for accurate estimation of binding energies from in vitro selected sites. SELEX, and related methods, have been employed since 1990 to determine the specificity of DNA and RNA binding proteins, as well as for other purposes
Further developments of this approach are underway. While we show that a single round of HT-SELEX is sufficient to get reasonably accurate models of binding energy, we think that including data from additional rounds may provide even better models. Sequencing after each round means that we have good estimates for the prior sequence distributions at each round and since the energy model and non-specific energy will be the same, the only additional parameters to estimate are the chemical potentials at each round,
Other types of data should also be amenable to the BEEML approach. We demonstrated its application to affinity data from MITOMI experiments
A current source of
We thank all members of the Stormo lab for helpful discussions and advice about this work, especially prior work by Dana Homsi and Vineet Gupta that motivated this research.