^{*}

Conceived and designed the experiments: YZ GDS. Performed the experiments: YZ DG. Analyzed the data: YZ DG GDS. Wrote the paper: YZ DG GDS.

The authors have declared that no competing interests exist.

We employ a biophysical model that accounts for the non-linear relationship between binding energy and the statistics of selected binding sites. The model includes the chemical potential of the transcription factor, non-specific binding affinity of the protein for DNA, as well as sequence-specific parameters that may include non-independent contributions of bases to the interaction. We obtain maximum likelihood estimates for all of the parameters and compare the results to standard probabilistic methods of parameter estimation. On simulated data, where the true energy model is known and samples are generated with a variety of parameter values, we show that our method returns much more accurate estimates of the true parameters and much better predictions of the selected binding site distributions. We also introduce a new high-throughput SELEX (HT-SELEX) procedure to determine the binding specificity of a transcription factor in which the initial randomized library and the selected sites are sequenced with next generation methods that return hundreds of thousands of sites. We show that after a single round of selection our method can estimate binding parameters that give very good fits to the selected site distributions, much better than standard motif identification algorithms.

The DNA binding sites of transcription factors that control gene expression are often predicted based on a collection of known or selected binding sites. The most commonly used methods for inferring the binding site pattern, or sequence motif, assume that the sites are selected in proportion to their affinity for the transcription factor, ignoring the effect of the transcription factor concentration. We have developed a new maximum likelihood approach, in a program called BEEML, that directly takes into account the transcription factor concentration as well as non-specific contributions to the binding affinity, and we show in simulation studies that it gives a much more accurate model of the transcription factor binding sites than previous methods. We also develop a new method for extracting binding sites for a transcription factor from a random pool of DNA sequences, called high-throughput SELEX (HT-SELEX), and we show that after a single round of selection BEEML can obtain an accurate model of the transcription factor binding sites.

Sequence-specific DNA binding proteins, including many transcription factors (TFs), are a critical component of transcriptional regulatory networks. Knowing their quantitative specificity, both the preferred binding sites and the relative binding affinity to different sites, can facilitate the understanding of gene expression patterns and how they are affected by altered cell states and variations in the genome sequences. A variety of methods are used to estimate the quantitative specificity of DNA-binding proteins, some of them direct experimental measurements of individual sequences or a few sequences at a time

The bimolecular interaction between a DNA binding protein, TF, and a particular DNA binding sequence, _{i}_{on}_{off}

The equilibrium binding constant of the TF to the site _{i}_{i}_{i}

The specific binding component,

Equation (1) is derived by considering a simple experiment where only a single sequence,

The overall probability of the sequence being bound (

The experiment we model is a binding reaction with a pool of TF molecules and a large pool of different sequences, _{i}

As pointed out by Djordjevic

Given a large enough sample of binding sites this experimental procedure could provide good estimates of the binding free energy for each sequence in the initial pool. However, for typical lengths

Equation (1) was used by Djordjevic

This completes the description of the model. By substituting equation (3) into equation (2), and that into equation (6), we obtain the relationship between the statistics of observed binding sites,

Given a collection of N bound sequences, we model the relationship between

Maximizing the likelihood function (7) with respect to

This is a non-linear parameter estimation problem and we minimize

A practical issue is the calculation of the denominator of equation (6), the partition function. For longer values of ^{L}

We use the half-site of the Mnt protein to test the method. Mnt is a repressor from phage P22 for which the binding affinity to all single base variants of the preferred binding sequence have been measured experimentally

(A) Prior distribution of binding energy for Mnt half-site

We also used BEEML to analyze the binding data for the human transcription factor MaxA. Binding affinities to all possible 4-long half sites, in the context of the preferred GTG for the other half-site, were determined experimentally by the MITOMI method

The zinc-finger protein Zif268, fused to Glutathione-S-Transferase (GST), was previously purified from an E. coli expression system for use in SELEX experiments

Top Panel (A–C): Effects of

Similar results are obtained for variations of ^{6}-fold ratio of non-specific binding affinity compared to the preferred binding site, ^{5}, ^{4},

^{2} = 0.57. ^{2} = 0.84. ^{2} = 0.96, which is essentially within the measurement error.

(A) Fit of point-estimate of binding energy as done in Maerkl & Quake paper (B) BEEML fit with PWM energy model and non-specific energy parameter (C) BEEML fit with position specific di-nucleotide energy model and non-specific energy parameter. (Note that in a previous analysis of this data

The sequencing of the initial library showed a small bias in the composition on the synthetic strand: A = 24.5%; C = 21.0%; G = 27.2%; T = 27.4%. We estimate the prior probabilities of sequences, ^{10} (>10^{6}) 10-mers. Since no significant higher-order biases were observed we expect that the frequencies of all 10-mers in the initial library are well approximated based on the mono-nucleotide composition. An initial BEEML model based on all of the selected binding sites was used to determine the most likely orientation of each site and whether it was entirely within the 10bp randomized region or overlapped the fixed sequences. Sites that were determined to overlap the fixed regions were eliminated from further analysis and the remaining sequences were reanalyzed by BEEML. As expected, because of the slight compositional bias and the G-rich consensus for zif268 (^{2} = 0.74 for BioProspector, r^{2} = 0.92 for BEEML). Not only are the non-specific and low affinity sites, which are the majority after only a single round of selection, better predicted by BEEML, but the high affinity, near-consensus sites are predicted much more accurately and with very little scatter compared to the BioProspector predictions. BEEML also returns estimates of ^{5} binding sites, we do not expect any over-fitting but to verify that is the case we performed a 10-fold cross-validation where we determined the parameters based on a random sample of 90% of the sequences and measured the fit to the remaining 10%. Indeed, we find that r^{2} = 0.90±0.05 on those samples.

Probabilistic models for binding site recognition, such as the fairly standard log-odds method, are popular because of their simplicity, intuitive appeal and because they can be easily implemented in motif discovery algorithms

Djordjevic

Besides the introduction of BEEML, this paper also introduces a novel HT-SELEX procedure for accurate estimation of binding energies from in vitro selected sites. SELEX, and related methods, have been employed since 1990 to determine the specificity of DNA and RNA binding proteins, as well as for other purposes _{i})

Further developments of this approach are underway. While we show that a single round of HT-SELEX is sufficient to get reasonably accurate models of binding energy, we think that including data from additional rounds may provide even better models. Sequencing after each round means that we have good estimates for the prior sequence distributions at each round and since the energy model and non-specific energy will be the same, the only additional parameters to estimate are the chemical potentials at each round,

Other types of data should also be amenable to the BEEML approach. We demonstrated its application to affinity data from MITOMI experiments

A current source of

We thank all members of the Stormo lab for helpful discussions and advice about this work, especially prior work by Dana Homsi and Vineet Gupta that motivated this research.