Transcription factors (TFs) are proteins that bind to specific sites on the DNA and regulate gene activity. Identifying where TF molecules bind and how much time they spend on their target sites is key to understanding transcriptional regulation. It is usually assumed that the free energy of binding of a TF to the DNA (the affinity of the site) is highly correlated to the amount of time the TF remains bound (the occupancy of the site). However, knowing the binding energy is not sufficient to infer actual binding site occupancy. This mismatch between the occupancy predicted by the affinity and the observed occupancy may be caused by various factors, such as TF abundance, competition between TFs or the arrangement of the sites on the DNA. We investigated the relationship between the affinity of a TF for a set of binding sites and their occupancy. In particular, we considered the case of the transcription factor lac repressor (lacI) in E.coli, and performed stochastic simulations of the TF dynamics on the DNA for various combinations of lacI abundance and competing TFs that contribute to macromolecular crowding. We also investigated the relationship of site occupancy and the information content of position weight matrices (PWMs) used to represent binding sites. Our results showed that for medium and high affinity sites, TF competition does not play a significant role for genomic occupancy except in cases when the abundance of the TF is significantly increased, or when the PWM displays relatively low information content. Nevertheless, for medium and low affinity sites, an increase in TF abundance (for both cognate and non-cognate molecules) leads to an increase in occupancy at several sites.
Citation: Zabet NR, Foy R, Adryan B (2013) The Influence of Transcription Factor Competition on the Relationship between Occupancy and Affinity. PLoS ONE 8(9): e73714. https://doi.org/10.1371/journal.pone.0073714
Editor: Frances M. Sladek, Univeristy of California Riverside, United States of America
Received: May 16, 2013; Accepted: July 31, 2013; Published: September 27, 2013
Copyright: © 2013 Zabet et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by the Medical Research Council [G1002110 to N.R.Z.] and the Royal Society (URF to B.A.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
A powerful key to understanding transcriptional regulation is the amount of time a regulatory binding site is occupied by a cognate transcription factor (TF). In particular, this ‘occupancy’ measure can be used to infer relative amounts of transcription of the target gene, and is therefore a more powerful comparative tool than simple sequence searches for ‘preferred binding sites’. Transcription factors have specific affinities for each site on the DNA (computed from the binding energy between the TF protein and the DNA molecule at the target site) and it is often naïvely assumed that this affinity is sufficient to predict the actual occupancy of TFs bound to the DNA . However, recent studies have demonstrated that affinity alone is not always sufficient to accurately predict TF occupancy .
Previous studies have shown that TF abundance can account for the correlation between the normalised affinity and normalised occupancy (“normalised” here refers to setting the maximum observed values to ) –, in the sense that increasing TF abundance increases the number of occupied sites and that those additional sites are of decreasing affinity. This result was explained by the fact that, once the high affinity sites get close to saturation, TF molecules will spend more time bound to lower affinity sites. However, in those studies the spatial organisation of sites on the DNA was disregarded. Such an assumption should predict occupancy for in vitro experiments such as SELEX or PBM , (where there are only short DNA sequences and one TF species), whilst in in vivo studies, could lead to biased predictions.
A popular approach to estimate occupancy is the statistical thermodynamics framework. This method computes the probability that, at equilibrium, one encounters a specific configuration of TF molecules on the DNA –. A number of studies consider a uniform affinity landscape for TFs or other DNA-binding proteins and focus on the occupancy of a single site (or a few sites) in the context of a genome with otherwise constant affinity –. However, TFs display a distribution of affinities to the DNA ,  and, thus, the assumption of a uniform landscape becomes restrictive (and can lead to biases in the results). Wasson and Hartemink  considered non-uniform affinity landscapes and investigated the relationship between the abundance of DNA-binding proteins and their occupancy using a statistical thermodynamics model. Their results confirmed that, when increasing TF abundance, low affinity sites display higher occupancy than that which would be predicted by affinity alone. Furthermore, the addition of other DNA-binding proteins (histones in their case) leads to an overall reduction in occupancy of the TFs of interest. Similarly, Kaplan et al.  applied a combination of a hidden Markov model and a thermodynamic framework and discovered that TF competition does not influence the observed occupancy significantly (at least in the case of their system). Nevertheless, they considered only the competition between various TF species and did not alter the abundance of their TFs of interest (they used the actual TF abundance that was experimentally measured).
The main assumption of the statistical thermodynamic framework is that the system reaches equilibrium and the transient time (the time to reach equilibrium) is negligible . Nevertheless, there is still no proof that, in the case of the TF search process, equilibrium exists or is reached fast enough to not affect the average behaviour. We use a stochastic simulation of the process by which a TF ‘searches’ for it’s regulatory binding site by first binding non-specifically to the DNA and then performing a one-dimensional random walk before eventually unbinding. This combination of binding/unbinding to/from the DNA and one-dimensional random walk is known as a facilitated diffusion mechanism  and it is evident that such a process is taking place inside the cell , . The physical advantage of facilitated diffusion over a purely three-dimensional diffusion or a purely one-dimensional random walk is a more rapid target site location (see  for review). Simulating facilitated diffusion can overcome some of the limitations of the statistical thermodynamics model by allowing ‘exact’ in silico measurement of the average occupancy of TF binding sites under various parametrisations of the cellular state (e.g. concentrations of DNA binding proteins), some of which will give rise to deviations from the predictions offered by the statistical thermodynamics model. For example, Chu et al., , demonstrate such deviations when they model TFs as having non-uniform affinity landscapes.
Here, we used a stochastic simulator that models the facilitated diffusion mechanism and studied the properties of a complete continuous DNA sequence (from the genome of E.coli K-12 ) being bound by both a cognate TF species (lacI in our case) and a non-cognate TF species (aimed to model the presence of other proteins on the DNA which contribute to crowding on the DNA) , . This scenario mimics the behaviour of TF molecules in a live cell performing facilitated diffusion in the search for their target sites. The TF molecules will not only compete with other molecules bound to the DNA for sites, but during the one-dimensional random walk on the DNA, they will slide or hop to nearby sites  and also bypass other bound molecules ,  which act as obstacles and create boundary effects .
Our results confirm that the addition of non-cognate TFs reduces the absolute occupancy of cognate TF binding sites, while their relative occupancy is influenced at relatively few (in the order of tens) low and medium affinity sites, and is unaffected at high affinity sites. That is, for low affinity (“non-specific”) and medium affinity sites, the addition of non-cognate TFs leads to significant differences between the predicted relative occupancy based on affinity (which we call affinity derived occupancy, or ADO) and the relative occupancy measured by stochastic simulation (which we call simulation derived occupancy, or SDO) at several sites, whilst for high affinity sites this relative binding pattern is unaffected. While the mismatch associated with low affinity sites should have little or no influence on gene regulation (unless the cognate TF molecules change conformation when bound to a functional high affinity site ), this may provide an explanation for the noise structure in actual genomic profiles of TF occupancy (e.g. ChIP data).
We further found that differences between ADO and SDO at medium and high affinity sites can arise if the cognate TF abundance is significantly increased or if the information content of the PWM is low. However, for normal bacterial TF abundances (usually in the range of copies ), PWM information content ,  and DNA sizes (e.g., ), the differences between the SDO and ADO are negligible and binding energies are good indicators of occupancy. Nevertheless, in the case of eukaryotic systems, their high TF abundances ( copies ), their lower information content motifs , and the amount of accessible DNA suggest that significant differences between ADO and SDO are likely to occur. Nevertheless, this increase in occupancy generated by the high abundance of cognate TFs can be reduced, to a certain degree, by a high abundance of non-cognate TF molecules in the system.
In , we found that, under certain conditions, the occupancy in the simulations cannot always be predicted based on the affinity. To systematically assess the source of the mismatch between affinity derived occupancy (ADO) and simulation derived occupancy (SDO), we considered the case of a bacterial TF (lacI) with biologically plausible parameters and investigated the relationship between affinity and occupancy. Figure 1 contains scatter plots of the SDO vs. ADO at individual sites (at resolution) for various crowding levels on the DNA, and various lacI abundances. To eliminate weak sites which will not facilitate the formation of a strong complex with lacI, we recorded only sites with high affinity . We chose this threshold to select the top of sites based on the distribution of binding energies, but the value of the threshold can be selected to match any distribution of binding energies.
We considered the case of the lac repressor TF and of DNA, which contains the site. Each system was simulated for (which is the average cell cycle time of E.coli , ) and, for each set of parameters, we considered independent simulations. We considered only the sites that have the binding energy at least of the highest value (the strongest sites). () Five different lacI copy numbers: () , () , () , () and () . We assumed the case of copies of non-cognate TFs, which leads to of the DNA being covered. () Five different non-cognate copy numbers: () , () , () , () and () , and copies of lacI.
Figure 1 () shows that for lacI molecule, there is an excellent agreement between ADO and SDO even in the case of crowding on the DNA. The mean ratio of SDO to ADO for lacl molecule with crowding is , within a confidence interval . This suggests that, even in the case of leaky gene expression ( or a few TF molecules), the TF is able to regulate a gene within a cell cycle and the percentage of time the site is occupied is not affected by crowding.
Usually, bacterial TFs number between and copies per cell . In this case, as well as in the case of lacI molecule, the addition of non-cognate TFs does not appear to introduce a significant difference between ADO and SDO.
Finally, a few bacterial TFs are known to exist in high copy numbers (e.g. the copy number of CRP is ) and Figure 1 () confirms that, in the case of highly abundant bacterial TFs, the ADO diverges from the SDO. In particular, we observed a two-fold increase in SDO, compared to ADO; see Table 1. This indicates that certain sites (for example , the second strongest site of lacI) will display a higher degree of occupancy than that predicted by affinity.
Next, we considered the effect of increased crowding of the DNA by non-cognates on the relationship between ADO and SDO. Figure 1 () shows that increasing the crowding level has a negligible effect on this relationship and that ADO is a good approximator of SDO at all levels of non-cognate crowding when 10 lacI molecules are modelled; see also Table 2.
Altogether, non-cognate binding proteins do not affect the occupancy of medium and high affinity sites, in the sense that the SDO of medium and high affinity sites is accurately approximated by the ADO. However, by significantly increasing the abundance of cognate TFs, ADO ceases to be a good approximator of the SDO of medium and high affinity sites. Thus, only cognate abundance influences the occupancy of medium and high affinity sites, while non-cognate TFs have only limited effect.
The results shown in Figure 1, use normalised measures of occupancy (ADO and SDO), which are the relative values with respect to the highest rate of occupancy at the strongest site. When analysing the absolute values for occupancy, Wasson and Hartemink  observed that the addition of non-specific DNA binding proteins (nucleosomes in their studies) will reduce the absolute occupancy of cognate TFs. Figure S4 shows that the absolute value of the SDO increases when the lacI abundance is increased and slightly decreases when the non-cognate abundance is increased, supporting the results from .
Figure 1 considers only sites with an affinity above a specific threshold. Besides providing more clarity, the rationale for this restriction was twofold: First, there is no clear evidence for the biological relevance of extreme low affinity sites, and second, we are only interested in amounts of occupancy that would be detectable in a biochemical assay (i.e. extreme low affinity binding events are likely not detectable), as the theoretical explanation of observed binding profiles is one of the goals of our research.
Figure 2 shows heatmaps representing the number of sites where the ratio between SDO and ADO is higher than a factor SDO/ADO . For example, when , the graph considers the sites where occupancy predicted from affinity underestimates the occupancy observed in the simulations. Interestingly, we did not find any sites where the SDO is lower than the ADO (which we call ‘false negative’ sites), under the various combinations of lacI abundances and crowding levels on the DNA (data not shown).
In this heatmap, we did not consider any affinity cut-off and plotted the number of sites where the ratio between SDO and ADO exceeds for a range of values of . There are four cases: () lacI molecule, () lacI molecules, () lacI molecules and () lacI molecules.
However, we found sites where SDO ADO and we call these sites ‘false positives’. For lacI abundances within [1,100] copies - Figures 2(A-C) - there are tens of sites where the SDO is higher by at least compared to the ADO (). These sites appear only for high levels of crowding (at least ) and their number is increased by increasing the crowding. This means that by increasing the crowding on the DNA, the number of sites where SDO is higher than ADO also increases. We also investigated if there is a particular affinity of the sites where the SDO exceeds ADO and found that these sites are usually distributed amongst the medium and non-specific sites; see Figure S6.
When we looked for larger differences between SDO and ADO we saw that by increasing we observed fewer false positive sites. In particular, for copies of lacI, there is no site where the occupancy in the simulations is higher by (i.e. ) than the value predicted by the affinity. This supports the conclusion from the previous section that the occupancy we observed in the simulations does not significantly deviate from that predicted based on the affinity.
In the case of 1000 copies of lacI, the results differ. Specifically, there appears to be two regimes, namely: () for and () for . In the first of these (), increasing the number of non-cognate molecules reduces the number of sites where the SDO/ADO . In other words, in this regime, increased crowding on the DNA has the opposite effect than that for lower lacI copy numbers (see above): it reduces the number of false positive sites. In the case of copies of lacI, the mean SDO/ADO ratio is (whilst when lacI abundance 100 copies it is approximately ) and by adding non-cognates the number of bound cognate molecules at sites whose SDO/ADO is reduced (see Figure S6). In turn the mean SDO/ADO ratio will be reduced which in turn explains why the number of false positive sites decreases. In the latter case (), we observe a similar effect as for lower abundances of lacI, namely that increasing the crowding on the DNA increases the number of bound cognate molecules at sites where SDO/ADO.
Considerations on eukaryotic cells
Eukaryotes typically have TF copies per cell , with some abundances being is high as copies per cell . This higher abundance of TFs comapred to prokaryotes appears to reflect that eukaryotic genomes are much longer, giving much greater space in which TFs can bind . However, at any one time large parts of eukaryotic genome are packed into dense chromatin, and are thus inaccessible to TF binding. For example, in the D. melanogaster embryo, on average only of the euchromatic genome of is accessible during each early developmental stage . This means that, in such eukaryotic cells, we have accessible DNA that is similar in length to that considered in this study (the E.coli genome is approximately ), but with TFs in much greater abundance. This begs the question of whether the relationship between occupancy and affinity that we observe when simulating the prokayrotic case (lacI around the site) is still true in the context of eukaryotic systems with TFs that have copies or more.
It is clear from Figure 1 that increasing the abundance of cognate TFs up to , increases the number of medium affinity sites that display significantly higher occupancy; see also Table 1. This observation remains true for different levels of crowding on the DNA as introduced by the presence of non-cognate TFs (no crowding, low crowding and medium crowding (data not shown)). Furthermore, at such high levels of cognate abundance almost all sites display a much higher occupancy than that predicted from their affinity. For example, the occupancy of the second strongest site of lacI () becomes approximately equal to that of the strongest one (), although there is a large difference in affinity between the two sites. This observation suggests that high TF abundance makes strong and weak sites less distinguishable, which would hinder a quantitative readout for the regulation of gene expression in the cell.
Above, we considered occupancy and affinity at single nucleotide resolution. Figure 3 shows a theoretical TF binding profile over a locus of the E.coli genome as calculated using GRiP, demonstrating the progressive effect on occupancy of increasing TF abundance. (The theoretical profiles are generated using a method described by Kaplan et al.  for modelling ChIP-seq profiles; see File S1). Each chart plots the ADO and SDO, and shows that for low copy numbers ( copies per cell), the profile of the ADO (filled region) matches the profile of SDO (solid line) with high accuracy for the cases of no crowding on the DNA ( non-cognate molecules) and medium crowding on the DNA ( non-cognate molecules). This would imply that, in bacterial cells (i.e. when TF abundance is relatively low), the binding of TFs to their target sites is not affected by competition with other molecules, and occupancy is predominantly a factor of, and is accurately modeled by, affinity. However, when TFs are highly abundant ( copies per cell), as is common in eukaryotic systems, the level of affinity is not the sole determinant of occupancy on the DNA. In other words, the amount of time spent bound is determined not just by the encoded information in the DNA (nucleotide composition of binding sites) and DNA accessibility, but by the abundance of TFs in the system (mainly cognate TF abundance, but small effects from non-cognates were observed).
We considered the case of the lac repressor TF and of DNA, which contains the site. In each chart the solid grey line is the SDO at one of four levels of lacI abundance, and the filled green region is the ADO. The SDO shown is calculated with 0 non-cognate molecules; calculations for 10% and 26% non-cognate abundance show no visible deviation from the 0 non-cognate case (hence not shown). The SDO was calculated at four lacI abundances: () , () , () and () molecules. Each system was simulated for and for each set of parameters we consider independent simulations. We considered only the sites that have the binding energy at least of the highest value (the strongest sites). We converted the single nucleotide resolution into expected ChIP-seq profiles as proposed in ; see File S1.
Finally, bacterial TFs have PWMs with higher information content compared to the eukaryotic TFs , , (e.g., for lacI, . Average information content: bacteria, ; yeast, ; multicellular eukaryotes, ). To investigate the influence of information content on the number of highly occupied sites observed in the simulations, we removed positions from the end of the lacI motif and performed the simulations at various abundances of lacI on naked DNA (i.e. no non-cognate TF molecules). In total, we considered six cases, which resulted in the information content of the reduced lacI motif being: () , () , () , () , () and () ; see Figure S7 and Figure S8. Figure 4 shows that, by selecting an arbitrary threshold (certain percent of the highest value of SDO), the number of sites with SDO higher than the threshold increases both as the abundance of lacI increases (compare the values on each row in Figure 4), and as the information content of the motif decreases (compare the values on each column in Figure 4). Note that the former (the dependence of the SDO on the TF abundance) was already shown in Figure 1 and Figure 3. Hence, in eukaryotic systems, we can expect a two fold increase in the number of sites with high SDO from both the greater TF abundance  and from the likely lower information content of the average eukaryotic PWM .
This heatmap represents the number of sites that display an occupancy in the simulation that is higher than the following thresholds: () , () and () . There were no non-cognate TFs in these cases and occupancy was calculated at abundances of lacI . Information content of the lacI motif was reduced by successively removing the rightmost column of the PWM (see Figure S7 and Figure S8). In general the number of high occupancy sites is increased by both increased lacI abundance (compare the values on each row) and reduced information content (compare the values on each column). In () at the highest lacI abundance, there are several cases where the number of highly occupied sites decreases with reducing the information content (from 16 to 8) contrary to the pattern at other abundances and/or thresholds. This can be explained by the fact that, in order to reduce the information content, we removed certain base pairs from the lacI motif, which can introduce biases in the affinity landscape. These biases can lead to small deviations from the expected results, particularly in the cases where there are few sites and the TF has high abundance. For example, in the case of the copies of lacI with the full motif, there are sites that display an occupancy of , while, in the case of copies of lacI with information content , those sites will display an occupancy of .
Note that by removing certain positions from the end of the lacI motif, we reduced the information content in a biased way and this can lead to small variations in the occupancy, particularly, in the case when there are a few sites that display high occupancy. Nevertheless, this approach to change the information content does not influence the general result: that TFs with lower information content motifs display a more dramatic change in the number of highly occupied sites compared to TFs with higher information content motifs.
Transcription factors perform a combination of three-dimensional diffusion and one-dimensional random walk on the DNA when they search for their target sites. Inherently, this mechanism leads to the binding of TFs not only to their target sites, but also to other, lower affinity sites on the DNA. In this context, it becomes important to understand the relationship between affinity (how strongly a TF binds to a site on the DNA) and occupancy (the residence time of a TF on a site).
Often it is assumed that the relative occupancy of a TF measured experimentally (say, in a ChIP assay) is indicative of the relative affinity, and many studies infer a TF's affinity by de novo motif analysis based on the most highly occupied sites (those showing the strongest ChIP enrichment). This assumption is flawed when there is divergence between occupancy and affinity for these highly occupied sites. Although this approximation proved to have good accuracy in the inference of position weight matrices in many cases (e.g. ), there are also examples where the method seems to fail (e.g. ). These cases refer to situations where false positive prediction (sites that have low affinity but display high occupancy) or false negative prediction (sites that have high affinity but display low occupancy) could have influenced the success of the study.
Our results indicate that by adding non-cognate TFs, the absolute occupancy of binding sites by cognate TF molecules is reduced (see File S1). The reduction in the absolute value of the occupancy is a consequence of the competition of TFs for the limited amount of DNA. Wasson and Hartemink  observed the same effect, although they used a different approach (a statistical thermodynamics model) to estimate the occupancy. However, in their study, they did not look at the occupancy relative to the highest value (the quantitative readout of binding events).
We found that the abundance of non-cognate TFs has a limited effect on the normalised occupancy of low, medium and high affinity sites; see Figure 1 () and Figure 2. Nevertheless, there are several sites (in the order of tens), where the addition of non-cognate TFs leads to significant deviations of the observed occupancy derived from simulation (SDO) from that derived from affinity (ADO). This result is supported by recent experimental evidence, where the authors showed that lac repressor occupancy increases at lower sites (far away from the site), when the crowding in the cell increases (and, thus, the crowding on the DNA increases as well) .
Bacterial TFs are expressed at low copy numbers (between and )  and they have only a few strong sites that are highly specific , . This suggests that, in the case of bacterial gene regulation, affinity controls the relative occupancy of the specific sites (acting as a local fine tuning mechanism), while the crowding level on the DNA controls the global occupancy of the sites (acting as a global regulator).
We also investigated under which conditions the normalised occupancy of the medium and high affinity sites is affected. Our results confirmed that for TFs with copies per cell and approximately of available DNA, the occupancy is higher than that predicted by affinity, irrespective of the abundance of non-cognate TFs. Eukaryotic systems have TFs with high abundance (on average copies per cell)  and although they have much larger genomes, only a small proportion of this is accessible to TFs (e.g., in early developmental stages of D. melanogaster) . This suggests that the rate of false positive binding events (higher occupancy than predicted by affinity) is significant in eukaryotic cells; see Figure 3. Note that our model is applicable only to TFs residing in the nucleoplasm and, thus, when we mention TF abundance in eukaryotic systems we refer to nuclear abundance of TFs .
The dependence of genomic occupancy of TFs on TF abundance is qualitatively similar to the results presented in previous analytical studies, which showed that, by increasing the abundance of TFs, high affinity sites reach saturation and, consequently, lower affinity sites will display a higher occupancy –. This means that the spatial organisation of sites on the DNA has only a limited effect on the genomic occupancy of TFs. Nevertheless, the quantitative differences between the SDO and these analytical solutions need systematic investigation and will be left for further research.
Kaplan et al.  investigated the relationship between experimentally measured occupancy (from ChIP-seq experiments) and that predicted using a hidden Markov model, and found that the highest correlation between the two was on average . To achieve this correlation they assumed real TF abundances that were previously measured in D. melanogaster nuclei , but they did not adapt the abundances of TFs to the size of the analysed DNA segment. In , we showed that, when the number of bound TF molecules is not changed in such a subsystem (a simulated entity smaller than the genome), the correlation coefficient between the occupancy of the full system, and the occupancy of the subsystems, can be as low as . This result is also shown in Figures 1 and 3, which confirm that an increase in cognate TF copy number can lead to a reduction in the correlation between occupancy and affinity landscape. Thus, one method to increase the correlation between the predicted and observed occupancy consists of adapting the abundance levels of the TFs with one of the methods presented in .
In addition, this higher number of highly occupied sites is also influenced by the information content of the motif. In Figure 4, we showed that, by reducing the information content, the number of sites with high SDO increases, but also that the effects of the increase in TF abundance on the highly occupied sites is more dramatic. In other words, by increasing the abundance of a TF with a PWM with lower information content, we observed a larger increase in the number of highly occupied sites compared to the case of a TF with a PWM with higher information content; compare different rows in Figure 4. This suggests that, in the case of eukaryotic systems (which have TFs with lower information content PWMs  and higher abundances ), the effects of TF abundance on the number of ‘false positive’ sites is more severe than in the case of bacterial cells.
Our approach to reduce the information content (by removing positions from the end of the lacI motif) is prone to introduce biases in the results, in particular, at high abundance of the TF and low number of highly occupied sites; see Figure 4 (). A different approach to reduce the information content could be to add non-specific sites uniformly when constructing the PWM, but we anticipate this would lead to similar results, namely: in the case of lower information content motifs, a change in the abundance of TF has more drastic effects on the number of highly occupied sites, compared to the case of higher information content motifs. Nevertheless, the details of this application of a different approach to reduce the information content needs to be left for further research as it is beyond the scope of this manuscript.
Finally, we found that the increase in occupancy caused by the addition of cognate molecules can be reduced by adding non-cognate molecules. Figure 2 () shows that while, in the case of empty DNA, most of the sites display an occupancy in the simulations that is higher by at least than that predicted from affinity; in the case of high crowding on the DNA, only several hundred sites display such a difference between SDO and ADO. However, this difference is still large, in the order of .
Materials and Methods
We use GRiP  to simulate facilitated diffusion of DNA-binding proteins around the DNA, which allows parametrisation with affinity data and measures site occupancy. Briefly, GRiP performs event driven stochastic simulations ,  of all molecules in the cell which are explicitly represented. Molecules perform both a three-dimensional diffusion in the cytoplasm (nucleoplasm in the case of eukaryotic cells) and a one-dimensional random walk on the DNA. The three-dimensional diffusion is modelled implicitly by simulating the Chemical Master Equation. This approach was shown to display negligible error if fast rebinding to the DNA is also modelled , and, in GRiP, fast rebinding is modelled through a hopping mechanism of TFs on the DNA. In addition, the model implements steric hindrance, in the sense that any base pair cannot be covered by two TFs simultaneously . The complete set of parameters for the model were previously presented in  and can be found in Table S1 in File S1.
The canonical lacI motif as generated from the three known high affinity sites .
In addition to lacI, the system explicitly represents non-cognate molecules in order to model macromolecular crowding. Each non-cognate molecule covers of DNA and is allowed to perform the facilitated diffusion mechanism in a similar way to cognate molecules . We consider five levels of crowding, namely: () (), () ( and ), () ( and ), () ( and ) and () ( and ). Note that, with the exception of the first case (no crowding on the DNA), all cases display crowding which is within biologically plausible values ( to ).
Before proceeding to investigate the relationship between affinity derived occupancy (ADO) and simulation derived occupancy (SDO), we first need to describe the methods used to estimate these parameters. ADO is computed using the average time a TF molecule spends bound at a certain position on the DNA as derived from an approximation of the binding energy (which is itself calculated from PWM score); see equation (3) in . Briefly, the affinity derived occupancy of a TF bound at the nucleotide on the DNA is given by(1)where is the average waiting time when bound at site, is the binding energy at position (which is equal to , where is the lacI PWM score at the nucleotide), is the Boltzmann constant and the temperature. In , we computed .
While ADO is computed directly from the PWM (a priori to the simulations) the SDO (simulation derived occupancy) is based on the results of our stochastic simulations. There are several ways in which the SDO can be estimated and in the following section we compare these approaches to justify our choice.
Measuring the occupancy
There are three methods to estimate the observed occupancy, namely:
- Ensemble average - Perform a set of stochastic simulations with identical parameters, each running for a time interval (chosen as adequate to reach a stationary behaviour) and record the position of each molecule at the end of the simulation. Using these sets of positions, measure the occupancy by computing the average amount of time the TF spends at each position . [Note: this is effectively the result obtained from a ChIP experiment: the mean behaviour within an ensemble of cells.]
- Time average - Observe a single system for a much longer time interval and compute the occupancy as the average amount of time the TF spends at each position . The time average can take less time to compute and, consequently, is an appealing method to estimate occupancy. In live cells, the activity state of a gene is related to the proportion of time the regulatory region is occupied and, thus, the time average may be a better indicator for biological relevance than ensemble average . Nevertheless, if one wants to replicate the result of ChIP experiments, then the ensemble average is more appropriate.
- Hybrid average - Perform a set of stochastic simulations for a long time interval . For each simulation calculate the time average occupancy and then perform an ensemble average over all time averages. At the population level, there is an ensemble average over the behaviour of all cells, thus the hybrid average is a good indicator of the occupancy when investigating gene regulation at population level.
The ergodic theorem assumes that the time average for long time intervals equals the ensemble average. However, the ergodicity assumption breaks down in certain cases (e.g. the time average differs from the ensemble average in multi-stable systems ). Thus, we need to investigate under what conditions the ergodicity assumptions break down within our system.
Figure 6 () confirms that the time average, hybrid average and ensemble average measures for SDO produce similar results. In this case, the system consists of a DNA molecule and one lacI TF and zero non-cognates. In addition, one can observe that all measures for SDO display negligible differences from ADO.
We considered of DNA, which contains the site (the strongest known binding site for lacI, which is located at position on the E.coli K-12 genome) and: () lac repressor molecule and non-cognate molecules, () lac repressor molecules and non-cognate molecules and () lac repressor molecule and non-cognate molecules. We plotted the sites that have a binding energy at least of the highest value ( strongest sites). () The ensemble average is computed from independent simulations [blue circles]; the time average is computed by running the simulations for [red crosses]; and the hybrid average is computed by running independent simulations for [green triangles]. () The ensemble average is computed from independent simulations [blue circles]; the time average is computed by running the simulations for [red crosses]; and the hybrid average is computed by running independent simulations for [green triangles]. () The ensemble average is computed from independent simulations [blue circles]; the time average is computed by running the simulations for [red crosses]; and the hybrid averageis computed by running independent simulations for [green triangles]. Table 3 shows that the three measures for SDO appear to have the same mean.
By increasing the copy number of the TF, the ensemble average and time average diverge. Figure 6 () models 20 lacI molecules and zero non-cognates, and it is clear that in some cases the time average values (red crosses) diverge from their associated ensemble average values (blue circles) and hybrid average values (green triangles). The more dramatic effect, however, is the significant deviation of SDO from ADO for all three measures. This shows that for significantly increased TF copy number, whilst the ergodicity assumption has begun to break down, the differences introduced are insignificant compared to the increased SDO observed at a large number of sites.
The case of increased crowding on the DNA, as modelled by the addition of non-cognate TFs, is shown in Figure 6 (). Here the cognate abundance is kept fixed to one molecule, while 20 non-cognates are modelled. The figure shows that a significant increase in the number of non-cognates has a negligble effect on all three measures of SDO.
Table 3 shows that in the case of naked DNA and one molecule of lacI, the three measurements for SDO (ensemble, time and hybrid averages) have approximately the same mean. However, molecular crowding on the DNA leads to deviations between ensemble and hybrid averages. In particular, in the case of high abundance of cognate TFs - 20 molecules of lacI - we observed a mean increase of in the hybrid average compared to the ensemble average, while in the case of high abundance of non-cognate TFs - 20 non-cognate molecules - we observed a decrease of in the hybrid average compared to the ensemble average. In addition, in Figure S1 in the we show that, when the simulation time is increased, the mean ratio of hybrid and ensemble averages tends to and the deviations from the mean are reduced.
Due to the fact that we are interested in genomic occupancy of TFs that are involved in the regulation of transcription and that, in particular, we are interested in cell population results, we use the hybrid average in all subsequent calculations within this manuscript. Nevertheless, it should be noted that using any of the three methods will lead to similar results.
System size reduction
Our results are obtained by simulating TF occupancy on the of the E.coli K-12 genome  (the DNA locus [300000, 400000]), roughly centered around the site (the most strongly bound site for lacI). In , we proposed two models that are required to adapt the parameters of the subsystem, namely: () copy number model and () association rate model. The former is easier to implement, but can be applied only to highly abundant TFs, while the latter requires an extra set of simulations, but can be applied to TFs with any abundance. Due to the fact that non-cognate TFs are highly abundant in our system, we applied the copy number model to simulate the non-cognate TFs. This leads to the association rate between non-cognate TFs and DNA being unaffected, but the abundances of non-cognate TFs changing to: () for crowding, () for crowding, () for crowding, () for crowding and () for crowding. Note that, in this manuscript, crowding refers to the percentage of the simulated DNA covered by DNA-binding proteins.
For lacI, we considered four abundances, namely: , , , . Due to the lower copy number, we used the association rate approach to adjust the parameters of the full system to the subsystem. This leads to the copy number of lacI being unaffected, but its association rate changing from  to the values listed in Table 4. Figure S2 represents the proportion of time spent on the DNA (which is required when computing the association rate) and also confirmed that our system size reduction method leads to a system behaviour that deviates only negligibly from the behaviour of the full system (Figure S3 and Figure S5).
Considerations on the model
Our model uses the PWM score to calculate the binding energy, which has been shown to be a good approximation , . However, Maerkl and Quake  showed that the PWM can underestimate the binding energy; discussed in . In fact, we found that the occupancy at the site is underestimated by our approach; see Figure S4. One solution to overcome this, consists of shifting the PWM scores to capture the low affinity sites and increasing the affinity at the known high affinity target sites. This assumes a priori knowledge of the target sites and cannot lead to generalisable results. Thus, in this manuscript we assume that the binding energy is well predicted by the PWM score, but we acknowledge that our results are not an exact representation of the lacI DNA binding system.
Furthermore, our model also discards cooperativity between TFs (modelled by either direct TF-TF interactions or DNA mediated cooperativity) as well as DNA looping. These are known mechanisms that influence the TF binding to DNA, at least in prokaryotic systems , . Interestingly, these mechanisms affect the facilitated diffusion of TFs  and could also explain the fact that the experimentally measured occupancy at the site is higher than the occupancy estimated only by the PWM derived binding energy. The rationale behind our assumptions (i.e. not including in the model TF cooperativity and DNA looping) is that we intended to investigate the contribution that the competition between TFs (for limited space on the DNA) has on the genomic occupancy of TFs and whether binding energy (predicted by PWM alone in our case) is the only determinant of the genomic occupancy of TFs.
Finally, the ensemble average is computed as the occupancy over the E.coli cell cycle (), which is then averaged over replicates. We need to investigate whether the mean occupancy is significantly affected by the transient behaviour of the system or whether we simulate long enough to average out the transient behaviour. Figure S9 shows that by increasing the simulation time, the variability of the occupancy is reduced, while the mean occupancy over the replicates remains the same for a simulation time of at least . This indicates that our choice of replicates, each simulated for captures the equilibrium behaviour.
Comparing the time average to the ensemble average for various abundances of cognate and non-cognate molecules. The system consists of of DNA which contains the site. There are three cases with respect to the numbers of TFs: () lacI molecule and non-cognates, () lacI molecules and non-cognates and () lacI molecules and non-cognates. In addition, we considered three values for the simulation time when computing the time and hybrid averages: () , () and () . (), () and () the boxplots represent the mean of the logarithm of the ratio between the time average and the ensemble average over replicates. A value of indicates that the time average is equal to the ensemble average. (), () and () the boxplots represent the standard deviation of the logarithm of the ratio between the time average and the ensemble average over replicates. The sites that have a binding energy lower than of the highest value () sites were removed. By increasing the simulation time, both the mean and the standard deviation of the logarithm of the ratio between the time average and the ensemble average tend to , showing that a longer simulation time leads to smaller differences between time and ensemble averages.
The percentage of time the lacI molecules spend bound to the DNA in the full system, when the crowding on the DNA is altered by changing the abundance and association rate of non-cognate TFs. We performed a set of simulations of the full system each lasting: () for lacI, () for lacI, () for lacI and () for lacI. The shaded area indicates values that are biologically plausible. The dashed line represents the experimentally measured value of the percent of time lacI stays bound to the DNA .
One dimensional statistics for various levels of non-cognate TFs. We performed a set of simulations of the subsystem each lasting , using the parameters presented in the Materials and Methods section and the parameters from Table S1 in File S1.
ADO and SDO for various abundances of lacI and crowding on the DNA. This is the same as Figure 1, except that the SDO was not normalised to the occupancy of the site, but to the length of the simulation. () is the same as () but plotted on the normal scale, while () is the same as () but plotted on the normal scale.
The average number of bound molecules for various crowding levels and various lacI abundances. We performed a set of simulations of the subsystem each lasting , using the parameters presented in the Materials and Methods section and the parameters from Table S1 in File S1.
Significant deviations between ADO and SDO. We considered the case of the lac repressor TF and of DNA, which contains the site. Each system was simulated for and, for each set of parameters, we considered independent simulations. We considered only the sites that have the binding energy at least of the highest value (the strongest sites). Furthermore, we considered only sites where the occupancy in the simulations is at least times higher than that predicted by the affinity. The number in the parentheses in the legend represents the total number of sites that display an SDO at least times higher than the ADO for each particular case. In each panel, the abundance of lacI is kept constant and the crowding on the DNA is increased from to . The level of crowding on the DNA (implemented through the abundance of non-cognate TF) influences the number of sites that display significant differences between occupancy and affinity. We considered four cases with respect to the number of lacI molecules: () , () , () and () .
Lower information content lacI motifs. The information content of the reduced motifs is: () , () , () , () , () and () ; see Figure S8.
Information content of the reduced lacI motifs. Information content of the reduced lacI motifs.
Behaviour of the time average occupancy for various abundances of cognate and non-cognate molecules. The system consists of of DNA which contains the site. There are three cases with respect to the amounts of TFs: () lacI molecule and non-cognates, () lacI molecules and non-cognates and () lacI molecules and non-cognates. In addition, we considered three values for the simulation time when computing the time average: () , () and () . (), () and (), the boxplots represent the mean over the DNA of the logarithm of the time average over replicates. (), () and (), the boxplots represent the standard deviation of the logarithm of the time average over replicates. The sites that have a binding energy lower than of the highest value () sites were removed. By increasing the simulation time, the variability of both moments reduce in the cases of 0 non-cognates; an effect not seen in the case of 20 non-cognates.
We would like to thank Mark Calleja for his support with configuring our simulations to run on CamGrid and Robert Stojnic for useful discussions and comments.
Conceived and designed the experiments: NRZ BA. Performed the experiments: NRZ RF. Analyzed the data: NRZ RF. Wrote the paper: NRZ RF BA.
- 1. Segal E, Widom J (2009) From DNA sequence to transcriptional behaviour: a quantitative approach. Nature Reviews Genetics 10: 443–456.
- 2. Kaplan T, Li XY, Sabo PJ, Thomas S, Stamatoyannopoulos JA, et al. (2011) Quantitative models of the mechanisms that control genome-wide patterns of transcription factor binding during early Drosophila development. PLoS Genetics 7: e1001290.
- 3. von Hippel PH, Berg OG (1986) On the specificity of DNA-protein interactions. PNAS 83: 1608–1612.
- 4. Berg OG, von Hippel PH (1987) Selection of DNA binding sites by regulatory proteins statistical-mechanical theory and application to operators and promoters. Journal of Molecular Biology 193: 723–750.
- 5. Gerland U, Moroz JD, Hwa T (2002) Physical constraints and functional characteristics of transcription factor-DNA interactions. PNAS 99: 12015–12020.
- 6. Djordjevic M, Sengupta AM, Shraiman BI (2003) A biophysical approach to transcription factor binding site discovery. Genome Resarch 13: 2381–2390.
- 7. Roider HG, Kanhere A, Manke T, Vingron M (2007) Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics 23: 134–141.
- 8. Zhao Y, Granas D, Stormo GD (2009) Inferring binding energies from selected binding sites. PLoS Comput Biol 5: e1000590.
- 9. Stormo GD, Zhao Y (2010) Determining the specificity of protein-DNA interactions. Nature Reviews 11: 751–760.
- 10. Ackers GK, Johnson AD, Shea MA (1982) Quantitative model for gene regulation by lambda phage repressor. PNAS 79: 1129–1133.
- 11. Bintu L, Buchler NE, Garcia HG, Gerland U, Hwa T, et al. (2005) Transcriptional regulation by the numbers: models. Current Opinion in Genetics and Development 15: 116–124.
- 12. Bintu L, Buchler NE, Garcia HG, Gerland U, Hwa T, et al. (2005) Transcriptional regulation by the numbers: applications. Current Opinion in Genetics and Development 15: 125–135.
- 13. Raveh-Sadka T, Levo M, Segal E (2009) Incorporating nucleosomes into thermodynamic models of transcription regulation. Genome Research 19: 1480–1496.
- 14. Stormo GD (2000) DNA binding sites: representation and discovery. Bioinformatics 16: 16–23.
- 15. Wasson T, Hartemink AJ (2009) An ensemble model of competitive multi-factor binding of the genome. Genome Research 19: 2101–2112.
- 16. Berg OG, Winter RB, von Hippel PH (1981) Diffusion-driven mechanisms of protein translocation on nucleic acids. 1. models and theory. Biochemistry 20: 6929–6948.
- 17. Elf J, Li GW, Xie XS (2007) Probing transcription factor dynamics at the single-molecule level in a living cell. Science 316: 1191–1194.
- 18. Hammar P, Leroy P, Mahmutovic A, Marklund EG, Berg OG, et al. (2012) The lac repressor displays facilitated diffusion in living cells. Science 336: 1595–1598.
- 19. Zabet NR, Adryan B (2012) Computational models for large-scale simulations of facilitated diffusion. Molecular BioSystems 8: 2815–2827.
- 20. Chu D, Zabet NR, Mitavskiy B (2009) Models of transcription factor binding: Sensitivity of activation functions to model assumptions. Journal of Theoretical Biology 257: 419–429.
- 21. Riley M, Abe T, Arnaud MB, Berlyn MK, Blattner FR, et al. (2006) Escherichia coli k-12: a cooperatively developed annotation snapshot - 2005. Nucleic Acids Research 34: 1–9.
- 22. Zabet NR, Adryan B (2012) GRiP: a computational tool to simulate transcription factor binding in prokaryotes. Bioinformatics 28: 1287–1289.
- 23. Zabet NR, Adryan B (2012) A comprehensive computational model of facilitated diffusion in prokaryotes. Bioinformatics 28: 1517–1524.
- 24. Mirny L, Slutsky M, Wunderlich Z, Tafvizi A, Leith J, et al. (2009) How a protein searches for its site on DNA: the mechanism of facilitated diffusion. Journal of Physics A: Mathematical and Theoretical 42: 434013.
- 25. Kampmann M (2004) Obstacle bypass in protein motion along dna by two-dimensional rather than one-dimensional sliding. J Biol Chem 279: 38715–38720.
- 26. Hedglin M, O′Brien PJ (2010) Hopping enables a dna repair glycosylase to search both strands and bypass a bound protein. ACS Chem Biol 5: 427–436.
- 27. Marcovitz A, Levy Y (2011) Frustration in protein-DNA binding influences conformational switching and target search kinetics. PNAS 108: 17957–17962.
- 28. Wunderlich Z, Mirny LA (2009) Different gene regulation strategies revealed by analysis of binding motifs. Trends in Genetics 25: 434–440.
- 29. Stormo GD, Fields DS (1998) Specificity, free energy and information content in protein-DNA interactions. Trends in Biochemical Sciences 23: 109–113.
- 30. Biggin MD (2011) Animal transcription networks as highly connected, quantitative continua. Developmental Cell 21: 611–626.
- 31. Santillan M, Mackey MC (2004) Influence of catabolite repression and inducer exclusion on the bistable behavior of the lac operon. Biophysical Journal 86: 1282–1292.
- 32. Thomas S, Li XY, Sabo PJ, Sandstrom R, Thurman RE, et al. (2011) Dynamic reprogramming of chromatin accessibility during drosophila embryo development. Genome Biology 12: R43.
- 33. Adryan B, Woerfel G, Birch-Machin I, Gao S, Quick M, et al.. (2007) Genomic mapping of suppressor of hairy-wing binding sites in drosophila. Genome Biology 8.
- 34. Zeitlinger J, Zinzen RP, Stark A, Kellis M, Zhang H, et al. (2007) Whole-genome ChIP-chip analysis of Dorsal, Twist, and Snail suggests integration of diverse patterning processes in the Drosophila embryo. Genes & Development 21: 385–390.
- 35. Kuhlman TE, Cox EC (2012) Gene location and dna density determine transcription factor distributions in Escherichia coli. Molecular Systems Biology 8.
- 36. Fowlkes CC, Hendriks CLL, Keranen SV, Weber GH, Rubel O, et al. (2008) A quantitative spatiotemporal atlas of gene expression in the Drosophila blastoderm. Cell 133: 364–374.
- 37. Zabet NR (2012) System size reduction in stochastic simulations of the facilitated diffusion mechanism. BMC Systems Biology 6: 121.
- 38. Gillespie DT (1976) A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. Journal of Computational Physics 22: 403–434.
- 39. Gillespie DT (1977) Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry 81: 2340–2361.
- 40. van Zon JS, Morelli MJ, Tanase-Nicola S, ten Wolde PR (2006) Diffusion of transcription factors can drastically enhance the noise in gene expression. Biophysical Journal 91: 4350–4367.
- 41. Hermsen R, Tans S, ten Wolde PR (2006) Transcriptional regulation by competing transcription factor modules. PLoS Comput Biol 2: 1552–1560.
- 42. Flyvbjerg H, Keatch SA, Dryden DT (2006) Strong physical constraints on sequence-specific target location by proteins on DNA molecules. Nucleic Acids Research 34: 2550–2557.
- 43. Gillespie DT (2000) The chemical langevin equation. Journal of Chemical Physics 113: 297–306.
- 44. Maerkl SJ, Quake SR (2007) A systems approach to measuring the binding energy landscapes of transcription factors. Science 315: 233–237.
- 45. Rosenfeld N, Young JW, Alon U, Swain PS, Elowitz MB (2005) Gene regulation at the single-cell level. Science 307: 1962–1965.