^{1}

^{*}

^{2}

^{3}

^{3}

^{4}

Conceived and designed the experiments: ANB PLR SNE. Wrote the paper: ANB PLR SNE. Wrote software: ANB. Analyzed results: ANB, PLR, SNE. Developed mathematical approach: PLR, SNE. Assisted coding: PLR.

The authors have declared that no competing interests exist.

Recent whole genome polymerase binding assays in the

Gene activation is an inherently random process because numerous diffusing proteins and DNA must first interact by random association before transcription can begin. For many genes the necessary protein–DNA associations only begin after activation, but it has recently been noted that a large class of genes in multicellular organisms can assemble the initiation complex of proteins on the core promoter prior to activation. For these genes, activation merely releases polymerase from the preassembled complex to transcribe the gene. It has been proposed on the basis of experiments that such a mechanism, while possibly costly, increases both the speed and the synchrony of the process of gene transcription. We study a realistic model of gene transcription, and show that this conclusion holds for all but a tiny fraction of the space of physical rate parameters that govern the process. The improved control of cell-to-cell variations afforded by regulation through a paused polymerase may help multicellular organisms achieve the high degree of coordination required for development. Our approach has also generated tools with which one can study the effects of analogous changes in other molecular networks and determine the relative importance of various molecular binding rates to particular system properties.

Investigations in yeast

This mechanism has been called

It remains an open question why expression of some genes is controlled further downstream than others. Several groups have postulated that pausing may ready a polymerase for rapid induction

When whole-genome studies extended the observation of pausing to cover many key developmental regulatory genes

Recent work by Darzacq and colleagues

The idea that gene expression itself is intrinsically variable (rather than variable as a result of extrinsic fluctuations in upstream quantities) is well established and is a recent focus of theoretical and experimental interest – see

Populations of single-celled organisms have been shown to take advantage of noisy gene expression to achieve clonal yet phenotypically heterogeneous populations

Here we investigate mathematically whether the significant change in the coordination of expression observed in experiment

We do this by constructing continuous-time Markov chain models of PIC assembly with states that correspond to joint configurations of the promoter and the enhancer. The (random) time taken for the chain to pass from a “start” state to an “end” state corresponds to the elapsed time between successive transcription events. The models we construct for the two different modes of regulation have a common set of transition rates, but the particular mode of regulation dictates that certain transitions are disallowed, resulting in two chains with different sets of states accessible from the “start” state. We describe this situation by saying that each model is a

(

Although there has recently been much work modeling different sources of stochasticity in gene expression, most models refrain from a detailed representation of the different protein–DNA complexes involved in favor of more abstract approximations

We model the intrinsic noise of regulation and polymerase recruitment using biologically-derived Markov chain models. We focus on this particular piece of the larger process of expression in greater detail than has been done previously in order to provide a detailed mathematical investigation of the role of promoter proximal pausing. Unlike simulation methods, our approach provides a tractable way to compute analytic expressions for which interpretation is direct and reliable. Moreover, it does not depend on small-noise or equilibrium assumptions, or require the passage to a continuum limit. Furthermore, the structure of the models we use is determined by biological realism rather than being constrained by mathematical tractability. Our approach is most similar to that of

As a prelude to describing the actual Markov chain model of transcriptional regulation we analyze, we describe a general approach to modeling promoters, enhancers and their interactions, and illustrate this approach with a toy model of transcription that is not too cumbersome to draw – see

We begin with two separate Markov chains, a

Next, to model the regulatory interaction between enhancer and promoter, we designate a particular configuration of the enhancer as the

The composite stochastic process that records the states of both the promoter and enhancer chains is our resulting Markov chain model of transcription. Varying the regulated step leads to alternative topologies for this chain. We stress that, as we change the choice of regulated step, the underlying promoter and enhancer chains remain the same. In particular, the same set of rate parameters are used in both schemes and they have the same meaning. This permits meaningful comparison of different methods of regulation. Two possible regulated steps, labeled “IR gated” and “ER gated”, are shown along with the corresponding Markov chains in

Formally, the general composite Markov chain model is constructed as follows. Consider two promoter configurations, say,

Because there are generally only two promoters per gene active at the same time in a given nucleus, binding of a general transcription factor (TF) at one locus does not decrease the total concentration of the TF in the nucleus sufficiently to affect the rate of binding at the homologous locus. Furthermore, since the observed timescales of variability in induction are shorter than the expected timescale for protein translation and folding, we neglect any feedback from mRNA synthesis which might modify the transition rates. This allows us, in particular, to assume that the jump rates of the Markov chain are homogeneous in time.

We now apply this framework to examine a model of transcription that is more interesting and detailed than the toy model used above for illustrative purposes.

Many general transcription factors (TFs), such as the protein complexes TFIIA, TFIIB, etc., function together in a coordinated fashion to form the pre-initiation complex (PIC) necessary for the proper activation of transcription

The steps of PIC assembly are not fully understood

Each possible complex in the process is enumerated as a state of the promoter Markov chain. (see text for description of each complex) The promoter chain (states 1–8) is combined with the enhancer chain (states A and B) to make the full 16 state model of transcription. Transitions that in some scheme require an activated enhancer (state B) are indicated by a gate,

Since we are interested in exploring the differences in which step of PIC assembly is regulated and not the different possible modes of enhancer activation, we use a simple abstracted two-state model of enhancer activation. A single transition switches the enhancer from the inactive state to the permissive state. For instance, a transition to the permissive state could represent the binding of a TF to the enhancer. This is not likely to be completely realistic, but if a particular step in the actual dynamics of transcription factor assembly and enhancer-promoter interaction is rate-limiting (e.g. the looping rate between a bound enhancer and its target promoter), then its behavior will be well approximated by our minimal model, with the transition from active to inactive corresponding to the rate for this limiting step.

For many paused genes, it is the phosphorylation event which is believed to be regulated

Finally, the scaffold of transcriptional machinery that facilitates polymerase binding does not necessarily dissociate when transcription begins. Thus, reinitiation may occur by binding new polymerases (at step 5) which must still reload TFIIH which was evicted during promoter escape in order to proceed to step 6 and so on back to step 8. Repeated cycles of reinitiation may lead to a burst of mRNAs synthesized from a single promoter opening event. We denote by

Our aim is not to present a definitive model of PIC assembly itself. Rather, we seek to understand the impact of different modes of regulation on a reasonable model that incorporates sufficient detail and to develop tools that can analyze effectively models of this complexity.

We are interested in the speed and variability of the transcription process, as measured, respectively, by the mean,

However, the ratio of this quantity for the IR scheme to its counterpart for ER scheme does not depend on our choice of time scale. For any time

We use our model to examine how these three important system properties – speed, synchrony, and transcript count variability – depend on the jump rates and how they differ between an IR and an ER regulation scheme. In both cases, the delay

Denote by

In principle, the transform

In particular, the mean and variance of

It is not necessary to carry out the differentiation in equation (2) explicitly, since (2) becomes

Equation (1) is known as the Feynman–Kac formula

To overcome this obstacle, we develop new analytic techniques that take advantage of the special structure of these matrices. First, we note that chains modeling transcription often have a block structure, in that we can decompose the state space according to the subset of states that must be passed through by any path of positive probability leading from the initial to the final state (we call such states

Our approach has several advantages. Firstly, once we have derived symbolic expressions for features of interest, it is straightforward to substitute in a large number of possibilities for the transition rate vector in order to understand how those features vary with respect to the values of the transition rates. This would be computationally impossible using simulation and at best very expensive using a numerical version of the naive Feynman–Kac approach. Secondly, we are able to differentiate the symbolic expressions with respect to the transition rate parameters to determine the sensitivity with respect to the values of the parameters. It would be even more infeasible to use simulation or a numerical Feynman–Kac approach to perform such a sensitivity analysis.

To get an initial sense of the differences between these two schemes of regulation, we first compared the transcriptional behaviors for a best-guess set of parameters, guided by measurements of promoter binding and escape rates by Darzacq et al.

We used the following rate parameters for the model of

We found the probability density of the amount of time it takes the system to go from induced to actively transcribing, shown in

(

We also described the number of mRNA produced over a given period of time at one choice of

In this example,

Our predictions for the time of expression and the number of transcripts in the previous subsection depended on the chosen parameter values such as the association rate of different GTFs and the average burst size of the gene expression. The values of such parameters can, for the most part, be only very approximately estimated. Moreover, they may be expected to vary considerably between different genes and different species.

Since a single vector of parameters simultaneously specifies our models for the two regulation mechanisms, we can systematically explore all possible combinations of promoter strength and enhancer activation rates and ask in each of these cases how the two mechanisms compare in terms of speed, synchrony and variability in transcript counts.

To compare the two kinds of regulation of the model in

In

(

We emphasize that this conclusion is still consistent with the possibility that a particular initiation regulated gene is expressed in a more synchronous pattern or with more rapid kinetics than some other elongation regulated gene: it is only necessary that the rate parameters are also sufficiently different. However, for the fixed set of rates associated with a given gene, the network topology of the ER scheme always improved synchrony and speed in our model of transcription relative to the corresponding IR scheme for the parameter vectors we sampled.

There is a plausible intuitive explanation for why elongation regulation is almost always faster than initiation regulation (

The effect of the regulatory scheme on the variation in the total amount of expression among cells is perhaps the most interesting and also experimentally untested consequence of regulating release from the paused state. As discussed above, we compute a factor

We explored the logarithm of this ratio (equivalently, the difference of the logarithms of the respective

(

When the complex is very stable, so that all polymerases find a preassembled scaffold to return to (

When the scaffold is still stable but less so (

When we consider the simplest case with no bursting (

We have found that, regardless of the value of

Consideration of how each chain depends on its starting state suggests an intuitive explanation for this difference. The IR scheme differs more in the amount of time it takes to reach the synthesis state when started with or without a scaffold (state 5 or state 1) than does the ER scheme. Intermediate values of

To further understand why elongation regulation results in faster, more synchronous, and more consistent gene expression over a wide range of parameters we investigated alternative post-initiation regulatory schemes. This allows us to explore how changing certain properties of the model of PIC assembly (the promoter chain) will affect the results: Is the difference large because there are many steps between the IR step and the ER step, or is it because there is no allowed transition leading backward out of the state immediately before the regulated step? To explore these questions, we made modifications to the toy model of

First note that, as is shown in

We performed the same analysis after adding a reverse transition from state 3 back to state 2 (see

We also investigated the case in which the

Small variations in rate parameters between cells will occur if the number of TF or Pol II molecules is small, so it is of interest to investigate how robust the properties of each regulation scheme are to such variation and which jump rates affect each scheme the most. To measure this sensitivity, we compute the gradient of a quantity of interest (e.g. the mean induction speed) with respect to the vector of jump rates, square the entries, and normalize so that the entries sum to one, giving a quantity we refer to as

To explore the sensitivity across parameter space, we computed relative sensitivities for each of the three system properties to all 16 parameters at each of the 10,000 random vectors of transition rates described above. Each of the system properties showed surprisingly similar sensitivity profiles, so we only discuss the results for the mean time to transcription. Marginal distributions of sensitivity of mean time to transcription to each parameter are shown in

Histograms of the marginal distributions of relative sensitivities for both the ER and IR schemes, across uniform random samples from parameter space, as described in the text. The smallest bin of the histogram (values below

As one might expect, for a given parameter vector the parameters to which the behavior of the models are most sensitive are generally those that happen to take the smallest value (and are thus rate-limiting): for each parameter vector, we recorded the sizes of the two parameters with the highest and second highest sensitivity values and found that their sample means were

Two further observations are evident from this analysis. First, we see which transitions in the process of activating the gene are most sensitive to small fluctuations (due to small number of TF molecules or changes in binding strength). As is apparent from

Second, we also observe that the complex assembly steps which may occur in arbitrary arrival order, namely the recruitment of TFIIE or TFIIF (governed by the jump rates

Speed, synchrony, degree of cell–to–cell variability, and robustness to environmental fluctuations are important features of transcription. They are properties of the system rather than of a particular gene, DNA regulatory sequence, or gene product taken in isolation, and optimizing them can, for instance, reduce the frequency of mis-patterning events that arise due to the inherently stochastic nature of gene expression. Understanding how these properties emerge, the mechanism by which they change, and the tradeoffs involved in optimizing them all require tractable models of transcription.

Through a study of stochastic models of transcriptional activation, we demonstrated that the increased speed and synchrony of paused genes, reported by Yao et al.

We furthermore explored what aspects of ER make this possible. From an examination of the effect of scaffold stability we proposed that elongation regulation should reduce the noise-amplifying nature of bursty expression. By investigating alternative models of post-initiation regulation, we also determined that our predictions depend critically on the stability of the transcriptionally engaged, paused polymerase, and would not be expected from polymerases cycling rapidly on and off the promoter (i.e. polymerase stalling).

Our investigation required us to introduce a general probabilistic framework for analyzing system properties of protein–DNA interactions. Stochastic effects, resulting from molecular fluctuations, are increasingly understood to play important roles in gene control and expression (see

Finally, our approach is not restricted to investigating the assembly of transcriptional machinery, but may also prove useful in studying stochastic properties of a variety of regulatory DNA sequences (such as enhancers). Different assembly topologies, such as sequential versus arbitrary association mechanisms for the component TFs

Sample chip. Identification of Paused Polymerase in Drosophila by Chip-chip: (A) Gene models are shown top, aligned to Pol II chromatin immunoprecipitation signal measurements from whole genome tiling array, showing locations in the genome where Pol II is bound in each of three specific tissues — the dorsal ectoderm, the neurogenic ectoderm, and the mesoderm, from Zeitlinger 2007. pnr is expressed only in the dorsal ectoderm — the promoter (highlighted region) is silent in the other tissues. (B) Genome data as in (A) for the region around the gene tup. In this case the promoter region is bound in all three tissue types, even though the rest of the gene is only transcribed in the dorsal ectoderm.

(0.53 MB EPS)

Paramater Histograms. Histograms of the distributions of those parameter values where the IR scheme is faster than the ER scheme (top row), more synchronous the ER scheme (middle row) or less noisy in terms of total transcripts than the ER scheme (bottom row).

(0.30 MB EPS)

Pinchpoint schema. A schematic of the decomposition. The probabilities a_{k}, b_{k}, c_{k}, and d_{k} depend only on the distributions of both adjacent chains X_{k} and X_{k+1}, while the behavior of X between pinch points p_{k-1} and p_{k} only depends on the distribution of X_{k}.

(0.30 MB EPS)

Change topology. Effect of regulated step. (A) Adding a transition k_{32} which enables polymerase to exit the paused state and return to a pre-initiated state. (B) Effect of the added transition on the structure of the composite Markov chains. (C) Comparison between the models over all of parameter space when the transition k_{32} is added. (D) Schematic of changing the regulated step to control promoter escape rather than release from pausing. (E) Resulting composite Markov chains for regulating promoter escape. (F) Comparison between the models over all of parameter space when promoter escape is the regulated step.

(1.25 MB EPS)

Sensitivity analysis for variance in transcription time. The details are the same as for

(0.47 MB EPS)

Sensitivity analysis for transcript count variability. The details are the same as for

(0.47 MB EPS)

Derivation of equations and detailed mathematical approach for rapid inversion of large transition matrices.

(0.45 MB PDF)

Matlab code to implement the analyses described in the main text and outlined in detail in

(6.06 MB ZIP)

We thank Graham Coop, Mike Levine, George Oster, Dan Rokhsar, Ken Wachter, Michael Cianfrocco and Teppei Yamaguchi for helpful discussions and comments on the manuscript.