Supplemental Information for : Reconciling Kinetic and Equilibrium Models of Bacterial Transcription

S1.1 Derivation of chemical master equation . . . . . . . . . . . . . . . . . . . . . 45 S1.2 Matrix form of the multi-state chemical master equation . . . . . . . . . . . 48 S1.3 General forms for mean mRNA and Fano factor . . . . . . . . . . . . . . . . 50 S1.3.1 Promoter state probabilities h~ mi . . . . . . . . . . . . . . . . . . . . 51 S1.3.2 First moments h~ mi and hmi . . . . . . . . . . . . . . . . . . . . . . . 52 S1.3.3 Second moment hmi and Fano factor ⌫ . . . . . . . . . . . . . . . . 53 S1.3.4 Summary of general results . . . . . . . . . . . . . . . . . . . . . . . 54 S1.4 Nonequilibrium Model One Poisson Promoter . . . . . . . . . . . . . . . . 55 S1.4.1 Mean mRNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 S1.4.2 Fano factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 S1.5 Nonequilibrium Model Two RNAP Bound and Unbound States . . . . . . . 57 S1.5.1 Mean mRNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 S1.5.2 Fano factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 S1.6 Nonequilibrium Model Three Multistep Transcription Initiation and Escape 59 S1.6.1 Mean mRNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 S1.6.2 Fano factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 S1.6.3 Generalizing ⌫ < 1 to more fine-grained models . . . . . . . . . . . . 61 S1.7 Nonequilibrium Model Four “Active” and “Inactive” States . . . . . . . . . 62 S1.7.1 Mean mRNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 S1.7.2 Fano factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62


S1 Derivations for non-bursty promoter models
In this section we detail the calculation of mean mRNA levels, fold-changes in expression, and Fano factors for nonequilibrium promoter models 1 through 4 in Figure 1. These are the results that were quoted but not derived in Sections 2 and 3 of the main text. In each of these four models, the natural mathematicization of their cartoons is as a chemical master equation such as Eq. 12 for model 1. Before jumping into the derivations of the general computation of the mean mRNA level and the Fano factor we will work through the derivation of an example master equation. In particular we will focus on model 1 from  The chemical master equation describes the continuous time evolution of a continuous or discrete probability distribution function. In our specific case we want to describe the time evolution of the discrete mRNA distribution. What this means is that we want to compute the probability of having m mRNA molecules at time t + t, where t is a su ciently small time interval such that only one of the possible reactions take place during that time interval.
For the example that we will work out here in detail we chose the two-state stochastic simple repression model schematized in Figure S1(A). To derive the master equation we will focus more on the representation shown in Figure S1(B), where the transitions between di↵erent mRNA counts and promoter states is more explicitly depicted. Given that the DNA promoter can exist in one of two states -transcriptionally active state, and with repressor bound -we will keep track not only of the mRNA count, but on which state the promoter is. For this we will keep track of two probability distributions: The probability of having m mRNAs at time t when the promoter is in the transcriptionally active state A, p A (m, t), and the equivalent probability but when the promoter is in the repressor bound state R, p R (m, t).
Since mRNA production can only take place in the transcriptionally active state we will focus on this function for our derivation. The repressor bound state will have an equivalent equation without terms involving the production of mRNAs. We begin by listing the possible state transitions that can occur for a particular mRNA count with the promoter in the active state. For state changes in a small time window t that "jump into" state m in the transcriptionally active state we have • Produce an mRNA, jumping from m 1 to m.
• Degrade an mRNA, jumping from m + 1 to m.
• Transition from the repressor bound state R with m mRNAs to the active state A with m mRNAs. Likewise, for state transitions that "jump out" of state m in the transcriptionally inactive state we have • Produce an mRNA, jumping from m to m + 1.
• Degrade an mRNA, jumping from m to m 1.
• Transition from the active state A with m mRNAs to the repressor bound state R with m mRNAs. The mRNA production does not depend on the current number of mRNAs, therefore these state transitions occur with probability r t. The same is true for the promoter state transitions; each occurs with probability k ± R t. As for the mRNA degradation events, these transitions depend on the current number of mRNA molecules since the more molecules of mRNA there are, the more will decay during a given time interval. Each molecule has a constant probability t of being degraded, so the total probability for an mRNA degradation event to occur is computed by multiplying this probability by the current number of mRNAs.
To see these terms in action let us compute the probability of having m mRNA at time t + t in the transcriptionally active state. This takes the form p A (m, t + t) = p A (m, t) where the overbrace indicates the corresponding state transitions. Notice that the second to last term on the right-hand side is multiplied by p R (m, t) since the transition from state R to state A depends on the probability of being in state R to begin with. It is through this term that the dynamics of the two probability distribution functions (p R (m, t) and p A (m, t)) are coupled. An equivalent equation can be written for the probability of having m mRNA at time t + t while in the repressor bound state, the only di↵erence being that the mRNA production rates are removed, and the sign for the promoter state transitions are inverted. This is p R (m, t + t) = p R (m, t) (S2) All we have to do now are simple algebraic steps in order to simplify the equations. Let us focus on the transcriptionally active state A. First we will send the term p A (m, t) to the right-hand side, and then we will divide both sides of the equation by t. This results in p A (m, t + t) p A (m, t) t = rp A (m 1, t) rp A (m, t) (S3) Upon taking the limit when t ! 0 we can transform the left-hand side into a derivative, obtaining the chemical master equation dp A (m, t) dt = rp A (m 1, t) rp A (m, t) Doing equivalent manipulations for the repressor bound state gives an ODE of the form dp R (m, t) dt = (m + 1) p R (m + 1, t) m p R (m, t) In the next section we will write these coupled ODEs in a more compact form using matrix notation.

S1.2 Matrix form of the multi-state chemical master equation
Having derived an example chemical master equation we now focus on writing a general matrix form for the kinetic models 1-4 shown in Figure 1(C) in the main text. In each of these four models, the natural mathematicization of their cartoons is as a chemical master equation such as Eq. 12 for model 1, which for completeness we reproduce here as d dt Here p R (m, t) and p U (m, t) are the probabilities of finding the system with m mRNA molecules at time t either in the repressor bound or unbound states, respectively. r is the probability per unit time that a transcript will be initiated when the repressor is unbound, and is the probability per unit time for a given mRNA to be degraded. k R is the probability per unit time that a bound repressor will unbind, while k + R is the probability per unit time that an unbound operator will become bound by a repressor. Assuming mass action kinetics, k + R is proportional to repressor copy number R. Next consider the cartoon for nonequilibrium model 2 in Figure 1(C). Now we must track probabilities p R , p P , and p E for the repressor bound, empty, and polymerase bound states, respectively. By analogy to Eq. S6, the master equation for model 2 can be written (S7) k + P and k P are defined in close analogy to k + R and k R , except for RNAP binding and unbinding instead of repressor. Similarly p P (m, t) is defined for the active (RNAP-bound) state exactly as are p R (m, t) and p E (m, t) for the repressor bound and unbound states, respectively. The new subtlety Eq. S7 introduces compared to Eq. S6 is that when mRNAs are produced, the promoter state also changes. This is encoded by the terms involving r, the last term in each of the equations for p E and p P . The former accounts for arrivals in the unbound state and the latter accounts for departures from the RNAP-bound state.
To condense and clarify the unwieldy notation of Eq. S7, it can be rewritten in matrix form. We define the column vectorp(m, t) as to gather, for a given m, the probabilities of finding the system in the three possible promoter states. Then all the transition rates may be condensed into matrices which multiply this vector. The first matrix is which tracks all promoter state changes in Eq. S7 that are not accompanied by a change in the mRNA copy number. The two terms accounting for transcription, the only transition that increases mRNA copy number, must be handled by two separate matrices given by R A accounts for transitions arriving in a given state while R D tracks departing transitions. With these definitions, we can condense Eq. S7 into the single equation which is just Eq. 15 in the main text. Straightforward albeit tedious algebra verifies that Eqs. S7 and S11 are in fact equivalent.
Although we derived Eq. S11 for the particular case of nonequilibrium model 2 in Figure 1, in fact the chemical master equations for all of the nonequilibrium models in Figure 1 except for model 5 can be cast in this form. (We treat model 5 separately in Appendix S2.) Model 3 introduces no new subtleties beyond model 2 and Eq. S11 applies equally well to it, simply with di↵erent matrices of dimension four instead of three. Models 1 and 4 are both handled by Eq. 12 in the main text, which is just Eq. S11 except in the special case of R D = R A ⌘ R, since in these two models transcription initiation events do not change promoter state.
Recalling that our goal in this section is to derive expressions for mean mRNA and Fano factor for nonequilibrium models 1 through four in Figure 1, and since all four of these models are described by Eq. S11, we can save substantial e↵ort by deriving general formulas for mean mRNA and Fano factor from Eq. S11 once and for all. Then for each model we can simply plug in the appropriate matrices for K, R D , and R A and carry out the remaining algebra.
For our purposes it will su ce to derive the first and second moments of the mRNA distribution from this master equation, similar to the treatment in [24], but we refer the interested reader to [42] for an analogous treatment demonstrating an analytical solution for arbitrary moments.

S1.3 General forms for mean mRNA and Fano factor
Our task now is to derive expressions for the first two moments of the steady-state mRNA distribution from Eq. S11. Our treatment of this is analogous to that given in Refs.
[24] and [42]. We first obtain the steady-state limit of Eq. S11 in which the time derivative vanishes, giving From this, we want to compute and which define the average values of m and m 2 at steady state, where the averaging is over all possible mRNA copy numbers and promoter states S. For example, for model 1 in Figure 1(C), the sum on S would cover repressor bound and unbound states (R and U respectively), for model 2, the sum would cover repressor bound, polymerase bound, and empty states (R, P , and E), and so on for the other models.
Along the way it will be convenient to define the following conditional moments as and These objects are vectors of the same size asp(m), and each component can be thought of as the expected value of the mRNA copy number, or copy number squared, conditional on the promoter being in a certain state. For example, for model 1 in Figure 1 which has repressor bound and unbound states labeled R and U , hm 2 i would be Analogously to hmi and hm 2 i, it is convenient to define the vector whose elements are simply the probabilities of finding the system in each of the possible promoter states. It will be convenient to denote by1 † a row vector of the same length asp whose elements are all 1, such that, for instance,1 † · hm 0 i = 1,1 † · hmi = hmi, etc. S1.3.1 Promoter state probabilities hm 0 i To begin, we will find the promoter state probabilities hm 0 i from Eq. S12 by summing over all mRNA copy numbers m, producing Using the definitions of hm 0 i and hmi, and noting the matrices K, R D , and R A are all independent of m and can be moved outside the sum, this simplifies to The last two terms can be handled by reindexing the summations, transforming them to match the definitions of hm 0 i and hmi. For the first, notingp( 1) = 0 since negative numbers of mRNA are nonsensical, we have Similarly for the second, which holds since in extending the lower limit from m = 1 to m = 0, the extra term we added to the sum is zero. Substituting these back in Eq. S20, we have So given matrices K, R D , and R A describing a promoter, finding hm 0 i simply amounts to solving this set of linear equations, subject to the normalization constraint1 † · hm 0 i = 1. It will turn out to be the case that, if the matrix K R D + R A is n dimensional, it will always have only n 1 linearly independent equations. Including the normalization condition provides the n-th linearly independent equation, ensuring a unique solution. So when using this equation to solve for hm 0 i, we may always drop one row of the matrix equation at our convenience and supplement the system with the normalization condition instead. The reader may find it illuminating to skip ahead and see Eq. S24 in use with concrete examples, e.g., Eq. S52 and what follows it, before continuing on through the general formulas for moments. S1.3.2 First moments hmi and hmi By analogy to the above procedure for finding hm 0 i, we may find hmi by first multiplying Eq. S12 by m and then summing over m. Carrying out this procedure we have and now identifying hmi and hm 2 i gives The summations in the last two terms can be reindexed just as we did for hm 0 i, freely adding or removing terms from the sum which are zero. For the first term we find Substituting back in Eq. S26 then produces or after simplifying So like hm 0 i, hmi is also found by simply solving a set of linear equations after first solving for hm 0 i from Eq. S24. Next we can find the mean mRNA copy number hmi from hmi according to where1 † is a row vector whose elements are all 1. Eq. S31 holds since the i th element of the column vector hmi is the mean mRNA value conditional on the system occupying the i th promoter state, so the dot product with1 † amounts to simply summing the elements of hmi.
Rather than solving Eq. S30 for hmi and then computing1 † · hmi, we may take a shortcut by multiplying Eq. S30 by1 † first. The key observation that makes this useful is that Intuitively, this equality holds because each column of this matrix represents transitions to and from a given promoter state. In any given column, the diagonal encodes all departing transitions and o↵-diagonals encode all arriving transitions. Conservation of probability means that each column sums to zero, and summing columns is exactly the operation that multiplying by1 † carries out.
Proceeding then in multiplying Eq. S30 by1 † produces or simply We note that the in equilibrium models of transcription such as in Figure 1, it is usually assumed that the mean mRNA level is given by the ratio of initiation rate r to degradation rate multiplied by the probability of finding the system in the RNAP-bound state. Reassuringly, we have recovered exactly this result from the master equation picture: the product 1 † · R A hm 0 i picks out the probability of the active promoter state from hm 0 i and multiplies it by the initiation rate r. S1.3.3 Second moment hm 2 i and Fano factor ⌫ Continuing the pattern of the zeroth and first moments, we now find hm 2 i by multiplying Eq. S12 by m 2 and then summing over m, which explicitly is Identifying the moments hm 2 i and hm 3 i in the first term simplifies this to Reindexing the sums of the last two terms proceeds just as it did for the zeroth and first moments. Explicitly, we have for the first sum and 1 X m=0 m 2 (m + 1)p(m + 1) = for the second. Substituting the results of the sums back in Eq. S36 gives and after grouping like powers of m we have As we found when computing hmi from hmi, we can spare ourselves some algebra by multiplying Eq. S40 by1 † , which then reduces to and noting from Eq. S34 that1 † · R A hm 0 i = hmi, we have the tidy result Finally we have all the preliminary results needed to write a general expression for the Fano factor ⌫. The Fano factor is defined as the ratio of variance to mean, which can be written as and simplified to Note a subtle notational trap here: hmi = 1 1 † · R A hm 0 i rather than the by-eye similar but wrong expression hmi 6 = 1 1 † · R A hmi, so the last term in Eq. S44 is in general quite nontrivial. For a generic promoter, Eq. S44 may be greater than, less than, or equal to one, as asserted in Section 3. We have not found the general form Eq. S44 terribly intuitive and instead defer discussion to specific examples.

S1.3.4 Summary of general results
For ease of reference, we collect and reprint here the key results derived in this section that are used in the main text and subsequent subsections. Mean mRNA copy number and Fano factor are given by Eqs. S34 and S44, which are respectively. To compute these two quantities, we need the expressions for hm 0 i and hmi given by solving Eqs. S24 and S30, respectively, which are Some comments are in order before we consider particular models. First, note that to obtain hmi and ⌫, we need not bother solving for all components of the vectors hm 0 i and hmi, but only the components which are multiplied by nonzero elements of R A . The only component of hm 0 i that ever survives is the transciptionally active state, and for the models we consider here, there is only ever one such state. This will save us some amount of algebra below.
Also note that we are computing Fano factors to verify the results of Section 3, concerning the constitutive promoter models in Figure 2 which are analogs of the simple repression models in Figure 1. We can translate the matrices from the simple repression models to the constitutive case by simply substituting all occurrences of repressor rates by zero and removing the row and column corresponding to the repressor bound state. The results for hmi computed in the repressed case can be easily translated to the constitutive case, rather than recalculating from scratch, by taking the limit k + R ! 0, since this amounts to sending repressor copy number to zero.
Finally, we point out that it would be possible to compute hm 0 i more simply using the diagram methods from King and Altman [40] (also independently discovered by Hill [41]). But to our knowledge this method cannot be applied to compute hmi or ⌫, so since we would need to resort to solving the matrix equations anyways for hmi, we choose not to introduce the extra conceptual burden of the diagram methods simply for computing hm 0 i. S1.4 Nonequilibrium Model One -Poisson Promoter S1.4.1 Mean mRNA For nonequilibrium model 1 in Figure 1, we have already shown the full master equation in Eq. 10 and Eq. S6, but for completeness we reprint it again as This is a direct transcription of the states and rates in Figure 1. This may be converted to the matrix form of the master equation shown in Eq. S11 with matrices where R A and R D are equal, so we drop the subscript and denote both simply by R. Since our interest is only in steady-state we dropped the time dependence as well.
First we need hm 0 i. Label its components as p R and p U , the probabilities of finding the system in either promoter state, and note that only p U survives multiplication by R, since so we need not bother finding p R . Then we have As mentioned earlier in Section S1.3.1, the two rows are linearly dependent, so taking only the first row and using normalization to set p R = 1 p U gives which is easily solved to find Substituting this into Eq. S51, and the result of that into Eq. S45, we have as asserted in Eq. 13 of the main text.

S1.4.2 Fano factor
To verify that the Fano factor for model 1 in Figure 2(A) is in fact 1 as claimed in the main text, note that in this limit p U = 1 and hmi = r/ . All elements of K are zero, and or, in other words, since there is only one promoter state, hmi = hmi. Then it follows that as claimed. S1.5 Nonequilibrium Model Two -RNAP Bound and Unbound States S1.5.1 Mean mRNA As shown earlier, the full master equation for model 2 in Figure 1 is which can be condensed to the matrix form of Eq. S11 with matrices given by As for model 1, we must first find R A hm 0 i. Denote its components as p R , p E , p P , the probabilities of being found in repressor bound, empty, or RNAP-bound states, respectively. Only p P is necessary to find since Then Eq. S47 for hmi reads 0 Discarding the middle row as redundant and incorporating the normalization condition leads to a set of three linearly independent equations, namely Using p R = 1 p E p P to eliminate p R in the first and solving the resulting equation for p E gives p E = (1 p P )k R /(k R + k + R ). Substituting this for p E in the second equation gives an equation in p P alone which is and solving for p P gives . (S66) Substituting this in Eq. S60 and multiplying by R A produces from which hmi follows readily, as claimed in Eq. 18 in the main text.

S1.5.2 Fano factor
To compute the Fano factor, we first remove the repressor bound state from the matrices describing the model, which reduce to Similarly we remove the repressor bound state from R A hm 0 i and take the k + R ! 0 limit, which simplifies to Then we must compute hmi from Eq. S48, which with these matrices reads where we labeled the components of hmi as m E and m P , since they are the mean mRNA counts conditional upon the system residing in the empty or polymerase bound states, respectively. Unlike for hm 0 i, the rows of this matrix are linearly independent so we simply solve this matrix equation as is. We can immediately eliminate m E since m E = m P (k P +r + )/k + P from the second row, and substituting into the first row gives an equation for m P alone, which is Expanding the products cancels several terms, and solving for m P gives .

(S73)
Note then that1 † · R A hmi = rm P . We also need the constitutive limit of hmi from Eq. S68, again found by taking k + R ! 0, which is and substituting this along with1 † · R A hmi = rm P into Eq. S46 for the Fano factor ⌫, we find This simplifies to which further simplifies to exactly Eq. 36 in the main text. S1.6 Nonequilibrium Model Three -Multistep Transcription Initiation and Escape S1.6.1 Mean mRNA In close analogy to model 2 above, nonequilibrium model 3 from Figure 1(C) can be described by our generic master equation Eq. S11 with promoter transition matrix given by and transcription matrices given by hm 0 i is again given by Eq. S47, which in this case takes the form where the four components of hm 0 i correspond to the four promoter states repressor bound, empty, RNAP-bound closed complex, and RNAP-bound open complex. As explained in Section S1.3.1, we are free to discard one linearly dependent row from this matrix and replace it with the normalization condition p R + p E + p C + p O = 1. Using normalization to eliminate p R from the first row gives (S81) If we substitute this in the third row, then the last two rows constitute two equations in p C and p O given by Solving for p C = p O r/k O in the second and substituting into the first gives us our desired single equation in the single variable p O , which is and solving for p O we find .

(S85)
Once again p O , the transcriptionally active state, is the only component of hm 0 i that survives multiplication by R A , and which equals Eq. 25 in the main text.

S1.6.2 Fano factor
To compute the Fano factor of the analogous constitutive promoter, we first excise the repressor states and rates from the problem. More precisely, we construct the matrix (K R D + R A I) and substitute it into Eq. S48 which is now where we labeled the unbound, closed complex, and open complex components of hmi as m E , m C , and m O , respectively. p O is given by the limit of Eq. S85 as k + R ! 0, which is where we define Z for upcoming convenience as this sum of terms will appear repeatedly. We can use the second equation to eliminate m E , finding m E = m C (k P + k O + )/k + P , and the third to eliminate m C , which is simply m C = m O (r + )/k O . Substituting these both into the first equation gives a single equation for the variable of interest, m O , Expanding the denominator and canceling terms leads to so if we substitute these two quantities into Eq. S46, we will readily obtain the Fano factor as Substituting, we see that and after simplifying, we obtain as stated in Eq. 37 in the main text. S1.6.3 Generalizing ⌫ < 1 to more fine-grained models In the main text we argued that the convolution of multiple exponential distributions should be a distribution with a smaller fractional width than the corresponding exponential distribution with the same mean. This can be made more precise with a result from [76], who showed that the convolution of multiple gamma distributions (of which the exponential distribution is a special case) is, to a very good approximation, also gamma distributed. Using their Eq. 2 for the distribution of the convolution, with shape parameters set to 1 to give exponential distributions, the total waiting time distribution has a ratio of variance to squared mean 2 /µ 2 = where the k i are the rates of the individual steps. Clearly this is less than 1 and therefore the total waiting time distribution is narrower than the corresponding exponential.
We also claimed in the main text that for a process whose waiting time distribution is narrower than exponential, i.e., has 2 /µ 2 < 1, the distribution of counts should be less variable than a Poisson distribution, leading to a Fano factor ⌫ < 1. This we argue by analogy to photon statistics where it is known that "antibunched" arrivals, in other words more uniformly distributed in time relative to uncorrelated arrivals, generally gives rise to sub-Poissonian noise [77], [78]. Although loopholes to this result are known to exist, to our knowledge they appear to arise from uniquely quantum e↵ects so we do not expect they apply for our problem. Nevertheless we refrain from elevating this equivalence of kinetic cycles with sub-Poissonian noise to a "theorem." S1.7 Nonequilibrium Model Four -"Active" and "Inactive" States S1.7.1 Mean mRNA The mathematical specification of this model is almost identical to model 2. The matrix K is identical, as is R D . The only di↵erence is that now R A = R D , i.e., both are diagonal, in contrast to model 2 where R A has an o↵-diagonal element, as in Eq. S59. Then the analog of Eq. S61 for finding hm 0 i is 0 In fact we need not do this calculation explicitly and can instead recycle the calculation of mean mRNA hmi from model 2. The matrices are identical except for the relabeling k ! (k P + r), and a careful look through the derivation of hmi for model 2 shows that the parameters k P and r only ever appear as the sum k P + r. So taking hmi from model 2, Eq. S68, and relabeling (k P + r) ! k gives us our answer for model four, simply

S1.7.2 Fano factor
Likewise, for computing the Fano factor of this model we may take a shortcut. Consider the constitutive model four from Figure 2 for which we want to compute the Fano factor and compare it to nonequilibrium model one of simple repression in Figure 1. Mathematically these are exactly the same model, just with rates labeled di↵erently and the meaning of the promoter states interpreted di↵erently. Furthermore, nonequilibrium model 1 from Figure 1 was the model considered by Jones et. al.
[36], where they derived the Fano factor for that model to be So recognizing that the relabelings k + R ! k and k R ! k + will translate this result to our model four from Figure 2, we can immediately write down our Fano factor as The objective of this section is to write down the steady-state mRNA distribution for model 5 in Figure 2. Our claim is that this model is rich enough that it can capture the expression pattern of bacterial constitutive promoters. Figure S2 shows two di↵erent schematic representations of the model. Figure S2(A) shows the promoter cartoon model with burst initiation rate k i , mRNA degradation rate , and mean burst size b. For our derivation of the chemical master equation we will focus more on Figure S2(B). This representation is intended to highlight that bursty gene expression allows transitions between mRNA count m and m 0 even with m m 0 > 1.  To derive the master equation we begin by considering the possible state transitions to "enter" state m. There are two possible paths to jump from an mRNA count m 0 6 = m to a state m in a small time window t: 1. By degradation of a single mRNA, jumping from m + 1 to m.
2. By producing m m 0 mRNA for m 0 2 {0, 1, . . . , m 1}. For the "exit" states from m into m 0 6 = m during a small time window t we also have two possibilities: 1. By degradation of a single mRNA, jumping from m to m 1.
2. By producing m 0 m mRNA for m 0 m 2 {1, 2, . . .}. This implies that the probability of having m mRNA at time t + t can be written as where we indicate G m 0 m as the probability of having a burst of size m 0 m, i.e. when the number of mRNAs jump from m to m 0 > m due to a single mRNA transcription burst. We suggestively use the letter G as we will assume that these bursts sizes are geometrically distributed with parameter ✓. This is written as (S100) In Section 3 of the main text we derive this functional form for the burst size distribution. An intuitive way to think about it is that for transcription initiation events that take place instantaneously there are two competing possibilities: Producing another mRNA with probability (1 ✓), or ending the burst with probability ✓. What this implies is that for a geometrically distributed burst size we have a mean burst size b of the form (S101) To clean up Equation S99 we can send the first term on the right hand side to the left, and divide both sides by t. Upon taking the limit where t ! 0 we can write (S102) Furthermore, given that the timescale for this equation is set by the mRNA degradation rate we can divide both sides by this rate, obtaining d d⌧ where we defined ⌧ ⌘ t ⇥ , and ⌘ k i / . The last term in Eq. S103 sums all burst sizes except for a burst of size zero. We can re-index the sum to include this term, obtaining (S104) Given the normalization constraint of the geometric distribution, adding the probability of all possible burst sizes -including size zero since we re-indexed the sum -allows us to write (S105) Substituting this into Eq. S103 results in (S106) To finally get at a more compact version of the equation notice that the third term in Eq. S106 includes burst from size m 0 m = 1 to size m 0 m = m. We can include the term p(m, t)G 0 in the sum which allows bursts of size m 0 m = 0. This results in our final form for the chemical master equation In order to solve Eq. S107 we will use the generating function method [79]. The probability generating function is defined as where z is just a dummy variable that will help us later on to obtain the moments of the distribution. Let us now multiply both sides of Eq. S107 by z m and sum over all m (S109) where we use P m ⌘ P 1 m=0 . We can distribute the sum and use the definition of F (z, t) to obtain We can make use of properties of the generating function to write everything in terms of F (z, ⌧ ): the first term on the right hand side of Eq. S110 can be rewritten as X For the second term on the right hand side of Eq. S110 we define k ⌘ m + 1. This allows us to write = @F (z) @z .
(S119) 0,0 0,0 1,0 1,0 2,0 2,0 3,0 3,0 1,1 1,1 2,1 2,1 3,1 3,1 2,2 2,2 3,2 3,2 3,3 3,3 The third term in Eq. S110 is the most trouble. The trick is to reverse the default order of the sums as (S120) To see the logic of the sum we point the reader to Figure S3. The key is to notice that the double sum where we also substituted the definition of the geometric distribution G k = ✓(1 ✓) k . Redistributing the sums we can write The next step requires us to look slightly ahead into what we expect to obtain. We are working on deriving an equation for the generating function F (z, ⌧ ) that when solved will allow us to compute what we care about, i.e. the probability function p(m, ⌧ ). Upon finding the function for F (z, ⌧ ), we will recover this probability distribution by evaluating derivatives of F (z, ⌧ ) at z = 0, whereas we can evaluate derivatives of F (z, ⌧ ) at z = 1 to instead recover the moments of the distribution. The point here is that when the dust settles we will evaluate z to be less than or equal to one. Furthermore, we know that the parameter of the geometric distribution ✓ must be strictly between zero and one. With these two facts we can safely state that |z(1 ✓)| < 1. Defining n ⌘ m m 0 we rewrite the last sum in Eq. S124 as where we use the geometric series since, as stated before, |z(1 ✓)| < 1. Putting these results together, the PDE for the generating function is @F @⌧ . (S128) Changing variables to ⇠ = 1 ✓ and simplifying gives @F @⌧

S2.1.2 Steady-state
To get at the mRNA distribution at steady state we first must solve Eq. S129 setting the time derivative to zero. At steady-state, the PDE reduces to the ODE which we can integrate as The initial conditions for generating functions can be subtle and confusing. The key fact follows from the definition F (z, t) = P m z m p(m, t). Clearly normalization of the distribution requires that F (z = 1, t) = P m p(m, t) = 1. A subtlety is that sometimes the generating function may be undefined at z = 1, in which case the limit as z approaches 1 from below su ces to define the normalization condition. We also warn the reader that, while it is frequently convenient to change variables from z to a di↵erent independent variable, one must carefully track how the normalization condition transforms.
Continuing on, we evaluate the integrals (producing a constant c) which gives ln F = ln(1 ⇠z) + c (S132) . (S133) Only one choice for c can satisfy initial conditions, producing (S134)

S2.1.3 Recovering the steady-state probability distribution
To obtain the steady state mRNA distribution p(m) we are aiming for we need to extract it from the generating function Taking a derivative with respect to z results in Setting z = 0 leaves one term in the sum when m = 1 since in the limit lim x!0 + x x = 1. A second derivative of the generating function would result in Again evaluating at z = 0 gives In general any p(m) is obtained from the generating function as (S140) Let's now look at the general form of the derivative for our generating function in Eq. S134. For p(0) we simply evaluate F (z = 0) directly, obtaining The first derivative results in (S142) Evaluating this at z = 0 as required to get p(1) gives For the second derivative we find Again evaluating z = 0 gives Let's go for one more derivative to see the pattern. The third derivative of the generating function gives which again we evaluate at z = 0 If was an integer we could write this as Since might not be an integer we can write this using Gamma functions as Generalizing the pattern we then have that the m-th derivative takes the form With this result we can use Eq. S140 to obtain the desired steady-state probability distribution function Note that the ratio of gamma functions is often expressed as a binomial coe cient, but since may be non-integer, this would be ill-defined. Re-expressing this exclusively in our variables of interest, burst rate and mean burst size b, we have p(m) = (m + ) (m + 1) ( ) (S152)

S2.2.1 Deriving the generating function for mRNA distribution
Let us move from a one-state promoter to a two-state promoter, where one state has repressor bound and the other produces transcriptional bursts as above. A schematic of this model is shown as model 5 in Figure 1(C). Although now we have an equation for each promoter state, otherwise the master equation reads similarly to the one-state case, except with additional terms corresponding to transitions between promoter states, namely where p R (m, t) is the probability of the system having m mRNA copies and having repressor bound to the promoter at time t, and p A is an analogous probability to find the promoter without repressor bound. k R + and k R are, respectively, the rates at which repressors bind and unbind to and from the promoter, and is the mRNA degradation rate. k i is the rate at which bursts initiate, and as before, the geometric distribution of burst sizes has mean b = (1 ✓)/✓. Interestingly, it turns out that this problem maps exactly onto the three-stage promoter model considered by Shahrezaei and Swain in [23], with relabelings. Their approximate solution for protein distributions amounts to the same approximation we make here in regarding the duration of mRNA synthesis bursts as instantaneous, so their solution for protein distributions also solves our problem of mRNA distributions. Let us examine the analogy more closely. They consider a two-state promoter, as we do here, but they model mRNA as being produced one at a time and degraded, with rates v 0 and d 0 . Then they model translation as occurring with rate v 1 , and protein degradation with rate d 1 as shown in Figure S4. Now consider the limit where v 1 , d 0 ! 1 with their ratio v 1 /d 0 held constant. v 1 /d 0 resembles the average burst size of translation from a single mRNA: these are the rates of two Poisson processes that compete over a transcript, which matches the story of geometrically distributed burst sizes. In other words, in our bursty promoter model we can think of the parameter ✓ as determining one competing process to end the burst and (1 ✓) as a process wanting to continue the burst. So after taking this limit, on timescales slow compared to v 1 and d 0 , it appears that transcription events fire at rate v 0 and produce a geometrically distributed burst of translation of mean size v 1 /d 0 , which intuitively matches the story we have told above for mRNA with variables relabeled.
To verify this intuitively conjectured mapping between our problem and the solution in [23], we continue with a careful solution for the mRNA distribution using probability generating functions, following the ideas sketched in [23]. It is natural to nondimensionalize rates in the problem by , or equivalently, this amounts to measuring time in units of 1 . We are also only interested in steady state, so we set the time derivatives to zero, giving where for convenience we kept the same notation for all rates, but these are now expressed in units of mean mRNA lifetime 1 .
The probability generating function is defined as before in the constitutive case, except now we must introduce a generating function for each promoter state, Our real objective is the generating function f (z) that generates the mRNA distribution As before we multiply both equations by z m and sum over all m. Each individual term transforms exactly as did an analogous term in the constitutive case, so the coupled ODEs for the generating functions read and after changing variables ⇠ = 1 ✓ as before and rearranging, we have We can transform this problem from two coupled first-order ODEs to a single second-order ODE by solving for f A in the first and plugging into the second, giving 0 = (1 z) @f R @z where, to reduce notational clutter, we have dropped the explicit z dependence of f A and f R . Simplifying we have This can be recognized as the hypergeometric di↵erential equation, with singularities at z = 1, z = ⇠ 1 , and z = 1. The latter can be verified by a change of variables from z to x = 1/z, being careful with the chain rule, and noting that z = 1 is a singular point if and only if x = 1/z = 0 is a singular point.
The standard form of the hypergeometric di↵erential equation has its singularities at 0, 1, and 1, so to take advantage of the standard form solutions to this ODE, we first need to transform variables to put it into a standard form. However, this is subtle. While any such transformation should work in principle, the solutions are expressed most simply in the neighborhood of z = 0, but the normalization condition that we need to enforce corresponds to z = 1. The easiest path, therefore, is to find a change of variables that maps 1 to 0, 1 to 1, and ⇠ 1 to 1. This is most intuitively done in two steps. First map the z = 1 singularity to 0 by the change of variables v = z 1, giving Now two singularities are at v = 0 and v = 1. The third is determined by (1 + v)⇠ 1 = 0, or v = ⇠ 1 1. We want another variable change that maps this third singularity to 1 (without moving 0 or infinity). Changing variables again to w = v ⇠ 1 1 = ⇠ 1 ⇠ v fits the bill. In other words, the combined change of variables w = ⇠ 1 ⇠ (z 1) (S165) maps z = {1, ⇠ 1 , 1} to w = {0, 1, 1} as desired. Plugging in, being mindful of the chain rule and noting (1 + v)⇠ 1 = (1 ⇠)(w 1) gives This is close to the standard form of the hypergeometric di↵erential equation, and some cancellation and rearrangement gives and a little more algebra produces which is the standard form. From this we can read o↵ the solution in terms of hypergeometric functions 2 F 1 from any standard source, e.g. [80], and identify the conventional parameters in terms of our model parameters. We want the general solution in the neighborhood of w = 0 (z = 1), which for a homogeneous linear second order ODE must be a sum of two linearly independent solutions. More precisely, with parameters determined by and constants C (1) and C (2) to be set by boundary conditions. Solving for ↵ and , we find (S171) Note that ↵ and are interchangeable in the definition of 2 F 1 and di↵er only in the sign preceeding the radical. Since the normalization condition requires that f R be finite at w = 0, we can immediately set C (2) = 0 to discard the second solution. This is because all the rate constants are strictly positive, so > 1 and therefore w 1 blows up as w ! 0. Now that we have f R , we would like to find the generating function for the mRNA distribution, We can recover f A from our solution for f R , namely where in the second line we transformed our original relation between f R and f A to our new, more convenient, variable w. Plugging our solution for f R (w) = C (1) 2 F 1 (↵, , ; w) into f A , we will require the di↵erentiation rule for 2 F 1 , which tells us @f R @w = C (1) ↵ 2 F 1 (↵ + 1, + 1, + 1; w), (S174) from which it follows that and therefore To proceed, we need one of the (many) useful identities known for hypergeometric functions, in particular w ↵ 2 F 1 (↵ + 1, + 1, + 1; w) = ( 1) ( 2 F 1 (↵, , 1; w) 2 F 1 (↵, , ; w)) . (S177) Substituting this for the second term in f (w), we find and since 1 = k + R + k R , the first and third terms cancel, leaving only Now we enforce normalization, demanding f (w = 0) = f (z = 1) = 1. 2 F 1 (↵, , 1; 0) = 1, so we must have Recalling that the mean burst size b = (1 ✓)/✓ = ⇠/(1 ⇠) and w = ⇠ 1 ⇠ (z 1) = b(z 1), we can transform back to the original variable z to find the tidy result with ↵ and given above by (S182) Finally we are in sight of the original goal. We can generate the steady-state probability distribution of interest by di↵erentiating the generating function, which follows easily from its definition. Some contemplation reveals that repeated application of the derivative rule used above will produce products of the form ↵(↵+1)(↵+2) · · · (↵+m 1) in the expression for p(m) and similarly for and . These resemble ratios of factorials, but since ↵, , and are not necessarily integer, we should express the ratios using gamma functions instead. More precisely, one finds which is finally the probability distribution we sought to derive.

S2.3 Numerical considerations and recursion formulas S2.3.1 Generalities
We would like to carry out Bayesian parameter inference on FISH data from [36], using Eq. (S184) as our likelihood. This requires accurate (and preferably fast) numerical evaluation of the hypergeometric function 2 F 1 , which is a notoriously hard problem [81], [82], and our particular needs here present an especial challenge as we show below.
The hypergeometric function is defined by its Taylor series as for |z| < 1, and by analytic continuation elsewhere. If z . 1/2 and ↵ and are not too large (absolute value below 20 or 30), then the series converges quickly and an accurate numerical representation is easily computed by truncating the series after a reasonable number of terms. Unfortunately, we need to evaluate 2 F 1 over mRNA copy numbers fully out to the tail of the distribution, which can easily reach 50, possibly 100. From Eq. (S184), this means evaluating 2 F 1 repeatedly for values of a, b, and c spanning the full range from O(1) to O(10 2 ), even if ↵, , and in Eq. (S184) are small, with the situation even worse if they are not small. A naive numerical evaluation of the series definition will be prone to overflow and, if any of a, b, c < 0, then some successive terms in the series have alternating signs which can lead to catastrophic cancellations.
One solution is to evaluate 2 F 1 using arbitrary precision arithmetic instead of floating point arithmetic, e.g., using the mpmath library in Python. This is accurate but incredibly slow computationally. To quantify how slow, we found that evaluating the likelihood defined by Eq. (S184) ⇠ 50 times (for a typical dataset of interest from [36], with m values spanning 0 to ⇠ 50) using arbitrary precision arithmetic is 100-1000 fold slower than evaluating a negative binomial likelihood for the corresponding constitutive promoter dataset.
To claw back & 30 fold of that slowdown, we can exploit one of the many catalogued symmetries involving 2 F 1 . The solution involves recursion relations originally explored by Gauss, and studied extensively in [81], [82]. They are sometimes known as contiguous relations and relate the values of any set of 3 hypergeometric functions whose arguments di↵er by integers. To rephrase this symbolically, consider a set of hypergeometric functions indexed by an integer n, for a fixed choice of ✏ i , ✏ j , ✏ k 2 {0, ±1} (at least one of ✏ i , ✏ j , ✏ k must be nonzero, else the set of f n would contain only a single element). Then there exist known recurrence relations of the form where A n , B n , and C n are some functions of a, b, c, and z. In other words, for fixed ✏ i , ✏ j , ✏ k , a, b, and c, if we can merely evaluate 2 F 1 twice, say for n 0 and n 0 1, then we can easily and rapidly generate values for arbitrary n.
This provides a convenient solution for our problem: we need repeated evaluations of 2 F 1 (a+ m, b + m, c + m; z) for fixed a, b, and c and many integer values of m. They idea is that we can use arbitrary precision arithmetic to evaluate 2 F 1 for just two particular values of m and then generate 2 F 1 for the other 50-100 values of m using the recurrence relation. In fact there are even more sophisticated ways of utilizing the recurrence relations that might have netted another factor of 2 speed-up, and possibly as much as a factor of 10, but the method described here had already reduced the computation time to an acceptable O(1 min), so these more sophisticated approaches did not seem worth the time to pursue.
However, there are two further wrinkles. The first is that a naive application of the recurrence relation is numerically unstable. Roughly, this is because the three term recurrence relations, like second order ODEs, admit two linearly independent solutions. In a certain eigenbasis, one of these solutions dominates the other as n ! 1, and as n ! 1, the dominance is reversed. If we fail to work in this eigenbasis, our solution of the recurrence relation will be a mixture of these solutions and rapidly accumulate numerical error. For our purposes, it su ces to know that the authors of [82] derived the numerically stable solutions (so-called minimal solutions) for several possible choices of ✏ i , ✏ j , ✏ k . Running the recurrence in the proper direction using a minimal solution is numerically robust and can be done entirely in floating point arithmetic, so that we only need to evaluate 2 F 1 with arbitrary precision arithmetic to generate the seed values for the recursion.
The second wrinkle is a corollary to the first. The minimal solutions are only minimal for certain ranges of the argument z, and not all of the 26 possible recurrence relations have minimal solutions for all z. This can be solved by using one of the many transformation formulae for 2 F 1 to convert to a di↵erent recurrence relation that has a minimal solution over the required domain of z, although this can require some trial and error to find the right transformation, the right recurrence relation, and the right minimal solution.

S2.3.2 Particulars
Let us now demonstrate these generalities for our problem of interest. In order to evaluate the probability distribution of our model, Eq. (S184), we need to evaluate hypergeometric functions of the form 2 F 1 (↵ + m, + m, + m; b) for values of m ranging from 0 to O(100). The authors of [82] did not derive a recursion relation for precisely this case. We could follow their methods and do so ourselves, but it is much easier to convert to a case that they did consider. The strategy is to look through the minimal solutions tabulated in [82] and search for a transformation we could apply to 2 F 1 (↵ + m, + m, + m; b) that would place the m's (the variable being incremented by the recursion) in the same arguments of 2 F 1 as the minimal solution. After some "guess and check," we found that the transformation produces a 2 F 1 on the right hand side that closely resembles the minimal solutions y 3,m and y 4,m in Eq. 4.3 in [82]. Explicitly, these solutions are where we have omitted prefactors which are unimportant for now. Which of these two we should use depends on what values z takes on. Equating 1 z = b/(1+b) gives z = 1/(1+b), and since b is strictly positive, z is bounded between 0 and 1. From Eq. 4.5 in [82], y 4,m is the minimal solution for real z satisfying 0 < z < 2, so this is the only minimal solution we need.
Now that we have our minimal solution, what recurrence relation does it satisfy? Confusingly, the recurrence relation of which y 4,m is a solution increments di↵erent arguments of 2 F 1 that does y 4,m : it increments the first only, rather than first and third. This recurrence relation can be looked up, e.g., Eq. 15.2.10 in [80], which is Now we must solve for the parameters appearing in the recurrence relation in terms of our parameters, namely by setting and solving to find (S193) Finally we have everything we need. The minimal solution where we have now included the necessary prefactors, is a numerically stable solution of the recurrence relation Eq. (S191) if the recursion is run from large m to small m.
Let us finally outline the complete procedure as an algorithm to be implemented: 1. Compute the value of 2 F 1 for the two largest m values of interest using arbitrary precision arithmetic.
With 2 F 1 computed, the only remaining numerical danger in computing p(m) in Eq. (S184) is overflow of the gamma functions. This is easily solved by taking the log of the entire expression and using standard routines to compute the log of the gamma functions, then exponentiating the entire expression at the end if p(m) is needed rather than log p(m).

S3.1 The problem of parameter inference
One could argue that the whole goal of formulating theoretical models about nature is to sharpen our understanding from qualitative statements to precise quantitative assertions about the relevant features of the natural phenomena in question [83]. It is in these models that we intend to distill the essential parts of the object of study. Writing down such models leads to a propagation of mathematical variables that parametrize our models. By assigning numerical values to these parameters we can compute concrete predictions that can be contrasted with experimental data. For these predictions to match the data the parameter values have to carefully be chosen from the whole parameter space. But how do we go about assessing the e↵ectiveness of di↵erent regions of parameter space to speak to the ability of our model to reproduce the experimental observations? The language of probability, and more specifically of Bayesian statistics is -we think -the natural language to tackle this question.

S3.1.1 Bayes' theorem
Bayes' theorem is a simple mathematical statement that can apply to any logical conjecture. For two particular events A and B that potentially depend on each other Bayes' theorem gives us a recipe for how to update our beliefs about one, let us say B, given some state of knowledge, or lack thereof, about A. In its most classic form Bayes' theorem is written as where the vertical line | is read as "given that". So P (B | A) is read as probability of B given that A took place. A and B can be any logical assertion. In particular the problem of Bayesian inference focuses on the question of finding the probability distribution of a particular parameter value given the data.
For a given model with a set of parameters✓ = (✓ 1 , ✓ 2 , . . . , ✓ n ), the so-called posterior distribution P (✓ | D), where D is the experimental data, quantifies the plausibility of a set of parameter values given our observation of some particular dataset. In other words, through the application of Bayes' formula we update our beliefs on the possible values that parameters can take upon learning the outcome of a particular experiment. We specify the word "update" as we come to every inference problem with prior information about the plausibility of particular regions of parameter space even before performing any experiment. Even when we claim as researchers that we are totally ignorant about the values that the parameters in our models can take, we always come to a problem with domain expertise that can be exploited. If this was not the case, it is likely that the formulation of our model is not going to capture the phenomena we claim to want to understand. This prior information is captured in the prior probability P (✓). The relationship between how parameter values can connect with the data is enconded in the likelihood function P (D |✓). Our theoretical model, whether deterministic or probabilistic, is encoded in this term that can be intuitively understood as the probability of having observed the particular experimental data we have at hand given that our model is parametrized with the concrete values✓. Implicitly here we are also conditioning on the fact that our theoretical model is "true," i.e. the model itself if evaluated or simulated in the computer is capable of generating equivalent datasets to the one we got to observe in an experiment. In this way Bayesian inference consists of applying Bayes' formula as Notice than rather than writing the full form of Bayes' theorem, we limit ourselves to the terms that depend on our quantity of interest -that is the parameter values themselves✓as the denominator P (D) only serves as a normalization constant.
We also emphasize that the dichotomy we have presented between prior and likelihood is more subtle. Although it is often stated that our prior knowledge is entirely encapsulated by the obviously named prior probability P (✓), this is usually too simplistic. The form(s) we choose for our likelihood function P (D |✓) also draw heavily on our prior domain expertise and the assumptions, implicit and explicit, that these choices encode are at least as important, and often inseparable from, the prior probability, as persuasively argued in [84].

S3.1.2 The likelihood function
As we alluded in the previous section it is through the likelihood function P (D |✓) that we encode the connection between our parameter values and the experimental observables. Broadly speaking there are two classes of models that we might need to encode into our likelihood function: • Deterministic models: Models for which a concrete selection of parameter values give a single output. Said di↵erently, models with a one-to-one mapping between inputs and outputs.
• Probabilistic models: As the name suggests, models that, rather than having a one-toone input-output mapping, describe the full probability distribution of possible outputs.
In this paper we focus on inference done with probabilistic models. After all, the chemical master equations we wrote down describe the time evolutions of the mRNA probability distribution. So all our terms P (✓ | D) will be given by the steady-state solution of the corresponding chemical master equation in question. This is rather convenient as we do not have to worry about adding a statistical model on top of our model to describe deviations from the predictions. Instead our models themselves focus on predicting such variation in cell count.

S3.1.3 Prior selection
The di↵erent models explored in this work embraced di↵erent levels of coarse-graining that resulted in a diverse number of parameters for di↵erent models. For each of these model configurations Bayes' theorem demands from us to represent our preconceptions on the possible parameter values in the form of the prior P (✓). Throughout this work for models with > 1 parameter we assign independent priors to each of the parameters; this is Although it is not uncommon practice to use non-informative, or maximally uninformative priors, we are of the mindset that this is a disservice to the philosophical and practical implications of Bayes' theorem. It sounds almost contradictory to claim that can we represent our thinking about a natural phenomenon in the form of a mathematical model -in the context of Bayesian inference this means choosing a form for the likehihoods, and even making this choice presupposes prior understanding or assumptions as to the relevant features in the system under study -but that we have absolutely no idea what the parameter values could or could not be. We therefore make use of our own expertise, many times in the form of order-of-magnitude estimates, to write down weakly-informative prior distributions for our parameters.
For our particular case all of the datasets from [36] used in this paper have O(10 3 ) data points. What this implies is that our particular choice of priors will not significantly a↵ect our inference as long as they are broad enough. A way to see why this is the case is to simply look at Bayes' theorem. For N 1000 3000 datum all of the independent of each other and n ⌧ 10 3 parameters Bayes' theorem reads as where d k represents the k-th datum. That means that if our priors span a wide range of parameter space, the posterior distribution would be dominated by the likelihood function.

S3.1.4 Expectations and marginalizations
For models with more than one or two parameters, it is generally di cult to visualize or reason about the full joint posterior distribution P (✓ | D) directly. One of the great powers of Bayesian analysis is marginalization, allowing us to reduce the dimensionality to only the parameters of immediate interest by averaging over the other dimensions. Formally, for a three dimensional model with parameters ✓ 1 , ✓ 2 , and ✓ 3 , we can for instance marginalize away ✓ 3 to produce a 2D posterior as or we can marginalize away ✓ 1 and ✓ 3 to produce the 1D marginal posterior of ✓ 2 alone, which would be Conceptually, this is what we did in generating the 2D slices of the full 9D model in Figure 4(A). In practice, this marginalization is even easier with Markov Chain Monte Carlo samples in hand. Since each point is simply a list of parameter values, we simply ignore the parameters which we want to marginalize away [52].

S3.1.5 Markov Chain Monte Carlo
The theory and practice of Bayesian inference with Markov Chain Monte Carlo (MCMC) is a rich subject with fascinating and deep analogies to statistical mechanics, even drawing on classical Hamiltonian mechanics and general relativity in its modern incarnations. We refer the interested reader to [52] and [85] for excellent introductions. Here we merely give a brief summary of the MCMC computations carried out in this work.
We used the Python package emcee for most of the MCMC sampling in this work. For the constitutive promoter inference, we also ran sampling with the excellent Stan modeling language as a check. We did not use Stan for the inference of the simple repression model because implementing the gradients of the hypergeometric function 2 F 1 appearing in Eq. S184, the probability distribution for our bursty model with repression, would have been an immensely challenging task. emcee was more than adequate for our purposes, and we were perhaps lucky that the 9-D posterior model for the model of simple repression with bursty promoter was quite well behaved and did not require the extra power of the Hamiltonian Monte Carlo algorithm provided by Stan [86]. Source code for all statistical inference will be made available at https://github.com/RPGroup-PBoC/bursty_transcription.

S3.2 Bayesian inference on constitutive promoters
Having introduced the ideas behind Bayesian inference we are ready to apply the theoretical machinery to our non-equilibrium models. In particular in this section we will focus on model 1 and model 5 in Figure 2(A). Model 1, the Poisson promoter, will help us build practical intuition into the implementation of the Bayesian inference pipeline as we noted in Section 3 of the main text that this model cannot be reconciled with experimental data from observables such as the Fano factor. In other words, we acknowledge that this model is "wrong," but we still see value in going through the analysis since the simple nature of the model translates into a neat statistical analysis.

S3.2.1 Model 1 -Poisson promoter
As specified in the main test, the mRNA steady-state distribution for model 1 in Figure 2(A) is Poisson with parameter . Throughout this Appendix we will appeal to the convenient notation for probability distributions of the form where the simbol "⇠" can be read as is distributed according to. So the previous equation can be read as: the mRNA copy number m is distributed according to a Poisson distribution with parameter . Our objective then is to compute the posterior probability distribution To proceed with the inference problem we need to specify a prior. In this case we are extremely data-rich, as the dataset from Jones et. al [36] has of order 1000-3000 single-cell measurements for each promoter, so our choice of prior matters little here, as long as it is su ciently broad. A convenient choice for our problem is to use a conjugate prior. A conjugate prior is a special prior that causes the posterior to have the same functional form as the prior, simply with updated model parameters. This makes calculations analytically tractable and also o↵ers a nice interpretation of the inference procedure as updating our knowledge about the model parameters. This makes conjugate priors very useful when they exist. The caveat is that conjugate priors only exist for a very limited number of likelihoods, mostly with only one or two model parameters, so in almost all other Bayesian inference problems, we must tackle the posterior numerically.
But, for the problem at hand, a conjugate prior does in fact exist. For a Poisson likelihood of identical and identically distributed data, the conjugate prior is a gamma distribution, as can be looked up in, e.g., [52], Section 2.6. Putting a gamma prior on introduces two new parameters ↵ and which parametrize the gamma distribution itself, which we use to encode the range of values we view as reasonable. Recall is the mean steady-state mRNA count per cell, which a priori could plausibly be anywhere from 0 to a few hundred. ↵ = 1 and = 1/50 achieve this, since the gamma distribution is strictly positive with mean ↵/ and standard deviation p ↵/ . To be explicit, then, our prior is ⇠ Gamma(↵, ) ( S 2 0 2 ) As an aside, note that if we did not know that our prior was a conjugate prior, we could still write down our posterior distribution from its definition as Without foreknowledge that this in fact reduces to a gamma distribution, this expression might appear rather inscrutable. When conjugate priors are unavailable for the likelihood of interest -which is almost always the case for models with > 1 model parameter -this inscrutability is the norm, and making sense of posteriors analytically is almost always impossible. Fortunately, MCMC sampling provides us a powerful method of constructing posteriors numerically which we will make use of extensively.
Since we did use a conjugate prior, we may simply look up our posterior in any standard reference such as [52], Section 2.6, from which we find that where we defined the sample meanm = 1 N P k m k for notational convenience. A glance at the FISH data from [36] reveals that N is O(10 3 ) and hmi & 0.1 for all constitutive strains in [36], somN & 10 2 . Therefore as we suspected, our prior parameters are completely overwhelmed by the data. The prior behaves, in a sense, like extra "data points" with a mean value of (↵ 1)/ [52], which gives us some intuition for how much data is needed to overwhelm the prior in this case: enough data N such that ⌧ N and ↵/ ⌧m. In fact,mN and N are so large that we can, to an excellent approximation, ignore the ↵ and dependence and approximate the gamma distribution as a Gaussian with meanm and standard deviation pm /N , giving which suggests we have inferred our model's one parameter to a precision of order 1%. This is not wrong, but it is not the full story. The model's posterior distribution is tightly constrained, but is it a good generative model? In other words, if we use the model to generate synthetic data in the computer does it generate data that look similar to our actual data, and is it therefore plausible that the model captures the important features of the data generating process? This intuitive notion can be codified with posterior predictive checks, or PPCs, and we will see that this simple Poisson model fails badly.
The intuitive idea of posterior predictive checks is simple: 1. Make a random draw of the model parameter from the posterior distribution.
2. Plug that draw into the likelihood and generate a synthetic dataset {m k } conditioned on .
More formally, the posterior predictive distribution can be thought of as the distribution of future yet-to-be-observed data, conditioned on the data we have already observed. Clearly if those data appear quite di↵erent, the model has a problem. Put another way, if we suppose the generative model is true, i.e. we claim that our model explains the process through which our observed experimental data was generated, then the synthetic datasets we generate should resemble the actual observed data. If this is not the case, it suggests the model is missing important features. All the data we consider in this work are 1D (distributions of mRNA counts over a population) so empirical cumulative distribution functions ECDFs are an excellent visual means of comparing synthetic and observed datasets. In general for higher dimensional datasets, much of the challenge is in merely designing good visualizations that can actually show if synthetic and observed data are similar or not.
For our example Poisson promoter model then, we merely draw many random numbers, say 1000, from the Gaussian posterior in Eq. S206. For each one of those draws, we generate a dataset from the likelihood, i.e., we draw 2648 (the number of observed cells in the actual dataset) Poisson-distributed numbers for each of the 1000 posterior draws, for a total of 2648000 samples from the posterior predictive distribution.
To compare so many samples with the actual observed data, one excellent visualization for 1D data is ECDFs of the quantiles, as shown for our Poisson model in Figure 3(B) in the main text.

S3.2.2 Model 5 -Bursty promoter
Let us now consider the problem of parameter inference from FISH data for model five from Figure 1(C). As derived in Appendix S2, the steady-state mRNA distribution in this model is a negative binomial distribution, given by where b is the mean burst size and k i is the burst rate nondimensionalized by the mRNA degradation rate . As sketched earlier, we can intuitively think about this distribution through a simple story. The story of this distribution is that the promoter undergoes geometrically-distributed bursts of mRNA, where the arrival of bursts is a Poisson process with rate k i and the mean size of a burst is b.
As for the Poisson promoter model, this expression for the steady-state mRNA distribution is exactly the likelihood we want to use in Bayes' theorem. Again denoting the single-cell mRNA count data as D = {m 1 , m 2 , . . . , m N }, here Bayes' theorem takes the form where the likelihood p(D | k i , b) is given by the product of N negative binomials as in Eq. S207. We only need to choose priors on k i and b. For the datasets from [36] that we are analyzing, as for the Poisson promoter model above we are still data-rich so the prior's influence remains weak, but not nearly as weak because the dimensionality of our model has increased from one to two.
We follow the guidance of [52], Section 2.9 in opting for weakly-informative priors on k i and b (conjugate priors do not exist for this problem), and we find "street-fighting estimates" [87] to be an ideal way of constructing such priors. The idea of weakly informative priors is to allow all remotely plausible values of model parameters while excluding the completely absurd or unphysical.
Consider k i . Some of the strongest known bacterial promoters control rRNA genes and initiate transcripts no faster than ⇠ 1/sec. It would be exceedingly strange if any of the constitutive promoters from [36] were stronger than that, so we can take that as an upper bound. For a lower bound, if transcripts are produced too rarely, there would be nothing to see with FISH. The datasets for each strain contain of order 10 3 cells, and if the hmi = k i b/ . 10 2 , then the total number of expected mRNA detections would be single-digits or less and we would have essentially no data on which to carry out inference. So assuming b is not too di↵erent from 1, justified next, and an mRNA lifetime of 1 ⇠ 3 5 min, this gives us soft bounds on k i / of perhaps 10 2 and 3 ⇥ 10 1 .
Next consider mean burst size b. This parametrization of the geometric distribution allows bursts of size zero (which could representing aborted transcripts and initiations), but it would be quite strange for the mean burst size b to be below ⇠ 10 1 , for which nearly all bursts would be of size zero or one. For an upper bound, if transcripts are initiating at a rate somewhat slower than rRNA promoters, then it would probably take a time comparable to the lifetime of an mRNA to produce a burst larger than 10-20 transcripts, which would invalidate the approximation of the model that the duration of bursts are instantaneous compared to other timescales in the problem. So we will take soft bounds of 10 1 and 10 1 for b.
Note that the natural scale for these "street-fighting estimates" was a log scale. This is commonly the case that our prior sense of reasonable and unreasonable parameters is set on a log scale. A natural way to enforce these soft bounds is therefore to use a lognormal prior distribution, with the soft bounds set ±2 standard deviations from the mean.

(S209)
Section 4 in the main text details the results of applying this inference to the single-cell mRNA counts data. There we show the posterior distribution for the two parameters for di↵erent promoters. Figure S5 shows the so-called posterior predictive checks (see main text for explanation) for all 18 unregulated promoters shown in the main text.

S3.3 Bayesian inference on the simple-repression architecture
As detailed in 4 in the main text the inference on the unregulated promoter served as a stepping stone towards our ultimate goal of inferring repressor rates from the steady-state mRNA distributions of simple-repression architectures. For this we expand the one-state bursty promoter model to a two-state promoter as schematized in Figure 1(C) as model 5. This model adds two new parameters: the repressor binding rate k + , solely function of the repressor concentration, and the repressor dissociation rate k , solely a function of the repressor-DNA binding a nity.
The structure of the data in [36] for regulated promoters tuned these two parameters independently. In their work the production of the LacI repressor was under the control of an inducible promoter regulated by the TetR repressor as schematized in Figre S6. When TetR binds to the small molecule anhydrotetracycline (aTc), it shifts to an inactive conformation unable to bind to the DNA. This translates into an increase in gene expression level. In other words, the higher the concentration of aTc added to the media, the less TetR repressors that can control the expression of the lacI gene, so the higher the concentration of LacI repressors in the cell. So by tuning the amount of aTc in the media where the experimental strains were grown they e↵ectively tune k + in our simple theoretical model. On the other hand to tune k the authors swap three di↵erent binding sites for the LacI repressor, each with di↵erent repressor-DNA binding a nities previously characterized [33].  Figure S6. aTc controlled expression of LacI repressor. Schematic of the circuit used in [36] to control the expression of the LacI repressor. The lacI gene is under the control of the TetR repressor. As the TetR repressor is inactivated upon binding of anhydrotetracycline or aTc, the more aTc added to the media were cells are growing, the less TetR repressors available to control the expression of the lacI gene, resulting in more LacI repressors per cell. LacI simultaneously controls the expression of the mRNA on which single-molecule mRNA FISH was performed for gene expression quantification.
What this means is that we have access to data with di↵erent combinations of k and k + . We could naively try to fit the kinetic parameters individually for each of the datasets, but there is no reason to believe that the binding site identity for the LacI repressor somehow a↵ects its expression level controlled from a completely di↵erent location in the genome, nor vice versa. In other words, what makes the most sense it to fit all datasets together to obtain a single value for each of the association and dissociation rates. What this means, as described in Section 4 of the main text is that we have a seven dimensional parameter space with four possible association rates k + given the four available aTc concentrations, and three possible dissociation rates k given the three di↵erent binding sites available in the dataset.
Note that since the repressor copy numbers are not known directly as explained before, we label their association rates by the concentration of aTc. Bayes theorem reads simply where D is the set of all N observed single-cell mRNA counts across the various conditions. We assume that individual single-cell measurements are independent so that the likelihood factorizes as where k ± j represent the appropriate binding and unbinding rates for the j-th measured cell. Our likelihood function, previously derived in Appendix S2, is given by the rather complicated result in Eq. S184, which for completeness we reproduce here as where ↵ and , defined for notational convenience, are Next we specify priors. As for the constitutive model, weakly informative lognormal priors are a natural choice for all our rates. We found that if the priors were too weak, our MCMC sampler would often become stuck in regions of parameter space with very low probability density, unable to move. We struck a balance in choosing our prior widths between helping the sampler run while simultaneously verifying that the marginal posteriors for each parameter were not artificially constrained or distorted by the presence of the prior. The only exception to this is the highly informative priors we placed on k i and b, since we have strong knowledge of them from our inference of constitutive promoters above.
We found that fitting a single operator/aTc concentration at a time with a single binding and unbinding rate did not yield a stable inference for most of the possible operator/aTc combinations. In other words, a single dataset could not independently resolve the binding and unbinding rates, only their ratio as set by the mean fold-change in Figure 1 in the main text. Only by making the assumption of a single unique binding rate for each repressor copy number and a single unique unbinding rate for each binding site, as done in Figure 4(A), was it possible to independently resolve the rates and not merely their ratios.
We also note that we found it necessary to exclude the very weakly and very strongly repressed datasets from Jones et. al. [36]. In both cases there was, in a sense, not enough information in the distributions for our inference algorithm to extract, and their inclusion simply caused problems for the MCMC sampler without yielding any new insight. For the strongly repressed data (Oid, 10 ng/mL aTc), with > 95% of cells with zero mRNA, there was quite literally very little data from which to infer rates. And the weakly repressed data, all with the repressor binding site O3, had an unbinding rate so fast that the sampler essentially sampled from the prior; the likelihood had negligible influence, meaning the data was not informing the sampler in any meaningful way, so no inference was possible.