A Computational Framework for Analyzing Stochasticity in Gene Expression

Stochastic fluctuations in gene expression give rise to distributions of protein levels across cell populations. Despite a mounting number of theoretical models explaining stochasticity in protein expression, we lack a robust, efficient, assumption-free approach for inferring the molecular mechanisms that underlie the shape of protein distributions. Here we propose a method for inferring sets of biochemical rate constants that govern chromatin modification, transcription, translation, and RNA and protein degradation from stochasticity in protein expression. We asked whether the rates of these underlying processes can be estimated accurately from protein expression distributions, in the absence of any limiting assumptions. To do this, we (1) derived analytical solutions for the first four moments of the protein distribution, (2) found that these four moments completely capture the shape of protein distributions, and (3) developed an efficient algorithm for inferring gene expression rate constants from the moments of protein distributions. Using this algorithm we find that most protein distributions are consistent with a large number of different biochemical rate constant sets. Despite this degeneracy, the solution space of rate constants almost always informs on underlying mechanism. For example, we distinguish between regimes where transcriptional bursting occurs from regimes reflecting constitutive transcript production. Our method agrees with the current standard approach, and in the restrictive regime where the standard method operates, also identifies rate constants not previously obtainable. Even without making any assumptions we obtain estimates of individual biochemical rate constants, or meaningful ratios of rate constants, in 91% of tested cases. In some cases our method identified all of the underlying rate constants. The framework developed here will be a powerful tool for deducing the contributions of particular molecular mechanisms to specific patterns of gene expression.


S1 Analytical expressions for the steady state central moments
For convenience, the CME equations for the general ON-OFF model of gene expression shown in the main text are reproduced here. dP 0 (m, q) dt =t of f P 1 (m, q) + d m (m + 1)P 0 (m + 1, q) (S1) + k p mP 0 (m, q − 1) + d p (q + 1)P 0 (m, q + 1) − (k m + md m + mk p + d p q + t on )P 0 (m, q) dP 1 (m, q) dt =t on P 0 (m, q) + d m (m + 1)P 1 (m + 1, q) (S2) + k p mP 1 (m, q − 1) + d p (q + 1)P 1 (m, q + 1) For simplicity we separated the CME into two parallel equations; one corresponding to the probabilities of m RNAs and q proteins when the gene is in the ON state, and one corresponding to the same probabilities when the gene is in the OFF state.
To derive the generic i-th RNA moment and j-th protein moment, we multiplied both sides of Eq.
1 and 2 by m i q j , then summed across all m and q, as in Sanchez and Kondev [1]. For clarity we will perform the following algebra on a parameter by parameter basis.
For t on and the LHS in Eq. 2, where the moments are derived from the identity that the i-th moment of n is equal to ∞ n=0 n i p(n). The subscripts above, e.g., m i 0 q j 0 , correspond to the partial moments with respect to the promoter state.
Since non-central moments are additive, the sum of partial moments across all promoter states equals S2 the complete moment.
Continuing the derivation of Eq. 4 in the main text, piecewise, for d m we have, Note from line 7 to 8 we change the summation across RNA in order to remove the m 1 + 1 term inside P 1 (m 1 + 1, q 1 ) term. Then, noting for the second summation that when m = 0 the expression also goes to zero, we re-unite both sides with the summation across m starting at 1. We use this same manipulation in future steps.
For k m , .

S3
Line 15 is exactly true for all j = 0. When j = 0, any choice of i will produce a polynomial of moments plus a constant 1 . Since 1 corresponds to the following underlying summation, we recognize this is simply the amount of probability mass contained in the ON state.
∞ m1=0 ∞ q1=0 P 1 (m 1 , q 1 ) = t on t on + t of f (S16) For k p , For d p , Combining Eqs. 4, 10, 15, 20, 24, and 26, we get Eq. 4 from the main body, and reproduced here: Following identical procedures but starting with Eq. S1 we arrive at the second partial moment expansion (Eq. 3 in the main text): To generate the partial moment ODEs we then plugged in the following values of i and j : for integer n = 0 to 4, we compute all integer moments that satisfy i + j = n for i, j ≥ 0. For example, when n = 3, we compute the following pairs: (i = 0, j = 3), (i = 1, j = 2), (i = 2, j = 1), (i = 3, j = 0). The result is 14 combinations of i and j for each promoter state, or 28 partial moment ODE equations.
For n=1, We then set each of the partial moment derivatives (LHS) to zero and solved for the partial moment terms in Mathematica (Wolfram Research, Champaign, IL, 2010). There were 28 unknown partial moments and 28 equations, and we were able to solve exactly for each one in terms of the 6 CDRCs, t on , t of f , k m , d m , k p , and d p . From these we computed the 14 moments about zero from the partial moments S8 using Eq. 5.
We examined the mean plus the next 3 non-normalized central moments. The j-th central moment µ j is defined as Expansions of µ j for j = 2, 3, 4 are as follows: where the moments about zero q j are exactly what we computed from the partial moments in Eqs.
29-57. Plugging in these expressions, we obtain the steady state mean and central moments 2-4. The mean and variance are printed here: Mean and variance Eqs. 65 and 62 agree with previously published work [2], though our solution is written completely in terms of the CDRCs. This makes it less interpretable but more readily computed.
The higher two moments written in terms of the CDRCs are extremely long. They are provided along S9 with this supplement coded in both C++ and Matlab for convenience.
An interesting observation is that d m and d p are completely interchangeable with their sum and product. That is, if we set: we can completely eliminate the variables d m and d p from Eqs. 65 and 62.
This result demonstrates that d m and d p do not independently contribute to the mean and variance of a gene's protein expression distribution, but rather their individual effects are convolved as a sum and a product. As such, there will always be at least two solutions given mean and variance only, unless d m =d p .
We speculate this is the reason why many previous methods require the degradation rate assumption that d m >d p , as this allows one to select between two candidate solution pairs.
Notably, the parameter conversion described does not work for skewness or kurtosis. This indicates that skewness and kurtosis introduce an asymmetry that allows their individual effects to be recognized.

S2 Validation of moment equations
Moment equation implementations provided in this supplement have been tested against Gillespie simulations to confirm their accuracy. Example convergences of a Gillespie simulation to the analytical moment solutions are given in Fig. S1. In this figure, relative error is computed as the j-th sample moment of the protein probability mass function computed at time t. Note that even S10 though the average protein count is low ( q = 216 proteins), the Gillespie simulation is extremely slow to converge; convergence occurs around 2 x 10 6 seconds, or 23.1 days (simulation time, not actual). Even after convergence, the probability mass function exhbits small but significant excursions away from the expected value.
Like in Fig. S1, we find Gillespie converges on our analytical solution regardless of what parameters are used. We conclude that the analytical solutions for all moments are correct.

S3 Analytically Constrained Exhaustive Search (ACES)
To derive CDRCs from the moments we wrote a numerical search algorithm. Applying a general-purpose fitting algorithm like traditional least-squares steepest descent, or even stochastic approaches like genetic or simulated annealing algorithms proved problematic. First, optimum behavior of most general purpose fitters require an accurate gradient. However, fitting on multiple moments with unknown derivatives makes gradient calculations challenging. It is also unclear how to weight each of the moments in the objective function; for example, is it better to have the mean moment off by 2% and the skewness off by 20%, or the mean off by 3% and the skewness 10%? Moreover, we expected our fitting to be underdetermined in most, if not all scenarios.
The ACES algorithm performs an exhaustive search of CDRC parameter space between each parameter's minimum and maximum value, where each CDRC range has been estimated from the literature (see main document). Given these boundary values for each CDRC, we subdivide the range equally in log space according to a resolution R and test every combination of each parameter value with every other parameter value. Effectively we test R 6 CDRC combinations in an ACES run at resolution R. For clarity we discuss below the first step of the ACES algorithm in the Materials and Methods section. The full pseudocode and a more detailed explanation is then provided below. Source code for ACES is provided in a seprate file.

S3.1 Algorithm
Procedure ACES(Mean input , Variance input , Skewness input ,Kurtosis input ) for all x ∈ Vdm dp min ← 1 M dp min ← max(dp min,d p−min ), dp max ← min(dp max,d p−max ) Rreduce ← dp max−dp min dp−max−dp−min Vdp←N log space(dp min,dp max,ceiling(R*Rreduce)) for all y ∈ Vdp Rreduce ← ton max−ton min ton−max−ton−min S12 Vton←N log space(dp min,dp max,ceiling(R*Rreduce)) for all z ∈ Vton t off ← pstar Vton(z) In short, ACES sub-divides the first parameter k m into R values from k m−min to k m−max spaced evenly in log space (function N log space()). We called this vector Vkm. We then iterate through that vector. For each value of Vkm(i), we compute the ranges of possible values for the next parameter, k p .
By default, k p will range from its min to its max, as defined by the physiological ranges (user input).
However, given that the particular k m value is fixed at Vkm(i), and that the mean is fixed, it is now possible that certain values within the full physiological range of k p cannot satisfy the mean equation.
The equations immediately following each for loop compute the maximum and minimum values of the current parameter given the parameters that are fixed at that point in the algorithm (i.e., all preceding parameters).
When t of f is reached, we have one equation and one unknown; the remaining parameters are fixed.
Upon solving for t of f , we have a candidate CDRC solution set that is guaranteed to satisify Eq. 65 S13 where the mean=M input . We then check higher moments. The functions Varsoln(), Skewsoln(), and Kurtsoln() take a set of CDRCs as inputs and return the moment solution.
In principle, the order in which the parameters are checked should not matter. However, we found that by random chance the order will matter since order changes exactly which values are placed in the CDRC test sets. This difference becomes less substantial as the resolution R increases, as expected. For the resolutions tested (usually R=125), order did not affect the MD values obtained.
ACES differs in behavior from a conventional fitting algorithm. ACES is deterministic; that is given the same input moments, physiological ranges for CDRCs, resolution and moment tolerance, ACES will always return the same set of results. This is similar to a steepest-descent fitting algorithm like traditional least-squares, but differs in that all fits, rather than one fit, are returned. In principle if we ran leastsquares many times, with initial guesses spanning the same CDRC test sets as in ACES, we might obtain a similar result. However, least squares operates on the assumption that the solution that best minimizes the objective function is the best fit. We found that among the many solutions returned by ACES, all of which satisfy the moment tolerance cutoff, proximity in moment space does not predict proximity in parameter space. That is, among all solutions that satisfy our relative error of moment <1% threshold, the solutions whose moments closest match the input moments fail to predict the solutions closest in CDRC space. In principle, as R → ∞ we should eventually test the true solution set which will exactly match the input moments; but at the resolutions we tested, any solution satisfying our threshold was equally likely to be closest to the true CDRC set.

S3.2 Reverse order algorithm
Fitting ACES on our library revealed a subset of input CDRC sets for which we could not find any solution.
While the low initial resolution of 83 left about 500 library members without a solution, increasing the resolution marginally to 127 proved to be enough to fit the majority of the 500. However, 48 remained without solutions, and even raising resolution as high as 773 yielded solutions for only a few.
These input sets share several features; all had a negative third moment, and all pairs of t on and t of f yielded a P on ≈ 1. In the original ACES formulation, the order of CDRC scanning was k m , k p , d m , d p , then t on and t of f . We found that even when ACES proposed CDRC sets where the first four CDRCs in the algorithm were almost exact, t on and t of f were always off by several orders of magnitude, resulting in candidate moment sets where skewness was positive rather than negative. This problem arises because S14 when t on t of f , exactly how much t on is greater than t of f has a miniscule impact on P on , but a large impact on distribution skewness. Thus, small errors in the guesses for the first four parameters propagate into massive errors in t on and t of f .
To address this, we generated a second version of ACES where we reversed the order in which parameters are tested. With t on and t of f tested first, their search is unbiased by choices of the other four parameters. This allowed ACES to find solutions for all remaining input sets at low resolutions (83 or 127).
The reverse-order ACES formulation appears to be about three times slower than the original formulation. We also did not rerun our the remaining library using this version of ACES. For these reasons, we suggest the original formulation is used at increasing resolutions up to two hundred or three hundred subdivisions. Above that, we recommend testing whether the reverse order formulation recovers solutions at lower resolutions.

S4 CDRC input library
We have an analytical solution for producing moments from CDRCs, and an inverse solver called ACES that attempts to find candidate CDRC sets consistent with a set of moments. To test ACES we generated a library of test CDRC sets. For each CDRC set we calculate their theoretical moments using the analytical solutions. Then, fitting only on these resulting moments, we ran ACES.
Since ACES should work regardless of parameter regime, we generated a course grained library of CDRC input sets where every rate constant is varied across its entire physiological range, and we test every combination of these subject to the min and max protein counts discussed in the paper. Since ACES follows a similar protocol-that is, ACES tests every combination of each parameter across its physiological range-we were careful not to generate input sets that were guaranteed to be exactly tested by ACES. For example, if the physiological max value for k m were .01, and this was also one of the course grained input values for k m , ACES is guaranteed to test exactly k m = .01 and would give falsely high estimations of CDRCs from moments. To ameliorate this we generated evenly distributed values in log space of each parameter across their physiological range scaled from each ranges min * 1.5 and max/1.5.

The library of CDRC input sets is then every combination of values for a row with all other values
in other rows. For example, input set 1 might be value 1 for all parameters. Input set 2 might be value S15

S5 Contribution of the 5th moment
One of the central findings in this paper is that four moments appears to completely capture the shape of a distribution. However, there are subtle differences between these distributions. A reasonable question to ask is whether measuring a 5th moment could distinguish good CDRC sets from poor CDRC sets that have the same first four moments.
To address this question, we performed the following analysis on the Gillespie-simulated distributions of our "maximally distant" CDRC sets and their reference. First, we calculated the fifth sample nonnormalized central moment for each distribution. Then we computed the relative error between the 5th moment of each maximally distant set distribution and its reference. We find that the average relative error of a maximally distant set with its reference was .073 ± .061. The high variance is consistent with some sets showing modest differences in their fifth moments (as high as .27), while the majority demonstrate that the fifth moments are identical. This is consistent with our observation that the distributions measured in this experiment are roughly identical.
One problem is that a large number of observations are required to accurately estimate higher order moments. We found we could reliably estimate kurtosis with the Gillespie algorithm when distributions were estimated from one million observations. However, we asked whether the differences in the 5th moment, which are already quite small, were significantly different from the differences we might observe S16 from two independent Gillespie simulations of the same CDRC set. We chose the reference CDRC for each of our maximally distant pairs and re-simulated each. We found that the relative error between the 5th moment of two distributions that were simulated from the same CDRC set was .08 ± .072, essentially the same difference as between distributions generated from CDRC sets that have different rate constants but identical moments.
We therefore conclude that the fifth moment for these maximally distant pairs and their reference are indistinguishable, at least when simulating distributions with a million observations. It is possible the fifth moment could provide additional information if we derived finer resolution distributions by simulating more observations. However, measuring expression in even a million cells is approaching-or beyond-the upper limit of what is measurable experimentally by flow cytometry, and significantly beyond the number measurable by microscopy.

S6 Gillespie simulations
We implemented Gilllespie's direct algorithm [3] in C++ for all simulations. After verifying the accuracy of analytical solutions for the first four moments, we used agreement between these analytical solutions and distribution sample moments as convergence criteria. To ensure that convergence was stable, after each initial convergence we continued the simulation for a discrete period and then re-tested for agreement between sample and analytical moments.

S7 Identifying gamma distributions across the library
The most direct way of identifying which CDRC sets correspond to gamma distributions would be to Gillespie simulate all library members. We quickly realized this was intractable, as generating high enough quality probability mass functions by Gillespie simulation proved too time-intensive. For CDRC sets corresponding to average protein counts over 10000, simulation took on the order of an hour just to rule out a distribution being gamma-shaped, and much longer to confirm a gamma distribution.
To avoid Gillespie-simulating all 8053 library input CDRC sets, we leveraged the known analytical moments of a gamma distribution as a screening tool. Gamma distribution parameters k and θ can be computed directly from the mean and variance: k = µ 2 1 /µ 2 , θ = µ 2 /µ 1 , where µ i is the i th central S17 moment. For each library input set we can derive the corresponding candidate k * and θ * . If a CDRC set results in a gamma distribution, we expect agreement between gamma distribution skewness (µ 3 /µ and CDRC derived skewness from our analytical result. We further checked our work by (1) comparing candidates' gamma distribution excess kurtosis (µ 4 /µ 2 2 − 3 = 6/ √ k) with our analytically derived excess kurtosis, and (2) Gillespie-simulating low protein count example CDRC sets to generate high quality probability mass functions. Both exercises revealed excellent agreement with the predicted gamma distribution.
Results of this analysis are reported in Data 1, sheet MVSK, in far right columns.