HIV Promoter Integration Site Primarily Modulates Transcriptional Burst Size Rather Than Frequency

Mammalian gene expression patterns, and their variability across populations of cells, are regulated by factors specific to each gene in concert with its surrounding cellular and genomic environment. Lentiviruses such as HIV integrate their genomes into semi-random genomic locations in the cells they infect, and the resulting viral gene expression provides a natural system to dissect the contributions of genomic environment to transcriptional regulation. Previously, we showed that expression heterogeneity and its modulation by specific host factors at HIV integration sites are key determinants of infected-cell fate and a possible source of latent infections. Here, we assess the integration context dependence of expression heterogeneity from diverse single integrations of a HIV-promoter/GFP-reporter cassette in Jurkat T-cells. Systematically fitting a stochastic model of gene expression to our data reveals an underlying transcriptional dynamic, by which multiple transcripts are produced during short, infrequent bursts, that quantitatively accounts for the wide, highly skewed protein expression distributions observed in each of our clonal cell populations. Interestingly, we find that the size of transcriptional bursts is the primary systematic covariate over integration sites, varying from a few to tens of transcripts across integration sites, and correlating well with mean expression. In contrast, burst frequencies are scattered about a typical value of several per cell-division time and demonstrate little correlation with the clonal means. This pattern of modulation generates consistently noisy distributions over the sampled integration positions, with large expression variability relative to the mean maintained even for the most productive integrations, and could contribute to specifying heterogeneous, integration-site-dependent viral production patterns in HIV-infected cells. Genomic environment thus emerges as a significant control parameter for gene expression variation that may contribute to structuring mammalian genomes, as well as be exploited for survival by integrating viruses.


I) Processing of cytometry data and assessing uncertainty in experimental distributions
A number of studies have noted the importance of accounting for heterogeneity in cell size and cell cycle to resolve the component of expression heterogeneity that may be due to noise sources intrinsic to the biochemical processes involved in gene expression. In [1], this was accomplished by 'in-silico' synchronization of microscopy time courses to select cells whose growth had progressed through a specified fraction of the cell cycle since the previous division, as quantified by image-analysis-based size determination. In [2], where cytometry data similar to our own was analyzed (in yeast), it was found that gating the data to select a narrow range of forward and side scatter values (FSC and SSC, considered as measures of cell size and granularity respectively), that minimized the width of the fluorescence distribution over a clonal population of cells, and eliminated correlation between scattering measurements and cellular fluorescence, was an effective means of isolating the component of expression heterogeneity believed to be due to intrinsic expression noise (which was also quantified using a dual reporter system for several test cases in that study). Essentially, the effect of such a procedure is to select cells of uniform size and state. However, this procedure only uses a small fraction of the originally collected data. Thus, we wished to optimize a correction for our measurements that accounts for cell size and state in our quantification of expression heterogeneity, but which makes use of a larger fraction of the total cell population to improve statistics.
Our aim was to select an optimal method of distribution extraction that minimizes the uncertainty in inferring a protein expression distribution at a defined value of FSC, where we considered FSC measure as a proxy for counting cells of uniform size., and we chose the mean FSC to specify a fluorescence distribution, referring to this conditional GFP distribution as the 'target' distribution (i.e. the distribution that we aim to reproduce). We adopted a procedure that combines gating, possibly 'correcting' the data to further remove correlations, and smoothing the resulting histogram, and we used a bootstrap approach to quantify uncertainties in candidate processing procedures, which we outline below. Our procedure essentially balances the errors that are generated by: 1) counting cells whose scattering measures differ from the mean, and whose associated conditional GFP distribution may differ from the 'target' distribution (and we explore the possibly of correcting for these deviations), with 2) the increased counting error due to only counting smaller numbers of cells within a very narrow gate of scattering measures, which was the procedure followed in [2]. Here we limit our discussion to optimization accounting for FSC variability. We note however, that we performed a similar analysis to account for variability in SSC, and found a similar gating to be optimal, with a similar quantification of distribution uncertainties. However, due to correlation between FSC and SSC, combining the two procedures had minimal effect on distribution shape, and our analysis of fit quality only includes uncertainty derived from the FSC procedure, which we now outline.

1)
A 'reference' distribution was generated for each clone: Our goal is to optimize a procedure for estimating the conditional GFP distribution at fixed FSC (or SSC), e.g. for the set of cells with mean FSC values. To optimize a distribution-processing procedure, we began by specifying for each clone a 'reference' distribution -that is, a smooth 2-d distribution that captures the co-dependence of GFP and FSC, and might represent the underlying probability distribution from which the collection of cells in each experimental data set were generated. By re-sampling from these 'reference' distributions, we computationally generated a collection of 'synthetic' data sets (described below in S.I.2), with known underlying probability distribution, and known associated 'target' GFP distribution at fixed FSC, on which we could test and optimize different distribution processing procedures. Specifically, below we will identify an FSC (or SSC) gate size that, together with further distribution processing procedures to be discussed, when applied to a 'real' data set, such as that generated by cytometry measurements on 10 4 cells from one of the clones in our study, produces an output distribution whose deviation from the 'target' GFP distribution at fixed FSC is minimized.
We generated 'reference' distributions by taking first the 2-d data set of GFP and FSC values for the 10 4 cells that were initially measured for each clone, and generating a 2-d histogram for each (e.g. Fig. S1.A). We then applied a 2-d low-pass Fourier filter to generate a smooth 2-d distribution, which after normalization (i.e. the count in each bin was divided by the total number of cells counted for the histogram), we considered to reflect an underlying probability distribution that is 'close' to the actual underlying probability distribution of observing any combination of GFP and FSC values for a randomly selected cell for that clone (e.g. Fig. S1.B). Specifically, the filter was set by adjusting a single parameter specifying the cut-off in Fourier space, such that the resulting 2-d distribution specified a mono-modal GFP distribution was then taken as the 'target' distribution that our processing procedure aims to reproduce for each clone (e.g. Fig. S1.C).

2)
Bootstrap-generated synthetic data: We re-sampled from the smooth 2-d 'reference' than most that had been analyzing in other studies [1,2], with cellular fluorescences spanning a larger than typical range of values relative to the mean fluorescence for any given clone. This is discussed further below in Section S.VIII where we address extrinsic source of noise'; in the Discussion section of the main text we considered features of the HIV promoter that might account for its observed large expression variability, and how this variability might play an advantageous role in the viral life-cycle dynamics.

4)
Possible 'corrections' to the processing procedure: Two other possible 'size corrections' were considered that might allow us to further increase the range of FSC values included in our histograms for our processing procedure to optimally resolve a 'target' GFP distribution at fixed FSC. We followed the same procedure as in 3. But subsequent to gating, we either subtracted linear correlation between FSC and GFP from the GFP values ('lin'), or divided by the ratio fo the FSC value to the mean FSC value ('div'). In each case, we found that agreement with our 'target' distribution was not significantly improved, and a gating window of order 60% remained optimal ( Fig. S1.E).

5)
Applying the optimal procedure to the 'real' data: Based on our optimization of a distribution-processing procedure for our 'synthetic' data (which was re-sampled from the original data after smoothing, see S.I.2), we followed the simple 'gating' procedure identified above in S.I.3 to process our experimental data, keeping only cells whose FSC values spanned the mid 60% for each clone. In this way, we optimally extracted a smooth GFP distribution that represents the probability of observing a given cellular fluorescence in cells whose FSC value is fixed at the mean over the population. A similar procedure was applied to select cells whose SSC values span the mid 60% (which was found as well to be nearly optimal for selecting a GFP distribution at the mean SSC for all clones). As a result of gating by both FSC and SSC, data from approximately 3600 cells per clone were used to generate the final histograms that were smoothed and used in our analysis for model fitting.

6)
Estimate of uncertainty in our experimental distributions: The analysis in Step 3 then provides an estimate of the uncertainty in estimating a target GFP distribution at the mean FSC value by using a gate that spans the mid 60%. In other words, the distribution of deviations for our optimized processing procedure, between the processed 'synthetic' data sets and their associated 'target', for each clone (both specified by the 'reference' distribution for each clone), provides an estimate of the deviation that we expect between our optimally processed experimental data and the actual underlying probability distribution of observing a GFP value from that clone for a cell whose FSC value is the average over the population. Specifically, we used the 95% upper bound on the values of distribution deviations (Dev) from the 'target' over the set of 'synthetic' data sets for each clone as a measure of the uncertainty in identifying the desired underlying probability distribution from our cytometry data. This estimate of uncertainty in our experimental distributions was then used to normalize the calculated deviations between our model fits and the processed experimental distributions, as described in the Materials and Methods section of the main text.

7)
Quantifying the contribution of cell-size variability to expression variability: As a check on the contribution of cell-size variability to our measurement of GFP expression variability, we calculated the relative contribution to the GFP distribution variance from correlation between GFP and FSC measurements ( Fig. S1.F). This quantity is essentially the R 2 value for linear regression of fluorescence measurement against FSC. In general, if we take FSC as a measure of cell size, we conclude that cell-size variability only contributes a few percent of the observed GFP distribution variances, despite the wide gate of FSC values that we have adopted. This suggests that the large expression heterogeneities observed in this study are not due to variability in cell size (see Sec. S.VIII for further discussion).
As an extra check, to be sure that our wide gating is not distorting the shape-features of our processed distributions, we considered the behavior of distribution coefficients of variation ('CV' = σ/µ) and skewness ('skew' = m 3 /σ 3 , where m 3 is the distribution 3 rd central moment), for varying gate sizes in the FSC/SSC plane (Figs. S1G-H). We find that even at gate sizes as small as 10% (i.e. a factor of 6 smaller in each scattering dimension than the optimized gate discussed above), the coefficient of variation is only typically affected by a few percent, and the skew by approximately 15%. Further, if we normalize our GFP values by subtracting off the average correlation between GFP and FSC (or SSC), as suggested above in Sec. S.I.4, the decrease in average coefficient of variation and skew at smaller gates is approximately cut in half (not shown). In general, the effect on distribution moments of reducing the gate size to 10% demonstrates significant scatter, and a number of clones surprisingly demonstrate larger coefficients of variation and skews for reduced gate sizes. This analysis further confirms that our relatively wide 60% gating is only generating a slight overestimate of distribution noise and skew, and that our choice of gate size is warranted as optimally reproducing a GFP distribution at a fixed value of scattering measure.
The approach taken here to optimize out distribution-processing procedure, and the role it plays in our analysis of heterogeneity-generating transcriptional dynamics, is summarized in a flow diagram, Table S1.

II) Solving the model
The model in Fig. 2A was solved numerically to identify the model-predicted probability of observing a given number of copies of protein (GFP) in a given cell, for comparison to our processed experimental distributions. To systematically search the parameter space of the model and identify a best-fit parameter combination for each clone, an efficient and precise algorithm was necessary. The algorithm that was used, which we discuss below, was implemented in Matlab (The Mathworks) and is available upon request.
The model in Fig denoted W{φ, m, n} -which specifies the probability of observing a cell in a particular configuration defined by the gene-state (φ a/r ), the number of copies of the transcript present in the cell (m copies), and the number of copies of the protein (n copies) -is given by the Chemical Master Equation (CME) for the system, which is a standard representation for describing wellmixed chemically reacting systems [3]. Marginalizing over the gene-state and transcript number distributions gives the protein number distribution that we wish to calculate to compare with our experimental cytometry data.
The CME for our system is: The full set of equations specified by eq. 1, for the probability of observing any state, is an infinite linear system of ordinary differential equations (ODEs).
For many-component systems (i.e. large numbers of reacting species), a direct solution to the master equation is not computationally feasible, and stochastic simulation techniques are often used to directly simulate trajectories of the stochastic process under consideration [4], with a large number of simulations generally necessary to sample the desired underlying probability distribution. However, for systems such as ours, with only a few components, a direct solution to the CME becomes feasible, and it can provide a fast and accurate method for calculating the desired probability distributions. This is the approach that we took.
Though an analytic solution exists for the transcript distribution in our model, as well as for a small number of other simple models [5,6,7], none exists for the protein distribution in our model. We therefore solved the system numerically. Here we discuss several important aspects of the numerical algorithm that we used. In particular: 1) Truncating the system at large protein and transcript numbers; 2) A continuum approximation that we used to coarse-grain the distribution; 3) The algorithm that was used to evolve the system in time to steady state; 4) Several checks that were used to ensure that the algorithm converged to the correct distribution.

1)
State-space truncation: In order to numerically solve the infinite system of equations represented by the CME, it was necessary to truncate the system at large values of transcript and protein number to specify a finite set of equations. We chose to truncate the system at transcript and protein numbers with values corresponding to the steady-state mean plus 15X the standard deviation of the transcript and protein distributions, which we analytically calculated (see Sec. S.III below). All probabilities were set to zero beyond these cut-offs, which represent states of the system that are effectively 'almost never' visited by the system over the time-course of our simulations.
The consistency of this choice of cut-off was confirmed in two ways. First, for all test cases where we examined them, the calculated probabilities of observing states at the boundary of our cut-off were near the order of the machine precision. Second, the Finite State Projection (FSP) algorithm, developed elegantly by Munsky and Khammash [8], provides a rigorous method of calculating an upper bound on the error due to state-space truncation. We applied this procedure to several test cases where the system was small enough to be solved exactly by the FSP, and always found the error-bound to be several orders of magnitude smaller than the uncertainty in our experimental distributions, over the mid 98% of the distribution.

2)
Coarse-graining: The CME is an exact formulation to evolve the probability of observing each individually-enumerated state of a chemically reacting system. However, for systems with large numbers of particles, the CME represents a very large system of equations (even after truncation). Under these conditions, though the system is large, the desired distributions often vary smoothly and admit a continuum approximation. For example, a system-size expansion leads to a Fokker-Planck Equation (FPE) [3,9], which for our system would represent transcript and protein numbers as continuous variables, and their evolution would be described by a pair of Partial Differential Equations (PDEs), rather than an infinite system of ODEs. For systems with large numbers of particles but small numbers of components, a numerical solution to the FPE can often be achieved with acceptable accuracy more efficiently than the solution to the original CME -that is, if a discretization can be achieved that leads to a substantially smaller number of equations through appropriate approximations of the PDF derivatives that are involved in the FPE. On the other hand, for systems that may sample both states of high and low particle numbers, as is often the case for cellular systems and in particular for our own system, relevant regions in the state space exist where a continuum approximation can provide an accurate computation, while other regions exist containing states of low particle number where a continuum approximation is inaccurate and it is necessary to solve the master equation to achieve acceptable accuracy in determining the PDF. For such systems, hybrid algorithms are often employed to partition the system and apply an appropriate and efficient method in each part of the state space and/or to different subsets of reactions. Such hybrid algorithms have generally been developed in connection with stochastic simulation algorithms [4,10,11,12,13], while methods that directly use a continuum approximation, such as the FPE, might use a state-space discretization for their numerical algorithm that converges to the CME for small particle numbers [13].
To take advantage of a continuum approximation at large particle numbers in our system, while keeping an exact CME treatment at low particle numbers, we applied a graded coarsegraining procedure. Probabilities were binned together, separately for each gene-state, into 'rectangles', or 'bins' in the transcript-protein plane. Bin sizes increased for larger transcript and protein numbers, where the distributions generally changed more slowly and were approximated well by a continuous distribution that could be expanded in a Taylor series for small changes in protein and transcript number. An approach to grading that we found to yield efficient and stable solutions was coarse-graining with bin sizes increasing in proportion to the square root of the protein or transcript number, in such a way that bins with protein or transcript numbers below 50 included exactly 1 state (any number at least of order 10, below which we required single-state bins, was effective in this respect).
The probability in a 'bin' was specified as the sum of the probabilities of all system configurations in the bin. Transition rates between 'binned' configurations were then specified by linearly approximating the values of the distribution at the boundaries of the bins. Thus, transition rates from one gene-state to another were still calculated exactly, while protein and transcript transitions were accounted for based on this continuum description, whose coarseness increased for larger transcript and protein numbers. For states of small transcript and protein number, this method retains the exact form of the CME. For large transcript and protein numbers, the resulting equations are equivalent to a discretization of the FPE that corresponds to our system. Coarse-graining and interpolation for transition rates was done in such a way that retains the probability-conserving feature of the CME.
This approximation scheme greatly increased computation speed, and its validity was checked by increasing and decreasing the length-scale of the coarse-graining/binning procedure by a factor of 2, over which the solutions were found to remain stable. 3) Evolving the system to steady state: The CME for our system, as well as the smaller system that resulted from our coarse-graining procedure, represents a stiff system of ODEs. For this reason, the system is integrated more efficiently with an implicit solver (one that evaluates derivatives based on the function values at the next time step for which it is solving). A method that we found to be fast and stable steps through time by treating, on a given time step, the transitions due to transcript and gene-state dynamics implicitly (using a backwards Euler method) and those due to protein dynamics explicitly (using a forward Euler method), and exchanging the terms that are treated implicitly and explicitly in the following time-step. Such methods are used to integrate multi-dimensional PDEs, in order to maintain numerical stability while taking longer time steps, and they facilitate the use of fast matrix inversion methods that take advantage of the banded structure that arises when only a single dimension is treated implicitly at each time-step in reaction-diffusion type problems [14].
The system was initialized with the gene inactive and no transcript or protein present (W{φ r , 0, 0} = 1). The system was then integrated by the above procedure till a steady-state distribution was reached. Steady state was determined when the relative change in probability for any bin with value greater than 0.1% of the maximum over the set of bins (this generally covered more than the mid 98% of protein values, which we used to compare to our experimental distributions) fell below 10 -5 dt, where dt is the size of the time step. This represents a change 4 orders of magnitude slower than the slowest time scale in the problem (which was generally the protein degradation time), and the total time for which the system was evolved was always at least 50X the protein degradation time. This steady-state criterion always resulted in good agreement between the moments of the calculated distribution and their theoretical steady-state values (see below)

4)
Solution accuracy: We checked the accuracy of our calculations by multiple methods, including the checks on the various approximations in our algorithm that were mentioned above.
In addition, we compared the first 3 moments of our simulated distributions to their theoretical values, where the distribution moments were analytically calculable as outlined below in the next section. Deviations were always less than 0.1% (this value is generally less than the uncertainty in these moments estimated for our experimental data), and often significantly less. In addition, for several test cases, where the mean transcript and protein numbers were small enough to calculate an effective steady state distribution by the FSP algorithm without any approximationthis is an exact calculation with an associated rigorously-calculated error bound [the FSP was discussed above; see 8] -we compared our results to this method. We typically found relative deviations between our solution and the FSP solution to be several orders of magnitude less than the uncertainty in our experimental distributions.

III) Distribution moments
Distribution moments of all orders can be obtained analytically for our system by standard methods [3], and were important for qualitatively analyzing our experimental distributions, characterizing the bursting regime in our theoretical analysis, generating initial guesses at best-fit model parameters, and specifying non-fit model parameters from independent measurements. Here we use a generating-function approach to calculate the distribution moments of the model. The generating function for the system is defined as: and satisfies the generating-function equation: This is simply a restatement of the infinite system in eq. 1 in multivariate form, and the full set ODEs that specifies the Master Equation for the system can be recovered by equating coefficients of powers of x and y on both sides of eq. 3.
Moments of the distribution are calculated by taking derivatives of eq. 3 with respect to x and y, and evaluating at x = y = 1. In particular the mean number of transcripts and proteins are given, respectively, by where the bracket, denotes an average over all gene states, transcript numbers, and protein numbers, for the probability distribution generated by the two-state model. At steady state, this yields the expression that would be expected from assuming deterministic, mass-action kinetics. Namely: In general, one can generate a hierarchy of algebraic equations to be solved for higher derivatives of the generating function with respect to x and y, which are used to calculate higher moments of the distribution, in terms of lower derivatives, and eventually in terms of model parameters. This allows one to calculate analytic expressions for all moments of the joint distribution for the system in terms of model parameters. In particular, variances of the protein and transcript distributions are given according to: , and at steady state we find: The same expressions for the first two moments of the protein distribution for this model are given in slightly reorganized form in the supplement to [15], and the first two moments of the transcript distribution are given, together with a very elegant analytic derivation of the full transcript distribution, in [5].
As mentioned in Sec. S.II, though all distribution moments for the two-state model are analytically calculable, the protein distribution its self is not analytically calculable, and was solved for numerically as described above. The first 3 analytic moments were therefore used to assess both the accuracy of our numerical simulations, and convergence to steady state.

IV) Bursting regime
The bursting regime is specified by relatively short active-state durations (κ r >> κ t -), multiple transcripts produced during each gene-activation event (b = κ t + /κ r finite), and moderate frequency of gene activation (in particular, κ r >> κ a ). Under these conditions, the distribution mean and variance reduce to: The active-duration for the two-state model is the average time that the gene remains in the active configuration during a single visit, and is given by: (the distribution of lengths of time for a single visit to the active state is exponential, with decay constant κ r , and here we omit the normalization by transcript decay time that is used in the main text and model fitting). While the gene remains in the active state, transcript is produced at a fixed rate (i.e. in a Poisson process), and the average number of transcripts produced during a single visit to the active state is given by the product of the production rate (κ t + ) and the average time that the gene remains in the active state (i.e. τ). Namely: This is the quantity that we have defined as the transcriptional burst size.
In the bursting regime, the frequency of gene-activation events reduces to κ a , because κ r >> κ a . Thus, in the bursting regime, κ a specifies the burst frequency, and the mean transcript number is given by the product of burst size and burst frequency (normalized by the rate of transcript degradation, see eq. 9).

V) Estimating non-fit model parameters
A number of the processes represented in Fig. 2A of the main text likely occur at locations that are spatially separated from the site of viral integration in our system, and are therefore expected to occur at the same rate for all integration clones. Namely, the rates of transcript degradation (κ t -), and protein production (κ p + ) and degradation (κ p -), are assumed fixed over the full set of integration-clones. A conversion factor from protein number to cytometermeasured relative fluorescence units (RFU) is also necessary to compare model results to our experimental data. We will refer to this factor as 'α'. Here we discuss how these parameters were measured experimentally Protein degradation in our model should more precisely labeled as protein dilution - via eq. 5. We thus obtained ! " p + /" p # ( ) = 2.5 from the measured ratio of population-mean RFU to population-mean transcript number for the analyzed clone. Similarly, in the bursting regime the distribution variances are related according to eq. 11, and the 1st term on the RHS can be dropped in the regime of large protein numbers, yielding: after regrouping terms and using and the above-estimated value of ! " p + /" p # ( ) . From the measured ratio of RFU and transcript variances for the considered clone, we thus obtained does not significantly affect distribution shape, and we chose ! p + /! p "~2 0 for convenience, noting that the experimental values for our system are likely several orders of magnitude greater, e.g. using the average value of ! p + /! t "~1 200 estimated by [16] over their data set gives an estimate of ! p + /! p "~4 800 for our system).
The effects of uncertainties in the values of our non-fit model parameters were not explicitly considered in our analysis in the main text. Here we comment on how these uncertainties might affect our findings. Uncertainties in determining the ratio ! " p + /" p # ( ) were analyzed by sampling several values above and below the measured value (differing by up to a factor of 2.5), and then re-applying our fitting procedure to the cytometry data. Variations in our key finding, that transcriptional burst size is the primary feature that varies over viral integrations, is expected to be robust towards uncertainty in non-fit model parameters.

VI) Fitting procedure and fit quality
The routine that we used to search the model parameter space for a best-fit combination for each clone, which is discussed in the main text, required a parameter estimate for initialization. This initialization was accomplished by estimating model parameters based on the first two distribution moments, and assuming transcriptional bursting, as follows.
In the bursting regime, distributions are effectively described by the transcriptional burst size and burst frequency (with other non-fit parameters fixed at their separately measured values, see Sec. S.V above). More generally, for a fixed value of active-state duration (τ), the model is specified by only two integration-dependent parameters (in any regime), which reduce to the transcriptional burst size and frequency in the bursting regime. Since the mean and variance are analytically calculable for any combination of parameters (see Sec. S.III above), their analytic forms (given by eqs. 5, 8) could be inverted to specify the remaining integration-dependent model parameters in terms of the distribution moments. Thus, we were able to analytically calculate a set of model parameters that reproduce the first two moments of each experimental distribution, for any fixed active duration, for each clone. In the bursting regime, this led to an analytically calculable transcriptional burst size and burst frequency. These calculated 'moment-fit' parameters were used to initialize the more systematic fitting routine that was used to obtain the results discussed in the main text. To be certain that the results of our fitting procedure were not dependent on the moment-based initial guess of the fit parameters, the initial-guess parameters were varied randomly by up to a factor of two, and the routine always converged to the same set of best-fit parameters for each clone, to within our estimate of experimental accuracy.
The quality of the final fits was such that the deviations from the experimental distribution (Dev) for a number of clones were comparable to the uncertainty in our data ( Fig.   S2.A), and most were significantly improved over the initial moment-based guess (Fig. S2.B).
In particular, the moment-fits generally did not account for the full skew of the experimental distributions, while the more systematic fitting routine, which minimized the deviation between the full model and experimental distributions (as described in the main text), only approximated the mean and variance, allowing small deviations in these features in order to better capture other distribution features such as skew. Nevertheless, some systematic deviation in the fits remained, with the systematic model fits still often demonstrating smaller skews than the data. That is, model fits often peaked to the right of the data and still underestimated the right tail of the distribution at larger fluorescence ( Fig. S2.C). The scale of the distribution deviations was such that for the portions of the distribution where they were maximal, their magnitude generally remained larger than expected due to uncertainty in the data, even for fits where the overall deviation (Dev) fell below the value expected due to experimental uncertainty (Fig. S2.D).
Investigation of other processes that could be added to the model to account for these deviations will be the subject of a future study.

VII) Summary of sources of uncertainty
The primary sources of uncertainty considered in our analysis are counting error (due to the finite number of cells counted to generate our experimental histograms), and error due to uncertainty in specifying a fluorescence distribution at fixed values of scattering measurements, as outlined above in section S.I.
Another potentially important source of uncertainty in our measurements is distribution drift over time, since our analysis assumes that our measurements represent steady-state expression profiles. We carried out longitudinal studies on 6 clones, where GFP expression profiles were measured daily by cytometry over the course of approximately one week. We found expression profiles to be relatively stable over time, with mean expressions fluctuating by a few percent, and no significant correlations between clones (Fig. S3.A). However, some systematic drift was evident in the processed distributions. The primary affect of this drift on distribution shape was to effectively scale all fluorescence values by a factor of 1 ± a few percent, such that the distribution were effectively shifted with their shapes preserved over the log-binning of our cytometry measurements (for a log-binned histogram, a distribution scaling effectively translates the distribution by a fixed number of bins, while maintaining the distribution shape). For small variations, such a distribution scaling implies that the relative change in the distribution variance is proportional to twice the relative change in distribution mean. This proportionality is clearly seen for our data, and is demonstrated in Fig. S3.B.
Indeed, distribution variability from day to day appeared qualitatively as a translation on the logbinned fluorescence axis for the plotted fluorescence histograms, a sample of which is given in Similarly, uncertainty generated by our instrumentation also contributes to uncertainty in identifying an experimental distribution for model fitting. The point-spread function (PSF) for our cytometer was much sharper than our experimental distributions, and its uncertainty, which we might estimate to be on the order of 10% for its relative width, would affect all distributions similarly. Theoretical analysis here suggests that uncertainty in the PSF width leads to an uncertainty in determining a transcriptional burst frequency from our fitting procedure, with the effect being quadratic. That is, uncertainties of order 10% in the relative width of the PSF typically would lead to a corresponding uncertainty of order 1% in determining the transcriptional burst frequency for each fit. As with uncertainty due to distribution drift over time, this uncertainty is on the order or smaller than that which was considered in our analysis due to our distribution processing procedure. Further, uncertainty in measuring the cytometer PSF would affect all distribution fits in the same way, scaling all inferred burst frequencies by a similar factor. Thus, even larger uncertainties in the cytometer PSF would not affect our inference of trends in burst-parameter variation over the set of clones.
As a final consideration of uncertainties in our model inference that might be due to our optimized gating procedure, which was applied to specify an experimental GFP distribution at fixed scattering measure in the main text, we re-applied our fitting procedure to the experimental distributions obtained by applying a 10% square gate in the FSC/SSC plane, as discussed in Sec.
S.I.7, which is a factor of 6 narrower in each scattering dimension than the optimal gating used in the main text. The fit parameters obtained for this set of distributions are compared to those obtained in the main text, in Fig. S4. In general, best-fit model parameter for these 'narrowgated' distributions agreed with those obtained in the main text approximately to within the 95% confidence intervals obtained by the analysis of the main text. Further, regression analysis indicates no significant change in calculated trends in burst-parameter variation with expression mean, confirming that the main results of our analysis are robust to uncertainties due our choice of distribution gating.
Overall, the above qualitative discussion, and additional systematic fits, suggest that the trends in burst parameter variation over integration position inferred in the main text are a robust result of our analysis, as are the specific values of inferred transcriptional dynamic parameters for each clone (though the later to a lesser degree).

VIII) Considering extrinsic sources of heterogeneity
Our analysis has assumed that expression heterogeneities in our system arise completely from the intrinsic processes represented by the model in Fig. 2A. Namely, from the probabilistic nature of the biochemical reactions involved in gene activation, transcription, and translation, at fixed values of the model parameters. Extrinsic sources of heterogeneity may be thought of as cell-cell variability in the parameters of the model. That is, the model parameters themselves may be random variables. Here we discuss features of our data and simulation results that support our intrinsic-noise-based analysis.
In earlier work, we had experimentally considered the affects of specific extrinsic factors in a similar HIV model system, including stage in cell cycle, cell size, aneuploidy, and mitotic cell division, finding little contribution to expression heterogeneity [17]. Here, we have found little correlation between cellular fluorescence and cell size, as measured by FSC, further indicating that cell growth is not a significant source of expression heterogeneity (e.g. Fig. S1.F).
We found that the contribution of cell-size variability to expression heterogeneity can be minimized by a coarse gating that includes cells whose FSC values lie in the mid 60% of the population (see Sec. S.I.3) in contrast to the narrow gating that Newman & Weissman found necessary to isolate the intrinsic component of expression variability [2], and we have noted that the large coefficient of variation of approximately 60% for the 'typical' distribution in our system is atypical when compared to the results of a number of studies that quantified expression heterogeneities from large numbers of promoters [1,16,18]. Here we continue to develop the argument that the large expression heterogeneities observed in our system more generally make an account based on extrinsic sources of heterogeneity less plausible.
Extrinsic fluctuations in processes that affect the expression of all genes should contribute similarly to expression heterogeneities from any promoter. This reasoning was applied to the large data set in [2,16], where it was suggested that extrinsic fluctuations therefore set a lower limit to the amount of expression heterogeneity the should be generated by any promoter, and therefore should only make a dominant contribution to expression heterogeneities from promoters with the sharpest expression profiles. The expression variability noted for the less 'noisy' promoters in those studies thus sets an upper bound for the relative contributions of extrinsic variability to the expression variability from an arbitrary promoter. Our measurements, on the other hand, represent the opposite end of the spectrum -the HIV promoter, for any of the sampled integrations (and especially for brighter ones), is much 'noisier' than the 'typical' eukaryotic promoter considered in the mentioned studies. We have no reason to suspect that the HIV promoter has any unique properties in terms of how it propagates extrinsic noise. We therefore argue that the increased expression noise demonstrated by the HIV promoter in our system, above the lower limit set by the class of least 'noisy' promoters that were analyzed in the mentioned studies, reflects intrinsic expression heterogeneities that are specified by features of the HIV promoter and its coupling to the host-cell environment.
From a theoretical point of view, we might consider the effects of sources of external 'noise' on the modeled processes in our system, as follows. Fluctuations in transcription rate are modeled in our system via fluctuations in promoter configuration between an active and repressed state. Any further fluctuations in this dynamic would simply be represented as a more complex stochastic process governing the gene-state dynamics and/or the inclusion of more states. We acknowledge that the representation of gene-state dynamics in our analysis is a great simplification. However, it captures the fundamental process of gene activation, and allows us to effectively capture the expression profiles that we observe experimentally. A two-state promoter, whose dynamics are described as in our model, thus represents a parsimonious and biologically motivated account of our data. Any proposition of a more complex underlying promoter dynamic, perhaps based on hypothesized contributions of extrinsic fluctuations, would need to be directly motivated by observations that are not accounted for by the current model.
For the present, there is no direct evidence to motivate consideration of a more complex promoter dynamic in our analysis.
On the other hand, even if we assume a two-state-promoter model to be correct, the transition rates between the states of our model should depend on concentrations of transcription factors that themselves fluctuate, and one could imagine that these fluctuations would generate extrinsic contributions to expression heterogeneity. To consider this possibility, we note that noise from each process in our model is filtered by the next in the chain of reactions that represent gene expression. That is, our observed protein expression heterogeneities result from much larger underlying heterogeneities in transcript distributions (see, for example, the sample distributions in Fig. 2B of the main text). In particular, if we consider the coefficient of variation as a measure of expression heterogeneity, then in the bursting regime, using eqs. 14 and 15, we find that the coefficient of variation for the protein and transcript distributions are related as: where we have inserted our experimentally measured value for the ratio of protein and transcript degradation rates in the last equality.
Similarly, if we consider the coefficient of variation of the transcription-rate distribution over the population (the transcription rate dynamics is determined by the gene-state dynamics, with values κ t + or 0 for the transcription rate, depending if the gene is in the inactive or active state) we find which becomes large in the bursting regime (where κ r >> κ a ). The coefficient of variation for the gene-state (transcription rate) is related to the coefficient of variation for the transcript distribution according to where the bursting regime has again been assumed. The first factor on the RHS becomes small only for burst sizes of order 1 or smaller, i.e. when transcripts are effectively produced one at a time and their distribution becomes indistinguishable from what would result from production in a continuous Poisson process. For large burst sizes, this factor approaches 1, and the second factor becomes large in the bursting regime.
Thus, in the bursting regime, for burst sizes that are not small Therefore, for our experimental distributions, our analysis predicts that the 'noisy' protein expression distributions that we measured are generated by yet 'noisier' underlying transcript distributions, which are in turn generated by yet 'noisier' underlying gene-state dynamics. And this is generally the case for a cascade of stochastic processes of the type that we are considering here. For noise in an upstream process to make a significant contribution to heterogeneity in the distribution of a down-stream component, the upstream process should be significantly noisier (and/or slower) than the down-stream process to which it couples. Of course the situation becomes more complicated if feedback is considered, but because our reporter system only produces GFP, we do not expect any feedback effects.
We thus conclude that for fluctuations in upstream components -such as transcription factors that directly couple to the LTR -to significantly contribute the expression heterogeneity from the LTR, their dynamics should be noisier than the predicted gene-state dynamics of the LTR. But the gene-state dynamics of the LTR are necessarily noisier than its protein expression dynamics, which in turn are likely noisier than the protein expression dynamics of most cellular proteins, since the LTR is noisier than the vast majority of promoters considered in other studies, as discussed above. Thus, we suggest that it is unlikely that fluctuations in the expression of upstream proteins that interact with the LTR significantly affect the expression distributions that we observe from the LTR.
We make one final comment concerning extrinsic fluctuations in our system. for extrinsic fluctuations is the use of two-reporter systems [19], one would thus expect to find expression correlations between two independent reporters in such a system. It is still an open and important problem in eukaryotic transcriptional biology, to understand the large-scale genomic and nuclear dynamics that affect the expression dynamics of multiple genes simultaneously, and more specifically, to identify and distinguish between contributions of extrinsic and intrinsic sources of expression noise in such a system where these two contributions may be highly integrated.