^{1}

^{2}

^{3}

^{¤}

^{4}

^{1}

^{2}

^{3}

^{5}

^{*}

^{1}

^{5}

^{6}

^{*}

Current address: Department of Molecular and Cell Biology, Beckman Research Institute of the City of Hope, Duarte, California, United States of America.

Conceived and designed the experiments: RS JCB DVS APA. Performed the experiments: JCB JEF. Analyzed the data: RS. Contributed reagents/materials/analysis tools: RS JCB JEF DVS. Wrote the paper: RS JCB JEF DVS APA.

The authors have declared that no competing interests exist.

Mammalian gene expression patterns, and their variability across populations of cells, are regulated by factors specific to each gene in concert with its surrounding cellular and genomic environment. Lentiviruses such as HIV integrate their genomes into semi-random genomic locations in the cells they infect, and the resulting viral gene expression provides a natural system to dissect the contributions of genomic environment to transcriptional regulation. Previously, we showed that expression heterogeneity and its modulation by specific host factors at HIV integration sites are key determinants of infected-cell fate and a possible source of latent infections. Here, we assess the integration context dependence of expression heterogeneity from diverse single integrations of a HIV-promoter/GFP-reporter cassette in Jurkat T-cells. Systematically fitting a stochastic model of gene expression to our data reveals an underlying transcriptional dynamic, by which multiple transcripts are produced during short, infrequent bursts, that quantitatively accounts for the wide, highly skewed protein expression distributions observed in each of our clonal cell populations. Interestingly, we find that the size of transcriptional bursts is the primary systematic covariate over integration sites, varying from a few to tens of transcripts across integration sites, and correlating well with mean expression. In contrast, burst frequencies are scattered about a typical value of several per cell-division time and demonstrate little correlation with the clonal means. This pattern of modulation generates consistently noisy distributions over the sampled integration positions, with large expression variability relative to the mean maintained even for the most productive integrations, and could contribute to specifying heterogeneous, integration-site-dependent viral production patterns in HIV-infected cells. Genomic environment thus emerges as a significant control parameter for gene expression variation that may contribute to structuring mammalian genomes, as well as be exploited for survival by integrating viruses.

Cellular gene expression is a fundamentally stochastic process due to the intrinsic randomness of the underlying biochemical reactions involved. The resulting stochastically generated expression heterogeneities have important biological consequences and also encode information about the underlying dynamics that generate them. A fundamental goal of transcriptional biology is to understand the quantitative regulation of gene-expression dynamics, which in eukaryotes depends on factors specific to each gene in concert with its surrounding cellular and genomic environment. We investigated the regulatory effects of variable genomic environments by quantitatively measuring expression heterogeneity from diverse single genomic integrations of the HIV promoter in cultured cells. Systematically fitting a model of stochastic gene expression to our measurements reveals transcript production in bursts as the underlying dynamic that accounts for the large heterogeneities observed within single-integration clonal cell populations, with the size of transcriptional bursts as the primary feature that varies over genomic integrations. Our findings implicate genomic environment as an important quantitative control parameter that eukaryotic cells might use to shape gene-expression patterns, and that lentviruses such as HIV, whose genomes are semi-randomly integrated into the genomes of the host cells they infect, may exploit to sample diverse and heterogeneous expression patterns that evade treatment.

The life cycle dynamics of HIV-1 within a host are shaped by numerous apparently stochastic processes, from the statistics of immune cell infection in humans, to mutation during reverse transcription, semi-random integration of the proviral DNA into the host-cell chromosome, and stochastic viral gene expression thereafter

The semi-random integration of HIV-1 into the host genome provides a particularly ideal opportunity to dissect the relative contribution of genomic environment as a fundamental element of expression regulation that may contribute importantly to expression dynamics and heterogeneities in eukaryotes. It is now well established that HIV-1 integration is biased towards actively transcribed chromosomal locations

The discrete and stochastic nature of gene expression has been appreciated for some time

Despite the apparent complexity of cellular transcriptional regulation, for many genes across a broad range of cell types, the patterns of cell-to-cell expression variability within isogenic populations are remarkably well described by simple stochastic models that represent the gene – including the associated genomic environment, chromatin structure, transcriptional regulators, and transcriptional machinery – as existing in a small number of discrete configurations, or states, with expression heterogeneities depending on probabilistic transitions between states and on probabilistic transcript and protein production and degradation

A wide range of transcriptional dynamics have been revealed by such analyses, from continuous

While the above studies have begun to characterize the dependence of gene-expression dynamics on a number of cellular inputs, a systematic, quantitative investigation of the contribution of genomic environment over a broad range of genomic integration positions remains to be conducted. Furthermore, the contrasting observations as to whether transcriptional activation frequency, transcriptional burst size, or some other feature of transcriptional dynamics represents the primary variable that cells modulate to control expression patterns in these diverse systems raise key questions of how important features of genetic, epigenetic, and regulatory architecture may differ in yeast and mammalian cells.

Here we explore the fundamental relationship between genomic environment and expression heterogeneity from a diverse set of semi-random single integrations of a model HIV-1-promoter/GFP-reporter construct in cultured Jurkat T-cells. Systematically and rigorously fitting a model of stochastic gene expression allows us to infer the underlying expression dynamics that account for the single-gene expression distributions that we measure from single-integration clonal populations. Our analysis reveals that transcript production in bursts accounts for the wide, highly skewed, expression profiles that we observe, and importantly that transcriptional burst size is the primary feature that varies across viral integrations. These results interestingly suggest that the virus samples a particularly ‘noisy’ range of possible expression profiles across cellular integrations, and open a number of important questions for further study. We propose several qualitative models that may explain this inferred variation of transcriptional dynamics with genomic environment and discuss the implications of our findings for HIV dynamics, and for cellular gene expression in general.

Although HIV-1 requires transactivation by the virally-encoded protein Tat to amplify its expression

^{2}) and mean (μ): ^{2}_{0}_{0}/_{0}

The shape features of our experimental distributions (such as mean, variance, skewness, etc.) are diagnostic of the underlying expression dynamics that generate them – and of the regulatory role of various molecular ‘inputs’ such as integration position (as well as promoter structure and concentrations of transcription factors, which were the ‘inputs’ considered in several other elegant studies:

To visualize additional features of the expression distributions over the set of clones, we translated each to a common mean fluorescence, and correspondingly scaled its fluorescence values about that mean based on the variance regression in

A simple stochastic model that captures a number of essential features of transcriptional biology, and that can reproduce a range of single-gene expression profiles, assumes that the promoter may exist in either an activated state (_{a}_{t}^{+}_{r}_{a}_{r}

_{a/r}_{t}^{−}_{t}^{+}_{a}

In the analysis that follows, we always consider steady-state model distributions, since longitudinal measurements over the course of a week on several clones indicate that distribution shapes are relatively stable over our time scale of interest (see ^{−1}, see

The qualitative expression regimes of the two-state gene model fundamentally depend on the relative values of the gene-state transition rate constants (

Another dynamic regime that has received considerable attention can be termed the transcriptional ‘bursting’ regime, in which the gene inactivation rate is fast (

Though the relatively slow time scale of protein degradation in our system (

While the analysis above provides intuition as to the dynamics and regulation that may underlie our experimental observations of the HIV LTR, it is solely a qualitative assessment based on the assumption of ‘burstiness’ and comparisons to ‘stereotypical’ model distributions. In reality, model distributions vary continuously between regimes, and means and variances provide an incomplete characterization of the actual distribution shapes. Therefore, to better determine the degree to which transcriptional ‘bursting’ best accounts for our experimental distributions, and the degree to which it can be distinguished from other possible dynamic regimes, we used a systematic fitting routine to identify the best-fit combination of transcription rate and gene-state transition rates for each distribution. Transcript degradation, protein production, and protein degradation rates (_{t}^{−}, _{p}^{+}, and _{p}^{−},

An important indicator of the dynamic regime of our system is the average time that the promoter remains in the active configuration following a gene activation event, relative to the average life time of a transcript (see _{t}^{−}_{r}_{r}_{r}

We find that the optimal agreement between model and experiment always occurs at short active-state durations (sample fits given in _{r}_{Max}_{Max}_{Max}_{r}_{r}^{Opt}

_{r}_{r}_{r}^{Opt}_{r}_{r}^{Opt}^{Opt}^{Opt}

From the optimal fits above we identified best-fit transcriptional burst sizes and frequencies that specify the predicted transcriptional dynamics for each integration clone. Importantly, we find that the transcriptional burst size is the primary feature that varies over the set of genomic environments sampled by our 31 viral integrations, increasing from a few transcripts in very dim clones to tens of transcripts in very bright clones (_{10}(^{2}_{10}(_{a}^{2}

Best-fit transcriptional burst size (_{a}_{r}_{r}_{a}^{2}_{a}^{2}

Our findings thus indicate that burst-size variation makes the dominant contribution in controlling single-gene expression profiles and represents the primary feature of transcriptional dynamics whose modulation distinguishes typical bright from dim clones. Importantly, the trends noted in

Another recent study has also considered a two state model to analyze expression variability from the HIV LTR

A correlate of our findings – that transcription in short bursts underlies basal expression heterogeneities from the HIV LTR in the absence of Tat – is that the active promoter configuration is short-lived. This implies that the promoter would be observed in the active configuration for only a small fraction of cells in a clonal population at any given time at steady state. The value of this fraction in the two-state model, which we refer to as the ‘active fraction’, _{a}_{Max}

The best-fit restriction of τ below τ^{Max}

Transcriptional burst size – defined by the ratio of the transcription rate to promoter-inactivation rate (or the product of transcription rate and the active duration

However, our analysis predicts that each ‘Mode’ of control leads to a distinct pattern of active-fraction variation over the set of integration clones (

Our findings, that expression from the HIV promoter is characterized by transcript production in bursts and that the site of viral integration primarily modulates transcriptional burst size, contribute to an emerging paradigm for transcriptional regulation that emphasizes the importance of stochastic/probabilistic dynamics

Transcript production in bursts is a particularly ‘noisy’ transcriptional dynamic that can generate significant cell-to-cell expression variability, which is reflected in wide and highly skewed single-gene expression distributions across clonal populations (

Similar to HIV expression shortly after infection, our system lacks the viral transcriptional activator Tat. In the absence of Tat the LTR has been observed to bind repressive factors that maintain a non-conducive chromatin configuration

Intriguingly, a recent mammalian genome-wide mapping of HAT and HDAC association found them simultaneously bound to a large number of active promoters, suggesting that simultaneous regulation by competitive epigenetic regulators may be more common than previously thought

Our findings suggest that transcriptional burst size is a more ‘locally’ determined property, more sensitive to those features of genomic environment that vary significantly between integration sites, whereas transcriptional burst frequency is, by comparison, a more ‘globally’ determined feature, specified by interactions with the cellular environment that may be more promoter-specific but less significantly integration-site dependent. Burst frequency reflects the statistics of assembling the more active promoter configuration from an inactive one, and we might speculate that this transition depends in part on large-scale chromatin reorganization and dynamics that are coordinated globally across the nucleus

At a more basic level, a feature of transcriptional burst size that could more generally account for a greater sensitivity to genomic environment is its complimentary dependence on transcriptional productivity and the stability of the active promoter configuration. We had noted earlier that this complimentary dependence specifies two distinct ‘Modes’ by which surrounding genomic regions may differentially affect the resulting transcriptional dynamics (see

Burst-size variation with promoter induction level from a tetracycline-inducible construct at a single genomic position has been noted in another study using mammalian cells

The observation that integration site primarily modulates transcriptional burst size from the HIV promoter implies that viral integrants sample a ‘noisy’ set of basal expression distributions by semi-randomly integrating in the genome. Specifically, relative distribution widths (i.e. the coefficient of variation) are approximately maintained and comparable between ‘dim’ and ‘bright’ integrations. This contrasts with the naive expectation that dimmer integrations should demonstrate greater relative expression heterogeneity due to larger relative fluctuations typically generated by smaller numbers of molecules, as would be the case if burst frequency were the primary covariate over viral integrations (see

The basal expression patterns, and their associated expression noise, that were measured here reflect the range of expression dynamics that may be generated initially from an HIV infection after its semi-random integration into the host genome but prior to significant production of viral proteins

We anticipate that certain ranges of parameters representing integration-site dependent basal fluctuations in promoter activity may act to specify distinct infected-cell fates, as illustrated in

Possible decomposition of the ‘space’ of basal burst-parameters inferred by the current analysis into ranges of parameter combinations that, in the presence of positive feedback from Tat, may lead to active viral replication vs. latent fates.

While other studies have considered the effects of genomic environment on mean expression, we have analyzed its effects on expression heterogeneities. By applying an integrated computational and experimental approach, we have characterized the modulation of underlying transcriptional dynamics by genomic environment in human cells. Since classes of human promoters often share common enhancer and repressor motifs, it is possible that two such promoters at different genomic loci would demonstrate significantly different transcriptional dynamics, as we have observed from different integrations of a single promoter in our system. In this way, genomic architecture would provide an additional axis of expression regulation complementary to that specified by individual promoter sequence architectures, and promoter and genomic architectures might evolve in parallel to optimize their coupled contributions to transcriptional control

The HIV-1 based lentiviral plasmid, pCLG, (encoding the HIV-1 LTR and GFP) was packaged and harvested in HEK 293T cells using 10 mg of vector, 5 mg pMDLg/pRRE, 3.5 mg pVSV-G, and 1.5 mg pRSV-Rev, as previously detailed ^{7} and 10^{8} infectious units/ml. Approximately 10^{3}–10^{6} infectious units of concentrated virus were used to infect 3×10^{5} Jurkat cells. Six days after infection, gene expression of infected cells was transactivated by incubating Jurkats with a combination of 20 ng/ml TNFα, 400 nM TSA, and 12.5 mg exogenous Tat protein

Forty-eight single GFP+ LTR-GFP (LG) Jurkats (clones) were sorted on a DAKO-Cytomation MoFlo Sorter into 96-well plates and cultured for at least 4 weeks to allow for clonal expansion. Infected cultures were analyzed via flow cytometry on a Beckman-Coulter EPICS XL-MCL cytometer (

Cytometry measurements on 10^{4} cells for each clone quantified GFP fluorescence as well as forward and side scatter (FSC and SSC). Live cells were selected by standard gating of FSC and SSC, and further gated to select the mid 60% of FSC and SSC values. This gating was optimized using a bootstrap approach to resolve the GFP profile at the mean scattering measure, while eliminating significant correlation between GFP distribution and scattering (see

The model in

Fit parameters (_{a}_{t}^{+}_{r}_{r}

A number of model parameters quantify processes occurring at spatially separate locations from the integrated LTR. These were assumed to be the same for all integrations, and were specified separately. Methods developed independently from this study allowed us to calibrate relevant non-fit model parameters via comparison between the measured transcript distribution for a single clone, and the corresponding cytometry-based GFP distributions (Foley, et al. manuscript in preparation). A conversion factor between transcript number and RFU could be estimated from the measured ratio of means, as ^{−1}, which served as an effective protein degradation rate (

A bootstrap procedure was used to estimate a 95% upper-bound on the value of _{data}_{fit}_{data}_{r}_{fit}_{data}_{r}_{r}

Distribution processing. A) 2-d histogram of fluorescence and forward scatter (FSC) values, as measured by cytometry from 10^{4} cells, for a sample clone. FSC is binned on a linear axis covering values between 1 and 1024 (10 bits) in arbitrary units (AU), and fluorescence values were log-binned over 4 orders of magnitude in relative florescence units (RFU). B) Smooth 2-d histograms were generated using a low-pass Fourier filter. The dashed line highlights correlation between fluorescence and FSC measures (we aim to account for this correlation in a distribution-processing procedure), and the green line is drawn at the mean FSC value, which specifies C) the ‘target’ GFP distribution at fixed FSC that we aim to extract by our processing procedure. D) Optimized gating. For each clone, a bootstrap approach was used to determine the optimal fraction of the FSC range to gate the data by (% Gate), which for each clone, minimizes the average over the set of re-sampled (synthetic) data of the deviation between each processed ‘synthetic’ data set and the ‘target’ distribution. The distribution deviation is defined as in the main text, as ^{2} value for a linear regression between FSC and GFP, equal to _{3}^{3}, _{3}^{rd} distribution central moment, H), was calculated for each clone, relative to the value measured at the 60% gate used in the main text. The mean value of this ratio over the set of integration clones is given by the red line at each Gate %, with the box marking the inter-quartile range (iqr) and the bar marking 1.5*iqr.

(1.09 MB TIF)

Fit quality and deviations. A) Fit uncertainty. The relative fit deviation (_{r}_{r}_{r}

(1.87 MB TIF)

Distribution stability over time. A) Distribution variation over time is not correlated among clones. Six clones and a control with no plasmid that quantifies cellular autofluorescence (‘Aut’) were followed over 6 consecutive days by cytometry. Daily fluctuations in fluorescence mean (μ_{t}, normalized by the value on the first day μ_{0}, for each clone) are uncorrelated over the sampled populations for any pair of time points (P>0.5). B) Distribution shape variations over time for any clone are approximated by a distribution scaling of all fluorescence by a constant value, such that the variance (σ^{2}) changes approximately as the mean squared (μ^{2}). For small deviations, this translates as the relative variance (σ_{t}^{2} at each time, normalized by its value on the first day, σ_{0}^{2} for each clone) changing in proportion to twice the change in mean, which is plotted as a reference line (‘Scaling’). C) Distribution variability over time for a sample clone approximately demonstrates a ‘scaling’ variation, as noted in B, which is equivalent to translating the distribution on the log-binned fluorescence axis on which the histogram is plotted. D) Distribution rescaling. For the sample clone in C, the fluorescence values each day are scaled by the ratio of the mean on the first day to the mean on that day. This rescaling leads to improved distribution stability over time. In particular, the distribution variability is now approximately within the experimental uncertainty due to our distribution-processing procedure. This suggests that distribution drift over time can be treated as a simple scaling of fluorescence values, perhaps due to metabolic drift, as discussed in the text. E, F) Best-fit model parameter variability over time is comparable to 95% confidence intervals calculated for sources of uncertainty considered in the main text. For each clone, the fitting procedure of the main text was applied to each processed experimental distribution, for each of the six days. Best-fit transcriptional burst frequencies (E) and burst sizes (F) for each clone, relative the value obtained for fitting the average of its distribution over time, is plotted against the log expression mean (averaged over the six days). Bars about the value of 1 represent 95% confidence intervals, as calculated in the main text, which do not include uncertainty due to distribution variability over time.

(0.86 MB TIF)

Gating for cell size does not significantly affect inferred trends in burst-parameter variation with integration position. A, B) The experimental distributions obtained by applying a 10% square gate in the FSC/SSC plane (as discussed in Sec. S.I.7) were fit following the procedure in the main text (‘narrow gate’), and the resulting best-fit model parameters compared to those obtained for each clone based on our optimized distribution processing procedure (‘optimal gate’, = 60%), that were given in

(0.55 MB TIF)

No significant correlation between transcriptional burst size and frequency for the HIV LTR. The best-fit transcriptional burst frequencies (_{a}_{a} b_{a}^{2}

(0.35 MB TIF)

Work flow.

(0.97 MB TIF)

Supplement to “HIV-Promoter Integration Site Primarily Modulates Transcriptional Burst Size, Rather Than Frequency.”

(0.89 MB PDF)

We thank Sharon Aviran, Kathryn Miller-Jensen, and Siddharth Dey for critical readings of the manuscript.