What can we learn when fitting a simple telegraph model to a complex gene expression model?

In experiments, the distributions of mRNA or protein numbers in single cells are often fitted to the random telegraph model which includes synthesis and decay of mRNA or protein, and switching of the gene between active and inactive states. While commonly used, this model does not describe how fluctuations are influenced by crucial biological mechanisms such as feedback regulation, non-exponential gene inactivation durations, and multiple gene activation pathways. Here we investigate the dynamical properties of four relatively complex gene expression models by fitting their steady-state mRNA or protein number distributions to the simple telegraph model. We show that despite the underlying complex biological mechanisms, the telegraph model with three effective parameters can accurately capture the steady-state gene product distributions, as well as the conditional distributions in the active gene state, of the complex models. Some effective parameters are reliable and can reflect realistic dynamic behaviors of the complex models, while others may deviate significantly from their real values in the complex models. The effective parameters can also be applied to characterize the capability for a complex model to exhibit multimodality. Using additional information such as single-cell data at multiple time points, we provide an effective method of distinguishing the complex models from the telegraph model. Furthermore, using measurements under varying experimental conditions, we show that fitting the mRNA or protein number distributions to the telegraph model may even reveal the underlying gene regulation mechanisms of the complex models. The effectiveness of these methods is confirmed by analysis of single-cell data for E. coli and mammalian cells. All these results are robust with respect to cooperative transcriptional regulation and extrinsic noise. In particular, we find that faster relaxation speed to the steady state results in more precise parameter inference under large extrinsic noise.


Parameter inference for the telegraph model
We use the "fminsearch" function in MATLAB to minimize the log-likelihood function where θ is the parameter set, N (n) denotes the number of cells with n copies of mRNA or protein, and P n (θ) denotes the mRNA or protein number distribution with parameter set θ.In this way, we can obtain estimates of the three parameters λ, γ, and ρ of the telegraph model (assuming that d = 1).One of the most important steps for optimization is to select the initial values λ0 , γ0 , and ρ0 of the three parameters.The conventional moment estimation method determines λ0 , γ0 , and ρ0 as functions of the first three moments of gene expression data across a cell population by solving three algebraic equations [5,6].Although this method is straightforward, it may produce negative values of λ0 , γ0 , and ρ0 [6].To avoid the possible negative initial values, we modify the moment estimation method as follows.If ρ0 is known, then the two initial values λ0 and γ0 can be determined by matching the mean and variance of gene product fluctuations [7,8], which leads to where M is the mean copy number and η is the coefficient of variation squared.Note that ρ/d is the typical gene product number in the active gene state.Hence in previous papers [7], ρ0 is commonly chosen as the maximum number of gene product molecules within the cell population.However, this choice may still give rise to negative values of λ0 and γ0 in Eq. ( 6) when the maximum number of molecules ρ0 < ηM + M − 1.If this happens, then we simply reset ρ0 = ηM + M .To test this method, we use the telegraph model to generate synthetic data of gene products numbers for N = 10 5 cells using the SSA under 10 3 randomly selected parameter sets within the regions λ, γ ∈ [0.1, 4] and ρ ∈ [10, 30].The synthetic data are then fitted to the telegraph model using the maximum-likelihood method with the initial values λ0 , γ0 , and ρ0 being chosen as above.For all 10 3 parameter sets, the estimated values of λ, γ, and ρ are almost the same as their real values (Supplementary Fig. S5).This shows that our parameter inference method is highly reliable.The readers may also ask how frequently the resetting of ρ0 occurs for real data.To see this, we generate synthetic data for N = 10 2 , 10 3 , 10 4 , 10 5 cells under 10 3 randomly selected parameter sets within the same regions mentioned above.Interestingly, we find that the resetting of ρ0 fails to be observed for all the tested sample sizes and parameter sets.This indicates that such resetting should rarely occur for real smFISH or scRNA-seq data.

Mean active and inactive periods for feedback models
We next compute the the mean active and inactive durations for the two feedback models.In [3], it has been shown that no matter there is autoregulation or not, the mean of gene product numbers can always be represented by the mean active and inactive periods as where T on and T off are the mean active and inactive periods of the gene, and the second term on the right-hand side represents the probability of the gene being active.This formula is intuitive for unregulated genes.However, for regulated genes, it is highly non-trivial and was proved in [3] using the analytical distributions of gene product numbers.The mean protein number n = F (1) can be easily recovered by taking the derivatives of the generating function given in Eqs. ( 3) and ( 4) at z = 1.For the positive feedback model, the gene expression mean n has the form of [3] Moreover, the mean active duration for the positive feedback model is given by T on = 1/γ.Substituting this equation into Eq.( 7) gives the explicit expression of the mean inactive duration, i.e.
Similarly, for the negative feedback model, the gene expression mean is given by [3] Moreover, the mean inactive duration for the negative feedback model is given by T off = 1/λ.Substituting this equation into Eq.( 7) gives the explicit expression of the mean active duration, i.e.
. Distinguishing the dynamic differences between a complex model and its effective telegraph model.Under the initial condition that the gene is off and there is no gene product molecules in the cell, the three-state and positive feedback models have smaller time-dependent second moment curve compared to the effective telegraph model, while the cross-talk pathway and negative feedback models have larger time-dependent second moment curve.However, common fluctuation indicators such as gene expression noise and the Fano factor, present less easily distinguishable dynamic differences between complex models and their effective telegraph models.Fit complex models with extrinsic noise to the telegraph model.We take into account the extrinsic noise by adding noise to the initiation rate ρ in the three-state, crosstalk pathway, positive feedback and negative feedback models.Here, we reset ρ as a log-normal distributed random variable with its mean being the original value of ρ and standard deviation being equal to 0.05, 0.1, 0.5 of the mean (presenting 5%, 10%, 50% extrinsic noise, respectively).For each of complex models with a extrinsic noise level, synthetic data of gene product numbers are generated under 625 parameter sets using the SSA.In steady state, the Hellinger distance (HD) between the simulated distribution and its telegraph model approximation is shown as a function of the mean expression level for the 625 parameter sets.Under the all tested parameter sets and extrinsic noise levels, the HD is less than 0.035 for the three-state model, less than 0.1 for both cross-talk and positive feedback models, less than 0.021 for the negative feedback model.

Fig H.
Distinguishing dynamic differences between complex models under extrinsic noise and their effective telegraph models.We take into account the extrinsic noise by adding noise to the initiation rate ρ in each of the three-state, cross-talk pathway, positive feedback and negative feedback models.Here, we reset ρ as a log-normal distributed random variable with its mean being the original value of ρ and standard deviation being equal to 0.05, 0.1, 0.5 of the mean (presenting 5%, 10%, 50% extrinsic noise, respectively).Under the initial condition that the gene is off and there is no gene product molecules in the cell, we generate synthetic data of gene product numbers for each complex model and each extrinsic noise level under 625 parameter sets using the SSA.We fit the steady-state synthetic data to obtain the effective telegraph models, and plot time-dependent gene products means for both synthetic data and the effective model.For all tested extrinsic noise levels, the three-state and positive feedback models have smaller time-dependent mean curve compared to the effective telegraph model, while the cross-talk pathway and negative feedback models have larger time-dependent mean curve.We take into account the extrinsic noise by adding noise to the initiation rate ρ in each of the negative feedback model.Here, we reset ρ as a log-normal distributed random variable with its mean being the original value of ρ and standard deviation being equal to 0.05, 0.1, 0.5 of the mean (presenting 5%, 10%, 50% extrinsic noise, respectively).Tuning a single parameter of the negative feedback model under each extrinsic noise level can generate a series of distributions of gene product numbers, along with different mean expression levels.Fitting these distributions by the telegraph model leads to a series of effective parameters λ, γ and ρ.Plotting λ, γ and ρ as a function of the corresponding mean expression level reveals how the effective parameters vary when the single parameter of the negative feedback model is tuned.There are general variation patterns of effective rates except for turning ρ under 50% extrinsic noise.

Fig A.
Fig A.Kullback-Leiber divergence (KLD) for quantifying the fit of the gene product distributions of complex models to the telegraph model.For each complex model, synthetic data of gene product numbers are generated under 625 parameter sets using the SSA.For each complex model, the KL divergence between the simulated distribution and its telegraph model approximation is shown as a function of the mean expression level for the 625 parameter sets.The KL divergence is less than 0.025 for all complex models.

Fig E.
Fig E. High accuracy of the method for inferring parameters of the telegraph model.(a) 3 parameter sets of ( λ, γ, ρ) are randomly selected within the regions λ, γ ∈ [0.1, 4] and ρ ∈ [10, 30].(b) For each of parameter sets, the synthetic distribution data is fitted to the telegraph model using the maximum-likelihood method, where the initial values λ0, γ0, and ρ0 for optimization are chosen by our method in the main text (see details in Method).The estimated values of λ, γ, and ρ are almost as the same as their real values, quantified by very small relative errors between real and inferred parameter values.

Fig I.
Fig I. Variation pattern of effective parameters under different induction conditions in the three-state model under extrinsic noise.We take into account the extrinsic noise by adding noise to the initiation rate ρ in each of the three-state model.Here, we reset ρ as a log-normal distributed random variable with its mean being the original value of ρ and standard deviation being equal to 0.05, 0.1, 0.5 of the mean (presenting 5%, 10%, 50% extrinsic noise, respectively).Tuning a single parameter of the three-state model under each extrinsic noise level can generate a series of distributions of gene product numbers, along with different mean expression levels.Fitting these distributions by the telegraph model leads to a series of effective rate parameters λ, γ and ρ.Plotting λ, γ and ρ as a function of the corresponding mean expression level reveals how the effective parameters vary when the single parameter of the three-state model is tuned.There are general variation patterns of effective rates except for turning λ1 (or λ2) under 50% extrinsic noise.