From noise to models to numbers: Evaluating negative binomial models and parameter estimations in single-cell RNA-seq

doi:10.1371/journal.pcbi.1014014

Fig 1.

Schematic comparison of the telegraph model of gene expression and two of its limiting cases, illustrating how distinct mechanisms can converge to the same negative binomial mRNA distribution.

(a) Schematic illustrating the telegraph model of gene expression. A gene switches between active (green) and inactive states (red) with rates and . Synthesis of transcripts occurs from the active state with rate ρ. The transcripts are subsequently degraded with rate 1. The rates are all normalized by the degradation rate. The steady-state distribution of transcript numbers is a Beta-Poisson compound distribution; (b) Schematic showing the special case where the gene is always in the active state and the transcription rate ρ varies from cell to cell according to a Gamma distribution. In this case, the mRNA distribution predicted by the telegraph model reduces to a Gamma-Poisson compound distribution (an NB distribution); (c) Schematic showing the special case where the gene spends most of its time in the inactive state () which leads to transcription occurring in short-lived bursts that are well separated from each other. All cells are identical, i.e., the rate constants do not vary from cell to cell. In this case, the mRNA distribution predicted by the telegraph model also reduces to an NB distribution.

More »

Expand

Fig 2.

Comparison of the steady-state mRNA distribution of the telegraph model (Eq (1) denoted as ) with the effective NB distribution (Eq (6) denoted as ) for the case of perfectly identical cells (parameters do not vary from cell-to-cell).

The two parameters of the effective NB distribution are chosen so that its first and second moments of mRNA counts exactly agree with those of the telegraph model (the values of the two parameters, and , are stated to 2 decimal places in the figure). In all four parameter cases, the effective NB distribution exceptionally well fits the corresponding telegraph distribution, and yet only in case (a) we have (the classical case of transcriptional bursting). This demonstrates that a good fit of an NB distribution to the telegraph model distribution does not imply the presence of transcriptional bursting.

More »

Expand

Fig 3.

The mRNA distribution of the telegraph model converges to that of effective NB distribution as the sum of the gene-state switching rates relative to the mRNA degradation rate () increases.

In the dashed box, we show that the effective NB distribution (blue solid line) exhibits a lower KL divergence to the telegraph model distribution compared to the Poisson approximation (orange dashed line) as grows. The distributions of the telegraph model (green dots), effective NB (blue solid lines), and Poisson (orange dashed lines) are shown for Points A-E, as indicated in the KL divergence plot. The other parameters are fixed at and . Point A: , Point B: , Point C: , Point D: , Point E: .

More »

Expand

Fig 4.

Benchmarking aeBIC against.

E[BIC] shows that a single-sample-based criterion can reliably recover the expected model-selection landscape across sample sizes and telegraph-model regimes. (a) Cartoon illustrating a computational approach to compare the aeBIC (top) with E[BIC] — the expected value of the BIC (bottom). The aeBIC utilizes a single score to select the best model distribution (telegraph, NB or Poisson) given that the ground-truth mRNA distribution is that of the telegraph model. The BIC method assigns a score to each different sample of simulated data from the telegraph model and then all these scores are averaged leading to E[BIC]. (b) The relative error (RE) of aeBIC compared to E[BIC] for two distributions (Poisson and NB) as a function of sample size for 10 parameter sets (see Table A in S1 Text for the values of and ; ρ is fixed to 15). Error bars show the standard error of the mean. (c) Phase diagram showing the regions of parameter space where the telegraph, NB and Poisson distributions are selected as optimal by the aeBIC, given that the ground-truth mRNA distribution is that of the telegraph model. Here is the sample size, is the sum of gene-state switching rates normalised by the degradation rate of mRNA, and is the fraction time spent in the active state. The fraction of the total parameter space occupied by the region where the NB distribution is optimally selected is shown on the plots. Note that the transcription rate is fixed to which implies that the maximum mean number of transcripts in the phase plots is 15.

More »

Expand

Fig 5.

The binomial capture model for scRNA-seq reveals how incomplete transcript capture systematically shifts the effective model-selection landscape across telegraph-model parameter regimes.

(a) Schematic illustrating the binomial capture model for scRNA-seq. Transcripts in each cell are captured with some probability . This causes a downsampling of the distribution of mRNA counts. (b) Phase diagram showing the regions of parameter space where the telegraph, NB and Poisson distributions are selected as the optimal ones by the aeBIC, given that the ground-truth mRNA distribution is that of the telegraph model. The phase diagrams are shown for three different values of (values stated next to the plots). Here is the sample size, is the sum of gene-state switching rates normalised by the degradation rate of mRNA, and is the fraction time spent in the active state. The fraction of the total parameter space occupied by the region where the NB distribution is optimally selected is shown on the plots. Note that the transcription rate is fixed to which implies that the maximum mean number of transcripts in the phase plots is and for the phase plots in rows 1, 2 and 3, respectively. These phase plots do not appreciably change if aeBIC is determined using moment-matching instead of MLE (Fig B Text).

More »

Expand

Fig 6.

Heterogeneity in scRNA-seq capture efficiency across cells systematically alters the effective model-selection landscape, shifting the regions in which telegraph, NB and Poisson distributions are favoured.

(a) Schematic illustrating the binomial capture model for scRNA-seq with a probability of mRNA capture, , that varies between cells according to some distribution. (b) We consider three different distributions all with mean but with varying coefficient of variation (CV): (i) Dirac() with CV = 0; (ii) Beta() with CV = 0.11; (iii) Beta() with CV = 0.21. (c) Phase diagram showing the regions of parameter space where the telegraph, NB and Poisson distributions are selected as the optimal ones by the aeBIC, given that the ground-truth mRNA distribution is that of the telegraph model with effective transcription rate where is sampled from the 3 distributions mentioned above. Here is the sample size, is the sum of gene-state switching rates normalised by the degradation rate of mRNA, and is the fraction time spent in the active state. The fraction of the total parameter space occupied by the region where the NB distribution is optimally selected is shown on the plots. Note that the transcription rate is fixed to which implies that the maximum mean number of transcripts in the phase plots is .

More »

Expand

Fig 7.

Technical-noise correction for heterogeneous capture efficiency reshapes the aeBIC model-selection landscape and reveals its impact on the accuracy of inferred bursting kinetics.

(a) Illustration of the differences between the standard and technical-noise-corrected models; (b) Phase diagrams produced by aeBIC model selection based on standard or corrected models. For both, the ground-truth model for observed data is the telegraph model with distributed according to the distribution which has mean and CV=. The transcription rate ρ is fixed to 15; the maximum mean number of transcripts is . The labels “Tele”, “NB” and “Pois” denote the regions selected using the aeBIC procedure with corrected models. The dashed lines demarcate the same regions but using the aeBIC procedure with standard models. The “Pois” area is divided into a white part (where both aeBIC procedures select the Poisson distribution) and a grey part (where the aeBIC with standard models selects the NB distribution while the aeBIC with corrected models selects the Poisson distribution). The heatmap shows the magnitude of the relative errors in the estimated burst frequency () and burst size () in the NB-optimal region (using the aeBIC with corrected models). The errors are computed using Eq (17) — note that this approach assumes full knowledge of the distribution of probability capture, an ideal case. In the plots, denotes sample size, is the sum of gene-state switching rates normalised by the degradation rate of mRNA, and is the fraction time spent in the active state.

More »

Expand

Fig 8.

Technical-noise-corrected inference in the NB-optimal regime reveals that relative ordering of gene bursting parameters can be robustly recovered.

(a) Scatter plot of the estimated ratio of burst parameters of a pair of genes and of the ground-truth ratio of the burst parameters of the same pair of genes. All parameter sets are sampled from the region of parameter space where the technical-noise-corrected NB distribution is optimally selected using the aeBIC approach (purple regions in Fig C in S1 Text). Points marked as blue are those pairs of genes for which the estimation led to an incorrect ordering of genes by the size of the burst parameter. The order was correctly inferred for gene pairs corresponding to orange and green points. Perfect ratio estimation is shown by the solid red line; gene pairs corresponding to orange (green) points overestimate (underestimate) the distance between the burst parameters of the gene pairs. (b) Distributions of the relative errors in burst frequency and burst size for those pairs of genes for which the order was correctly inferred (orange and green points) in (a). Note that the inference and model selection approach here assumes full knowledge of the distribution of probability capture, an ideal case.

More »

Expand

Fig 9.

Analysis of mouse fibroblast scRNA-seq data reveals that many genes best fitted by the NB distribution are not transcriptionally bursty.

(a) Model selection using BIC on 21,684 genes (670 cells) after preprocessing shows that ∼80% of genes are best fitted by the NB model. (b,c) For these NB-fitted genes, Bayesian inference of was performed using the -modified telegraph model, with reliability assessed by the confidence interval criterion (CI range/median <0.4). The fraction of NB-fitted genes classified as transcriptionally bursty depends on the threshold chosen for : 0.1 in (b) and 0.25 in (c). In both cases, a substantial fraction of genes are best fitted by the NB model yet are not transcriptionally bursty.

More »

Expand