Fig 1.
Two-state model of bursty gene expression.
The distribution of gene expression in a population of cells is partially caused by bursts of gene expression. (A) In the two state model of bursting gene expression, a gene stochastically turns on at rate Kon and off at rate Koff. When the gene is on, it transcribes mRNA at a rate of Kt. Note that all three rates are normalised to the rate of mRNA degradation (Kd). (B) These kinetic parameters control the shape of the gene expression distribution. In this figure, we keep Kt = 100 and vary Kon and Koff. At high Kon the distribution of mRNA transcripts is similar to a Poisson distribution (i), at high Koff the distribution of mRNA transcripts is similar to a negative binomial (ii), and at low values for Kon and Koff the distribution is bimodal (iii).
Fig 2.
Pipeline for the analysis of single cell gene expression data.
This paper develops several new tools for single cell analysis. (A) The kinetic parameters are estimated using a large look-up table. The likelihoods for each possible parameter set are found by multiplying the experimental data by the lookup table, and then the maximum likelihood is identified. (B) The SABEC algorithm is used to identify clusters of cells with uniform bursting kinetics– this iterative algorithm alternates between assigning cells to clusters and estimating bursting kinetic parameters for each cluster. This clustering algorithm is run 50 times and the results are summarized in a consensus matrix, which represents the frequency of each pair of cells being found in the same cluster. (C) Finally, the EPiK tool identifies the most likely set of parameters to have varied for a gene across two different cell populations, using a combination of the Bayesian Information Criterion (BIC), Marginal Probability (MP), and a Subsampling-based method, as described in (C). BIC calculates the likelihood for each possible set of parameters, penalised by the number of free parameters. The MP score calculates the likelihood of each parameter changing, independent of the behaviour of the other parameters. In the subsampling method, random sets of cells are selected, the kinetic parameters estimated for each population of cells; then, the distributions of estimated kinetic parameters are compared. (D) These new tools are combined to form a pipeline for analyzing single cell qPCR data. The details for each step of this pipeline are described in the Methods.
Fig 3.
In silico validation of single cell analysis pipeline.
In order to estimate the accuracy of the new tools developed in this paper, each tool was evaluated against simulated datasets. (A-C) The kinetic parameter estimation method was tested against simulated data in which 90% of the data was randomly discarded, to simulate the loss of biological material through inefficient cDNA preparation. The known kinetic parameters and the estimated kinetic parameters were compared for Kon (A), Koff (B) and Kt (C). (D-F) Next, SABEC was tested against simulated datasets that had the same kinetic parameters as those estimated for the cell populations in the Moignard single cell dataset (124 simulated cells for each of the 5 cell populations)—SABEC was tested on 100 of such simulated datasets and the robustness of the clustering was measured by calculating the proportion of ambiguously clustered cells (PAC). This figure depicts a hierarchical clustering of the consensus matrices that come from (D) the dataset whose clustering had the worst PAC score, (E) a randomly selected dataset, and (F) the dataset whose clustering had the best PAC score. The coloured bars along the side and bottom represent the true class labels of each cell. (G-H) Next, the true positive and false positive rates were calculated for each proposed component of EPiK, the union of the MP and subset method, and the intersection of all three methods. For the MP and Subset methods, thresholds were set so the false positive rate would be approximately 2%– receiver operating characteristic (ROC) curves for these are found in S13 and S14.
Fig 4.
Application of SABEC to hematopoietic stem cells and progenitor populations.
SABEC was used to identify subpopulations of cells in the Moignard et al. dataset and to identify specific cells that do not cluster well with the other cells of the same type. (A) Hierarchical clustering was applied to the SABEC consensus matrix to reveal six subpopulations of HSC in subfigure. (Bi) The expected hematopoiesis differentiation tree is shown–the names of the cell populations are color-coded to match the vertical and horizontal bars in A. (Bii) We hypothesized a possible alternative differentiation tree, based on the clustering results. (C) Another application of this clustering method is to remove possible outliers, which may have arisen due to poorly sorted cells or extrinsic variability. Within each cell population (i.e. CLP (i), GMP (ii), HSC (iii), LMPP (iv) and PreM (v)), we sorted the cells by how often the cells are clustered with other cells from the same FACS label (% match). The vertical lines depict thresholds selected to approximately correspond to the region with the steepest slope– cells to the left are disregarded as outliers in the EPiK analysis.
Fig 5.
Transcription factor-specific kinetic parameter changes.
For each specific gene, it is possible to use EPiK to predict which kinetic parameters vary throughout differentiation. (A-D) For each transition between populations of cells, the kinetic parameter changes are predicted for each TF using all six methods shown in the key, which we represent by six colored rectangles– the color corresponds to the parameters predicted to have changed, as designated in the key. If there are not enough cells with a certain gene expressed in order to calculate parameter changes, then the rectangles are depicted as white. The results are shown for a selection of TFs: Eto2 (A), Mitf (B), Nfe2 (C), and PU.1 (D). (E-H) To demonstrate that EPiK provides reasonable results, we depict the maximum likelihood estimates for Kt and Kon for each population of cells (pruned dataset). If there is a predominant change in Kon (red), Kt (yellow), or both variables (orange), then these populations are connected by lines of the corresponding color.
Fig 6.
Most significant kinetic parameters variations during hematopoiesis.
This figure illustrates kinetic parameter change predictions that are consistent across all six methods illustrated in the key of Fig 5– in other words, this is the intersection of BIC, MP, and the subset methods, as applied to both pruned and un-pruned datasets. Just as in Fig 5, red boxes designate that Kon varies, yellow boxes designate that Kt varies and orange boxes designate that both parameters change. A positive/negative signifies that gene expression increases/decreases along the line (top to bottom; or in the case of horizontal lines left to right). (A) The most significant kinetic parameter changes are shown for the expected hematopoietic tree. (B) Based on the SABEC clustering analysis, it appeared as if there were two distinct HSC subpopulations– the differences between these two subpopulations and LMPP and PreM subpopulations were also calculated.
Fig 7.
Kinetic parameter differences between leukemic and healthy cells.
This figure illustrates kinetic parameter change predictions for the single cell qPCR data in Guo et al. Just as in Figs 5 and 6, red boxes indicate that Kon varies, yellow boxes indicate that Kt varies and orange boxes signify that both parameters change. The blue box indicates that Koff varies. A positive/negative signifies that gene expression increases/decreases along the line (top to bottom; or in the case of horizontal lines left to right).
Fig 8.
Consensus clustering comparison of SABEC and K-means.
In the case of the Moignard dataset, SABEC is better than K-means at grouping cells by their FACS label. (A, B) For each pair of cells, we can count the number of times a pair of cells cluster together– a cumulative density function of these values, for K varying from 5 to 10, is shown in for SABEC (A) and K-means (B). Pairs of cells that are grouped together greater than 10% of the time and less than 90% of the time–represented by the vertical lines in (A) and (B)– are considered to be ambiguously clustered. (C) The proportion of cells that are ambiguously clustered (PAC) for SABEC and K-means is used to evaluate the robustness of the clustering approach. (D, E) Next we quantify the accuracy of the clusters, as compared to the labelled values (as per FACS). Specifically, this was done using the variable information (VI) (D) and the corrected Rand index (corrected Rand) (E). (F, G) The full consensus matrix, by cell type label according to FACS, for K = 6 are shown for SABEC (F) and K-means (G).
Fig 9.
Properties of two-state model and kinetic parameter estimation strategy.
Kinetic parameters control the shape of the gene expression distribution. Kinetic parameters are estimated by calculating the maximum likelihood from Eq 2. (A-B) In order to illustrate some properties of the two-state model, the likelihoods of various kinetic parameter sets were calculated for a toy example–specifically GFI1b gene expression data in hematopoietic stem cells (HSC). (A) For each possible Kon and Koff value, we calculated the maximum log likelihood across all possible Kt values. The overall log-maximum likelihood was 603.66 and only values that were within 0.5 of this value were coloured in, with the maximum value in dark red. (B) The maximum log likelihood of all possible Kon values were also calculated, as Koff and Kt were varied. There is a region in subfigure B where Koff and Kt can compensate for one another. (C) An example of parameter compensation was simulated: the original distribution (grey) has visible changes when Koff (blue) and Kt (red) are varied, but the distributions are barely distinguishable when both parameters are varied (purple). (D) Finally, we show our method’s estimates of Kon and ln(Koff/Kt) for each TF in each population of FACS cells.