Determining Physical Mechanisms of Gene Expression Regulation from Single Cell Gene Expression Data

Many genes are expressed in bursts, which can contribute to cell-to-cell heterogeneity. It is now possible to measure this heterogeneity with high throughput single cell gene expression assays (single cell qPCR and RNA-seq). These experimental approaches generate gene expression distributions which can be used to estimate the kinetic parameters of gene expression bursting, namely the rate that genes turn on, the rate that genes turn off, and the rate of transcription. We construct a complete pipeline for the analysis of single cell qPCR data that uses the mathematics behind bursty expression to develop more accurate and robust algorithms for analyzing the origin of heterogeneity in experimental samples, specifically an algorithm for clustering cells by their bursting behavior (Simulated Annealing for Bursty Expression Clustering, SABEC) and a statistical tool for comparing the kinetic parameters of bursty expression across populations of cells (Estimation of Parameter changes in Kinetics, EPiK). We applied these methods to hematopoiesis, including a new single cell dataset in which transcription factors (TFs) involved in the earliest branchpoint of blood differentiation were individually up- and down-regulated. We could identify two unique sub-populations within a seemingly homogenous group of hematopoietic stem cells. In addition, we could predict regulatory mechanisms controlling the expression levels of eighteen key hematopoietic transcription factors throughout differentiation. Detailed information about gene regulatory mechanisms can therefore be obtained simply from high throughput single cell gene expression data, which should be widely applicable given the rapid expansion of single cell genomics.

Nevertheless, preliminary studies have shown that it is possible to utilize the shape of the distribution of gene expression at a single point in time to estimate the kinetic parameters of the two-state model. Raj et al. [11] used a variation of the mathematical analysis done by Peccoud and Ycart [7] to estimate the kinetic parameters in fluorescent in situ hybridization (FISH)-based expression studies, and Kim and Marioni [12] developed a strategy to estimate kinetic parameters in RNA-seq data. In addition, Teles et al. [13] applied a similar approach to a single cell qPCR dataset in the context of a hematopoietic developmental system.
As it is now accepted that bursting dynamics can in principle be resolved from single cell snapshot datasets, we can take advantage of this type of analysis to develop new algorithms that can answer a wide range of biological questions. In this paper, we apply our understanding of bursting gene expression to develop a more effective clustering algorithm for single cell gene expression data, which we call Simulated Annealing for Bursty Expression Clustering (SABEC). This can help researchers identify previously uncharacterized sub-populations of cells within their single cell data. Secondly, we develop a statistical tool for identifying whether the burst frequency (K on or K off ) or burst magnitude (K t ) are the source of gene expression variations between experimental samples. Instead of simply observing whether gene expression increases or decreases across two populations of cells, this Estimation of Parameter changes in Kinetics (EPiK) toolkit allows researchers to make inferences about the underlying regulatory mechanisms affecting gene expression. We provide a complete pipeline in R for analyzing single cell qPCR data, including data normalization steps, SABEC clustering and EPiK cluster comparisons. This pipeline can be utilized in a wide range of biological contexts. We apply it to study how some of the key transcription factors in hematopoiesis are being regulated, discovering that the hematopoietic stem cells studied in Moignard et al. [14] formed two distinct sub-populations, distinguishable by Tel and Gata1 expression levels. Then, by manipulating the expression levels of two of the TFs involved in early differentiation (Gfi1 and Gata2), we predict that the primary mechanisms by which these TFs influence their downstream targets is by manipulating K on . Finally, we compare the mechanisms by which cell surface markers are regulated in healthy and leukemic cells based on data previously reported in Guo et al. [15].

A pipeline for analyzing single cell gene expression data
We have developed a pipeline for analyzing single cell qPCR data in order to determine the mechanism by which TFs were regulated during differentiation (Fig 2).
The input data consists of single cell qPCR data for a set of genes, where the cells have been sorted into their expected cell populations, using fluorescence-activated cell sorting (FACS). There are two main outputs of the pipeline. Firstly, cells are divided into populations (clusters) based on their bursting behaviour-each gene has the similar rates of activation (K on ), rates of The likelihoods for each possible parameter set are found by multiplying the experimental data by the lookup table, and then the maximum likelihood is identified. (B) The SABEC algorithm is used to identify clusters of cells with uniform bursting kinetics-this iterative algorithm alternates between assigning cells to clusters and estimating bursting kinetic parameters for each cluster. This clustering algorithm is run 50 times and the results are summarized in a consensus matrix, which represents the frequency of each pair of cells being found in the same cluster. (C) Finally, the EPiK tool identifies the most likely set of parameters to have varied for a gene across two different cell populations, using a combination of the Bayesian Information Criterion (BIC), Marginal Probability (MP), and a Subsampling-based method, as described in (C). BIC calculates the likelihood for each possible set of parameters, penalised by the number of free parameters. The MP score calculates the likelihood of each parameter changing, independent of the behaviour of the other parameters. In the subsampling method, random sets of cells are selected, the kinetic parameters estimated for each population of cells; then, the distributions of estimated kinetic parameters are compared. (D) These new tools are combined to form a pipeline for analyzing single cell qPCR data. The details for each step of this pipeline are described in the Methods. inactivation (K off ) and rates of transcription (K t ) across all cells within each cluster. Secondly, the pipeline compares between each pair of cell populations and predicts the mechanism by which each gene changes its expression-by changing the rate the gene turned on and off or by changing the rate of transcription once the gene is already active.
A key component in the pipeline is our strategy for estimating the kinetic parameters. Previous approaches for estimating the kinetic parameters were slow to compute, because they involved iterative algorithms, such as simulated annealing [13], expectation maximisation [11], or Gibbs Sampling [12]. Meanwhile, we have a pre-computed look-up table for P(x|K on , K off , K t ), the probability of seeing x mRNA transcripts, given a set of kinetic parameters (Fig 2A and Eq 1). This table can be used to compute the most likely set of kinetic parameters using a single matrix multiplication-a very efficient computation (Eq 2).
In the pipeline, first we normalize the measurements from the qPCR and transform them into estimated mRNA counts (See Eq 6). Next, these measurements are input into a newly developed unsupervised clustering algorithm (SABEC) which identifies populations of cells with uniform bursting kinetics. SABEC is an iterative algorithm that starts by randomly assigning cells to clusters; next, the algorithm alternates between estimating the kinetic parameters for each gene within each cluster (using Eq 2) and probabilistically re-assigning cells to clusters based on their likelihood of belonging to each cluster (using Eq 9), until the algorithm converges-i.e. when fewer than 5% of cells swap clusters. SABEC is run fifty times using different initial sets of randomly assigned clusters, and the results are summarised as a consensus matrix-a matrix that records the number of times two cells were grouped together. The number of clusters can be determined by one of the following methods: the proportion ambiguously clustered (PAC), Variable Information (VI) or the corrected Rand score (See Fig 2B).
Finally, each pair of populations is compared using a statistical tool we developed called EPiK, in order to identify whether each gene is regulated by modifying the rate the gene turns on (burst frequency) or the level of transcription once the gene is active (burst magnitude). EPiK is a compilation of three different methods: the Bayesian Information Criteria (BIC), a Marginal Probability (MP) Score, and a subsampling method. Since for each gene there can be between 0 and 3 parameters that can differ across a pair of populations, BIC identifies the set of parameters most likely to differ after penalising parameter sets that are larger (See Eq 12). The MP score separately calculates the likelihood of each parameter changing, independent of whether the other parameters change or stay the same (See Eq 15). In the subsampling method, the kinetic parameters are calculated for small random subsets of cells, then the distributions of estimated kinetic parameters for the subsets are compared by the Kolmagorov-Smirnov statistic( Fig 2C).
All together, our R data processing scripts, SABEC, and EPiK come together to form a pipeline that utilizes the mathematics of bursting gene expression to determine how genes are differentially regulated across populations of cells ( Fig 2D).

Validation of methods with in silico datasets
These methods were all tested in silico using data sets that have similar structures to the experimental datasets. Each of the three methods introduced in this paper (ML approach for kinetic parameter estimation, SABEC and EPiK) required a different set of simulated data.
The ML method for kinetic parameter estimation was benchmarked populations of cells with known kinetic parameters to allow us to quantify the accuracy of the method. 3,000 parameter sets were randomly selected (uniformly distributed) with K on between 0 and 5, K off between 0 and 20, and K t between 0 and 600. We randomly sampled 10% of the transcripts from each of the simulated cell, to represent the technical noise caused by the loss of 90% of the starting material during sample preparation. We included a stochastic loss of mRNA transcripts to account for material loss during cDNA library construction-Islam et al. [33] estimates that only 48% of transcripts are reverse transcribed into cDNA, and Wu et al. [34] could capture 42% of the total unique transcripts that were identified in bulk RNA-seq.
The simulated datasets for testing the SABEC method were chosen to be as similar to the experimental datasets as possible. 100 simulation sets were generated, with each completely parallel to the Moignard dataset; Each simulation set consisted of five populations of 124 cells with 18 genes each, with their kinetic parameters equal to those estimated by the ML method for the experimental data.
Finally, to test the final method of comparing kinetic parameters between two populations of cells, 1600 simulated datasets were generated, each one consisted of a pair of populations of 124 cells with 1 gene each. There are eight combinations of kinetic parameters that can change (none, all, three ways one parameter can change and three ways two parameters can change). Two hundred pairs of populations were selected for each of these eight scenarios, with 100 simulations with 0 < K off < 5 and 100 simulations with 5 < K off < 10. The range of the other parameters were 0 < K on < 5 and 0 < K t < 600. In these simulations we also simulated the random loss of 90% of the mRNAs.
Even after simulating 90% loss of biological material, the predicted parameters correlated well with their real values, although K off was the most difficult parameter to predict (See Fig  3A-3C), especially at wider parameter ranges (S1 and S2 Figs), but it still performed better than the Kim and Marioni Method (S3 Fig). These in silico results emphasize that the cDNA library efficiency can have a large impact on the absolute values of predicted parameters, which suggests that raw parameter values are not comparable across different experiments. SABEC also accurately predicted clusters of simulated datasets, with similar population structures and parameter ranges to the experimental dataset (Fig 3D-3F). Finally, we found that the EPiK method was very conservative, with approximately a 0.05% false positive rate when the methods were intersected (Fig 3G and 3H). Further details about the validations and comparisons with alternative methods are described in depth in the Methods. Next, we applied these methods to experimental datasets.

Single cell analysis of hematopoietic stem cells and progenitors
The dataset we use to test our pipeline comes from Moignard et al. [14], which is a high quality single cell qPCR dataset that includes approximately 124 cells each from five different populations of cells during hematopoeisis. Hematopoiesis is the process by which hematopoietic stem cells (HSC) in the bone marrow differentiate into different types of red and white blood cell types. This process can be depicted as a differentiation tree, in which each cell must make multiple "decisions" at each branching point in the tree that will determine its final cell fate. Moignard et al. [14] includes two key branching points: i) HSC cells can become lymphoidprimed multipotential progenitors (LMPPs) or premegakaryocytes (PreMegEs, also referred to as PreMs in figures) and ii) LMPP cells can become granulocyte-macrophage progenitors (GMPs) or common lymphoid progenitors (CLP).
The focus of this study is on the densely interconnected TF regulatory network that has been shown to contribute to cell fate decisions. There is strong evidence that at least seven of these TFs directly interact with one another, potentially forming a SCL/Lyl1/Gata2/Runx1/ Lmo2/Fli-1/Erg heptad, with some TFs directly binding to the DNA (such as Gata2 and SCL) and other TFs serving a bridge between the DNA bound components (such as LMO2) [16].
Other key TFs that were profiled in Moignard et al. [14] were PU.1, Meis1, Hhex, Tel, Nfe2, Eto2, Mitf and Ldb1. In silico validation of single cell analysis pipeline. In order to estimate the accuracy of the new tools developed in this paper, each tool was evaluated against simulated datasets. (A-C) The kinetic parameter estimation method was tested against simulated data in which 90% of the data was randomly discarded, to simulate the loss of biological material through inefficient cDNA preparation. The known kinetic parameters and the estimated kinetic parameters were compared for K on (A), K off (B) and K t (C). (D-F) Next, SABEC was tested against simulated datasets that had the same kinetic parameters as those estimated for the cell populations in the Moignard single cell dataset (124 simulated cells for each of the 5 cell populations)-SABEC was tested on 100 of such simulated datasets and the robustness of the clustering was measured by calculating the proportion of ambiguously clustered cells (PAC). This figure depicts a hierarchical clustering of the consensus matrices that come from (D) the dataset whose clustering had the worst PAC score, (E) a randomly selected dataset, and (F) the dataset whose clustering had the best PAC score. The coloured bars along the side and bottom represent the true class labels of each cell. (G-H) Next, the true positive and false positive rates were calculated for each proposed component of EPiK, the union of the MP and subset method, and the intersection of all three methods. For the MP and Subset methods, thresholds were set so the false positive rate would be approximately 2%-receiver operating characteristic (ROC) curves for these are found in S13 and S14. There are a number of open questions about the differentiation of HSC into the various progenitor cell populations. Firstly, although the cells were assigned to their populations via FACS, it is unclear whether these sub-populations are truly uniform. For instance, some HSC cells may be biased towards self-renewal or producing cells in the lymphoid or myeloid lineages.
Secondly, the specific mechanisms by which these TFs are being regulated are as yet uncharacterized. Influencing the rate at which genes turns on (K on ) could increase the proportion of cells with an active copy of a gene; whereas manipulating the transcription rate (K t ) would result in there being higher concentrations of mRNA, while maintaining the same proportion of cells with an active gene. In hematopoiesis, cellular TF concentrations help determine which branch the cell will take as it differentiates towards its final cell fate. Therefore, choosing to manipulate K on instead of K t could influence the proportion of cells that enter a certain differentiation trajectory. In addition, since the expression of these TFs is tightly linked, one gene's bursting dynamics could have repercussions on the dynamics of the entire network.

Hematopoietic stem cell populations have heterogeneous bursting kinetics
Although the cells that we study have been sorted into distinct subpopulations using FACS, this does not guarantee that the populations are indeed transcriptionally uniform. Some cells may be misclassified by FACS (expected misclassification rate is 1% for the Moignard dataset), and some of the known populations may be composed of as yet unidentified subpopulations. In addition, extrinsic variability, such as having cells in different stages of the cell cycle, could cause the populations to be heterogeneous.
It can be difficult to accurately identify homogenous subpopulations using standard clustering approaches like K-means or hierarchical clustering (S5 Fig). For instance, K-means is most effective when each cluster is normally distributed and has similar variances. Due to bursting, gene expression is unlikely to come from such a distribution; gene expression distributions can look like a Poisson distribution, a negative binomial distribution, or even a bimodal distribution, depending on K on , K off , and K t . Therefore, we developed a simulated annealing strategy (Simulated Annealing for Bursty Expression Clustering, SABEC), which takes into account bursting gene expression, in order to have more robust clustering. SABEC was rigorously tested against simulated datasets that were designed to be as close as possible to the structure of the experimental data (S6 Fig). Additionally, the algorithm was tested on a wide range of simulated datasets, to see if this method could perform well under varied conditions, such as on data with different numbers of genes measured in each cell and different numbers of cells in each population (S7 Fig). Next, SABEC was applied to the experimental single cell qPCR data from Moignard et al. [14]. The final clustering is depicted in Fig 4A. Although the predicted hematopoiesis differentiation tree is expected to look like Fig 4Bi, we found that HSC is divided into two distinct subpopulations. One of these populations had 12.8% of cells expressing Gata1, while the other population had none. Since Gata1 is uncommon in HSC cells, but often found in multi-potent progenitor populations (MPP, the earliest progenitor population formed by HSCs [17]), we hypothesized that the populations of cells without Gata1 may be self-renewal HSCs and the other may be HSCs poised for differentiation.
Our clustering with SABEC was based solely on the expression of 18 TFs. However, Moignard et al. [14] also profiled cell surface protein tyrosine kinase c-Kit. It is known that low levels of c-Kit correlate to greater self-renewal potential in HSCs [18]. Even though all of the HSC cells were positive for the c-Kit protein according to FACS, not all the cells had high levels of c-Kit transcripts. Our hypothesized self-renewal HSC population had significantly lower levels of c-Kit than our poised-to-divide population (p-value 9.117e-06 with Kolmogorov-Smirnov test). Even though c-Kit was not one of the genes used to cluster the cells, there was substantial difference in expression levels, suggesting the tree topology depicted in Fig 4Bii. Further evidence for this arrangement is in S9 Fig. The SABEC algorithm has provided us with subpopulations that appear to have uniform bursting kinetics. In the next section we will identify which kinetic parameters were most likely adjusted across each pair of cell populations, using EPiK. EPiK works best when cell populations have uniform bursting dynamics; S11 Fig shows how having mixed cell types can influence estimates of kinetic parameters. Therefore, we remove cells that do not cluster well with other cells of the same type, with cutoff thresholds designated by the vertical lines in Fig   which is determined by the region of the curve with the steepest slope. However, this pruning protocol could bias parameter estimation if it were to eliminate cells that are falsely identified as outliers (S10 Fig). Therefore, we run EPiK both on the pruned and unpruned datasets, and only consider parameters that are consistently found to have changed under both conditions. It is important to note that the proposed outlier cells may be of biological interest-for instance, they may be rare cell types or cells in the process of differentiation. The pruned dataset is only for use to boost the accuracy of predictions with EPiK, but all cells should be observed for other types of analysis.

Transcription factors involved in hematopoiesis are regulated by different mechanisms at each stage of differentiation
EPiK incorporates three different metrics for evaluating whether kinetic parameter changes are significant. By taking the intersection of these three prediction methods, the false positive rate decreases without a significant drop in the true positive rate. This gives us a very conservative list of probable kinetic parameter changes (a 0.05% false positive rate with our simulated dataset).
The first method is the Bayesian Information Criterion (BIC) (S12 Fig), and the second method is the marginal probability (MP), which is the log likelihood of a certain parameter being varied, independent of whether the other two parameters vary or stay the same (S13 Fig). In the third method, we repeatedly subsample cells from each population and estimate the kinetic parameters for each subset, comparing the maximum distance between the cumulative density functions of the distributions (S14 Fig).
These three methods were applied to the experimental data from Moignard et al. [14]. Each of the methods is based on slightly different assumptions, so they each have different distributions of parameters being varied (See S15 Fig). For instance, BIC predicts that K off is adjusted in a few cases, but this is not deemed significant by either the MP or subset methods. In addition, there are more significant parameter changes in the case of the pruned populations compared to the complete populations, because these populations are more distinct from one another.
We can now take a closer look at a few examples of TFs that have predicted kinetic parameter changes during differentiation ( Fig 5). The predictions from each method are drawn along the branches of the hematopoiesis differentiation tree (Fig 5A-5E). To demonstrate the magnitude and direction of these kinetic parameter changes, Fig 5E-5H illustrates the kinetic parameter estimates for each population of cells.
In Fig 5A and 5F, four out of the six methods predict that Eto2 is up-regulated by increasing K on (red) during the HSC to PreM transition, which is consistent with Eto2 having a role in increasing the proportion of cells that differentiate into PreM [19].
One striking feature of Fig 5A-5E is that Eto2, Mitf, and PU.1 are regulated by different kinetic parameters in different stages of blood differentiation. PU.1 has known cell-type-specific enhancer elements [20], and our results may suggest that each of these may regulate PU.1 through different mechanisms. On the other hand, Nfe2 is consistently regulated by K t throughout differentiation.
In some instances, all six methods come to a consensus as to which kinetic parameter was the source of gene expression variability for a particular TF, and these are shown in Fig 6A. Throughout hematopoiesis, most of the differences in gene expression come from changes of K on , but Lmo2, Nfe2 and Meis1 are regulated by K t in the transition from LMPP to GMP. Recall that in the previous section, we identified that HSC forms two distinct sub-populations. We compared the two HSC sub populations with their child populations (LMPP and PreM) (see Fig 6). Between HSC1 and HSC2 only Tel seemed to consistently change its kinetic parameter (K on ). As expected, the cell population with higher Gata1 and c-Kit expression has more in common with LMPP and PreM cells than the other HSC subpopulation.
In summary, our methods have allowed us to not only identify which genes have changed their expression, but also what physical mechanism caused that change.

Gata2 and Gfi1 may regulate transcription by influencing the rate genes turn on
In the previous section, we identified TFs that were regulated by K on or K t during blood differentiation, but it is unclear which TFs were controlling these changes. It may be possible to discover the specific mechanistic role of a TF by manipulating its expression experimentally and then calculating the change in kinetic parameters of its downstream targets.
We decided to focus on the Gfi1-Gfi1b-Gata2 subnetwork that was identified as being important at the first branching point of HSCs to LMPP and PreMegEs [14]. Primary HSCs are difficult to isolate in large quantities, culture and manipulate, so we turned to HPC7 cells, a model cell line for hematopoietic stem and progenitor cells which has some differentiation potential towards more mature blood cells [16,21]. Similarly to HSCs, HPC7 cells express Gata2 and Gfi1b, but little or no Gfi1. We therefore up-regulated Gfi1 expression and downregulated Gata2 expression and performed single cell gene expression analysis for the gene set described by Moignard et al. [14], as well as some additional genes involved in HSC differentiation. We analysed 81 cells expressing an shRNA against Gata2 and 77 cells control cells expressing an shRNA against Luciferase, and 72 cells overexpressing Gfi1 and 45 control cells expressing an empty vector.
There are a number of strategies by which a TF could reduce gene expression: by decreasing the rate a gene turns on, by increasing the rate a gene turns off, or by decreasing the rate of transcription of an active gene, and each of these strategies would result in different temporal dynamics of gene expression bursting. All six methods came to a consensus that up-regulating Gfi1 seemed to significantly alter Erg, Gfi1b, Hhex and Mpl, by lowering K on . Previous research suggests that Gfi1 is usually a repressor that either keeps the chromatin in a condensed state or actively competes for binding with activators [22]. Both of these mechanisms of action are consistent with Gfi1 down regulating its targets by lowering K on . Based on ChIP-seq experiments from Sanchez-Castillo et al. [23], Gfi1 binds in or near all four of these potential targets (see S1  Table).
In the other experiment, Gata2 was down-regulated; however, this was not a complete knockdown, with only a slight overall decrease in expression (S20 Fig). All six methods suggest that Gfi1b expression was decreased via a change in K on as Gata2 levels decreased. In the pruned population of cells, Procr (also known as EPCR, a known target of Gata2) had higher K on after the knockdown [24].
When Gata2 and Gfi1 were down-and up-regulated, our methods could only detect changes in K on . Therefore, we can hypothesize that this could be the mechanism of action of these two TFs.

Gene expression variations in leukemia
Our pipeline for the identification of differential kinetic parameter values can also be applied to compare healthy and diseased cell populations. In particular, we apply it to four of the cell populations isolated in Guo et al. [15]: two cell types from a healthy mouse (GMP, Lin+) and two from a leukemic mouse (LGMP, LLin+), whose leukemic cells came from a MLL-AF9 fusion protein [25]. The most significant kinetic parameter changes are shown in Fig 7. Some of the genes profiled by Guo et al. [15] were found in fewer than 10 cells in one or more cell population, and these were excluded from kinetic parameter comparisons.
Some of the genes are enriched in a single cell population; for instance, TLR9, a factor whose expression influences the prognosis of leukemia [26], is found predominantly in the LLin+ cells, and appears to be regulated by K on . Most of the identified kinetic parameter changes were in K on or a combination of K on and K t . However, MUC13 and SIGLEC5 were predicted to have been regulated by only K t and IL3RA was predicted to be regulated by K off . Interestingly, IL3RA is the only example where all methods predict changes in K off , which suggests that this is a promising gene to focus on in future research.

Discussion
In this study, we exploit established mathematical models of bursting gene expression to develop a new pipeline for analyzing single cell qPCR data to more robustly cluster biological samples and provide insight into the mechanics of gene regulation. We apply these methods to study gene regulation in hematopoietic stem cell and progenitor populations. Even though single cell qPCR data can only provide snapshots of gene expression in a population of cells across different time points, we can infer the temporal dynamics of gene expression in these cells, and use this information to infer the population substructure (via SABEC) or regulatory mechanisms (via EPiK). This pipeline can be applied to study how genes are regulated during the natural process of differentiation or as cells progress into a diseased state. In addition, by manipulating the expression of a TF within its cell culture, we can infer its specific regulatory role. These algorithms perform well in simulated datasets (S1, S2, S6, S7, S8, S12, S13 and S14 Figs), performing better than similar computational tools (Figs 8, S3 and S5).
Instead of simply observing how much gene expression heterogeneity there is in a sample, it is now possible to predict the specific regulatory mechanisms that contributed to heterogeneity. Most crucially, this type of analysis can be done in a high-throughput manner. In addition, commonplace clustering algorithms like K-means and hierarchical clustering are not meant to cluster data drawn from Poisson, negative binomal, and bimodal distributions, as is the case for single cell gene expression data. For instance, K-means performs best on data that comes from normal distributions with similar standard deviations. For this reason, it is critical to use a clustering algorithm that incorporates information about the shape of the gene expression distributions when analyzing single cell resolution datasets.
However, it is unclear how the cell cycle could influence our results, so future experiments ought to include cell cycle markers as controls [27]. In addition, our approach assumes that the gene expression distributions are close to equilibrium. Fortunately, Peccoud and Ycart [7] demonstrated that the two state model approaches the equilibrium distribution very rapidly (at an exponential rate), so this assumption likely holds. Other researchers have attempted to fit a multi-state model to data, instead of a two state model [28], but unless there is strong evidence for a more complex model, it is wise to use the simplest approach to avoid over-fitting the data. In the future, it may be possible to modify the model to detect cells that are in the process of transitioning between cell populations-for instance, these may be cells that have estimated kinetic parameters that are between two other populations. In addition, Teles et. al. [13] estimated the probability of transition between cell populations using machine learning methods, but this is beyond the scope of this paper.
One particularly useful application of this pipeline is the validation of assumptions used to model specific sub-networks of genes important in differentiation. Often, these models assume certain mechanisms by which the TFs influence one another. For instance, Narula et al. [29] constructed a mathematical model of a hematopoietic sub-network under the assumption that K off was the parameter that is biologically regulated, in the absence of any experimental data. Instead of arbitrarily selecting a modeling strategy, we can now choose one that fits the data best. In addition, we have discovered that different TFs are regulated through different mechanisms in each stage of differentiation, implying that a single model of a gene network might not universally apply. Therefore, these strategies would allow us to make more biologically plausible models of gene subnetworks. Having a better understanding of gene regulation processes is important in order to learn how these are perturbed in disease, and also to develop protocols that produce desired cell types for cell therapy.
A modified version of this pipeline for RNA-seq data would be an important future development. The kinetic parameter estimation strategy and EPiK may be applied to single cell RNAseq, as long as there is sufficient sequencing depth to capture the population-wide distribution of gene expression. Both these methods scale approximately linearly with the number of cells and genes under study, and could also be run in parallel for very large datasets. However, SABEC would not scale well with the large number of genes analysed in RNA-seq experiments, so alternative clustering approaches must be tested in an RNA-seq context.
The transcription process is a multi-step chemical reaction: chromatin must enter the correct state, TFs must bind in the right places, the general transcription machinery must be recruited and initiated, etc. It is currently impossible to distinguish the effects of all of these mechanisms in a high-throughput way. Our pipeline provides a first attempt to understand how the kinetic parameters underlying complex transcriptional processes influence heterogeneity within and across cell populations, through the analysis of single cell gene expression data.
HPC7 cells were infected with retrovirus by centrifugation at 800 xg at 32°C for 1.5 hours with 4 μg/ml polybrene (Sigma), after which the retroviral supernatant was replaced with fresh media and cells were cultured as normal. Transduction efficiency was monitored by flow cytometry for GFP.

Single cell gene expression analysis
Single GFP+ cells were sorted by FACS into individual wells of 96 well plates and single cell RT-qPCR was carried out as described previously [14]. Cells were captured 48 hours after retroviral transduction for Gfi1 overexpression and 72 hours after transduction for Gata2 knockdown (S1 Table).

Estimation of kinetic parameters
Bursting gene expression can lead to a number of different distributions of gene expression in a population of cells (See Fig 1B), ranging from a bimodal distribution (when K on and K off are low) to a Poisson distribution (when K on is much higher than K off ) to an exponential decay-like distribution (when K off is much higher than K on ). The probability of having x mRNA molecules in a cell with kinetic parameters K on , K off and K t is given by the analytical solution developed by [11]: In this equation, 1F1 represents the confluent hypergeometric function of the first kind, a summation over an infinite series that is time intensive to compute. To improve our runtimes, we precomputed an extensive lookup table of values of P(x|K on , K off , K t ). Let us say that we have n cells, each with an mRNA molecule count (for a particular gene) of x i , and let X = {x 0 , x 1 , . . .x n }. Given this list of mRNA counts from a semi-uniform population, we can assess the log likelihood of each possible set of parameters: The set of kinetic parameters that has the maximum likelihood is chosen. However, it is important to note that some areas of the parameter space are more sensitive to parameter changes than others. For instance, [11] notes that at large values of K off the equation of P(x|K on , K off , K t ) approaches: GðK on ÞGðx þ 1Þ This equation depends of the ratio of K t /K off rather than K t and K off separately, which means that K off and K t are more difficult to distinguish as K off increases. The practical implication of this observation is illustrated in Fig 9A and 9B, which depict the log likelihoods at different kinetic parameter values for GFI1b in HSC cells, with regions that have log likelihoods close to the peak value coloured in (specifically, within 0.5 of the maximum likelihood). Although there is only a narrow range of possible K on values (A), there is a range of values where K off and K t can partially compensate for one another (B). A specific example is illustrated with simulated data in Fig 9C: while the original distribution of gene expression (grey) varies visibly when K off (blue) or K t (red) are varied, it is possible to change both variables (purple) and almost recover the original distribution. Furthermore, this analysis suggests that it would be difficult to incorporate any additional parameters into the system and find unique estimates for each of them.
In S1 Fig, we test the performance of this strategy of kinetic parameter estimation on simulated data, including artificial technical noise. Although K off cannot be accurately identified when K off > 5, the ratio of K off to K t is accurately predicted (S2 Fig). Our method also performs better than the Gibbs Sampling based approach developed by Kim and Marioni, 2012 (S3 Fig).
Next, we applied this maximum likelihood approach for kinetic parameter estimation on hematopoietic stem cell and progenitor populations (CLP, GMP, CLP, LMPP and PreM cells) from [14]. The values for K on and ln(K off /K t ) for the Moignard data are shown in Fig 9F. Note that many of the TFs were particularly chosen due to their importance in early differentiation of HSCs, so one would expect a lower level of gene expression (and therefore a higher K off /K t ratio) in later stage progenitor populations such as CLP and GMP. There is a wide range of estimated kinetic parameter values across the TFs in each cell population; however, we need to ensure that these kinetic parameter estimates are not being skewed by mis-classified cells before we can evaluate whether these differences are statistically significant.

Parameter range and look-up table
It is crucial to select a table of appropriate range and point density for applying to the experimental data. The criteria for selecting this table were: i) fewer than 5% of the experimental points were at the maximum parameter value for the K on and K t parameters. ii) K off had to be high enough in order to include the parameter range in which only the ratio of K t to K off matter ii) the density of points was sufficient to minimise artifacts arising from the discretization of the parameter space.
Given these constraints, the range of parameters was chosen to be 0 < K on < 5, 0 < K off < 20 and 0 < K t < 200 and possible mRNA counts as 0 < x < 200. The sampling density for K on was every 0.1, for K off every 0.4 and for K t every 5 for a total of 395000 possible parameter sets.
We deposited the look-up table, the code used to generate the table in Mathematica and the code for calculating the maximum likelihood in R on Github: https://github.com/ezer/ SingleCellPipelineOverview.

Scaling qPCR data to mRNA counts
The main source of data analysed in this paper comes from Moignard et al. [14], and contains single cell qPCR data from five populations of hematopoietic stem cell and progenitor cells (CLP, GMP, HSC, LMPP, PreM), as determined by FACS, with approximately 124 cells in each population. The genes profiled by the qPCR include 18 TFs with crucial roles in cell fate, which were normalised to two "housekeeping" genes (PolII and Ubc), as described in Moignard et al. [14]. The maximum log likelihood of all possible K on values were also calculated, as K off and K t were varied. There is a region in subfigure B where K off and K t can compensate for one another. (C) An example of parameter compensation was simulated: the original distribution (grey) has visible changes when K off (blue) and K t (red) are varied, but the distributions are barely distinguishable when both parameters are varied (purple). (D) Finally, we show our method's estimates of K on and ln(K off /K t ) for each TF in each population of FACS cells.
The outcome of a PCR experiment is a normalised C t value, which relates to mRNA molecules (x) as follows: where a and b are constants. For each TF, we chose b: where x max = 200 is the maximum number of mRNA pre-calculated in our lookup table and X TF is the set of mRNA counts for the particular TF. This choice of b TF stretches the values of x to have as wide a range as possible. It also removes the need to set an a parameter, since the equation for calculating x can be simplified to: It is important to note that while a different b parameter was chosen for each TF, this b factor is consistent across all five populations of cells, which is crucial for our later attempts to compare kinetic parameter values between cell populations. Even though the choice of b changes the absolute value of the kinetic parameters that are estimated, it has minimal effects on the strength of the linear correlation between the known and estimated values.

Simulated Annealing for Bursty Expression Clustering (SABEC)
SABEC begins by assigning each cell to a random population 1 to K. Note that the total number of clusters K must be set at the start of the algorithm. Next, SABEC iteratively calculates the kinetic parameters of each of the K populations, and the cells are reassigned to new populations probabilistically.
In a standard Expectation Maximisation clustering algorithm, the probability of assigning a cell to a population is proportional to the ratio of the likelihoods of the cell coming from each population. Let the kinetic parameter set for a single population be S i (g) = (K on (g), K off (g), K t (g)), where g is the gene (between 1 and G) and i labels the population (between 1 and K), and let x(g) be the vector of mRNA counts for each gene, for a particular cell. The log likelihood for a certain population can be calculated as follows: If the sum of the likelihoods for each population is L tot , this value can be scaled as such: Since it would take a long time for this algorithm to converge if the clusters are close to one another, we add a temperature parameter, so that initially it is easy for cells to be assigned to different clusters, but it becomes harder and harder to swap clusters over time: where τ is the temperature parameter and t is the iteration number of the algorithm (a counter that increments each time the cells are reassigned to new clusters). A cell is probabilistically assigned to a new cluster based on the relative values of L@(S i |x) for each population. The algorithm terminates when fewer than 5% of the cells swap clusters in an iteration, or after 100 iterations. S4 Fig shows how the accuracy of the algorithm depends of the temperature parameter, τ.
Since this algorithm is randomized and since it is possible for certain runs of the algorithm to fall into local optima, we run this algorithm 50 times and conduct a secondary consensus clustering step. In this step, a consensus matrix is produced, in which each cell of the matrix represents the number of times that two cells are found in the same cluster. A plot of the cumulative density function of the values of this matrix can help visualise the robustness of the clustering (See Fig 8A). For comparison and to show the necessity for SABEC, consensus clustering of K-means was also conducted (See Fig 8B). Fig 8 shows the cumulative density functions for the number of times two cells cluster together for SABEC (A) and K-means (B). [32] determined that the most robust metric for comparing the consistency of consensus clustering methods is the Proportion Ambiguously Clustered (PAC) score, which is defined as the proportion of cell pairs that cluster together in 10% and 90% of the repeated runs of the algorithm. This corresponds to the proportion of cell pairs that lie between the vertical lines in the cumulative distribution functions in Fig 8(A) and  8(B). The PAC scores for SABEC and K-means are compared in C, illustrating that SABEC usually has more consistent outcomes than K-means, with the fewest ambiguously clustered cells at K = 7.
In addition, to estimate the accuracy of our method, we can assume that the expected cluster assignment (as determined by FACS) is our gold standard. By comparing the results of our clustering approach with the gold standard, we can estimate the accuracy of our method on the experimental data. To do this, we cluster each of our consensus matrices into five clusters using partitioning around medoids (PAM), a clustering approach similar to K-means (but more consistent since it uses data points as centres). We can then compare our results to the gold standard labels using metrics such as VI (See Fig 8D) and the corrected Rand index (See Fig 8E). SABEC performs better than K-means by these two metrics, with the estimated number of clusters equal to 6. The one exception is that the corrected Rand index suggests that K-means performs slightly better than SABEC when K = 5; however, SABEC provides substantially better outcomes at K = 6. Based on simulated datasets with values similar to the experimental data, we determined that these latter two methods provide more accurate estimates of the number of clusters than the PAC method, which can sometimes overestimate the appropriate number of clusters (S8 Fig).
Figs 8F and 2G compare the consensus matrices for K = 6 for SABEC (F) and K-means (G), with the cells sorted by their FACS-determined labels. SABEC results appear more consistent, with cells frequently clustering with other cells of the same type. In addition, any cells that are "misclassified" tend to cluster instead with cell populations of their parent or children populations. For instance, some HSC cells cluster with their child populations (PreM and LMPP), and some GMP and CLP cells cluster with their parent population (LMPP).
The R script for SABEC is available in Github: https://github.com/ezer/ SingleCellPipelineOverview, including a sample input file to run in parallel on a Condor cluster, and appropriately merge the outputs.

Parameters chosen for K-means and hierarchical clustering
The SABEC method was compared to hierarchical clustering and K-means approaches. The hierarchical clustering approach used was the default one associated with the heatmap function in R (Euclidean distance metric and complete clustering). The K-means approach used the default algorithm in R (Hartigan and Wong), but the maximum iterations was increased to 100 in order to be more comparable to the SABEC approach. K-means was repeated 50 times and an additional consensus clustering step was taken, in order to provide a fairer comparison to SABEC. Note that the input to both the hierarchical clustering and K-means algorithms were the normalised C t values, while the input to SABEC is the scaled mRNA counts. These algorithm and distance metrics were chosen since they are the most commonly used. Other variations of hierarchical clustering and K-means were tested, but none of the results were significantly better or different than the ones shown.

Estimation of Pairwise changes in Kinetics (EPiK)
The R script for determining which kinetic parameters vary across populations is available in Github: https://github.com/ezer/SingleCellPipelineOverview. Three methods were tested on simulated datasets of genes with randomly selected kinetic parameters. We conducted these simulation tests for two different ranges of the kinetic parameters to illustrate that there is different sensitivity to parameter changes in different regions of the parameter space (S12, S13 and S14 Figs). These results suggest that changes in K off cannot be accurately detected when K off > 5. In addition, the methods often performs better when fewer kinetic parameters change at once (S12 and S17 Figs). The correctly identified simulated cases were those that have the largest magnitude of kinetic parameter change (S18 Fig) and had average kinetic parameter values closer to the origin, where the kinetic parameters are more accurately estimated (S19 Fig).
Two of the methods (MP and Subset methods), have continuous-valued outputs, and so a threshold must be set for determining whether or not a change in a kinetic parameter is likely significant. The thresholds were set to have approximately a 2% false positive rate, based on the in silico validation tests. The threshold values for the MP method when K off < 5 are −6.3, −8.5 and −6.8 for K on , K off and K t , and −4.9 and −6.0 for K on and K t when K off > 5. For the subset method, the thresholds are 0.77, 0.91 and 0.86 (K on , K off and K t , respectively) when K off < 5, and 0.681 and 0.870 (K on and K t ) when K off > 5.
BICs-based method. There are eight possible parameter sets that could have varied across the two populations of cells for each TF (no parameters adjusted, all parameters adjusted, three cases with one parameter adjusted and three cases with two parameters adjusted). It is possible to calculate the maximum likelihood of each scenario occurring. Let us assume we have a matrix L p,m that lists the likelihoods for each population of cells p = 1, 2 and parameter set in the lookup table m = 1 to 356,000. The maximum likelihood score when no parameter is being adjusted is: Meanwhile, the maximum likelihood score when all the parameters are being adjusted is: Clearly, L all will always be greater than or equal to L none . Therefore, we must adjust these likelihoods to take into account that fact that they involve different numbers of parameters being adjusted. This is done as follows: Where i indices the set of parameters being adjusted, N i is the number of parameters presumed adjusted, n is the number of cells and L i is the likelihoods, for example, L none or L all calculated above.
Marginal probability-based method. To calculate an MP score for K on , let us make two matrices A(i, j) and B(i, j), one for each of the two populations of cells we are trying to compare.
Each contains L(K on , K off , K t |X), with i indexing the unique set of K on values (with i = 1 to I) and j indexing a unique set of (K off , K t ) pairs (with j = 1 to J).
B 0 (i) can be calculated in the same way as A 0 (i).
However, it is important to implement this strategy in such a way as to minimize the rounding errors caused by taking the exponent of negative number with large magnitude. This can be accomplished thanks to the following statement: By choosing x to be close to a or b, it is possible to avoid such errors. This strategy is employed in all the calculations above involving summing exponents.
Subsetting method. In the third method, we repeatedly subsample 25% of the cells (100 times) and estimate the kinetic parameters for each subset. For each of the three options for the kinetic parameter, we calculate the absolute distance between the cumulative density functions of the two distributions, aka the Kolmogorov-Smirnov Statistic (S14 Fig). Note that we cannot use the p-values from the Kolmogorov-Smirnov test, because we re-sample points many times, so this statistic would be biased towards lower p-values. The threshold for the K-S statistic that was deemed significant was the one with a 2% false positive rate. Changing the size of the subsamples only had a small effect on the accuracy of this strategy (S16 Fig). Supporting Information S1 Fig. Parameter estimates for simulated data. Subfigures A-C show parameter estimates for simulated parameter sets randomly drawn from 0 < K on < 5, 0 < K off < 10 and 0 < K t < 600, for K on (A), K off (B) and K t (C). Subfigures D-F show the same, but with 10 < K off < 20, also for K on (D), K off (E) and K t (F). Note that the known and estimated kinetic parameter values are all linearly correlated, except for K off when it is greater than approximately 5. The kinetic parameters of the simulated datasets were estimated using the Gibb's Sampling approach introduced by [12]. This method was designed for RNA-seq, so the gene length is a required input, but 10,000bp was included for all genes and all the mRNA counts were multiplied by this value. K on is always between 0 and 5, and K t is always between 0 and 600, and 90% of the mRNA molecules are randomly removed. Subfigures A-C are for simulations with 0 < K off < 2, D-F are for 0 < K off < 10 and G-I are for 10 < K off < 20. K on is estimated in subfigures A, D, G, K off is estimated in subfigures B, E, H and K t is estimated in subfigures C, F, I. (TIF) S4 Fig. Choosing the temperature parameter. SABEC requires a choice of a temperature parameter (we chose 10 for most of this paper), which speeds convergence of the algorithm. 100 simulation datasets were clustered with SABEC (each repeated 50 times). The y-axes of A shows the average number of iterations before convergence across these 50 repeats. Convergence is defined as the number of iterations of the algorithm until fewer than 5% of the cells swapping clusters, but it is capped at a maximum of 100 iterations. The grey shaded area represents the full range of average values across all 100 simulated datasets and the black line represents the overall average number of iterations. This subfigure illustrates that the larger the temperature, the quicker the algorithm converges. Subfigures B,C illustrate how temperature influences the accuracy of the algorithm, as per the average variable information (VI) and corrected Rand index across all 100 simulated datasets. Both these metrics illustrate that the higher the temperature, the less accurate the algorithm. The aberration when the temperature parameter has a value of 6 comes from the fact that the number of iterations is capped at 100, so in some cases the algorithm did not fully converge. Note that we choose a temperature of 10 elsewhere in this paper. Out of the 100 simulated datasets that were generated, three examples are illustrated here, with their consensus matrices shown in subfigures A-C and the clustered heatmaps of these in D-F. These include the dataset that had the worst robustness by PAC score (A,D), the best robustness (C, F) and a randomly selected third example (B,E). The colours in D-F are blue for CLP, red for GMP, green for HSC, purple for LMPP and orange for PreM. Note that the clustering is more robust than the experimental dataset. In fact, manual inspection of 100 clusterings found no example of HSC being split into two clusters, while GMP/CLP being clustered together (the scenario observed in the experimental dataset), suggesting that the subdivision of HSC into two clusters is probably not an artifact of the SABEC method. (TIF)

S7 Fig. SABEC applied to simulated datasets with different numbers of genes and cells.
First, we generated a list of 100 parameter sets that were randomly selected from a normal distribution around the experimentally determined kinetic parameter values, with a standard deviation equal to 5% of the parameter range in our look-up table (specifically, 0.25, 1 and 10, for K on , K off and K t respectively). This created a kinetic parameter distribution that was similar to the distribution estimated for the experimental data by [14]. For each simulated dataset, we randomly selected kinetic parameter sets from this list, varying the number of genes and the number of cells, but keeping the number of populations at 5. For each choice of number of genes and number of cells, we repeated this procedure with 5 different simulated datasets. In each case, the SABEC method was used to cluster the dataset (including the consensus clustering step). Subfigure A illustrates the variable information (VI) and B shows the corrected Rand index. Subfigure C shows the average number of iterations of SABEC until convergence. (TIF) S8 Fig. Estimating K for simulated datasets. We wished to determine how reliable PAC, VI and the corrected Rand index were at predicting the correct number of clusters when interpreting SABEC data. For instance, it could be that the SABEC method consistently causes there to appear to be more clusters than there actually are, which could mean that the division of HSC into two clusters is spurious. After predicting the number of clusters in the simulated datasets, with each of the 100 resulting curves drawn for PAC (A), VI (B) and the corrected Rand index (C), we see that PAC often predicts a spurious cluster, but VI and the correct Rand index do not. This is one of the reasons we selected K = 6, trusting VI and the corrected Rand index more than the PAC score. Cells that did not cluster well with other cells of their labelled population were removed in a pruning step, because these cells may not have uniform gene expression bursting kinetics compared to the other cells in their population. However, we were worried whether this would create a consistent bias in the parameter estimates. The line of best fit is designated as the red dashed line. Subfigures A-D compare the original estimates of the kinetic parameters on the complete dataset with the parameter estimates for the pruned datasets in the [14] data, for K on (A), K off (B), K t (C) and K t /K off (D). Subfigures E-H also compare the parameter estimate changes in the complete and pruned datasets, but for the 100 simulated datasets. Note that removing outliers does bias the kinetic parameter estimates, in the same direction as observed in the experimental data, but to a much smaller extent. In particular, estimated K on parameters increase and K off and K t values decrease after the pruning procedure, in both cases. Note that the bias observed in S10 Fig is seen prominently here: K on estimates are higher in mixed cell populations, while K off and K t estimates are lower, and the magnitude of the change is much more consistent with that seen in the experimental data. Nevertheless, we cannot neglect the fact that removing false negative cells can also bias the kinetic parameter estimates. (TIF) S12 Fig. BIC results for simulated datasets. 800 simulated datasets with K off < 5 (A) and 800 simulated datasets with 5 < K off < 10 (B) were generated, with 100 examples of each possible type of kinetic parameter set change (x-axis). For each of these, BIC was used to predict the kinetic parameter change, as shown by the colors in the stacked bar plot. Grey represents the case that none of the parameters change and black represents the case that all the kinetic parameters change. The primary colors (red, blue, yellow) correspond with the case of one kinetic parameter changes (K on , K off and K t , respectively). The secondary colors correspond with the cases where two kinetic parameters change, with purple as K on and K off , orange as K on and K t and green as K off and K t . Clearly, BIC enriches for the correct set of parameters, although it is a bit conservative when more than one kinetic parameter is varied. Note that when K off > 5, K off is often mislabeled as none and sometimes as K t . (TIF) S13 Fig. MP results for simulated datasets. These are the ROC curves for K off < 5 (A-C) and K off > 5 (D-F), with the area under the curve (AUC) listed for each figure. Based on the high values of AUC, this method is predictive for K on and K t , but less so for K off . Note that predictive power for K off completely disappears at higher values, although it does not seem to substantially increase K t false positive rates. (TIF)  Fig 4(Bi). The colors designate which kinetic parameter was adjusted, yellow for K t , red for K on , blue for K off , orange for K t and K on , green for K t and K off , purple for K on and K off , grey for no changes and black for all parameters. Subfigures A-B are for BIC, C-D for marginal probability (MP) and E-F for the subset method (Sub). Each of the methods was applied to the dataset where potential outliers were pruned (A, C and E) and the complete datasets before the pruning step (B, D and F).