Identifying cell states in single-cell RNA-seq data at statistically maximal resolution

Single-cell RNA sequencing (scRNA-seq) has become a popular experimental method to study variation of gene expression within a population of cells. However, obtaining an accurate picture of the diversity of distinct gene expression states that are present in a given dataset is highly challenging because of the sparsity of the scRNA-seq data and its inhomogeneous measurement noise properties. Although a vast number of different methods is applied in the literature for clustering cells into subsets with ‘similar’ expression profiles, these methods generally lack rigorously specified objectives, involve multiple complex layers of normalization, filtering, feature selection, dimensionality-reduction, employ ad hoc measures of distance or similarity between cells, often ignore the known measurement noise properties of scRNA-seq measurements, and include a large number of tunable parameters. Consequently, it is virtually impossible to assign concrete biophysical meaning to the clusterings that result from these methods. Here we address the following problem: Given raw unique molecule identifier (UMI) counts of an scRNA-seq dataset, partition the cells into subsets such that the gene expression states of the cells in each subset are statistically indistinguishable, and each subset corresponds to a distinct gene expression state. That is, we aim to partition cells so as to maximally reduce the complexity of the dataset without removing any of its meaningful structure. We show that, given the known measurement noise structure of scRNA-seq data, this problem is mathematically well-defined and derive its unique solution from first principles. We have implemented this solution in a tool called Cellstates which operates directly on the raw data and automatically determines the optimal partition and cluster number, with zero tunable parameters. We show that, on synthetic datasets, Cellstates almost perfectly recovers optimal partitions. On real data, Cellstates robustly identifies subtle substructure within groups of cells that are traditionally annotated as a common cell type. Moreover, we show that the diversity of gene expression states that Cellstates identifies systematically depends on the tissue of origin and not on technical features of the experiments such as the total number of cells and total UMI count per cell. In addition to the Cellstates tool we also provide a small toolbox of software to place the identified cellstates into a hierarchical tree of higher-order clusters, to identify the most important differentially expressed genes at each branch of this hierarchy, and to visualize these results.


A Detailed Derivations A1 Likelihood of partitions
We consider an scRNA-seq dataset D characterized by a matrix of UMI counts n gc corresponding to the number of UMIs for gene g in cell c.We denote by ρ partitions of the cells into non-overlapping subsets and want to determine a likelihood function P (D|ρ) under the assumption that, for each subset s ∈ ρ of the partition, all cells c ∈ s have the same gene expression state.To explain how this likelihood function is calculated, we first discuss how the mRNA counts m g of a cell, and observed UMI counts n g in an scRNA-seq experiment depend on the gene expression processes in the recent history of the cell, as previously introduced in [1].
Given the inherent thermodynamic fluctuations affecting the molecules inside the cell, and the Brownian motion that they are subject to, even a comprehensive description of the current 'state' of the cell in terms of the number of molecules of each type in each cellular compartment only determines the rates with which different molecular reactions occur.For the mRNA levels of a given gene g, the relevant rates are the transcription rate λ g and the mRNA decay-rate µ g .It is well established that for a gene g with constant transcription rate λ g and constant mRNA decay-rate µ g , the number of mRNA molecules in the cell m g follows a Poisson distribution with mean ⟨m g ⟩ = a g = λ g /µ g , e.g.[2].More generally, when µ g and λ g are arbitrary time-dependent functions µ g (t), λ g (t), with λ g (t) denoting the transcription rate a time t in the past of the cell, and µ g (t) the decay rate of mRNAs for gene g a time t in the past, the probability distribution for the current number of mRNAs m g in the cell is still a Poisson distribution with mean [1]: which we call the 'transcription activity' of gene g.Note that time is measured backwards from the present (t = 0) to the distant past (t = ∞) in the history of the cell.Thus, a single parameter a g for each gene g is sufficient to fully characterize the distribution of mRNA numbers in a cell at any given time point.
The remaining uncertainty about the actual numbers is due to random thermodynamic fluctuations in events such as RNA polymerase binding or mRNA degradation.To conclude, given the expression state of the cell as defined by the vector of transcription activities ⃗ a, the probability of a count vector of cellular mRNAs across all genes ⃗ m is therefore a product of Poisson distributions: We will assume that the measured UMI counts n g correspond to a random sample of the cell's total mRNA pool m g with some unknown capture rate p per mRNA.As will be discussed below, our model remains valid if the capture rate p varies between cells or has gene-dependent biases.Given these assumptions, the likelihood for the observed UMI count vector ⃗ n is still a Poisson distribution, albeit with a different mean: Following [1], we now define the transcription quotients α g = a g /A, with A = g a g the total transcription activity of the cell.Note that α g corresponds to the expected fraction of transcripts from gene g among all transcripts in the cell.For a cell with a total UMI count N = g n g , we have Conditioned on the total count N , the distribution of all measured counts ⃗ n is a multinomial in the transcription quotients: This is the form of the likelihood of the UMI counts of a single cell as a function of the transcription quotient vector ⃗ α.In our model, we use the transcription quotient vector ⃗ α to represent the 'expression state' of a cell and the key ingredient of our model is that, given a partition ρ, all cells within each subset s of the partition have the same transcription quotient vector ⃗ α.The likelihood for the counts D s of a subset of cells s that have equal transcription quotients ⃗ α, is given by where ⃗ n s = c∈s ⃗ n c is the vector of total UMI counts among all cells in the subset s.To calculate the likelihood P (D|ρ) of a partition, we will marginalize over the unknown transcription quotient vector ⃗ α for each of the subsets s in ρ and to do this we have to define a prior distribution over possible transcription quotient vectors ⃗ α.We will use a Dirichlet prior, which corresponds to a maximal ignorance prior in the sense that it is the unique prior that is invariant under arbitrary rescaling of the transcription quotients α g → λ g α g .Moreover, it is the conjugate prior to the multinomial distribution, allowing us to analytically marginalize over the ⃗ α.In particular, for each subset s we characterize our prior information regarding its transcription quotient vector ⃗ α by the same Dirichlet prior: where the θ g are the parameters of the Dirichlet prior and Θ = g θ g .We can now marginalize over ⃗ α and obtain with N s = g n gs is the total number of UMI summed over all cells in subset s.The likelihood of a partition ρ is now obtained by simply taking the product of this expression over all subsets s ∈ ρ: Note that this expression is very similar to likelihood functions derived previously for clustering DNA sequences [3].Essentially the only change is that the 4-letter DNA alphabet is here replaced by the 'alphabet' of G genes.Using a uniform prior on the partitions ρ and the parameters ⃗ θ of the Dirichlet priors, the posterior distribution P (ρ, ⃗ θ|D) is now simply proportional to the likelihood P (D|ρ, ⃗ θ).Ideally we would further marginalize over the parameters ⃗ θ to obtain a posterior distribution over partitions only, but this is analytically intractable.We thus aim to find the combination (ρ, ⃗ θ) that jointly maximizes both the posterior P (ρ, ⃗ θ|D) and likelihood P (D|ρ, ⃗ θ), i.e. optimizing over both the partition ρ and the parameters θ g for each gene individually.Without loss of generality, we can rewrite the parameters of the prior θ g as the product of an overall scale vector Θ and a normalized vector ⃗ ϕ with g ϕ g = 1.Second, we note that for the trivial partition in which all cells are put into a single cluster, the optimal ϕ g are given by i.e. the prior parameter ϕ g simply equals the fraction of UMIs for gene g in the entire dataset.In order to keep the optimization numerically tractable, we will approximate the optimization of the prior's parameters by fixing ⃗ ϕ to this vector, setting ⃗ θ = Θ ⃗ ϕ, and only optimize the scale factor Θ ∈ R + , while leaving the ϕ g fixed for a given dataset.Setting the prior in this way ensures that, for each subset s, the expected direction of the transcription quotient vector ⃗ α matches the overall UMI counts in the entire dataset, while optimizing Θ allows to tuning of the expected amount of variability around this 'average' vector of transcription quotients.
With this chosen form of the prior, we finally get an expression for the likelihood of the whole dataset D that only depends on the scale factor Θ and the partition of cells into subsets ρ: Finally, we return to discuss our simplifying assumption that the mRNA capture rate p is constant across genes and cells and show that this assumption can be significantly relaxed without affecting the results.In particular, we will assume that the probability p gc of capturing (and successfully amplifying and sequencing) an mRNA for gene g in cell c can be written as where p c is a cell-specific overall capture rate and q g describes gene-dependent biases that may be specific to the particular experiment, but are assumed constant across the cells in the experiment.With this capture efficiency, the expected UMI count for gene g in cell c is ⟨n gc ⟩ = p c q g a g .Thus, the expected fraction of counts from gene g becomes where the last equality defines αg .From here, we can proceed the derivation exactly as before with αg replacing α g .As we marginalize out this variable in Equation (10), the final result is invariant.Note that, given that we separately marginalize over the ⃗ α g of each subset, the result is even invariant when different subsets s have different gene-bias vectors ⃗ q, as long as all cells within a subset have the same bias.This suggests that our likelihood over partitions is not only insensitive to fluctuations in overall capture efficiency across cells, but will also be quite robust to fluctuations in gene-dependent capture efficiency as long as cells with equal expression states have equal biases.

A2 Posterior of transcription quotients
Although for calculating likelihoods over partitions ρ, we marginalize over the GES ⃗ α s for each subset s, we can of course also obtain posterior distributions over these GES for each cluster.Given a subset of cells s, the posterior distribution for its GES ⃗ α s can be obtained from equations (6), ( 7) and (10) to find: From this expression, we can derive expressions for mode, mean and variance of ⃗ α s : Often we are interested in the log-transcription quotients δ gs = log(α gs ) rather than the transcription quotients themselves.For these we find: where ψ is the digamma function (the derivative of the logarithm of the Gamma function) and ψ 1 is the first derivative of the digamma function.

A3 Cellstate similarities and hierarchical clustering
We define the similarity between two subsets of cells S a and S b as the ratio of the likelihoods of the partitions with both subsets merged and each subset separate: Note that this similarity metric does not behave like a usual distance metric in that similar clusters will have a large σ and dissimilar clusters a small σ close to 0. To build a hierarchical tree of the cellstates, we start by setting the leaf clusters of the tree to the cellstates of the optimal partition and then calculate all pairwise similarities σ ab between all pairs of leaf clusters.We then iteratively merge the pair of clusters with the highest similarity, and recalculate the pairwise similarities with the newly formed cluster until all clusters have merged into a single cluster.For plotting, we save the resulting tree in the Newick format with distances set to positive log-similarities d ab = max[− log(σ ab ), 0].

A4 Differentially expressed genes between pairs of cellstate clusters
To describe the differences between the GES of different clusters, and to help give biological interpretation, it is useful to quantify which genes are most significantly differentially expressed between the clusters.In our framework, we can quite naturally define the extent of differential expression of genes by decomposing Equation (25) into contributions of individual genes, i.e. σ ab = g σ ab,g .A low value of σ ab,g for a gene g indicates that the UMI counts are much more different between the two clusters a and b than would be expected from noise.A high value σ ab,g , in contrast, indicates that counts are within the expected noise levels.
To obtain such a decomposition, we start by decomposing the cluster likelihood of Equation (10) into contributions from individual genes P (⃗ n|N, Θ) = g P (n g |N, Θ).In the multinomial model, the likelihood is conditioned on the total number of captured mRNA in the cell N , so that, formally, the counts n gc are correlated for all pairs of genes.However, since this correlation is generally quite weak, we can make the assumption that the expression noise is independent between genes.Thus, We can find P (⃗ n|α g ) by marginalizing over the other variables: where N = g n g .We can further marginalize over {n g ′ ̸ =g }, requiring that N is constant, and normalize to obtain the binomial distribution: Next, we find P (α g |Θ): where Finally, we have all ingredients of Equation ( 27) above.Unlike in Equation (10) where we perform the integral subject to the constraint g α g = 1, we integrate over all α g separately.With the high dimensionality of ⃗ α and assuming the likelihood function has a sharp peak, the error will be small.
This likelihood function clearly has a separate contribution P (n g |N, Θ) for each gene: Comparing to Equation (10), we see that this is equivalent to taking the ratio of the cluster likelihood with gene g and without g.
Finally, we can use this expression for P (n g |N, Θ) in Equation ( 25) and define a gene-specific score for differential expression where the subscripts a, b refer to two subsets of cells.Genes with σ ab,g < 1 have counts that poorly fit a model with a single transcription quotient for both clusters, compared to a model where the two clusters have distinct transcription quotients.In contrast, genes with a score σ ab,g > 1 favour a model with a single transcription quotient and are not differentially expressed between the clusters.

A5 Marker genes for distinguishing between pairs of cellstate clusters
In the previous section we derived a score σ ab,g that quantifies how significant the difference is in expression of gene g between the set of cellstates a and the set of cellstates b.Importantly, this significance measure depends on the average expression of gene g in the sets a and b and is independent of how the expression of gene g varies across cells in the sets a and b.Consequently, a gene g can be highly significant even if the distributions of its expression level across the cells of sets a and b are highly overlapping.
Thus, besides a method for finding genes whose average expression differs most significantly between the cells of a and b, it is also useful to have a method for identifying genes whose expression can reliably distinguish cells from set a from cells of set b.That is, we want to identify genes whose distributions of expression levels across sets a and b exhibit least overlap.
Given two sets of objects a and b, and a real-valued quantity x that is given for each object, we want to define a measure that quantifies how well the variable x can classify objects from set a from those of set b.A reasonable measure for this classification ability is the area under the ROC curve which is obtained by varying a cut-off q and plotting the fraction of objects in a that have x < q on the vertical axis (often called true positive rate), and the fraction of objects in b that have x < q (false positive rate) on the horizontal axis.Notably, this area under the ROC is equal to the probability that if we pick one random object o a from set a and one random object o b from set b, that the x value of o a is less than the x value of o b .A variable x is a good classifier when this probability is either near 1 (i.e. the x values of the a objects are systematically below the x values of the b objects) or near 0 (when the x values of the a objects are systematically larger than those of the objects in b).
We will use this measure to quantify to what extent the expression of a given gene g separates cells in set a from cells in set b.The complication is that we do not know the precise expression levels of each of the cells.Instead, for a cell from cellstate c, we only have a posterior distribution P (α g |c) for the transcription quotient of gene g.Thus, if we define |c| as the number of cells in cell state c, and P (α g |c) as the posterior distribution of the transcription quotient of gene g given the UMI counts of the cells in cell-state c, then the probability we want to calculate is the following: αg 0 with |a| and |b| the total number of cells in the sets of cellstates a and b, respectively.Thus, for each pair of cell-states (c 1 , c 2 ) from sets a and b, we calculate the probability that the transcription quotient β g in cell-state c 1 is less than the transcription quotient α g in cell-state c 2 .Then we take a weighted average over these pair scores, weighted with the number of cells in each cell state.This then gives the probability that if we sample a random cell from one of the cell states in a and one random cell from one of the cell states in b, that the transcription quotient α g of gene g is lower in the first cell than in the second cell.Whenever this probability P g (a < b) is either close to zero or close to 1, then the gene g is a good marker gene for distinguishing the sets a and b.
For a given gene, let n g be its UMI count in the cellstate c, N the total UMI count of cellstate c, θ g the parameter of the global Dirichlet prior for gene g, and Θ = g θ g .The posterior probability for the transcription quotient α g in cellstate c is then given by the beta-distribution For a given pair of cell states (c 1 , c 2 ), the integrals that we have to calculate are of the form Although, as far as we can tell, these integrals are not analytically tractable, we can make progress by noting that, generally, the transcription quotient of a given gene g is much less than 1 in both cellstates , i.e. α g1 ≪ 1 and α g2 ≪ 1.Under this assumption, the above Beta-distributions are well approximated by Gamma-distributions: If we make this approximation then the integrals can be performed analytically and the final result is a cumulative distribution of a beta-distribution: where n g1 and n g2 are the UMI counts of gene g in cellstates c 1 and c 2 , and N 1 and N 2 are the total UMI counts in cellstates c 1 and c 2 .Equation ( 43) is just the cumulative beta-distribution with parameters the UMI counts for the gene in the two cell states, i.e. n g1 + θ g and n g2 + θ g , evaluated at a point that is the fraction of the total UMI counts that are in the first cell state, i.e. (N 1 + Θ)/(N 1 + N 2 + 2Θ).
To calculate a final marker score S g for a gene to distinguish between the set of cell states a and the set of cell states b we calculate the probability P g (a < b) as: The function I α,β (x) is the cumulative probability of the beta-distribution with parameters α and β evaluated at x.The gene g is a good marker when P g (a < b) is either close to zero or close to one.So to get a final target score S g we use the absolute value of the log-ratio between P g (a < b) and 1 − P g (a < b) We have implemented this marker score in a C program that is distributed with Cellstates and have provided example notebooks of its use on the associated GitHub page.

B1 MCMC Algorithm for maximizing the likelihood
Our aim is to identify the combination of the prior's scale factor Θ and partition ρ that jointly maximize the likelihood P (D|ρ, Θ) given in Equation ( 14).Notably, since the likelihood function is a very complex function of both the partition ρ and the prior's scale factor Θ, this optimization requires a fairly expensive search.The general structure of our optimization algorithm is to first start with an initial guess for Θ and to then iteratively 1. Search for the partition ρ * that maximizes the likelihood with the current value of Θ, 2. Given this optimal partition ρ * , find the value of Θ that maximizes the likelihood P (D|ρ * , Θ), until Θ and ρ * no longer change.
First, with respect to the optimization of Θ, we have found empirically that both the optimal partition ρ * and the likelihood P (D|ρ, Θ) depend only weakly on Θ, so that the optimal partition ρ * does not depend sensitively on Θ, and it suffices to only roughly estimate the optimal value of Θ.Since ρ has to be re-optimized for every change in Θ, we thus limit ourselves to only consider values of Θ = 2 q , q ∈ N.That is, we only estimate the optimal Θ to within a factor 2. Furthermore, for the initial value of Θ we take q = ⌊log 2 (⟨N U M I ⟩) + 0.5⌋ where ⟨N U M I ⟩ is the average number of total UMI counts per cell.That is, our initial guess for Θ corresponds to the average total UMI count per cell.This means that strength of the influence of the prior is about equal to the influence of the data from a single-cell.
During the development of Cellstates, we have experimented with a number of different schemes for searching the optimal partition ρ.This included greedy algorithms that iteratively fuse pairs of cellstates so as to maximize the increase in the likelihood P (D|ρ, Θ) and Gibbs sampling schemes in which at each time step a random cell is chosen and then moved into a cluster with probability proportional to the likelihood of the resulting partition.However, we found that the best trade-off between running time and optimality of the final partition is obtained using a simple Markov chain Monte Carlo (MCMC) algorithm that we developed previously for sampling partitions of short DNA sequences [3].
In particular, given a value of Θ we optimize the partition by starting from the partition in which each cell forms its own cluster.Starting from this partition, we then sample the space of partitions using a Metropolis-Hastings scheme.To explain this sampling scheme, we conceptualize partitions as distributions of the C cells over C 'boxes' such that all cells in the same box form a cluster and different boxes correspond to different clusters.We thus initially assign each of the C cells to one of C boxes and the space of partitions is searched by moving cells between the C boxes.Specifically we iterate the following steps: 1.A cell is chosen uniformly at random and taken out of its current box.
2. One of the C − 1 other boxes is chosen uniformly at random and we consider the partition ρ ′ that is created by moving the cell into this box.
3. We calculate the likelihood ratio of the new to old partitions: P move = P (D|ρ ′ , Θ)/P (D|ρ, Θ). 5.If P move * P bias ≥ 1, the move is accepted.Otherwise, we draw a random number p in the range p ∈ [0, 1) and accept the move if p < P move * P bias .
Note that the additional factor P bias (which appears only when the move reduces the number of clusters) is needed to ensure detailed-balance of the chain as explained in detail in the supplementary material of [3].
These steps are iterated until the likelihood has stopped increasing for a sufficient number of steps.In the current implementation of the algorithm, the stopping criterion is controlled by two parameters: a 'step number' S and 'try number' T .Each round a total number of S × T moves are attempted.This is repeated if at least S of the S × T trials led to an accepted move, i.e. the partition was changed at least S times.If less than S moves were made in the S × T trials, the value of S is reduced by 10 and the value of T is multiplied by 10, and new rounds of trials are started.This is continued until S falls below 10, after which the MCMC moves are stopped.Note that the algorithm keeps track of the partition with the highest likelihood it has seen at any point during the sampling, and sets the partition to this highest likelihood partition at the end of the rounds of MCMC moves.By default we set S = C, i.e. equal to the number of cells C, and T = 1000, but these values can be changed by the user.
After the MCMC sampling has completed, a final uphill walk is performed as follows.For each pair of clusters existing in the partition, we calculate the likelihood change that would occur if the clusters were merged into one.We then iteratively merge clusters until no more mergers are left that would increase the likelihood.Finally, for each cell we calculate the likelihood change that would occur if the cell were moved into any of the currently existing clusters, and move the cell to the cluster that maximizes the likelihood.
It is important to note that, although the MCMC sampling scheme is designed to sample partitions in proportion to their likelihood, in practice we find that P move * P bias is almost always either very large, or very small.Therefore, in practice this sampling scheme effectively performs a random uphill walk.

B2 Simulated Datasets
The simulated datasets were created based on the inferred cellstates of real datasets.Of the 36 datasets which were analyzed for this paper, we selected those 18 that have less than 6000 cells, more than 3 cellstates with more than 10 cells in them, and a median number of UMIs per cell greater than 1000.Our aim was to simulate datasets based on the set of transcription quotients inferred to be present in the real datasets.Additionally, we wanted to make sure that the clusters can only be identified by differences in transcription quotients (i.e.relative gene expression levels) and not by differences in total UMI counts.The total number of UMI for each cell c, N c , was therefore drawn independently from a log-normal distribution that was fitted to the experimental distribution of N c in the corresponding dataset.
For each cellstate s, we determined the mean expression transcription quotient vector ⟨⃗ α s ⟩ and then sampled the UMI count vectors ⃗ n c of each cell c in the cellstate s from a multinomial distribution with mean ⟨⃗ α s ⟩.However, we found that the maximal likelihood partition of the simulated dataset often differed from the partition generating the dataset, especially for very small and singlet clusters.In particular, when cells that were in a singlet state were given a lower total UMI count in the simulation, these cells were often no longer statistically significantly different from other states, and this to a lesser extent also affected small clusters.To mitigate this problem, we retained only cellstates with more than 10 cells.The number of cells in each cellstate was kept the same as in the experimental data.With this simulation procedure, we made sure that most simulated cellstates would be statistically distinct, even when total UMI counts were lower than in the original dataset.
In this way, three separate simulated datasets were generated for each experimental dataset.Lastly, an additional three "down-sampled" simulations were carried out with ⟨log 10 (N c )⟩ = 3 and var(log 10 (N c )) = 0.1 fixed for all datasets.

B3 Further Discussion of the results on simulated data
In Figure B we show, for each of the three simulations (marked by their color) generated per experimental dataset (different columns), the detailed outcomes of three independent Cellstates runs per simulation.The top panel shows the difference in log-likelihood ∆L between the inferred partition and the partition used to generate the simulated data.Negative scores mean that the inferred partition had a lower likelihood than the simulated one and can be attributed to a failure of the algorithm to find the optimal partition.As can be seen, such errors are rare and there is always at least one out of the three runs that has ∆L ≥ 0, i.e. a partition at least as good as the one used to generate the dataset was always found.Positive scores indicate that a partition was found with a higher likelihood than the one used to generate the dataset.This can happen for example if there are not enough counts in the simulated cells to statistically distinguish cellstates.To test this hypothesis, we looked at the "down-sampled" simulations with fewer UMI per cell which would make cells with similar, but distinct transcription quotients indistinguishable.Indeed, the results shown in Figure C confirm this hypothesis: For most down-sampled simulations, the best-scoring partition is different from the ground-truth.Also, they tend to score low on homogeneity but high on completeness -which means that inferred clusters are unions of clusters used to generate the simulated data.

C Benchmarking of other clustering tools
For each dataset we started from the raw UMI count table.Cellstates was run directly on the UMI count table and obtained the optimal partition ρ * .In addition, we hierarchically merged cellstate clusters until the number of clusters matched the number of clusters in the reference annotation for the corresponding dataset.
For all other clustering tools we first normalized and log-transfered the data as follows.Let n gc be the raw UMI count for gene g in cell c, N c = g N gc the total UMI count of cell c, and N med the median of N c across the cells in the dataset.We then obtain the normalized expression levels Each of the clustering tools was then run on the matrix of expression levels x gc using all default parameters.However, for most algorithms the user needs to provide a parameter to control the number of clusters.We set this parameter for each method so as to set the number of clusters to the number of clusters in the annotation of the corresponding dataset.Finally, since SuperCell is designed to produce fine-grained clusterings, we obtained two separate partitions for this method for each dataset: one with the number of clusters matching the number of clusters that Cellstates produced on the dataset and one where the number of clusters matched the number of clusters in the reference annotation.
The performance was assessed by calculating homogeneity and completeness for the partitions of each method on each dataset.

D Substructure within cell types found by Cellstates does not correspond to batch effects
It is conceivable that the detailed structure in gene expression states found by Cellstates could reflect batch effects to some extent, e.g. cells from different batches could consistently separate into different cellstates.To investigate to what extent cells from different batches separate into different cellstates we used two datasets.First, the dataset of Zeisel et al. [4] contains cells from batches of animals with different age and sex and we investigated the composition of each cellstate in terms of age (younger or older than 24 months) or sex (male, female, or not determined).We find that cellstates that contain multiple cells generally correspond to mixtures cells of different age and sex.In particular, Fig. K shows the sex and age composition of all interneuron cellstates.To confirm that this observation extends to other datasets we used a dataset consisting of eight samples from the ventricular-subventricular zone (V-SVZ) of the mouse brain that were single-cell sequenced using DropSeq [5].Of the eight samples four were from females and four from males.Furthermore, two replicate samples where from the lateral wall (LW) and two from the septal wall (SW) of the V-SVZ from each sex.We ran Cellstates with default parameters on the raw matrix of UMI counts across genes and cells and constructed the hierarchical tree of cellstates.Figure L shows the resulting hierarchy of cellstates together with the composition of the clusters in terms of the 8 batches (pie charts in Fig. L).Note that the pie charts are normalized so as to reflect the differing total number of cells from each batch.
One can observe that, also for this dataset, most cellstates are relatively balanced in terms of the batches from which the cells derive.For example pericytes and venous endothelial cells show a roughly uniform distribution over all samples.In contrast neuroblasts as well as microglia show a more biased distribution.In the case of neuroblasts we can see an enrichment in the LW samples, which is expected since the LW is more neurogenic than the SW.One can furthermore observe a few highly biased clusters, sometimes being almost entirely comprised from one sample.However, we did not manage to unambiguously assign a known cell type to these clusters.
In summary, these analysis support that cellstates do not reflect batch effects and that cells from different batches commonly mix within single cellstates.For each of the 18 real datasets, three simulations were generated (red, blue, and green) and Cellstates was run three times on each simulated dataset.In the top panel, the difference in log-likelihood delta LL between the inferred and simulated partitions is shown (with a positive difference meaning that a partition was found with higher log-likelihood than the one used to simulate the data).The corresponding homogeneity and completeness of the inferred compared to the simulated partitions are given in the two lower panels.Detailed results from Cellstates runs on down-sampled simulated data based on the set of GESs from various indicated datasets, but with a median of only 1000 UMI per cell.For each of the 18 GES-sets, three simulations were generated (red, blue, and green) and Cellstates was run three times on each simulation.In the top panel, the difference in log-likelihood delta LL between the inferred partition and the partition used to generate the simulated data is shown (positive meaning that a partition was found with higher log-likelihood than the one used to generate the data).The corresponding homogeneity and completeness of the inferred partitions compared to the partitions used to generate the data are shown in the two lower two panels.[4] (see legend), each pie chart corresponds to one higher-order cluster, and the area in each pie chart is proportional to the number of cells with the corresponding annotation in that cluster.Note that cells annotated by [4] as microglia and endothelial-mural cells are merged into one cluster at this level of the hierarchy.The tree structure indicates how these clusters are related upon further merging.

D
Substructure within cell types found by Cellstates does not correspond to batch effects 11 E Supplementary Figures 13

4 .
If the new partition ρ ′ has N clus clusters and ρ had N clus + 1 clusters, we set P bias = (C − N clus ), otherwise P bias = 1.

Fig A .
Fig A. Runtimes of Cellstates (vertical axis) as a function of the number of cells in the dataset (horizontal axis) for all datasets analyzed in this study.Note that both axes are shown on a logarithmic scale.
Fig C.Detailed results from Cellstates runs on down-sampled simulated data based on the set of GESs from various indicated datasets, but with a median of only 1000 UMI per cell.For each of the 18 GES-sets, three simulations were generated (red, blue, and green) and Cellstates was run three times on each simulation.In the top panel, the difference in log-likelihood delta LL between the inferred partition and the partition used to generate the simulated data is shown (positive meaning that a partition was found with higher log-likelihood than the one used to generate the data).The corresponding homogeneity and completeness of the inferred partitions compared to the partitions used to generate the data are shown in the two lower two panels.
comparison of partitions to best out of 5 cellstates runs

Fig D .
Fig D. Reproducibility of independent Cellstates runs.Cellstates was run 5 times on each of 34 scRNA-seq datasets and the highest-likelihood partition that was found was compared with those of the 4 other runs.The resulting homogeneity and completeness scores are shown twice.On the left as a scatter-plot with markers colored by dataset, illustrating which outliers belong to the same dataset.On the right, a 2D-histogram shows the fraction of runs in bins of size 0.025 × 0.025, showing that most fall within the top right corner with homogeneity and completeness > 0.95.
Fig E.Tissue of origin is predictive for cellstate diversity.For 29 different datasets from 9 tissues the median cellstate abundance f cellstate (top panel) and the entropy of the distribution of cellstate abundances is shown stratified by tissue (horizontal axis).Although there is large variability across datasets, datasets from the same tissue tend to have similar cellstate diversity.Tissues are sorted roughly from most to least diverse from left to right.

FigF.
Fig F.Cellstate diversity does not depend on technical features of the experiment.The mean of the distribution of cellstate abundances f cellstate (top row panels), the fraction of singlet cellstates (second row panels), the median (third row panels) and the entropy (bottom row panels) of f cellstate are shown as a function of the median total number of UMIs per cell (N U M I ) (left column panels) and the total number of cells in each dataset (right column panels).Marker colors indicate the tissue from which the samples originate.These results show that there is no correlation between the various diversity measures and either sequencing depth (total UMI count) or the number of cells sequenced, with the exception of a weak negative correlation between the median cellstate abundance Median[f cellstates ] and the number of cells n cells .This correlation is explained by the fact that the majority of cellstates are singlets in many experiments.

Fig L .
Fig L.Hierarchical tree of cellstates for the data of[5] consisting of cells of 8 samples from the ventricular-subventricular zone (V-SVZ).The pie chart of each cellstate indicates the composition of its cells from the 8 different batches, which derived from individuals of different sex and corresponding to different locations in the V-SVZ (see legend).The figure shows that most cellstates correspond to relatively well balanced mixtures of cells from the different samples.Note that the pie charts are normalized to reflect the total number of cells from different batches in the data.

Table A .
Summary of most important variables used in this section.a g Transcription activity for gene g m g Number of mRNAs of gene g in a cell n g Number of UMIs measured for gene g in a cell α g Transcription quotient for gene g θ g Parameters of the Dirichlet prior, θ g = Θϕ g Θ Scale of Dirichlet prior parameters ϕ g Unit vector of Dirichlet prior parameter D Dataset, full table of UMI counts n g ρ Partition of cells into a set of subsets s [4]15 higher-order cellstate clusters cells annotated in[4]as microglia and endothelial-mural separate.Colors indicate different annotations from[4](see legend) and the area in each pie chart is proportional to the number of cells in the corresponding cluster with the corresponding annotation.Note that at this level of the hierarchy cells with a common annotation in[4]tend to separate into multiple higher-order cellstate clusters.Cells in cellstates of the Zeisel dataset[4]derive from batches with different age and sex.Shown is the sample composition in terms of age of the donor (i.e.either older or younger than 24 months of age, top panel) or its sex (bottom panel) for all the interneuron cellstates.The sizes of the discs at the leaves correspond to the numbers of cells in the corresponding cellstates and the colors indicate the fractions of cells of coming from old/young or male/female donors.Note that almost all cellstates that contain more than one cell consist of mixtures from different batches.
sex Fig K.