Quantifying pluripotency landscape of cell differentiation from scRNA-seq data by continuous birth-death process

Modeling cell differentiation from omics data is an essential problem in systems biology research. Although many algorithms have been established to analyze scRNA-seq data, approaches to infer the pseudo-time of cells or quantify their potency have not yet been satisfactorily solved. Here, we propose the Landscape of Differentiation Dynamics (LDD) method, which calculates cell potentials and constructs their differentiation landscape by a continuous birth-death process from scRNA-seq data. From the viewpoint of stochastic dynamics, we exploited the features of the differentiation process and quantified the differentiation landscape based on the source-sink diffusion process. In comparison with other scRNA-seq methods in seven benchmark datasets, we found that LDD could accurately and efficiently build the evolution tree of cells with pseudo-time, in particular quantifying their differentiation landscape in terms of potency. This study provides not only a computational tool to quantify cell potency or the Waddington potential landscape based on scRNA-seq data, but also novel insights to understand the cell differentiation process from a dynamic perspective.


Input:
A preprocessed dataset matrix X ∈ R n×m , where n is the number of samples and m is the number of genes; Output: The number of clusters, K; The cell type (cluster indices) for each sample, C; The weight matrix between clusters,P ; The potential value for each cell type,V ; Steps: 1: Construct a k-nearest-neighbor (kNN) graph with the data matrix X; 2: On the kNN graph, compute the diffusion distance matrix between samples, and construct a Markov chain between nodes with a transition matrix named P ; 3: Apply a suitable clustering method to cluster samples into K groups, and note the cluster indices of each sample as C; 4: Generate the coarse-grained matrixP , which represents the transfer between clusters, and the weight matrixP , which denotes the transition path between clusters; 5: Compute the net-flowR for each cluster; 6: Compute the pseudo-time potentialV of each cluster byP andR; 7: return K, C,V ,P .

Notes:
If we have suitable clustering of samples in advance, the number of clusters K and the cell type C can be set as inputs, and Step 3 can be omitted. A detailed flowchart can be found in Fig. I.
The codes of LDD and the pre-processed datasets used in the paper can be downloaded from https://github. com/smsxiaomayi/LDD/blob/master/LDDcode.zip.

S2. SUPPLEMENTARY NOTES
Note S1. Details of simulated models The drift-diffusion process evolves as where x = (x 1 , x 2 , . . . , x 50 ) T is the position of a particle in 50 dimensions, F is a potential function, D is the noise amplitude, and ξ is a normal random vector standing for noise. We choose is set as the source region, where particles are added uniformly, and the exit region is set as {x|x 2 ≤ −2}, where particles disappear once crossing x 2 = −2. A reflection boundary is set at {x|x 2 = 1}. If n particles exit at time t, n new particles are added in the source region to maintain a steady system. In total, 400 particles were simulated from t = 0 to t = 10 5 . In preprocessing, principal component analysis (PCA) was used to reduce the dimension into two. Subsequently, we scaled the data to make the maximum absolute value of its elements to 1 (So are the following other examples). For this dataset, the second principal component was taken as the true differentiation time.
For the two-gene regulatory network shown in Fig. 2C, where two genes inhibit each other, we simulated the data with the following rules: 1. x i decreases with the rate 0.003x i , i = 1, 2.
2. x i increases with the rate max{ (0.01xi) 4

5.
The system is simulated by the stochastic differentiation equation with white noise. Noise amplitude is 1.3, and the time interval is from t = 0 to T = 5 × 10 4 with the step size ∆t = 1.
6. 1000 cells are simulated. Gene expression in the new cell is sampled from a normal distribution with mean x = (500, 500, 40, 40, 40, 40) and standard derivation 10 (samples with negative element will be resampled). If n samples cross {x|x i = 0, i = 1, 2, . . . , 6}, they will be removed, and n new samples are added from the source.
There are four branches. An increasing in four cases -(x 1 , x 3 ) or (x 1 , x 4 ) or (x 2 , x 5 ) or (x 2 , x 6 ), will occur, and in each case, the other four genes are few. For the six-gene example in this paper, we use − min{x 1 , x 2 } as the measure of the true time direction.

Note S2. Description of real datasets
Guo's dataset was derived from [3]. In total, 46 selected genes and 438 samples from mouse stem cells, consisting of 7 stages named oocyte, 2-cell, 4-cell, 8-cell, morula, E3.5 blastocyst, and E4.25 blastocyst were included. Further, three types, including trophectoderm (TE), primitive endoderm (PE), and epiblast (EPI) were formed during the differentiation process. The PE and EPI made up the inner cell mass (ICM) and were almost distinguishable. In our example, we observed two branches: TE and ICM. We conducted PCA to reduce to three dimensions, and four groups clustered by k-means. Figures B(a-b) show two groups on the main branch and other two groups belonging to two sublineages.
Nef's dataset was reported in [8], which used XY mouse gonads to study mouse sex determination. It is accessible from GEO: GSE97519. In total, 400 cells passed the quality control testing in [8]. After log-normalization, 8210 important genes were chosen, which had standard derivations larger than 1. It consisted of five stages named E10.5, E11.5, E12.5, E13.5, and E16.5. The interstitial progenitor cells and Sertoli cells formed two branches; however, we removed the fetal Leydig cell lineage, which consisted of very few samples. In our example, after reducing to two dimensions by PCA, we clustered the samples into five groups and computed their potential values (Figs. B(c-e)).
Xu's dataset originated from [11], which is accessible from GEO: GSE90047. The original data contained 447 samples and 40824 genes from the mouse embryo. After log-normalization and gene selection, 1140 effective genes, which showed expression in all samples, were detected. Seven stages from E10.5 to E17.5 described bipotential hepatoblasts differentiation into hepatocytes and cholangiocytes. Five clusters were found. The potential values in two-dimensional PCA space and differentiation paths are shown in Figs. 3A-C.
Furlan's dataset was derived from [2], which is accessible by GEO: GSE99933. In brief, 337 cells revealed that Schwann cell precursors (SCPs), a type of peripheral glial stem cell, had the ability to differentiate into large numbers of chromaffin cells in the adrenal medulla. Further, 500 genes, which highly correlated with Chga, a characteristic gene in the differentiation process, were selected. In our example, 4 clusters formed in the two-dimensional PCA space. A linear process from SCPs to chromaffin cells through a moderate differentiation bridge could be detected, as shown in Figs. 3D-F.
A summary of information derived from real datasets is listed in Table A. For the operator L ε,α defined by eqn. (17), we will prove that under the large sample limit N → ∞, it will converge to a continuous backward operator, i.e.
Proof: The large sample limit N → ∞ is the basic assumption. Under this assumption and the Laplacian approximation where φ(x) is any smooth function, we can approximate eqns. (13) and (15) as The shift operator, which is the limitation of the transition matrix eqn. (16) for infinite samples, has the asymptotic expansion as Thus, we have the infinitesimal backward operator as If we define L * α as the conjugate operator of L α under the weighted inner product ·, · r , with the same procedure above we obtain where L T ε,α is the transpose of L ε,α . L * α is also called the infinitesimal forward operator. Some interesting cases include: when α = 1, L 1 = ∆, which is the Laplace-Beltrami operator; when α = 1/2, L 1/2 = L * 1/2 = ∇ log r(x) · ∇ + ∆, which is exactly the backward operator in eqn. (11).
Note S4. Details on the approximation of the net-flow R In this note, the method to approximate the net-flow for each cluster is described. First, we list the original continuous birth-death process, i.e.
where x is a vector representing the gene expression, c(x, t) is the probability density function (pdf) of cells at x, F (x) is a potential function, D is the noise amplitude, and R(x) is the net-flow of cells at state x. ∇, ∇· and ∆ denote the gradient, divergence, and Laplace operators, respectively. With the conditional probability density function (cpdf) we obtain the evolution of r s (x, t) as where is the forward operator. In the long time limit, c(x, t) and r s (x, t) will converge to the steady distribution p(x) and r s (x), respectively. When t → ∞, using and eqn. (S1), we get Assuming that samples have separated into different metastable states, ∇F (x) and dw are orthogonal, where dw is the outer normal vector on the boundary of Ω s . We can also obtain by the divergence theorem. Finally, after we combine eqns. (S6) and (S7), and let t → ∞ in eqn. (S3), the net-flow rateR s of cluster s can be calculated asR In the aspect of numerical computation, we know that samples {x 1 , x 2 , . . . , x N } ∈ Ω s have cpdf r s (x). When the sample size is N → ∞, the limitations s ] is the support set of r Thus, by approximating r To approximate a coarse-grained transfer operatorP = (p st ) K×K on the clusters space, we drive from theapproximation equation where T ε, 1 2 is the transfer operator satisfying After we apply the diffusion map theory on N samples with transition matrix p ij = P ε,1/2 (x i , x j ) and computê where n s is the number of samples in Ω s , we can derive the discrete numerical equation of eqn. (S12) as whereV = (V 1 ,V 2 , · · · ,V K ) T andR = (R 1 ,R 2 , · · · ,R K ) T . Eqns. (S13) and (S14) are required for eqns. (20) and (28) in the main text.

Note S6. Description of other pseudo-time methods
Six popular pseudo-time algorithms are compared in this study. Below, we provide a short description and list some notes on their applications. Corresponding figures are included in the following supplemental document. TSCAN [6] ("Tools for Single Cell Analysis") first clusters cells into groups and then constructs a minimum spanning tree (MST). The pseudo-time of TSCAN is determined by the order of samples' projections on the MST. The number of clusters is needed as the input. When applying TSCAN to our datasets, we used the same number of clusters as LDD, except using five clusters for Simu2. The pseudo-times of some samples were not derived by TSCAN. Therefore, we only considered those with pseudo-times for comparison. The two-dimensional plots for every dataset are shown in Fig. C. TSCAN has PCA plot different from LDD or Slingshot due to additional scaling applied on the dataset.
Monocle2 [7] uses the reversed graph embedding (RGE) technique to learn a graph's structure. A root cell is needed as one of the inputs, and the pseudo-time is calculated by the geodesic distance between the projections on the tree structure. Therefore, choosing the distribution family for the gene expression matrix is important. As Monocle2 does log-normalization in the algorithm by itself, we used "exp(data)" as the input and chose the distribution family "tobit()", except for Simu2 and Simu3 which used "data" as the input along with the parameter "negbinomial.size()". The results are shown in Fig. D.
Diffusion map [1,4] and DPT [5] are two similar methods. The former uses the first coordinate of the diffusion map's reduction as the pseudo-time, and the latter computes the so-called "Diffusion pseudotime" by the accumulation of the transition matrix. Duplicated samples must be eliminated, and a root cell is needed for DPT. Figures E-F show the corresponding results for the two algorithms on seven datasets.
SLICER [10] uses the local linear embedding (LLE) for dimension reduction. A kNN graph is constructed, and the shortest path distance from a root cell on the graph is considered as the pseudo-time. The final results of SLICER for seven datasets are shown in Fig. G.
Slingshot [9] is a recently published method. Slingshot can fit a principal curve for each lineage obtained by MST to generate the pseudo-time. Figure H shows the results of Slingshot for the seven datasets.
Among the six algorithms, if a root cell as the start node is needed, the first one with the smallest true time label in the samples will be chosen. Some of the six algorithms may generate pseudo-times that are negatively correlated with the true time labels, because they could not automatically determine the correct direction. In that case, a minus sign is added to the pseudo-times to create a positive correlation.