Trajectory inference from single-cell genomics data with a process time model

doi:10.1371/journal.pcbi.1012752

Fig 1.

Chronocell overview.

The input of Chronocell comprises three components: 1) the trajectory structure, which outlines the states each lineage traverses as paths on a directed graph. 2) The sampling assumption, which defines the prior distribution of latent variables, namely lineages and process time, with a default uniform distribution over both; and 3) the scRNA-seq data, consisting of unspliced and spliced count matrices. The Chronocell model consists of a expression model with piecewise-constant transcription rates, and a Bernoulli measurement model. Each state s is associated with a transcription rate α_s for each gene, as well as an exit time τ_k denoting the switching time to the next state, where k is the index for the time segment. The EM algorithm is used for inference, with each iteration alternating between E-steps and M-steps. The results of Chronocell primarily include the estimated parameters and posterior distributions over latent variables for each cell.

More »

Expand

Fig 2.

Demonstration of inference on simulation.

a The ground truth trajectory structure. Cells jump to the next state (2) from starting state (1) at τ₀ = 0, and then bifurcate into two lineages with different ending states (3 and 4) at τ₁. The process ends at τ₂ = 1. Out of 200 total genes, 100 genes are non variable with the same distributions along time. b The falsely assumed structure that does not know the first two states are supposed to be merged into one. All genes are assumed to vary along time. c The ELBO scores of 100 random initializations (blue dots) compared to those of warm start (red line). The x axis is the Pearson’s correlation between the mean process time of each random initialization and the true time. d ELBO scores over fitting iterations of both warm start (red line) and the best random initialization (blue line), with the ELBO calculated with true parameters (gray line) as reference. e Heatmaps of inferred posterior distributions. x axis is time grids, and y axis is cells aligned by their true times with true transcription states on the left. The intensity of color indicates the weights of posterior distributions of cells on the grids. Heatmap of cells from τ₀ to τ₁ use a gray color palette. Heatmap of cells from τ₁ to τ₂ use purple color palette. Heatmap of cells from τ₂ to τ₃ of first lineage use blue, and those of second lineage use red. RMSE stands for root mean square errors. f The averaged posterior distributions across cells (dark blue) and true empirical distribution (gray) of process time. g Inferred parameters values compared to true values. For non variable genes, only are identifiable and compared. Error is mean normalized error as described in the text, and the mean is computed across genes. h The confusion matrix for gene selection. i α values of selected genes over states. j Two models and the distribution of the chosen one by (train) ELBO, AIC, BIC and test ELBO, calculated on 20 samples each with a different set of parameter.

More »

Expand

Fig 3.

Inference results for Forebrain data.

a Schematics of Forebrain data and PCA plot of cells colored by cell type annotations. b The AIC scores of the trajectory and cluster models. The x axis is the mean process time correlations of 100 random initializations (blue dots). The AIC scores of random initializations are compared to those of warm start (red line) as well as 3 Poisson mixture model (yellow line). AP stands for average precision. c Posterior distributions of process time of the trajectory model with warm start. Cells were ordered in y axis by their inferred mean process time and the left bar displays their cell types using the colors in a. The below histogram shows average posterior distribution averaged over cells. The entropy of the average posterior distribution was calculated using its weights on the 100 discretized time grids.

More »

Expand

Fig 4.

Inference results for Erythroid data.

a Schematics of Erythroid data and PCA plot of cells colored by cell type annotations. b The fitted trajectory structure and inferred mean process time from random initialization indicated in blue on the same PCA plot as in a. c Posterior distributions of process time. Cells were ordered in y axis by their inferred mean process time and the left bar displays their cell types using the colors in a. The below histogram shows average posterior distribution over cells. The entropy of the average posterior distribution was calculated using its weights on the 100 discretized time grids. d Averaged posterior distribution across cells from different experimental time points. n is the number of cells. e α values of 24 selected genes over states. f Phase plots of top 5 DE genes of 24 selected genes. The x axis is the raw unspliced counts and y axis is the raw spliced counts. The blue curve is the fit mean of product Poisson distributions of unspliced and spliced counts over process time, and its darkness corresponds to the value of process time.

More »

Expand

Fig 5.

Inference results for cell cycle data.

a Schematics of Cell cycle data and scatter plot of the Geminin-GFP and Cdt1-RFP of RPE1 cells colored by cell type annotations. b The fit trajectory structure and inferred mean process time from random initialization indicated in blue on the same scatter plot as in a. c Posterior distributions of process time. Cells were ordered in y axis by their inferred mean process time and the left bar displays their cell types using the colors in a. The below histogram shows average posterior distribution averaged over cells. The entropy of the average posterior distribution was calculated using its weights on the 100 discretized time grids. d α values of 84 selected genes over states. e Comparison of γ estimates for 84 selected genes with estimates derived from metabolic RNA labeling data. CCC stands for concordance correlation coefficient. n is the number of genes for which estimates are available in each respective paper.

More »

Expand