Fig 1.
Demonstration workflow using scedar to analyze an scRNA-seq dataset with 3005 mouse brain cells and 19,972 genes generated using the STRT-Seq UMI protocol by Zeisel et al. [53].
Procedures and parameters that are not directly related to data analysis are omitted. The full version of the demo is available at https://github.com/logstar/scedar/tree/master/docs/notebooks.
Fig 2.
The minimum description length iteratively regulated agglomerative clustering (MIRAC) algorithm.
MIRAC extends hierarchical agglomerative clustering (HAC) in a divide and conquer manner for scRNA-seq data. Input with raw or dimensionality reduced scRNA-seq data, MIRAC starts with building an HAC tree (Line 1–3), and the tree is then divided into small sub-clusters (Line 4–5), which are further merged iteratively into clusters (Line 9–37). The rationales and detailed procedures are described in the Methods section.
Fig 3.
Clustering method benchmarks on experimental datasets.
(A) Runtimes. (B) CCRs on different datasets, with different points of each dataset representing different numbers of clusters. For each dataset, the numbers of clusters are the same across all compared clustering methods.
Table 1.
Real scRNA-seq datasets for benchmark.
Fig 4.
Gene dropout imputation method benchmarks.
(A) Runtimes on 40 simulated 10x Genomics datasets. (B) ROC curves (± standard deviation) of dropout detection on the simulated 10x Genomics datasets. (C) t-SNE scatter plots of the Zeisel et al. [53] dataset after gene dropout imputations.
Fig 5.
KNN rare transcriptomic profile detection on the Zeisel et al. [53] dataset.
(A) t-SNE scatter plot with colors labeling cell types and markers labeling common or rare transcriptomic profiles. 9.3% cells are marked as rare. (B) Pairwise cosine distance heatmap with left strip as MIRAC labels and upper strip as common or rare transcriptomic profiles labels. (C) Pairwise cosine distance heatmap with rare transcriptomic profiles removed.
Fig 6.
Identified genes separating the MIRAC clusters 1, 15, and 22 of the Zeisel et al. [53] dataset.
(A) t-SNE scatter plot with color as MIRAC cluster labels and marker shape as compared or not compared. (B) t-SNE scatter plots of the compared clusters with color as log2(read count + 1) of the corresponding gene and marker shape as MIRAC clusters. (C) Transcription level heatmap of the top 100 important cluster separating genes in the compared cells, with rows as cells ordered by cluster labels and columns as genes ordered by importance. The color gradient is log2(clip(read count, 1, 100)), where the clip(read count, 1, 100) function changes any read count below 1 to 1 and above 100 to 100, in order to better compare genes at different transcription levels.