A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa
Fig 1
Strategies of clustering multiple single-cell populations.
In the example, four cell types are shown in four different colors. (A) Ground Truth. 2D plot of a pool of single cells combined from 3 single cell populations of identical distributions separated by the true marker genes A and gene B. (B) Simulated Single-cell Populations. 3D plots of the three single-cell populations separated by the marker genes A, B and non-marker gene X. The simulation data are generated from the ground truth data with rotation and scaling to represent technical biases and biological variation with 998 random genes in addition to gene A and gene B (1000 genes in total). Additional noise is also introduced. Three different clustering strategies are shown below in (C), (D) and (E). (C) Pooled Clustering. The 2D plot with the true marker genes A and B on pooled data that simply combines 3 single-cell populations together for clustering is shown. Even with the correct marker selection, cells from different types are still mixed because of the rotation, scaling and noise. (D) Separated Clustering. The 2D plot on each individual cell population is shown. With the limited single-cell sample size and skewed cell-type distribution, incorrect marker genes may be selected, shown as genes P, Q and R. (E) Multitask Clustering and Embedded Feature Selection. The proposed multitask clustering can identify both the true marker genes and correctly cluster the individual cells into their respective types in each population. The clustering of each dataset is reinforced from the results in the other two datasets shown as the connected clusters across the three experiments.