Mcadet: A feature selection method for fine-resolution single-cell RNA-seq data based on multiple correspondence analysis and community detection

doi:10.1371/journal.pcbi.1012560

Fig 1.

Schematic of Mcadet workflow.

A. Matrix pre-processing. B. MCA decomposition. C. Leiden community detection (clustering). D. Calculate and rank Euclidean distances. E. Statistical testing.

More »

Expand

Table 1.

Selected existing feature selection methods in scRNA-seq analysis.

More »

Expand

Fig 2.

Jaccard Similarity index for comparing feature selection performance on PBMC (A and B) and simulated datasets (C and D). The Jaccard similarity index measures the accuracy of selecting true (or semi-true) HVGs by different FS methods. Each dot in the graph represents a dataset, and the dashed horizontal line represents the baseline mean for all methods. The p-values, obtained through one-sided pairwise t-tests, indicate the significance of the differences between Mcadet method and each other FS method. Ns: non-significance, NS: p ≥ 0.05,*: 0.01 ≤ p < 0.05, **: 0.001 ≤ p < 0.01, ***: p < 0.001.

More »

Expand

Fig 3.

The trend of Jaccard Similarity index as the number of selected genes increases on PBMC (A and B) and simulated datasets (C and D). The number of selected genes range from 200 to 3,000. The Brennecke method, which does not allow for specifying the number of HVGs needed, are excluded from this comparison.

More »

Expand

Table 2.

Number of HVGs selected by different feature selection methods by default.

More »

Expand

Fig 4.

F1 score: Performance evaluation on minority cell populations on PBMC (A and B) and simulated datasets (C and D). The p-values, obtained through one-sided pairwise t-tests, indicate the significance of the differences between Mcadet method and each other FS method. Ns: non-significance, NS: p ≥ 0.05,*: 0.01≤ p < 0.05, **: 0.001 ≤ p < 0.01, ***: p < 0.001.

More »

Expand

Fig 5.

Comparison of the mean Jaccard similarity of genes selected by different FS methods with semi-true HVGs in PBMC fine-resolution datasets after splitting with a probability of ε = 0.5.

Error bars represent the standard deviations. Data 1 and 2 are the two split datasets of original datasets.

More »

Expand

Fig 6.

Density plot of log-mean expression for selected genes in PBMC fine-resolution datasets.

The light lavender density represents the true HVGs in each panel. The dashed vertical blue line represents the mean for each distribution.

More »

Expand

Fig 7.

Frequency bar plot of informative genes discovered by Mcadet exclusively.

More »

Expand

Fig 8.

Comparison of the mean gene expression of gene SPINT2 by different fine-resolution PBMC cell types.

The horizontal red dashed line represents overall mean.

More »

Expand

Fig 9.

Averaged clustering metrics for comparing feature selection performance on PBMC (A and B) and simulated datasets (C and D).

The p-values, obtained through one-sided pairwise t-tests, indicate the significance of the differences between Mcadet method and each other FS method. Ns: non-significance, NS: p ≥ 0.05,*: 0.01 ≤ p < 0.05, **: 0.001 ≤ p < 0.01, ***: p < 0.001.

More »

Expand

Fig 10.

Density plot of the analytical Spearman correlations.

Density distribution of the analytical Spearman correlations for genes selected by Mcadet (red) compared to all genes (blue) across 50 generated continuous scRNA-seq datasets.

More »

Expand

Fig 11.

Comparison of feature selection methods using mean analytical Spearman correlations.

The p-values, obtained through one-sided pairwise t-tests, indicate the significance of the differences between Mcadet method and each other FS method. Ns: non-significance, NS: p ≥ 0.05,*: 0.01 ≤ p < 0.05, **: 0.001 ≤ p < 0.01, ***: p < 0.001.

More »

Expand

Fig 12.

2D biplot of a coarse-resolution PBMC dataset.

X-axis and Y-axis are the first two PCs of standard row coordinates of cells (dots) and the principal coordinates of genes (+) (texts). The black arrows represent the Euclidean distance from genes to the cell centroid.

More »

Expand

Fig 13.

The Euclidean distances between different marker genes to the centroid of each coarse-resolution PBMC cell type.

Top 60 PCs were used to calculate the Euclidean distances in the embedded biplot space.

More »

Expand

Fig 14.

UMAP visualization of a fine-resolution PBMC dataset with true annotated labels by different FS methods.

A-G: HVGs of semi-ground truth; HVGs selected by Mcadet, HVGs selected by Scry, HVGs selected by NBDrop, HVGs selected by Brennecke, HVGs selected by M3Drop, HVGs selected by Seurat Disp, HVGs selected by Seurat Vst, HVGs selected by Seurat Mvp.

More »

Expand

Fig 15.

UMAP visualizations of the same fine-resolution PBMC dataset with Fig 14, colored by k-means clustering labels by different FS methods.

A-G: HVGs of semi-ground truth; HVGs selected by Mcadet, HVGs selected by Scry, HVGs selected by NBDrop, HVGs selected by Brennecke, HVGs selected by M3Drop, HVGs selected by Seurat Disp, HVGs selected by Seurat Vst, HVGs selected by Seurat Mvp.

More »

Expand