Fig 1.
A. Matrix pre-processing. B. MCA decomposition. C. Leiden community detection (clustering). D. Calculate and rank Euclidean distances. E. Statistical testing.
Table 1.
Selected existing feature selection methods in scRNA-seq analysis.
Fig 2.
Jaccard Similarity index for comparing feature selection performance on PBMC (A and B) and simulated datasets (C and D). The Jaccard similarity index measures the accuracy of selecting true (or semi-true) HVGs by different FS methods. Each dot in the graph represents a dataset, and the dashed horizontal line represents the baseline mean for all methods. The p-values, obtained through one-sided pairwise t-tests, indicate the significance of the differences between Mcadet method and each other FS method. Ns: non-significance, NS: p ≥ 0.05,*: 0.01 ≤ p < 0.05, **: 0.001 ≤ p < 0.01, ***: p < 0.001.
Fig 3.
The trend of Jaccard Similarity index as the number of selected genes increases on PBMC (A and B) and simulated datasets (C and D). The number of selected genes range from 200 to 3,000. The Brennecke method, which does not allow for specifying the number of HVGs needed, are excluded from this comparison.
Table 2.
Number of HVGs selected by different feature selection methods by default.
Fig 4.
F1 score: Performance evaluation on minority cell populations on PBMC (A and B) and simulated datasets (C and D). The p-values, obtained through one-sided pairwise t-tests, indicate the significance of the differences between Mcadet method and each other FS method. Ns: non-significance, NS: p ≥ 0.05,*: 0.01≤ p < 0.05, **: 0.001 ≤ p < 0.01, ***: p < 0.001.
Fig 5.
Comparison of the mean Jaccard similarity of genes selected by different FS methods with semi-true HVGs in PBMC fine-resolution datasets after splitting with a probability of ε = 0.5.
Error bars represent the standard deviations. Data 1 and 2 are the two split datasets of original datasets.
Fig 6.
Density plot of log-mean expression for selected genes in PBMC fine-resolution datasets.
The light lavender density represents the true HVGs in each panel. The dashed vertical blue line represents the mean for each distribution.
Fig 7.
Frequency bar plot of informative genes discovered by Mcadet exclusively.
Fig 8.
Comparison of the mean gene expression of gene SPINT2 by different fine-resolution PBMC cell types.
The horizontal red dashed line represents overall mean.
Fig 9.
Averaged clustering metrics for comparing feature selection performance on PBMC (A and B) and simulated datasets (C and D).
The p-values, obtained through one-sided pairwise t-tests, indicate the significance of the differences between Mcadet method and each other FS method. Ns: non-significance, NS: p ≥ 0.05,*: 0.01 ≤ p < 0.05, **: 0.001 ≤ p < 0.01, ***: p < 0.001.
Fig 10.
Density plot of the analytical Spearman correlations.
Density distribution of the analytical Spearman correlations for genes selected by Mcadet (red) compared to all genes (blue) across 50 generated continuous scRNA-seq datasets.
Fig 11.
Comparison of feature selection methods using mean analytical Spearman correlations.
The p-values, obtained through one-sided pairwise t-tests, indicate the significance of the differences between Mcadet method and each other FS method. Ns: non-significance, NS: p ≥ 0.05,*: 0.01 ≤ p < 0.05, **: 0.001 ≤ p < 0.01, ***: p < 0.001.
Fig 12.
2D biplot of a coarse-resolution PBMC dataset.
X-axis and Y-axis are the first two PCs of standard row coordinates of cells (dots) and the principal coordinates of genes (+) (texts). The black arrows represent the Euclidean distance from genes to the cell centroid.
Fig 13.
The Euclidean distances between different marker genes to the centroid of each coarse-resolution PBMC cell type.
Top 60 PCs were used to calculate the Euclidean distances in the embedded biplot space.
Fig 14.
UMAP visualization of a fine-resolution PBMC dataset with true annotated labels by different FS methods.
A-G: HVGs of semi-ground truth; HVGs selected by Mcadet, HVGs selected by Scry, HVGs selected by NBDrop, HVGs selected by Brennecke, HVGs selected by M3Drop, HVGs selected by Seurat Disp, HVGs selected by Seurat Vst, HVGs selected by Seurat Mvp.
Fig 15.
UMAP visualizations of the same fine-resolution PBMC dataset with Fig 14, colored by k-means clustering labels by different FS methods.
A-G: HVGs of semi-ground truth; HVGs selected by Mcadet, HVGs selected by Scry, HVGs selected by NBDrop, HVGs selected by Brennecke, HVGs selected by M3Drop, HVGs selected by Seurat Disp, HVGs selected by Seurat Vst, HVGs selected by Seurat Mvp.