mbkmeans: Fast clustering for single cell data using mini-batch k-means

doi:10.1371/journal.pcbi.1008625

Fig 1.

mbkmeans uses less memory and is faster than k-means.

Performance evaluation (y-axis) of (A) maximum memory (RAM) used (GB) and (B) elapsed time (minutes) (repeated 10 times) for increasing sizes of datasets (x-axis) with N = 75,000, 150,000, 300,000, 500,000, 750,000, and 1,000,000 observations and G = 5,000 genes, using our desktop configuration. Results for mbkmeans are in green (in-memory) and blue (on-disk); k-means is in red. We used k = 15 for both algorithms and used a batch size of b = 500 observations for mbkmeans.

More »

Expand

Fig 2.

The accuracy of mbkmeans depends on batch size.

Performance evaluation (y-axis) with (A) adjusted Rand index (ARI) and (B) within clusters sum of squares (WCSS) for increasing batch sizes ranging from 75 to 1000 cells (x-axis) using simulated gene expression data (G = 1000) with a fixed k = 3 true centroids with three sizes of datasets (N = 5000, 10000, 25000). (C) WCSS (y-axis) for increasing batch sizes (x-axis) using real scRNA-seq gene expression data from 10X Genomics and k = 15 for both algorithms. ARI and WCSS is reported as an average across 50 runs.

More »

Expand

Fig 3.

The speed and memory-usage of mbkmeans depends on batch size.

Performance evaluation (y-axis) of (A) maximum memory (RAM) used (GB) and (B) elapsed time (minutes) for increasing batch sizes (x-axis) with b = 75, 150, 300, 500, 1,000, 1,500, 3,000, 5,000, 7,500, 10,000, 20,000, 50,000, 100,000, and 200,000 with a dataset of size N = 1,000,000 observations using our desktop configuration. Results for mbkmeans in-memory are in red and and on-disk in blue. We used k = 15 for the number of centroids.

More »

Expand

Fig 4.

The speed and memory-usage of the on-disk mbkmeans implementation depends on the structure of the on-disk file.

Performance evaluation (y-axis) of (A) maximum memory (RAM) used (GB) and (B) elapsed time (minutes) (repeated 10 times) for increasing sizes of datasets (x-axis) with N = 75,000, 150,000, 300,000, 500,000, 750,000, and 1,000,000 observations using our desktop configuration. Results for indexing a HDF5 file by gene is blue, by cell is red, as a single chunk is purple and the default indexing is green. The single chunk was only able to run for the smallest dataset size (N = 75,000). We used k = 15 and used a batch size of b = 500 observations for mbkmeans.

More »

Expand

Fig 5.

Results of full analysis on 1.3 million mouse brain cells.

(A) Hexbin plot [54] of the UMAP representation of the 1.3 million cells, color coded by the clusters found via mbkmeans. (B) Heatmap of the average gene expression of each of the 15 clusters found by mbkmeans for 42 marker genes.

More »

Expand