Single-cell data integration across weakly linked modalities

doi:10.1371/journal.pcbi.1014231

Fig 1.

Overview pipeline of MMIHCL.

(a) The input modality matrices and share linked features. Weighted adjacency cell graphs and are constructed through the akNN module. (b) Linked features are embedded via the hypergraph operator to generate initial embeddings, followed by a Many-to-Many (M-to-M) matching process to obtain the initial matching . (c) The workflow iteratively updates cell embeddings and matching for iterations. Each loop involves , Canonical Correlation Analysis (CCA) [27], and M-to-M matching. The final outputs are the optimized joint embeddings and the 1-to-N matching . (d) Feature similarity is calculated to transition from a constant k in standard kNN to a cell-specific adaptive , yielding the final graph . (e) For a target cell i, local and global message-passing branches generate embeddings and , respectively. Contrastive learning is employed between the two branches to enhance representation robustness, followed by a fusion step to obtain the final learned embedding .

More »

Expand

Fig 2.

Comprehensive performance evaluation on weak linkage datasets.

(a) Boxplots show the weighted overall scores balancing biological fidelity and multimodal alignment. The horizontal bold red dashed line represents the MMIHCL mean baseline. Numerical values in parentheses denote mean differences relative to MMIHCL. The boxplots are defined as follows: the minimum is calculated as the 25th percentile minus 1.5 times the Inter-Quartile Range (IQR), and the maximum is calculated as the 75th percentile plus 1.5 times the IQR. The hinges of the box represent the IQR, while the whiskers extend to 1.5 times the IQR. The black line and the red dotted line in the box plot represent the median and the mean respectively, and the bounds of the box correspond to the 25th and 75th percentiles. Statistical significance for (a) and (b) was determined by two-sided Wilcoxon signed-rank tests (***: P < 0.001, **: P < 0.01, *: P < 0.05, ns: not significant). (b) Scatter plot illustrates the balance between biological fidelity (S_bio) and batch effect removal (S_batch). Diamonds, error bars, and shaded 95% confidence ellipses represent centroids, standard deviations, and the distribution of results across five repetitions, respectively. Statistical significance symbols in the legend also refer to MMIHCL as the reference. (c-g) Radar charts display performance across seven metrics (ACC, 1-FOSCTTM, NMI, ARI, ASW_label, GC, and ASW_batch) for: (c) CITE-seq PBMC, (d) TEA-seq PBMC, (e) AB-seq BMC, (f) CITE-seq BMC, and (g) CODEX tonsil. Significance markers for metrics indicate statistical differences between the two top-ranked methods (determined by two-sided Wilcoxon signed-rank tests). NOTE: Unless otherwise specified, definitions for plot elements and significance testing remain consistent for similar figures throughout this study.

More »

Expand

Fig 3.

UMAP visualization on CITE-seq PBMC dataset.

The first and third row subgraphs are colored by data modality, and the second and fourth row subgraphs are colored by cell type. Other UMAP graphs are also arranged in this way.

More »

Expand

Fig 4.

Comprehensive performance evaluation on strong linkage datasets.

(a) Boxplots show the weighted overall scores balancing biological fidelity and multimodal alignment. (b) Scatter plot illustrates the balance between biological fidelity (S_bio) and batch effect removal (S_batch). (c-e) Radar charts display performance across seven metrics (ACC, 1-FOSCTTM, NMI, ARI, ASW_label, GC, and ASW_batch) for: (c) CITE-seq & CyTOF PBMC, (d) CyTOF human H1N1 & IFNG, and (e) 10X-Multiome PBMC.

More »

Expand

Fig 5.

UMAP visualization on CyTOF human H1N1 & IFNG dataset.

More »

Expand

Fig 6.

Performance evaluation of cross-modality feature prediction on the CITE-seq PBMC dataset.

(a) Violin plots displaying the distribution of cell-wise PCCs between ground truth and predicted surface protein abundances. The white dot represents the median PCC, and the thick bar indicates the interquartile range. The numbers above the violins indicate the difference in median PCC relative to MMIHCL. Statistical significance was determined using two-sided Wilcoxon signed-rank tests (***: P < 0.001, **: P < 0.01, *: P < 0.05, ns: not significant). (b) CDF curves illustrating the proportion of cells (y-axis) surpassing specific PCC thresholds (x-axis). The translucent shading surrounding each curve represents the standard deviation, and the values in parentheses within the legend denote the AUC for each method. (c) Side-by-side heatmaps comparing the z-scored expression of 10 representative surface proteins (e.g., CD3.1, CD19, CD14) across annotated cell types for ground truth, MMIHCL prediction, and MaxFuse prediction. Rows represent protein markers, and columns represent individual cells sorted by cell type. (d) UMAP visualizations of ground truth versus predicted expression for two lineage-specific protein markers: CD3.1 (T cells) and CD19 (B cells).

More »

Expand

Fig 7.

Application of MMIHCL in disease classification and drug target discovery.

(a) UMAP visualization of joint embeddings on the HPAP dataset. The top row subgraphs are colored by disease status (Control vs. T1D), and the bottom row subgraphs are colored by cell type. The clustering performance metrics (NMI and ARI) in the top row are calculated using all cells, whereas those in the bottom row are computed using exclusively the T1D subpopulation. (b) Split violin plots comparing the expression distributions of three representative interferon-stimulated genes (CXCL10, CXCL11, and CCL2) from the Kang18 PBMC dataset across Seurat, MARIO, and MMIHCL. The numeric values annotated above the violins indicate the statistical significance (P-values) derived from the Welch’s t-tests. (c) Volcano plot visualizing the DEGs identified by MMIHCL on the Kang18 PBMC dataset. Significantly up-regulated and down-regulated genes are highlighted in color, whereas non-significant genes are displayed in gray.

More »

Expand

Table 1.

Ablation study of MMIHCL under weak and strong linkage scenarios.

More »

Expand

Fig 8.

Comprehensive analysis of the akNN mechanism.

(a) Distribution of cell type proportions in the CITE-seq PBMC dataset. (b) Box plots of the learned neighbor counts (k) across cell types, with Spearman correlation () indicating the relationship between cluster size and k. (c) Density estimation of learned k values for each cell type using KDE. Curves are independently normalized to visualize distribution shapes across varying population sizes. (d) UMAP visualization labeled with the specific proportion of each cell population.

More »

Expand

Fig 9.

Comprehensive analysis of computational cost and scalability.

(a) Benchmarking of Running Time (RT) in minutes across datasets. (b) Benchmarking of peak Memory Consumption (MC) in Gigabytes (GB). (c, d) Scalability analysis of RT and MC on datasets ranging from 10k to 100k cells, while the red dashed line in (c) and (d) indicates the 4-hour time and 64GB memory limit respectively.

More »

Expand