A benchmark of semi-supervised scRNA-seq integration methods in real-world scenarios

doi:10.1371/journal.pcbi.1014008

Fig 1.

Design of semi-supervised integration benchmarking.

Schematic diagram of the benchmarking workflow. In this study, ten data integration methods, including five semi-supervised algorithms and five unsupervised baselines, are evaluated across six integration datasets. Five label scenarios are designed to reflect diverse real-world conditions, including randomly missing, randomly wrong, missing and mixing at edge, partially annotated batches, automatically generated labels and labels with varied granularity, together with the baseline setting where five unsupervised methods are used and five semi-supervised method are used with full labels presented. Integration results are assessed using 9 metrics that evaluate batch effect removal and conservation of biological variance from true cell type labels (label conservation).

More »

Expand

Fig 2.

Baseline settings: using unsupervised methods results.

(a) Bar plots showing the performance of all methods across six datasets under this setting. Each bar represents the overall weighted score of a method; triangles and circles indicate the batch correction and biological conservation scores, respectively. In each dataset sub-figure, the left panel displays the performance of five semi-supervised methods provided with full labels, represented by solid-colored bars. The right panel shows the performance of five unsupervised methods, depicted with empty (unfilled) bars. (b) Bubble plot of all the methods relative ranking considering the overall weighted score, bio conservation score and batch correction score. The first column of each color group represents the average performance of each method across all six datasets. (c) Scatter plot of the average overall batch correction score against the average overall bio conservation score for each methods across all datasets. The error bars represent the standard error of the mean for each method and score type across all datasets. The horizontal red dashed line represents the average bio conservation score across all methods, while the vertical blue dotted line represented the average batch correction score.

More »

Expand

Fig 3.

Partial Label Scenario I: Randomly Missing Labels.

(a) Bar plots showing the performance of all methods across six datasets under this setting. Each bar represents the overall weighted score of a method; triangles and circles indicate the batch correction and biological conservation scores, respectively. The vertical dashed lines divide methods into four groups, namely, 30%, 50%, 70% and unsupervised approaches. The five unsupervised methods are shown on the right, represented by unfilled bars. (b) Scatter plot of the scaled batch correction score against the bio-conservation score for each method under the setting for different proportions, averaged across six datasets. The scaled score for each dataset and missing proportion is calculated as the ratio of overall bio-conservation/batch-mixing metric for a given method with respect to the corresponding mean using five unsupervised methods. The detailed scaling procedure can be found in Methods Section 2.2. Scaled scores for unsupervised methods are also included using unfilled triangles. Different colors indicate the methods and the size of dot shapes represent the missing proportions. The horizontal red dashed line represents the average bio conservation score across all methods (both supervised and unsupervised methods), while the vertical blue dotted line represented the average batch correction score. (c) Radar plots showing the performance of all methods on individual metrics for the human pancreas, macaque, human immune, and lung atlas datasets, averaged over all the three proportions for semi-supervised methods. Metrics include biological conservation (red) and batch correction (blue). As scCRAFT achieved the highest overall performance among unsupervised methods, only its scores are shown for clarity; radar plots for the remaining methods are provided in the Section 3 in S1 Text.

More »

Expand

Fig 4.

Scenario III: Missing and Mixing at Edge.

Plots of semi-supervised and unsupervised methods across three selected datasets under varying levels of label reassignment in Scenario III. Results are shown for three datasets: (a) human immune, (b) lung atlas, and (c) lung two species. Each row corresponds to a single dataset and presents a bar plot (left), a scatter plot of scaled scores (middle), and a radar plot of individual metrics (right). Left Panels (Bar Plots): The overall weighted score is shown for semi-supervised methods (solid bars) at different label reassignment proportions (controlled by threshold γ) and for unsupervised methods (empty bars). Overlaid circles and triangles indicate the biological conservation and batch correction scores, respectively. Middle Panels (Scatter Plots): The trade-off between the scaled batch correction score (x-axis) and bio-conservation score (y-axis) is visualized. Methods are color-coded, and the size of the data points corresponds to the reassignment percentage. Solid circles represent semi-supervised results at different γ thresholds; empty triangles represent unsupervised methods. The horizontal (red) and vertical (blue) dashed lines mark the mean bio-conservation and batch correction scores, respectively. The scaling procedure is detailed in the Methods Section 2.2. Right Panels (Radar Plots): Performance is broken down across individual biological conservation (red) and batch correction (blue) metrics. For semi-supervised methods, scores are averaged across all proportions. For clarity, scCRAFT is shown as a representative high-performing unsupervised method. (Complete radar plots for all methods are provided in Section 3 in S1 Text).

More »

Expand

Fig 5.

Scenario IV: Partially Annotated Batches.

(a) Bar plots of all the methods’ performance in four datasets under this setting. The bar indicates the overall weighted score of each method, while the triangle and circle represents the batch correction scores and bio-conservation scores respectively. The vertical dashed lines separate the bars into four groups, namely, 30%, 50%, 70% and unsupervised. The five unsupervised counterparts are presented on the right, depicted with empty (unfilled) bars. (b) Scatter plot of the scaled batch correction score against the scaled bio-conservation score for each method under the partially annotated batches setting for different proportions, averaging across four datasets. The scaled score for each dataset and missing proportion is calculated as the ratio of overall bio-conservation/batch-mixing metric for a given method with respect to the corresponding mean using five unsupervised methods. The detailed scaling procedure can be found in Methods Section 2.2. Scaled scores for unsupervised methods are also included using unfilled triangles. Different colors indicate the methods and the size of dot shapes represent the missing batch proportions. The horizontal red dashed line represents the average bio conservation score across all methods (both supervised and unsupervised methods), while the vertical blue dotted line represented the average batch correction score. (c) Radar plots showing the performance of all methods on individual metrics for the human pancreas, macaque, human immune, and lung atlas datasets, averaged over all the three proportions for semi-supervised methods. Metrics include biological conservation (red) and batch correction (blue). As scCRAFT achieved the highest overall performance among unsupervised methods, only its scores are shown for clarity; radar plots for the remaining methods are provided in the Section 3 in S1 Text.

More »

Expand

Table 1.

Number of Cell Types Predicted by Auto-Annotation Tools.

More »

Expand

Fig 6.

Scenario V: Integration with Auto-annotated Labels.

(a) Bar plots showing the performance of all methods across four datasets under this setting. Each bar represents the overall weighted score of a method; triangles and circles indicate the batch correction and biological conservation scores, respectively. The vertical dashed lines divide methods into four groups: those using Azimuth, CellAssign, SingleR, and unsupervised approaches. The five unsupervised methods are shown on the right, represented by unfilled bars. (b) Scatter plot of scaled batch correction scores versus biological conservation scores for each method, averaged across the four applicable datasets. Different colors indicate methods, and point shapes represent the origin of the labels. The scaled score for each dataset and auto-annotated labels is calculated as the ratio of overall bio-conservation/batch-mixing metric for a given method with respect to the corresponding mean using five unsupervised methods. The detailed scaling procedure can be found in Methods Section 2.2. The horizontal red dashed line marks the average biological conservation score across all methods, while the vertical blue dotted line marks the average batch correction score. (c) Radar plots showing the performance of all methods on individual metrics for the human pancreas, lung two species, human immune, and lung atlas datasets, averaged over all annotation types for semi-supervised methods. Metrics include biological conservation (red) and batch correction (blue). As scCRAFT achieved the highest overall performance among unsupervised methods, only its scores are shown for clarity; radar plots for the remaining methods are provided in the Section 3 in S1 Text.

More »

Expand

Table 2.

Hierarchical Mapping of Original Fine-grained Labels to Coarse-grained Lineages.

More »

Expand

Fig 7.

Impact of inconsistent label granularity on integration performance.

(a-c) Benchmarking results on the human immune dataset where 30% (a), 50% (b), and 70% (c) of batches were assigned coarse-grained labels, while the remainder retained fine-grained labels. Bar plots show the overall integration score (y-axis) across three strategies: coarse label (harmonizing all batches to coarse levels), mixing coarse (using inconsistent labels as-is), and unannotated (masking coarse labels as “Unknown”). Semi-supervised methods’ performance using oracle label and unsupervised methods’ performance are also plotted as reference. Colors represent different integration methods. Note that generative methods (e.g., scDREAMER, scGEN) generally outperform the unannotated baseline when using mixed labels, whereas reference-based methods (e.g., scANVI) show sensitivity to label inconsistency. Dotted lines represent the baseline performance of unsupervised integration.).

More »

Expand

Table 3.

Summary of Datasets Used in Integration.

More »

Expand