Fig 1.
The overall framework for evaluating the reliability of MIL models follows a three-step process.
First, a MIL model is trained on a weakly-supervised task for predicting slide-level labels. Next, the trained model is applied to predict scores for individual image patches. Finally, the reliability value is computed based on the predicted patch scores and their corresponding annotations, where tumor patches are highlighted in green and normal patches in orange in the annotation visualization.
Fig 2.
(I) The test-30 slide with ground truth annotations (green) overlaid on the tissue section. (II) Corresponding heatmap generated by MAX-POOL, showing predicted patch scores distribution from low (blue) to high (red).
The annotation and heatmap are spatially aligned for comparison.
Table 1.
Average reliability, classification, and computational metrics (± standard deviation) over five repetitions for the CAMELYON16 dataset.
Table 2.
Average reliability, classification, and computational metrics (± standard deviation) over five repetitions for the CAMELYON16 dataset using additive models.
Fig 3.
(I) The test-40 slide with ground truth annotations (green) overlaid on the tissue section. (II) Corresponding heatmap generated by MEAN-POOL-INS, showing predicted patch scores distribution from low (blue) to high (red).
The annotation and heatmap are spatially aligned for comparison.
Fig 4.
(I) A slide from CATCH with ground truth annotations (green) overlaid on the tissue section. (II) Corresponding heatmap generated by MAX-POOL, showing predicted patch scores distribution from low (blue) to high (red).
The annotation and heatmap are spatially aligned for comparison.
Table 3.
Average reliability, classification, and computational metrics (± standard deviation) over five repetitions for CATCH.
Table 4.
Average reliability, classification, and computational metrics (± standard deviation) over five repetitions for the CATCH using additive models.
Fig 5.
(I) A slide from CATCH with ground truth annotations (green) overlaid on the tissue section. (II) Corresponding heatmap generated by ACMIL/4, showing predicted patch scores distribution from low (blue) to high (red).
The annotation and heatmap are spatially aligned for comparison.
Table 5.
Average reliability, classification, and computational metrics (± standard deviation) over five repetitions for TCGA BRCA.
Fig 6.
(I) A slide from TCGA BRCA with ground truth annotations (green) overlaid on the tissue section. (II) Corresponding heatmap generated by MADMIL/3, showing predicted patch scores distribution from low (blue) to high (red).
The annotation and heatmap are spatially aligned for comparison.
Fig 7.
(I) A slide from TCGA BRCA with ground truth annotations (green) overlaid on the tissue section. (II) Corresponding heatmap generated by MEAN-POOL-INS, showing predicted patch scores distribution from low (blue) to high (red).
The annotation and heatmap are spatially aligned for comparison.
Table 6.
Average reliability, classification, and computational metrics (± standard deviation) over five repetitions for TCGA BRCA using additive models.
Fig 8.
Bar plots comparing the models with different metrics.
Table 7.
The overall mean of the average reliability, classification, and computation metrics across CAMELYON16, CATCH, and TCGA BRCA.
Table 8.
The overall mean of the average reliability, classification, and computation metrics for additive models across CAMELYON16, CATCH, and TCGA BRCA datasets.
Table 9.
Analysis of reliability metrics showing the effect of excluding each metric on model ranking.
Rankings are obtained by summing the scores of the selected reliability metrics (with equal weight) across CAMELYON16, CATCH, and TCGA BRCA datasets, and ranking models based on the aggregated score.