Fig 1.
BioFuse architecture and workflow.
The upper section illustrates the training process: input samples are preprocessed and fed into multiple foundation models to generate embeddings, which are then concatenated and used to train a classifier. The lower section shows the evaluation process using the same BioFuseModel architecture on a validation set, followed by performance assessment of the trained model.
Table 1.
Summary of foundation models supported in BioFuse.
Fig 2.
Visual overview of the MedMNIST + 2D datasets.
Sample images at 224x224 resolution showcase the diverse collection spanning multiple medical imaging modalities, illustrating the wide range of diagnostic tasks represented in the benchmark.
Table 2.
Overview of MedMNIST + 2D datasets.
Fig 3.
High-level workflow of BioFuse within a typical machine learning pipeline.
BioFuse serves as a sophisticated embedding generator, accepting training and validation sets as inputs. As outputs, BioFuse provides embeddings for both the training and validation sets, along with a configured BioFuseModel. This BioFuseModel acts as an embedding generator for subsequent use on the test set.
Table 3.
Performance comparison between BioFuse and best existing methods on MedMNIST+; bold values indicate the best performance for each dataset.
Fig 4.
Heatmap of test AUC, showing single model performance across MedMNIST+ datasets.
CLIP is included as a baseline general-purpose foundation model for comparison. White box indicates the top-performer while red boxes highlight the 2nd and 3rd top performers for each dataset; More than three models may be highlighted if they share identical values.
Fig 5.
Heatmap of test accuracy, showing single model performance across MedMNIST+ datasets and ImageNet-1K (top-1).
CLIP is included as a baseline general-purpose foundation model for comparison. White boxes indicate top performers, while red boxes highlight second and third best performers for each dataset. When multiple models achieve identical performance, more than three boxes may be highlighted.
Table 4.
Performance comparison of biomedical foundation models on ImageNet-1K. Bold values indicate best performance.
Table 5.
Ablation 1 – Self-attention fusion vs. concatenation.
The “Concat” columns reproduce the BioFuse baseline; the Model combination column lists the backbones selected for the self-attention run only—the corresponding concat combinations are given in Table 3. Δ = (Self-attn – Concat). Backbones printed in bold overlap with the concat baseline.
Table 6.
Ablation 2 – Oracle-single vs. concatenation.
Concat figures are the main BioFuse results; Oracle-single uses only the best individual model for each dataset. Δ = Oracle − Concat.
Table 7.
Robustness on MedMNIST‐C (lower is better; raw ratio scale).
Fig 6.
End-to-end runtime per dataset.
Stacked bars show total wall-clock time (hours) to reproduce each BioFuse experiment. The blue segment indicates embedding extraction for the nine backbones, the green segment the XGBoost evaluation runs, and the orange segment the hyperparameter sweep on Weights & Biases.