Figures
Abstract
The biomedical field has witnessed a surge in pre-trained foundation models excelling in specific subdomains such as radiology and histopathology. While integrating these models promises a more comprehensive understanding of biomedical data, it poses challenges in model compatibility and feature fusion. We present BioFuse, a novel open-source framework designed to generate optimised biomedical embeddings. BioFuse utilises a pool of 9 state-of-the-art (SOTA) foundation models to create task-specific embeddings. It employs grid search to automatically identify the optimal combination of models, fusing their embeddings through vector concatenation. On the MedMNIST+ benchmark, using XGBoost as the downstream classifier, BioFuse outperforms several existing methods, achieving SOTA AUC in 5/12 datasets, while maintaining near-SOTA performance across most remaining datasets. Remarkably, our experiments reveal unexpected cross-modal capabilities, with histopathology and radiology models showing strong performance when applied to other imaging modalities. BioFuse features a high-level API for immediate deployment and an extensible architecture to incorporate future models and fusion techniques. We anticipate BioFuse will not only enhance the utility of foundation models in biomedicine but also open new avenues for uncovering cross-modal relationships in biomedical data.
Citation: Hossain MN, Harris-Birtill D (2026) BioFuse: an embedding fusion framework for biomedical foundation models. PLoS One 21(3): e0320989. https://doi.org/10.1371/journal.pone.0320989
Editor: Xu Yanwu, South China University of Technology, CHINA
Received: March 10, 2025; Accepted: February 16, 2026; Published: March 18, 2026
Copyright: © 2026 Hossain, Harris-Birtill. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: 1. MedMNIST+: https://zenodo.org/records/10519652 2. BioFuse Embedding Cache for MedMNIST+ 2D Datasets: https://doi.org/10.5281/zenodo.16732578 3. BioFuse Embedding Cache for ImageNet-1K: https://doi.org/10.5281/zenodo.14930584 4. BioFuse GitHub repository: https://github.com/mnhcorp/biofuse.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Artificial Intelligence (AI) is driving advances in biomedical research and clinical practice [1–3]. Foundation models – large, pre-trained models capable of performing a wide variety of tasks with minimal adaptation – have emerged as key tools in this domain. Trained on vast, diverse datasets [4], these models have demonstrated promising performance in various biomedical subdomains [5].
In medical imaging, foundation models are used to detect and classify abnormalities in radiographs [5], generate diagnostic reports [6], segment images to delineate organs and tumours [7], analyse histopathology slides for disease diagnosis [8,9], and integrate information across imaging modalities to enhance diagnostic accuracy [10]. In clinical natural language processing (NLP), they are employed for clinical text summarization [11] and information extraction [12], while in genomics, they help decipher the language of non-coding DNA [13] and allow accurate cell type annotation [14].
1.1 Motivation
Typically, researchers manually select foundation models tailored for specific tasks [15]. These models increasingly serve as feature extractors, transforming input data into dense numerical representations called embeddings [16]. These embeddings encapsulate complex features of the input data in high-dimensional vectors, allowing efficient computation and comparison.
However, foundation models are often trained on single modalities, with research groups developing specialized architectures for specific domains [17–19]. While this approach enables exceptional performance within designed scopes, these models may have learned features relevant beyond their original domains. As domain-specific models become increasingly refined [20], there is a growing need to explore their broader applicability across biomedical tasks and modalities.
Combining embeddings from multiple models allows us to exploit modality-specific encodings while gaining new perspectives through cross-modal integration. Various approaches have demonstrated the value of leveraging multiple modalities – for instance, ConVIRT [21] utilises contrastive learning between paired images and text to enhance chest X-ray classification performance. Such cross-modal techniques suggest that strategic integration of specialized models could generate more comprehensive representations of biomedical data and unlock insights inaccessible to single-modality approaches.
This leads us to ask two key questions:
- Does combining the embeddings of multiple foundation models trained on different biomedical modalities improve task performance, as measured by AUC (area under the curve) and accuracy?
- How effectively do foundation models transfer knowledge across different biomedical imaging modalities, as evidenced by comparative AUC and accuracy scores in cross-modal applications?
Despite the potential benefits, effectively integrating multiple foundation models presents major challenges. Retraining models on combined data from multiple modalities demands extensive computational resources and expertise [22]. While combining pre-trained model embeddings offers an alternative [23,24], the rapidly growing number of foundation models [25] makes identifying optimal combinations increasingly complex. Moreover, the lack of standardized frameworks for embedding fusion hinders our ability to fully exploit these models’ collective knowledge.
1.2 Contributions
In response to these challenges, we introduce BioFuse, an open-source framework designed to harness the collective power of diverse biomedical foundation models. This paper makes the following key contributions:
- Enhanced Performance on the MedMNIST+ Benchmark. Using embeddings generated by BioFuse combined with XGBoost classification, we outperform several existing baselines on the MedMNIST+ benchmark, achieving the highest test AUC in 5 of 12 datasets and best accuracy in 2 datasets. The framework’s embeddings maintain near-SOTA performance (within 2% margin) across most remaining datasets, demonstrating the effectiveness of our embedding fusion approach.
- Revealing Cross-Modal Capabilities in Biomedical Foundation Models. Our experiments reveal unexpected cross-modal transfer in both histopathology and radiology models, showcasing the potential for models trained on one imaging modality to generate effective embeddings for others.
- An Extensible Framework for Optimised Biomedical Embedding Generation. BioFuse provides a high-level API that simplifies the generation of task-specific embeddings. It automatically identifies optimal combinations of foundation models and fuses their outputs, addressing the challenges of model integration. The framework’s architecture allows for easy incorporation of new models and fusion techniques, ensuring adaptability to emerging developments in biomedical AI.
By providing a standardized approach to embedding fusion and automating model selection, BioFuse enhances the utility of foundation models in biomedicine while opening new avenues for uncovering relationships between diverse biomedical data representations.
2. Background and related work
2.1 Foundation models in biomedicine
Foundation models, driven by Large Language Models (LLMs), have ushered in a new era in deep learning [26]. Three key factors enabled their development: the Transformer architecture [27] for processing sequential data, advances in GPU capabilities [28], and extensive training datasets [29,30]. These billion-parameter models [12] require substantial computing infrastructure for training [31], but can then solve tasks beyond their original training objectives [32]. While they support zero-shot inference and fine-tuning [26], they are commonly used as feature extractors for classification models [16,33].
Foundation models are typically trained using self-supervised learning (SSL) techniques, including Contrastive Learning [34], Masked Image Modelling (MIM) [35], and self-distillation [36]. These methods create proxy tasks from unlabelled data: contrastive learning differentiates between similar and dissimilar samples, MIM reconstructs masked image regions, and self-distillation leverages the model’s own predictions as training targets. By utilizing the inherent structure of data, SSL enables training on much larger datasets without manual annotation. This approach has demonstrated superior performance and transferability compared to traditional supervised learning [34,35,37], making it particularly valuable in medical imaging where labelled data is scarce.
2.1.1 Unimodal foundation models.
Biomedical imaging foundation models have demonstrated remarkable effectiveness when trained on single modalities (unimodal training). In computational pathology, models like UNI [38], UNI2 [39], Prov-GigaPath [40], and Hibou-B [41] have set new benchmarks through increasingly larger-scale training. UNI outperforms prior models across 34 clinical tasks, while its successor UNI2 expands capabilities to both H&E and Immunohistochemistry (IHC). Prov-GigaPath achieves state-of-the-art performance in cancer subtyping and mutation prediction, and Hibou-B demonstrates high adaptability across pathology applications. In radiology, RAD-DINO [42], pre-trained on a vast corpus of chest X-rays, excels in detecting conditions and capturing detailed features crucial for biomarker discovery and prognosis prediction. Built on architectures like Vision Transformers (ViT) [43] and trained with self-supervised learning, these models demonstrate how large-scale training captures modality-specific nuances and enables superior biomedical imaging performance.
2.1.2 Vision language models.
Vision-language models (VLMs) integrate visual and textual data to learn joint representations for tasks like image classification, retrieval, and captioning. BioMedCLIP [44] and PubMedCLIP [45] achieve state-of-the-art performance across various biomedical benchmarks, while CheXagent [46] excels in chest X-ray interpretation tasks and CONCH [47] demonstrates superior performance across 13 histopathology benchmarks. These models demonstrate how integrating visual and textual information enhances biomedical imaging analysis.
2.1.3 Limitations of foundation models.
Despite their capabilities, biomedical foundation models face serious limitations including restricted cross-domain applicability, high computational demands, and challenges in interpretability crucial for clinical decision-making. While using them as feature extractors mitigates the risk of hallucinations [20], effectively leveraging the collective knowledge of multiple specialized models remains an open challenge that could enhance performance and provide more robust biomedical insights.
2.2 Related work
Prior work relevant to BioFuse spans three key areas: embedding fusion approaches that combine features from multiple models, cross-modal methods that transfer knowledge between domains, and automated selection techniques that optimize model combinations.
2.2.1 Embedding fusion.
Recent works have demonstrated the benefits of combining embeddings from multiple models. In histopathology, Neidlinger et al. [24] showed that combining four foundation models improved tumor classification and survival prediction across 13 patient cohorts, but their approach was limited to a single modality and lacked a framework for optimal model selection. Similarly, Zarif et al. [48] and Dong et al. [49] combined two and four pre-trained CNNs respectively for breast and liver cancer classification, but their approaches used bespoke architectures specific to their tasks. In clinical NLP, BioFLAIR [50] combines two embedding models (FLAIR and BioELMo) for biomedical named entity recognition, while the Concatenated BioMed-Transformers [51] fuses three transformer models for medical text classification, but both were limited to single modalities.
2.2.2 Cross-modal transfer.
Cross-modal approaches have attempted to bridge different modalities in biomedical data. Zipper [52] combines two pre-trained unimodal models for speech recognition and text-to-speech tasks using multi-tower decoders with cross-attention, but its computational complexity limits practical application. BioBridge [53] uses knowledge graphs to connect two unimodal foundation models for protein sequence-text retrieval and drug design tasks, yet fails to fully exploit the embedded knowledge across different biomedical domains.
2.2.3 Automated model selection.
In automated model selection, ACE [54] uses reinforcement learning to select optimal combinations of word embedding models for NLP tasks like named entity recognition and part-of-speech tagging. However, its focus on text-only tasks fails to address the multimodal nature of biomedical data, and its computational demands make it impractical for high-dimensional data where rapid model selection is often needed.
2.2.4 Limitations of related work.
While these approaches have made impressive strides, they often focus on specific domains or tasks, lacking the versatility to leverage image extraction capabilities across models trained on different modalities. Moreover, the untapped potential in combining foundation models to enhance performance and generalisability across diverse modalities remains a non-trivial challenge. BioFuse aims to address these limitations by providing a flexible framework that can integrate multiple foundation models across various biomedical domains, facilitating both embedding fusion and cross-modal transfer while automating the selection of optimal model combinations.
3. Methods
3.1 Ethics statement
This research was conducted under ethical approval from the institutional review board (approval code CS17485).
3.2 BioFuse framework
3.2.1 Overview.
BioFuse is an open-source framework that automates the selection, extraction, and fusion of embeddings from multiple pre-trained biomedical foundation models across diverse modalities such as radiographs, histopathology slides, and clinical text. Its modular design supports seamless integration of new models and fusion methods, keeping pace with rapid advancements.
BioFuse’s primary goal is to generate optimal embeddings by leveraging multiple models via vector concatenation. These fused embeddings encapsulate multimodal information in a unified format. To assess their quality, BioFuse employs an approach similar to linear probing [55], but with a more sophisticated classifier. Specifically, we use XGBoost [56], a powerful tree-based machine learning algorithm, to train on the frozen, fused embeddings for various downstream tasks. This evaluation method provides a reliable measure of embedding quality without requiring computationally expensive fine-tuning of the foundation models themselves.
Users can utilise these embeddings as input features for custom models tailored to specific research questions or clinical applications. By offering optimised embeddings and demonstrating their effectiveness through advanced probing, BioFuse serves as a versatile feature extraction framework for biomedical imaging tasks.
3.2.2 System architecture.
BioFuse’s architecture comprises three components: pre-trained foundation models, the BioFuseModel, and the search module (see Fig 1):
The upper section illustrates the training process: input samples are preprocessed and fed into multiple foundation models to generate embeddings, which are then concatenated and used to train a classifier. The lower section shows the evaluation process using the same BioFuseModel architecture on a validation set, followed by performance assessment of the trained model.
- Pre-trained Foundation Models. Existing, pre-trained models from various biomedical domains are loaded without fine-tuning to preserve their original capabilities.
- BioFuseModel. The core component that processes and integrates embeddings from multiple pre-trained models in a single forward pass per input. It preprocesses each model’s input, extracts embeddings, and fuses them via vector concatenation, preserving the strengths of each model while ensuring efficiency.
- Search Module. Evaluates embedding performance from different model combinations (each represented by a BioFuseModel) by computing validation accuracy using XGBoost as a lightweight downstream classifier. This step automates identifying optimal configurations.
3.2.3 Foundation models in BioFuse.
Foundation models in BioFuse were selected based on architectural diversity, performance, efficiency, and accessibility. Both unimodal and vision-language models (VLMs) are included to capture diverse biomedical information, with VLMs chosen for their pre-training on paired text-image data. Models trained on high-quality biomedical datasets in various imaging modalities, from macroscopic radiological images to microscopic histological data, were preferred for a comprehensive representation of knowledge.
The selection process prioritized models documented in peer-reviewed literature with demonstrated excellence across multiple biomedical applications. We carefully balanced computational efficiency against performance to ensure BioFuse remains practical for real-world implementation. Additionally, we favored models with standardized implementations on platforms such as Hugging Face [57] to enhance accessibility and community support. Through this thoughtful curation, BioFuse incorporates a diverse array of foundation models that collectively provide robust and adaptable capabilities across the biomedical domain.
Table 1 summarizes the foundation models supported by BioFuse, including dataset size, pre-training methods, and encoder types.
3.2.4 Embedding extraction.
BioFuse streamlines the extraction of embeddings from diverse pre-trained foundation models across multiple biomedical imaging modalities. The framework leverages established libraries including Hugging Face Transformers [59], OpenCLIP [60], and PyTorch Image Models [61] to facilitate this process. The system automatically handles model and preprocessor loading while applying appropriate model-specific preprocessing—such as image resizing and normalization—before generating embeddings.
The framework employs GPU acceleration to enhance inference speed when processing large-scale datasets. Memory management is optimised through immediate release of intermediate outputs after use, enabling the system to scale effectively with increasingly complex inputs. These technical enhancements ensure BioFuse can efficiently handle diverse biomedical imaging modalities while maintaining computational efficiency in resource-constrained environments.
3.2.5 Fusion methodology.
BioFuse fuses embeddings from multiple pre-trained models using vector concatenation, combining embeddings into a single representation.
Let be the embedding from the i-th model, where
is its dimensionality. Given n models, the fused embedding is:
where denotes concatenation along the feature dimension, yielding a total size:
For instance, if , then
.
Advantages of vector concatenation:
- Efficient & Simple: No additional parameters or projection layers, reducing computational complexity.
- Feature Preservation: Retains all modality-specific features without loss.
- Scalable: New models can be seamlessly integrated without retraining fusion components.
- Interpretable: Maintains per-model feature separation, aiding downstream classifiers.
Feature-scale alignment. Although the nine foundation models differ in objective, domain, and ViT scale, the length of their flattened embeddings is still within a single order of magnitude (512–1536 dimensions). Any scale mismatch across those vectors could, in principle, bias the downstream learner. Two properties mitigate this risk: (i) concatenation keeps each model’s features in isolated sub-blocks, so no cross-model normalisation is required; (ii) the downstream classifier is XGBoost, a tree-based ensemble whose splitting decisions depend primarily on the relative ordering of feature values rather than their absolute magnitudes [56,62]. While feature importance scores may retain some scale sensitivity, empirical studies on gradient-boosted trees demonstrate that standard scaling transformations (z-score, min–max) typically change test accuracy by less than 0.5 percentage points [63], indicating that scale heterogeneity is not a limiting factor for BioFuse’s performance.
3.2.6 Automated selection of models.
BioFuse automates model selection, identifying the optimal set for each task without manual intervention. It evaluates all model combinations, leveraging GPU-accelerated XGBoost to efficiently search this exponential space. Embeddings are fused and rapidly assessed using gradient boosting. GPU acceleration mitigates computational overhead by parallelizing tree construction and handling large feature matrices efficiently. This enables BioFuse to evaluate model combinations in seconds and process larger datasets within minutes, ensuring thorough exploration of the combinatorial space.
An on-disk cache stores previously generated embeddings to avoid redundant computations. BioFuse selects the combination with the highest validation accuracy, ensuring embeddings are tailored to the specific task while maintaining generalisation within the dataset.
3.3 Dataset
As Illustrated in Fig 2, our study uses 12 2D datasets from the MedMNIST+ benchmark [64], a large-scale, MNIST-like collection of standardized biomedical images designed for various machine learning tasks in the medical domain. The dataset was accessed periodically between June 1, 2024, and February 28, 2025, for research purposes. Key features of the dataset include:
- Multi-modal: MedMNIST+ covers a wide range of biomedical imaging modalities, including X-Ray, OCT, Ultrasound, CT, Histopathology and Electron Microscopy images.
- Standardized: All images are pre-processed into a uniform
resolution for 2D datasets, with corresponding classification labels. The datasets also come with consistent train-validation-test splits.
- Multi-task: The dataset supports various machine learning tasks, including binary and multi-class classification, ordinal regression, and multi-label classification.
- Multi-scale: MedMNIST includes approximately 708K 2D images, with dataset sizes ranging from 780 to 236,386 samples.
Sample images at 224x224 resolution showcase the diverse collection spanning multiple medical imaging modalities, illustrating the wide range of diagnostic tasks represented in the benchmark.
The decision to focus on 2D datasets was driven by the use of a simple classifier which is well-suited for 2D image analysis tasks. Table 2 provides an overview of the MedMNIST2D datasets used in this study.
MedMNIST+ enables a comprehensive evaluation of BioFuse across multi-class, binary, and multi-label classification, spanning diverse dataset sizes and class distributions. For consistency, we treat the ordinal regression task (RetinaMNIST) as a multi-class classification problem, a standard approach for handling ordinal data in machine learning [74].
3.4 Experiments
3.4.1 Objectives.
Our experiments assess:
- Performance improvement: Evaluate whether fusing embeddings via vector concatenation outperforms individual models on MedMNIST+ using AUC and Accuracy.
- Cross-modal transfer: Test if features from one modality (e.g., histopathology) enhance performance when combined with another (e.g., radiographs), measured by AUC and Accuracy.
- Generalisation: Assess BioFuse’s ability to maintain strong performance across diverse biomedical domains using nine foundation models.
- Fusion method validation: Compare vector concatenation against alternative fusion strategies (self-attention) and single-model baselines to justify the chosen approach.
- Robustness evaluation: Assess BioFuse’s performance under realistic image corruptions using MedMNIST-C.
These objectives establish BioFuse as a versatile framework for multimodal biomedical imaging, leveraging complementary information to improve classification performance across datasets.
3.4.2 Hardware configuration.
The experiments were conducted on a server with an NVIDIA A100 GPU (80GB VRAM), an AMD EPYC 7713 CPU (64-core, 128-thread CPU), and 1 TB RAM. While the CPU and memory were shared, the GPU was dedicated to BioFuse, ensuring uninterrupted computation for embedding extraction, model fusion, and classification.
3.4.3 Model selection classifier.
For internal evaluation (see Fig 1), we used XGBoost [56], a scalable gradient-boosted decision tree algorithm known for its efficiency in binary and multi-class classification. The model was trained on fused embeddings from multiple foundation models.
Although embedding order can influence performance, we found this effect negligible. Optimizing order introduces substantial complexity— model combinations and
possible orders per combination—making exhaustive search impractical. To maintain computational tractability, we fixed the concatenation order.
Hyperparameters were manually selected following [75], balancing efficiency and accuracy. We set n_estimators = 250, learning_rate = 0.1, and max_depth = 6 to prevent overfitting while capturing biomedical data patterns. The objective function was multi:softprob for multi-class, binary:logistic for binary classification, and a one-versus-rest strategy for multi-label tasks. Training used XGBoost’s GPU acceleration to reduce computation time.
3.4.4 Training and evaluation workflow.
Our evaluation of BioFuse follows a structured workflow, as illustrated in Fig 3:
- Dataset Preparation: BioFuse is provided with training and validation sets for each dataset in MedMNIST + 2D.
- Model Selection: It identifies the optimal combination of foundation models for each dataset (see Fig 1).
- Embedding Generation: Using the selected model combination, BioFuse generates training and validation embeddings and returns a configured BioFuseModel (embedding generator).
- Classifier Selection and Tuning: For the final evaluation, we use XGBoost, tuning its hyperparameters on training embeddings and selecting the configuration that achieves the highest validation accuracy.
- Test Embedding Generation: The trained BioFuseModel generates embeddings for the test set, ensuring consistency with the training and validation process.
- Final Evaluation: The tuned XGBoost model (from Step 4) is evaluated on the test set embeddings to assess overall performance.
BioFuse serves as a sophisticated embedding generator, accepting training and validation sets as inputs. As outputs, BioFuse provides embeddings for both the training and validation sets, along with a configured BioFuseModel. This BioFuseModel acts as an embedding generator for subsequent use on the test set.
To ensure a comprehensive final evaluation on the test sets, we conducted independent hyperparameter tuning for each dataset in the MedMNIST+ benchmark (as described in Step 4). The XGBoost hyperparameter search range is detailed in S2 Table.
Performance was measured using test AUC and accuracy, with results averaged over three independent runs to enhance robustness.
3.4.5 Evaluation metrics.
We employ two complementary metrics in our evaluation framework:
Accuracy is used for internal model selection during BioFuse’s automated search process:
Area Under the ROC Curve (AUC-ROC) serves as our primary reporting metric for final test set performance. This choice aligns with MedMNIST+ benchmarks and provides a threshold-independent assessment of model quality. AUC-ROC represents the probability that a randomly chosen positive example ranks higher than a negative one [76], with values ranging from 0.5 (random performance) to 1.0 (perfect classification).
For comprehensive comparison, we report both AUC-ROC and accuracy in our final results tables.
4. Results
4.1 Overview
On the MedMNIST+ benchmark, BioFuse outperforms several existing methods, achieving the highest test AUC on 5 of 12 datasets and near-SOTA performance (within 2% margin) on the remaining datasets. It also achieves the best test accuracy on 2 of 12 datasets and near-SOTA accuracy on another 4 (see Table 3). These results highlight the effectiveness of embedding fusion from diverse foundation models in improving performance across a broad range of biomedical image classification tasks. The exact XGBoost parameters are listed in S3 Table.
A particularly interesting finding is the versatility of histopathology and radiology foundation models. Histopathology models UNI, Hibou-B, CONCH and Prov-GigaPath and radiology models CheXagent and rad-dino consistently appear in the top-performing model combinations, even when applied to modalities different from their pre-training data. This cross-modality effectiveness suggests these models learn generalisable features useful across various biomedical imaging tasks.
4.2 Cross-Modal performance
4.2.1 Transfer to other medical imaging modalities.
Figure 4 (AUC-ROC) and Fig 5 (accuracy) list the top-three single-model scores per dataset. Hyperparameters for XGBoost were frozen (see 3.4 Experiments). CLIP [32] remains our non-medical baseline.
CLIP is included as a baseline general-purpose foundation model for comparison. White box indicates the top-performer while red boxes highlight the 2nd and 3rd top performers for each dataset; More than three models may be highlighted if they share identical values.
CLIP is included as a baseline general-purpose foundation model for comparison. White boxes indicate top performers, while red boxes highlight second and third best performers for each dataset. When multiple models achieve identical performance, more than three boxes may be highlighted.
Fixed hyperparameters in single-model heatmaps. For the single-model heatmaps, we use a fixed XGBoost configuration across all experiments. Running a full Bayesian search for every (dataset, backbone) pair would require additional fits. Extending this to all backbone combinations would increase the count by orders of magnitude. Since the heatmaps aim to rank the relative quality of individual backbones, a shared configuration suffices and isolates performance differences to the backbone itself. For fusion experiments (concatenation, self-attention, oracle-single), we perform full Bayesian search as only one backbone set per dataset requires tuning, keeping the computational budget manageable.
Radiology → Histopathology. Models whose pre-training focus is radiology demonstrate strong transfer to histopathology tasks:
- CheXagent (CA)—originally trained on chest X-rays—achieves first place in both metrics on DermaMNIST and places second on TissueMNIST for both accuracy and AUC.
Radiology → Microscopy/Ophthalmology. Radiology models also transfer effectively to other imaging modalities:
- CheXagent (CA) achieves first place on RetinaMNIST (first in both metrics) and reaches third place on BloodMNIST for accuracy (0.974) and tied second place for AUC (0.999).
Histopathology → Radiology. Conversely, histopathology models also transfer effectively to radiology benchmarks:
- Prov-GigaPath (PG) achieves first in AUC (0.945) and second in accuracy (0.878) on BreastMNIST, while placing tied second on ChestMNIST accuracy.
- Hibou-B (HB) ranks second on BreastMNIST AUC (0.929) and third on OrganSMNIST accuracy (0.756).
- UNI (UN) enters the top-three on BreastMNIST (third accuracy), OrganSMNIST (second in both metrics), and PneumoniaMNIST (third AUC).
- CONCH (CO) secures third place on BreastMNIST AUC (0.911).
Histopathology → Microscopy/Ophthalmology. Histopathology models also excel on microscopy and ophthalmology tasks:
- Hibou-B (HB) achieves first place on BloodMNIST with perfect AUC (1.000) and top accuracy (0.985).
- UNI and UNI2 place second and third, respectively, on BloodMNIST accuracy, with UNI also second in AUC. UNI2 reaches third place on RetinaMNIST for both metrics.
- Prov-GigaPath (PG) ranks second on RetinaMNIST for both accuracy (0.640) and AUC (0.862).
Baseline. CLIP lags behind specialist medical models but still reaches the top-three on OrganAMNIST for both accuracy (0.894) and AUC (0.993).
Taken together, these heatmaps demonstrate genuine multidirectional transfer: radiology-centric models can excel on histopathology and ophthalmology tasks, while histopathology models perform competitively on radiology and microscopy benchmarks. No single model dominates every modality—hence the motivation for the fusion strategy deployed in BioFuse.
4.3 Transfer to natural images
To comprehensively evaluate the generalisation capabilities of medical foundation models, we assess their performance on the ImageNet-1K [79] test set, a benchmark dataset widely used in computer vision. While we don’t expect medical models to match the performance of models trained specifically on natural images, their performance provides valuable insights into their general visual understanding capabilities. Surprisingly, some medical models demonstrate remarkable transfer ability, with Prov-GigaPath and CheXagent achieving 78.2% and 66.0% top-1 accuracy respectively, outperforming CLIP (58.1%) despite CLIP being trained on 400M natural image-text pairs. Table 4 presents the complete ImageNet-1K test set performance results.
4.4 Ablation studies
4.4.1 Ablation 1: Self-attention fusion.
To test whether a learnable fusion layer can outperform the simple concatenation used in BioFuse, we replaced concatenation with a multi-head self-attention (MHSA) block applied across the per-model embeddings (attention formulation as in [27]). Following common practice ([43]), we swept over 4, 8, and 12 heads. The projected dimension is therefore
. For each dataset we selected the best‐performing projection size (column proj_dim in Table 5) and fed those embeddings into the same XGBoost hyper-parameter search that was used for the concatenation baseline.
The “Concat” columns reproduce the BioFuse baseline; the Model combination column lists the backbones selected for the self-attention run only—the corresponding concat combinations are given in Table 3. Δ = (Self-attn – Concat). Backbones printed in bold overlap with the concat baseline.
Discussion. Self-attention (SA) fusion helps only two sets—OCTMNIST (+1.7 pp ACC) and PathMNIST (+0.4 pp AUC)—and trails concatenation elsewhere. The largest deficits occur on OrganCMNIST (–26 pp AUC, –52 pp ACC) and OrganSMNIST (–1 pp AUC, –3.8 pp ACC). We identify three interacting factors:
- Extra parameters vs. class-wise sample size. SA introduces a substantial number of new weights, which are difficult to learn reliably when each class provides only a few hundred training images (as in OrganC/S). Parameter-free concatenation avoids this over-fitting risk.
- Radiology–histology model mix. The BioFuse pool combines radiology backbones (RD, CA) with histology-centric ones (HB, UN, UN2, PG, CO). When the target domain is a radiology CT variant such as OrganC/S, features from the histology models are often weak or misaligned. Concatenation leaves these channels untouched; SA, however, learns cross-model attention weights that can amplify conflicting cues, sharply reducing accuracy.
- Feature coherence across models. The two datasets that improve (OCTMNIST, PathMNIST) have strong texture cues extracted consistently by all backbones, allowing SA to reinforce aligned signals and yield modest gains.
Given this trade-off and the added tuning cost, plain concatenation remains the recommended fusion strategy for BioFuse under the current settings. The exact XGBoost parameters for SA are listed in S4 Table. Future work will explore sparsity-constrained or modality-aware attention to curb overfitting when radiology and histology embeddings diverge.
4.4.2 Ablation 2: Oracle-single.
In normal BioFuse training we brute-force all backbone combinations, pick the set that gives the highest validation accuracy, and then tune XGBoost on that fused feature space. For the “oracle-single” we restrict the search to individual backbones only. The single model that tops the validation leaderboard for each dataset is treated as the oracle, tuned with the same XGBoost procedure, and evaluated on the test set.
S5 Table lists the tuned hyper-parameters; Table 6 compares oracle-single performance with the concatenation baseline.
Concat figures are the main BioFuse results; Oracle-single uses only the best individual model for each dataset. Δ = Oracle − Concat.
Discussion. Using only the single best backbone generally underperforms concatenation (both metrics decrease on eight of twelve datasets). A modest AUC gain is observed on PathMNIST, but large drops occur on OrganCMNIST and OrganSMNIST, confirming that fusion captures complementary information not present in any single model. Consequently, BioFuse’s concatenation scheme remains the strongest overall configuration.
4.5 Robustness to realistic image corruptions
Benchmark. We assess robustness with MedMNIST-C [80], which adds eleven medically motivated corruptions (five severity levels each) to every MedMNIST dataset, mirroring ImageNet-C [81]. Unless stated otherwise, we evaluate the BioFuse concatenation baseline—namely the best backbone combination for each dataset listed in Table 3—trained only on the clean MedMNIST
data; corrupted images are simply resized to
, and no further fine-tuning is performed. All metrics are computed with the official medmnistc-api toolkit [82].
Metrics. MedMNIST-C reports three error-based quantities (lower is better) (Table 7):
- Clean score – balanced error on the uncorrupted test set.
- BE – AlexNet-normalised balanced error averaged over all 55 corruption–severity pairs (AlexNet = 1).
- rBE – AlexNet-normalised increase in error relative to the clean set (AlexNet = 1). Values < 1 mean the model degrades less than AlexNet under corruption.
Note: the original paper multiplies these ratios by 100 (AlexNet = 100). We report the raw ratios from medmnistc-api for transparency; multiply by 100 to obtain the paper’s scale.
Discussion. Averaged across all datasets, BioFuse achieves BE = 1.64 and rBE = 2.07. This is higher (worse) than the single-backbone ViT-B/16 benchmark (BE ≈ 0.763) [80], mirroring the robustness–accuracy trade-off previously reported on CIFAR-C [81].
Fusion combines complementary features on clean data, but when corruptions affect the CNN and ViT backbones differently their representations can diverge, lowering performance. The clearest example is OrganCMNIST (BE = 3.35), where colour-contrast shifts likely drive conflicting cues between backbones. Conversely, on PathMNIST (BE < 0.6) all backbones respond similarly to texture-dominated corruptions, so fusion remains robust. Robustness is therefore highly dataset dependent.
Overall, our results underline a key design challenge: while fusion improves representational capacity on clean data, additional mechanisms are needed to maintain robustness. Future work will explore corruption-aware fusion, adaptive backbone weighting under distribution shift, and robustness-preserving training strategies for multi-backbone models.
5. Discussion
5.1 Key findings
Based on the comprehensive results in Table 3, we would like to highlight some important observations:
- State-of-the-art performance. BioFuse demonstrates exceptional performance on the MedMNIST+ benchmark, achieving new state-of-the-art (SOTA) scores across multiple datasets. Specifically, it achieves the highest AUC in five datasets and the best accuracy in two datasets. These improvements, achieved in a highly competitive benchmark, underscore BioFuse’s ability to extract and combine more informative features from multiple pre-trained models.
- Strong performance on DermaMNIST. A notable improvement was observed on the DermaMNIST dataset, where BioFuse achieved state-of-the-art performance in both AUC and accuracy. Despite the SOTA AUC already being at 0.97, BioFuse managed to deliver a further performance improvement in AUC and ∼1% in Accuracy. This demonstrates BioFuse’s ability to improve performance even on tasks where the margin for improvement is small.
- Near SOTA AUC performance. BioFuse demonstrated highly competitive performance across multiple datasets, achieving AUC scores within 2% of state-of-the-art methods. This consistent performance was observed across various CT imaging tasks (OrganAMNIST, OrganCMNIST, OrganSMNIST) and microscopy datasets (TissueMNIST, PathMNIST). While BioFuse did not achieve the highest test accuracy for most datasets, it notably achieved SOTA performance on PathMNIST. The minimal gap in AUC scores compared to best-performing methods suggests BioFuse maintains robust class separation capabilities across these diverse medical imaging tasks.
- Strong AUC scores even when accuracy suffers. For datasets such as OCTMNIST, PneumoniaMNIST, and ChestMNIST, BioFuse’s test accuracy lagged behind existing models, but it outperformed them in AUC scores. Specifically, BioFuse matched the SOTA AUC of 0.992 for OCTMNIST, 0.944 (vs. 0.939) for BreastMNIST, and 0.835 (vs. 0.822) for ChestMNIST, showing that even when accuracy lags, BioFuse’s embeddings allow the model to make confident and accurate class distinctions.
- Efficacy of model ensembles. Across all datasets, the best-performing combinations involved multiple foundation models. No single model was able to outperform the ensembles, underscoring the importance of model diversity in feature representation. This finding suggests that the combined outputs of different models allow BioFuse to better capture complementary information across multiple modalities, resulting in more robust performance across a variety of biomedical imaging tasks.
5.2 Cross-Modal transfer
Biomedical foundation models exhibited remarkable cross-modal transfer capabilities, often excelling in tasks well beyond their original training domains. This generalisation likely stems from shared visual feature representations that transcend specific imaging modalities. Notably, models trained on a single modality (such as histopathology-specific models) demonstrated stronger cross-modal transfer than those trained on multiple modalities, suggesting a potential benefit to focused, modality-specific pre-training for developing generalisable representations.
Histopathology models — particularly UNI, Hibou-B, and Prov-GigaPath — demonstrated exceptional cross-modal performance, likely due to their Vision Transformer-based architectures [38,40,41]. These models develop capacity to process multi-scale visual features during pre-training, from micron-scale cellular details to millimeter-scale tissue structures. This multi-scale capability appears particularly transferable across modalities, as the feature extraction mechanisms that identify cellular boundaries and tissue organization in histopathology may transfer effectively to detecting analogous structural patterns in retinal images and radiological scans, despite differences in visual appearance.
CheXagent’s sophisticated architecture, which integrates an 8-billion parameter clinical language model with a vision encoder and vision-language bridge [46], may contribute to its strong cross-modal adaptability. Pre-trained on 6 million instruction-image-answer triplets across 65 diverse datasets, this model appears to develop both modality-specific and modality-agnostic feature representations. The language component potentially functions as a semantic intermediary, facilitating knowledge transfer between distinct imaging domains. This architectural advantage correlates with CheXagent’s impressive performance in non-radiological tasks such as dermatology (AUC 0.960) and microscopy (AUC 0.999).
While our findings demonstrate substantial cross-modal capabilities, this transfer is not universal across all biomedical tasks. We observed performance decreases in highly specialized domains such as CT scan interpretation, where OrganAMNIST and OrganCMNIST showed the widest performance gaps between specialized and cross-modal models. These limitations suggest that effective cross-modal transfer depends on both model architecture and the inherent similarity between source and target domains. Future work should explore these boundaries systematically, potentially guiding the development of more universally transferable biomedical foundation models.
5.3 Computational considerations
Understanding BioFuse’s computational demands is essential for practical implementation. Following [83], we analyzed total response time as a realistic computational performance metric. Figure 6 decomposes the computational requirements for BioFuse experiments into three sequential stages: embedding extraction, exhaustive XGBoost evaluation across 511 combinations, and 100-trial Bayesian optimization of nine hyperparameters. Detailed per-dataset runtime tables is provided in S6 Table.
Stacked bars show total wall-clock time (hours) to reproduce each BioFuse experiment. The blue segment indicates embedding extraction for the nine backbones, the green segment the XGBoost evaluation runs, and the orange segment the hyperparameter sweep on Weights & Biases.
Hyperparameter optimization constitutes the primary computational requirement across all datasets. Large datasets like TissueMNIST (165K training samples) and those with many labels (ChestMNIST with 14 labels) require 70–80% of total runtime for optimization (approximately 35h and 40h, respectively) due to longer training times per trial, while the smallest datasets complete in under one hour. For small datasets like BreastMNIST and PneumoniaMNIST (several hundred images each), fixed-cost embedding extraction completes within minutes, making the 511-combination evaluation the second-largest computational component. Embedding extraction itself represents a bounded computational component: batched inference at 128 images per step maintains extraction below 7h even for the largest datasets when aggregated across all models. For individual models, extraction remains under 2h for most model-dataset combinations, with CheXagent on TissueMNIST (3.4h) and ChestMNIST (1.6h) representing the longest single-model extraction times (S7 Fig).
These results indicate that with cached embeddings, the primary computational bottleneck lies in downstream model selection. Beyond runtime considerations, we also examined GPU memory requirements to assess hardware accessibility. Our analysis showed that 7 of 9 foundation models operated within 5 GB of VRAM, making them suitable for consumer-grade GPUs. However, CheXagent, with 8 billion parameters, required 33.8 GB VRAM, necessitating server-grade GPU hardware (S8 Fig). Although memory requirements scale with dataset size, the relative differences between models remain consistent.
To support reproducibility and eliminate the need for time-intensive embedding extraction, we provide the complete embedding cache via Zenodo for MedMNIST+ [84] and ImageNet-1K [85].
5.4 Limitations and future directions
- Evaluation scope and task generalisation. BioFuse is currently evaluated only on classification tasks (binary, multi-class, and multi-label) using MedMNIST
. Testing on segmentation, detection, regression, and real-world clinical data would give a fuller picture. Rigorous out-of-distribution evaluation on clean external datasets remains important future work. Future work could also extend BioFuse to time-series modalities and multimodal clinical records.
- Dependency on pre-trained models. Performance is bounded by the quality and diversity of the underlying backbones. Expanding the pool to MRI, ultrasound, genomics and clinical-text models would reduce bias and enable broader multimodal fusion. Systematic bias evaluation across demographic groups remains essential future work.
- Computational overhead. Extracting embeddings from many large backbones (e.g., CheXagent, 33 GB VRAM) and running exhaustive searches is GPU-intensive. More efficient search heuristics or progressive fusion strategies are an important next step.
- Interpretability challenges. Combining multiple backbones obscures how each modality contributes to a prediction. Integrating saliency maps, feature attribution or attention visualisation will be critical for clinical trust and regulatory acceptance.
By addressing these areas through collaboration between AI researchers and clinicians, BioFuse can become a more efficient, interpretable and generalisable tool for multimodal biomedical applications.
6. Conclusion
The integration of diverse foundation models represents a promising frontier for advancing biomedical imaging analysis. In this work, we introduced BioFuse, a novel framework that systematically fuses embeddings from multiple biomedical foundation models to generate optimised representations for downstream tasks. Evaluated across 12 diverse imaging modalities in the MedMNIST+ benchmark, BioFuse with XGBoost classification outperformed existing methods, achieving the highest AUC in five datasets and maintaining near-SOTA performance in most others. Notably, it demonstrated exceptional performance in dermatology classification (DermaMNIST) and revealed unexpected cross-modal transfer capabilities in histopathology and radiology models like UNI, Hibou-B, Prov-GigaPath, and CheXagent.
These results highlight the benefits of leveraging multiple pre-trained models rather than relying on a single foundation model. BioFuse’s ability to automatically identify and integrate complementary representations from diverse models suggests significant potential for healthcare applications requiring comprehensive image interpretation across modalities. The framework’s extensible architecture ensures adaptability to future foundation models as they emerge.
While demonstrating clear advantages, BioFuse faces challenges, including computational overhead from grid search and potential redundancy in concatenated embeddings. Future work should explore more efficient fusion strategies, expand applications beyond classification to segmentation and detection tasks, and incorporate interpretability mechanisms essential for clinical adoption and regulatory approval.
By harnessing the collective strengths of multiple foundation models through a systematic approach to embedding fusion, BioFuse not only improves performance on benchmark tasks but also opens new avenues for cross-modal knowledge transfer in biomedical imaging. This contribution moves us closer to more reliable and comprehensive AI-assisted medical decision-making systems that can integrate information across the diverse imaging modalities encountered in clinical practice.
Supporting information
S2 Table. XGBoost hyperparameter search ranges used for Bayesian optimization.
https://doi.org/10.1371/journal.pone.0320989.s002
(PDF)
S3 Table. Optimal XGBoost hyperparameter configurations for concatenation fusion for all MedMNIST+ datasets.
https://doi.org/10.1371/journal.pone.0320989.s003
(PDF)
S4 Table. Optimal XGBoost hyperparameter configurations for self-attention fusion for all MedMNIST+ datasets.
https://doi.org/10.1371/journal.pone.0320989.s004
(PDF)
S5 Table. Optimal XGBoost hyperparameter configurations for oracle-single models using best individual backbones for all MedMNIST+ datasets.
https://doi.org/10.1371/journal.pone.0320989.s005
(PDF)
S6 Table. Computational runtime breakdown for the concatenation fusion experiment, showing embedding extraction, model training, hyperparameter sweep, and total processing time for each dataset.
https://doi.org/10.1371/journal.pone.0320989.s006
(PDF)
S7 Fig. Heatmap of embedding extraction times (hours) across nine foundation models and twelve MedMNIST+ datasets.
https://doi.org/10.1371/journal.pone.0320989.s007
(PDF)
S8 Fig. Peak GPU VRAM usage (GB) for each foundation model during embedding extraction.
https://doi.org/10.1371/journal.pone.0320989.s008
(PDF)
S9 Appendix. Pretraining dataset details for all nine biomedical foundation models used in BioFuse.
https://doi.org/10.1371/journal.pone.0320989.s009
(PDF)
S10 Appendix. MedMNIST+ dataset descriptions including imaging modalities, clinical tasks, and dataset statistics.
https://doi.org/10.1371/journal.pone.0320989.s010
(PDF)
References
- 1. He J, Baxter SL, Xu J, Xu J, Zhou X, Zhang K. The practical implementation of artificial intelligence technologies in medicine. Nat Med. 2019;25(1):30–6. pmid:30617336
- 2. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56. pmid:30617339
- 3. Chen X, Wang X, Zhang K, Fung K-M, Thai TC, Moore K, et al. Recent advances and clinical applications of deep learning in medical image analysis. Med Image Anal. 2022;79:102444. pmid:35472844
- 4.
Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the opportunities and risks of foundation models. 2021. Available from: https://arxiv.org/abs/2108.07258
- 5.
Azad B, Azad R, Eskandari S, Bozorgpour A, Kazerouni A, Rekik I. Foundational models in medical imaging: a comprehensive survey and future vision. 2023. Available from: https://arxiv.org/abs/2310.18689
- 6.
Wang Z, Wu Z, Agarwal D, Sun J. MedCLIP: contrastive Learning from Unpaired Medical Images and Text. 2022. Available from: https://arxiv.org/abs/2210.10163
- 7. Ma J, He Y, Li F, Han L, You C, Wang B. Segment anything in medical images. Nat Commun. 2024;15(1):654.
- 8.
Breen J, Allen K, Zucker K, Godson L, Orsi NM, Ravikumar N. Histopathology foundation models enable accurate ovarian cancer subtype classification. 2024. Available from: https://arxiv.org/abs/2405.09990
- 9.
Vorontsov E, Bozkurt A, Casson A, Shaikovski G, Zelechowski M, Liu S. Virchow: a million-slide digital pathology foundation model. 2023. Available from: https://arxiv.org/abs/2309.07778
- 10.
Saab K, Tu T, Weng WH, Tanno R, Stutz D, Wulczyn E. Capabilities of Gemini models in medicine. 2024. Available from: https://arxiv.org/abs/2404.18416
- 11. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40. pmid:31501885
- 12. Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23(6):bbac409. pmid:36156661
- 13. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112–20. pmid:33538820
- 14. Yang F, Wang W, Wang F, Fang Y, Tang D, Huang J, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell. 2022;4(10):852–66.
- 15.
Wang Z, Liu C, Zhang S, Dou Q. Foundation model for endoscopy video analysis via large-scale self-supervised pre-train. Springer Nature Switzerland; 2023. p. 101–11.
- 16.
Wölflein G, Ferber D, Meneghetti AR, Nahhas OSME, Truhn D, Carrero ZI. Benchmarking pathology feature extractors for whole slide image classification. 2024. Available from: https://arxiv.org/abs/2311.11772
- 17. Tiu E, Talius E, Patel P, Langlotz CP, Ng AY, Rajpurkar P. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat Biomed Eng. 2022;6(12):1399–406. pmid:36109605
- 18. Wang X, Yang S, Zhang J, Wang M, Zhang J, Yang W, et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med Image Anal. 2022;81:102559. pmid:35952419
- 19.
Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L. Towards expert-level medical question answering with large language models. 2023. Available from: https://arxiv.org/abs/2305.09617
- 20.
Khan W, Leem S, See KB, Wong JK, Zhang S, Fang R. A comprehensive survey of foundation models in medicine. 2024. Available from: https://arxiv.org/abs/2406.10729
- 21.
Zhang Y, Jiang H, Miura Y, Manning CD, Langlotz CP. Contrastive learning of medical visual representations from paired images and text. 2022. Available from: https://arxiv.org/abs/2010.00747
- 22.
Li W, Peng Y, Zhang M, Ding L, Hu H, Shen L. Deep model fusion: a survey. 2023. Available from: https://arxiv.org/abs/2309.15698
- 23.
Chen MF, Fu DY, Adila D, Zhang M, Sala F, Fatahalian K. Shoring up the foundations: fusing model embeddings and weak supervision. 2022. Available from: https://arxiv.org/abs/2203.13270
- 24.
Neidlinger P, Nahhas OSME, Muti HS, Lenz T, Hoffmeister M, Brenner H. Benchmarking foundation models as feature extractors for weakly-supervised computational pathology. 2024. Available from: https://arxiv.org/abs/2408.15823
- 25. Qiu J, Li L, Sun J, Peng J, Shi P, Zhang R, et al. Large ai models in health informatics: Applications, challenges, and the future. IEEE J Biomed Health Inform. 2023;27(12):6074–87. pmid:37738186
- 26.
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P. Language models are few-shot learners. 2020. Available from: https://arxiv.org/abs/2005.14165
- 27.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. 2023. Available from: https://arxiv.org/abs/1706.03762
- 28.
Kaplan J, McCandlish S, Henighan T, Brown TB, Chess B, Child R, et al. Scaling laws for neural language models. 2020. Available from: https://arxiv.org/abs/2001.08361
- 29. Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113–20. pmid:24071849
- 30.
Xie Y, Zhou C, Gao L, Wu J, Li X, Zhou HY. MedTrinity-25M: A large-scale multimodal dataset with multigranular annotations for medicine. 2024. Available from: https://arxiv.org/abs/2408.02900
- 31.
Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-LM: Training multi-billion parameter language models using model parallelism. 2020. Available from: https://arxiv.org/abs/1909.08053
- 32.
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. 2021. Available from: https://arxiv.org/abs/2103.00020
- 33.
Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. 2019. Available from: https://arxiv.org/abs/1810.04805
- 34.
Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. 2020. Available from: https://arxiv.org/abs/2002.05709
- 35.
He K, Chen X, Xie S, Li Y, Dollár P, Girshick R. Masked autoencoders are scalable vision learners. 2021. Available from: https://arxiv.org/abs/2111.06377
- 36.
Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V. DINOv2: learning robust visual features without supervision. 2024. Available from: https://arxiv.org/abs/2304.07193
- 37.
Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P. Emerging properties in self-supervised vision transformers. 2021. Available from: https://arxiv.org/abs/2104.14294
- 38. Chen RJ, Ding T, Lu MY, Williamson DFK, Jaume G, Song AH, et al. Towards a general-purpose foundation model for computational pathology. Nat Med. 2024;30(3):850–62. pmid:38504018
- 39.
Lab M. UNI2: pathology foundation model. 2025. Available from: https://github.com/mahmoodlab/UNI/tree/uni2
- 40. Xu H, Usuyama N, Bagga J, Zhang S, Rao R, Naumann T, et al. A whole-slide foundation model for digital pathology from real-world data. Nature. 2024;630(8015):181–8. pmid:38778098
- 41.
Nechaev D, Pchelnikov A, Ivanova E. Hibou: a family of foundational vision transformers for pathology. 2024. Available from: https://arxiv.org/abs/2406.05074
- 42.
Pérez-García F, Sharma H, Bond-Taylor S, Bouzid K, Salvatelli V, Ilse M. RAD-DINO: exploring scalable medical image encoders beyond text supervision 2024. Available from: https://arxiv.org/abs/2401.10815
- 43.
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: transformers for image recognition at scale. 2021. Available from: https://arxiv.org/abs/2010.11929
- 44.
Zhang S, Xu Y, Usuyama N, Xu H, Bagga J, Tinn R, et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. 2024. Available from: https://arxiv.org/abs/2303.00915
- 45.
Eslami S, Meinel C, de Melo G. PubMedCLIP: How much does CLIP benefit visual question answering in the medical domain? In: Findings of the Association for Computational Linguistics: EACL 2023. 2023. pp. 1181–93. Available from: https://aclanthology.org/2023.findings-eacl.88
- 46.
Chen Z, Varma M, Delbrouck JB, Paschali M, Blankemeier L, Veen DV, et al. CheXagent: towards a foundation model for chest x-ray interpretation. 2024. Available from: https://arxiv.org/abs/2401.12208
- 47.
Lu MY, Chen B, Williamson DFK, Chen RJ, Liang I, Ding T. Towards a visual-language foundation model for computational pathology. 2023. Available from: https://arxiv.org/abs/2307.12914
- 48. Zarif S, Abdulkader H, Elaraby I, Alharbi A, Elkilani WS, Pławiak P. Using hybrid pre-trained models for breast cancer detection. PLoS One. 2024;19(1):e0296912. pmid:38252633
- 49. Dong X, Li M, Zhou P, Deng X, Li S, Zhao X, et al. Fusing pre-trained convolutional neural networks features for multi-differentiated subtypes of liver cancer on histopathological images. BMC Med Inform Decis Mak. 2022;22(1):122. pmid:35509058
- 50.
Sharma S, Daniel R Jr. BioFLAIR: pretrained pooled contextualized embeddings for biomedical sequence labeling tasks. 2019. Available from: https://arxiv.org/abs/1908.05760
- 51.
Yogarajan V, Pfahringer B, Smith T, Montiel J. Improving predictions of tail-end labels using concatenated bioMed-transformers for long medical documents. 2021. Available from: https://arxiv.org/abs/2112.01718
- 52.
Zayats V, Chen P, Ferrari M, Padfield D. Zipper: a multi-tower decoder architecture for fusing modalities. 2024. Available from: https://arxiv.org/abs/2405.18669
- 53.
Wang Z, Wang Z, Srinivasan B, Ioannidis VN, Rangwala H, Anubhai R. BioBridge: bridging biomedical foundation models via knowledge graphs. 2024. Available from: https://arxiv.org/abs/2310.03320
- 54.
Wang X, Jiang Y, Bach N, Wang T, Huang Z, Huang F. Automated concatenation of embeddings for structured prediction. 2021. Available from: https://arxiv.org/abs/2010.05006
- 55.
Alain G, Bengio Y. Understanding intermediate layers using linear classifier probes. 2018. Available from: https://arxiv.org/abs/1610.01644
- 56.
Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. Available from: https://doi.org/10.1145/2939672.2939785
- 57.
Jiang W, Synovic N, Hyatt M, Schorlemmer TR, Sethi R, Lu Y-H, et al. An empirical study of pre-trained model reuse in the hugging face deep learning model registry. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2023. 2463–75. Available from: https://doi.org/10.1109/icse48619.2023.00206
- 58.
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P. Training language models to follow instructions with human feedback. 2022. Available from: https://arxiv.org/abs/2203.02155
- 59.
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. HuggingFace’s transformers: state-of-the-art natural language processing. 2020. Available from: https://arxiv.org/abs/1910.03771
- 60.
Ilharco G, Wortsman M, Wightman R, Gordon C, Carlini N, Taori R. OpenCLIP. 2021. Available from: https://doi.org/10.5281/zenodo.5143773
- 61.
Wightman R. PyTorch image models. 2019. Available from: https://github.com/rwightman/pytorch-image-models
- 62.
Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regression trees. Belmont: Wadsworth International Group; 1984.
- 63.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W. LightGBM: a highly efficient gradient boosting decision tree. 2017. Available from: https://arxiv.org/abs/1711.04684
- 64. Yang J, Shi R, Wei D, Liu Z, Zhao L, Ke B, et al. MedMNIST v2 - a large-scale lightweight benchmark for 2D and 3D biomedical image classification. Sci Data. 2023;10(1):41. pmid:36658144
- 65. Kather JN, Krisam J, Charoentong P, Luedde T, Herpel E, Weis C-A, et al. Predicting survival from colorectal cancer histology slides using deep learning: a retrospective multicenter study. PLoS Med. 2019;16(1):e1002730. pmid:30677016
- 66.
Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017. pp. 3462–71. Available from: https://doi.org/10.1109/cvpr.2017.369
- 67. Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data. 2018;5:180161. pmid:30106392
- 68. Kermany DS, Goldbaum M, Cai W, Valentim CCS, Liang H, Baxter SL, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018;172(5):1122-1131.e9. pmid:29474911
- 69. Liu R, Wang X, Wu Q, Dai L, Fang X, Yan T, et al. DeepDRiD: diabetic retinopathy-grading and image quality estimation challenge. Patterns (N Y). 2022;3(6):100512. pmid:35755875
- 70. Al-Dhabyani W, Gomaa M, Khaled H, Fahmy A. Dataset of breast ultrasound images. Data Brief. 2019;28:104863. pmid:31867417
- 71. Acevedo A, Merino A, Alférez S, Molina Á, Boldú L, Rodellar J. A dataset of microscopic peripheral blood cell images for development of automatic recognition systems. Data Brief. 2020;30:105474. pmid:32346559
- 72. Ljosa V, Sokolnicki KL, Carpenter AE. Annotated high-throughput microscopy image sets for validation. Nat Methods. 2012;9(7):637. pmid:22743765
- 73. Bilic P, Christ P, Li HB, Vorontsov E, Ben-Cohen A, Kaissis G, et al. The Liver Tumor Segmentation Benchmark (LiTS). Med Image Anal. 2023;84:102680. pmid:36481607
- 74. Gutierrez PA, Perez-Ortiz M, Sanchez-Monedero J, Fernandez-Navarro F, Hervas-Martinez C. Ordinal regression methods: survey and experimental study. IEEE Trans Knowl Data Eng. 2016;28(1):127–46.
- 75.
XGBoost Parameters. XGBoost 2.1.1 Documentation. Available from: https://xgboost.readthedocs.io/en/stable/parameter.html
- 76.
Classification: ROC and AUC. Machine Learning. Available from: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc. 2023.
- 77.
Doerrich S, Salvo FD, Brockmann J, Ledig C. Rethinking model prototyping through the MedMNIST dataset collection. 2024. Available from: https://arxiv.org/abs/2404.15786
- 78. Manzari ON, Ahmadabadi H, Kashiani H, Shokouhi SB, Ayatollahi A. MedViT: a robust vision transformer for generalized medical image classification. Comput Biol Med. 2023;157:106791. pmid:36958234
- 79.
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. 2015. Available from: https://arxiv.org/abs/1409.0575
- 80.
Di Salvo D, Wang H, Xu M, Feng Z, He X, Wang K. MedMNIST-C: a benchmark for robust medical imaging classification under corruptions. 2024. Available from: https://arxiv.org/abs/2406.17536
- 81.
Hendrycks D, Dietterich TG. Benchmarking neural network robustness to common corruptions and perturbations. 2019. Available from: https://arxiv.org/abs/1903.12261
- 82.
Di Salvo F. Medmnistc-api [sodtware]. 2024. Available from: https:1063//github.com/francescodisalvo05/medmnistc-api
- 83.
Harris-Birtill D, Harris-Birtill R. Understanding computation time: a critical discussion of time as a computational performance metric. In: Misztal A, Harris PA, Parker JA, editors. Time in variance. Leiden: Brill; 2021. p. 220–48. Available from: https://doi.org/10.1163/9789004470170_014
- 84.
Hossain MN, Harris-Birtill D. BioFuse embedding cache for MedMNIST 2D datasets. 2025. Available from: https://doi.org/10.5281/zenodo.16732578
- 85.
Hossain MN, Harris-Birtill D. BioFuse embedding cache for ImageNet-1K. 2025. Available from: https://doi.org/10.5281/zenodo.14930584