Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

MedZeroSeg: Zero-shot medical image segmentation via vision foundation models

  • Ronghui Zhang,

    Roles Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft

    Affiliation Concord University College Fujian Normal University, Fuzhou, China

  • Min Huang ,

    Roles Data curation, Formal analysis, Funding acquisition, Methodology, Software, Writing – review & editing

    2018011@fjjxu.edu.cn (MH); rui.li@jmu.edu.cn (RL)

    Affiliations College of Electronic Information Science, Fujian Jiangxia University, Fuzhou, China, Smart Home Information Collection and Processing on Internet of Things Laboratory of Digital Fujian, Fuzhou, China

  • Rui Li

    Roles Data curation, Funding acquisition, Methodology, Resources, Writing – review & editing

    2018011@fjjxu.edu.cn (MH); rui.li@jmu.edu.cn (RL)

    Affiliation School of Finance and Economics, Jimei University, Xiamen, China

Abstract

A novel medical image segmentation framework, MedZeroSeg, is proposed to address key challenges in the field. Leveraging vision foundation models such as CLIP (Contrastive Language-Image Pre-training) and SAM (Segment Anything Model), it achieves zero-shot segmentation, accurately delineating previously unseen medical images without requiring additional labeled data. This significantly reduces reliance on large-scale annotated datasets. At its core, MedZeroSeg introduces a Dual-Path Feature Extraction Module that captures both fine anatomical details and global contextual information through the integration of local and global perception mechanisms, enhancing robustness against the complexity and variability inherent in medical imaging.Additionally, a Context-Enhanced Hard-Negative Contrast Loss is introduced to enhance contrastive learning by exploiting contextual cues and refining hard-negative sampling, leading to better representations and higher efficiency. The key innovation of MedZeroSeg lies in its ability to leverage generalizable knowledge from CLIP and SAM without any task-specific fine-tuning, making it highly adaptable across different medical imaging modalities. Extensive experiments on three publicly available datasets, including cardiac MRI (ACDC), multi-organ abdominal CT (Synapse), and chest X-ray (COVID-QU-Ex), demonstrate that MedZeroSeg achieves superior results in both zero-shot and weakly supervised segmentation settings, showcasing strong generalization capabilities and minimal data dependency. The framework represents a significant advancement in medical image analysis and opens up promising directions for future research in applying advanced foundation models and innovative learning strategies to healthcare applications.

Introduction

Medical image segmentation plays a fundamental role in contemporary clinical workflows, facilitating disease diagnosis, treatment planning, and quantitative analysis of pathological structures [1, 2, 3, 4, 5]. Although deep learning-based segmentation models [69] have achieved remarkable success in recent years, several longstanding challenges remain unresolved. These include the scarcity of well-annotated datasets, the limited generalization ability of models trained on restricted domains, and insufficient robustness when faced with variations in imaging modalities, acquisition conditions, and patient-specific anatomical differences. As annotation in medical imaging is expensive and time-consuming—often requiring expert radiologists—there is an urgent demand for methods capable of achieving accurate segmentation with minimal or even no task-specific labels while maintaining broad adaptability across diverse clinical environments.

With the emergence of large-scale vision-language models such as CLIP (Contrastive Language-Image Pre-training) [10] and foundation segmentation models like SAM (Segment Anything Model) [11], new opportunities have arisen for zero-shot and cross-domain segmentation [12,13]. These models demonstrate impressive generalization on natural images and have sparked interest in their potential use in medical imaging [1417,18]. However, transferring such models to medical domains remains nontrivial. The gap between natural and medical images, manifested in modality, specific visual characteristics, subtle pathological cues, and complex anatomical structures—raises ongoing debates regarding how effectively these models can be adapted without large-scale medical data or domain-specific fine-tuning [19,20]. As a result, the feasibility and reliability of zero-shot segmentation in clinical settings remain open research questions.

Another challenge concerns the imbalance between local detail extraction and global contextual understanding. Classical architectures such as U-Net [21], and other CNN-based models [2224] excel at capturing local features but often fail to encode long-range contextual information, which is critical for distinguishing ambiguous anatomical boundaries or small lesions [25]. Multi-scale feature fusion techniques [26] and attention-based methods [27] partially alleviate this issue, yet they tend to increase computational cost and still rely heavily on annotated data. The tension between preserving fine anatomical detail and capturing global structure remains a central point of discussion in the community.

In parallel, contrastive learning has emerged as a powerful self-supervised paradigm for representation learning [28]. However, traditional contrastive objectives such as InfoNCE [29] often suffer from inefficient or noisy negative sample selection, limiting their discriminative power, especially in the context of subtle medical variations. Recent work on learning from AI-generated annotations [30] has shown promise in reducing annotation burden, yet the integration of such approaches with zero-shot foundation models remains underexplored. The community continues to debate how to incorporate contextual cues, meaningful hard-negative mining, and multi-modal information into contrastive learning to maximize its benefit in clinical applications.

To address these challenges, we propose MedZeroSeg, a novel framework that integrates the strengths of CLIP and SAM to achieve zero-shot medical image segmentation in radiological tasks. By leveraging these models, MedZeroSeg performs accurate segmentation without additional annotations, substantially reducing reliance on large-scale labeled datasets. To enhance the balance between local detail extraction and global context understanding, we propose a Dual-Path Feature Extraction Module (DPFEM). This module combines local and global perception pathways to capture both fine anatomical structures and broader contextual information. The dual-path design enables the model to better comprehend subtle details and overall image layouts, thereby improving segmentation accuracy and robustness. To optimize contrastive learning, we introduce a new loss function named Contextual-Enhanced Hard-Negative Contrastive (CEHNC) Loss. This loss function enhances contrast learning efficiency by utilizing contextual information more effectively and refining the selection of hard-negative samples. Consequently, it improves both the training efficiency and the accuracy of segmentation by enhancing the discrimination between positive and negative samples, allowing the model to learn more discriminative feature representations.

Our approach makes three primary contributions:

  1. Zero-shot Medical Segmentation: By integrating the capabilities of CLIP and SAM, our framework achieves zero-shot segmentation of medical images without additional annotations, dramatically reducing dependence on large-scale labeled data.
  2. DPFEM: We design a feature extraction module that integrates local and global perception to capture fine anatomical details and rich contextual information, thereby enhancing the model’s capability to process complex medical images.
  3. CEHNC Loss Function: We propose a novel CEHNC Loss that enhances contextual information utilization and improves hard-negative sample selection to optimize contrastive learning, thereby boosting training efficiency and segmentation accuracy.

Through comprehensive validation across various medical image segmentation tasks, including cardiac MRI (ACDC) [31], multi-organ abdominal CT (Synapse) [32], and chest X-ray (COVID-QU-Ex) [33], MedZeroSeg demonstrates outstanding performance. It showcases significant potential as an advanced tool for next-generation medical image segmentation, proving its efficacy and versatility in diverse clinical settings.

Materials and methods

Fig 1 presents an overview of the Contextual Zero Medical Image Segmentation Network (MedZeroSeg), which consists of three main stages: fine-tuning the BioClip model using the proposed CEHNC loss, performing zero-shot segmentation guided by textual prompts and the DPFEM, and applying weakly supervised segmentation to further refine potential labeling results.

thumbnail
Fig 1. An overview of the proposed MedZeroSeg framework.

This figure shows the MedZeroSeg architecture for zero-shot medical image segmentation, highlighting the CEHNC loss for fine-tuning, the DPFEM for feature extraction, and the textual prompt-guided segmentation process.

https://doi.org/10.1371/journal.pone.0344978.g001

Context-enhanced hard negative contrastive loss

To address the limitations of current contrastive learning methods in medical image segmentation—particularly their inefficiency in data utilization and inability to distinguish subtle differences—we introduce the CEHNC loss. This newly designed loss function incorporates additional components aimed at improving the model’s discriminative capability and training efficiency.

Unlike traditional contrastive losses such as InfoNCE [29], which typically treat all negative samples equally or use a fixed weighting scheme, CEHNC dynamically adjusts the contribution of each negative sample based on its context and difficulty. Traditional contrastive losses may fail to effectively utilize hard negatives—samples that are visually similar but belong to different categories—which are crucial for enhancing model generalization and learning efficiency.

Zero-shot and weakly supervised learning

Adaptive hard negative selection

In contrastive learning, conventional approaches often rely on random or fixed strategies to select negative samples—those semantically different from the anchor (i.e., positive) sample. However, such methods may fail to identify hard negatives, which are visually similar yet belong to different categories and are essential for improving model generalization and learning efficiency. To address this limitation, the CEHNC loss introduces an adaptive mechanism to select hard negatives.

AS show in Fig 2, the core idea of this mechanism is to dynamically adjust the weight assigned to each candidate negative sample within a training batch. Specifically, at the start of each batch, the similarity between the anchor and all other samples is computed—typically using metrics such as cosine similarity. Based on these similarity scores, the method identifies the most challenging negative samples for the current anchor. Then, considering the overall distribution of the batch, the algorithm adaptively updates two parameters, and , which control the difficulty curve of the selected negatives. These parameters ensure that sufficient attention is given to hard cases while still allowing for the inclusion of easier samples. As a result, the CEHNC loss enables the model to better focus on distinguishing fine-grained differences, ultimately leading to improved segmentation performance.

thumbnail
Fig 2. Illustration of the proposed CEHNC loss pipeline.

This figure presents the CEHNC loss mechanism, which dynamically selects hard-negative samples and updates difficulty parameters and to enhance fine-grained feature discrimination during training.

https://doi.org/10.1371/journal.pone.0344978.g002

Integration of contextual information

Beyond conventional image-text pairs, the CEHNC loss further incorporates domain-specific contextual information from medical data. This is achieved by integrating relevant clinical metadata, such as diagnostic reports or patient history, into the feature learning process. These additional contextual features are encoded using a dedicated context encoder and subsequently fused with both image and text embeddings before being fed into the contrastive loss function.

To implement this, all auxiliary data sources are first processed through a specialized context encoder, which transforms them into high-dimensional feature representations. These encoded features are then combined with the original visual and textual embeddings to construct a more comprehensive and semantically rich feature space. This integration not only enhances the model’s understanding within each individual modality but also facilitates cross-modal consistency and complementarity.

For example, when a patient’s prior medical records and personal health context are incorporated, the model can better interpret ambiguous findings in a single X-ray image, leading to more accurate identification of pathological conditions. By embedding such rich contextual knowledge, the CEHNC loss significantly improves the system’s ability to recognize complex disease patterns. Moreover, it enhances the robustness and interpretability of the model, enabling predictions that are not only more accurate but also more transparent and clinically meaningful.

Loss function formulation

The CEHNC loss is designed as a dual-direction contrastive objective that jointly optimizes image-to-text and text-to-image alignment. It is defined as:

(1)

where each term represents a directional component responsible for pulling positive pairs closer while pushing apart negative ones based on their embedding distances. The revised formulation now includes the positive pair term in the denominator (highlighted in the equations below), following standard contrastive learning conventions. Specifically:

(2)(3)

with zn and sn denoting the n-th normalized embeddings from the vision and language encoders respectively, M the batch size, and a learnable scaling factor analogous to temperature. To dynamically modulate the influence of negative samples, we introduce two adaptive weighting functions:

(4)(5)

where , are hyperparameters that control the sensitivity to hard negatives. These weights suppress the contribution of distant negatives exponentially with respect to their Euclidean distance in the joint embedding space. Unlike temperature scaling (), which uniformly scales all similarity scores, the weighting functions and apply sample-specific, distance-dependent weights that selectively emphasize hard negatives (samples closer in embedding space). For normalized embeddings, Euclidean distance and cosine similarity are monotonically related ( for unit vectors), so using Euclidean distance provides a direct hardness signal while maintaining consistency with the cosine similarity used in the contrastive objective. The learnable scaling factor is initialized to , following the default temperature setting in the original CLIP model, and is optimized jointly with other model parameters during training. In summary, CEHNC loss leverages dynamic weighting and context-aware hard negative mining to enhance the model’s ability to handle subtle differences in medical images, offering significant improvements over traditional contrastive losses.

To fine-tune the BiomedCLIP model [22] using the CEHNC loss function, we utilized the public MedPix dataset containing multiple radiological modalities. In this process, we chose the base version of Vision Transformer as the image encoder, while PubMedBERT [34] was used as the text encoder. To ensure the quality and consistency of the data, we performed a detailed preprocessing work on the MedPix dataset, which included removing all special characters, trimming the front and back margins, and excluding data samples with caption lengths of less than 20 characters. These steps help reduce noise and improve the efficiency of model training.

Next, all images are uniformly resized to a resolution of pixels and normalized using the mean and standard deviation values of the RGB channels as defined in the original CLIP model. This normalization strategy ensures better alignment and comparability of visual features across different modalities. Following preprocessing, the dataset is split into training and validation subsets using an 85%–15% ratio, yielding 20,292 images for training and 3,515 for validation. This partitioning strategy guarantees a sufficiently large training set while maintaining a representative validation set for reliable performance evaluation.

During the fine-tuning stage, a relatively low initial learning rate of is employed, combined with a stepwise learning rate decay of 50%, allowing the model to converge more steadily. Additionally, a batch size of 64 is used throughout training, which provides a balanced trade-off between computational efficiency and accurate gradient estimation. Through these carefully selected hyperparameters and training configurations, we aim to enhance the BiomedCLIP model’s effectiveness in medical image analysis tasks [35].

Zero-shot segmentation guided by DPFEM-enhanced textual cues

Using the fine-tuned BiomedCLIP model, we propose a zero-shot generalized medical image segmentation strategy. The strategy integrates the XAI technique, gScoreCAM [8], with BiomedCLIP to generate text-related visual saliency maps guided by anatomical or pathological cues. While gScoreCAM has shown superior accuracy and specificity over Grad-CAM [36] for natural images, we extend its use to radiology tasks and further enhance the generated saliency maps through the DPFEM (Fig 3) before post-processing them with Conditional Random Field (CRF) filters.

thumbnail
Fig 3. Architecture of the proposed DPFEM.

This figure presents the DPFEM architecture, which captures both local details and global contextual information through parallel perception branches and dynamic gated feature fusion to enhance medical image segmentation.

https://doi.org/10.1371/journal.pone.0344978.g003

The gScoreCAM heatmaps are converted to SAM bounding box prompts through the following steps: (1) the heatmap is normalized to [0,1] using min-max normalization; (2) a threshold of 0.5 is applied to obtain a binary mask; (3) morphological closing with a 3×3 kernel is applied to fill small holes and smooth boundaries; (4) the tightest axis-aligned bounding box enclosing all non-zero pixels is computed; (5) this bounding box is provided as the prompt input to SAM for mask generation.

Fig 3 shows an overview of the proposed DPFEM. Our model leverages DPFEM to effectively capture both local details and global contextual information. The operation flow of the framework can be divided into several key steps: first, in the input image preprocessing stage, the original image is processed by an initial convolutional layer, a process that not only initially extracts spatial features, but also appropriately reduces the data dimensionality and lays the foundation for subsequent deeper feature learning. The next step is the local perception branch, where Depth-wise Separable Convolution (DSC) is used, combined with activation functions such as SiLU, which enables the model to retain as much spatial structure information as possible while maintaining high computational efficiency, which is crucial for understanding the subtle but important local features in the image. characteristics of an image. Meanwhile, in the global perception branch running in parallel, the Single-head Self-Attention (SHSA) mechanism is introduced, which allows the model to explore the correlations within the whole image region to better grasp the semantic content of the image as a whole and the interactions between its parts. The last step is dynamic gated attention fusion, where we innovatively combine feature maps from the local perception and global perception branches through a dynamic gating mechanism. This approach dynamically weighs the importance of local and global features based on their relevance and confidence scores, significantly enhancing the model’s ability to handle complex medical images. Unlike traditional fusion methods that simply concatenate or sum features, our dynamic gate adjusts weights in real-time, providing a more nuanced and context-aware feature representation. This dynamic gating mechanism not only improves segmentation accuracy but also boosts the model’s generalization capabilities across diverse imaging modalities.

The comprehensive representation rich in both fine-grained information and macroscopic vision is thus formed, which greatly enhances the model’s comprehension of the complex medical image as well as the expressiveness of the segmentation task.

Input image preprocessing.

The input image I first undergoes preprocessing through an initial convolutional layer [26]. This step extracts preliminary spatial features and reduces the data dimension for subsequent processing. The output of this step can be represented as:

(6)

where denotes the convolution operation.

Local perception branch.

This branch utilizes Depth-wise Separable Convolution (DSC) along with an activation function such as SiLU (Sigmoid Linear Unit), to preserve spatial structure information. DSC decomposes a standard convolution into two stages: a depth-wise convolution followed by a point-wise convolution. The process can be formulated as:

(7)

where and represent the depth-wise and point-wise convolution operations, respectively.

Global perception branch.

In parallel with the local perception branch, the global perception branch incorporates a Single-head Self-Attention (SHSA) mechanism, which focuses on learning global dependencies. The SHSA mechanism can be expressed as:

(8)

where represents the single-head self-attention operation. We adopt SHSA rather than multi-head attention for computational efficiency, as our dual-path architecture already provides feature diversity through parallel local and global perception branches. Empirically, SHSA achieves comparable performance to multi-head attention at lower computational cost for our zero-shot segmentation task, while still effectively capturing long-range dependencies between different regions of the image.

Feature fusion.

The features from both the local and global perception branches are then fused together. A common approach to fusion is to concatenate the feature maps and pass them through a fully connected layer or another convolutional layer. The fusion process can be described as:

(9)

where represents the fully connected or convolutional operation, respectively.

We apply a Conditional Random Field (CRF) filter to the fused feature map Ffused for post-processing, to obtain the initial coarse segmentation result. The CRF filter can smooth the segmentation boundaries and remove noise, thereby improving the segmentation quality. By integrating these two complementary paths, the network is not only able to precisely capture local details but also efficiently learn the global structural information within the images. This overcomes the limitations inherent in traditional CNNs and Vision Transformers (ViTs). Ultimately, the fused features result in a more comprehensive and richer representation, enhancing the model’s performance in tasks such as medical image segmentation. The dual-path design ensures computational efficiency while enhancing the model’s ability to parse complex medical images.

Weakly-supervised segmentation for potential labeling improvement

On the basis of zero-shot segmentation, we further introduce a weakly supervised learning strategy to improve the accuracy and robustness of the segmentation results. The specific implementation steps are as follows: first, the initial rough segmentation results are obtained by using the fine-tuned BiomedCLIP model and the saliency map generated by gScoreCAM, combined with the DPFEM and the Conditional Random Field (CRF) filter for post-processing. These rough segmentation results are used as pseudo-labels. More accurate pseudo-masks are generated by passing these pseudo-labels to the SAM. These pseudo-masks are used as the results of zero-shot segmentation to provide the basis for subsequent weakly-supervised segmentation training with potential labeling improvements.

In the weakly supervised segmentation training, we use the pseudo-masks generated above as training data. These pseudo-masks are not fully accurate labeling, but they can provide enough information to guide the learning of the model. Meanwhile, we choose residual UNet [23] as the model for weakly supervised training. Residual UNet is an efficient convolutional neural network architecture that can provide high-quality segmentation results while maintaining computational efficiency. During the training process, we use the Cross-Entropy Loss function (CELF) to measure the difference between the model predictions and the pseudo-labeling. Parameter updates are performed using the Adam optimizer with a learning rate of 1E-4 and appropriate decay strategies are set. To increase the generalization ability of the model, we applied various data enhancement techniques such as random rotation, flipping, and scaling during the training process. After weakly supervised training, the model is able to generate finer and more accurate segmentation results. These improved segmentation results not only increase the boundary accuracy, but also reduce noise and mis-segmentation.

Results

Experimental setup, datasets, and validation metrics

We evaluate MedZeroSeg on three public medical imaging datasets: ACDC for cardiac MRI, Synapse for multi-organ CT, and COVID-QU-Ex for chest X-ray (pulmonary segmentation). All models are implemented in PyTorch and trained on NVIDIA A100 GPUs with the same hyperparameter settings for fair comparison. All quantitative results are reported as mean ± standard deviation over three independent runs with different random seeds to account for training variability.

Notably, MedZeroSeg performs 2D slice-wise inference for inherently 3D datasets (ACDC and Synapse), due to the 2D-only input requirement of the pre-trained CLIP and SAM models. Consequently, 3D/2.5D inference or volumetric post-processing is not applied. This design choice preserves the strict zero-shot setting without introducing task-specific heuristics.

To evaluate the quality of BiomedCLIP fine-tuning, we performed tests on the ROCO dataset [37], which contains about 7,042 multimodal medical images covering a wide range of clinical cases. We validated the top-1 and top-2 matching retrieval accuracy in both image-to-text and text-to-image directions. For the experiments, we performed five runs, each using a batch size of 50, and ensured diverse combinations of text and images within each batch by randomly shuffling the batch (generating a total of 70,420 rearranged examples). During the fine-tuning process, we compared several different loss functions, including the InfoNCE loss [29], DCL [38], HN-NCE [39], and our proposed CEHNC loss. To ensure fairness, all strategies are trained with the same hyperparameters (temperature , learning rate ). For HN-NCE and CEHNC, we set a uniform difficulty coefficient . In addition, we also consider the pre-trained BiomedCLIP [22], PMC-CLIP [16], and CLIP [10] as the baseline models.

To evaluate the segmentation performance under zero-shot and weakly supervised conditions, as well as the effectiveness of different design components of MedZeroSeg, we selected three public datasets (covering three different imaging modalities) that provide ground truth segmentation labels (cardiac structures, abdominal organs, and lung fields) and are divided into training, validation, and test sets. The details are as follows:

  • Synapse [32]: Synapse is a public multi-organ segmentation dataset. There are 30 contrast-enhanced abdominal clinical CT cases in this dataset. Following the settings in [40], 18 cases are used for training and 12 for testing. The annotation of each image includes 8 abdominal organs.
  • ACDC [31]: ACDC is a public cardiac MRI dataset consisting of 100 exams. For each exam, there are two different modalities, and the corresponding label includes left ventricle, right ventricle, and myocardium. The dataset is split into 80 for training and 20 for testing.
  • Chest X-ray (COVID-QU-Ex) [33]: The COVID-QU-Ex database includes 16,280, 1,372, and 957 chest radiographs (including normal, lung opacities, viral pneumonia, and COVID-19 cases) for training, validation, and testing, respectively.

Based on the above datasets, we conducted a detailed comparative analysis to evaluate the segmentation quality of the initial labels, zero-shot pseudo-masks, and weakly supervised methods generated from the gScoreCAM results after conditional random field processing guided by the DPFEM on each test set. For the ablation study of the zero-shot segmentation task, we paid particular attention to two aspects: first, the effect of BiomedCLIP model fine-tuning on the final segmentation results; and second, comparing the performance differences between two different classes of activation map (CAM) techniques, gScoreCAM and Grad-CAM, in generating segmentation masks. These ablation experiments were all performed on a test subset of each of the three previously mentioned independent datasets.

To ensure the consistency of the experimental conditions, we uniformly used the same SAM architecture, a fixed selection of target layers, a consistent text cueing strategy, and considered only the top 60 most salient channels from each input image for the CAM analysis, regardless of the settings. In addition, several widely recognized quality evaluation metrics were used to measure the performance of the segmentation algorithms throughout the study, including Intersection over Union (IoU),Dice Similarity Coefficient (DSC), and Area Under the Curve of the Receiver Operating Characteristic (AUC). Their formula is as follows.

(10)(11)(12)

By implementing a paired-samples t-test, we were able to scientifically verify whether the observed phenomena and trends were statistically significant. Specifically, when the p-value is less than 0.05, it can be considered that there is a significant difference between the two sets of data, thus supporting our experimental conclusions. Metrics such as HD95 and ASSD were not reported, as they are highly sensitive to small boundary discrepancies and annotation noise, which are pronounced in medical images and exacerbated in zero-shot settings. Therefore, Dice, IoU, and AUC were selected as more reliable and clinically meaningful evaluation metrics.

Data preprocessing.

Prior to model training and evaluation, comprehensive data preprocessing was applied to ensure data quality and consistency across all datasets. For the ROCO dataset, images were resized to pixels and normalized using ImageNet statistics. Text captions were lowercased, tokenized using the BiomedBERT tokenizer, and truncated or padded to a maximum sequence length of 77. Duplicate or ambiguous image-text pairs were removed through automated filtering and manual inspection. For the segmentation datasets, we performed modality-specific preprocessing: (1) In Synapse and ACDC, CT and MRI volumes were resampled to isotropic spacing (1 mm3 voxels), intensity-normalized using z-score normalization per volume, and sliced into axial 2D slices for training; (2) For Lung X-ray, chest radiographs were resized to , and pixel intensities were scaled to [0,1]. To reduce bias, all datasets were checked for patient overlap between train/validation/test splits, and data augmentation (random horizontal flipping, rotation , and brightness jittering) was applied only on the training sets during fine-tuning.

Quantitative and qualitative results

Table 1 demonstrates the performance of the BiomedCLIP model on the ROCO dataset in cross-modal retrieval tasks including text-to-image and image-to-text after fine-tuning the model using a number of different loss functions. To provide a benchmark comparison, we also list three pre-trained CLIP models as references, namely, the standard CLIP [10], the medical domain-specific PMC-CLIP [16], and the original version of BiomedCLIP [22]. Through a paired McNemar statistical test, we found that the BiomedCLIP model fine-tuned using our CEHNC loss function significantly outperforms the other available loss functions and all pre-trained baseline models. This result demonstrates the effectiveness of the CEHNC loss function in improving cross-modal retrieval accuracy.

thumbnail
Table 1. Top-K cross-modal retrieval accuracy for CLIP models on the ROCO dataset.

https://doi.org/10.1371/journal.pone.0344978.t001

Table 2 shows in detail the accuracy assessment of MedZeroSeg under different settings in the zero-shot segmentation task. Specifically, we compare the performance of the model in two cases: first, using pre-trained BiomedCLIP focuses on context-aware hard negative mining; second, the effectiveness of gScoreCAM versus the Grad-CAM method in generating SAM bounding box cues. The experimental results show that the bounding box cues generated using gScoreCAM are significantly better than those of the Grad-CAM method, suggesting that gScoreCAM is able to improve segmentation accuracy more effectively. In addition, by fine-tuning BiomedCLIP using our proposed CEHNC loss function, not only its superior performance across multiple task types and different image modalities was further verified, but also an overall improvement in segmentation quality was observed. This confirms the effectiveness of the CEHNC loss function for enhancing the performance of BiomedCLIP in medical image processing tasks. In Table 2, we also include the MIS-Net method [41], a SAM-based weakly-supervised baseline, for comparison, against which our improvements are clearly demonstrated.

thumbnail
Table 2. Comparison of different models and CAM techniques with standard deviation and training time.

https://doi.org/10.1371/journal.pone.0344978.t002

In Table 3, we show in detail the segmentation accuracy of the proposed methods under zero-shot and weakly-supervised settings, with fully-supervised segmentation as a reference benchmark. This comparison helps to comprehensively evaluate the performance of different methods in real applications.

thumbnail
Table 3. Performance of various methods compared to fully supervised baseline.

https://doi.org/10.1371/journal.pone.0344978.t003

For the zero-shot segmentation task, we compare two different ways of generating initial labels: one is based on the initial labels generated by the gScoreCAM saliency map (“saliency map”), and the other is a pseudo-mask generated via SAM (“saliency map + SAM”). The experimental results show that the method combining BiomedCLIP and SAM exhibits significant superiority in all evaluation metrics, significantly improving segmentation quality (p < 0.05). This suggests that by integrating these two techniques, critical regions in the image can be captured more efficiently, thus improving the accuracy and reliability of segmentation.

Under the weakly supervised setting, the performance comparison between zero-shot and weakly supervised methods shows dataset- and metric-dependent trends. For the Chest X-ray dataset, the weakly supervised ResUNet substantially outperforms the zero-shot approach across all metrics (IoU: 75.95 vs. 48.85, DSC: 85.92 vs. 64.24). In contrast, for ACDC and Synapse, the zero-shot method (Saliency Maps + DPFEM + CRF) achieves competitive or superior performance compared to the weakly supervised ResUNet in terms of IoU and DSC. Specifically, on ACDC, the zero-shot method achieves IoU of 57.52 vs. 40.99 for weakly supervised, and on Synapse, IoU of 49.91 vs. 41.55. It should be noted that the statistical significance markers in Table 3 indicate comparisons against the fully supervised ResUNet baseline, not against the weakly supervised method.

Although current fully-supervised deep learning models [23] provide state-of-the-art accuracy in the field of medical image segmentation, our MedZeroSeg zero-shot segmentation method still demonstrates competitiveness on some specific tasks. Specifically, MedZeroSeg outperforms the ResUNet-based fully supervised method in the ACDC and Synapse segmentation tasks. This finding suggests that our method can provide high-quality segmentation results even in the absence of large amounts of labeled data.

However, the fully-supervised method still performs well in the lung radiograph segmentation task, outperforming the zero-shot method both in terms of evaluation metrics such as IoU, DSC, and AUC. This suggests that sufficient labeled data is still a key factor in improving segmentation accuracy in certain application scenarios.

Fig 4 presents representative qualitative results for the zero-shot segmentation setting on the ACDC cardiac MRI dataset. Each example includes the original image, the corresponding ground-truth label, and the predictions produced by BiomedCLIP, the SAM-based weakly supervised baseline MIS-Net, and the proposed MedZeroSeg framework. Under the same zero-shot segmentation configuration, MedZeroSeg yields cardiac masks that more faithfully follow the contours of the left ventricle, right ventricle, and myocardium, with more complete coverage of thin myocardial walls and clearer separation between adjacent structures. In contrast, BiomedCLIP often under-segments the ventricular cavities or fails to capture subtle myocardial boundaries, while MIS-Net tends to generate fragmented or overly smooth contours that deviate from the ground truth. Across diverse cardiac views and varying contrast levels, MedZeroSeg produces fewer spurious predictions in the background and better preserves small anatomical details. These qualitative observations are consistent with the quantitative improvements in IoU, DSC, and AUC reported in Table 2, and jointly demonstrate the superiority of MedZeroSeg over BiomedCLIP and MIS-Net for zero-shot cardiac segmentation on ACDC.

thumbnail
Fig 4. The cardiac MRI segmentation results on the ACDC dataset.

Representative qualitative comparison among BiomedCLIP, the SAM-based weakly supervised baseline MIS-Net, and the proposed MedZeroSeg in the zero-shot segmentation setting. For each cardiac MRI slice, the original image, ground-truth label, and predicted masks from the three methods are shown. MedZeroSeg yields more accurate and complete delineations of the left ventricle, right ventricle, and myocardium, with clearer boundaries and fewer spurious predictions than BiomedCLIP and MIS-Net.

https://doi.org/10.1371/journal.pone.0344978.g004

Fig 5 shows qualitative zero-shot segmentation results on the Synapse multi-organ abdominal CT dataset, again comparing MedZeroSeg with BiomedCLIP and MIS-Net. The examples cover several representative abdominal organs, including both large structures and smaller, shape-irregular targets. Under identical zero-shot segmentation conditions, MedZeroSeg produces organ masks that more closely align with the ground-truth annotations, particularly around complex boundaries and regions with low contrast or ambiguous intensity transitions. BiomedCLIP frequently under-segments target organs or confuses neighboring tissues, while MIS-Net tends to suffer from boundary leakage and missing small structures, leading to incomplete or noisy segmentations. By contrast, MedZeroSeg maintains coherent organ shapes and reduces both false positives and false negatives across different slices and patients. Together with the performance gains summarized in Table 2, the qualitative results in Fig 5 further highlight that MedZeroSeg generalizes more effectively than BiomedCLIP and MIS-Net to challenging multi-organ CT segmentation tasks in the zero-shot setting.

thumbnail
Fig 5. The multi-organ CT segmentation results on the Synapse dataset.

Qualitative zero-shot segmentation results comparing BiomedCLIP, MIS-Net, and the proposed MedZeroSeg on representative abdominal CT slices. Each example includes the original image, ground truth annotations, and the corresponding predictions from the three methods. Across multiple abdominal organs, MedZeroSeg produces masks that better align with the ground truth, particularly around challenging boundaries and small structures, while BiomedCLIP and MIS-Net tend to under-segment targets or leak into adjacent tissues.

https://doi.org/10.1371/journal.pone.0344978.g005

Beyond cardiac MRI and abdominal CT, we further evaluate MedZeroSeg in the zero-shot segmentation setting on the COVID-QU-Ex chest X-ray dataset, which contains chest radiographs from normal, lung opacity, viral pneumonia, and COVID-19 cases. As illustrated in Fig 6, each example shows the original radiograph, the ground-truth lung or lesion mask, and the corresponding predictions from BiomedCLIP, the SAM-based weakly supervised baseline MIS-Net, and the proposed MedZeroSeg. Under identical zero-shot segmentation conditions, BiomedCLIP tends to under-estimate the spatial extent of diffuse pulmonary opacities or miss faint peripheral lesions, while MIS-Net often produces fragmented or overly smooth masks with noticeable leakage into ribs, heart shadows, or mediastinal regions. In contrast, MedZeroSeg generates more contiguous and anatomically plausible segmentations, better delineating the bilateral lung fields and pathological opacities while suppressing background structures. These qualitative findings are consistent with the quantitative results in Table 2, where MedZeroSeg achieves the highest IoU, DSC, and AUC among the three methods on the chest X-ray dataset, further demonstrating its advantage in zero-shot chest X-ray segmentation.

thumbnail
Fig 6. The chest X-ray segmentation results on the COVID-QU-Ex dataset.

Representative qualitative comparison of lung or lesion segmentation in the zero-shot segmentation setting using BiomedCLIP, the SAM-based weakly supervised baseline MIS-Net, and the proposed MedZeroSeg. For each chest radiograph, the original image, ground-truth mask, and predicted masks from the three methods are presented. Compared with BiomedCLIP and MIS-Net, MedZeroSeg more accurately captures the extent of pulmonary opacities and preserves anatomically plausible lung boundaries, while reducing false positives in ribs, cardiac regions, and other background structures.

https://doi.org/10.1371/journal.pone.0344978.g006

Discussion

The proposed MedZeroSeg represents an early effort to integrate CLIP and SAM models for generic radiology segmentation tasks. Unlike prior adaptations that combine CLIP or SAM individually for specific medical datasets, our method is specifically designed under a zero-shot setup, where no domain-specific fine-tuning is performed.

By introducing the XAI technique gScoreCAM into the medical imaging domain, our approach establishes a unique saliency-to-prompt conversion mechanism that bridges semantic text embeddings and spatial localization cues. This enables textual cue–based interaction, easy adaptation to unseen data domains or tasks, and efficient utilization of pre-trained models for medical image segmentation. In addition, the proposed dual-path refinement structure and CEHNC loss collaboratively enhance feature alignment between the image and text modalities, achieving both superior segmentation accuracy and robust generalization.

A major contribution of this study is the design of the new CEHNC loss function, which allows more efficient fine-tuning of the BiomedCLIP model than current state-of-the-art loss functions, especially under small batch sizes (see Table 1). Although we have demonstrated the application of the CEHNC loss in unsupervised CLIP model fine-tuning, future work will explore its potential for full model training.

When using BiomedCLIP and gScoreCAM to generate saliency maps, we used simple textual prompts such as “brain tumor” to describe the segmentation task. However, we note that the quality of these saliency maps can be further improved through more sophisticated text prompt engineering, incorporating richer anatomical or pathological descriptors (e.g., shape and location). This opens promising possibilities for interactive radiology education.

As seen in the ablation study, both gScoreCAM and the fine-tuned BiomedCLIP play critical roles in the success of our approach. Weakly supervised segmentation improved accuracy primarily in radiograph-based lung segmentation, while the complex contrast in ultrasound and the inherently 3D nature of the Synapse dataset suggest that volumetric segmentation methods may be more suitable for such data.

Generalizability limitations. While MedZeroSeg was evaluated on three publicly available benchmark datasets (ACDC, Synapse, and COVID-QU-Ex), the generalizability of our method to clinical data from different imaging vendors, acquisition protocols, or patient populations remains to be validated. In particular, the COVID-QU-Ex dataset is likely enriched with severe disease patterns such as diffuse bilateral ground-glass opacities commonly observed in severe COVID-19 cases, which may not be fully representative of routine clinical chest radiographs. Additionally, vendor-specific imaging characteristics and center-dependent biases inherent in public datasets may limit the direct transferability of our results to real-world multi-center settings.

Annotation quality considerations. The contrasting performance trends observed across datasets may be partially attributable to differences in annotation granularity and precision. The COVID-QU-Ex lung masks are generated using standardized, pixel-level lung field annotations with clear anatomical boundaries, which naturally favor supervised learning approaches. In contrast, ACDC and Synapse involve more complex anatomical structures (cardiac chambers, multiple abdominal organs) where annotations may be coarser or subject to inter-observer variability. Under such conditions, a zero-shot model leveraging pretrained medical priors may appear more competitive relative to supervised models trained to replicate imperfect labels.

Notably, the recent MedSAM [19] has shown excellent performance in medical applications. However, since MedSAM was fine-tuned on a large number of publicly available medical datasets(including our test set), its direct use would violate the zero-shot assumption of our framework. Nevertheless, given the strong baseline performance of SAM in our system, we plan to further explore integrating MedSAM into MedZeroSeg to assess its potential benefits.

Finally, this study was validated across three segmentation tasks and imaging modalities. In future work, we will extend our evaluation to a broader range of medical domains and imaging types to comprehensively assess the generalization ability and practical applicability of MedZeroSeg. External validation on multi-center clinical datasets will be essential to further assess real-world robustness. We hope these efforts will further advance innovation in medical image segmentation and understanding.

Conclusions

In this study, we introduced MedZeroSeg, a zero-shot medical image segmentation framework that leverages the complementary strengths of CLIP and SAM. Without any domain-specific fine-tuning, MedZeroSeg achieves accurate segmentation across multiple modalities, effectively reducing reliance on large-scale annotated datasets. Our design integrates a DPFEM to capture both local anatomical details and global contextual information, and employs a novel CEHNC Loss to enhance contrastive learning through context-aware hard negative selection. Together, these components enable more robust and efficient feature alignment between image and text modalities. Experimental results on diverse public datasets, including ultrasound, MRI, and X-ray, demonstrate that MedZeroSeg delivers competitive segmentation accuracy and strong cross-domain generalization. While promising, our current framework remains limited by its 2D design, prompt dependency, and potential bias from SAM-based mask proposals. Future work will focus on extending MedZeroSeg to 3D volumetric reasoning, adaptive prompt learning, and uncertainty estimation, as well as validating it on multi-center clinical datasets to further assess real-world robustness and generalizability. In summary, MedZeroSeg provides an efficient, extensible, and annotation-free solution for medical image segmentation, contributing to the broader adoption of foundation models in clinical imaging analysis [20, 25, 18].

References

  1. 1. Fernandes SL, Tanik UJ, Rajinikanth V, Karthik KA. A reliable framework for accurate brain image examination and treatment planning based on early diagnosis support for clinicians. Neural Comput & Applic. 2019;32(20):15897–908.
  2. 2. Arun N, Gaw N, Singh P, Chang K, Aggarwal M, Chen B, et al. Assessing the Trustworthiness of Saliency Maps for Localizing Abnormalities in Medical Imaging. Radiol Artif Intell. 2021;3(6):e200267. pmid:34870212
  3. 3. Bae W, Noh J, Kim G. Rethinking class activation mapping for weakly supervised object localization. In European conference on computer vision 2020 Aug 23 (pp. 618-634). https://doi.org/10.1007/978-3-030-58555-6_37
  4. 4. Xie W, Ye Y, Hong Q, Yao J, Wu S, Zhou R, et al. Endo-HDR: Dynamic endoscopic reconstruction with deformable 3D Gaussians and hierarchical depth regularization. Knowledge-Based Systems. 2026;332:114914.
  5. 5. Xie W, Yao J, Cao X, Lin Q, Tang Z, Dong X, et al. Surgicalgaussian: Deformable 3d gaussians for high-fidelity surgical scene reconstruction. In: 2024. 617–27. https://doi.org/10.1007/978-3-031-72089-5_58
  6. 6. Baevski A, Babu A, Hsu WN, Auli M. Efficient self-supervised learning with contextualized target representations for vision, speech and language. In: 2023. 1416–29. https://doi.org/10.5555/3618408.3618467
  7. 7. Byra M, Jarosik P, Szubert A, Galperin M, Ojeda-Fournier H, Olson L, et al. Breast mass segmentation in ultrasound with selective kernel U-Net convolutional neural network. Biomed Signal Process Control. 2020;61:102027. pmid:34703489
  8. 8. Chen P, Li Q, Biaz S, Bui T, Nguyen A. gScoreCAM: What Objects Is CLIP Looking At?. Lecture Notes in Computer Science. Springer Nature Switzerland. 2023. p. 588–604. https://doi.org/10.1007/978-3-031-26316-3_35
  9. 9. Chen T, Mai Z, Li R, Chao WL. Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:2305.05803. 2023 May 9. (2023). https://arxiv.org/abs/2305.05803
  10. 10. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G. Learning transferable visual models from natural language supervision. In International conference on machine learning 2021 Jul 1 (pp. 8748-8763). http://proceedings.mlr.press/v139/radford21a.html
  11. 11. Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo WY, Dollár P. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision 2023 (pp. 4015-4026). https://doi.org/10.1109/ICCV51070.2023.00371
  12. 12. Wang Z, Yang Y, Chen Y, Yuan T, Sermesant M, Delingette H, et al. Mutual Information Guided Diffusion for Zero-Shot Cross-Modality Medical Image Translation. IEEE Trans Med Imaging. 2024;43(8):2825–38. pmid:38551825
  13. 13. Li Y, Shao H-C, Liang X, Chen L, Li R, Jiang S, et al. Zero-Shot Medical Image Translation via Frequency-Guided Diffusion Models. IEEE Trans Med Imaging. 2024;43(3):980–93. pmid:37851552
  14. 14. Li S, Cao J, Ye P, Ding Y, Tu C, Chen T. ClipSAM: CLIP and SAM collaboration for zero-shot anomaly segmentation. Neurocomputing. 2025;618:129122.
  15. 15. Li Y, Wang H, Duan Y, Zhang J, Li X. A closer look at the explainability of Contrastive language-image pre-training. Pattern Recognition. 2025;162:111409.
  16. 16. Lin W, Zhao Z, Zhang X, Wu C, Zhang Y, Wang Y, et al. Pmc-clip: Contrastive language-image pre-training using biomedical documents. In: 2023. 525–36. https://doi.org/10.1007/978-3-031-43993-3_51
  17. 17. Liu J, Lin Z, Padhy S, Tran D, Bedrax Weiss T, Lakshminarayanan B. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. Advances in Neural Information Processing Systems. 2020;33:7498–512.
  18. 18. Yang X, Gong X. Foundation model assisted weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision 2024 (pp. 523-532). https://doi.org/10.1109/WACV57701.2024.00058
  19. 19. Ma J, He Y, Li F, Han L, You C, Wang B. Segment anything in medical images. Nature Communications. 2024;15(1):654.
  20. 20. Hu X, Xu X, Shi Y. How to efficiently adapt large segmentation model (sam) to medical images. arXiv preprint arXiv:2306.13731. 2023 Jun 23. https://arxiv.org/abs/2306.13731
  21. 21. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention 2015 Oct 5 (pp. 234-241). https://doi.org/10.1007/978-3-319-24574-4_28
  22. 22. Zhang S, Xu Y, Usuyama N, Xu H, Bagga J, Tinn R, Preston S, Rao R, Wei M, Valluri N, Wong C. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915. 2023 Mar 2. https://arxiv.org/abs/2303.00915
  23. 23. Zhang Z, Liu Q, Wang Y. Road Extraction by Deep Residual U-Net. IEEE Geosci Remote Sensing Lett. 2018;15(5):749–53.
  24. 24. Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision 2016 Sep 17 (pp. 694-711). https://doi.org/10.1007/978-3-319-46475-6_43
  25. 25. Huang Z, Liu H, Zhang H, Li X, Liu H, Xing F, Laine A, Angelini E, Hendon C, Gan Y. Push the boundary of sam: A pseudo-label correction framework for medical segmentation. arXiv preprint arXiv:2308.00883. 2023 Aug 2. https://arxiv.org/abs/2308.00883
  26. 26. Zhang B, Wang Y, Ding C, Deng Z, Li L, Qin Z, et al. Multi-scale feature pyramid fusion network for medical image segmentation. Int J Comput Assist Radiol Surg. 2023;18(2):353–65. pmid:36042149
  27. 27. Zhu D, Sun D, Wang D. Dual attention mechanism network for lung cancer images super-resolution. Comput Methods Programs Biomed. 2022;226:107101. pmid:36367483
  28. 28. Chae G, Lee J, Kim SB. Contrastive learning with hard negative samples for chest X-ray multi-label classification. Applied Soft Computing. 2024;165:112101.
  29. 29. Oord AV, Li Y, Vinyals O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. 2018 Jul 10. https://arxiv.org/abs/1807.03748
  30. 30. Song Y, Liu Y, Lin Z, Zhou J, Li D, Zhou T, et al. Learning From AI-Generated Annotations for Medical Image Segmentation. IEEE Trans Consumer Electron. 2025;71(1):1473–81.
  31. 31. Zhang R. ACDC dataset. Dataset: 2016. Database: figshare [Internet]. https://doi.org/10.6084/m9.figshare.31287772
  32. 32. Zhang R. Synapse dataset. Dataset: 2016. Database: figshare [Internet]. https://doi.org/10.6084/m9.figshare.31287538
  33. 33. Tahir M, Chowdhury MEH, Qiblawey Y, Khandakar A, Rahman T, Kiranyaz S, et al. COVID-QU-Ex. Kaggle. 2016. https://www.kaggle.com/datasets/anasmohammedtahir/covidqu
  34. 34. Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans Comput Healthcare. 2021;3(1):1–23.
  35. 35. Yu X, Zhang L, Wu Z, Zhu D. Core-Periphery Multi-Modality Feature Alignment for Zero-Shot Medical Image Analysis. IEEE Trans Med Imaging. 2025;44(10):3973–83. pmid:39418140
  36. 36. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision 2017 (pp. 618-626). https://doi.org/10.1109/ICCV.2017.74
  37. 37. Pelka O, Koitka S, Rückert J, Nensa F, Friedrich CM. Radiology objects in context (ROCO): A multimodal image dataset. Intravascular imaging and computer assisted stenting and large-scale annotation of biomedical data and expert label synthesis. https://doi.org/10.1007/978-3-030-01364-6_20
  38. 38. Yeh CH, Hong CY, Hsu YC, Liu TL, Chen Y, LeCun Y. Decoupled contrastive learning. In: 2022. 668–84. https://doi.org/10.1007/978-3-031-19809-0_38
  39. 39. Radenovic F, Dubey A, Kadian A, Mihaylov T, Vandenhende S, Patel Y, et al. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023. 6967–77. https://doi.org/10.1109/CVPR52729.2023.00673
  40. 40. Chen T, Kornblith S, Swersky K, Norouzi M, Hinton GE. Big self-supervised models are strong semi-supervised learners. Advances in Neural Information Processing Systems. 2020;33:22243–55.
  41. 41. Häkkinen I, Melekhov I, Englesson E, Azizpour H, Kannala J. Medical image segmentation with SAM-generated annotations. In: 2024. 51–62. https://doi.org/10.1007/978-3-031-92089-9_4