MedZeroSeg: Zero-shot medical image segmentation via vision foundation models

doi:10.1371/journal.pone.0344978

Fig 1.

An overview of the proposed MedZeroSeg framework.

This figure shows the MedZeroSeg architecture for zero-shot medical image segmentation, highlighting the CEHNC loss for fine-tuning, the DPFEM for feature extraction, and the textual prompt-guided segmentation process.

More »

Expand

Fig 2.

Illustration of the proposed CEHNC loss pipeline.

This figure presents the CEHNC loss mechanism, which dynamically selects hard-negative samples and updates difficulty parameters and to enhance fine-grained feature discrimination during training.

More »

Expand

Fig 3.

Architecture of the proposed DPFEM.

This figure presents the DPFEM architecture, which captures both local details and global contextual information through parallel perception branches and dynamic gated feature fusion to enhance medical image segmentation.

More »

Expand

Table 1.

Top-K cross-modal retrieval accuracy for CLIP models on the ROCO dataset.

More »

Expand

Table 2.

Comparison of different models and CAM techniques with standard deviation and training time.

More »

Expand

Table 3.

Performance of various methods compared to fully supervised baseline.

More »

Expand

Fig 4.

The cardiac MRI segmentation results on the ACDC dataset.

Representative qualitative comparison among BiomedCLIP, the SAM-based weakly supervised baseline MIS-Net, and the proposed MedZeroSeg in the zero-shot segmentation setting. For each cardiac MRI slice, the original image, ground-truth label, and predicted masks from the three methods are shown. MedZeroSeg yields more accurate and complete delineations of the left ventricle, right ventricle, and myocardium, with clearer boundaries and fewer spurious predictions than BiomedCLIP and MIS-Net.

More »

Expand

Fig 5.

The multi-organ CT segmentation results on the Synapse dataset.

Qualitative zero-shot segmentation results comparing BiomedCLIP, MIS-Net, and the proposed MedZeroSeg on representative abdominal CT slices. Each example includes the original image, ground truth annotations, and the corresponding predictions from the three methods. Across multiple abdominal organs, MedZeroSeg produces masks that better align with the ground truth, particularly around challenging boundaries and small structures, while BiomedCLIP and MIS-Net tend to under-segment targets or leak into adjacent tissues.

More »

Expand

Fig 6.

The chest X-ray segmentation results on the COVID-QU-Ex dataset.

Representative qualitative comparison of lung or lesion segmentation in the zero-shot segmentation setting using BiomedCLIP, the SAM-based weakly supervised baseline MIS-Net, and the proposed MedZeroSeg. For each chest radiograph, the original image, ground-truth mask, and predicted masks from the three methods are presented. Compared with BiomedCLIP and MIS-Net, MedZeroSeg more accurately captures the extent of pulmonary opacities and preserves anatomically plausible lung boundaries, while reducing false positives in ribs, cardiac regions, and other background structures.

More »

Expand