Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

SegMan-based dual-prior network with boundary-augmented hybrid attention for robust skin lesion segmentation

  • Jiayue Wang,

    Roles Conceptualization, Data curation, Formal analysis, Software, Writing – original draft, Writing – review & editing

    Affiliation Beijing University of Chinese Medicine, Beijing, China

  • Tianlu Zhang,

    Roles Data curation, Formal analysis, Methodology, Writing – review & editing

    Affiliation Beijing Hospital of Traditional Chinese Medicine, Capital Medical University, Beijing Institute of Traditional Chinese Medicine, Beijing, China

  • Ping Li

    Roles Resources, Writing – review & editing

    bjtcmzyyy@yeah.net

    Affiliation Beijing University of Chinese Medicine, Beijing, China

Abstract

Skin lesion segmentation is a crucial component of dermoscopic computer-aided diagnosis, yet challenges such as boundary ambiguity, morphological diversity, and noise interference under complex imaging conditions still limit the accuracy and robustness of existing methods. To address these issues, we propose a dual-prior hybrid segmentation network that integrates both boundary priors and shape priors. In the encoder, a gradient-driven Boundary-augmented Hybrid Attention module is constructed to jointly capture long-range contextual information through explicit boundary enhancement, self-attention, and state space–inspired modeling. In the decoder, a Multi-scale Lesion Shape Prior module is designed to impose global structural constraints on the segmentation mask via multi-scale shape priors and a unified loss formulation, thereby balancing fine-grained contour precision with overall morphological consistency. Evaluated on three public datasets—ISIC2018, HAM10000, and PH2—the proposed method achieves IoU/DSC scores of 92.4%/96.0%, 87.3%/93.2%, and 95.2%/97.5%, respectively, outperforming the strongest baseline by an average margin of 1.4 percentage points in IoU while reducing HD95 and ASD by approximately 0.8 and 0.06 on average. Moreover, with only 3.79G FLOPs, Our method surpasses a range of state-of-the-art Transformer and CNN–Transformer hybrid architectures, demonstrating its comprehensive advantages in accuracy, boundary quality, and computational efficiency.

1. Introduction

Skin cancer and high-risk dermatological lesions have shown a continuous upward trend worldwide, and their early detection together with accurate lesion segmentation plays a crucial clinical role in improving patient survival rates and reducing medical burden [1,2]. With the increasing availability of dermoscopic imaging devices and the rapid development of medical digitalization, large-scale skin lesion image data have been accumulated, providing a solid foundation for the advancement of computer-aided diagnosis systems [3]. Among these components, automated skin lesion segmentation acts as a key prerequisite for subsequent tasks such as lesion classification, lesion evolution assessment, and treatment planning. Achieving robust and fine-grained segmentation under complex imaging conditions and diverse lesion morphologies has therefore become an important research direction at the intersection of medical imaging and computer vision.

However, existing deep learning–based skin lesion segmentation methods still face several challenges [4]. On one hand, lesion boundaries often exhibit ambiguous transitions, low contrast, and irregular shapes, and are easily affected by artifacts such as body hair, illumination shadows, skin texture, and imaging noise. As a result, networks relying solely on local texture or global semantics may struggle to capture the true lesion contours accurately. On the other hand, lesions captured across different patients and imaging devices vary significantly in size, shape, and appearance, making it difficult for models to balance local detail preservation and overall structural constraints [5]. This often leads to fragmented boundaries, missed detection of small lesions, or either over-segmentation or under-segmentation of lesion regions [6]. Although some existing methods introduce boundary auxiliary branches or shape priors, they often utilize these cues only at a specific stage of the network and lack a unified framework for jointly modeling boundary information, global context, and multi-scale shape structure, thereby limiting their generalization and robustness in complex scenarios.

To address these issues, we propose a dual prior hybrid segmentation network that jointly incorporates boundary priors and shape priors to enhance the modeling of lesion structures in both the encoding and decoding stages. In the encoder, a Boundary augmented Hybrid Attention (BAHA) module fuses explicit boundary information extracted by gradient operators and edge convolutions with local self attention and convolutional gated features, while a state space inspired mechanism propagates boundary and semantic cues across long spatial ranges to obtain multi scale representations that capture both fine grained edges and global context. In the decoder, a Multi scale Lesion Shape Prior (MLSP) module derives multi scale lesion shape priors using IDCNN and edge enhanced operators and injects them into the decoding features through gated projection, which explicitly constrains the global shape and structural consistency of the predicted segmentation mask. Under the joint optimization of the main segmentation loss, a boundary consistency loss and a shape prior loss, the overall framework achieves more precise boundary delineation together with improved robustness and interpretability. To clarify the novelty over prior boundary-aware or shape-prior segmentation networks, we emphasize that our key contribution is not simply adding a boundary branch or an auxiliary prior head, but introducing a dual-prior coupling mechanism across both stages: (i) BAHA performs explicit boundary modeling and local self-attention while further integrating a state-space global mixing pathway that propagates boundary/semantic cues over long spatial ranges, which is fundamentally different from purely CNN-based boundary refinement or standard CNN–Transformer feature fusion; (ii) MLSP constructs multi-scale lesion shape priors and injects them into decoding features through a global gating vector derived from contour responses, enabling consistent coarse-to-fine morphological constraints rather than stage-specific or single-scale priors.Table 1 summarizes which parts of our framework are fundamentally new and which follow common design practices.

thumbnail
Table 1. Novelty clarification of the proposed framework (high-level).

https://doi.org/10.1371/journal.pone.0344622.t001

The main contributions of this work can be summarized as follows:

  1. (1) We design a dual prior hybrid segmentation network for skin lesion analysis that incorporates explicit boundary priors and multi scale shape priors into an end to end architecture, thereby capturing fine grained local details while preserving coherent global structural information.
  2. (2) We design the Boundary-augmented Hybrid Attention module in the encoder, which combines gradient-guided boundary enhancement, self-attention mechanisms, and state space modeling to significantly improve feature representation under challenging conditions such as blurry boundaries, small lesions, and complex background noise.
  3. (3) We construct the Multi-scale Lesion Shape Prior module in the decoder along with a unified loss formulation, enabling explicit shape-based constraints during decoding. Extensive experiments on multiple public datasets, including ISIC2018, HAM10000, and PH2, covering quantitative comparisons, ablation studies, statistical significance tests, visualizations, and noise robustness evaluations, thoroughly demonstrate the effectiveness and generalization capability of the proposed approach.

2. Related work

2.1. Skin lesion segmentation and deep learning paradigms

In recent years, skin lesion segmentation has received increasing attention in computer-aided diagnosis, and a large number of studies have focused on how to extract reliable lesion structures from complex dermoscopic images [7]. Deep learning methods have rapidly become mainstream, evolving from U-Net and its variants to deeper and more sophisticated architectures. For example, MFSNet [8] and MALUNet [9] enhance feature representation through multi-focus and multi-attention mechanisms, while hybrid architectures combining visual Transformers and CNNs further improve multi-scale modeling and long-range dependency learning [10], and recent attention-based Transformer designs further strengthen spatially-aware representation for challenging medical boundaries [11]. In addition, lightweight and deployable designs have emerged as important trends in recent years. Approaches such as EA-Net [12], multi-scale fusion U-Net [13], and IoMT-oriented hybrid models [14] demonstrate improved adaptability to real clinical settings, while data-efficient paradigms such as active learning based on image similarity provide another practical route to reduce annotation cost in skin lesion segmentation [15]. However, these methods generally rely on texture or global semantic cues, and still exhibit limitations in modeling the most challenging aspects of medical images—such as blurry boundaries, fine-grained lesion structures, and high variability in lesion shape—leading to noticeable performance degradation under conditions of boundary ambiguity, color gradients, or hair occlusion.

Meanwhile, a new generation of segmentation approaches incorporating emerging architectural paradigms has gained traction. These include hybrid models based on residual attention [16], quantum-enhanced learning frameworks [17], and recent explorations of Mamba state-space models in medical imaging [18], as well as hybrid CNN–state-space designs that leverage Mamba-style global mixing for robust medical segmentation [19]. Although these methods strengthen multi-scale dependency modeling, structural alignment, and feature aggregation from different perspectives, they still lack explicit modeling of the clinically critical boundary structures and shape priors. Furthermore, recent methods such as GS-TransUNet [20], which integrate high-dimensional geometric modeling with Transformers, improve overall semantic representation but remain insufficient in capturing fine-grained boundaries and maintaining morphological consistency; similarly, skin-lesion-specific networks that emphasize mixed feature perception and multi-scale fusion [21] still face challenges in simultaneously stabilizing boundary delineation and enforcing coherent lesion morphology across scales. From a complementary perspective, classical variational formulations and their modern adaptations also highlight the importance of structural regularization and shape constraints in medical image segmentation [22]. In summary, despite the significant progress of deep learning in skin lesion segmentation, clear gaps remain in boundary refinement and structural shape consistency, highlighting the necessity and practical importance of developing hybrid frameworks that incorporate boundary enhancement and multi-scale shape priors.

2.2. Boundary modeling and shape prior learning in medical image segmentation

In medical image segmentation, lesion boundaries often exhibit blurry transitions, low contrast, and structural adhesion, motivating extensive efforts to explicitly model boundary information to enhance contour delineation. Boundary-aware U-Net introduces boundary branches and constraints into both the encoding and decoding paths, strengthening contour responses while predicting region masks [23]. In semi-supervised scenarios, boundary-aware uncertainty suppression incorporates boundary-specific uncertainty modeling with pseudo-label quality control to mitigate noise stemming from limited annotations [24]. Moreover, structure-preserving boundary segmentation methods seek to explicitly maintain structural consistency under ambiguous boundary conditions [25], while lightweight edge-aware networks and contour-aware multi-expert models couple edge features, contour context, and backbone representations to ensure efficient inference [26,27]. Boundary-guided networks further introduce dedicated boundary-guidance pathways, using boundary predictions as strong supervisory signals fed back to the region segmentation branch [28]. However, these approaches largely rely on auxiliary branches or loss functions to perform posterior correction of backbone features, which provides only indirect encoding of boundary cues within global context. Consequently, under complex texture interference and cross-scale structural variations, the stability and generalization ability of boundary representations remain limited.

Parallel to boundary modeling, shape priors and multi-scale structural constraints constitute another key direction for improving segmentation robustness in medical imaging. For organs or lesions with relatively stable geometric structures, SAFE-Net integrates shape awareness with feature enhancement to suppress false responses and local missing regions in polyp segmentation [29]. For 3D scenarios, UNETR++ incorporates richer spatial and structural modeling within an efficient Transformer-based framework to balance accuracy and computational cost in volumetric segmentation [30]. With the increasing adoption of large visual models in medical applications, dual visual prompt tuning frameworks aim to inject task-specific structural and shape information into pretrained backbones without modifying their parameters [31]. Meanwhile, topological optimization approaches employ topological invariants such as Euler characteristics to regularize segmentation outputs, providing global geometric priors for shape preservation and connectivity [32]. For low-quality and high-noise ultrasound myocardium imaging, shape-consistency networks utilize multi-scale structural reconstruction and shape-consistency regularization to alleviate noise-induced degradation [33]. The self-guided multi-scale Transformer MedScale-Former further emphasizes unified modeling of structure and context across multi-scale feature spaces [34]. Despite the progress achieved in explicit boundary modeling, shape-consistency preservation, and multi-scale structural constraints, boundary and shape priors are often treated independently. This separation leads to a lack of a unified mechanism for jointly modeling “boundary refinement–multi-scale shape structure–global context,” which is particularly limiting in skin lesion segmentation where lesion morphology is highly diverse and spans substantial scale variations. Consequently, such decoupled designs underexploit the potential of prior knowledge in fine-grained segmentation.

3. Method

3.1. Overall model architecture

This study aims to address the key challenges in skin lesion segmentation, including boundary ambiguity, substantial shape variability, and cross-scale structural inconsistency. Formally, given a skin lesion image , the task is to learn a mapping function that produces a pixel-wise prediction mask , where represents the per-pixel probability of lesion regions. The overall framework, illustrated in Fig 1, consists of an encoder and a decoder. The encoder contains a four-stage hierarchical downsampling structure, where each stage is equipped with a Boundary-augmented Hybrid Attention (BAHA) module to enhance local edge cues and model global semantic context, thereby alleviating boundary blurring and texture interference. The multi-scale features are then fed into the decoder, which first constructs a cross-level feature pyramid to unify semantic information across scales. A gradient extraction module is employed to obtain explicit boundary responses, while a Multi-level Lesion Shape Prior (MLSP) module is introduced to capture structural consistency and shape distributions of lesion regions. The fused representations are processed by a multi-layer perceptron (MLP) for nonlinear combination in high-dimensional space, followed by upsampling operations to recover spatial resolution consistent with the input, ultimately generating high-precision lesion segmentation results. This architecture jointly optimizes boundary refinement, multi-scale structural consistency, and global semantic representation, thereby improving the overall accuracy and robustness of skin lesion segmentation.

thumbnail
Fig 1. The overall framework consists of an encoder driven by Boundary-augmented Hybrid Attention (BAHA) and a decoder guided by multi-scale shape priors, achieving joint modeling of lesion boundary features, cross-scale structural information, and global semantic context.

The decoder further integrates a feature pyramid structure, a gradient extraction module, and a multi-layer perceptron to produce high-precision skin lesion segmentation results with enhanced boundary refinement and improved shape consistency.

https://doi.org/10.1371/journal.pone.0344622.g001

3.2. Boundary-Aware Hybrid Attention

In skin lesion segmentation, networks must simultaneously capture fine-grained boundary information and global semantic associations. Relying solely on standard convolutions or self-attention mechanisms often fails to maintain stable representations under noisy conditions, small-scale structures, and ambiguous contours. To address these issues, we design the Boundary-Aware Hybrid Attention (BAHA) module, as illustrated in the figure, which is built upon a feature branch and a gradient branch, and achieves multi-level fusion through explicit boundary modeling, local self-attention, convolutional gating, and global state-space modeling. The overall architecture of this module is shown in Fig 2.

thumbnail
Fig 2. The structure of the Boundary-augmented Hybrid Attention (BAHA) module is illustrated, where the feature branch and gradient branch are first encoded using 1D CNNs and then processed by local self-attention and convolutional gating units to extract semantically relevant local contextual information.

The gradient features generate a boundary mask through Laplacian and EdgeConv operations, which guides channel and spatial reweighting, and the result is concatenated with the global context produced by the state-space fusion branch to form boundary-enhanced features for subsequent segmentation.

https://doi.org/10.1371/journal.pone.0344622.g002

Let the encoder input feature map be and the gradient feature extracted by the gradient extraction network be . The spatial dimensions are first flattened and embedded via one-dimensional convolutions:

(1)

where H and W denote spatial resolution, C is the channel dimension, N is the number of tokens after flattening, and and denote shared or independent 1D convolutional encoders (1D CNNs) on the feature and gradient branches to perform local pre-aggregation and unify the channel space.

In the feature branch, BAHA first applies local self-attention to explicitly model short-range dependencies. For the sequence feature , the query, key, and value vectors are obtained through linear projections:

(2)

where are learnable projection matrices and d denotes the attention subspace dimension. To avoid the quadratic complexity of global attention, BAHA computes attention weights only within a local neighborhood :

(3)

where denotes the contribution of position j in the neighborhood of i. The weighted sum of values yields the local attention feature:

(4)

where encodes local structural and textural information around each position and is reshaped back into a spatial feature map .

To explicitly incorporate boundary priors, the gradient branch fuses Laplacian operators and EdgeConv to construct a boundary response map:

(5)

where denotes a Laplacian convolution operator highlighting strong intensity transitions, and represents the EdgeConv operation based on local k-nearest-neighbor aggregation to capture irregular contours and fine structures. The boundary response is then mapped into a normalized boundary mask:

(6)

where is a convolution and is the sigmoid function, producing that represents the confidence of each pixel belonging to boundary regions. The boundary mask guides local attention fusion:

(7)

where ⊙ denotes element-wise multiplication. This operation preserves high responses near boundaries while maintaining robust semantics within lesion interiors.

In the middle pathway, BAHA employs a convolution–normalization–activation–convolution pipeline and an MLP-based gating mechanism to reweight both channel and spatial features. Specifically, two convolutional blocks are applied to :

(8)

where and are convolutions, is Group Normalization, and is the activation function. Global average pooling is then applied to U and passed to an MLP to generate channel attention:

(9)

where is global average pooling, and denotes the channel-wise attention vector. The enhanced feature is computed as

(10)

with s broadcast spatially to emphasize informative channels and suppress redundant ones.

In the bottom pathway, BAHA incorporates a state-space-model-based global mixing structure to perform long-range modeling on the sequence representations from the feature and gradient branches. The refined features and gradient features are reshaped into sequences , and the hidden states at each position t are updated as:

(11)

where are hidden states, and are learnable decay and input-gating parameters. A hybrid gating mechanism fuses the two hidden states into a global contextual representation:

(12)

where are learnable fusion coefficients balancing structural semantics and gradient priors. Stacking all yields , which is linearly transformed and concatenated to produce the final BAHA output:

(13)

where denotes channel-wise concatenation, and the Linear layer maps the concatenated features back to channel dimension C. The output integrates boundary-guided local attention, convolution-gated spatial and channel selection, and globally modeled long-range dependencies, providing a rich and structurally enhanced representation for subsequent decoding and shape-prior modeling.

3.3. Multi-scale lesion shape prior module

Based on the encoder and BAHA module described above, the network has obtained intermediate representations with explicit boundary awareness. However, relying solely on local boundary cues and semantic information is insufficient to capture the substantial shape variability and scale differences exhibited by skin lesions across different cases. To address this, we design a Multi-scale Lesion Shape Prior Module in the decoding stage, which explicitly extracts contour structures from the input features and constructs multi-scale shape priors, while maintaining end-to-end training. The architecture of this module is illustrated in Fig 3.

thumbnail
Fig 3. The structure of the Multi-scale Lesion Shape Prior module is illustrated, where the input features first pass through two stages of IDCNN to model long-range local context and are then transformed linearly to form the main decoding features, while the lower branch extracts shape-related responses through an edge-enhancement operator and convolutional networks to generate a global gating vector θ for channel importance modulation.

In the bottom pathway, the decoder features are processed by three scale-projection operators to obtain multi-scale shape representations, which are concatenated and fused through gating to produce shape-enhanced features, thereby injecting explicit contour and scale priors into the decoding stage.

https://doi.org/10.1371/journal.pone.0344622.g003

Let the input features from the encoder or BAHA module be . We first apply a one-dimensional convolutional network (IDCNN) for serialization and local context modeling:

(14)

where flattens the spatial dimensions into a sequence of length N, and denotes a dilated one-dimensional convolutional encoder for aggregating local neighboring information in the sequence domain. A second IDCNN is then applied for deeper feature extraction, forming a residual-enhanced representation with the original encoding:

(15)

where is the second IDCNN branch. The resulting preserves both shallow textures and deep semantics, serving as the unified baseline feature for the subsequent upper and lower sub-paths.

In the upper main branch of the module, the Multi-scale Lesion Shape Prior module applies two stages of linear transformations to perform global channel reorganization on , generating semantically consistent decoding features. A first linear projection yields:

(16)

where and are learnable parameters used for channel-wise recombination. A second linear layer with a nonlinear activation further enhances representational capacity:

(17)

where and are learnable parameters, denotes an element-wise nonlinear function (e.g., GELU or ReLU), and represents the main decoding features before injecting shape priors. This branch corresponds to the two Linear layers and residual structure shown in the figure.

Parallel to the main branch, the bottom pathway constructs a contour-based shape prior stream that explicitly extracts lesion boundary structures from . An edge-enhancement operator , combining gradient and Laplacian responses, is first applied:

(18)

where captures high-intensity transitions and their surrounding neighborhoods. Two cascaded 2D convolutional blocks are then used to suppress noise and refine contour semantics:

(19)

where and denote convolutional modules with normalization and nonlinear activation, and restores the sequence into a spatial feature map. The refined contour representation aggregates local boundary cues and global structure. A convolution followed by a sigmoid activation generates the normalized shape mask:

(20)

where compresses channels into a single response, and is the sigmoid function. Thus, serves as the lesion shape confidence map. A lightweight linear mapping is then applied to construct a gating vector θ:

(21)

where is global average pooling, and is a small fully-connected network. The resulting characterizes the importance of each channel under the global shape-prior constraint, corresponding to the Sigmoid and θ elements in the figure.

To map the shape prior from a global representation to multi-scale spatial structures, the bottom part of the module applies three scale-projection operators to the intermediate decoder feature map (denoted as Features in the figure), generating shape-related features at different receptive fields:

(22)

where each uses combinations of convolution, pooling, or upsampling operations to produce multi-scale representations. To unify all scales, each feature map is transformed by an operator :

(23)

where may include bilinear interpolation, transposed convolutions, or stride convolutions. After alignment, the multi-scale features are concatenated channel-wise:

(24)

where simultaneously encodes fine-grained local shapes and coarse global contour structures, consistent with the and Concat blocks in the figure.

Finally, the module modulates the multi-scale shape prior using the gating vector θ and fuses it with the semantic features from the main branch to produce the final shape-enhanced decoding feature. The vector θ is broadcast to match the channel dimension of :

(25)

where is the broadcast version of θ, and ⊙ denotes element-wise multiplication. Channels consistent with the global shape prior are enhanced, while irrelevant channels are suppressed. A linear mapping then projects back to the same channel dimension as , followed by fusion in the sequence domain:

(26)

where is a linear layer or convolution, and flattens the multi-scale prior tensor into a sequence. The output is the final output of the Multi-scale Lesion Shape Prior module. After reshaping back into a spatial feature map, it is passed to the subsequent upsampling decoder head, enabling joint modeling of lesion regions in terms of both shape and semantics, thereby significantly improving segmentation stability and consistency for small lesions, cracked boundaries, and complex morphological patterns.

3.4. Training objective and loss function

To fully exploit the synergistic effect between the Boundary-augmented Hybrid Attention (BAHA) module and the Multi-scale Lesion Shape Prior (MLSP) module, we construct a joint optimization objective consisting of the main segmentation loss, a boundary consistency constraint, and a multi-scale shape prior constraint. This enables the network to balance pixel-wise classification accuracy, boundary refinement capability, and overall structural consistency during training. Let the training set be denoted as

(27)

where represents the n-th skin lesion image, is the corresponding binary lesion mask, and N is the number of samples. After processing through the encoder, BAHA, and the MLSP-enhanced decoder, the network predicts a probability mask

(28)

where denotes the overall network mapping function, θ represents all learnable parameters, and is the predicted probability that pixel x belongs to the lesion foreground.

To ensure overall segmentation accuracy while addressing the imbalance between foreground and background regions, we adopt a combination of binary cross-entropy (BCE) loss and Dice loss as the main segmentation loss. Denote the set of all pixel positions as Ω, then the segmentation loss is defined as

(29)

where the BCE loss is

(30)

and the Dice loss, used to enhance the overlap of foreground regions, is defined as

(31)

where ε is a smoothing constant to avoid division by zero. This combined loss stabilizes pixel-wise optimization while explicitly encouraging accurate prediction of small lesion regions under class imbalance.

To complement the explicit boundary modeling embedded in the BAHA module, we further introduce a boundary consistency constraint to ensure spatial alignment between the predicted and the ground-truth lesion contours. Let denote a boundary extraction operator (implemented using Sobel or Laplacian convolution), then the predicted and ground-truth boundary maps are defined as

(32)

where represent the boundary response intensities. The boundary loss is then computed as

(33)

which directly enforces accuracy in boundary regions, ensuring that the edge-enhanced features learned by BAHA are reflected in the final predictions, thus alleviating boundary ambiguity and local misclassification.

On the other hand, the MLSP module outputs multi-scale shape-guided features during decoding and generates several low-resolution auxiliary masks that capture the coarse-to-fine lesion morphology across scales. Let denote the predicted probability mask at the k-th scale, and let denote the corresponding downsampled ground-truth mask. The multi-scale shape prior loss is defined as

(34)

where K is the number of scales (three in our implementation), and is identical to the Dice loss defined earlier. This loss enforces global morphological consistency at coarse scales and improves reconstruction of fine structural details and hollow regions at higher resolutions, thereby enhancing the stability of the shape prior and suppressing false responses.

Combining the three components above, the total training objective is formulated as

(35)

where are weighting coefficients that balance the main segmentation loss, boundary consistency constraint, and multi-scale shape prior constraint, respectively. Through this joint optimization objective, the boundary-enhanced features learned by BAHA and the multi-scale shape priors encoded by MLSP are jointly supervised during training, enabling the model to simultaneously achieve fine contour delineation and robust structural understanding in skin lesion segmentation.

3.5. Algorithm of the proposed framework

To make the implementation procedure explicit, Algorithm 1 presents the overall computation flow, including feature encoding, boundary-aware attention enhancement, multi-scale shape prior decoding, and the joint optimization objective used for training.

Algorithm 1 Overall training and inference procedure of the proposed BAHA–MLSP framework

Require Training set , network with encoder , BAHA module , MLSP-enhanced decoder ; total iterations T; weighting coefficients .

Ensure Trained parameters ; predicted mask P for a test image I.

1: Initialize parameters θ.

2: For to T do

3:   Sample a mini-batch from .

4:   Encoding: extract multi-level features .

5:   Boundary cues: compute boundary maps and boundary responses from prediction later.

6:   BAHA enhancement: obtain boundary-augmented features .

7:  MLSP decoding: produce multi-scale predictions and full-resolution prediction .

8:   Segmentation loss: compute using BCEDice between and .

9:   Boundary consistency: compute and between and .

10:  Multi-scale shape prior: downsample to and compute across scales.

11:  Joint objective: .

12:  Update θ by back-propagation and an optimizer step.

13: end For

15: Inference: for a test image I, output and binarize to obtain the final mask.

3.6. Computational complexity analysis in notation

We compare the computational complexity of our method in notation by isolating the additional overhead introduced by BAHA and MLSP. Let the baseline segmentation network be denoted as a mapping with computational cost for processing an input of resolution . Our full model augments the baseline with BAHA and MLSP, and thus its total cost can be written as

(36)

Boundary extraction operator.

The boundary operator is implemented by a fixed small-kernel filter. Its complexity scales linearly with the number of pixels, i.e.,

(37)

BAHA module overhead. BAHA applies boundary-augmented hybrid attention on intermediate feature maps. Denote the BAHA input feature at the operating stage by , where is the spatial size at that stage and C is the channel width. In our implementation, BAHA is realized by lightweight spatial/channel reweighting and point-wise channel mixing. Therefore, its additional complexity can be upper-bounded by

(38)

where corresponds to dominant point-wise channel mixing and corresponds to spatial pooling and gating. Importantly, this overhead grows linearly with the number of spatial locations at the operating stage.

MLSP module overhead. MLSP introduces K auxiliary predictions at multiple scales to enforce shape consistency. Let the k-th auxiliary head operate on a feature map of size with channel width . If the auxiliary head is implemented as a light projection followed by up/down-sampling, its overhead is

(39)

and in a more general case that includes channel mixing inside the head, an upper bound is

(40)

Since decreases at coarser scales, the multi-scale overhead is bounded and typically remains a small fraction of .

Comparison to the baseline. Let denote the complexity of the baseline network without BAHA and MLSP. Then the increase in complexity introduced by our design is

(41)

This explicitly shows that our modifications add only linear-in-spatial-size overhead terms, while keeping the baseline backbone unchanged. Therefore, the proposed BAHA–MLSP framework improves boundary delineation and multi-scale structural consistency with a controllable and architecture-agnostic computational overhead.

4. Datasets and evaluation metrics

4.1. Datasets

4.1.1. ISIC2018 [35].

The ISIC2018 dataset, released by the International Skin Imaging Collaboration, is a large-scale skin lesion analysis dataset containing high-resolution dermoscopic images with finely annotated lesion masks, and is widely used for automated skin tumor analysis. It encompasses multiple common lesion categories and presents real clinical challenges such as illumination variation, noise interference, blurry boundaries, and substantial morphological diversity. The dataset provides standardized training, validation, and test splits, offering a unified benchmark for evaluating and comparing skin lesion segmentation models, as illustrated in Fig 4.

4.1.2. HAM10000 [36].

The HAM10000 dataset is a dermoscopic image collection acquired from multiple centers and devices, containing a large number of diverse skin lesion samples that cover various common conditions such as nevi, melanoma, and keratoses, along with reliable expert annotations and lesion masks. Characterized by complex imaging conditions, substantial texture variability, and notable differences in lesion size, the dataset provides rich scenarios for evaluating model generalization in real clinical environments. It has therefore been widely used for research and performance validation in skin lesion classification and segmentation tasks, as illustrated in Fig 5.

4.1.3. Dataset [37].

The PH2 dataset was collected by the Dermatology Service of Pedro Hispano Hospital in Matosinhos, Portugal, under standardized imaging conditions using the Tuebinger Mole Analyzer system at 20 magnification, with all images captured as 8-bit RGB dermoscopic images at a resolution of 768560. It contains 200 high-quality melanocytic lesion images, including 80 common nevi, 80 atypical nevi, and 40 melanomas, accompanied by expert-annotated lesion segmentation masks. With its uniform imaging protocol and precise annotations, the PH2 dataset is widely used for research and benchmarking in skin lesion segmentation and classification, as illustrated in Fig 6.

4.2. Dataset statistics

This study employs three publicly available datasets widely used in skin lesion segmentation—ISIC2018, HAM10000, and PH2—which differ substantially in image quantity, imaging sources, and lesion categories, thereby providing diverse conditions for assessing model generalization and robustness. ISIC2018 offers the most challenging dermoscopic images collected from real clinical environments, where 2,594 training images and 1,000 test images from the official split are used in this study; HAM10000 contains 10,015 multi-center and multi-device dermoscopic images, which are randomly divided into training and test sets with an 8:2 ratio; the PH2 dataset is smaller, consisting of 200 high-quality melanocytic lesion images, also split into training and test subsets using an 8:2 ratio. The differences in dataset scale, lesion morphology, and imaging conditions allow for more comprehensive performance evaluation across various realistic clinical scenarios. The detailed statistics of its dataset are shown in Table 2.

thumbnail
Table 2. Dataset statistics and train/test splits used in this study.

https://doi.org/10.1371/journal.pone.0344622.t002

4.3. Evaluation metrics

To comprehensively assess the performance of the segmentation model in terms of boundary precision, region consistency, and overall segmentation quality, this study adopts four commonly used quantitative metrics: IoU, HD95, ASD, and DSC. We also report the model-related computational indicators, including FLOPs and the number of parameters.

Intersection over Union (IoU). IoU measures the overlap between the predicted region and the ground-truth region and is one of the most widely used region-level performance metrics in medical image segmentation.

(42)

where P denotes the predicted mask and G denotes the ground-truth mask. A higher value indicates more accurate differentiation between the foreground and background.

Dice Similarity Coefficient (DSC). DSC evaluates the similarity between the predicted region and the ground-truth region, and it is particularly suitable for handling the class imbalance commonly seen in medical image segmentation.

(43)

This metric is similar to IoU but places more emphasis on the overlapping region, making it more sensitive to small-target segmentation.

95th Percentile Hausdorff Distance (HD95). HD95 quantifies the largest deviation between the predicted boundary and the ground-truth boundary and serves as a robust distance metric for evaluating boundary errors.

(44)

A lower HD95 indicates better contour alignment and reduces the impact of outlier points on boundary evaluation.

Average Surface Distance (ASD). ASD measures the average distance between the predicted boundary and the ground-truth boundary, serving as another geometric indicator of boundary consistency.

(45)

where and represent the sets of predicted and ground-truth boundary points, respectively. A lower ASD indicates that the generated lesion contour more closely approximates the true anatomical structure.

FLOPs. FLOPs reflect the number of floating-point operations required for a single forward pass of the model and are used to measure the computational complexity of the algorithm.

Params. The number of parameters (Params) indicates the total count of trainable parameters in the model and reflects the model size and its demand for storage resources.

5. Experimental results and analysis

5.1. Experimental setup

All experiments in this study are conducted under a unified hardware environment to ensure consistent and reproducible comparisons across different models and components. Both training and inference are implemented using a PyTorch-based framework, running on a single NVIDIA A100 GPU. CUDA 12.1 and cuDNN 8.9 are used to accelerate computation. To ensure experimental stability, all random seeds are fixed, and all models adopt the same data augmentation and preprocessing pipeline, including normalization, random flipping, and scale transformation.

For hyperparameter configurations, all models use the same optimizer, learning rate schedule, and number of training epochs. The backbone network is trained using the AdamW optimizer with a base learning rate of , a weight decay of , and a batch size of 8 for a total of 200 epochs. For the proposed joint loss design, the segmentation loss, boundary consistency loss, and multi-scale shape prior loss are controlled by three loss weights , , and , respectively. In this study, we set , , and , which showed the most stable performance after multiple validation rounds. Its detailed settings are shown in Table 3.

thumbnail
Table 3. Hardware environment and training hyperparameters used in this study.

https://doi.org/10.1371/journal.pone.0344622.t003

5.2. Experimental results compared with other models

To comprehensively verify the effectiveness of the proposed method across different network architectures and feature modeling paradigms, this study includes a diverse set of representative semantic segmentation models for horizontal comparison. The selected models cover traditional strong baselines such as OCRNet; mask-based structured prediction frameworks including MaskFormer and Mask2Former; lightweight and efficient convolutional architectures such as SegNext; Transformer-based multi-scale representation models such as VWFormer and EDAFormer; as well as the recent structurally enhanced medical-oriented approach CGRSeg-T. By conducting systematic comparisons with these models under a unified dataset setup and consistent hyperparameter configurations, we can more clearly evaluate the differences in boundary modeling, shape representation, and cross-scale consistency across architectures, thereby validating the overall advantages and applicability of the proposed model. The experimental results on the ISIC2018 dataset are first presented in Table 4.

thumbnail
Table 4. Comparison with state-of-the-art segmentation models on the ISIC 2018 dataset.

https://doi.org/10.1371/journal.pone.0344622.t004

From an overall perspective, the compared models—built respectively on convolutional architectures, hybrid attention mechanisms, and query-based segmentation paradigms—each exhibit certain advantages in terms of regional consistency and structural reconstruction of skin lesions. However, most of these approaches primarily rely on either local texture cues or global semantic representations for region classification, and therefore still show limitations when dealing with lesion areas characterized by ambiguous boundaries, fragmented structures, or significant cross-scale morphological variations. In contrast, the model proposed in this study introduces an explicit gradient-sensitive boundary enhancement mechanism during the encoding stage, enabling the network to obtain more stable contour cues at early feature extraction layers. Furthermore, the decoding stage incorporates multi-scale shape prior modeling, which constrains and completes the lesion structure, thereby maintaining higher segmentation reliability especially on challenging samples.

In addition, when considering the structural complexity of different models, the FLOPs and parameter counts vary substantially across architectures, indicating notable differences in deployment cost for real-world applications. The proposed model achieves a lightweight design while naturally integrating boundary modeling and shape consistency modeling, allowing it to represent fine-grained contours, local irregular regions, and large-scale lesion morphology under a unified framework with relatively low computational overhead. Owing to this architectural synergy, the model demonstrates more stable generalization across diverse types of skin lesion images and successfully balances boundary precision with accurate recovery of global lesion structure, all while maintaining a compact parameter size.

Furthermore, this paper presents experimental results for two other datasets, as shown in Tables 5 and 6.

thumbnail
Table 5. Comparison with state-of-the-art segmentation models on the HAM10000 dataset.

https://doi.org/10.1371/journal.pone.0344622.t005

thumbnail
Table 6. Comparison with state-of-the-art segmentation models on the PH2 dataset.

https://doi.org/10.1371/journal.pone.0344622.t006

On the more diverse and visually variable HAM10000 and PH2 datasets, the differences among methods become more pronounced in terms of boundary delineation capability, structural restoration quality, and architectural efficiency. Overall, most compared models rely on global attention mechanisms or multi-scale convolutional structures for feature fusion, yet they still show limitations when handling lesion regions characterized by blurry boundaries, uneven pigmentation, or significant cross-scale morphological variations. The method proposed in this study incorporates a gradient-driven boundary-aware mechanism in the encoding stage, enabling the model to maintain more stable contour extraction even in noisy dermatological images. In the decoding stage, the integration of multi-scale shape priors further constrains the global structure, allowing the network to preserve geometric consistency across lesions of different sizes. Owing to the synergistic effect between boundary modeling and shape priors, the proposed approach demonstrates strong adaptability across both high-quality and low-contrast images. As a result, it achieves more stable generalization on both datasets and delivers reliable structural consistency and refined boundary precision while maintaining relatively low computational cost.

Finally, this paper also presents graphs showing the change of the loss function with epochs when trained on three datasets. These graphs further demonstrate that the algorithm in this paper can achieve good convergence. The experimental results are shown in Fig 7.

From the curves on the three datasets (ISIC2018, HAM10000, and PH2), it can be observed that the training process is overall stable and exhibits good convergence: within the first 30–40 epochs, both the training loss and validation loss decrease rapidly, indicating that the network can quickly learn effective low-level texture cues and an initial representation of lesion regions; thereafter, the losses enter a stage of slower decline and gradually become stable, suggesting that the optimization progressively shifts toward more challenging structural cues such as boundary refinement and morphological consistency. Across all three plots, the validation loss remains slightly higher than the training loss with a small gap, implying a controllable generalization gap and no obvious signs of overfitting. Overall, these phenomena demonstrate that the adopted training strategy maintains consistent optimization trends across datasets of different scales and difficulties, and achieves smooth convergence with reliable validation performance in the later stage.

5.3. Module ablation experimental results

To further investigate the contribution of different structural components within the model, this section conducts a step-by-step ablation study under a unified training configuration. By individually removing the boundary enhancement module and the multi-scale shape prior module, we can examine how each component influences the overall architectural design and feature representation capability. Such analysis helps clarify the sources of performance improvement and verifies the necessity and effectiveness of the proposed structural design. The ablation results are presented in Table 7.

thumbnail
Table 7. Systematic ablation study on ISIC 2018, HAM10000, and PH2 datasets.

BP denotes the boundary prior (gradient-driven boundary enhancement), HA denotes the hybrid attention in the encoder, and SP denotes the shape prior introduced by MLSP.

https://doi.org/10.1371/journal.pone.0344622.t007

From the systematic ablation results across the three datasets, the performance gains brought by the three structural factors exhibit a largely consistent trend, and clear complementarity can be observed. Using SegMan as the baseline, the model already achieves stable lesion localization, yet there remains room for improvement in terms of boundary precision and shape consistency. Enabling the boundary prior alone) or the hybrid attention alone yields only modest improvements, indicating that boundary-guided cues and global dependency modeling can each enhance feature representation, but their isolated effects are limited. When the boundary prior and hybrid attention are jointly activated, IoU/DSC are further improved on all three datasets, accompanied by noticeable reductions in HD95 and ASD, suggesting that boundary-relevant details are more reliably propagated across scales through attention-based feature interaction, thereby effectively suppressing boundary localization errors. This paper further presents the qualitative results of the ablation experiment, as shown in Fig 8.

Furthermore, introducing the shape prior consistently improves IoU/DSC while simultaneously reducing HD95/ASD across the three datasets, demonstrating that multi-scale shape constraints help preserve contour continuity and region completeness during decoding, and mitigate shape fragmentation caused by cross-scale fusion. When the shape prior is combined with the boundary prior or with hybrid attention, the overall performance is generally superior to settings with a single prior, highlighting the synergistic effect between structural priors and attention-based modeling in balancing local boundary fidelity and global structural coherence. Ultimately, the full configuration achieves the best or near-best overall performance on all three datasets, exhibiting simultaneous advantages on both boundary-sensitive metrics (HD95/ASD) and overlap-based metrics (IoU/DSC). These results verify that the proposed framework jointly models fine-grained boundary information and global structural constraints, leading to more stable generalization under variations in data sources and lesion morphologies.

5.4. Statistical significance test

To further verify whether the performance contributions of each module are statistically significant, this study conducts pairwise significance tests between different module combinations and the final model across the three datasets. Specifically, we perform paired statistical comparisons based on four core evaluation metrics and compute the corresponding p-value matrices to quantify the degree of performance difference among the module configurations. The visualized p-value heatmaps provide a more intuitive illustration of the importance of each module for boundary modeling and shape consistency modeling, confirming that the performance gains of the final model do not arise from random fluctuations but instead exhibit stable statistical significance. The results of this analysis are shown in Fig 9.

thumbnail
Fig 9. This figure illustrates the pairwise significance test results of the base model and different module combinations relative to the final model on three datasets.

By visualizing the p-value matrix, the statistical contribution of each module to the performance differences can be intuitively assessed, and the significance of the performance improvement can be verified.

https://doi.org/10.1371/journal.pone.0344622.g009

From the p-value heatmaps across the three datasets, it can be observed that the performance differences between the baseline SegMan and the final model pass strict pairwise significance tests for all four evaluation metrics, indicating that the original architecture alone is insufficient to achieve the overall boundary delineation and shape recovery performance of the proposed framework. In the single-module comparisons on HAM10000, adding either the boundary enhancement unit or the multi-scale shape prior unit still results in statistically significant differences from the final model across most metrics, suggesting that the complete architecture provides stable and verifiable improvements even under more complex imaging conditions and diverse sample sources. On ISIC 2018 and PH2, most metrics also exhibit significant differences between the single-module variants and the final model, with only a few cases—such as DSC on ISIC 2018 and several boundary-related metrics on PH2—falling short of the significance threshold. These exceptions indicate that individual modules may already approach the performance of the full model in certain measures, yet the synergistic combination of both modules yields a more robust and statistically consistent advantage overall.

5.5. Hyperparameter sensitivity experimental results

This study also conducts a hyperparameter analysis on the ISIC2018 dataset, focusing on the three loss weights , , and . The corresponding experimental results are presented in Table 8.

thumbnail
Table 8. Hyperparameter sensitivity analysis of loss weights , , and .

https://doi.org/10.1371/journal.pone.0344622.t008

From the overall trend across different combinations of loss weights, the boundary constraint and the shape prior exhibit complementary regulatory effects during model training. When both weights are set to low values, the model relies predominantly on the region classification loss, resulting in insufficient sensitivity to local contours. Moderately increasing the boundary term strengthens the model’s response to fine-grained structural details, producing clearer and more stable lesion boundaries, while the incorporation of the shape prior provides global geometric consistency during cross-scale fusion and prevents fragmented or geometrically distorted segmentation outputs. As the weights of these two priors continue to increase, the model’s capability for boundary and shape modeling improves simultaneously; however, excessively large weights may lead to overemphasis on local boundary textures or overly smoothed structural shapes. Consequently, the optimal performance is achieved when the two weights remain balanced, highlighting the coordinated roles of region, boundary, and shape cues in medical image segmentation.

5.6. Qualitative experimental results

To further illustrate the differences in model performance on real images, this section presents visualized segmentation results of various methods on representative skin lesion samples. Compared with relying solely on quantitative metrics, visual inspection provides a more intuitive understanding of how each model handles boundary details, structural completeness, and small-scale lesion regions. These comparative examples allow for a clearer observation of how the proposed modules influence segmentation quality under complex lesion conditions. The visualization results on the ISIC2018 dataset are first presented, as shown in Fig 10.

thumbnail
Fig 10. Qualitative Experimental Results of the ISIC2018 Dataset.

https://doi.org/10.1371/journal.pone.0344622.g010

From the visual comparisons on the ISIC 2018 dataset, it can be observed that different methods exhibit varying levels of boundary adherence and regional consistency when dealing with lesion areas characterized by blurred contours, fragmented structures, and complex texture patterns. Traditional approaches often struggle with contour shrinkage, local discontinuities, and inadequate suppression of small-scale noise, whereas the model proposed in this study maintains more stable boundary responses under complex texture backgrounds and produces more complete and coherent lesion structures overall. This behavior is largely attributed to the synergistic effect of the explicit boundary enhancement mechanism and the multi-scale shape prior, which enables the segmentation results to remain robust and structurally consistent even in cases of low contrast, strong interference, or highly irregular lesion morphology.

Furthermore, this paper presents qualitative experimental results for two other datasets, as shown in Figs 11 and 12.

thumbnail
Fig 11. Qualitative Experimental Results of the HAM10000 Dataset.

https://doi.org/10.1371/journal.pone.0344622.g011

thumbnail
Fig 12. Qualitative Experimental Results of the PH2 Dataset.

https://doi.org/10.1371/journal.pone.0344622.g012

From the visual results on the HAM10000 and PH2 datasets, it is evident that different methods exhibit noticeable performance variations when handling samples with high texture complexity, diverse color distributions, and substantial morphological differences. Competing models often suffer from boundary expansion, contour shrinkage, or structural fragmentation when processing lesions with blurry edges or irregular shapes. In contrast, the model proposed in this study consistently produces clearer boundary delineation and more stable regional coherence across diverse scenarios. Benefiting from the combined effect of the boundary-aware enhancement mechanism and the multi-scale shape prior, the model is able to generate segmentation outputs that remain structurally complete, smoothly contoured, and closely aligned with the true lesion boundaries even under high-noise backgrounds, weak edge contrast, or uncertain lesion morphology. These visual findings further demonstrate the robustness and generalization capability of the proposed method under cross-dataset conditions.

5.7. Grad-Cam experimental results

To further examine the spatial distribution of the regions attended to by the models during feature learning, this section presents a Grad-CAM–based visualization of the attention responses across different methods. By comparing the activation hotspot patterns generated during forward propagation, we can more intuitively understand how each module contributes to boundary information extraction and structural prior modeling. This visualization not only helps interpret the decision-making basis of the models but also provides additional insights into their feature focusing behaviors when confronted with complex skin lesion scenarios. The corresponding visual results are shown in Fig 13.

thumbnail
Fig 13. Experimental results of Grad-cam on three datasets.

https://doi.org/10.1371/journal.pone.0344622.g013

Across the three datasets, the Grad-CAM visualizations show that the models consistently concentrate their attention on the core lesion regions and their surrounding boundaries under various imaging conditions and lesion morphologies, while exhibiting noticeably weaker responses to distracting areas such as background skin texture, body hair, or illumination artifacts. In the ISIC 2018 and HAM10000 examples, the heatmaps closely overlap with the segmentation masks regardless of whether the lesion appears as a large irregular patch or a compact small region, indicating that the network bases its decisions primarily on features closely related to lesion contours and internal structures. Even in PH2, where color contrast may be subtle and the surrounding texture more complex, the Grad-CAM overlays still display a focused activation distribution over the target regions. From an interpretability perspective, these patterns demonstrate that the discriminative features learned by the model align well with the clinically relevant lesion locations.

5.8. Model robustness analysis

In real-world dermatological lesion segmentation scenarios, images are often affected by factors such as sensor noise, transmission interference, and environmental illumination variations, making it essential to evaluate the robustness of the model under noise perturbations. Therefore, in this subsection, we keep the network architecture and training configurations unchanged, and systematically analyze the model’s stability and reliability under complex imaging conditions by injecting Gaussian noise of varying intensities into the input images. In addition, by comparing the segmentation performance across different noise variance settings, we further examine whether the proposed method can consistently maintain accurate lesion depiction and boundary coherence in the presence of random disturbances. The corresponding experimental results are presented in Fig 14.

thumbnail
Fig 14. The curves showing the changes in IoU, DSC, HD95, and ASD metrics of the proposed model on the ISIC 2018, HAM10000, and PH2 datasets under different Gaussian noise intensities demonstrate its robustness to lesion segmentation performance under random perturbations.

https://doi.org/10.1371/journal.pone.0344622.g014

From Fig 14, it can be observed that as the intensity of Gaussian noise gradually increases, the IoU and DSC metrics on all three datasets exhibit an overall downward trend, while HD95 and ASD rise accordingly. This pattern indicates that random perturbations substantially increase the difficulty of accurately locating lesion regions and maintaining boundary alignment. Comparing across datasets, HAM10000 shows the largest fluctuations under noise enhancement, suggesting that its more diverse imaging conditions and lesion morphologies make it more sensitive to noise. In contrast, PH2 presents the smoothest curves, reflecting stronger noise resistance on this relatively homogeneous dataset, while ISIC 2018 falls between the two. Overall, the metrics remain within a relatively high range under low to moderate noise levels, demonstrating that the proposed model is still capable of stably capturing lesion regions and their boundary structures in the presence of random disturbances.

5.9. Cross-dataset generalization experiment results

To evaluate the robustness of the proposed framework under domain shifts, we conduct cross-dataset generalization experiments by training the model on one dataset and directly testing it on another without any target-domain fine-tuning. This setting simulates practical deployment scenarios where acquisition devices, imaging protocols, and population distributions differ across clinical sources, thereby challenging both boundary delineation and morphological consistency modeling. The quantitative results are summarized in Table 9.

thumbnail
Table 9. Cross-domain generalization results on skin lesion segmentation.

https://doi.org/10.1371/journal.pone.0344622.t009

From the cross-dataset transfer results, it can be observed that the proposed method exhibits a certain degree of generalization across domains, while its performance is still noticeably affected by domain discrepancies: using ISIC2018 as the source domain yields the best transfer performance, and in particular, ISIC2018→PH2 achieves the highest IOU/DSC (52.1/66.9) while maintaining relatively low HD95/ASD (20.3/3.12), indicating that when the target-domain samples are more standardized or the distribution shift is relatively mild, the model can better preserve boundary localization and overall region consistency; in contrast, HAM10000→ISIC2018 and PH2→ISIC2018 show a marked drop in IOU/DSC (38.7/52.3 and 33.5/48.0), accompanied by substantially increased HD95 and ASD (31.4/4.29 and 35.7/4.72, respectively), suggesting that stronger imaging-style differences, color distribution shifts, or lesion morphology deviations between source and target domains make the model more prone to contour misalignment, boundary fragmentation, and local mis-segmentation. Overall, the consistent trend that DSC/IOU decreases as HD95/ASD increases implies that cross-domain degradation is mainly manifested as amplified boundary errors and weakened morphological consistency, which further highlights the necessity of imposing stronger boundary and shape-prior constraints and enhancing robustness under domain-shift scenarios.

5.10. PR curve experimental results

To facilitate a better discussion, the experimental results of the PR curve are presented here, as shown in Fig 15.

As shown in Figure, the PR curve for Background on all three datasets remains almost always close to the upper left corner, indicating that the model has high discrimination stability for background pixels and a low false positive rate. In contrast, the Lesions curve shows a significant decline in the high recall interval, especially on HAM10000 where the decline is earlier and steeper, indicating that when the model attempts to cover more lesion areas, it is more prone to introducing false positives, and the precision is more significantly affected. Overall, the lesion curves on PH2 and ISIC2018 are closer to the ideal shape, indicating that the model more fully characterizes the lesion boundaries and fine-grained textures under these two data distributions. Although there are some differences in performance across datasets, the overall trend is consistent.

6. Discussion

To address practical deployment requirements beyond FLOPs, we further discuss runtime and memory behavior of the proposed BAHA–MLSP framework under consumer-assisted or point-of-care constraints. In our implementation, we profile end-to-end inference latency at input resolution using batch size on two representative hardware settings: a workstation GPU (NVIDIA RTX 3090, CUDA 11.8) and a commodity CPU-only setting (Intel Core i7-10700). The measured average inference time is ms/image on GPU and ms/image on CPU, while peak activation memory during inference is 428 MB on GPU and 612 MB of CPU RAM.

At the study level, a few mild limitations are worth noting. First, while our method demonstrates stable improvements across HAM10000, PH2, and ISIC2018, additional verification on more diverse acquisition settings and annotation protocols would further strengthen the generality of the conclusions. Second, our quantitative evaluation is mainly based on standard segmentation metrics and PR behavior; future work could complement these results with more fine-grained error characterization to better reflect clinical-facing quality perception. Finally, although the framework is designed to remain efficient for point-of-care usage, the module composition still introduces several tunable components, and minor dataset-specific adjustment may be beneficial to consistently achieve the best trade-off between accuracy and efficiency.

7. Conclusion

This study addresses the challenge of skin lesion segmentation under complex imaging conditions by proposing a dual-prior hybrid segmentation network that integrates both boundary priors and shape priors. Specifically, a Boundary-augmented Hybrid Attention (BAHA) module is introduced in the encoder, while a Multi-scale Lesion Shape Prior (MLSP) module is incorporated in the decoder, collectively enhancing the model’s ability to represent lesion structures from the perspectives of feature extraction and mask reconstruction. Through the synergistic effects of boundary enhancement, self-attention, and state space–inspired modeling, as well as the explicit constraints imposed by multi-scale shape priors and a unified loss design, the proposed method achieves superior segmentation performance to mainstream CNN-, Transformer-, and hybrid-based approaches on public datasets such as ISIC2018, HAM10000, and PH2. It consistently delivers more precise boundary delineation and more stable overall mask structure across metrics including IoU, DSC, HD95, and ASD, while maintaining controllable computational complexity and demonstrating strong generalization and robustness.

Despite the comprehensive experimental results, this work still has several limitations. It is validated only on 2D dermoscopic images, without incorporating patient clinical information or additional modalities, and does not explore weakly supervised, semi-supervised, or cross-domain generalization settings that better reflect real-world conditions. Future work will investigate extending our method to multimodal and multi-center scenarios, enhancing its transferability and reliability in clinical environments by incorporating clinical metadata, adaptive domain alignment, and learning from incomplete annotations. In addition, lightweight architectural design and inference acceleration will be explored to promote deployment on edge devices and point-of-care decision support systems, ultimately providing more practical technological support for early skin lesion screening and precision diagnosis.

References

  1. 1. Wunderlich K, Suppa M, Gandini S, Lipski J, White JM, Del Marmol V. Risk Factors and Innovations in Risk Assessment for Melanoma, Basal Cell Carcinoma, and Squamous Cell Carcinoma. Cancers (Basel). 2024;16(5):1016. pmid:38473375
  2. 2. Chen Y, Gui H, Yao H, Adu-Brimpong J, Javitz S, Golovko V, et al. Single-Lesion Skin Cancer Risk Stratification Triage Pathway. JAMA Dermatol. 2024;160(9):972–6. pmid:38922597
  3. 3. Lindsay D, Soyer HP, Janda M, Whiteman DC, Osborne S, Finnane A, et al. Cost-Effectiveness Analysis of 3D Total-Body Photography for People at High Risk of Melanoma. JAMA Dermatol. 2025;161(5):482–9. pmid:40136266
  4. 4. Toptaş B. Enhanced Skin Lesion Segmentation via Attentive Reverse-Attention U-Net. Symmetry. 2025;17(11):2002.
  5. 5. Yu L, Min W, Wang S. Boundary-Aware Gradient Operator Network for Medical Image Segmentation. IEEE J Biomed Health Inform. 2024;28(8):4711–23. pmid:38776204
  6. 6. Jiang H, Li L-F, Yang X, Wang X, Luo M-X. BSNet: a boundary-aware medical image segmentation network. Eur Phys J Plus. 2025;140(1).
  7. 7. Mirikharaji Z, Abhishek K, Bissoto A, Barata C, Avila S, Valle E, et al. A survey on deep learning for skin lesion segmentation. Med Image Anal. 2023;88:102863. pmid:37343323
  8. 8. Basak H, Kundu R, Sarkar R. MFSNet: A multi focus segmentation network for skin lesion segmentation. Pattern Recognition. 2022;128:108673.
  9. 9. Ruan J, Xiang S, Xie M, Liu T, Fu Y. MALUNet: A Multi-Attention and Light-weight UNet for Skin Lesion Segmentation. In: 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2022. 1150–6. http://dx.doi.org/10.1109/bibm55620.2022.9995040
  10. 10. Gulzar Y, Khan SA. Skin Lesion Segmentation Based on Vision Transformers and Convolutional Neural Networks—A Comparative Study. Applied Sciences. 2022;12(12):5990.
  11. 11. Yadav DP, Sharma B, Webber JL, Mehbodniya A, Chauhan S. EDTNet: A spatial aware attention-based transformer for the pulmonary nodule segmentation. PLoS One. 2024;19(11):e0311080. pmid:39546546
  12. 12. Cheng D, Gai J, Mao Y, Gao X, Zhang B, Jing W, et al. EA-Net: Research on skin lesion segmentation method based on U-Net. Heliyon. 2023;9(12):e22663. pmid:38076196
  13. 13. Liu Z, Hu J, Gong X, Li F. Skin lesion segmentation with a multiscale input fusion U-Net incorporating Res2-SE and pyramid dilated convolution. Sci Rep. 2025;15(1):7975. pmid:40055411
  14. 14. Akram A, Rashid J, Jaffar MA, Faheem M, Amin RU. Segmentation and classification of skin lesions using hybrid deep learning method in the Internet of Medical Things. Skin Res Technol. 2023;29(11):e13524. pmid:38009016
  15. 15. Shu X, Li Z, Tian C, Chang X, Yuan D. An active learning model based on image similarity for skin lesion segmentation. Neurocomputing. 2025;630:129690.
  16. 16. Almuayqil SN, Arnous R, Sakr N, Fadel MM. A new hybrid model for segmentation of the skin lesion based on residual attention U-Net. Computers, Materials & Continua. 2023;75(3).
  17. 17. Kumar SS, Shanmugam K, Jyothi V, Deepthi TV, Rao PS, Devi RS. Quantum-Enhanced Deep Learning Framework (QDLF): A Hybrid Approach for Advanced Skin Cancer Detection and Image Classification. Frontiers in Health Informatics. 2024;13(4).
  18. 18. Hou S, Hou J, Pang Y, Xia A, Hou B. MSAMamba-UNet: A Lightweight Multi-Scale Adaptive Mamba Network for Skin Lesion Segmentation. J Bionic Eng. 2025;22(6):3209–25.
  19. 19. Xiong Y, Shu X, Liu Q, Yuan D. HCMNet: A Hybrid CNN-Mamba Network for Breast Ultrasound Segmentation for Consumer Assisted Diagnosis. IEEE Trans Consumer Electron. 2025;71(3):8045–54.
  20. 20. Kumar A, Kanthen KR, John J. GS-TransUNet: integrated 2D Gaussian splatting and transformer UNet for accurate skin lesion analysis. In: Medical Imaging 2025: Computer-Aided Diagnosis, 2025. 790–800.
  21. 21. Xiong Y, Yuan D, Li L, Shu X. MFPNet: Mixed Feature Perception Network for Automated Skin Lesion Segmentation. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 2024. 105–17.
  22. 22. Shu X, Li Z, Chang X, Yuan D. Variational methods with application to medical image segmentation: A survey. Neurocomputing. 2025;639:130260.
  23. 23. Alahmadi MD. Boundary Aware U-Net for Medical Image Segmentation. Arab J Sci Eng. 2022;48(8):9929–40.
  24. 24. Li C, Zhang J, Niu D, Zhao X, Yang B, Zhang C. Boundary-Aware Uncertainty Suppression for Semi-Supervised Medical Image Segmentation. IEEE Trans Artif Intell. 2024;5(8):4074–86.
  25. 25. Lee HJ, Kim JU, Lee S, Kim HG, Ro YM. Structure boundary preserving segmentation for medical image with ambiguous boundary. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 4817–26.
  26. 26. Wu Z, Leng L. Edge-Aware Lightweight Network for Medical Image Segmentation. In: International Conference on Multimedia Information Technology and Applications, 2025. 79–85.
  27. 27. Wang J, Zhou C, Huang Y. Contour-Aware Multi-Expert Model for Ambiguous Medical Image Segmentation. IEEE Trans Med Imaging. 2025;44(8):3284–98. pmid:40232920
  28. 28. Xu R, Xu C, Li Z, Zheng T, Yu W, Yang C. Boundary guidance network for medical image segmentation. Sci Rep. 2024;14(1):17345. pmid:39069513
  29. 29. Yu J, Qi L. SAFE-Net: Shape-aware and feature enhancement network for polyp segmentation. Biomedical Signal Processing and Control. 2025;99:106906.
  30. 30. Shaker A, Maaz M, Rasheed H, Khan S, Yang M-H, Shahbaz Khan F. UNETR++: Delving Into Efficient and Accurate 3D Medical Image Segmentation. IEEE Trans Med Imaging. 2024;43(9):3377–90. pmid:38722726
  31. 31. Cui R, Liu L, Zou J, Hu X, Pei J, Qin J. Taming large vision model for medical image segmentation via Dual Visual Prompt Tuning. Comput Med Imaging Graph. 2025;124:102608. pmid:40695060
  32. 32. Li L, Ma Q, Ouyang C, Paetzold JC, Rueckert D, Kainz B. Topology Optimization in Medical Image Segmentation With Fast χ Euler Characteristic. IEEE Trans Med Imaging. 2025;44(12):5221–32. pmid:40720275
  33. 33. Cui R, Liang S, Zhao W, Liu Z, Lin Z, He W, et al. A Shape-Consistent Deep-Learning Segmentation Architecture for Low-Quality and High-Interference Myocardial Contrast Echocardiography. Ultrasound Med Biol. 2024;50(11):1602–10. pmid:39147622
  34. 34. Karimijafarbigloo S, Azad R, Kazerouni A, Merhof D. MedScale-Former: Self-guided multiscale transformer for medical image segmentation. Med Image Anal. 2025;103:103554. pmid:40209553
  35. 35. Wen H, Xu R, Zhang T. ISIC 2018 - A method for lesion segmentation. 2018. https://arxiv.org/abs/1807.07391
  36. 36. Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data. 2018;5:180161. pmid:30106392
  37. 37. Mendonca T, Ferreira PM, Marques JS, Marcal ARS, Rozeira J. PH² - a dermoscopic image database for research and benchmarking. Annu Int Conf IEEE Eng Med Biol Soc. 2013;2013:5437–40. pmid:24110966
  38. 38. Yuan Y, Chen X, Wang J. In: European conference on computer vision, 2020. 173–90.
  39. 39. Cheng B, Schwing A, Kirillov A. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems. 2021;34:17864–75.
  40. 40. Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint. 2021. https://doi.org/10.48550/arXiv.2102.04306
  41. 41. Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R. Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022. 1290–9.
  42. 42. Guo MH, Lu CZ, Hou Q, Liu Z, Cheng MM, Hu SM. Segnext: Rethinking convolutional attention design for semantic segmentation. Advances in Neural Information Processing Systems. 2022;35:1140–56.
  43. 43. Hatamizadeh A, Nath V, Tang Y, Yang D, Roth HR, Xu D. In: International MICCAI brainlesion workshop, 2021. 272–84.
  44. 44. Yan H, Wu M, Zhang C. Multi-scale representations by varying window attention for semantic segmentation. arXiv preprint. 2024. https://doi.org/arXiv:240416573
  45. 45. Yu H, Cho Y, Kang B, Moon S, Kong K, Kang SJ. In: European Conference on Computer Vision, 2024. 92–110.
  46. 46. Ni Z, Chen X, Zhai Y, Tang Y, Wang Y. In: European Conference on Computer Vision, 2024. 239–55.
  47. 47. Fu Y, Lou M, Yu Y. SegMAN: Omni-scale context modeling with state space models and local attention for semantic segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025. 19077–87.