Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

DSCANet: Integrating dual encoder and spatial cross-attention for polyp segmentation

  • Jun Su,

    Roles Conceptualization, Formal analysis, Software, Supervision, Validation, Visualization, Writing – review & editing, Conceptualization, Formal analysis, Software, Supervision, Validation, Visualization, Writing – review & editing

    Affiliation School of Computer Science, Hubei University of Technology, Wuhan, China

  • Tiantian Shi,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing – original draft, Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing – original draft

    Affiliation School of Computer Science, Hubei University of Technology, Wuhan, China

  • Bogdan Adamyk

    Roles Software, Supervision, Validation, Visualization, Writing – review & editing, Software, Supervision, Validation, Visualization, Writing – review & editing

    b.adamyk@aston.ac.uk

    Affiliation Aston Business School, Aston University, Birmingham, United Kingdom

Abstract

Colonoscopy is a crucial clinical procedure for detecting colorectal polyps, which are strongly associated with the development of colon cancer. This endoscopic technique plays a vital role in both cancer prevention and early diagnosis. Accurate and efficient polyp segmentation is critical for enhancing the diagnostic reliability and clinical utility of colonoscopy. However, achieving precise segmentation presents significant challenges, primarily due to the diversity of polyps in their size and shape, coupled with poorly defined boundaries between polyps and surrounding tissues. To address these challenges, we propose a novel segmentation network, named DSCANet, which is a dual-branch encoder-structured network designed to efficiently fuse body and edge features for high-precision medical image segmentation. DSCANet integrates four key modules: a dual-branch encoder, a spatial cross-attention (SCA) module, a bipolar fusion (BF) module, and a flexible axis-attention (FAA) module. The dual-branch encoder consists of separate body and edge encoders, which extract respective features independently. The SCA module bridges the semantic gap between the two encoders’ features. The BF module fuses the shallowest and deepest features, while the FAA module assists the decoder in extracting semantic information from high-level features. DSCANet achieved superior performance on multiple colorectal polyp segmentation datasets. The code is available at https://github.com/Shantdst/DSCANet.

1. Introduction

Colorectal cancer (CRC) is a leading gastrointestinal malignancy that predominantly develops from the malignant transformation of colorectal polyps. Given the high incidence of CRC and its significant preventability, it has become a central focus of research in cancer prevention and treatment [1]. Numerous studies have demonstrated that the development of CRC is closely associated with the cancerous transformation of adenomatous polyps. Therefore, early detection and timely resection of these lesions are crucial for preventing and treating CRC. Clinical practice confirms that endoscopic resection of adenomatous polyps can reduce the risk of CRC by up to 90\%, demonstrating the critical importance of early detection in CRC prevention and control. At present, colonoscopy is the preferred method for screening, diagnosing, and treating colorectal polyps. While the widespread use of this technique has reduced CRC incidence by approximately 30\%, it still has certain limitations.

Deep learning-based medical image segmentation has significantly advanced clinical diagnosis [2,3]. While convolutional neural networks (CNNs) remain predominant in current segmentation approaches [4,5], their limited receptive fields restrict them to local feature extraction, often failing to model long-range spatial dependencies. Transformers [6] mitigate this limitation through self-attention mechanisms, enabling global contextual modeling.

In addition, polyp segmentation remains a challenging task, attributed to the variability in polyp size and shape, as well as poor target-to-background contrast ratio. As illustrated by the sample images of a polyp in Fig 1, the morphological diversity of polyps complicates their segmentation from the surrounding tissue. The substantial variability in size and the indistinct, often blurry boundaries further exacerbate the difficulty of polyp segmentation. To overcome these limitations, we developed an innovative multi-scale feature integration framework, termed DSCANet, optimized for polyp segmentation accuracy. The key contributions of this work include:

thumbnail
Fig 1. Some examples of typical polyps (a) and (b) indicate samples with blurred boundaries.

(c) indicates a tiny polyp sample and (d) indicates a large polyp sample. (e) Indicates polyp samples with different colors.

https://doi.org/10.1371/journal.pone.0345515.g001

  • We propose a hybrid network with a dual-branch encoder architecture, which is designed to separately extract edge and body features.
  • We introduce the Spatial Cross-Attention (SCA) module to bridge the semantic gap between the edge and body branches before feature fusion, thereby enhancing segmentation accuracy.
  • The model includes a two-stage fusion module that optimally combines shallow and deep feature representations.
  • We integrate a lightweight Flexible Axis-Attention (FAA) layer into the decoder, achieving significant performance improvements with negligible computational overhead.

2. Related works

2.1. Traditional approaches in polyp segmentation

Polyp segmentation approaches primarily fall into two paradigms: conventional image processing techniques and machine learning-based methods. Conventional approaches employ methods such as thresholding [7], edge detection [8], and region-based segmentation [9] by utilizing low-level features (e.g., color, texture, and shape).

In contrast, machine learning-based methods demonstrate superior performance in extracting discriminative color and texture features for polyp segmentation. For example, Li et al. [10] enhanced polyp segmentation by projecting image features into higher-dimensional spaces via machine learning. Maghsoudi et al. [11] proposed a vectorization method that clusters pixels with similar features for efficient segmentation.

While these conventional methods [12] are valuable, they depend on handcrafted features and exhibit limited generalizability for colorectal polyp analysis. This is particularly evident in scenarios characterized by complex contours, significant morphological diversity, and low-contrast conditions, which consequently lead to limited precision in polyp localization and mucosal boundary differentiation.

2.2. Deep learning methods in polyp segmentation

In recent years, the rapid advancement of deep learning techniques has enabled convolutional neural networks (CNNs) to significantly contribute to polyp segmentation through their powerful multi-scale feature encoding capabilities [13]. A seminal architecture, U-Net [14], introduced by Ronneberger et al., established an encoder-decoder symmetry with skip connections. Building upon this, Zhou et al. [15] developed U-Net++, which incorporates dense skip connections to significantly enhance multi-scale feature fusion and effectively bridge the semantic gap across network hierarchies. Other U-Net variants, including Levit-U-Net [16], PDAtt-U-Net [17], ERD-U-Net [18], and ADS-U-Net [19], further incorporate attention mechanisms, pyramid pooling, and advanced feature aggregation to optimize the framework for polyp segmentation challenges.

In computer vision, Transformers are recognized for their superiority in extracting global information and capturing long-range dependencies, whereas CNNs excel at modeling local features. Consequently, a series of novel Transformer-CNN hybrid architectures have been developed for polyp segmentation to leverage the complementary strengths of both paradigms. For instance, TransFuse [20] employs a parallel branching structure to fuse CNN and Transformer features, while SwinENet [21] couples EfficientNet’s local feature extraction with the Swin Transformer’s long-range dependency modeling. This design preserves hierarchical feature representations across both global and local scales, thereby significantly enhancing segmentation robustness and accuracy. The MIA-Net framework [22] achieves multi-scale information fusion through the parallel processing of Transformer attention mechanisms and convolutional feature maps. By incorporating dedicated feature extraction and cross-modal fusion modules, MIA-Net effectively improves segmentation accuracy. HSNet [23] synergistically integrates CNN and Transformer architectures for comprehensive local-to-global feature representation. In contrast, ColonFormer [24] employs an efficient Transformer encoder for multiscale feature learning and uses a CNN-based decoder to enable hierarchical feature aggregation, thereby improving accuracy for small polyps.

To fully leverage the rich information in medical images and significantly enhance segmentation precision, we propose DSCANet, an advanced dual-encoder hybrid model for polyp segmentation. Our model is designed to address the semantic gap prior to feature fusion, thereby enhancing the integration of body and edge information. Furthermore, it overcomes the limitations of previous feature fusion techniques and effectively extracts multi-level and fine-grained features, demonstrating outstanding segmentation performance.

3. The proposed method

The overall architecture of our proposed DSCANet is illustrated in Fig 2, which primarily consists of four key components: a dual-branch encoder, a Spatial Cross-Attention (SCA) module, a Bipolar Fusion (BF) module, and a Flexible Axis-Attention (FAA) module. The SCA module bridges the semantic gap between the two branching encoders prior to feature fusion, while the BF module is designed to effectively integrate high-level and low-level feature maps. The following subsections will provide a detailed exploration of each component.

thumbnail
Fig 2. The structure of the proposed DSCANet.

https://doi.org/10.1371/journal.pone.0345515.g002

3.1. Dual-branch encoder

The proposed dual-encoder architecture comprises two components: (1) a Swin Transformer-based body encoder [25] for global contextual modeling, and (2) a CNN-based edge encoder that employs Pixel Difference Convolution (PDC) [26] for local boundary delineation.

3.1.1. Edge encoder.

To address the limitations of conventional medical image segmentation approaches in edge feature extraction, we introduce a dedicated boundary-aware module to enhance segmentation precision. As illustrated in Fig 2, the edge encoder employs a four-stage hierarchical architecture. Each stage integrates four Pixel Difference Convolution (PDC) blocks for multiscale feature extraction, with max-pooling operations between stages progressively reducing feature map dimensions to construct a pyramidal representation. The initial stage transforms the three-channel input into a C-dimensional feature space while performing 4 × spatial downsampling to ensure dimensional consistency with the body encoder.

The core component, the PDC module, is designed to overcome the limited edge perception of standard convolutions by incorporating gradient-aware operations. While vanilla convolution performs a weighted accumulation of pixel intensities within a kernel, PDC computes the weighted sum of pixel value differences, thereby explicitly capturing edge information. The mathematical formulations for vanilla convolution and PDC are provided in Equations 1 and 2, respectively. Each PDC block consists of a ReLU activation followed by depthwise and pointwise convolutional layers. Residual connections are incorporated to alleviate the vanishing gradient problem and facilitate more stable gradient flow through the deep network.

(1)(2)

Where denotes the weights within the k × k convolution kernel, while and represent the pixels covered by the kernel, and P refers to the assortment of pixel pairs selected within the kernel’s local coverage area. To enhance edge delineation further, we employ a supervised strategy that generates an edge map from the output features from each stage and calculates the difference between this edge map and the ground truth.

3.1.2. Body encoder.

The body encoder utilizes the Swin Transformer architecture to generate high-level feature representations for global context modeling. Its shifted window mechanism enables efficient inter-region feature interaction, which has demonstrated superior performance in segmenting irregular organ structures. As indicated by the red blocks in Fig 2, the Swin Transformer encoder consists of four hierarchical stages. The initial stage employs a patch embedding layer to partition the input image into non-overlapping P × P patches. These patches are then linearly projected into a C-dimensional embedding space, incorporating positional encoding to preserve spatial information. The resulting sequence of token embeddings is processed by a series of Swin Transformer blocks.

In subsequent stages, patch merging operations downsample the feature maps while increasing their dimensionality, allowing the network to capture more complex features at larger receptive fields. Each stage is composed of multiple Swin Transformer blocks. Critically, these blocks are arranged in alternating pairs: one block employs a Window-based Multi-head Self-Attention (W-MSA) mechanism, while the subsequent block uses a Shifted Window-based Multi-head Self-Attention (SW-MSA) mechanism to enable cross-window communication. Each block comprises a Layer Normalization (LN) layer, a multi-head self-attention module (either W-MSA or SW-MSA), a Multilayer Perceptron (MLP), and residual connections skipping both the attention and MLP layers. The detailed computational procedures for these blocks are formalized in Equations (3) through (6).

(3)(4)(5)(6)

Patch merging is conducted between stages to downsample the feature map, thereby capturing essential contextual features. Adjacent 2 × 2 patches are combined into a single larger patch, effectively reducing the number and size of patches to minimize the loss of information. Patch merging reduces the feature scale by a downsampling ratio of two. Given an initial image size of H × W × 3, the resulting feature dimensions for each processing stage are denoted as H/4 × W/4 × C, H/8 × W/8 × 2C, H/16 × W/16 × 4C, and H/32 × W/32 × 8C, respectively.

3.2. Spatial cross-attention module (SCA)

The proposed Spatial Cross-Attention (SCA) module takes n feature levels from the edge encoder as input to generate enhanced representations. These refined features, particularly from the highest and lowest levels, are then connected to the corresponding stages of the body encoder, thereby providing richer semantic information.

As illustrated in Fig 3, the SCA module operates in two sequential phases:

  1. Multi-scale Feature Tokenization: In this phase, multi-scale patch embedding modules process the input feature maps from the edge encoder to transform them into a sequence of tokenized representations.
  2. Cross-Attention Feature Refinement: The subsequent phase applies the core Spatial Cross-Attention mechanism to these tokens, enabling the modeling of long-range dependencies across the spatial domain and refining the features based on global context.

Initially, multi-scale patches are extracted from four hierarchical stages of the edge encoder. Given four hierarchically scaled encoder stages and corresponding patch sizes where i = 1, 2, 3, 4, we extract blocks using 2D average pooling with specified pooling and stride sizes . Subsequently, we apply 1 × 1 deep convolutional mapping to the flattened 2D blocks:

(7)

where , (i = 1, 2, 3, 4) denotes the i-th encoder stage of planarized patches. It is worth noting that the patch count P remains invariant throughout all processing stages , allowing for the use of cross-attention among these tokens.

The SCA mechanism is shown in Fig 4 Given the reshaped output (i = 1, 2,..., n), layer normalization is conducted followed by concatenation across the channel dimension. We use the connected tokens as the query and key, along with each token as the value. We use a 1 × 1 depth mapping for queries, keys, and values:

thumbnail
Fig 4. The spatial cross-attention mechanism.

https://doi.org/10.1371/journal.pone.0345515.g004

(8)(9)(10)

where , and , are the predicted query, key, and value respectively. Then SCA can be represented as

(11)

in this context, Q, K, and denote the matrices of query, key, and value embeddings, respectively, and is the scaling factor. In the multi-head configuration , where represents the total number of heads. The output of the SCA is then processed through deep convolution operation to generate SCA outputs. Layer normalization and GeLU activation are then applied to these SCA outputs. Finally, the n outputs from the SCA blocks are connected to the corresponding decoder sections via the upsampling layer, followed by the sequence of 1 × 1 convolution, batch normalization, and ReLU activation. Cross-attention differs from self-attention in that creates an attention map by fusing multiscale encoder features, rather than operating on each stage individually. This mechanism is utilized to capture long-range relationships across distinct encoder stages.

3.3. Bipolar fusion module(BF)

Currently, a major challenge lies in effectively fusing multi-level features from CNNs and Swin Transformers while maintaining feature consistency. Traditional approaches involve summing the features from CNN levels and integrating the associated Swin Transformer layers into the decoder, subsequently producing a segmentation map. However, this method makes it difficult to ensure feature consistency across different levels, leading to suboptimal performance. Consequently, we introduce a novel module named BF, which effectively addresses this issue by accepting the minimum () and maximum () levels as inputs and utilizing a cross-attention mechanism to combine multi-scale information.

Typically, shallower levels contain more precise localization information, whereas deeper levels contain more semantic content, which is ideally matched for the decoder. To efficiently fuse multi-scale features and reduce computational overhead, we selectively incorporate only the shallowest and deepest levels into the feature fusion process. In the proposed BF module, the class token plays a pivotal role by aggregating comprehensive information from the input features. These class tokens are generated through the Global Average Pooling (GAP) applied to each level’s features. The process for obtaining class tokens is described as follows:

(12)

Where . The class tokens are subsequently linked to the corresponding level embeddings before being passed to the transformer encoders. S transformer encoders are deployed at the smaller level, with L transformer encoders utilized at the larger level for global self-attention computation. Importantly, learnable position embeddings are incorporated into each token across both levels, integrating positional data into the learning process of the transformer encoders.

Upon transmitting the embedding information through the transformer encoder, the features of each level are integrated through the cross-attention module. Prior to fusion, the class tokens of the two levels are exchanged, meaning that the class tokens from one level are linked to those of the other level. Each newly generated embedding is then integrated via the module and subsequently reprojected to its corresponding level. Interactions between tokens from distinct levels allow class tokens to propagate extensive information across the levels.

In particular, this interaction for smaller levels is illustrated in Fig 5 The output , obtained by initially projecting onto the dimension of , is denoted as . . is linked to . The key and value for the cross-attention calculation, while also function independently as a query. Given that only class tokens are queried, the cross-attention mechanism runs in linear time, producing the output , which can be expressed mathematically as:

(13)(14)

3.4. Flexible axial-attention for decoder (FAA)

To enhance the decoder’s extraction of semantically rich high-level feature information, we analyze the advantages of employing an FAA mechanism applied to visual recognition. To accommodate computational constraints, operations are restricted to the height and width axes of the decoder feature map, implementing axial attention along these axes. This approach notably improves computational efficiency while replicating the self-attention mechanism.

Specifically, the axial attention mechanism in our study efficiently processes non-local contexts while achieving significant computational efficiencies. The method incorporates positional biases into its framework and effectively encodes remote interactions within the input feature maps. However, this approach performs best on large-scale datasets, as axial attention is more adept at learning positional biases among keys, queries, and values. For small-scale datasets, particularly in medical image segmentation, learning positional biases proves challenging. Consequently, we introduce an enhanced axial attention block to reduce the impact of positional bias on non-local context encoding. The attention mechanism applied to the width axis is formally expressed in Equation 15, with a similar formulation applying to the height dimension.

(15)

The proposed framework incorporates an adjustable gating mechanism. The parameters are learnable and automatically updated, collectively forming the gating mechanism that controls the impact of learned relative position encodings on non-local context encodings. Typically, if a relative position encoding is precisely learned, the gating mechanism will prioritize it with a higher weight compared to that are not accurately learned. Fig 6 clearly illustrates the feedforward process of the gating mechanism in axial attention.

3.5. Loss function

To accommodate the two-branch structure of our network, we define a composite loss function consisting of two specialized components: and , each tailored to a specific branch.

3.5.1. Edge supervision loss.

Since most edge detection samples yield negative results, we use an annotator-robust loss [27]. We implement this loss function across all hierarchical edge maps produced by the encoder’s multi-stage architecture. Specifically, for each spatial coordinate i in the j-th level edge detection output , the loss is computed through a conditional evaluation framework defined as follows:

(16)

In our formulation, denotes the predicted value of the i-th pixel in the j-th edge map. A predefined threshold determines the sample labeling: pixels annotated as positive but with confidence scores below are reclassified as negative. The dataset’s class imbalance is quantified by , representing the proportion of negative samples. To balance the contributions of positive and negative samples, we introduce the weighting factor , where is a hyperparameter. The hyperparameters for the edge loss are γ = 0.6, β = 0.8, λ = 2.0. The total edge loss is computed as the summation of per-pixel losses across all edge maps:

(17)

3.5.2. Body supervision loss.

The inherent class imbalance in polyp image segmentation poses significant challenges. To mitigate this issue, we utilize a hybrid loss function that combines Binary Cross-Entropy Loss and Dice Loss, which effectively balances the learning between foreground and background regions.

The Binary Cross-Entropy Loss quantifies pixel-wise prediction errors through a logarithmic penalty, making it widely applicable for semantic segmentation tasks. Its formulation is given by:

(18)

In medical image analysis where foreground-background imbalance is prevalent, the Dice Loss function offers particular advantages by focusing on region-based overlap rather than pixel-level classification. The loss function is mathematically expressed as:

(19)

Our hybrid loss function combines binary cross-entropy and Dice loss through linear combination:

(20)

The overall loss function L integrates both edge-aware and region-based components through a balancing hyperparameter:

(21)

We set the weight parameters λ₁, λ₂, and to 0.6, 0.4, and 0.3, respectively.

4 Experiments

4.1. Dataset details

We employ two benchmark medical image segmentation datasets, Kvasir-SEG [28], and CVC-ClinicDB [29] to evaluate our approach.

  1. (1) The Kvasir-SEG dataset, focusing on pixel-level segmentation of colorectal polyps, comprises 1000 images of gastrointestinal polyps and their corresponding segmentation masks. The dataset is split with 80\% allocated to the training set and the remaining 20\% reserved for testing.
  2. (2) The CVC-ClinicDB, used for segmenting lesion regions in colonoscopy images, includes 612 raw images and their corresponding segmentation masks. The data is divided into training and test sets in a 75\% to 25\% ratio.

To improve model robustness and prevent overfitting, we applied data augmentations including random horizontal/vertical flipping, random rotation, and color jittering in brightness, contrast, and saturation. These measures help the model generalize across variations in polyp shape, size, and visual appearance.

4.2. Implementation details

We implemented the proposed method using the PyTorch framework and conducted all experiments on an NVIDIA GeForce RTX 4060 8G GPU. To enhance computational efficiency, we resized all training, testing, and validation images to 224 × 224. The proposed model is trained using the Adam optimizer with a weight decay of 0.01 and an initial learning rate of 0.01. The learning rate is scheduled by the ReduceLROnPlateau algorithm with patience = 5, factor = 0.5, and min_lr = 1e-6. For comparison, we also conducted experiments using the SGD optimizer with an initial learning rate of 0.01 under the same scheduling settings. The batch size was set to 16 for both the Kvasir-Seg and CVC datasets.

All experiments were conducted using the aforementioned datasets, with Dice loss as the primary loss function. Performance was evaluated using the Dice Similarity Coefficient (DSC), mean Intersection over Union (mIoU), Precision, Recall, and F1 Score(F1). We trained both the proposed model and the comparison models from scratch for 200 epochs to ensure optimal performance on each dataset.

Below are the mathematical formulations for each of these metrics.

(22)(23)(24)(25)(26)

In this context, TP, TN, FP, and FN denote True Positives, True Negatives, False Positives, and False Negatives, respectively. These metrics are essential statistical measures derived from the comparison of predicted categories with actual categories.

4.3. Comparative experiment

To rigorously evaluate DSCANet’s performance advantages, we compare it with eight classic medical image segmentation models such as U-Net [14], ResUnet++ [30], MultiResUnet [31], R2U-net [32], VNet [33], DoubleU-Net [34], TransUNet [35], and BEFUnet [36].

4.3.1. Results of Kvasir-SEG.

Table Table 1 presents a comparison between our proposed DSCANet and previous state-of-the-art (SOTA) methods on the Kvasir-SEG dataset. The experimental results indicate that the proposed method demonstrates exceptional segmentation accuracies, with scores of 95.28\% (DSC) and 90.98\% (mIoU). Compared to U-Net, its variants, and BEFUnet (which similarly employs a dual-encoder structure), our method demonstrates significant improvements in DSC evaluation metrics. Furthermore, DSCANet outperforms all competing methods in Recall. Notably, DSCANet surpasses both CNN-based methods (which rely solely on local information) and transformer-based approaches, demonstrating superior feature representation and edge prediction capabilities.

thumbnail
Table 1. Experimental results on the Kvasir-SEG dataset.

https://doi.org/10.1371/journal.pone.0345515.t001

Fig 7 compares visualized segmentation results obtained by several representative methods and our approach on the Kvasir-SEG dataset. The results highlight the limitations of existing methods in segmenting polyp edges, whereas our method effectively extracts edge features through its dedicated edge encoder, delivering superior segmentation performance.

thumbnail
Fig 7. Visual illustration of the proposed method applied to the Kvasir-SEG dataset.

The wrong target polyp is shown with a red outline.

https://doi.org/10.1371/journal.pone.0345515.g007

4.3.2. Results of CVC-ClinicDB.

Through experimental comparison of DSCANet with classical models on the CVC-ClinicDB dataset, our method maintains superior performance. As shown in Table 2, DSCANet outperforms all competing methods across all metrics, achieving segmentation accuracies of 95.45\% (DSC) and 91.28\% (mIoU), demonstrating balanced segmentation capability. The Precision and Recall metrics (95.86\% and 96.29\%, respectively) are 0.67 and 0.51 percentage points higher than those of DoubleU-Net, which is renowned for its excellent segmentation performance.

thumbnail
Table 2. Experimental results on the CVC-ClinicDB.

https://doi.org/10.1371/journal.pone.0345515.t002

Fig 8 presents a visual comparison of segmentation results from multiple models on the CVC-ClinicDB dataset. Notably, our method produces results that more closely match the ground truth compared to other approaches. Evaluation across all three datasets demonstrates that DSCANet delivers comprehensive and robust segmentation performance, consistently achieving improved accuracy across diverse datasets. Specifically, DSCANet addresses the limitations of previous methods by effectively extracting and fusing both main body and edge features through multi-scale integration, enabling high-accuracy segmentation.

thumbnail
Fig 8. Visual illustration of the proposed applied to the CVC-ClinicDB dataset.

https://doi.org/10.1371/journal.pone.0345515.g008

4.4. Qualitative analysis

Tables 1 and 2 presents the segmentation results of different models on the Kvasir-SEG and CVC-ClinicDB datasets. The results demonstrate that our method significantly outperforms the classical U-Net model, achieving Dice scores of 95.28\% and 95.45\% on the Kvasir-SEG and CVC-ClinicDB datasets, respectively. Among the compared methods, DoubleU-Net and TransUNet show outstanding segmentation performance on the Kvasir-SEG dataset. Compared to DoubleU-Net, our method improves the Dice score by 0.77 percentage points, mIoU by 1.35 percentage points, and Recall by 0.46 percentage points. Similarly, DoubleU-Net and ResUNet achieve excellent segmentation results on the CVC-ClinicDB dataset. Our method attains mIoU scores that are 1.32 and 2.13 percentage points higher than those of DoubleU-Net and ResUNet, respectively. Additionally, our method achieves the highest F1 scores on both datasets, confirming the superior overall performance of DSCANet.

4.5. Analysis of computational complexity

We evaluated computational efficiency along two primary dimensions: the number of parameters (Params), floating-point operations (FLOPs) and Inference Time. As summarized in Table 3, TransUNet has the largest parameter count (105.37M), indicating the highest memory demand among all competitors. Meanwhile, DoubleU-Net exhibits the highest computational cost at 130.93 GFLOPs, making it the least efficient model in terms of inference complexity. In contrast, our DSCANet requires only31.25M parameters and 58.63 GFLOPs and 0.0442 Inference Time significantly lower than most counterparts. This demonstrates that DSCANet maintains competitive performance while achieving superior computational efficiency and a reduced memory footprint.

thumbnail
Table 3. Computational complexity comparison of the evaluated models in terms of FLOPs and Params, with an input resolution of 224 × 224. The most efficient results for each column are shown in bold.

https://doi.org/10.1371/journal.pone.0345515.t003

4.6. Ablation study

4.6.1. Module ablation study.

To validate the effectiveness of the key components in DSCANet, we conducted a comprehensive series of ablation experiments on the Kvasir-SEG and CVC-ClinicDB datasets. The baseline model consists of a two-branch encoder and decoder. We systematically enhanced the baseline by sequentially integrating the SCA, BF, and FAA modules, evaluating six distinct combinations: baseline + SCA, baseline + BF, baseline + FAA, baseline + SCA + BF, baseline + SCA + FAA, and baseline + BF + FAA. As demonstrated in Tables 4 and 5, each module contributes significantly to improving the baseline model’s performance on the Kvasir-SEG and CVC-ClinicDB datasets. dataset. The visualization results are presented in Fig 9.

thumbnail
Table 4. The module ablation experiments results on the Kvasir-SEG dataset.

https://doi.org/10.1371/journal.pone.0345515.t004

thumbnail
Table 5. The module ablation experiments results on the CVC-ClinicDB dataset.

https://doi.org/10.1371/journal.pone.0345515.t005

thumbnail
Fig 9. Visual illustration of module ablation study.

https://doi.org/10.1371/journal.pone.0345515.g009

SCA Module Influence: Model M2 (enhanced with SCA) achieves significant improvements in segmentation accuracy on both the Kvasir-SEG and CVC-ClinicDB datasets. The Dice scores improve by 2.34 percentage points and 1.78 percentage points, respectively, compared to the baseline model. Furthermore, SCA enhances all evaluated metrics, including mIoU, Precision, and Recall. The SCA module effectively bridges the semantic gap between the two encoders and facilitates multi-scale feature fusion, thereby significantly improving the model’s overall segmentation performance. As shown in Fig 10a, the SCA module markedly improves the baseline model’s boundary detection capability for large polyps, resulting in enhanced segmentation accuracy.

thumbnail
Fig 10. The detailed visual results of the module ablation experiments.

The wrong target polyp is shown with a red outline.

https://doi.org/10.1371/journal.pone.0345515.g010

BF Module Influence: The proposed BF module effectively integrates lowest- and highest-level features, addressing the limitation of inadequate feature extraction in prior methods. When integrated with the baseline model on the Kvasir-SEG dataset, BF improves the DSC metric by 1.09 percentage points. Furthermore, when combined with SCA and FAA modules on the CVC-ClinicDB dataset, BF increases mIoU by 1.93 and 2.04 percentage points, respectively. As shown in Fig 10b, the baseline model is susceptible to noise interference, particularly in small polyp segmentation, resulting in reduced accuracy. By contrast, BF incorporation significantly enhances model robustness and improves detection performance for small polyps.

FAA Module Influence: The FAA component, enhances the decoder’s ability to extract high-level semantic features. Notably, the baseline+SCA + FAA combination achieves the second-best performance on Kvasir-SEG (after DSCANet), with a DSC of 94.85\%, mIoU of 90.72\%, Precision of 94.37\%, and the Recall of 95.22\% - representing significant improvements over the baseline. Comparative analysis shows that removing FAA from DSCANet reduces Precision by 1.46 and 2.69 percentage points on Kvasir-SEG and CVC-ClinicDB, respectively. Fig 10c demonstrates that while the baseline model struggles with noise artifacts in dark backgrounds, FAA integration dramatically improves robustness and segmentation precision.

4.6.2. Loss function ablation study.

To validate the efficacy of the proposed body encoder loss strategy, we conducted comprehensive ablation studies while keeping the edge encoder’s loss function fixed. We evaluated the body encoder using both Binary Cross-Entropy (BCE) Loss and Dice Loss (DL) on the Kvasir-SEG and CVC-ClinicDB datasets. The quantitative results are presented in Table 6.

thumbnail
Table 6. Loss ablation experiments results on the Kvasir-SEG and CVC-ClinicDB datasets.

https://doi.org/10.1371/journal.pone.0345515.t006

When employing Binary Cross-Entropy (BCE) loss alone for the body encoder, the model demonstrates significantly inferior segmentation performance compared to our hybrid loss strategy. Although BCE is widely adopted in semantic segmentation tasks, it suffers from a critical limitation: in scenarios with severe class imbalance (where foreground pixels are substantially outnumbered by background pixels), the background components dominate the loss function, leading to model bias toward background prediction and suboptimal performance.

This limitation is particularly evident in polyp segmentation, where polyp sizes exhibit considerable variation. While BCE achieves satisfactory results for larger polyps, its performance deteriorates significantly for smaller polyps. To address this challenge, we incorporate DL, which effectively mitigates the impact of foreground-background imbalance and enables more accurate segmentation of smaller polyps.

Notably, when employing either BCE or DL alone as the body encoder’s loss function, the model exhibits slower convergence compared to their combined usage. Moreover, within limited training epochs, our proposed hybrid loss strategy achieves superior performance, outperforming the standalone DL approach by 0.75 and 0.78 percentage points on the Kvasir-SEG and CVC-ClinicDB datasets, respectively, as measured by Dice scores.

Our proposed hybrid loss function, which combines weighted BCE and DL, effectively handles the highly imbalanced foreground-background ratios in polyp segmentation. As shown in Fig 11, the hybrid loss strategy demonstrates significantly faster convergence compared to using BCE alone, while maintaining stable optimization behavior throughout training.

thumbnail
Fig 11. Training loss curve of different loss strategies on the Kvasir dataset.

(a) Training loss curve of Binary Cross-Entropy Loss. (b) Training loss curve of combination of BCE and DL.

https://doi.org/10.1371/journal.pone.0345515.g011

5. Conclusions

In this study, we propose DSCANet, a novel dual-branch encoder architecture designed for efficient multi-scale feature fusion. Our framework comprises three key components: A SCA module that bridges the body and edge encoders, effectively minimizing semantic gaps during feature fusion while enhancing feature integration; A BF module that hierarchically connects both the shallowest and deepest features to combine multi-resolution information; An FAA layer in the decoder that extracts high-level semantic features with computational efficiency. Extensive experiments on three public datasets demonstrate that DSCANet significantly outperforms state-of-the-art methods across all evaluation metrics. The model exhibits remarkable generalization capabilities, particularly in segmenting irregular and boundary-ambiguous regions, while maintaining competitive computational efficiency.

References

  1. 1. Sung H, Ferlay J, Siegel RL. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: Cancer J Clin. 2021;71(3):209–49.
  2. 2. Subramanian AAV, Venugopal JP. A deep ensemble network model for classifying and predicting breast cancer. Comput Intell. 2022;39(2):258–82.
  3. 3. He D, Li Y, Chen L, Xiao X, Xue Y, Wang Z, et al. Dual-guided network for endoscopic image segmentation with region and boundary cues. Biomed Signal Process Control. 2024;91:106059.
  4. 4. Lou A, Guan S, Loew M. DC-UNet: Rethinking the U-Net Architecture with Dual Channel Efficient CNN for Medical Image Segmentation. In: Medical Imaging 2021: Image Processing. SPIE; 2021. pp. 758–68.
  5. 5. Srivastava A, Jha D, Chanda S, Pal U, Johansen H, Johansen D, et al. MSRF-net: a multi-scale residual fusion network for biomedical image segmentation. IEEE J Biomed Health Inform. 2022;26(5):2252–63. pmid:34941539
  6. 6. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inform Process Syst. 2017;30.
  7. 7. Xia S, Krishnan SM, Tjoa MP. A novel methodology for extracting colon’s lumen from colonoscopic images. J Syst Cybern Inform. 2003;1(2):7–12.
  8. 8. Wang Z, Li L, Anderson J, Harrington DP, Liang Z. Computer-aided detection and diagnosis of colon polyps with morphological and texture features. In: Medical imaging 2004: Image processing. SPIE; 2004. pp. 972–9. https://doi.org/10.1117/12.535664
  9. 9. Chowdhury TA, Ghita O, Whelan PF. A statistical approach for robust polyp detection in CT colonography. In: 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. IEEE; 2006. pp. 2523–6.
  10. 10. Iakovidis DK, Koulaouzidis A. Automatic lesion detection in wireless capsule endoscopy — A simple solution for a complex problem. In: 2014 IEEE International Conference on Image Processing (ICIP). IEEE; 2014. pp. 2236–40. https://doi.org/10.1109/icip.2014.7025453
  11. 11. Maghsoudi OH. Superpixel based segmentation and classification of polyps in wireless capsule endoscopy. In: 2017 IEEE Signal Processing in Medicine and Biology Symposium (SPMB). IEEE; 2017. pp. 1–4.
  12. 12. Jin Y, Hu Y, Jiang Z, Zheng Q. Polyp segmentation with convolutional MLP. Vis Comput. 2022;39(10):4819–37.
  13. 13. Liu F, Hua Z, Li J, Fan L. MFBGR: Multi-scale feature boundary graph reasoning network for polyp segmentation. Eng Appl Artif Intell. 2023;123:106213.
  14. 14. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. Cham: Springer international publishing; 2015. pp. 234–41.
  15. 15. Zhou Z, Rahman Siddiquee MM, Tajbakhsh N. Unet: A nested u-net architecture for medical image segmentation. In: International workshop on deep learning in medical image analysis. Cham: Springer International Publishing; 2018. 3–11.
  16. 16. Xu G, Zhang X, He X. Levit-unet: Make faster encoders with transformer for medical image segmentation. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Singapore: Springer Nature Singapore; 2023. pp. 42–53.
  17. 17. Bougourzi F, Distante C, Dornaika F, Taleb-Ahmed A. PDAtt-Unet: pyramid dual-decoder attention Unet for Covid-19 infection segmentation from CT-scans. Med Image Anal. 2023;86:102797. pmid:36966605
  18. 18. Li H, Zhai D-H, Xia Y. ERDUnet: an efficient residual double-coding Unet for medical image segmentation. IEEE Trans Circuits Syst Video Technol. 2024;34(4):2083–96.
  19. 19. Yang Y, Dasmahapatra S, Mahmoodi S. ADS_UNet: a nested UNet for histopathology image segmentation. Exp Syst Appl. 2023;226:120128.
  20. 20. Li J, Gao G, Yang L, Bian G, Liu Y. DPF-Net: a dual-path progressive fusion network for retinal vessel segmentation. IEEE Trans Instrum Meas. 2023;72:1–17.
  21. 21. Park K-B, Lee JY. SwinE-Net: hybrid deep learning approach to novel polyp segmentation using convolutional neural network and Swin Transformer. J Comput Design Eng. 2022;9(2):616–32.
  22. 22. Li W, Zhao Y, Li F, Wang L. MIA-Net: multi-information aggregation network combining transformers and convolutional feature learning for polyp segmentation. Knowl-Based Syst. 2022;247:108824.
  23. 23. Zhang W, Fu C, Zheng Y, Zhang F, Zhao Y, Sham C-W. HSNet: A hybrid semantic network for polyp segmentation. Comput Biol Med. 2022;150:106173. pmid:36257278
  24. 24. Duc NT, Oanh NT, Thuy NT, Triet TM, Dinh VS. ColonFormer: an efficient transformer based method for colon polyp segmentation. IEEE Access. 2022;10:80575–86.
  25. 25. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. pp. 9992–10002. https://doi.org/10.1109/iccv48922.2021.00986
  26. 26. Yang J, Bai L, Sun Y, Tian C, Mao M, Wang G. Pixel difference convolutional network for RGB-D semantic segmentation. IEEE Trans Circuits Syst Video Technol. 2024;34(3):1481–92.
  27. 27. Liu Y, Cheng MM, Hu X, et al. Richer convolutional features for edge detection. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 3000–9.
  28. 28. Jha D, Smedsrud PH, Riegler MA, et al. Kvasir-seg: A segmented polyp dataset. International conference on multimedia modeling. Cham: Springer International Publishing; 2019: pp. 451–62.
  29. 29. Bernal J, Sánchez FJ, Fernández-Esparrach G, Gil D, Rodríguez C, Vilariño F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput Med Imaging Graph. 2015;43:99–111. pmid:25863519
  30. 30. Jha D, Smedsrud PH, Riegler MA. Resunet: An advanced architecture for medical image segmentation. In: 2019 IEEE International Symposium on Multimedia (ISM). IEEE; 2019. pp. 225–2255.
  31. 31. Ibtehaz N, Rahman MS. MultiResUNet : Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 2020;121:74–87. pmid:31536901
  32. 32. Alom MZ, Hasan M, Yakopcic C. Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv preprint. 2018.
  33. 33. Abdollahi A, Pradhan B, Alamri A. VNet: an end-to-end fully convolutional neural network for road extraction from high-resolution remote sensing data. IEEE Access. 2020;8:179424–36.
  34. 34. Jha D, Riegler MA, Johansen D. Doubleu-net: A deep convolutional neural network for medical image segmentation. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS). 2020. pp. 558–64.
  35. 35. Chen J, Lu Y, Yu Q. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint. 2021.
  36. 36. Manzari ON, Kaleybar JM, Saadat H. Befunet: A hybrid cnn-transformer architecture for precise medical image segmentation. arXiv preprint. 2024.