Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

DHAFormer: Dual-channel hybrid attention network with transformer for polyp segmentation

  • Xuejie Huang,

    Roles Conceptualization, Data curation, Methodology, Writing – original draft, Writing – review & editing

    Affiliation School of Computer Science and Technology, Xinjiang University, Urumqi, China

  • Liejun Wang ,

    Roles Formal analysis, Validation, Writing – review & editing

    wljxju@xju.edu.cn

    Affiliation School of Computer Science and Technology, Xinjiang University, Urumqi, China

  • Shaochen Jiang,

    Roles Software, Validation, Visualization

    Affiliation School of Computer Science and Technology, Xinjiang University, Urumqi, China

  • Lianghui Xu

    Roles Formal analysis, Software, Visualization

    Affiliation School of Computer Science and Technology, Xinjiang University, Urumqi, China

Abstract

The accurate early diagnosis of colorectal cancer significantly relies on the precise segmentation of polyps in medical images. Current convolution-based and transformer-based segmentation methods show promise but still struggle with the varied sizes and shapes of polyps and the often low contrast between polyps and their background. This research introduces an innovative approach to confronting the aforementioned challenges by proposing a Dual-Channel Hybrid Attention Network with Transformer (DHAFormer). Our proposed framework features a multi-scale channel fusion module, which excels at recognizing polyps across a spectrum of sizes and shapes. Additionally, the framework’s dual-channel hybrid attention mechanism is innovatively conceived to reduce background interference and improve the foreground representation of polyp features by integrating local and global information. The DHAFormer demonstrates significant improvements in the task of polyp segmentation compared to currently established methodologies.

Introduction

Colorectal cancer (CRC) is a prevalent and lethal form of cancer worldwide, responsible for over 694,000 deaths annually. It ranks high in terms of cancer incidence and mortality rates, posing a substantial risk to public health [1]. The prevailing medical consensus holds that CRC typically develops from adenomatous polyps through several stages. Early screening and removal of colon polyps can reduce the risk of CRC [2, 3]. Effective early screening and preventive strategies are critical in reducing the incidence and mortality rates associated with CRC. Nonetheless, the variability in polyp sizes and shapes, inconsistent image quality, and the presence of indistinct features in medical imaging complicate the accuracy of colonoscopy procedures. This introduces difficulties and risks in CRC screening and prevention.

The development of artificial intelligence has spurred research into learning algorithms for computer-aided diagnostic (CAD) systems, aiming to detect and delineate polyps autonomously. This advancement could improve physicians’ capabilities in identifying lesions and reducing missed detection rates [46]. However, polyp segmentation, crucial for enhancing the efficiency and quality of colonoscopy, encounters numerous technical challenges. The difficulty in differentiating polyps from their surrounding mucosa during colonoscopy is often attributed to their similar color, texture, and shape, particularly under variable illumination and in situations involving flat lesions or inadequate bowel preparation.

With the advent of deep learning, convolutional neural networks (CNNs) have established the foundation for contemporary polyp segmentation methods. Full convolutional network (FCN) [7] was initially suggested for semantic segmentation, and later on, their variations [8, 9] also made great strides in polyp segmentation tasks. Most segmentation models employ an encoder-decoder structure based on the UNet [10] architecture, typically built from a convolutional layer. Despite the dominance of UNet in polyp segmentation, it and its subsequent variants [1113] face a similar problem as the CNN model: a lack of modeling ability for global correlations. This limitation primarily stems from the fact that CNNs only extract local information and cannot effectively capture global correlations.

In computer vision, researchers are exploring the use of the Transformer [14] architecture, known for utilizing self-attention mechanisms to establish connections among distant elements in input data. The Vision Transformer (ViT) [15] adapts this architecture for image recognition tasks by dividing an image into patches and processing them as a sequence, which reduces computational costs and enhances the processing of large images. ViT has been proven effective in various image segmentation tasks. Recent studies, such as those involving Polyp-PVT [16], SSFormer [17], FCBFormer [18] and DuAT [19] indicate that Transformer-based models achieve exceptional performance in polyp segmentation tasks. However, despite their enhanced accuracy in segmentation, these models often struggle with indistinct polyp boundaries. This is partly due to the small scale of existing polyp datasets, which do not represent the full range of polyp sizes and can lead to pixel imbalance due to the low proportion of polyp pixels in the overall image. Another challenge is the shape of polyps. Their irregular and jagged contours can make it difficult for networks to identify edge pixels accurately. Classical networks may have limitations in effectively segmenting polyps of various sizes.

Based on the aforementioned factors, we offer a novel method called DHAFormer, which features a dual-channel hybrid attention (DHA) module, enhancing the model’s capacity for foreground perception of polyps alongside global and local information processing. Crucially, the DHAFormer incorporates a multi-scale channel fusion module (MCFM) designed to aggregate features across multiple scales, bolstering the detection and delineation of polyps of varying sizes. The MCFM functions by emphasizing salient features while diminishing less relevant ones, thereby sharpening the polyp’s visibility and segmentation accuracy. It integrates a channel attention mechanism that assigns adaptive weights to each channel, providing a nuanced feature representation based on the polyp’s contextual surroundings. This multi-faceted approach allows for more precise and detailed analysis, substantially improving the performance of polyp segmentation.

Our significant contributions are as follows:

  1. We design the MCFM, which integrates multi-scale feature extraction with a channel attention mechanism to optimize local detail perception and enhance sensitivity to polyps of various sizes.
  2. We propose a DHA module that combines global and local features, thereby enhancing the model’s sensitivity to information and improving its ability to recognize foreground information effectively.
  3. We validated the DHAFormer through comprehensive experimentation on two prestigious public datasets, and the results demonstrated the effectiveness of our proposed model.

Related work

Convolutional neural networks

CNN is a cornerstone of deep learning in medical imaging, excelling in tasks like target detection [20], classification [21], and semantic segmentation [22]. UNet [10] stands out for its effective spatial hierarchy management and localization precision in image segmentation. UNet++ [11] evolved from UNet [10] by introducing nested skip pathways and deep supervision for improved feature propagation and segmentation accuracy. UNet3+ [12] employs a comprehensive jump-connected structure for detailed information extraction and deep supervision for improved feature representation in polyp segmentation. UACANet [3] and DCRNet [23] explored the region of uncertainty and the relationships within and across image contexts, respectively. Jain et al. [24] conducted a comparative study of deep learning-based segmentation models, demonstrating the effectiveness of UNet and SegNet architectures using MobileNetV1 for polyp localization in wireless capsule endoscopy(WCE) images. WCENet [25] featuring a two-phase process that classifies WCE images into four categories and uses an attention-based CNN with a SegNet-based localization framework. Jain et al. [26] propose a CNN with meta-feature learning for wireless capsule endoscopy image classification, which handles intra-class variability and proficiently categorizes gastrointestinal images as either normal or abnormal. Despite their proficiency in local feature extraction, convolutional operations are limited in their ability to capture global image information. Comparatively, global information is crucial to separating foreground from background in polyp segmentation. Therefore, relying on more than convolutional operations may lead to poor-quality segmentation.

Attention mechanism

To improve the feature representation capabilities of CNNs, some researchers have recently introduced attention methods. Attention methods enable networks to prioritize salient aspects of the input data. For example, AG-Sononet [13] created an attention gate module that permits the network to concentrate on important information while preserving computational efficiency. To enhance UNet++ [11] for polyp segmentation, AG-ResUNet++ [27] combines attention gates with the ResNet [28] foundation. The reverse attention module used by PraNet [29] forces focus on the line separating a polyp from its surroundings. CoInNet [30] proposes a novel concern mechanism with convolution, involution, and statistical feature concern units for polyp segmentation. Huang et al. [31] proposed a polyp segmentation network using a hybrid channel-spatial attention and pyramid global context guided feature fusion, achieving significant improvements in segmentation accuracy across multiple datasets. Overall, the attention module can bring performance gains to most CNNs and neural networks. Nonetheless, even with attention enhancements, CNNs face difficulties in capturing the extensive spatial relationships between distant input segments.

Vision transformer

Transformer [14] has revolutionized the field of natural language processing with its ability to capture long-range dependencies in input sequences through self-attention mechanisms. Its application has expanded to medical imaging tasks such as polyp segmentation, demonstrating its versatility. For example, Transfuse [32] employs a dual-branch structure combining Transformer and CNN to leverage both global and local feature extraction. Polyp-PVT [16] integrates a pyramid vision transformer to enhance feature robustness. Segtran [33] proposes a compressed attention block to normalize self-attention and extend blocks to learn diversified representations. SSFormer [17] proposes an aggregate of local and global features stepwise, improving the model’s processing ability. USegTransformer-P and USegTransformer-S [34] integrate transformer-based and convolution-based encoders to enhance precision in medical image segmentation tasks, combining local and global features effectively. Recent progress in transformer-based medical image analysis [35] explores the adoption of transformers in medical image analysis (MIA), highlighting their utility in improving classification, segmentation, and other MIA tasks through their ability to handle complex data and enhance feature extraction. WDFF-Net [36] proposed scale-sensing feature fusion to solve the problem of large changes in polyp size and shape. Wang et al. [37] propose a new architecture for polyp segmentation that uses CNN and transformers as encoders to capture local information and remote dependencies. These models showcase improved handling of polyp boundaries and feature robustness but still face challenges with irregular polyp shapes.

DHAFormer differentiates itself from other methods by integrating a MCFM and a DHA mechanism, which together enhance the segmentation accuracy and robustness by effectively capturing both local and global features.

Methodology

Overall DHAFormer

Fig 1a illustrates the network’s general design, which uses two parallel branches: the FCN branch (FCB) and the transformer branch (TB). The FCB is mainly used to output the full-size feature maps for extracting local information. We use the BiFormer [38] architecture in the TB branch as an encoder. The TB branch outputs reduced-size semantic feature maps, focusing on relevant regions through an MCFM for extracting global information and then up-sampling to full-size features. The improved prediction head (PH+) will then process the combined result features of the two branches. To better focus on the foreground polyp region and capture global dependencies at various scales, we designed a DHA module for the PH+ module. This enables the model to identify and segment the polyp region more accurately. The FCB is aligned with FCBFormer [18].

thumbnail
Fig 1. Overall network architecture.

(a) DHAFormer. (b) Transformer branch. (c) Fully convolutional branch. (d) Improved prediction head module. (e) Local emphasis module. (f) Residual block.

https://doi.org/10.1371/journal.pone.0306596.g001

Fully convolutional branch (FCB)

The overall structure of FCB is shown in Fig 1c. We adopted the same parameters for the FCB of our network as FCBFormer [18], which permits the fusion of multiscale features and, when combined, features extracted from the transformer branch enables more precise prediction of full-size segmentation maps.

Transformer branch (TB)

Transformer encoder.

In this study, unlike the approach in FCBFormer [18], an ImageNet pre-trained BiFormer [38] serves as the image encoder within the TB framework, substituting the previously used pyramid vision transformer v2 (PVTv2) [39]. The selected BiFormer model is the ‘base’ variant, boasting 56.8 million parameters and leveraging bi-level routing attention. This method facilitates dynamic, per-query sparse attention, allowing for an enhanced focus on pertinent key regions. The implementation of FCBFormer we used in our experiment uses BiFormer [38] as the encoder for the TB.

The overall architecture of the transformer encoder is shown in Fig 1b. we obtain four distinct feature pyramid levels(El, E2, E3, E4), ranging from coarse to fine, via the BiFormer encoder. El, E2, and E3 are categorized as low-level features, amalgamate detailed feature data with some degree of noise and irrelevant details. By enhancing and examining these traits, they can offer fine-grained details to enhance advanced features. As part of the advanced decoder input, E4 is an advanced feature that allows for exact target area location.

Transformer decoder.

Similar to FCBFormer [18] settings, the transformer encoder returns features with four levels, which we use as inputs to an improved progressive locality decoder (PLD+) to obtain multiple scales of features. The PLD+ consists of four Local Emphasis (LE) modules, an MFCM, and a stepwise feature aggregation (SFA) module, each LE module dealing with features at one level of the feature pyramid. The role of the LE modules is to enhance the local features in the feature representation, as the transformer-based model is relatively weak in this respect. After the fourth layer, features are processed by LE module, MCFM is added to enhance the processing of high-level features by the network. We then fuse the outputs of the three LE modules and an MCFM into a multi-scale feature map for predicting polyp regions in the image via the SFA module. Compared to the traditional transformer structure, this alternative can more effectively utilize local features in the image to enhance segmentation.

Multi-scale channel fusion module (MCFM).

Generally, polyps vary widely in size and shape, so a segmentation method that can adapt to different scales is needed. We propose a multi-scale channel fusion module, which combines multi-scale features with channel attention to deal with multi-scale problems effectively. Precisely, we first extract multi-scale features using different convolution branches, then compute channel attention to adapt the features of different channels, and finally combine multi-scale features and channel attention. In this way, the model can be better adapted to objects and features at different scales.

Fig 2 shows the detailed design of MCFM. The input of MCFM is , which is after being processed by LE. Firstly, multi-scale feature extraction is carried out to capture information of different scales. Each branch performs a convolution operation, where branches 1 and 2 use a 3x3 convolution kernel, and branch 3 uses a 5x5 convolution kernel. These operations extract feature details to accommodate polyps of different sizes and shapes. The features obtained through these three branches are then added together. This operation can be expressed as: (1) (2) (3) where BN indicates the BatchNorm [40] operation. ⊕ denotes element-wise addition.

Next, the channel attention mechanism calculates the channel attention weight through two convolutional layers. This weight is used to adaptively weight the features of different channels to determine which channels are most critical for the segmentation task. The channel attention is calculated by a Sigmoid [41] activation function and two convolution operations in the formula. The formula is as follows: (4) where σ stands for the Sigmoid [41] function, which converts the convolutional output to a weight between 0 and 1. Conv1×1 is a 1x1 convolution operation used to change the dimension of the feature channel. Avg stands for adaptive average pooling operation, and it reduces the spatial dimension to facilitate global computation of channel attention.

In the multi-scale feature fusion stage, features from three branches are combined to integrate information at different scales. This helps the module better adapt to multi-scale objects, no matter how size changes. Fweight is by adding features F′ and channel weight W is achieved by multiplying elements by elements: (5)

Finally, we combine the weighted features with the original input features by learning the parameter α and generating the Fout by the ReLU activation function: (6) where α is a learning parameter that allows the model to balance between the original input features and the multi-scale fusion features.

This step makes the module more adaptable in segmentation tasks, especially in polyps with multiple scales and significant contrast variations. Combined with multi-scale and channel attention, the performance of intestinal polyp segmentation was effectively improved.

Improved prediction head (PH+).

The overall structure of PH includes the TB and FCB as its input, as illustrated in Fig 1e. Unlike the PH module in FCBFormer [18], we incorporate our DHA module into the PH module, forming PH+. The role of PH+ is to further process the two-branch features for more accurate segmentation prediction, and the addition of DHA can enhance this purpose. By fusing global and local features, the DHA module can intensify the focus on the foreground region, which is crucial for successful segmentation prediction. In conclusion, when integrated into the PH module, the DHA module can leverage the network’s focus on the foreground, thus providing more semantically expressive features for the final segmentation prediction.

Dual-channel hybrid attention (DHA).

Recent research shows that attention mechanisms are crucial for enhancing the effectiveness of deep learning models. We propose a new DHA mechanism to better simulate the overall relationship and specific characteristics of the lesion location. This mechanism is based on global context branching and local lesion branching, and it is applied to the PH+ module of our model.

The DHA module architecture, inspired by transformer components, is depicted in Fig 3. Unlike traditional self-attention, our DHA layer is a novel construct that bifurcates into a global context branch(GCB) and a local lesion branch(LLB). The global context branch utilizes adaptive average pooling with kernel sizes of 1 × 1, 3 × 3, and 5 × 5 to capture multi-scale spatial features from the decoder feature map creating a global feature representation by reshaping the pooled outputs and concatenating them (S < < N and N = H × W). This process effectively expands the receptive field and enhances the feature map with broad contextual information, obtained by a pyramid pooling procedure [42].

For the local lesion branch, an initial segmentation mask is applied to D using element-wise multiplication, followed by a custom sum pooling operation to distill focused lesion features into K2 and . The integration of features from both branches is then performed by summing: (7) The output of the DHA layer is formulated as: (8) where each headj represents the output of an individual attention head computed as: (9) where ∅o, , , refers to the linear projection. n denotes the number of multiple heads and the attention formula is as follows: (10) where dk is the size of each head equivalent to .

Datasets and metrics

The datasets

Kvasir-SEG [43] and CVC-ClinicDB [44] are two open-access datasets commonly used for gastrointestinal polyp image segmentation. The Kvasir-SEG dataset is one of the datasets that include a sizable number of colonoscopy pictures labeled by medical professionals. This dataset consists of 1000 colonoscopy pictures, along with typical segmentation findings. The labeling and verification of these pictures have been done by knowledgeable gastroenterologists, making it a valuable resource for constructing and evaluating gastrointestinal polyp segmentation algorithms. On the other hand, the CVC-ClinicDB dataset, which consists of 612 images from 29 colonoscopy sequences, is mainly utilized for polyp detection in colonoscopy recordings. Both datasets are open-access and include polyps with various shapes, making them valuable data sources for research in medical image segmentation.

Evaluation metrics

The tests assess the network performance using the Dice, IoU, Precision, and Recall. the formula is as follows: (11) (12) (13) (14) where TP indicates that the classifier predicts a positive result and the sample is actually positive. FP indicates that the classifier predicts a positive result, but the sample is negative. TN denotes that the classifier predicts a negative result and the sample is negative. FN stands for a negative classifier prediction, but the sample’s actual value is positive.

Results and analysis

Implementation details

In this study, we created DHAFormer using the PyTorch framework. The network loss function is the Bce loss function and the Dice loss function. The training was conducted with the AdamW optimizer [45], starting with a learning rate of 1e-4. If the performance of the validation set does not improve after 10 cycles, the learning rate is halved. The input resolution was set to 448 × 448, and the batch size was set to 2. We trained DHAFormer for a total of 200 epochs. To adhere to recommendations by [17, 29, 46, 47], an 80%/10%/10% random train/validation/test split was utilized. The data augmentations employed in this study closely resemble those used by the authors of the original FCBFormer [18].

Comparative experiment

To further illustrate how well the suggested DHAFormer works for segmenting lesions, we also trained and assessed some well-known and cutting-edge instances using the same dataset and assessment measures. These examples included the most sophisticated CNN-based networks, including UNet [10], ResUNet [48], ResUNet++ [46], PraNet [29], HarDNet-MSEG [49] and transformer-based network architectures such as Polyp-PVT [16], SSFormer [17], FCBFormer [18] and DuAT [19]. Meanwhile, to guarantee the impartiality of the experimental comparisons, the same parameter settings and computing environments were used throughout the experimentation, and the findings are displayed in Table 1 for the two datasets. Although some of the earlier models do not perform as well as the latest models in some metrics, they still have some application value.

Table 1 presents the results of a quantitative comparison of various methods used on the Kvasir-SEG dataset and CVC-ClinicDB dataset and highlights the best results in bold fonts. The results show that our model achieves the best results on both datasets in Dice, IoU, Precision, and Recall. Based on traditional CNN methods, they still perform well. However, our method is much better than the one based on CNN. SSFormer [17] uses the Transformer architecture for global context modeling of image features and enhances the feature representation using spatial concentration and channel attention; FCBFormer [18] is a method based on FCN and transformer, which together perform feature extraction and segmentation of the input image, achieving significant competitive advantages. DHAFormer comprehensively considers the extraction of local features and global features to achieve better segmentation results in polyp segmentation. Compared with other methods, DHAFormer has a larger parameter count and FLOPS, but it has significant advantages in improving segmentation accuracy and recall rate.

Fig 4a presents qualitative comparisons of the Kvasir-SEG dataset with different approaches. This comparison reveals that the standard convolution method performed poorly in global modeling, making it challenging to identify complicated boundaries in difficult scenarios. The transformer improves these phenomena, however, the transformer-based approach has a weak local modeling capability, as seen from the predicted segmentation maps of SSFormer [17], which has a coarse segmentation profile. As shown in Fig 4, DHAFormer can identify the edge of the polyp more accurately, and its segmentation contour is also smoother and more in line with the growth characteristics of the polyp. The effectiveness of DHAFormer is verified by qualitative analysis.

thumbnail
Fig 4. Qualitative comparison results.

(a) Qualitative comparison results on Kvasir. (b) Qualitative comparison results on CVC-ClinicDB.

https://doi.org/10.1371/journal.pone.0306596.g004

Fig 4b displays illustrative qualitative results produced using various techniques for some complex cases from the CVC-CliniDB dataset. The DHAFormer effectively measures the relationship between background and foreground information and improves segmentation, as indicated by the first row of qualitative analysis results. The segmentation effect of our proposed network DHAFormer on the boundary of the lesion region is significantly better compared to the current commonly used CNN segmentation networks and more advanced transformer-based methods, as demonstrated in the qualitative analysis results in the second and third rows of Fig 4b. Our method effectively enhances foreground information while suppressing background information, as shown in Fig 4b, verifying the feasibility of DHAFormer for segmentation through comparison with other methods.

Ablation studies

Impact of key components on DHAFormer performance.

We conducted ablation tests on the Kvasir-SEG and CVC-ClinicDB datasets and contrasted our model with the baseline (FCBFormer) to more clearly illustrate the impact of each of our parts. The experimental findings are shown in Table 2, demonstrating the importance of the MCFM and DHA modules in this model. We attempted to remove either of these modules during the ablation trial, which resulted in a decrease in network performance. We may, therefore, conclude that adding the MCFM and DHA module is essential for enhancing the functionality of the polyp segmentation model.

Our DHAFormer method surpasses the Baseline model FCBFormer in all four indices measured on the Kvasir-SEG dataset. Specifically, our method improves the Dice, IoU, Precision, and Recall by 2.02%, 2.15%, 2.01%, and 1.12%, respectively. These results unequivocally establish the superiority of our approach. This improvement in the Dice and IoU indices suggests that the enhanced model effectively captures the foreground information. Furthermore, the improved model demonstrates its ability to better control false alarms and omissions, resulting in improved Precision and Recall indices. It can be seen from Fig 5 that Baseline+MCFM can segment polyp contours more accurately than Baseline. The boundary processing of the foreground part is more accurate and smooth in MCFM, and the segmentation of a large area is closer to the label map. In the first line of Fig 5, it can be observed that the Baseline+DHA enhances the identification of the foreground region of the polyp compared to the Baseline. Combining the MCFM and DHA module enables better capture of foreground information and suppression of background information, resulting in segmentation results that closely align with the labeled image.

On the CVC-CliniDB dataset, the segmentation results of Baseline+MCFM+DHA are superior to those of Baseline, Baseline+MCFM, and Baseline+DHA, it is observed that Dice, IoU, Precision, and Recall improve by 1.60%, 2.21%, 0.64%, and 1.78% over the baseline model FCBFormer, respectively. To correctly identify polyp boundary information, the network must focus more on extracting local details. As shown in the fourth line of Fig 6, compared with Baseline, Baseline+MCFM can identify polyp areas more accurately but introduces some redundant information. Baseline+MCFM+DHA can enhance the outlook information and eliminate some redundant information. The combination of MCFM and DHA module can grasp the polyp boundary information more accurately and conduct local modeling.

Optimal configuration of MCFM.

In order to further explain the importance of the number of convolution operations in MFCM. We did additional ablation experiments, specifically we used three 3 × 3 convolution branches, two 3 × 3 and one 5 × 5 convolution branches, and one 3 × 3, one 5 × 5 and one 7 × 7 convolution branches in MCFM, respectively, to illustrate the effects of varying the number and size of convolution nuclei within MFCM.

In the ablation study in Table 3, DHAFormer model demonstrated significant performance improvement through evaluation of different MCFM configurations. The results showed that the best Dice scores were obtained on the Kvasir-SEG and CVC-ClinicDB data sets using the 3 × 3, 3 × 3 and 5 × 5 convolution combinations, which were 92.94% and 94.65%, respectively. This configuration not only performs the best Dice score and accuracy, but also maintains a reasonable balance in terms of the number of parameters (64.71M) and floating-point arithmetic (258.52G). In contrast, although the 3 × 3, 5 × 5, and 7 × 7 convolution configurations have the highest floating point throughput (262.53G), the performance gains are not significant. Therefore, 3 × 3, 3 × 3, and 5 × 5 convolution combinations are considered to be the most efficient configurations, capable of improving segmentation performance while maintaining high computational efficiency.

thumbnail
Table 3. Ablation experiments of different configurations of MCFM.

https://doi.org/10.1371/journal.pone.0306596.t003

MFCM placement studies.

In the internal ablation experiment in Table 4, we evaluated the impact of multi-scale channel fusion module (MCFM) placed at different locations on model performance, demonstrating its important role in multi-level feature fusion. Although adding MCFM at the end of each LE module has the highest Dice score of 93.01%, this configuration is higher on floating-point arithmetic (271.52G), resulting in increased computing costs. We finally chose to add MCFM after the fourth LE module and achieved a Dice score of 92.94%, which maintained a reasonable balance in terms of the number of parameters (64.71M) and floating point arithmetic (258.52G). Therefore, we chose to add MCFM after the fourth LE module in order to maintain high computational efficiency while guaranteeing high performance.

thumbnail
Table 4. Ablation experiments of MCFM at different locations on the Kvasir-SEG dataset.

https://doi.org/10.1371/journal.pone.0306596.t004

DHA internal ablation experiment.

We configured both GCB and LLB in the DHA module. To isolate the effects of each component of DHA, we conducted ablation experiments inside DHA that will make their individual contributions to model performance clearer.

In the internal ablation experiment in Table 5, the DHAFormer model demonstrated significant performance improvement by evaluating the independent contribution of the GCB and the LLB. The results showed that the combination of GCB and LLB configurations achieved the highest Dice scores on the Kvasir-SEG and CVC-ClinicDB data sets, which were 93.23% and 94.56%, respectively. This configuration not only performs well in Dice score and accuracy, but also maintains a reasonable balance in terms of the number of parameters (64.72M) and floating-point arithmetic (332.56G). In contrast, a configuration using only GCB or LLB, while also improving, is not as significant as a combination of the two. Therefore, the configuration combining GCB and LLB is considered to be the most efficient and can improve segmentation performance while maintaining high computational efficiency.

Generalizability tests.

We conducted a generalization test of DHAFormer, following the conventions outlined by [17, 18, 47]. Specifically, in this test, we assessed the performance of the model trained in Kvasir-SEG on CVC-ClinicDB and vice versa. The results of these generalizability tests, which can be found in Table 6, indicate that DHAFormer excelled in processing images with slightly different distributions compared to the training dataset. Notably, it outperformed existing models in most metrics.

Conclusion

In this study, we introduce a dual-channel hybrid attention network with transformer (DHAFormer). This novel polyp segmentation architecture adopts a multi-scale channel fusion module (MCFM) and dual-channel hybrid attention (DHA) for dense prediction. Our goal is to improve the model’s ability to identify and segment polyps accurately. On the one hand, MCFM combines multi-scale features and channel attention to increase the sensitivity of the network to polyp size. On the other hand, the DHA module simulates both global and local features to enhance the network’s attention to the foreground polyp area. This enhancement enables the model to efficiently identify and segment hidden polyp areas that are easily overlooked. The combination of the MCFM and DHA module demonstrated superior performance compared to the baseline model, as evidenced by improvements in Dice, IoU, Precision, and Recall metrics. This underscores the effectiveness of our proposed DHAFormer network for lesion segmentation. In future work, we aim to optimize the network for efficiency while improving our understanding of the network’s local data.

While our proposed DHAFormer model demonstrates superior performance in polyp segmentation tasks, it has a higher number of parameters and FLOPs compared to some state-of-the-art methods. This increased complexity could impact its applicability in real-time or resource-constrained environments. We acknowledge this limitation and will address it in our future research by optimizing the model to reduce its computational demands while maintaining high segmentation accuracy, making it more suitable for practical applications.

References

  1. 1. Bernal J, Tajkbaksh N, Sánchez FJ, Matuszewski BJ, Chen H, Yu L, et al. Comparative Validation of Polyp Detection Methods in Video Colonoscopy: Results From the MICCAI 2015 Endoscopic Vision Challenge. IEEE Transactions on Medical Imaging. 2017;36(6):1231–1249. pmid:28182555
  2. 2. Kim NH, Jung YS, Jeong WS, Yang HJ, Park SK, Choi K, et al. Miss rate of colorectal neoplastic polyps and risk factors for missed polyps in consecutive colonoscopies. Intestinal research. 2017;15(3):411–418. pmid:28670239
  3. 3. Kim T, Lee H, Kim D. Uacanet: Uncertainty augmented context attention for polyp segmentation. In: Proceedings of the 29th ACM International Conference on Multimedia; 2021. p. 2167–2175.
  4. 4. Mesejo P, Pizarro D, Abergel A, Rouquette O, Beorchia S, Poincloux L, et al. Computer-aided classification of gastrointestinal lesions in regular colonoscopy. IEEE transactions on medical imaging. 2016;35(9):2051–2063. pmid:28005009
  5. 5. Zhou G, Liu X, Berzin TM, Brown JRG, Li L, Zhou C, et al. A real-time automatic Deep learning polyp detection system increases polyp and adenoma detection during colonoscopy: A prospective double-blind randomized study. In: Gastroenterology. vol. 156. WB SAUNDERS CO-ELSEVIER INC 1600 JOHN F KENNEDY BOULEVARD, STE 1800 …; 2019. p. S1511–S1511.
  6. 6. Kudo Se, Mori Y, Misawa M, Takeda K, Kudo T, Itoh H, et al. Artificial intelligence and colonoscopy: Current status and future perspectives. Digestive Endoscopy. 2019;31(4):363–371. pmid:30624835
  7. 7. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 3431–3440.
  8. 8. Brandao P, Mazomenos E, Ciuti G, Caliò R, Bianchi F, Menciassi A, et al. Fully convolutional neural networks for polyp segmentation in colonoscopy. In: Medical Imaging 2017: Computer-Aided Diagnosis. vol. 10134. Spie; 2017. p. 101–107.
  9. 9. Akbari M, Mohrekesh M, Nasr-Esfahani E, Soroushmehr SR, Karimi N, Samavi S, et al. Polyp segmentation in colonoscopy images using fully convolutional network. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE; 2018. p. 69–72.
  10. 10. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer; 2015. p. 234–241.
  11. 11. Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J. Unet++: A nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4. Springer; 2018. p. 3–11.
  12. 12. Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, et al. Unet 3+: A full-scale connected unet for medical image segmentation. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2020. p. 1055–1059.
  13. 13. Schlemper J, Oktay O, Schaap M, Heinrich M, Kainz B, Glocker B, et al. Attention gated networks: Learning to leverage salient regions in medical images. Medical image analysis. 2019;53:197–207. pmid:30802813
  14. 14. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
  15. 15. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net; 2021. Available from: https://openreview.net/forum?id=YicbFdNTTy.
  16. 16. Dong B, Wang W, Fan DP, Li J, Fu H, Shao L. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv preprint arXiv:210806932. 2021;.
  17. 17. Wang J, Huang Q, Tang F, Meng J, Su J, Song S. Stepwise feature fusion: Local guides global. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2022. p. 110–120.
  18. 18. Sanderson E, Matuszewski BJ. FCN-transformer feature fusion for polyp segmentation. In: Annual Conference on Medical Image Understanding and Analysis. Springer; 2022. p. 892–907.
  19. 19. Tang F, Xu Z, Huang Q, Wang J, Hou X, Su J, et al. DuAT: Dual-Aggregation Transformer Network for Medical Image Segmentation. In: Liu Q, Wang H, Ma Z, Zheng W, Zha H, Chen X, et al., editors. Pattern Recognition and Computer Vision—6th Chinese Conference, PRCV 2023, Xiamen, China, October 13-15, 2023, Proceedings, Part V. vol. 14429 of Lecture Notes in Computer Science. Springer; 2023. p. 343–356. Available from: https://doi.org/10.1007/978-981-99-8469-5_27.
  20. 20. Shen Z, Liu Z, Li J, Jiang YG, Chen Y, Xue X. Dsod: Learning deeply supervised object detectors from scratch. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 1919–1927.
  21. 21. Anthimopoulos M, Christodoulidis S, Ebner L, Christe A, Mougiakakou S. Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE transactions on medical imaging. 2016;35(5):1207–1216. pmid:26955021
  22. 22. Chen H, Qi X, Yu L, Heng PA. DCAN: deep contour-aware networks for accurate gland segmentation. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition; 2016. p. 2487–2496.
  23. 23. Yin Z, Liang K, Ma Z, Guo J. Duplex Contextual Relation Network For Polyp Segmentation. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI); 2022. p. 1–5.
  24. 24. Jain S, Seal A, Ojha A. Localization of Polyps in WCE Images Using Deep Learning Segmentation Methods: A Comparative Study. In: Raman B, Murala S, Chowdhury AS, Dhall A, Goyal P, editors. Computer Vision and Image Processing—6th International Conference, CVIP 2021, Rupnagar, India, December 3-5, 2021, Revised Selected Papers, Part I. vol. 1567 of Communications in Computer and Information Science. Springer; 2021. p. 538–549. Available from: https://doi.org/10.1007/978-3-031-11346-8_46.
  25. 25. Jain S, Seal A, Ojha A, Yazidi A, Bures J, Tacheci I, et al. A deep CNN model for anomaly detection and localization in wireless capsule endoscopy images. Comput Biol Medicine. 2021;137:104789. pmid:34455302
  26. 26. Jain S, Seal A, Ojha A. A Convolutional Neural Network with Meta-feature Learning for Wireless Capsule Endoscopy Image Classification. Journal of Medical and Biological Engineering. 2023;43(4):475–494.
  27. 27. Hung NB, Duc NT, Van Chien T, Sang DV. AG-ResUNet++: an improved encoder-decoder based method for polyp segmentation in colonoscopy images. In: 2021 RIVF International Conference on Computing and Communication Technologies (RIVF). IEEE; 2021. p. 1–6.
  28. 28. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 4700–4708.
  29. 29. Fan DP, Ji GP, Zhou T, Chen G, Fu H, Shen J, et al. Pranet: Parallel reverse attention network for polyp segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer; 2020. p. 263–273.
  30. 30. Jain S, Atale R, Gupta A, Mishra U, Seal A, Ojha A, et al. Coinnet: A convolution-involution network with a novel statistical attention for automatic polyp segmentation. IEEE Transactions on Medical Imaging. 2023;. pmid:37768798
  31. 31. Huang X, Zhuo L, Zhang H, Yang Y, Li X, Zhang J, et al. Polyp segmentation network with hybrid channel-spatial attention and pyramid global context guided feature fusion. Computerized Medical Imaging and Graphics. 2022;98:102072. pmid:35594809
  32. 32. Zhang Y, Liu H, Hu Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24. Springer; 2021. p. 14–24.
  33. 33. Li S, Sui X, Luo X, Xu X, Liu Y, Goh R. Medical Image Segmentation using Squeeze-and-Expansion Transformers. In: Zhou ZH, editor. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. International Joint Conferences on Artificial Intelligence Organization; 2021. p. 807–815. Available from: https://doi.org/10.24963/ijcai.2021/112.
  34. 34. Dhamija T, Gupta A, Gupta S, Anjum , Katarya R, Singh G. Semantic segmentation in medical images through transfused convolution and transformer networks. Appl Intell. 2023;53(1):1132–1148. pmid:35498554
  35. 35. Liu Z, Lv Q, Yang Z, Li Y, Lee CH, Shen L. Recent progress in transformer-based medical image analysis. Comput Biol Medicine. 2023;164:107268. pmid:37494821
  36. 36. Cao J, Wang X, Qu Z, Zhuo L, Li X, Zhang H, et al. WDFF-Net: Weighted Dual-branch Feature Fusion Network for Polyp Segmentation with Object-aware Attention Mechanism. IEEE Journal of Biomedical and Health Informatics. 2024;.
  37. 37. Wang Z, Liu Z, Yu J, Gao Y, Liu M. Multi-scale nested UNet with transformer for colorectal polyp segmentation. Journal of Applied Clinical Medical Physics. 2024; p. e14351. pmid:38551396
  38. 38. Zhu L, Wang X, Ke Z, Zhang W, Lau RW. BiFormer: Vision Transformer with Bi-Level Routing Attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023. p. 10323–10333.
  39. 39. Wang W, Xie E, Li X, Fan DP, Song K, Liang D, et al. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media. 2022;8(3):415–424.
  40. 40. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. pmlr; 2015. p. 448–456.
  41. 41. Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings; 2011. p. 315–323.
  42. 42. Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2881–2890.
  43. 43. Bernal J, Sánchez FJ, Fernández-Esparrach G, Gil D, Rodríguez C, Vilariño F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized medical imaging and graphics. 2015;43:99–111. pmid:25863519
  44. 44. Jha D, Smedsrud PH, Riegler MA, Halvorsen P, de Lange T, Johansen D, et al. Kvasir-seg: A segmented polyp dataset. In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26. Springer; 2020. p. 451–462.
  45. 45. Loshchilov I, Hutter F. Fixing weight decay regularization in adam. 2018;.
  46. 46. Jha D, Smedsrud PH, Riegler MA, Johansen D, De Lange T, Halvorsen P, et al. Resunet++: An advanced architecture for medical image segmentation. In: 2019 IEEE international symposium on multimedia (ISM). IEEE; 2019. p. 225–2255.
  47. 47. Srivastava A, Jha D, Chanda S, Pal U, Johansen HD, Johansen D, et al. MSRF-Net: a multi-scale residual fusion network for biomedical image segmentation. IEEE Journal of Biomedical and Health Informatics. 2021;26(5):2252–2263.
  48. 48. Zhang Z, Liu Q, Wang Y. Road extraction by deep residual u-net. IEEE Geoscience and Remote Sensing Letters. 2018;15(5):749–753.
  49. 49. Huang CH, Wu HY, Lin YL. Hardnet-mseg: A simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean dice and 86 fps. arXiv preprint arXiv:210107172. 2021;.