Figures
Abstract
Since Transformers have demonstrated excellent performance in the segmentation of two-dimensional medical images, recent works have also introduced them into 3D medical segmentation tasks. For example, hierarchical transformers like Swin UNETR have reintroduced several prior knowledge of convolutional networks, further enhancing the model’s volumetric segmentation ability on three-dimensional medical datasets. The effectiveness of these hybrid architecture methods is largely attributed to the large number of parameters and the large receptive fields of non-local self-attention. We believe that large-kernel volumetric depthwise convolutions can obtain large receptive fields with fewer parameters. In this paper, we propose a lightweight three-dimensional convolutional network, LKDA-Net, for efficient and accurate three-dimensional volumetric segmentation. This network adopts a large-kernel depthwise convolution attention mechanism to simulate the self-attention mechanism of Transformers. Firstly, inspired by the Swin Transformer module, we investigate different-sized large-kernel convolution attention mechanisms to obtain larger global receptive fields, and replace the MLP in the Swin Transformer with the Inverted Bottleneck with Depthwise Convolutional Augmentation to reduce channel redundancy and enhance feature expression and segmentation performance. Secondly, we propose a skip connection fusion module to achieve smooth feature fusion, enabling the decoder to effectively utilize the features of the encoder. Finally, through experimental evaluations on three public datasets, namely Synapse, BTCV and ACDC, LKDA-Net outperforms existing models of various architectures in segmentation performance and has fewer parameters. Code: https://github.com/zouyunkai/LKDA-Net.
Citation: Li M, Ma J, Zhao J (2025) LKDA-Net: Hierarchical transformer with large Kernel depthwise convolution attention for 3D medical image segmentation. PLoS One 20(8): e0329806. https://doi.org/10.1371/journal.pone.0329806
Editor: Fatih Uysal, Kafkas University: Kafkas Universitesi, TÜRKIYE
Received: March 19, 2025; Accepted: July 22, 2025; Published: August 8, 2025
Copyright: © 2025 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Our paper utilizes three publicly available datasets for medical image segmentation, namely: Synapse, ACDC, and BTCV. Specific descriptions of these datasets are provided in the “Experiments and Results” section of this paper. Additionally, the datasets used in our model have been uploaded to the figshare website. The Synapse dataset can be accessed via the following DOI: 10.6084/m9.figshare.29073904. The ACDC dataset can be accessed via the following DOI: 10.6084/m9.figshare.29071418. The BTCV dataset can be accessed via the following DOI: 10.6084/m9.figshare.29077214.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Since 3D medical images possess richer and more detailed spatial information than 2D images, 3D voxel segmentation is a critical technique that enables the visualization of medical images, aids in diagnosis, and facilitates the planning of treatments [1, 2]. For example, hierarchical transformer models like the Swin Transformer have been introduced into the segmentation of 3D medical images and have achieved excellent performance on multiple volumetric segmentation benchmarks [3–6]. These models have reincorporated some prior knowledge of convolutional neural networks, such as local connectivity and translation invariance, enhancing the applicability of transformers in the field of 3D medical images. Their strategy of splitting the input into patches and using the window-level self-attention mechanism to model global dependencies has further enlarged the receptive field and strengthened the feature representation [7]. Therefore, these hybrid transformer-convolution frameworks have achieved a significant improvement in performance [8].
However, directly applying 3D transformer models as general backbone networks has issues such as high computational overhead and high memory requirements [3]. The computational complexity of the global self-attention mechanism grows quadratically with the increase in input resolution [9]. High-resolution 3D medical images often contain a large amount of fine-grained information, imposing higher requirements on the feature extraction capabilities of models [10]. Therefore, the design of efficient 3D medical image analysis models still remains to be explored [11, 12]. We believe that compared with transformers, 3D convolutional networks can simulate the behavior of large receptive fields with fewer parameters through depthwise convolutions. The depthwise convolution within local regions can mimic the window-level self-attention computation of transformers [3]. Large-kernel depthwise separable convolutions can provide global-level receptive fields for feature extraction, replacing the expensive global self-attention operations of transformers [13].
Based on the above considerations, this paper proposes a lightweight three-dimensional convolutional network named LKDA-Net, which uses a large-kernel depthwise convolution attention mechanism to simulate the Transformer self-attention mechanism, aiming to achieve efficient and accurate three-dimensional volumetric segmentation. Specifically, inspired by the design of the Swin Transformer module, we have investigated large-kernel convolution attention mechanisms of different sizes to obtain a larger global receptive field. Furthermore, we use the Inverted Bottleneck with Depthwise Convolutional Augmentation to replace the MLP in the Swin Transformer module, which can enhance the feature expression ability and improve the segmentation performance while reducing the redundancy among channels. We evaluated the LKDA-Net on three public three-dimensional medical image segmentation datasets, and the results show that it outperforms current models with various architectures and has fewer parameters. Our main contributions are as follows:
- We propose the LKDA-Net, which is a lightweight convolutional network for three-dimensional medical image segmentation based on the LKDA-Net Block. The LKDA-Net Block explores large-kernel convolution attention mechanisms of different sizes to obtain a larger global receptive field. In addition, we design and use the Inverted Bottleneck with Depthwise Convolution Augmentation to replace the multi-layer perceptron (MLP) to enhance the expression of channel features, so as to reduce the number of parameters and improve the segmentation performance.
- We propose a skip connection fusion module to achieve smooth feature fusion, enabling the decoder to effectively utilize the feature information obtained by the encoder and optimize the feature processing effect.
- We evaluated the segmentation performance of our LKDA-Net on three public datasets and provided visual analysis and parameter quantity comparison. The experimental results demonstrate that our model outperforms current models with various architectures in terms of performance and has significantly fewer parameters.
Related work
CNN-based segmentation approaches
In the field of deep learning-based image segmentation, the primary network architectures encompass three categories: Convolutional Neural Network (CNN)-based methods, Transformer-based methods, and CNN-Transformer hybrid architectures. CNN-based methods leverage the robust spatial feature extraction capabilities of CNNs to effectively identify and segment anatomical structures in medical images. Convolutional neural networks (CNNs) [14–16] have emerged as potent means for handling medical image segmentation tasks, owing to their remarkable capacity in seizing multiscale representations, local semantic details, as well as texture particulars. Çiçek and his collaborators [17] augmented the U-Net framework by substituting 2D convolutions with 3D operations for the purpose of segmentation in volumetric images with sparse labels. Isensee et al.[16] put forward a nnU-Net model founded on the U-Net structure, which incorporates an automated setup to distill features from images at multiple hierarchical levels.
Furthermore, investigators have delved into the acquisition of local-global information via pure CNN architectures, such as deformable convolutions [18–20], depthwise convolutions [21–23] and large kernel convolutions [13, 24]. For example,In the detection of bipolar disorder using OCT images, Attention TurkerNeXt employs interpretable attention mechanisms and feature visualization techniques to identify critical anatomical regions that drive the model’s decision-making process[25]. Ho et al. [26] used relatively large kernels within skip connections to deal with the segmentation of splenic regions. Similarly, Li et al. [21] proposed the Lkau-net architecture, integrating extensive depthwise convolutions and large-kernel convolutions within the decoder for medical image volume segmentation. Unfortunately, as the kernel size increases, both model parameters and FLOPs increase significantly, thereby affecting training and inference efficiency. To improve the effectiveness of large kernels, ConvNeXt [27] utilizes the potential of large kernel depthwise convolutions, a technique originating from the field of natural image processing. However, in the field of volume segmentation, the application of large-kernel depthwise convolutions is still relatively under-explored. Given the large receptive field provided by these depthwise convolutions, we believe that they have the potential to emulate Transformer behaviors, making them suitable for effective use in volume segmentation tasks.
Transformers-based segmentation approaches
Subsequently, Transformers have achieved remarkable success in natural language processing (NLP), and Transformer-based approaches for medical image segmentation have emerged as a key research focus. By leveraging self-attention mechanisms, these methods capture global dependencies to improve the understanding and segmentation accuracy of complex medical images, particularly those with challenging backgrounds and irregular anatomical structures[28]. Recent progress in Vision Transformers [29] has overcome Long-Range dependency issues, especially in Medical Image Segmentation [6, 30]. A significant innovation in this area is the Swin-Unet [28], which has a U-shaped encoder-decoder structure enhanced with Swin Transformer blocks. Likewise, Jiang et al. introduced SwinBTS [31], which uses improved Transformer modules to extract detailed features. Zhou et al.’s nnFormer [6] retains the use of convolutional layers for local image details and a hierarchical structure for multi-scale features. However, Transformer-based volumetric segmentation models have a large number of parameters and long training times, and the high computational complexity due to multi-scale feature extraction makes the situation worse [32]. This limitation leads to a reconsideration of whether convolutional neural networks can effectively mimic Transformer advantages for efficient feature extraction.
CNN-transformer hybrid segmentation approaches
Recently, researchers have begun exploring methods based on CNN-Transformer hybrid architectures. These hybrid approaches aim to integrate the efficiency of CNNs in processing local image features with the capability of Transformers to capture global dependencies, thereby enhancing both the efficiency and precision of medical image segmentation[33]. Some research efforts have aimed to develop hybrid architectures that combine the U-Net model with transformers [2, 5]. This combination intends to use convolutions for local feature extraction and global self-attention to capture comprehensive global-local contextual information. One such innovation is TransUNet [33], which has introduced an encoder with a hybrid CNN-Transformer architecture. This enhancement improves segmentation performance by smoothly integrating convolutional neural networks into the Transformer framework. This integration helps to effectively regain local spatial information. TransFuse [34] adopted a parallel integration approach that combines Transformers and CNNs to boost the effectiveness of seizing global information. Furthermore, UNETR [5] has introduced a new approach to semantic segmentation of medical images by using Transformers. This novel method redefines the task as a 1D sequence-to-sequence prediction problem. Another remarkable contribution is the 3D UX-NET [2], which proposes a lightweight volumetric ConvNet module with large kernel depth-wise convolutions. This module fine-tunes stratified features, ultimately leading to improved volumetric segmentation results.
Attention mechanism
Attention mechanism enables the model to focus flexibly and specifically on the critical parts in the image. In the field of medical image segmentation, the attention mechanism encompasses two main categories: spatial attention and channel attention. Among them, the channel attention primarily concentrates on those objects of significance [35–37], while the spatial attention lays its emphasis on salient regions [38, 39]. Currently, most of the Transformer-based methods directly utilize the self-attention mechanism of Transformer to capture global feature information. For example, MedT [40] has proposed a model based on gated axial attention, which extends the current architecture by incorporating a control mechanism into the self-attention. The guided self-attention mechanism proposed by Sinha captures richer context dependencies more accurately [41]. The global spatial attention module constructed by TransAttUnet [42] successfully combines the global spatial attention with the self-attention, allowing the model to obtain long-range context interaction information. To effectively alleviate the high computational burden caused by self-attention, SegNeXt [43] proposes an efficient multi-scale convolutional attention mechanism. PraNet [44] introduces reverse attention to improve the accuracy of segmentation boundaries. However, these techniques only focus on spatially salient regions and neglect, to some extent, the attention to important objects in the channel dimension. For instance, although CBAM integrates channel and spatial information, its spatial attention is obtained through channel compression, resulting in the uniform distribution of spatial attention weights among channels [45]. MA-Unet [46] uses attention gates to properly solve the semantic ambiguity introduced by skip connections and acquires global information at different scales by means of multi-scale prediction fusion.
Method
Fig 1 illustrates the network architecture of our LKDA-Net constructed based on the LKDA-Net Block. Firstly, we utilize the large kernel projection to extract patch-wise features and input them into the encoder composed of the LKDA-Net Block and the downsampling block. Subsequently, the decoder consisting of the Skip Connection Fusion Module and the upsampling module further extracts features and performs upsampling. Meanwhile, the feature information at different scales of the encoder is fused through the Skip Connection Fusion Module.
We adopt a multi-scale hierarchical encoder-decoder structure. Firstly, the feature map is projected into multiple patches that are embedded by means of large-kernel 3D convolution. Each stage within the encoder consists of LKDA-Net Blocks and downsampling modules, which are responsible for handling feature maps at different scales. The decoder is made up of the Skip Connection Fusion Module and upsampling modules. The specific structure of the LKDA-Net Block is illustrated in Figs 2 and 3 The specific structure of the Skip Connection Fusion Module is shown in Fig 4.
For a more detailed comparison, Table 1 comprehensively illustrates the architectural differences between the proposed LKDA-Net and existing models from multiple perspectives. Transformer-based models (e.g., Swin-Unet,nnFormer ) rely on self-attention mechanisms with quadratic computational complexity , which hinders efficient and continuous processing of high-resolution 3D volumetric data. CNN-Transformer hybrid models (e.g., TransUNet, UNETR, Swin-UNETR) utilize Transformers for global context modeling and CNNs for local feature extraction. However, they underutilize the inherent inductive bias of convolutions and introduce redundant cross-channel parameters through MLP modules. Meanwhile, large-kernel convolutional models (e.g., MedNext) capture global information via expansive kernels but fail to explicitly model channel-wise and spatial dependencies, resulting in suboptimal segmentation performance for complex small targets. In contrast, the proposed LKDA-Net addresses these limitations through three key innovations: (1) Large Kernel Depthwise Convolution Attention (LKD Attention) enables global context modeling with linear complexity O(N), preserving 3D spatial continuity while expanding the effective receptive field; (2) Inverted Bottleneck with Depthwise Convolution Enhancement (DWCA) refines local feature representation by decoupling channel interactions and eliminating redundancy; (3) A Group Convolution-based Skip Connection Fusion Module aligns multi-scale encoder-decoder features, effectively mitigating semantic gap issues caused by resolution mismatches.
LKDA-Net block
Inspired by ConvNeXt and Swin Transformers, we propose the LKDA-Net Block for 3D medical image segmentation. As directly applying transformers as a universal backbone has the problem of high computational complexity, we propose to simulate the self-attention mechanism of transformers for global relationship modeling based on the large kernel depthwise convolution attention mechanism, so as to efficiently extract global features. Furthermore, we have designed an efficient 3D medical image segmentation network. Fig 2 compares the differences between the Swin Transformer Block and our proposed LKDA-Net Block. The distinct designs of the LKDA-Net Block mainly include Large Kernel Depthwise Convolution Attention (LKD Attention) and Inverted Bottleneck with Depthwise Convolution Augmentation (DWCA).
(b) Architecture of the LKDA-Net Block proposed by us. Compared with the Swin Transformer block, the LKDA-Net Block obtains a larger receptive field through Large Kernel Depthwise Convolution Attention (LKD Attention) and Inverted Bottleneck with Depthwise Convolutional Augmentation (DWCA), while also improving the quality of feature representation.
Large Kernel Depthwise Convolution Attention (LKD Attention).
Swin Transformer employs Window-based Multi-head Self-Attention (W-MSA) to capture local dependencies and further utilizes Shifted Window MSA to explore the dependencies among different windows, thereby obtaining a global receptive field. We have found that there is a significant similarity between the per-channel computation of Depthwise Convolution and the weighted summation of self-attention. We believe that using Large Kernel Depthwise Convolution Attention (LKD Attention) can acquire a large receptive field just like MSA does, thus capturing global dependencies.
The LKD Attention proposed by us adaptively utilizes channel-level and spatial-level contextual information through a large receptive field. Specifically, multiple large kernel depthwise convolutions are employed to extract multi-scale features. Moreover, we cascade these large kernel depthwise convolutions, endowing them with increasing dilation rates and growing kernel sizes. On the one hand, this design can recursively aggregate contextual information within the receptive field. On the other hand, the features extracted within deeper and larger receptive fields contribute more to the output, enabling the LKD Attention to capture more effective features.
The LKD Attention is shown in Fig 3. In the specific operations of the LKD Attention, we first utilize a convolutional layer to conduct a projection operation that halves the number of channels, aiming to reduce the complexity caused by multiple convolutions. The input feature
of l layer is projected to the layer
(where C represents the number of channels, and H, W, D are the dimensions of the 3D image). Secondly, we perform depthwise convolution operations (DW Conv) with two large kernels respectively on the projected feature maps. The first one is a depthwise convolution with a dilation rate of 1 and a kernel size of
, and the second one is a depthwise convolution with a dilation rate of 3 and a kernel size of
. The large convolutional kernels can capture the global feature dependencies within a local region, simulating the window-level self-attention calculation in transformers. Meanwhile, the channel independence of depthwise convolutions is also similar to the operations of self-attention on each patch.
By cascading the depthwise convolutions with two large kernels, the LKD Attention can obtain an effective receptive field with a size of . The two resulting feature maps
and
are concatenated to restore the original number of channels, obtaining the feature
. Then, the global spatial relationships of these features are effectively modeled by applying average pooling and maximum pooling along the channels of the feature
.
Then, a depthwise convolution with a size of is used to mix and interact the information obtained in the previous step among different spatial feature representations. Finally, the Sigmoid activation function is employed to obtain the weight values w1 and w2.
These weight values are utilized to adaptively select the features from different large kernels and calibrate them to obtain .
Subsequently, after performing a convolution operation on the obtained
, a
SE module is used to weight the feature map, thereby explicitly modeling the interdependencies among the channels of its convolutional features to improve the quality of the feature map representation. Finally, the residual connection is utilized to generate the output feature
.
Inverted Bottleneck with Depthwise Convolution Augmentation (DWCA).
In the Transformer structure, an inverted bottleneck is designed, that is, the dimension of the hidden layer in the MLP module is four times wider than the input dimension. MobileNetV2 [47] first applied the inverted bottleneck structure to Convnet, and then several advanced Convnet models also adopted a similar design. Therefore, we adopt a similar inverted bottleneck structure and use the Depthwise Convolution with a size of 1 1
1 to process the features. Inspired by vision transformers, instead of using the batch normalization (BN) commonly used in Convnet, we use Layer Normalization (LN) for the normalization operation. In addition, we replace the RELU activation function with the smoother GELU activation function.
In the Inverted Bottleneck structure we proposed, the features passing through the Layer Normalization layer are first expanded to four times the number of input channels through Depthwise Convolution. After the result is processed by the GELU activation function, the Depthwise Convolution with a size of 1 1
1 is then used to independently scale each channel feature back to the original number of input channels. Subsequently, a residual connection is made with the original feature map to obtain the output of the Inverted Bottleneck. We expand and compress the dimensions of each channel in an independent manner, which reduces the redundancy among channels while enhancing the feature expression ability. Therefore, we define the outputs of l and
layer of the LKDA-Net Block as follows:
where prediction and prediction zl + 1 are the outputs of different depth layers of DWConv. SE represents Squeeze-and-Excitation block. LN denotes layer normalization, and DWCA represents Inverted Bottleneck with Depthwise Convolution Augmentation.
LKDA-Net encoder
Take a 3D voxel data from the training set as the input of the encoder. Our encoder is divided into five stages. In the first stage, instead of adopting the method of linear projection embedding, we use a Depthwise Convolution layer with a size of to perform patch embedding, obtaining a feature map with a resolution of
. This patch-based feature extraction aims to simulate the operation of first performing patch segmentation and then learning embeddings in visual transformers. Compared with the global self-attention in transformers which has high computational cost, our design provides an efficient alternative to capture global dependencies within local regions. The remaining four stages are each composed of the LKDA-Net Block and downsampling. In the LKDA-Net Block, the Large Kernel Depthwise Convolution Attention is used to model global dependencies, followed by the Inverted Bottleneck with Depthwise Convolutional Augmentation structure which expands and compresses the dimensions of each channel in an independent manner, enhancing the feature expression ability while reducing the redundancy among channels. Instead of using MLP, we use a 3D convolution with a stride of 2 and a convolution kernel size of
to halve the resolution of the feature map so that the information of each channel can be better fused. The third, fourth, and fifth stages follow the same operations, resulting in feature map resolutions of
,
, and
, respectively. Multiple-scale hierarchical feature representations are extracted at each stage and are further utilized in the segmentation of 3D voxel data.
LKDA-Net decoder
Our decoder utilizes the Skip Connection Fusion Module to further extract hierarchical visual feature representations and optimize the segmentation results. Specifically, firstly, the feature outputs of each encoder stage are concatenated and fused with the upsampled decoding features. Secondly, the results of multi-step fusion are input into a convolutional layer with a softmax activation function to predict the final segmentation probabilities. The application of the Skip Connection Fusion Module in the decoder enables our model to align the semantic information of features at different resolutions, so as to optimize the segmentation of fine-grained volumetric data.
Skip connection fusion module.
Traditional skip connections usually utilize ordinary convolution operations for feature fusion. Such operations can be straightforward and lack flexibility and pertinence during the processing, causing the encoder and decoder to bear more computational loads and data processing pressures. However, the skip connection fusion module we propose adopts group convolution as a key component to effectively address these problems. As shown in Fig 4, we divide the convolution operation into two groups. One group is specifically dedicated to the refined extraction of “feature-to-feature” for the features from the encoder in the skip connection, while the other group is responsible for carrying out the same operation on the decoder features after upsampling. Here, the kernel size of the group convolution is set to 3 3
3, the stride is 1, and the padding is 1. In order to achieve a more sufficient and comprehensive feature fusion effect, we add two inverted bottleneck pointwise convolutions after the group convolution operation is completed. The skip connection fusion module can reasonably and adaptively allocate the features before fusion to the group convolution for processing according to the characteristics of the features themselves. Subsequently, the highly efficient and dense pointwise convolutions play a crucial role in the entire feature fusion process and undertake the main task of feature fusion.
The module employs group convolution to separately process encoder and decoder features, followed by inverted bottleneck pointwise convolutions for adaptive fusion.
Moreover, inside the skip connection fusion module, each convolution operation is followed by a GELU activation function layer and a BatchNorm normalization layer, so as to further optimize the feature processing effect and enhance the stability and adaptability of the module. The definition of the Skip Connection Fusion Module Block is as follows:
where fE and fD are the features of the encoder and the decoder respectively. fc is the concatenated feature, and ffusion is the output fused feature map of the Skip Connection Fusion Module.
Upsampling block.
The Upsampling Block is mainly constituted by an upsampling layer, a convolution layer, a batch normalization layer, and a ReLU activation function. Specifically, for the upsampling process, we employ the bilinear interpolation method to upscale the feature map by a factor of 2. This technique is beneficial as it can enhance the resolution of the feature map in a relatively smooth and efficient manner. The convolution layer within the block has a kernel size of , a stride of 1, and a padding of 1. Such convolution settings enable the extraction and refinement of local features. Through the combined utilization of these techniques, namely the bilinear interpolation, the convolution operation, the batch normalization, and the ReLU activation, we are capable of not only effectively increasing the resolution of the feature map but also retaining the essential features.
Loss function
To calculate the loss between the predicted 3D voxels and the ground truth, we utilize a combination of cross-entropy loss and soft dice loss, leveraging the advantages of both loss functions. The loss function for MedX-Net is defined as follows:
where i represents the total number of 3D voxels, and N is the number of predicted classes. Zi,j denotes the ground truth value of class j at voxel i, while Pi,j represents the predicted probability output of class j at voxel i by the model.
Experiments and results
Datasets
Our method was evaluated on three datasets: Synapse, BTCV, and ACDC, with both quantitative metrics and visual analysis results presented. These datasets cover diverse anatomical regions including abdominal multi-organ and cardiac regions, and incorporate imaging modalities such as CT and MRI, enabling comprehensive validation of the model’s segmentation performance and generalization capabilities. As widely recognized and extensively utilized benchmarks in the medical image segmentation community, these datasets provide a reliable reference standard to ensure the credibility and comparability of our experimental findings.
Synapse Dataset. The Synapse dataset holds a significant position in the field of medical image analysis. It mainly focuses on CT scan data of abdominal organs. The data is sourced from a specific group of 30 patients. The data partitioning approach is inspired by the concept of TransUnet. Specifically, 18 groups of data are allocated to the training set to facilitate the model’s learning and training procedures, which allows the model to acquire the normal and abnormal feature patterns of abdominal organs. The remaining 12 groups of data function as the test set, aiming to assess the performance and accuracy of the trained model and examine the model’s generalization capability on unseen data. This dataset encompasses the delineation of eight distinct organs, specifically the spleen, left kidney, pancreas, stomach, aorta, liver, gallbladder, and right kidney.
BTCV Dataset. The BTCV Dataset (Beyond-the-Cranial-Vault Abdominal CT Organ Segmentation) comprises 30 training instances and 20 testing instances. Notably, this dataset not only encompasses the eight organs present in the Synapse dataset but also encompasses structures of the esophagus, inferior vena cava, portal and splenic veins, as well as the right and left adrenal glands.
ACDC Dataset. The ACDC Dataset (Automated Cardiac Diagnosis Challenge) comprises MRI scan images from numerous patients, featuring a composition of 70 training samples, 10 validation samples, and 20 test samples. Within each MRI image, delineations are made for the myocardium (MYO), right ventricle (RV), and left ventricle (LV) regions.
Implementation details
Our developmental environment consists of Ubuntu 18 and PyTorch 1.12. We engage in training utilizing a singular GeForce RTX 3090 endowed with 24GB. The orchestration employs the AdamW optimizer, coupled with a stipulation of a maximum iteration count amounting to 40000, and an initial learning rate instantiation set at 0.0001. As a preliminary step, all images undergo resampling to conform to a uniform voxel spacing and are subsequently cropped to dimensions of . As the training regimen unfolds, a symphony of data augmentation methodologies is invoked. Within this repertoire, the repertoire includes scaling, rotation, luminance and contrast modulations, Gaussian noise infusions, as well as Gaussian blurring. For the purpose of experimental evaluation, the Average DSC is harnessed as the yardstick of assessment. Additionally, we employ the Hausdorff Distance 95% (HD95) metric to measure the maximum boundary error of segmentation results, which further reveals the model’s sensitivity to anatomical boundaries. A lower HD95 value indicates higher agreement between the segmentation boundaries and the ground truth annotations. The ground truth and predicted values are denoted by
and
, respectively, for a given semantic class i. The ground truth and predicted surface point sets are denoted by
and
, respectively. The DSC metric is defined as:
During the course of the training endeavor, we incorporated a mechanism of profound supervision. To be specific, we employed the cross-entropy loss function and the Dice loss function as the ultimate measures of loss. The computation of loss was executed by contrasting the upsampled images from distinct resolutions of the decoder with their corresponding ground truth values. Consequently, the ultimate objective function for training materialized as the summation of losses across five distinct resolutions:
where respectively denote the magnitudes of loss weights for distinct resolution features. In this context, a greater weight is assigned to the loss of feature maps with higher resolutions, thus facilitating accelerated convergence and superior segmentation outcomes. We established that
, with the cumulative summation of all loss weight factors equating to 1.
Comparison with state-of-the-arts
We evaluated the performance of LKDA-Net on three segmentation datasets. These segmentation datasets vary in complexity, the number of structures to be segmented, image modalities (CT, MRI), and spatial and phenotypic heterogeneity. This experimental design emphasizes the experimental effect and generalization ability of LKDA-Net in different segmentation tasks. For a detailed evaluation and comparison, we compared the performance with various recent state-of-the-art (SOTA) segmentation models. These models were trained according to the optimal parameters provided in their papers and evaluated through the same 5-fold cross-validation.
- CNN-based methods: U-Net [48], nnUNet [16].
- Transformer-based methods: Swin-Unet [28], MISSFormer [30], nnFormer [6].
- Hybrid CNN-Transformer-based methods: TransUNet [33], UNETR [5], swin-UNETR [9].
- Large convolutional kernel-based methods: 3D UX-NET [2], MedNext [11].
Table 2, Table 3 and Table 4 respectively present the performance comparisons on the Synapse multi-organ segmentation dataset, the Automated Cardiac Diagnosis dataset and the BTCV abdominal multi-organ segmentation task. When evaluating the model’s efficiency metrics, we conducted experiments under the same experimental environment and preprocessing pipeline as during the training phase. Table 5 presents the comparative results of parameter counts and computational costs across different models, where the inference time corresponds to the total time required by the model to segment 12 test samples from the Synapse dataset.
Synapse Dataset: LKDA-Net achieved an overall average Dice Similarity Coefficient (DSC) of 87.21% on the Synapse Dataset, demonstrating the best and most robust overall performance compared with other state-of-the-art (SOTA) methods. Specifically, compared with the baseline model U-Net based on Convolutional Neural Network (CNN), which achieved an average DSC of 76.85%, LKDA-Net achieved a superior overall segmentation performance. Compared to MedNext, which employs large-kernel pure convolutions to expand the receptive field, its lack of explicit modeling of channel-wise and spatial dependencies limits segmentation accuracy for small targets (e.g., adrenal glands). In contrast, LKDA-Net leverages an Inverted Bottleneck with Depthwise Convolution Enhancement (DWCA ) to enable the model to concentrate on critical anatomical regions. In addition, the average DSC obtained by LKDA-Net was also significantly higher than that of methods based purely on Vision Transformer, such as Swin-Unet, MISSFormer, and nnFormer. Although nnFormer had a lower computational complexity, LKDA-Net achieved better segmentation results while reducing the number of model parameters by two-thirds.
When LKDA-Net was compared with hybrid CNN-Transformer methods that had already reached the state-of-the-art level in various segmentation tasks, it achieved a superior overall performance. Although the UNETR had lower floating-point operations, LKDA-Net achieved a higher average DSC in terms of overall segmentation performance. While Swin-UNETR captures long-range dependencies through windowed self-attention (W-MSA), its computational complexity grows quadratically with input resolution, and its MLP modules introduce redundant cross-channel parameters. In contrast, LKDA-Net’s LKD Attention significantly mitigates boundary ambiguity (Fig 6) and demonstrates superior segmentation performance without relying on the high computational costs associated with Transformer architectures. 3D UX-NET proposed using large-kernel convolutional modules to replace Transformer modules. However, LKDA-Net achieved superior performance with a much lower computational complexity than it, which proved the effectiveness and efficiency of the LKDA-Net Block and the overall architecture we proposed. While MedNext’s large-kernel pure convolutional design yields lower computational complexity and shorter inference time compared to LKDA-Net, LKDA-Net achieves significantly higher segmentation accuracy. LKDA-Net achieves the lowest HD95 of 7.33 mm, significantly outperforming Swin-UNETR (10.55 mm) and 3D UX-NET (11.23 mm). This indicates our model not only improves volumetric overlap but also reduces extreme segmentation errors at organ boundaries. The average DSC of LKDA-Net in all organ-specific segmentation tasks had been significantly improved, indicating that LKDA-Net demonstrated excellent capabilities in feature extraction and representation from different organs.
It can be seen from Table 5 that the number of parameters of the LKDA-Net model is only 44.52 million. Compared with other pure Transformer models, the number of parameters of our model has been significantly reduced. The implementation of LKD Attention within the LKDA-Net Block substantially reduces the model’s parameter count. However, the computationally intensive nature of LKD Attention results in relatively higher FLOPs for our model. This design ensures efficient segmentation while mitigating the risks of overfitting from excessive model parameters and the overconsumption of computational resources. In terms of inference time, LKDA-Net achieves an end-to-end inference time (encompassing data preprocessing, GPU computation, and data transfer) of 1.65 seconds for segmenting 12 test samples, which is notably faster than Swin-UNETR (3.72 seconds) and nnUNet (3.41 seconds), despite their similar FLOPs. Furthermore, compared to 3D UX-NET and MedNext, which rely on large-kernel pure convolutional operations, these models exhibit comparable parameter counts to LKDA-Net but fall short in overall accuracy and inference efficiency.
ACDC Dataset: LKDA-Net achieved an overall average DSC of 92.79% on the ACDC dataset, demonstrating outstanding and extremely robust overall performance compared with other state-of-the-art (SOTA) methods. Specifically, compared with the U-Net baseline model based on the Convolutional Neural Network, which achieved an average DSC of 87.55%, LKDA-Net achieved a more excellent overall segmentation performance. In addition, the average DSC of LKDA-Net was significantly higher than that of methods based on pure Vision Transformer, such as Swin-Unet and MISSFormer. Even though nnFormer is currently the best-performing method on the ACDC dataset, LKDA-Net still surpassed it in terms of the overall number of parameters and overall segmentation performance. When LKDA-Net was compared with various hybrid CNN-Transformer methods, it also achieved a superior overall performance. Although models such as TransUNet, Swin-Unet, and UNETR have their own advantages in certain aspects. For example, TransUNet has a relatively low number of parameters and computational complexity, Swin-Unet performs fairly well in some indicators, and UNETR has a lower number of floating-point operations, LKDA-Net achieved a higher average DSC in terms of overall segmentation performance. Moreover, compared with the current mainstream models, the average DSC of LKDA-Net in the segmentation tasks of specific regions such as the right ventricle (RV), myocardium (Myo), and left ventricle (LV) has been significantly improved. This indicates that LKDA-Net demonstrates extremely excellent capabilities in feature extraction and representation from different cardiac structures, and it can divide different regions in cardiac magnetic resonance images more accurately, thereby providing more reliable and precise data support for cardiac-related medical research or clinical diagnosis.
BTCV Dataset: LKDA-Net also demonstrates excellent segmentation performance on the BTCV dataset. The current excellent pure Convolutional Neural Network model, nnUNet, has an average DSC of 83.16%, which is the best-performing model among all current mainstream models. LKDA-Net has an average DSC of 83.23% on the BTCV dataset and outperforms the nnUNet model with a lower number of model parameters and a higher average DSC. Compared with the method based on pure Vision Transformer, nnFormer, LKDA-Net has a significant advantage in average DSC and can capture the feature information in images more effectively, thus transforming it into more accurate segmentation results. When compared with various hybrid CNN-Transformer methods, LKDA-Net also stands out. While TransUNet integrates CNNs and Transformers, its encoder-decoder framework employs a simplistic fusion strategy (direct concatenation of features), leading to insufficient alignment between shallow local features and deep semantic information. In contrast, LKDA-Net achieves more precise segmentation through a skip-connection fusion module (combining grouped convolutions with inverted bottleneck pointwise convolutions). Although TransUNet and UNETR has relatively lower floating-point operations, LKDA-Net achieves a higher value in overall segmentation performance. Notably, LKDA-Net attains an HD95 of 4.85 mm, surpassing nnFormer (5.15 mm) and MedNext (9.34 mm). The reduced HD95 demonstrates the effectiveness of our skip connection fusion module in preserving anatomical details. When handling various tissue and organ segmentation tasks in the BTCV dataset, LKDA-Net can outline the target regions more accurately, and its overall segmentation effect surpasses that of the current mainstream baseline models.
Fig 5 presents the visual comparison between LKDA-Net and other models in the Synapse multi-organ segmentation task. UNETR exhibits discontinuities in segmentation boundaries for the pancreas, Swin-UNETR demonstrates erroneous segmentation in the pancreatic head region, and nnFormer suffers from incomplete segmentation of the kidneys. In contrast, LKDA-Net’s predictions align more closely with the ground truth, particularly in capturing subtle boundary structures, where its contour continuity and spatial consistency significantly outperform competing models. These observations validate the enhanced global context modeling capability of LKDA-Net’s large-kernel depthwise separable convolution attention (LKD Attention).
The parts where our model outperforms other models are marked in yellow. Compared to other models, LKDA-Net demonstrates higher structural continuity in segmenting complex organ boundaries, such as the pancreas and kidneys, achieving results closer to the ground truth.
Fig 6 shows the visual comparison between LKDA-Net and other models in the BTCV abdominal multi-organ segmentation task. Here, nnFormer and nnUNet display incomplete segmentation of the portal splenic vein (PSV) with boundary blurring, while Swin-UNETR fails to accurately reconstruct the morphology of the right adrenal gland (RAG). In contrast, LKDA-Net effectively integrates multi-scale features through its Skip Connection Fusion Module, preserving fine-grained anatomical details to enhance segmentation precision while significantly mitigating edge ambiguity.
Zoomed-in boxed regions highlight significant differences in segmentation quality. LKDA-Net exhibits superior contour alignment with the ground truth for small anatomical structures, such as the esophagus (ESO), gallbladder (GAL), and adrenal glands (RAG, LAG), while substantially reducing mis-segmentation artifacts.
Ablation experiment
We conducted ablation studies on the LKDA-Net Block and the Skip Connection Fusion Module respectively. In the ablation studies, we evaluated the parameters and segmentation performance of different architectural configurations on the Synapse dataset. We adopted Swin-UNETR as the baseline model and conducted the following experimental configurations: (1) Replacing the Windowed Multi-Head Self-Attention (W-MSA) in Swin Transformer Blocks with our proposed LKD Attention (3 3
3 DWConv + 5
5
5 DWConv). (2) Using LKD Attention with larger kernels (5
5
5 DWConv + 7
7
7 DWConv). (3) Further scaling kernel sizes in LKD Attention (7
7
7 DWConv + 9
9
9 DWConv). (4) Substituting the MLP in Swin Transformer Blocks with our Inverted Bottleneck with Depthwise Convolutional Augmentation (DWCA). (5) Replacing Swin Transformer Blocks with LKDA-Net Blocks. (6) Combining LKDA-Net Blocks with the Skip Connection Fusion Module in the full architecture.
Table 6 shows the performance of adding the two modules proposed in this paper to the Swin-UNETR baseline model on the Synapse dataset. The experimental data indicate that the number of parameters of the Swin-UNETR model on the Synapse dataset is 62.83M, and the average DSC is 83.48%. When the Window Multi-Head Self-Attention (W-MSA) in the Swin Transformer Block of the Swin-UNETR model is replaced with the LKD Attention proposed in this paper, the number of parameters of the model is significantly reduced, and the average DSC is also improved. In addition, we conducted experiments with different configurations for the kernel sizes of the two Depthwise Convolutions (DWConv) in the LKD Attention . In the ablation study, we observed that the LKD Attention module achieved a mean Dice Similarity Coefficient (DSC) of 85.21% when employing a combination of 5 5
5 and 7
7
7 convolutional kernels, outperforming other configurations. This superiority can be attributed to two key factors: From the perspective of multiscale context capture, the 5
5
5 kernel effectively captures midrange contextual information within local regions, while the 7
7
7 kernel extends the receptive field to model global anatomical dependencies. Their cascaded design recursively aggregates multiscale features, enhancing the model’s capacity to represent complex organ boundaries and heterogeneous regions. From the standpoint of parameter efficiency and feature representation balance, larger 9
9
9 kernels further expand the receptive field, they introduce a significant parameter increase (50.13M) and risk incorporating redundant noise. Conversely, smaller 3
3
3 kernels fail to adequately cover global dependencies. The 5
5
5 + 7
7
7 configuration achieves an optimal balance between receptive field coverage and feature precision while maintaining a lower parameter count (43.44M).
Furthermore, when the Multi-Layer Perceptron (MLP) in the Swin Transformer Block is replaced with the Inverted Bottleneck with Depthwise Convolutional Augmentation (DWCA) module in the LKDA-Net Block, both the number of parameters and the segmentation performance of the model slightly increase. This module not only improves performance metrics but also addresses the semantic gap by processing the encoder’s shallow local features and the decoder’s deep semantic features separately through grouped convolutions, combined with inverted bottleneck pointwise convolutions to adaptively calibrate feature weights. Furthermore, it employs 3 3
3 grouped convolutions and GELU activation functions to enhance feature extraction, preserving edge details and subtle structures. By leveraging grouped convolutions to reduce redundant computations, the module achieves efficient fusion with minimal parameter overhead.
Moreover, on the basis of adding the LKDA-Net Block to the Swin-UNETR, the Skip Connection Fusion Module is added to the decoder. Although the number of parameters of the model increases to 48.5M, the segmentation performance is improved to 87.21%. These results demonstrate that the LKDA-Net Block can not only reduce the number of parameters of the model by using LKD Attention but also extract multi-scale features with a larger receptive field to improve the segmentation performance. And the Skip Connection Fusion Module can effectively fuse the shallow local information at each stage in the encoder and the deep semantic information in the decoder, further improving the accuracy of segmentation.
Discussion
To understand the limitations of LKDA-Net, we analyzed the segmentation results of it and other SOTA methods on the Synapse Dataset and BTCV dataset, with a focus on the cases where the Dice scores were the lowest among these methods. Table 7 presents the Dice scores of these failure cases. The failure cases of these methods exhibited significant consistency, as almost all methods showed relatively low segmentation accuracy in the same instances within the two datasets. Specifically, in the BTCV multi-organ segmentation task, U-Net and TransUNet demonstrated the lowest segmentation accuracy in the first case and had similar low performance in the second case. In contrast, Swin-UNETR and nnFormer had the lowest Dice scores in the second case while performing poorly in the first case. In this study, we compared a variety of model architectures, including the pure convolutional U-Net, the pure Transformer architecture nnFormer, and the hybrid CNN-Transformer frameworks TransUNet and Swin UNETR. Therefore, these limitations might not stem from the architectural design or training. We speculate that certain anatomical features or pathological characteristics may inherently pose difficulties for segmentation algorithms.
Conclusions
We propose the LKDA-Net, which is an effective network architecture for 3D medical image segmentation. The design of LKDA-Net aims to model global relationships by simulating the self-attention mechanism of Transformers based on the LKDA-Net Block, so as to efficiently extract global features and then achieve accurate and efficient 3D volume segmentation. The encoder of LKDA-Net consists of the LKDA-Net Block and downsampling. Specifically, the Large Kernel Depthwise Convolution Attention (LKD Attention) in the LKDA-Net Block extracts multi-scale features by employing multiple large kernel depthwise convolutions. It recursively aggregates the contextual information within the receptive field and captures more effective features in deeper and larger receptive fields at the same time. The Inverted Bottleneck with Depthwise Convolution Augmentation (DWCA) in the LKDA-Net Block reduces the redundancy among channels while enhancing the feature expression ability by independently expanding and compressing the dimensions of each channel. The decoder utilizes the Skip Connection Fusion Module to further extract hierarchical visual feature representations and optimize the segmentation results. Inside it, effective operations such as group convolution are adopted to reasonably allocate features for fusion. The upsampling module improves the resolution of the feature map and retains key features through techniques like bilinear interpolation. We have conducted a comprehensive evaluation of LKDA-Net on multiple publicly available 3D medical image segmentation datasets. The results show that compared with the current Transformer models, LKDA-Net reduces the computational complexity while improving the segmentation performance.
References
- 1. Azad R, Aghdam EK, Rauland A, Jia Y, Avval AH, Bozorgpour A. Medical image segmentation review: the success of u-net. arXiv preprint 2022. https://arxiv.org/abs/2211.4830
- 2. Lee HH, Bao S, Huo Y, Landman BA. 3D UX-Net: a large kernel volumetric ConvNet modernizing hierarchical transformer for medical image segmentation. arXiv preprint2022. https://arxiv.org/abs/2209.15076
- 3.
Tang Y, Yang D, Li W, Roth HR, Landman B, Xu D, et al. Self-supervised pre-training of swin transformers for 3D medical image analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 20730–40.
- 4.
Wang W, Chen C, Ding M, Yu H, Zha S, Li J. Transbts: multimodal brain tumor segmentation using transformer. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021 : 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I. 2021. p. 109–19.
- 5.
Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B. Unetr: transformers for 3d medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022. p. 574–84.
- 6. Zhou HY, Guo J, Zhang Y, Yu L, Wang L, Yu Y. Nnformer: interleaved transformer for volumetric segmentation. arXiv preprint 2021.
- 7. Zhang Z, Zhang H, Zhao L, Chen T, Arik SÖ, Pfister T. Nested hierarchical transformer: towards accurate, data-efficient and interpretable visual understanding. AAAI. 2022;36(3):3417–25.
- 8.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. p. 10012–22.
- 9.
Hatamizadeh A, Nath V, Tang Y, Yang D, Roth HR, Xu D. Swin unetr: swin transformers for semantic segmentation of brain tumors in mri images. In: International MICCAI Brainlesion Workshop. 2021. p. 272–84.
- 10.
Peiris H, Hayat M, Chen Z, Egan G, Harandi M. A robust volumetric transformer for accurate 3D tumor segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer; 2022. p. 162–72.
- 11.
Roy S, Koehler G, Ulrich C, Baumgartner M, Petersen J, Isensee F. Mednext: transformer-driven scaling of convnets for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. 2023. p. 405–15.
- 12. Vasu PKA, Gabriel J, Zhu J, Tuzel O, Ranjan A. FastViT: a fast hybrid vision transformer using structural reparameterization. arXiv preprint 2023.
- 13. Li H, Nan Y, Del Ser J, Yang G. Large-Kernel attention for 3D medical image segmentation. Cognit Comput. 2024;16(4):2063–77. pmid:38974012
- 14. Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imaging. 2020;39(6):1856–67. pmid:31841402
- 15.
Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y. Unet 3: a full-scale connected unet for medical image segmentation. In: ICASSP. IEEE; 2020. p. 1055–9.
- 16. Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18(2):203–11. pmid:33288961
- 17.
Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016 : 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II. 2016. p. 424–32.
- 18. Yang X, Li Z, Guo Y, Zhou D. DCU-net: a deformable convolutional neural network based on cascade U-net for retinal vessel segmentation. Multim Tools Appl. 2022;81(11):15593–607.
- 19. Yang C, Zhang Z. PFD-Net: pyramid fourier deformable network for medical image segmentation. Comput Biol Med. 2024;172:108302. pmid:38503092
- 20.
Azad R, Niggemeier L, Hüttemann M, Kazerouni A, Aghdam EK, Velichko Y, et al. Beyond self-attention: deformable large kernel attention for medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2024. p. 1287–97.
- 21.
Li H, Nan Y, Yang G. LKAU-Net: 3D large-kernel attention-based u-net for automatic MRI brain tumor segmentation. In: Annual Conference on Medical Image Understanding and Analysis. Springer; 2022. p. 313–27.
- 22. Liu Y, Zhang Z, Yue J, Guo W. SCANeXt: Enhancing 3D medical image segmentation with dual attention network and depth-wise convolution. Heliyon. 2024;10(5):e26775. pmid:38439873
- 23.
Zeng N, Fang J, Wang X, Lu X, Huang J, Miao H, et al. Factoring 3D convolutions for medical images by depth-wise dependencies-induced adaptive attention. In: 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2022. p. 883–6. https://doi.org/10.1109/bibm55620.2022.9995195
- 24.
Shang J, Zhou S. LK-UNet: large Kernel design for 3D medical image segmentation. In: ICASSP 2024 -2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2024. p. 1576–80.
- 25. Arslan S, Kaya MK, Tasci B, Kaya S, Tasci G, Ozsoy F, et al. Attention TurkerNeXt: investigations into bipolar disorder detection using OCT images. Diagnostics (Basel). 2023;13(22):3422. pmid:37998558
- 26. Huo Y, Xu Z, Bao S, Bermudez C, Moon H, Parvathaneni P, et al. Splenomegaly segmentation on multi-modal MRI using deep convolutional networks. IEEE Trans Med Imaging. 2019;38(5):1185–96. pmid:30442602
- 27.
Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. A convnet for the 2020 s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022. p. 11976–86.
- 28.
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q. Swin-unet: unet-like pure transformer for medical image segmentation. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III. Springer; 2023. p. 205–18.
- 29. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint 2020. https://arxiv.org/abs/2010.11929
- 30. Huang X, Deng Z, Li D, Yuan X. MISSFormer: an effective medical image segmentation transformer. CoRR. 2021.
- 31. Jiang Y, Zhang Y, Lin X, Dong J, Cheng T, Liang J. SwinBTS: a method for 3D multimodal brain tumor segmentation using swin transformer. Brain Sci. 2022;12(6):797. pmid:35741682
- 32. Shamshad F, Khan S, Zamir SW, Khan MH, Hayat M, Khan FS, et al. Transformers in medical imaging: a survey. Med Image Anal. 2023;88:102802. pmid:37315483
- 33. Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y. Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint 2021. https://arxiv.org/abs/2102.04306
- 34.
Zhang Y, Liu H, Hu Q. Transfuse:fusing transformers, cnns for medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021 : 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I, Strasbourg, France. 2021. p. 14–24.
- 35. Chen Y, Wang K, Liao X, Qian Y, Wang Q, Yuan Z, et al. Channel-Unet: a spatial channel-wise convolutional neural network for liver and tumors segmentation. Front Genet. 2019;10:1110. pmid:31827487
- 36.
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 7132–41.
- 37.
Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q. ECA-Net: efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 11534–42.
- 38. Jaderberg M, Simonyan K, Zisserman A. Spatial transformer networks. Adv Neural Inf Process Syst. 2015;28.
- 39.
Guo C, Szemenyei M, Yi Y, Wang W, Chen B, Fan C. SA-UNet: spatial attention U-Net for retinal vessel segmentation. In: 2020 25th International Conference on Pattern Recognition (ICPR). 2021. p. 1236–42. https://doi.org/10.1109/icpr48806.2021.9413346
- 40.
Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM. Medical transformer: gated axial-attention for medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021 : 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I. 2021. p. 36–46.
- 41. Sinha A, Dolz J. Multi-scale self-guided attention for medical image segmentation. IEEE J Biomed Health Inform. 2021;25(1):121–30. pmid:32305947
- 42. Chen B, Liu Y, Zhang Z, Lu G, Kong AWK. Transattunet: multi-level attention-guided u-net with transformer for medical image segmentation. IEEE Trans Emerg Topics Comput Intell. 2023.
- 43. Guo MH, Lu CZ, Hou Q, Liu Z, Cheng MM, Hu SM. Segnext: rethinking convolutional attention design for semantic segmentation. Adv Neural Inf Process Syst. 2022;35:1140–56.
- 44.
Fan DP, Ji GP, Zhou T, Chen G, Fu H, Shen J, et al. Pranet: parallel reverse attention network for polyp segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer; 2020. p. 263–73.
- 45.
Woo S, Park J, Lee JY, Kweon IS. Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018. p. 3–19.
- 46.
Cai Y, Wang Y. Ma-unet: an improved version of unet based on multi-scale and attention mechanism for medical image segmentation. In: Third International Conference on Electronics and Communication; Network and Computer Technology (ECNCT 2021). 2022. p. 205–11.
- 47.
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C. MobileNetV2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. p. 4510–20. https://doi.org/10.1109/cvpr.2018.00474
- 48.
Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015 : 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18. Springer; 2015. p. 234–41.