Figures
Abstract
Medical image segmentation plays an important role in medical diagnosis and treatment. Most recent medical image segmentation methods are based on a convolutional neural network (CNN) or Transformer model. However, CNN-based methods are limited by locality, whereas Transformer-based methods are constrained by the quadratic complexity of attention computations. Alternatively, the state-space model-based Mamba architecture has garnered widespread attention owing to its linear computational complexity for global modeling. However, Mamba and its variants are still limited in their ability to extract local receptive field features. To address this limitation, we propose a novel residual spatial state-space (RSSS) block that enhances spatial feature extraction by integrating global and local representations. The RSSS block combines the Mamba module for capturing global dependencies with a receptive field attention convolution (RFAC) module to extract location-sensitive local patterns. Furthermore, we introduce a residual adjust strategy to dynamically fuse global and local information, improving spatial expressiveness. Based on the RSSS block, we design a U-shaped SA-UMamba segmentation framework that effectively captures multi-scale spatial context across different stages. Experiments conducted on the Synapse, ISIC17, ISIC18 and CVC-ClinicDB datasets validate the segmentation performance of our proposed SA-UMamba framework.
Citation: Liu L, Huang Z, Wang S, Wang J, Liu B (2025) SA-UMamba: Spatial attention convolutional neural networks for medical image segmentation. PLoS One 20(6): e0325899. https://doi.org/10.1371/journal.pone.0325899
Editor: Panos Liatsis, Khalifa University of Science and Technology, UNITED ARAB EMIRATES
Received: September 7, 2024; Accepted: May 19, 2025; Published: June 12, 2025
Copyright: © 2025 Liu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The code to replicate this study’s findings is available at https://github.com/xiaomengxin123/SA-UMamba. The data sets used in this study are publicly available as follows. The Synapse dataset at https://www.synapse.org/Synapse:syn3193805/files/; the ISIC17 dataset at https://challenge.isic-archive.com/data/#2017; the ISIC18 dataset at https://challenge.isic-archive.com/data/#2018; and the CVC-ClinicDB dataset at https://polyp.grand-challenge.org/CVCClinicDB/. We confirm that others would be able to access these data and the authors did not have any special access privileges.
Funding: This work was supported in part by the Natural Science Foundation of Hebei Province (No. F2022201013); the Scientific Research Program of Anhui Provincial Ministry of Education (No. 2024AH051686); the Science and Technology Program of Huaibei (No. 2023HK037); the Anhui Shenhua Meat Products Co., Ltd. Cooperation Project (No. 22100084); and the Entrusted Project by Huaibei Mining Group (2023). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Medical segmentation plays a crucial role in modern clinical applications, such as assisting in diagnosis, helping to formulate treatment plans, and guiding the implementation of treatment methods [1–3]. In practice, the analysis of medical segmentation results relies on experienced doctors. However, owing to the subjective and objective differences in doctors’ judgments [4], it is impossible to rapidly obtain judgments that are also accurate, leading to discrepancies between reality and the results of analysis. Therefore, accurate and fast medical image segmentation methods are very important, as they can not only enhance diagnostic efficiency [5], but also ensure the accuracy of the results.
In recent years, convolutional neural network (CNN)- and Transformer-based deep learning methods [6, 6–9] have significantly contributed to advance the field of medical image segmentation. Among them, U-Net [6], as a representative CNN method, has been applied in subsequent medical image segmentation studies because of its simple framework. Many subsequent studies [11–14] have adopted this U-shaped structure and achieved promising results. However, CNN-based methods [7, 8, 11] are limited by their local receptive fields, which prevent them from effectively modeling long-range dependencies. Inspired by the success of Transformer-based methods [15–17] on natural language processing (NLP) and image tasks, TransUNet [9] was the first to apply a Vision Transformer (ViT) to medical image segmentation, demonstrating the powerful performance of Transformers in this domain. Subsequently, TransFuse [18] was designed with a ViT [16] in the encoder stage and CNN in the decoder stage, capturing both global and local features simultaneously. Additionally, Swin-Unet [10] employs the Swin Transformer [17] in both the encoder and decoder stages; it was demonstrated to outperform pure convolutional and hybrid Transformer-based methods. Although Transformer-based methods excel in global modeling, their unique self-attention mechanism introduces quadratic computational complexity. Particularly in medical image segmentation, as image resolution and data volume increase, the computational requirements of the model significantly increase, thereby creating a substantial computational burden [16, 19, 20]. Therefore, there is an urgent need for a new medical image segmentation architecture that can effectively capture long-range dependencies while maintaining linear computational complexity, thus mitigating the rising computational cost.
Recently, state-space models (SSMs) [21, 22] have demonstrated their effectiveness in long-sequence modeling. Compared to Transformers, they have linear computational complexity, scaling linearly with the sequence length. Mamba [23] emerged as a result of applying a SSM to end-to-end neural networks, retaining the abilities of SSM while significantly enhancing the inference capability beyond that feasible for Transformers. Owing to Mamba’s success in continuous long data analysis in fields such as NLP and genomic analysis, researchers have recently begun exploring its application in the visual domain [24–31], consequently achieving successful outcomes. Vision Mamba [24] was constructed a Mamba framework similar to the ViT structure, being the first to apply Mamba in the visual domain. Subsequently, the VMamba [25] architecture introduced the 2D selective scanning (SS2D) module, which enables image patch scanning in four directions and thus enhances Mamba’s ability to understand spatial information in images. Mamba was then integrated into U-shaped medical image segmentation architectures. Specifically, U-Mamba [26] was the first to apply the original Mamba module to medical image segmentation, whereas VM-UNet [28] and Swin-UMamba [29] leveraged the Swin-Unet architecture and VMamba’s pre-trained weights, outperforming methods like Swin-Unet on medical segmentation datasets and thus showcasing strong competitiveness. Recently, some Mamba-based methods [29, 30] have incorporated spatial and channel attention; however, they focus on global spatial or channel attention rather than concentrating on spatial attention within the receptive field, which limits their ability to effectively capture fine-grained spatial features in the local context. Furthermore, co-optimization of the feature extraction capability of Mamba and spatial attention remains a challenge.
To comprehensively address the limitations of existing methods in balancing global context and local spatial detail, we propose a novel residual spatial state-space (RSSS) block as an extension of the VSS block from VMamba. The RSSS block integrates a receptive-field attention convolution (RFAC) module to enhance spatial awareness by capturing position-specific features within the receptive field. We have also incorporated learnable parameters, positioning them at both ends of the VSS block and spatial module to act as adjustment factors for the residual connection and co-optimize the Mamba-derived features and spatial attention. We have also constructed the RSSS block to have an architecture that is similar to VM-UNet, named SA-UMamba, which enhances segmentation performance. Briefly, the contributions of this paper are as follows:
- We propose a RSSS block, which efficiently integrates Mamba with RFAC module to simultaneously capture global dependencies and spatially diverse local features;
- We have introduced learnable parameters to optimize the residual connections within the RSSS block and more effectively balance the Mamba-derived features and spatial attention to realize more comprehensive feature capture;
- We build SA-UMamba based on the RSSS block, consequently improving the Dice similarity coefficient (DSC) performance on the Synapse dataset, with the DSC increasing from 80.50% to 82.54%, and Hausdorff distance (HD95) decreasing from 22.37 mm to 16.80 mm. This demonstrates the leading effectiveness of the Mamba method on Synapse, as well as its competitiveness with the latest Transformer-based methods.
Related works
U-like medical image segmentation
Most early works in medical image segmentation were CNN-based methods inspired by U-Net [6]. Many subsequent works [7, 32–35] also applied this architecture, such as ResUNet [7] and UNet++ [32]. These methods were also extended to the field of 3D medical image segmentation, employing models such as the 3D U-Net [36] and V-Net [37]. These works benefit from the strong feature extraction capability of convolution, but they are also limited by the inability of convolution to extract global features. Initially, Transformers were only applied for translation tasks in NLP [15]. However, researchers recently began successfully applying them to the visual domain [16, 17], offering a new approach to medical image segmentation. For example, TransUNet [9] combines Transformer and CNN models to aggregate global and local feature information. Swin-Unet [10] employs Transformer blocks to form a complete Transformer segmentation framework on the basis of the U-shaped structure, achieving excellent segmentation performance based on global context. However, the attention mechanism in Transformers involves a high level of computational complexity, which means that Transformer-based methods require expensive hardware resources. Recently, the emergence of Mamba, with its linear computational complexity, has attracted widespread interest from researchers. Thus, Mamba was quickly adopted in the field of medical image segmentation, and Mamba-based models [23, 38] like VM-UNet [28], VM-UNetV2 [30] and Swin-UMamba [29] have achieved excellent segmentation performance in the medical segmentation field, offering a promising solution to the computational challenges of Transformer-based methods. Despite Mamba’s promising potential, its current applications remain limited, particularly in fully leveraging spatial attention mechanisms within the receptive field. Our work addresses this gap by introducing an optimized approach to integrate spatial attention within the receptive field, enhancing Mamba’s feature extraction capability for more precise segmentation.
Attention mechanism
As an effective means of enhancing model performance, the attention mechanism allows models to focus more on key features. It is widely used in the fields of vision and NLP, establishing a relatively complete system. Wang et al. [39] proposed a convolutional block attention module (CBAM), which utilizes channel and spatial attention modules to capture both channel and spatial features. As a portable module, it not only enhances network performance, but also be easily applied to CNNs. Although the CBAM has already demonstrated good performance, Hou et al. [40] identified information loss within the CBAM and proposed the lightweight coordinate attention (CA) mechanism to address this problem. Zhang et al. [41] argued that previously developed attention mechanisms do not consider multi-scale feature information in channels and space. Thus, they proposed the PSA and SPC modules to enable multi-scale feature extraction in the channel and spatial dimensions, respectively. Although existing attention mechanisms can effectively enhance local feature representation, they often overlook the variations among different spatial positions within the local receptive field. Inspired by RFAC [42], we explored a new approach to spatial attention (referred to as spatial attention in this paper), using receptive-field spatial features to capture fine-grained positional variations, thereby further improving network performance. Additionally, to co-optimize spatial features and Mamba-derived features and thus enhance feature extraction capability, we introduced adjustment factors, which also serve to enhance the comprehensiveness of the model. Unlike most existing approaches [30, 45] that simply add attention blocks, our work establishes receptive-field spatial attention as an effective method that focuses on balancing these features, ensuring a more comprehensive and effective integration upon current research.
Methods
In this section, we first introduce the concept of Mamba, and then describe our network structure in detail. Fig 1 shows the network structure of our proposed SA-UMamba, which employs an encoder-decoder architecture. Note that the encoder and decoder networks both comprise Mamba-based VSSS blocks.
Mamba preliminaries
Current SSMs, namely, structured state space sequence models (S4) and Mamba, both rely on their continuous systems. These systems map 1D input functions or sequences (denoted as ) to the output
through an intermediate hidden state function
. This process can be represented by a linear ordinary differential equation:
where represents the state matrix,
and
represent the linear projection parameters, and
represents the skip connection.
Mamba and S4 discretize the continuous system to better adapt to deep learning environments. They introduce the time scale parameter and apply consistent discretization rules to convert A and B into the discrete parameters
and
, respectively. The discretization method applies the zero-order hold approach, specifically represented as follows:
After discretization, SSM-based models (S4 models) can be computed in two different ways: linear recursion or global convolution. These two computation methods are represented by Eqs 3 and 4, respectively, as shown below:
where represents the structured convolutional kernel, and L denotes the length of the input sequence x.
However, the SSM (S4) cannot make adjustments based on the input content, which significantly limits its ability to capture context and makes it difficult to achieve the same effect as Transformers. In contrast, the selective SSM, also referred to as Mamba or S6, improves upon this by introducing related parameters, i.e., B = SB(x), C = Sc(x), and . This allows the model to adapt to more complex inputs and enable Mamba to maintain efficient training and inference through the scan algorithm.
SA-UMamba framework
The overall architecture of SA-UMamba is shown in Fig 1 (a). SA-UMamba adopts an asymmetric design similar to VM-UNet [28]. Specifically, SA-UMamba comprises a patch embedding layer, an encoder, a decoder, a final projection layer, and skip connections, and the number of blocks in each encoder and decoder stage is determined based on VM-UNet.
First, the input image is fed into the patch embedding layer. The patch embedding layer divides the image into non-overlapping 4
4 patches and applies a linear projection layer to expand the channels of each patch to C; it was initially set to 96. Through this process, the original image is reshaped into a feature map
, where H and W are the original height and width of the input image, respectively. Next, layer normalization [43] is applied to
, and the normalized
is fed to the encoder for feature extraction.
The encoder comprises four stages, each using RSSS blocks to further extract features from the input. Additionally, in the first three stages, we apply patch-merging operations to process the input features, reducing the width and height by half while also doubling the number of channels. We apply [2,2,2,2] RSSS blocks in the four stages of the encoder, with the number of channels for each stage established as [C,2C,4C,8C].
The decoder also comprises four stages, similar to the encoder, using RSSS blocks to process the features. However, unlike the encoder, which processes features in the first three stages, the decoder uses patch-expanding operations in the last three stages to double the width and height of the feature maps while also halving the number of channels. In these four stages, we apply [2,2,2,1] RSSS blocks, with the number of channels for each stage established as [8C,4C,2C,C]. After being processed by the decoder and final linear layer, the width, height, and number of channels of the feature map are restored to the original image size, and this feature map is used to match the segmentation targets.
Standard residual connections are applied for the skip connections. Before the features are fed to the RSSS blocks of each encoder stage, skip connections are established at the corresponding positions in the decoder. These skip connections help to transmit information from the early stages of the encoder to the subsequent decoder stages, facilitating the network’s ability to retain and utilize these low-level features.
Residual spatial state-space block
Previous Mamba-based medical image segmentation studies [28–30] generally adopted VSS blocks as the backbone for the encoder and decoder of the U-shaped network structure. The VSS block was designed based on VMamba’s VSS block [25]. Considering the results of various Transformer-based studies [44, 45], we believe that the features extracted by the VSS block can be further enhanced through additional spatial attention mechanisms. Therefore, a new structure specifically designed for Mamba-based medical image segmentation networks holds promise.
Thus, we propose the RSSS block as an enhancement of the Mamba-based VSS block. Specifically, we have applied RFAC as a new paradigm for spatial attention to enhance Mamba’s ability to extract spatial features. Our key idea is to facilitate feature co-optimization between global Mamba-derived spatial representations and local spatial attention and this is achieved by embedding residual connections with learnable adjustment factors, enabling adaptive information fusion.
As shown in Fig 1 (b), given the input feature , we first apply LayerNorm for normalization, and then use the vision state-space module (VSSM) to capture long-term spatial dependencies. Additionally, to co-optimize the Mamba-derived spatial features and spatial attention features, we apply the learnable adjustment factor
(denoted as "scale" in Fig 1 (b)) as a means to control the information from the residual connection. This operation allows the network to control the relative contribution of the original input to the enhanced representation and this process can be represented by the following equation:
where l denotes the l-th RSSS block and LN represents the LayerNorm process. Subsequently, is further normalized and fed into the RFAC module, which performs local spatial attention refinement. To integrate these spatially refined features with the global representation from VSSM, we introduce a second learnable scaling parameter
in the residual connection, enabling dynamic co-optimization of global and local cues. This process can be represented by the following equation:
Through this two-stage integration strategy, the RSSS block explicitly models and balances the interaction between mamba modeling and spatial attention enhancement, making it more adaptive and expressive for medical image segmentation tasks. Next, we provide a detailed explanation of the two core components that constitute the RSSS block: the Vision State-Space Module (VSSM) and the Rceptive-Filed Attention Convolution Module (RFAC).
Vision state-space module
Inspired by Mamba’s success in maintaining linear computational complexity while also modeling long-range dependencies, as previously mentioned, we constructed the VSSM to extract features from Mamba. As can be seen in Fig 1 (c), we followed the VMamba methodology [25] and chose to process the input features through two parallel branches.
In the first branch, we initially expand the channel number of the features through a linear layer to times that of the original, i.e.,
, where
is a predefined channel expansion factor. Then, the features pass through a 3
3 depth-wise convolutional layer [46] followed by the SiLU [47] activation function. Subsequently, the output of the SS2D module is normalized by using LayerNorm and combined with the information from the other branch.
In the second branch, the channel number of the features is expanded to , using the same technique as in the first branch, followed by processing via the SiLU activation function. Then, the features from the first and second branches are combined by using the Hadamard product to aggregate the information. Lastly, a linear layer projects the channel number back to the original input feature channels C, yielding the output Xout with the same shape as the input. This process can be represented by the following equation:
2D selective scan module
The original Mamba is a unidirectional model that can only process 1D information and is thus substantially limited in handling 2D image information. To adapt Mamba for visual tasks and better utilize 2D spatial information, we chose to apply the SS2D module from VMamba as the foundational component of the VSSM, as shown in Fig 2.
(a) SS2D expansion operation; (b) Core component of Mamba (S6); (c) SS2D merging operation.
Specifically, the SS2D module obtains image patches by unfolding the 2D image in four directions (i.e., from top-left to bottom-right, bottom-right to top-left, top-right to bottom-left, and bottom-left to top-right). These patches are then processed into four different 1D sequences, each of which is processed by an SSM (S6). Lastly, the sequences from different directions are merged to reestablish a complete 2D feature map. Given the input feature , SS2D processing can be represented by the following equation:
where represents the four different scanning directions. Expand, S6 and Merge denote the expansion, scanning and merging processes, respectively. The S6 module captures long-range information for each sequence according to Eqs. 1 and 2,
represents the output feature map of the SS2D module. For more details regarding S6, refer to [25].
RFAC module
Existing standard convolution operations extract features by performing sliding window operations of convolution. However, there is often overlap between sliding windows, and the parameters of the convolutional kernel are the same within each sliding window. Therefore, the convolution operation cannot capture the position-related information discrepancies in the image, which constrains the effects of the convolution operation. Receptive-field features are represented by the feature maps obtained after the transformation of the original feature map, and these maps comprise non-overlapping sliding windows. The spatial attention mechanism can learn the importance of features in different positions through the attention map. Therefore, the effective combination of the spatial attention mechanism and convolution operation overcomes the limitations of standard convolution operations and paves the way for further development of spatial attention.
RFAC utilizes non-overlapping windows within the receptive field to extract features, as shown in Fig 3. RFAC involves multiplying the attention map by the transformed receptive-field features. Specifically, the attention map first applies average pooling (AvgPool) to aggregate the global information of features within each receptive field, performs a 1 1 convolution to obtain interaction information, and, lastly, applies the softmax function to emphasize the importance of each feature within the receptive field. The transformed receptive-field features are processed by applying a fast group convolution (Group Conv) operation, which efficiently performs grouped convolutions by splitting channels into groups and processing them in parallel. RFAC can be represented by the following equation:
where Ar represents the attention map and Fr represents the transformed receptive-field features obtained via the Group Conv operation. indicates a group convolution with kernel size i
i, k indicates the size of the convolutional kernel, and Norm refers to the batch normalization process. Therefore, RFAC can generate a corresponding attention map for each receptive field, which is then multiplied by the convolutional features. This eliminates the insensitivity of standard convolution operations to positional changes within the receptive field, thereby allowing the capture of different features at different positions in the image and realization of positional information awareness.
Loss function
The mask images used in medical image segmentation are binary or multi-class, medical image segmentation can be viewed as a pixel-level classification task. Therefore, we chose to apply binary cross-entropy (Bce) loss, Dice loss, and cross-entropy (Ce) loss as the loss functions for different tasks, as shown in Eq 10. For the Synapse dataset, we applied Dice loss and Ce loss to form the CeDice loss function and extend their applicability to multi-class problems, as shown in Eq 11. For the ISIC dataset, we applied Dice loss and Bce loss to form the BceDice loss function, as shown in Eq 12.
where N represents the total number of samples and C represents the number of classes. yi and represent the ground truth and predicted values, respectively. yi,c indicates whether the i-th sample belongs to the class c. If the sample i belongs to the class c, the corresponding value is 1; otherwise, it is 0.
is the model’s predicted probability that a sample i belongs to the class c.
and
denote the ground truth and predicted values, respectively.
and
are the weight parameters of the loss functions, which were initially set to 1 (see [28] for details).
Experimental results
Implementation details
All experiments with our proposed SA-UMamba were conducted in an environment with Ubuntu 22.04, Python 3.8, CUDA 12.2, PyTorch 2.1.0, and two RTX 3090 GPUs. Following previous works, we used image sizes of pixels for the ISIC17, ISIC18 and CVC-ClinicDB datasets, and
pixels for the Synapse dataset. Data augmentation techniques, including random flipping and random rotation, were employed to prevent overfitting. We used the BceDice loss function for the ISIC17, ISIC18 and CVC-ClinicDB datasets and the CeDice loss function for the Synapse dataset. The batch size was set to 32, and the AdamW [48] optimizer was applied. The initial learning rate was set to 10−3, with a minimum learning rate of 10−5, and the CosineAnnealingLR scheduler [49] was employed. The maximum number of iterations was set to 50, and the training epoch size was set to 300. For pre-trained weights, we used the weights of VMamba-S [25] pre-trained on ImageNet-1k to initialize the parameters of the encoder and decoder in our model.
Applied datasets
In this study, testing involved the utilization of three datasets commonly used in the medical segmentation field: the Synapse multi-organ segmentation dataset (Synapse), International Skin Imaging Collaboration dataset 2017 (ISIC17), International Skin Imaging Collaboration dataset 2018 (ISIC18) and Colorectal Cancer-Clinic Dataset (CVC-ClinicDB).
The Synapse dataset [50] is a publicly available multi-organ segmentation dataset comprising data on eight types of abdominal organs (aorta, gallbladder, left kidney, right kidney, liver, pancreas, spleen, and stomach). It includes 30 abdominal CT cases, comprising 3779 axial CT images. Following previous studies, we applied 18 cases as the training set and 12 cases as the test set. For this dataset, we used the DSC and HD95 as evaluation metrics for the model.
The ISIC17 and ISIC18 datasets [51, 52] are skin lesion segmentation datasets released by the International Skin Imaging Collaboration (ISIC) in 2017 and 2018. ISIC17 contains 2150 images with segmentation mask labels, and ISIC18 contains 2694 images with segmentation mask labels. Following previous studies [28, 45], we split the datasets into training and testing sets with a ratio. For these two datasets, we conducted detailed analyses by using several evaluation metrics, i.e., the mean Intersection over Union (mIoU), DSC, accuracy (Acc), specificity (Spe), and sensitivity (Sen).
The CVC-ClinicDB dataset [53] is the training dataset for the Colonoscopy Polyp Detection Challenge in the MICCAI 2015 competition. It is primarily used for the early detection and diagnosis of colorectal cancer and the dataset consists of 612 high-resolution colonoscopy images. In our work, we follow the approach used in BRAU-Net++ [45] and randomly split the dataset into 490 images for training, 61 images for validation, and 61 images for testing. We evaluated the CVC-ClinicDB dataset using several performance metrics, including the mean Intersection over Union (mIoU), Dice Similarity Coefficient (DSC), accuracy (Acc), precision (Pre) and recall.
Segmentation results for Synapse dataset
The experimental results show that our SA-UMamba outperformed the CNN- and Transformer-based methods, as well as the latest CNN-Transformer hybrid methods, on both evaluation metrics. It also outperformed our baseline model, VM-UNet, indicating that applying Mamba with spatial attention can result in more efficient modeling. As shown in Table 1, relative to the Transformer-based methods TransUNet [9] and Swin-Unet [10], SA-UMamba significantly increased the DSC by 6.53% and 4.31%, and reduced HD95 by 14.89 and 4.75 mm, respectively. Relative to the Mamba-based VM-UNet method, SA-UMamba increased the DSC by 2.49% and reduced HD95 by 5.57 mm. This demonstrates that Mamba’s global modeling capability is similar to that of Transformers, and that the introduction of spatial attention helps Mamba to more effectively learn spatial semantic information.
Specifically, SA-UMamba outperformed other methods on most organ segmentation tasks, especially the Gallbladder and Kidney (R) images. As shown in Table 1, our SA-UMamba achieved the best DSC of 82.53%. Moreover, compared with the latest CNN-Transformer hybrid method BRAU++ [45], SA-UMamba achieved a lower HD95, demonstrating a level of performance that is comparable to that of the latest Transformer-based methods.
Fig 4 enables visual analysis of the results for different methods on the Synapse dataset. As shown in Fig 4, our method yielded smoother segmentation results than other methods, as the final images closely resembled the GT images. Even under the condition of small target features, SA-UMamba was able to leverage spatial features to more effectively learn global and long-range semantic information, resulting in superior segmentation performance.
Segmentation results for ISIC dataset
For the ISIC dataset, we compared SA-UMamba with existing methods. As shown in Table 2, our method outperformed the previous state-of-the-art Mamba model VM-UNet, as well as recent CNN and Transformer models, across various metrics (i.e., mIoU, DSC, and Acc) on the ISIC17 and ISIC18 datasets. Relative to VM-UNet, our proposed SA-UMamba increased by 0.785% on ISIC17 and 0.746% on ISIC18. Although the improvement is not very significant, it demonstrates the effectiveness and robustness of our method on different datasets.
Segmentation results for CVC-ClinicDB dataset
For the CVC-ClinicDB dataset, we also compare SA-UMamba with existing methods, as shown in Table 3. SA-UMamba outperforms most methods, particularly in DSC (90.41%) and precision (90.59 %), surpassing TransUNet and Attn-Unet. While its recall (86.37%) is slightly lower than Attention U-Net (90.10%), SA-UMamba strikes a strong balance between precision and recall. Compared to the Mamba-based method VM-Unet, our approach improves all performance metrics on the dataset, demonstrating that our work significantly enhances the stability of Mamba-based methods. Overall, SA-UMamba shows superior performance across key metrics, particularly in precision, and proves to be a more robust and promising approach for medical image segmentation.
Ablation study
To investigate the impact of various factors on the model performance, we conducted ablation studies using the Synapse dataset. Specifically, we independently evaluated the effectiveness of the pre-trained weights, effectiveness of different RSSS block design choices, and operation of the attention mechanism.
Effectiveness of pre-trained weights.
Our framework was built based on that of VMamba, which incorporates pre-trained model weights of three different sizes. To explore the impact of different VMamba pre-trained weights on SA-UMamba, we conducted an ablation study under four configurations, i.e., under the conditions of training the model from scratch and using three different scales of pre-trained VMamba weights. The model without pre-trained weights is referred to as Original, whereas those with pre-trained weights of three different sizes are referred to as Tiny, Base, and Small, respectively.
Table 4 details the experimental results for different pre-trained weights on the Synapse dataset. As shown in Table 4, using pre-trained models led to more favorable results than when pre-trained models are not used. Specifically, the Small model demonstrated significant improvement over Original, with the DSC increasing by 5.48% and HD95 decreasing by 28.87%; particularly, it achieved the best DSC and HD95. However, the metrics for the Base model indicated a certain degree of decline relative to the Small model. This indicates that the pre-trained weights significantly influence the performance of the Mamba model. Thus, in the case of SA-UMamba, using the Small-sized pre-trained weights for initialization is optimal.
Effectiveness of RSSS block design choice.
The influence of the RSSS block, as the core module of SA-UMamba, can vary according to its components. Thus, we removed different components within the block to investigate their effects on the model.
In Table 5, baseline represents the model without any additional components. Scale only indicates that using only the residual balancing factor; it does not provide any significant improvement to the model, and may even have a negative impact. RFAC only shows that using only the RFAC module can just lead to moderate improvements in model performance. When both the Scale and RFAC components are used together (Scale+RFAC (Ours)), the framwork demonstrates that combining the balancing factor with RFAC results in significant improvements in model performance, indicating that the optimal Mamba-based medical segmentation model performance requires co-optimization of spatial attention and the Mamba-derived features.
Operation of the attention mechanism.
Attention mechanisms, as a simple and effective way to improve model performance, have been inspired by Transformer-based medical segmentation methods that use different attention calculation mechanisms. Thus, we replaced the RFAC spatial attention with some common spatial and channel attention modules to evaluate the influence of different attention calculation methods on our model. The results are shown in Table 6.
Table 6 shows that channel attention [59] and lightweight CA [40] afforded moderate levels of performance enhancement, whereas CBAM [39] negatively impacted the model performance. Therefore, choosing an appropriate attention calculation method is crucial. In the case of Mamba-based medical segmentation tasks, we believe that RFAC [42] not only overcomes the CNN’s problem of locality, it also emphasizes the feature differences within each receptive field by using adjustment factors to co-optimize the spatial attention and Mamba-derived features. Consequently, RFAC significantly contributes to enhance Mamba performance.
Encoder-decoder architectures.
Given that the number of RSSS blocks in both the encoder and decoder can significantly influence model performance, we conducted an ablation study to evaluate how variations in the number of these blocks affect the model’s effectiveness. The results are shown in Table 7.
As shown in Table 7, increasing the number of RSSS blocks in both encoder and decoder leads to a rise in model complexity, resulting in higher parameter count and computational load. However, with the increase of model complexity, the segmentation performance has declined. For example, when the number of blocks increases from {2,2,2,2}–{2,2,2,1} to {2,2,9,2}–{2,2,2,1}, the DSC value decreases from 82.54% to 81.74%, and HD95 value increases from 16.80 mm to 20.60 mm. This indicates that the increased complexity does not yield corresponding improvements in performance and may even negatively affect the model’s generalization ability. Additionally, we found a clear performance advantage of the asymmetric structure over the symmetric one. Based on these findings, we designed SA-UMamba with an asymmetric structure, specifically, {2,2,2,2}–{2,2,2,1} RSSS blocks were assigned to the encoder and the decoder.
Comparison of Computational Costs with SOTA methods.
To further validate the effectiveness of our proposed method, we compare it with several SOTA models in terms of parameter count and computational complexity, as summarized in Table 8. Although our model exhibits a higher number of parameters compared to conventional convolutional method U-Net and Transformer-based model Swin-Unet, it achieves significantly lower FLOPs while attaining superior segmentation performance in both DSC and HD95 metrics. When compared with VM-UNet, a pure Mamba-based model, our method introduces slightly more parameters and computational cost due to the integration of convolutional operations. However, this design enhances feature extraction capability and contributes to achieving the best overall performance. Furthermore, compared with BRAU-Net++, which combines convolution and Transformer modules, our model reduces the computational burden to nearly half while delivering better DSC. These results demonstrate that, by leveraging the high efficiency of the Mamba architecture along with architectural design, our method achieves a favorable trade-off between performance and computational cost.
Discussion
In this work, we propose SA-UMamba, which integrates the global sequence modeling capabilities of Mamba with the local feature extraction strengths of convolutional layers to improve medical image segmentation performance. Although our method outperforms baseline models on most evaluation metrics, it performs slightly worse on a few metrics. We attribute this to specific artifacts in the datasets, such as hair occlusion in ISIC18, which may hinder the extraction of reliable local features and reduce the overall representation quality.
To address this, our future work will explore pure Mamba architectures with inherent global-local modeling capabilities, aiming to eliminate reliance on additional modules and further enhance model generalization. We also plan to reduce computational overhead by designing lightweight Mamba variants that avoid multi-directional scanning. Moreover, we will extend SA-UMamba to a broader range of medical imaging tasks, such as detection and registration, to evaluate its versatility and practical applicability.
Conclusions
In this paper, we have introduced SA-UMamba, a Mamba-based U-shaped medical segmentation model with spatial attention. We utilized RSSS blocks to construct the encoder and decoder, employing learnable scaling factors to co-optimize Mamba-derived and spatial features. The results of extensive experiments confirmed that our SA-UMamba model outperforms previous Mamba-based medical models. In the future, we will explore optimization strategies for the Mamba model to reduce the number of parameters and further decrease the computational resource requirements, thereby extending the applicability of SA-UMamba.
Acknowledgments
The authors would like to thank the Academic Editor and anonymous reviewers for their valuable comments.
References
- 1. Bai W, Suzuki H, Huang J, Francis C, Rueckert D. A population-based phenome-wide association study of cardiac and aortic structure and function. Nat Med. 2020;26:1654–62.
- 2. Mei X, Lee H-C, Diao K-Y, Huang M, Lin B, Liu C, et al. Artificial intelligence-enabled rapid diagnosis of patients with COVID-19. Nat Med. 2020;26(8):1224–8. pmid:32427924
- 3.
Jungo A, Meier R, Ermis E, Blatti-Moreno M, Herrmann E, Wiest R, et al. On the effect of inter-observer variability for a reliable estimation of uncertainty of medical image segmentation. In: Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention–MICCAI 2018, Granada, Spain. 2018, pp. 682–90.
- 4. Joskowicz L, Cohen D, Caplan N, Sosna J. Inter-observer variability of manual contour delineation of structures in CT. Eur Radiol. 2019;29(3):1391–9. pmid:30194472
- 5. Tang H, Chen X, Liu Y, Lu Z, You J, Yang M, et al. Clinically applicable deep learning framework for organs at risk delineation in CT images. Nat Mach Intell. 2019;1(10):480–91.
- 6.
Ronneberger O, Fischer P, Brox T. U-Net: Convolutional networks for biomedical image segmentation. In: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Munich, Germany, 2015, pp. 234–41.
- 7.
Xiao X, Lian S, Luo Z, Li S. Weighted Res-UNet for high-quality retina vessel segmentation. In Proceedings of the International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19-21 October 2018, 2018, pp. 327–31.
- 8. Guan S, Khan AA, Sikdar S, Chitnis PV. Fully dense UNet for 2-D sparse photoacoustic tomography artifact removal. IEEE J Biomed Health Inform. 2020;24(2):568–76. pmid:31021809
- 9.
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, et al. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv, preprint, 2021. https://doi.org/arXiv:2102.04306
- 10.
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, et al. Swin-Unet: Unet-like pure transformer for medical image segmentation. In: Proceedings of the European Conference on Computer Vision, online. 2023, pp. 205–218.
- 11. Ibtehaz N, Rahman MS. MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 2020;121:74–87. pmid:31536901
- 12. Zhou H-Y, Guo J, Zhang Y, Han X, Yu L, Wang L, et al. nnFormer: volumetric medical image segmentation via a 3D transformer. IEEE Trans Image Process. 2023;32:4036–45. pmid:37440404
- 13.
Guo J, Zhou HY, Wang L, Yu Y. UNet-2022: exploring dynamics in non-isomorphic architecture. In: Proceedings of the International Conference on Medical Imaging and Computer-Aided Diagnosis–MICAD 2022, 2023, pp. 465–76.
- 14. Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18(2):203–11. pmid:33288961
- 15.
Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2019, pp. 4171–86.
- 16.
Dosovitskiy A, et al. An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 2021 International Conference on Learning Representations, Vienna, Austria, 2021, pp. 1–21.
- 17.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 2021, pp. 10012–22.
- 18.
Zhang Y, Liu H, Hu Q. TransFuse: fusing transformers and CNNs for medical image segmentation. In: Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention–MICCAI 2021, Strasbourg, France, 2021, pp. 14–24.
- 19. Madebo MM, Abdissa CM, Lemma LN, Negash DS. Robust tracking control for quadrotor UAV with external disturbances and uncertainties using neural network based MRAC. IEEE Access. 2024;12:36183–201.
- 20. Ayalew W, Menebo M, Merga C, Negash L. Optimal path planning using bidirectional rapidly-exploring random tree star-dynamic window approach (BRRT*-DWA) with adaptive Monte Carlo localization (AMCL) for mobile robot. Eng Res Express. 2024;6(3):35212.
- 21.
Gu A, Goel K, Ré C. Efficiently modeling long sequences with structured state spaces. In: Proceedings of the 2022 International Conference on Learning Representations, Virtual, 25-29 April 2021, pp. 1–27.
- 22.
Gu A, Johnson I, Goel K, Saab K, Dao T, Rudra A, et al. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In: Proceedings of the 35th Annual Conference on Neural Information Processing Systems, Virtual, 6-14 December 2021, pp. 572–85.
- 23.
Gu A, Dao T. Mamba: linear-time sequence modeling with selective state spaces. arXiv, preprint, 2023. https://doi.org/arXiv:2312.00752
- 24.
Zhu L, Liao B, Zhang Q, Wang X, Liu W, Wang X. Vision mamba: efficient visual representation learning with bidirectional state space model. In: Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 2024, pp. 62429–42.
- 25.
Liu Y, Tian Y, Zhao Y, Yu H, Xie L, Wang Y, et al. VMamba: visual state space model. arXiv, preprint, 2024. https://doi.org/arXiv:2401.10166
- 26.
Ma J, Li F, Wang B. U-Mamba: enhancing long-range dependency for biomedical image segmentation. arXiv, preprint, 2024. https://doi.org/arXiv:2401.04722
- 27.
Xing Z, Ye T, Yang Y, Liu G, Zhu L. SegMamba: Long-range sequential modeling mamba for 3D medical image segmentation. arXiv, preprint, 2024. https://doi.org/arXiv:2401.13560
- 28.
Ruan J, Xiang S. VM-UNet: Vision mamba UNet for medical image segmentation. arXiv, preprint, 2024. https://doi.org/arXiv:2402.02491
- 29.
Liu J, et al. Swin-UMamba: mamba-based UNet with ImageNet-based pretraining. arXiv, preprint, 2024. https://doi.org/arXiv:2402.03302
- 30.
Zhang M, Yu Y, Gu L, Lin T, Tao X. VM-UNET-V2: Rethinking vision mamba unet for medical image segmentation. arXiv, preprint, 2024. https://doi.org/arXiv:2403.09157
- 31.
Wang Z, Zheng J, Zhang Y, Cui G, Li L. Mamba-UNet: UNet-like pure visual mamba for medical image segmentation. arXiv, preprint, 2024. https://doi.org/arXiv:2402.05079
- 32.
Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J. UNet++: A nested U-Net architecture for medical image segmentation. In: Proceedings of the 4th International Workshop on Deep Learning in Medical Image Analysis (DLMIA 2018) and the 8th International Workshop on Multimodal Learning for Clinical Decision Support (ML-CDS 2018), Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018, pp. 3–11.
- 33. Gu R, Wang G, Song T, Huang R, Aertsen M, Deprest J, et al. CA-Net: comprehensive attention convolutional neural networks for explainable medical image segmentation. IEEE Trans Med Imaging. 2021;40(2):699–711. pmid:33136540
- 34. Shan T, Yan JS. SCA-Net: A spatial and channel attention network for medical image segmentation. IEEE Access. 2021;9:160926–37.
- 35. Liu X, Gao P, Yu T, Wang F, Yuan R-Y. CSWin-UNet: Transformer UNet with cross-shaped windows for medical image segmentation. Information Fusion. 2025;113:102634.
- 36.
Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In: Proceedings of the 19th International Conference on Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016, Athens, Greece, 2016, pp. 424–32.
- 37.
Milletari F, Navab N, Ahmadi SA. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In: Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, 2016, pp. 565–71.
- 38. Zhang H, Zhu Y, Wang D, Zhang L, Chen T, Wang Z, et al. A survey on visual mamba. Appl Sci. 2024;14(13):5683.
- 39.
Woo S, Park J, Lee JY, Kweon IS. CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 2018, pp. 3–19.
- 40.
Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, 20–25 June 2021, pp. 13708–17.
- 41.
Zhang H.; Zu K.; Lu J.; Zou Y.; Meng D. EPSANet: An efficient pyramid squeeze attention block on convolutional neural network. In: Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022, pp. 541–57.
- 42.
Zhang X, Liu C, Yang D, Song T, Ye Y, Li K, et al. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv, preprint, 2023. https://doi.org/arXiv:2304.03198
- 43.
Ba JL, Kiros JR, Hinton GE. Layer normalization. arXiv, preprint, 2016. https://doi.org/arXiv:1607.06450
- 44. Huang X, Deng Z, Li D, Yuan X, Fu Y. MISSFormer: an effective transformer for 2D medical image segmentation. IEEE Trans Med Imaging. 2023;42(5):1484–94. pmid:37015444
- 45.
Lan L, Cai P, Jiang L, Liu X, Li Y, Zhang Y. BRAU-Net: U-shaped hybrid CNN-transformer network for medical image segmentation. arXiv, preprint, 2024. https://doi.org/arXiv:2401.00722
- 46.
Guo Y, Li Y, Wang L, Rosing T. Depthwise convolution is all you need for learning multiple visual domains. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu Hawaii, 27 January-1 February 2019, pp. 8368–75.
- 47.
Nishiyama T, Kumagai A, Kamiya K, Takahashi K. SILU: strategy involving large-scale unlabeled logs for improving malware detector. In: Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC), Rennes, France, 2020, pp. 1–7.
- 48.
Loshchilov I, Hutter F. Decoupled weight decay regularization. In: Proceedings of the International Conference on Learning Representations, New Orleans, LA, 6-9 May 2019, pp. 1–10.
- 49.
Loshchilov I, Hutter F. SGDR: stochastic gradient descent with warm restarts. In: Proceedings of the International Conference on Learning Representations, Toulon, France, 2017, pp. 1–13.
- 50.
Multi-atlas labeling beyond the cranial vault–workshop and challenge in MICCAI 2015, Munich, Germany, 2015. https://www.synapse.org/Synapse:syn3193805/wiki/89480
- 51.
Berseth M. ISIC 2017–skin lesion analysis towards melanoma detection. arXiv, preprint, 2017. https://doi.org/arXiv:1703.00523
- 52.
Codella N, et al. Skin lesion analysis toward melanoma detection. arXiv, preprint, 2019. https://doi.org/arXiv:1902.03368
- 53. Bernal J, Sánchez FJ, Fernández-Esparrach G, Gil D, Rodríguez C, Vilariño F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput Med Imaging Graph. 2015;43:99–111. pmid:25863519
- 54. Alom MZ, Yakopcic C, Hasan M, Taha TM, Asari VK. Recurrent residual U-Net for medical image segmentation. J Med Imaging (Bellingham). 2019;6(1):014006. pmid:30944843
- 55.
Oktay O, et al. Attention U-Net: learning where to look for the pancreas. In: Proceedings of the 1st Conference on Medical Imaging with Deep Learning (MIDL 2018), Amsterdam, The Netherlands, 2018, pp. 1–10.
- 56.
Gao Y, Zhou M, Liu D, Yan Z, Zhang S, Metaxas DN. A data-scalable transformer for medical image segmentation: architecture, model efficiency, and benchmark. arXiv, preprint, 2022. https://doi.org/arXiv:2203.00131
- 57.
Ruan J, Xiang S, Xie M, Liu T, Fu Y. MALUNet: A multi-attention and light-weight UNet for skin lesion segmentation. In: Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, 2022, pp. 1150–6.
- 58.
Cai P, Lu J, Li Y, Lan L. Pubic symphysis-fetal head segmentation using pure Transformer with bi-level routing attention. arXiv, preprint, 2023. https://doi.org/arXiv:2310.00289
- 59.
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 18-23 June 2018, pp. 7132–41.