Figures
Abstract
Accurate segmentation of lesions in prostate magnetic resonance images (MRI) is important for assessing patient health and personalized treatment in the clinic. However, the traditional UNet segmentation network has low segmentation accuracy because of the fuzzy boundary and low contrast. Therefore, we propose a Lightweight Mamba-UNet (LM-UNet) prostate MRI image segmentation method. Initially, the encoder-decoder backbone structure consists of parallel vision mamba (PV-Mamba) and efficient multi-scale attention (EMA). The number of model parameters is reduced by constructing PV-Mamba while extracting the correlation between features over long distances. The EMA is then used to learn different spatial features in groups and construct cross-spatial information aggregation methods for richer feature aggregation. Subsequently, we construct the edge feature extraction (EFE) and the edge feature fusion (EFF) to achieve different levels of feature fusion in the encoder. Ultimately, we suggest a multi-stage and multi-level skip connections (MMSC) to achieve multi-level fusion between the encoder and decoder, there reducing semantic discrepancies between contextual features and improving segmentation accuracy. Experimental results demonstrate that on the PROMISE12 dataset, LM-UNet outperforms seven comparative segmentation methods in terms of parameter count, computational memory requirements, and precise segmentation of lesion margins.
Citation: Xu K, Zhou S, Chen Y, Chen J, Zhang N, Liao Y (2026) LM-UNet: Lightweight Mamba-UNet Prostate MRI image segmentation network. PLoS One 21(3): e0339719. https://doi.org/10.1371/journal.pone.0339719
Editor: Kumaradevan Punithakumar, University of Alberta, CANADA
Received: July 30, 2025; Accepted: December 11, 2025; Published: March 23, 2026
Copyright: © 2026 Xu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The Prostate MRI dataset used in this study are available in link: https://zenodo.org/records/8014041.
Funding: This study was supported by Guizhou Provincial Natural Science Research Project (Youth Project) in the form of a grant awarded to K.X. (No. Qian Jiao Ji[2024]279) and Guizhou Provincial Education Science Planning Project (Youth Project) in the form of a salary for K.X. (2024C018). The specific roles of this author are articulated in the ‘author contributions’ section. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Prostate cancer is the most common malignant tumor in men and also one of the leading causes of male mortality. Approximately 1.1 million men worldwide are diagnosed with prostate cancer each year [1,2]. Currently, with the continuous increase in population aging, the incidence and mortality of prostate cancer in china are both gradually rising [3]. Early-stage prostate cancer can be effectively treated through surgery and can achieve good therapeutic outcomes. In contrast, advanced prostate cancer may metastasize, and once the cancer cells have spread, they are difficult to cure. Therefore, early screening, early detection, and early treatment of prostate cancer have significant clinical importance [4,5].
MRI has been widely used as a non-invasive imaging tool for the screening of prostate cancer. The clinician obtains the location, size, and shape of the region of interest in the prostate MRI image by segmenting it, so as to realize the diagnosis of the malignant degree of prostate cancer [6]. In this process, the diagnostic results are affected by the qualifications of clinicians, and the diagnostic consistency between different doctors is low [7,8]. Therefore, it is of great significance to develop an algorithm that can automatically segmentation prostate MRI images.
Deep learning techniques have shown good performance in medical image segmentation, mainly due to the use of convolutional neural network (CNN), a powerful modeling technique [9–11]. To address the challenge of low accuracy in medical image segmentation. Ronneberger et al. [12] proposed a fully convolutional UNet segmentation network. This architecture employs skip connections to integrate low-resolution features from the encoder downsampling process with high-resolution features generated through the decoder upsampling operations, effectively improving segmentation precision. However, due to the variability of lesions in medical images, multiple down-sampling of the UNet segmentation network will lead to the loss of detailed information such as edges. Song et al. [13] suggested a dual-branch framework comprising a global feature reconstruction and a local feature reconstruction, to preserve the global detail features of the input image. Yin et al. [14] developed a guided filtering module integrated after each downsampling and upsampling operation in the standard UNet architecture, establishing hierarchical feature guidance through inter-stage information propagation, achieve effective transmission of different features.
Furthermore, to resolve training instability caused by semantic discrepancies during the fusion of encoder-derived downsampled features and decoder-processed upsampled features. Asadi et al. [15] advanced a bidirectional convolutional long short-term memory that performs nonlinear integration of multi-scale representations by bidirectionally coupling hierarchical features from the encoding path with progressively refined outputs in the decoding pathway. To address edge information degradation in lesion segmentation. Zhu et al. [16] employed a boundary-weighted domain adaptive neural network that enhances boundary sensitivity during prediction, thereby achieving more precise extraction of boundary features in pathological images through adaptive feature recalibration. To precise delineation of anatomical structures in prostate lesion segmentation. Wang et al. [17] studied a boundary encoding network that learns discriminative representations of organ edge through multi-scale boundary-aware learning, by establishing dense contextual dependencies between boundary semantics and regional features to guide pixel-wise classification, achieving enhanced delineation accuracy of glandular contours in histopathology images.
Some researchers have improved the UNet and achieved good performance in prostate MRI image segmentation. However, these methods that use fully convolutional structures can only extract local features during the feature extraction process, lacking the ability to capture global features. Zhang et al. [18] employed a transfuse method with a parallel branch architecture, which can effectively capture inter-image dependencies and low-level spatial details, there improving the accuracy of traditional CNN for prostate segmentation. Hung et al. [19] designed a cross-slice attention using the transformer approach, which can be integrated with any skip-connection-based network architecture to achieve context information fusion. Pollastri et al. [20] investigated a transformer model based on long-distance self-supervised learning that integrates contextual information across various anatomical planes. Research has shown that segmentation methods based on the transformer only exhibit significant performance improvements when trained on large datasets [21,22]. Furthermore, the computational complexity of transformer-based segmentation methods is proportional to the square of the sequence length, which results in low efficiency when processing long sequences. This makes the network require a substantial amount of time and memory during both the training and inference.
To overcome the challenges of transformer massive parameter count, high training costs, and low efficiency in processing long sequences. Some scholars have focused their research on proposing numerous lightweight methods [23–26], which employs a selective mechanism to dynamically adjust state transition parameters based on input content, enabling focused detection of critical information such as lesion margins and fine structures. Simultaneously, leveraging its selective state space model, it achieves dynamic focusing on key lesion regions within medical images and long-range context modeling with linear computational complexity. Ensure global consistency and accuracy of the segmentation results, while significantly reducing the parameter and computational cost.
Overall,for the prostate MRI image segmentation, there may be some shortcomings using the existing UNet. For example, (1) the prostate MRI image data is small, while the UNet network constructed using full convolution has more parameters, and overfitting is easy to occur during model training. (2) During the encoding stage of the UNet network, multiple downsampling operations are performed, which leads to the loss of detailed edge information of the lesion. thus leading to unsmooth edge in the segmentation result. (3) The low resolution information in the direct encoding of the skip connection is fused with the high resolution information in the decoding, while there is a semantic difference between the two different resolution information itself, and the direct fusion will lose part of the spatial information, which leads to poor segmentation effect. Therefore, we propose the LM-UNet method for prostate MRI image segmentation, which can effectively segment lesions in prostate MRI images.
The contributions of this work are summarized as follows:
- We propose a LM-UNet method for prostate MRI image segmentation. LM-UNet consists of the parallel vision mamba (PV-Mamba), efficient multi-scale attention (EMA), edge feature extraction (EFE), edge feature fusion (EFF), and multi-stage and multi-level skip connections (MMSC) components. This method improves the accuracy of MRI image segmentation and captures correlation information between feature-length sequences while reducing network parameters and memory computation requirements.
- We construct the EFE and EFF modules. First, the EFE fuses the shallow texture details with the deep, abstract features from the encoder. Then, the EFF fuses each encoder output to enrich the diversity of the feature space.
- We put forward the MMSC to realize multi-stage and multi-level fusion of the fused features from the encoder with the decoder to reduce the semantic differences of the features between codecs and thus improve the segmentation accuracy.
- The lay out LM-UNet method is validated on the publicly available medical dataset PROMISE12 [27] and compared and analyzed with several classical segmentation methods, and the experimental results show that the segmentation method in this paper has the highest segmentation performance.
2. Methods
In this section, we present a method for segmenting prostate MRI images using an LM-UNet, based on the encoding and decoding ideas of the UNet network. As demonstrated in Fig 1, the LM-UNet is primarily made up of an encoder, a decoder, and a skip connection.
The encoder and decoder of LM-UNet consist of PV-Mamba+EMA, with skip connections formed by EFF and MMSC. EFE is employed to fuse edge features from different levels of encoder layers and
.
2.1. Method overview
To achieve efficiently segment of prostate MRI images, we have designed a novel LM-UNet prostate MRI image segmentation network based on the encoding and decoding ideas of UNet network, as observed in Fig 1. LM-UNet is mainly composed of encoder, decoder, and MMSC. The encoder-decoder are composed of PV-Mamba, EMA, and EFE. As a selective state space model with linear computational complexity, PV-Mamba can effectively capture the relevant information between long sequences, while reducing network parameters and memory computing requirements [28]. EMA uses cross-spatial learning to process feature aggregation in parallel, which avoids the problem of over-fitting caused by too deep network. EFE embeds low-level edge features into high-level semantic features, effectively realizing the fusion of shallow features and deep features. MMSC performs multi-stage and multi-level fusion output of the fusion features from the encoder and is connected to the decoder to reduce the semantic difference between the contexts and enhance the expression ability of the features, thereby improving the segmentation accuracy.
2.2. Encoder-Decoder
The encoder and decoder comprise of PV-Mamba, EMA, and EFE components. The functions of each component are as follows.
2.2.1. PV-Mamba.
To reduce the computational complexity of the network and capture the relevant information between long sequences, we introduce the PV-Mamba in layers 4–6 of LM-UNet. As indicated in Fig 2, PV-Mamba is a selective state space model with linear computational complexity that excels at handling long sequence modeling. PV-Mamba through the global receptive field and dynamic weighting mechanism, it effectively alleviates the limitations of convolutional neural networks in modeling and realizes the long-distance modeling ability with transformer. The detailed steps of PV-Mamba operation are as follows.
- (1) Apply LayerNorm processing to the feature map X output from the third-layer encoder to reduce internal covariate shifts and enhance stability. The LN feature map is divided into four sub-feature maps
and
, each sub-feature map has one-fourth the number of channels of the original feature map, i.e., C/4, with the aim of reducing the computational complexity of each sub-feature map.
- (2) Each sub-feature map is fed into the Mamba for processing. The PV-Mamba optimizes feature representations by introducing dynamic weights and a global receptive field, enabling the to capture richer feature information [29]. The output processed by PV-Mamba is added to the residual of the original sub-feature map to obtain feature
. Specifically,
is formed by summing the feature obtained by scaling the original sub-feature map by adjustment factor α and the feature output by PV-Mamba. This helps mitigate the vanishing gradient problem in deep networks and allows the model to flexibly adjust the weighting between original features and learned features.
- (3) By concat operation the information from each sub-feature map, the feature map
is obtained.
- (4) The concatenated feature map
is passed through an LN layer to ensure consistency in feature distribution. A projection operation is then applied to adjust the feature map dimensionality, ensuring it meets the input requirements of subsequent network layers. The calculation formula for PV-Mamba is as follows:
where is the LayerNorm,
means the split operation, α indicates the adjustment factor for the residual addition, Cat represents the concat operation, and
denotes the projection operation. From Eqs (1)–(4), it can be seen that the use of PV-Mamba processing features ensures that the total number of channels processed remains constant and the maximization parameter is reduced while maintaining high accuracy.
2.2.2. Efficient multi-scale attention.
To preserve complete channel information while effectively learning features, we introduced an EMA after each encoder backend. This alleviates the information bottleneck caused by increased network depth while enhancing cross-spatial feature interaction and aggregation capabilities [30]. As demonstrated in Fig 3, Firstly, EMA partitions the input features into multiple groups along the channel dimension to reduce computational complexity while preserving complete channel information. Subsequently, a parallel subnetwork was constructed to extract spatial and channel-wise attention weights for different groups, thereby capturing cross-regional semantic dependencies. Ultimately, a cross-spatial information fusion mechanism was designed. By leveraging multi-scale feature interaction and weighted aggregation, it dynamically integrates discriminative local features from different groups with global contextual information. This approach significantly enhances the model representational capacity while preserving feature resolution. The operation is as follows.
Feature Grouping: The EMA uniformly partitions the input feature map along the channel dimension, dividing it into g mutually exclusive sub-feature groups
, where each sub-feature has a dimension of
. This grouping strategy achieves structural decoupling of features while avoiding channel compression and preserving the integrity of original feature information.
Parallel Sub-Network: EMA extracts attention weights for group features through three functionally complementary parallel paths. Two paths perform average pooling operations along the height and width of the feature map, respectively, yielding two feature maps. Then, the features are refined using concat and convolutions, and Sigmoid activation is applied to generate attention weights. Next, multiply the generated attention weights element-wise with the original input feature map to weight the input feature map. Second, the weighted feature maps undergo group normalization. The group-normalized feature maps are then processed using Softmax to generate normalized attention weights, and use Matmul to weight the attention weights with feature maps. The third approach employs
convolutions, focusing on capturing multi-scale local spatial features to enhance feature distinguishability. Finally, the outputs of all paths are multiplied element-wise to generate the final output feature map. This parallel processing architecture integrates global context with local details, there providing more robust attention weights for subsequent feature aggregation.
Cross-Space Learning: During the cross-spatial learning stage, EMA can fuse features from different spatial dimensions. For the outputs of two convolutions, a two-dimensional global average pooling operation is applied to encode global spatial information, there compressing the feature map of each channel into features with a global receptive field. Meanwhile, the output features of the
branch undergo dimensionality transformation to be reshaped into a dimension structure matching that of the former. Ultimately, the output features of EMA encompass both captured remote spatial dependencies and extracted multi-scale local details, enabling cross-channel fusion.
2.2.3. Edge feature extraction module.
A good edge prior can improve the accuracy of segmentation results. Although low-level features contain rich edge details [31], they also introduce numerous edges from regions of no interest. To enhance features along edges of interest, an edge-aware module is proposed to extract edge information from prostate MRI images. As shown in Fig 4, EFE combines the low-level feature and the high-level feature
within the encoder.
and
are processed by PV-Mamba respectively, followed by integrating feature
and the upsampled
through concatenation. Finally, feature enhancement is performed using two
convolutional layers, and edge features
are obtained by activating them with PV-Mamba and Sigmoid functions.
2.3. Multi-stage multi-level skip connections
MMSC include EFF, multi-stage channel attention (MCA), and multi-stage and multi-scale spatial attention (MMSA) components. The EFF fuses the shallow information with each encoder output and embeds the edge information into each encoder to provide the decoder with a priori knowledge. The MCA realizes multi-stage and multi-scale information enhancement in the encoder to efficiently extract target information of different sizes in prostate MRI. The MMSA module enhances the spatial dimension features thereby enhancing the feature representation.
2.3.1. Edge feature fusion.
To enrich the diversity of the feature space, the EFF injects the EFE output into the feature representation to improve the representation of the lesion structure semantics. As shown in Fig 5, given the input feature and edge feature
. The features from
and the features after
upsampling are multiplied element by element and connected with
residual. Then
convolution was used to learn the fused features, and the fused features were fed into the PV-Mamba to construct long-distance dependencies to enhance the correlation between the features. To enhance the fusion feature representation, the fusion features are processed using average pooling (Avg), and then the corresponding channel attention weights are obtained by 1D convolution and Sigmoid function and multiplied with the output of PV-Mamba to obtain the fusion of each encoder with the edge features as
. The EFF is calculated as follows.
where is the output of the i encoder,
denotes the edge feature, and PV indicates the Mamba operation, σ is a Sigmoid function, and ⊗ means the element-wise multiplication.
2.3.2. Multi-stage channel attention.
Obtaining multi-stage and multi-scale information in the encoder is crucial for segmenting targets of different sizes in prostate MRI. The fusion of multi-stage and multi-scale information has been proven to be key to improving segmentation performance [25,32]. Therefore, a MCA for channel level is proposed to generate channel attention map, which is generated by connecting features of different stages on the channel axis. As shown in Fig 6 (a), MCA fuses multi-stage and multi-scale information into local information fusion (1D convolution operation) and global information fusion(different fully connected layers in each stage) to provide more informative attention feature maps. To better integrate multi-stage and multi-scale information from the encoder. The specific operation of MCA is as follows.
where GAP stands for global average pooling, represents the feature maps of different stages obtained from the encoder, Concat is the join operation on the channel dimension, Conv1D indicates one-dimensional convolution operation,
is the fully connected layer of stage i, σ is a Sigmoid function, and ⊗ is the element-wise multiplication.
2.3.3. Multi-scale multi-spatial attention.
The MMSA is proposed to generate spatial attention maps for each stage to integrate multi-scale information on the spatial dimension [33]. The MMSA is shown in Fig 6 (b), To begin with, the average pooling and maximum pooling operations are used to map the features of each stage. Moreover, the dilated convolution (dilation rate is 3, kernel size is 7) and Sigmoid are used for normalization to generate spatial attention at each stage. Lastly, the generated spatial attention map is multiplied by the original feature map element by element to enhance the features of important regions and suppress the features of unimportant regions. The enhanced feature map is added to the original feature map through residual connection, and the spatial attention information is enhanced while retaining the original features. MMSA operation is as follows.
where is the feature map of stage i, AvgPool represents average pooling, MaxPool indicates MaxPool, Concat denotes a connection operation, σ is a Sigmoid function, and ⊗ means the element-wise multiplication.
2.4. Loss function
In the image segmentation, the goal is to enable the model to distinguish different regions. The cross entropy loss function helps the model learn how to distinguish different categories more accurately by minimizing the difference between the predicted probability distribution and the real distribution. Therefore, this paper uses cross entropy loss constraint model training, which is defined as follows:
where y is the true label and means the prediction. In medical image segmentation, there is a need to face the difficult problem of unbalanced data samples, if the cross-entropy loss function alone is used to constrain the model training without considering the category problem, dice similarity coefficient is used to measure the degree of overlap between predicted segmentation and real segmentation. Combining the cross-entropy loss function and the dice loss function can simultaneously optimize the classification accuracy of the model and the consistency of the segmentation, which is especially important for medical images that require accurate segmentation [34]. The definition is as follows.
where is the dot product summation operation between the predicted segmentation results and the labels, and the dice loss expression is:
In this paper, a combined loss function is used for constrained model training, combining the two functions to optimize the loss calculation of the network model locally and holistically to improve the network model segmentation accuracy. The definition of combined loss is as follows:
where is the cross-entropy loss function and
denotes the dice loss function.
2.5. Architecture comparison
Table 1 illustrates the differences between LM-UN and UVM-UNet. LM-UNet incorporates EMA and EFE into the encoder to extract edge features at distinct hierarchical levels. The skip connections encompass EFF, MCA, and MMSA, enhancing cross-spatial interactions and multi-level feature fusion. This precisely captures prostate edges, thereby improving segmentation performance.
3. Experimental results and analysis
3.1. Datasets
3.1.1. Data acquisition.
In this study, we validated all segmentation methods using the PROMISE12 dataset, a widely recognized public medical image dataset. In prostate MRI data acquisition, transverse T2-weighted MRI was selected as the primary imaging modality due to its significant advantages in anatomical detail. This imaging technique is widely adopted in clinical diagnosis and research for its high soft tissue contrast and ability to clearly display the prostate and its surrounding structures. It provides detailed views of the prostate internal architecture, including differentiation between the prostatic capsule, central gland, and peripheral gland. PROMISE12 data from 4 centers, including university college london (UCL), the beth israel deaconess medical center (BIDMC), haukeland university hospital (HK) and the radboud university nijmegen medical centre (RUNMC) [27]. Detailed information regarding data collection is provided in Table 2.
3.1.2. Data description.
To validate the effectiveness of LM-UNet, the experimental data set uses the PROMISE12 data set proposed by the medical image computing and computer assisted intervention society (MICCAI) at the prostate MRI image segmentation challenge held in 2012. The training set consists of 1,473 images, while the test set consists 473 images. Moreover, tissue sections from the same patient will not appear in both the training and testing sets simultaneously.
3.1.3. Data preprocessing.
We preprocessed the PROMISE12 dataset to improve image quality and enhance feature information. The preprocessing steps are as follows:
Step 1: Convert NIFTI data to PNG format. To enhance image clarity and reduce noise interference, we employ histogram equalization to adjust brightness and contrast, thereby boosting overall image contrast and achieving a more uniform brightness distribution.
Step 2: After clinicians locate the mask with the largest diameter for each patient, the center of the mask is determined using OpenCV. A bounding rectangle is then drawn around the lesion area with the center coordinates as the origin, ensuring the lesion is fully enclosed. Then, expand each of the top, bottom, left, and right edge of the rectangle by 5 pixels to ensure that the edge near the lesion area are fully encompassed.
Step 3: Apply the expanded rectangle coordinates to smaller mask images of other lesion areas in the same patient, then crop the corresponding regions to obtain the entire lesion area. Overlay the obtained mask region onto the corresponding MRI image to isolate the relevant lesion area.
Step 4: To meet the input requirements of deep neural network models, the lesion area was normalized to a size of pixels, all data are three-channel images.
3.2. Implementation details
In this experiment, all methods were trained on a Linux 6.1.85 system equipped with an Nvidia L4 graphics card, using the PyTorch 1.13 + cu117 deep learning framework. During training, all models employ identical configurations. The batch size is 16, and the training for 100 epochs. The AdamW optimizer is uniformly applied,with ,
,
and
. The input channel sizes of the encoder are 3, 8, 16, 24, 32, and 48, respectively. The corresponding output channels are 8, 16, 24, 32, 48, and 64, respectively. The input and output dimensions of the encoder are shown in Table 3. The decoder output corresponds to the encoder output. The output sequence from Encoder 3 converts the two-dimensional feature map into a one-dimensional sequence via grid scanning. After transposition, it becomes [B, 3,136, 24] and is input into Mamba. The state dimension is 16, with a minimum discretization step size of 0.001 and a maximum discretization step size of 0.1. To evaluate the model generalization capability, we employed 5-fold cross-validation during the LM-UNet training process. This method was used to select the optimal model and hyperparameters, there enhancing the model reliability and performance.
3.3. Validation metrics
To evaluate the performance of different segmentation method, we employed the following metrics: Dice Similarity Coefficient (DSC), Intersection over Union (IoU), Accuracy, Specificity, Sensitivity, HD95, Precision, and Average Symmetric Surface Distance (ASSD) to objectively, and quantitatively evaluate the segmentation algorithm.
3.4. Experimental result
To further validate the effectiveness of LM-UNet, we conducted comparative analyses with segmentation methods, including UNet [12], ResUNet [35], PraNet [36], TransUNet [37], SwinUNet [38], VM-UNet [39], and UVM-UNet [28].
The evaluation metrics for the different segmentation methods are shown in Table 4. The LM-UNet had the highest DSC, IoU, Accuracy, HD95, Precision and ASSD which reached 92.75, 86.81, 96.57, 4.37, 92.95, and 3.95, respectively. The Specificity and Sensitivity were next best with 97.74 and 93.30. The evaluation results show that LM-UNet has the best segmentation performance on PROMISE12 dataset.
The visualization of prostate segmentation results by each segmentation method is shown in Fig 7. The first and second columns are prostate MRI images and corresponding mask. See Fig 7, it can be seen that when segmenting prostate MRI images, the ResUNet has the worst segmentation results, and the segmentation have more missegmentation, and even the lesion area is not segmented. The reason may be that the data is less, resulting in model training overfitting. The segmentation methods of UNet and PraNet can segment the general area of the lesion, but there are also some under-segmented areas, and the segmentation edge is blurred, with more burrs. The main reason is that the segmentation method based on CNN can only extract the local features of the image and lose the global information of the image. In addition, in the process of skip connection, the results of multiple down-sampling in the encoder are directly fused with the decoder, and there is a large semantic gap from the feature of different resolutions, which makes the final segmentation results of the decoder appear more burr edges. Compared with the segmentation method based on CNN, the segmentation models based on transformer and CNN, such as TransUNet and SwinUNet, have certain advantages. This is despite the fact that they combine the dual advantages of local modeling and long-range modeling. However, from the segmentation visualization results, there are still edge segmentation methods and under-segmentation and mis-segmentation parts. The reason is that the segmentation method based on transformer needs to experiment on a large amount of data to obtain ideal segmentation results. However, for the segmentation of medical image, due to the small number of data samples and the large difference in the structure of the data itself, the segmentation results are poor.
Compared with the based on CNN and transformer, the based on Mamba has great advantages. The model combines the advantages of CNN and transformer, and has low model parameters, fast, and less computer resources. From the segmentation results of VM-UNet and UVM-UNet, it can be seen that the lesion area in the prostate image can be effectively segmented, the segmentation results of VM-UNet and UVM-UNet are closer to those of mask, but there there are also instances of slight over and under-segmentation. The primary reason may be that VM-UNet and UVM-UNet overly focus on global dependencies, leading to diminished perception of local details. This compromises segmentation edge accuracy and sensitivity to small targets. As shown in the Fig 7, segmentation boundaries extend beyond the lesion area.
It can be seen from Fig 7 that the LM-UNet proposed in this paper has the best segmentation results. A novel PV-Mamba is introduced into the UNet, which has long modeling ability and low parameter quantity. At the same time, the encoder and decoder are constructed with CNN. In the construction process, the importance of the edge information of the lesion to the segmentation result is considered, so that the EFE is constructed between the shallowest and deepest layers of the encoder to extract the edge information at different resolutions in different encoders. Then the EFF is constructed to fuse the edge information with the output of each encoder to obtain the features with edge information in each encoder. To alleviate the semantic difference between the encoder and the decoder, MMSC is constructed between the encoder and the decoder for multi-stage and multi-level fusion, thereby improving the segmentation accuracy of the model. It can be seen that the LM-UNet segmentation model proposed in this paper has great application potential in prostate segmentation.
The bar chart in Fig 8 clearly illustrates the segmentation performance of different segmentation methods. DSC, IoU, Accuracy, Specificity, Sensitivity, HD95, Precision, and ASSD are used as evaluation metrics. Among these, LM-UNet was evaluated as the overall best model, achieving improvements of 1.89, 3.04, 0.91, 3.05, and 0.34 in DSC, IoU, Accuracy, Sensitivity, and Precision respectively compared to the UNet, while reducing HD95, and ASSD metrics by 1.51, and 0.99, Specificity decreased by 0.16. Relative to SwinUNet, enhancements in DSC, IoU, Accuracy, Specificity, and Precision were noted at 5.12, 7.97, 2.89, 4.72, and 10.37, respectively, whereas HD95 and ASSD experienced declines of 4.42 and 2.19. Sensitivity, however, dipped by 1.63. In contrast to VM-UNet, improvements in DSC, IoU, Accuracy, Specificity, Sensitivity, and Precision by 2.03, 3.35, 1.02, 0.47, 2.11, and 1.48, respectively, while HD95 and ASSD saw reductions of 1.79 and 1.08. When juxtaposed with UVM-UNet, the achieved gains in DSC, IoU, Accuracy, Specificity, Sensitivity, and Precision by 1.42, 2.36, 0.75, 0.39, 1.10, and 1.35, respectively, and concurrently witnessed decreases in HD95, and ASSD by 1.13 and 0.86. Additionally, the standard deviation of each evaluation metric indicates that LM-UNet exhibits the lowest standard deviation, suggesting smaller data fluctuations, a more concentrated distribution, and higher consistency. This demonstrates that LM-UNet possesses superior stability.
3.5. Ablation study
The UVM-UNet is used as the baseline, and the LM-UNet components are mainly divided into EMA, EFE, EFF, and MMSC. To verify the improvement strategy of the segmentation algorithm in this paper, the influence of the above components on the performance of the segmentation algorithm is verified respectively. The experimental results are as Table 5 shows that the EMA is introduced into the encoder in the baseline network, and the segmentation effect of the model is improved compared with the baseline. It shows that the EMA has good performance by parallel processing feature aggregation through cross-spatial learning. When the EFE and the EFF are introduced respectively, the model segmentation performance is improved slightly, indicating that the PV-Mamba has certain advantages in long-distance modeling. When the EFE and the EFF simultaneously perform feature extraction and fusion, the segmentation performance of the model is greatly improved. Compared with baseline, DSC, IoU, Accuracy, Specificity, Sensitivity and Precision increased by 1.42, 2.36, 0.75, 0.39, 1.10, and 1.38, respectively, and HD95, ASSD decreased by 1.13, 0.86.
Compared with the baseline + EMA, DSC, IoU, Accuracy, Specificity, Sensitivity, and Precision increased by 0.81, 1.21, 0.30, 0.07, 0.86, and 0.48, respectively, and HD95, ASSD decreased by 0.50, 0.41. In addition, the standard deviation of the LM-UNet is mostly the lowest, indicating that the model has good stability.
3.6. Complexity analysis
To analyze the complexity of different segmentation methods, we conducted a quantitative analysis of their parameters, FLOPs, and inference time required for training 100 epochs, as shown in Table 6 and Fig 9. Fig 9 (a) demonstrated the number of parameters for different methods, with TransUNet having the most parameters at 105.277 M. The parameter of UVM-UNet is the lowest, only 0.049 M. The parameters of LM-UNet amount to 0.45 M, ranking second among all methods. Fig 9 (b) indicated the FLOPs of different methods, where PraNet has the highest FLOPs at 47.325 G. UVM-UNet has the lowest FLOPs at only 0.046 G. and LM-UNet has FLOPs of 6.908 G. Fig 9 (c) observed the inference time required for training different methods over 100 epochs. Among all methods, VM-UNet required the longest time, reaching 11,732 s. LM-UNet inference time was 4,992 s. Overall, while LM-UNet may not be the best in terms of parameters, FLOPs, and inference time. Compared to UVM-UNet, LM-UNet does exhibit certain differences in terms of parameters, FLOPs, and inference time. However, the EMA, EFE, and EFF components in LM-UNet effectively enhance boundary feature extraction capabilities while simultaneously integrating features from different levels. A comprehensive analysis of its segmentation results reveals that this method achieves an outstanding balance between segmentation quality and computational complexity. A good balance between segmentation quality and computational complexity.
4. Conclusion
Aiming at the characteristics of fuzzy boundary and low contrast of lesions in prostate MRI images, the traditional UNet segmentation network can not effectively extract edge information, as well as the loss of edge detail information after multiple downsampling, and the problem of feature fusion of different resolutions in encoder and decoder. We propose a LM-UNet prostate MRI image segmentation method. This method introduce the PV-Mamba in the deep layer of the encoder-decoder. PV-Mamba, as a linear state selection model, not only reduces the number of model parameters but also exhibits strong long-range modeling capabilities. We construct EFE between the shallowest and deepest layers of the encoder to effectively extract edge detail information of lesions at different encoder levels. Meanwhile, we build EFF to fuse the extracted edge detail information with the output results of each encoder. Additionally, utilizing MMSC to achieve multi-level and multi-scale overlapping between encoders and decoders, contextual semantic discrepancies are reduced, resulting in smoother segmentation outcomes. The experimental results show that the proposed LM-UNet not only achieves higher segmentation accuracy, but also has a smaller parameter size and lower computational memory.
This study has several limitations: (1) The experiment was conducted solely on a single dataset, failing to adequately reflect the complexity and diversity of medical imaging data encountered clinical practice. (2) The research primarily focused on segmentation performance while paying less attention to real-time processing capabilities. In clinical, algorithm execution speed significantly impacts physicians diagnostic and treatment decisions. (3) The experiment utilized only 2D slices, disregarding 3D contextual information. Our future work will expand both the validation dataset and the scope of research.
References
- 1. Culp MB, Soerjomataram I, Efstathiou JA, Bray F, Jemal A. Recent Global Patterns in Prostate Cancer Incidence and Mortality Rates. Eur Urol. 2020;77(1):38–52. pmid:31493960
- 2. Abbasi AA, Hussain L, Awan IA, Abbasi I, Majid A, Nadeem MSA, et al. Detecting prostate cancer using deep learning convolution neural network with transfer learning approach. Cogn Neurodyn. 2020;14(4):523–33. pmid:32655715
- 3. Bai L, Wushouer H, Huang C, Luo Z, Guan X, Shi L. Health Care Utilization and Costs of Patients With Prostate Cancer in China Based on National Health Insurance Database From 2015 to 2017. Front Pharmacol. 2020;11:719. pmid:32587512
- 4. Li C, Wan ZQ, Zheng DB, Wang YL. Effects of laparoscopic radical prostatectomy on wound infection of surgery in patients with prostate cancer: A meta-analysis. International Wound Journal. 2024;21(2):e14774.
- 5. Chen J, He L, Ni Y, Yu F, Zhang A, Wang X, et al. Prevalence and associated risk factors of prostate cancer among a large Chinese population. Sci Rep. 2024;14(1):26338. pmid:39487298
- 6. Wang W, Pan B, Ai Y, Li G, Fu Y, Liu Y. ParaCM-PNet: A CNN-tokenized MLP combined parallel dual pyramid network for prostate and prostate cancer segmentation in MRI. Comput Biol Med. 2024;170:107999. pmid:38244470
- 7. Glazer DI, Mayo-Smith WW, Sainani NI, Sadow CA, Vangel MG, Tempany CM, et al. Interreader Agreement of Prostate Imaging Reporting and Data System Version 2 Using an In-Bore MRI-Guided Prostate Biopsy Cohort: A Single Institution’s Initial Experience. AJR Am J Roentgenol. 2017;209(3):W145–51. pmid:28657843
- 8. Song WH, Kim TU, Ryu HS, Yun MS, Park S-W. Enhancement of inter-/intra-reader agreement using the Prostate Imaging Reporting and Data System version 2.1 for prostate cancer detection in magnetic resonance imaging/transrectal ultrasound software fusion prostate biopsy. Investig Clin Urol. 2025;66(5):405–15. pmid:40897659
- 9. Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans Pattern Anal Mach Intell. 2022;44(7):3523–42. pmid:33596172
- 10. Younas HI, Bukhari S, Bukhari F, Aslam N, Badar HMS, Kajla NI. Classifying invasive ductal carcinoma using transfer learning: An efficient methodology. Journal of Computing and Biomedical Informatics. 2022;4(01):236–50.
- 11. Zafar UB, Hamza H, Kajla NI, Badar HMS, Nawaz SA, Siddique MN. Advancing glioblastoma diagnosis through innovative deep learning image analysis in histopathology. Journal of Computing and Biomedical Informatics. 2024;ICASET 2024 Special Issue.
- 12.
Ronneberger O, Fischer P, Brox T. U-Net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III. Springer. 2015. p. 234–41.
- 13. Song J, Chen X, Zhu Q, Shi F, Xiang D, Chen Z, et al. Global and Local Feature Reconstruction for Medical Image Segmentation. IEEE Trans Med Imaging. 2022;41(9):2273–84. pmid:35324437
- 14. Yin P, Yuan R, Cheng Y, Wu Q. Deep Guidance Network for Biomedical Image Segmentation. IEEE Access. 2020;8:116106–16.
- 15.
Asadi-Aghbolaghi M, Azad R, Fathy M, Escalera S. Multi-level context gating of embedded collective knowledge for medical image segmentation. arXiv preprint. 2020. https://doi.org/arXiv:200305056
- 16. Zhu Q, Du B, Yan P. Boundary-Weighted Domain Adaptive Neural Network for Prostate MR Image Segmentation. IEEE Trans Med Imaging. 2020;39(3):753–63. pmid:31425022
- 17. Wang S, Liu M, Lian J, Shen D. Boundary Coding Representation for Organ Segmentation in Prostate Cancer Radiotherapy. IEEE Trans Med Imaging. 2021;40(1):310–20. pmid:32956051
- 18.
Zhang Y, Liu H, Hu Q. TransFuse: Fusing transformers and CNNs for medical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I, 2021. 14–24.
- 19. Hung ALY, Zheng H, Miao Q, Raman SS, Terzopoulos D, Sung K. CAT-Net: A Cross-Slice Attention Transformer Model for Prostate Zonal Segmentation in MRI. IEEE Trans Med Imaging. 2023;42(1):291–303. pmid:36194719
- 20.
Pollastri F, Cipriano M, Bolelli F, Grana C. Long-Range 3D Self-Attention for MRI Prostate Segmentation. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), 2022. 1–5. https://doi.org/10.1109/isbi52829.2022.9761448
- 21. Li X, Ding H, Yuan H, Zhang W, Pang J, Cheng G, et al. Transformer-Based Visual Segmentation: A Survey. IEEE Trans Pattern Anal Mach Intell. 2024;46(12):10138–63. pmid:39074008
- 22.
Strudel R, Garcia R, Laptev I, Schmid C. Segmenter: Transformer for Semantic Segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 7242–52. https://doi.org/10.1109/iccv48922.2021.00717
- 23.
Ma J, Li F, Wang B. U-Mamba: Enhancing long-range dependency for biomedical image segmentation. 2024. https://arxiv.org/abs/240104722
- 24.
Wang Z, Zheng JQ, Zhang Y, Cui G, Li L. Mamba-UNet: UNet-like pure visual mamba for medical image segmentation. 2024. https://arxiv.org/abs/240205079
- 25. Wu R, Liu Y, Liang P, Chang Q. H-vmunet: High-order Vision Mamba UNet for medical image segmentation. Neurocomputing. 2025;624:129447.
- 26. Badar HMS, Kajla NI, Arshad J, Saher N, Ahmad M, Jamil MA. Lightweight intrusion detection for IoD infrastructure using deep learning. Journal of Computing & Biomedical Informatics. 2024;ICASET 2024 Special Issue:1–12.
- 27. Litjens G, Toth R, van de Ven W, Hoeks C, Kerkstra S, van Ginneken B, et al. Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Med Image Anal. 2014;18(2):359–73. pmid:24418598
- 28.
Wu R, Liu Y, Liang P, Chang Q. Ultralight VM-UNet: Parallel vision Mamba significantly reduces parameters for skin lesion segmentation. 2024. https://arxiv.org/abs/240320035
- 29.
Liao W, Zhu Y, Wang X, Pan C, Wang Y, Ma L. LightM-UNet: Mamba assists in lightweight UNet for medical image segmentation. arXiv preprint. 2024. https://doi.org/arXiv:240305246
- 30.
Ouyang D, He S, Zhang G, Luo M, Guo H, Zhan J, et al. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. 1–5. https://doi.org/10.1109/icassp49357.2023.10096516
- 31.
Sun Y, Wang S, Chen C, Xiang TZ. Boundary-guided camouflaged object detection. In: 2022. https://arxiv.org/abs/220700794
- 32. Wu R, Liang P, Huang X, Shi L, Gu Y, Zhu H, et al. MHorUNet: High-order spatial interaction UNet for skin lesion segmentation. Biomedical Signal Processing and Control. 2024;88:105517.
- 33.
Puttagunta RS, Kathariya B, Li Z, York G. Multi-Scale Feature Fusion using Channel Transformers for Guided Thermal Image Super-Resolution. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024.
- 34. Kajla NI, Missen MMS, Luqman MM, Coustaty M, Mehmood A, Choi GS. Additive Angular Margin Loss in Deep Graph Neural Network Classifier for Learning Graph Edit Distance. IEEE Access. 2020;8:201752–61.
- 35. Diakogiannis FI, Waldner F, Caccetta P, Wu C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS Journal of Photogrammetry and Remote Sensing. 2020;162:94–114.
- 36.
Fan DP, Ji GP, Zhou T, Chen G, Fu H, Shen J, et al. PraNet: Parallel reverse attention network for polyp segmentation. In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2020. Springer; 2020. 263–73.
- 37.
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y. TransUNet: Transformers make strong encoders for medical image segmentation. 2021. https://doi.org/10.48550/arXiv.2102.04306
- 38.
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, et al. Swin-UNet: UNet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision (ECCV). Springer; 2022. p. 205–18.
- 39.
Ruan J, Li J, Xiang S. VM-UNet: Vision mamba UNet for medical image segmentation. arXiv preprint. 2024. https://doi.org/arXiv:240202491