Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

DEFIF-Net: A lightweight dual-encoding feature interaction fusion network for medical image segmentation

  • Zhanlin Ji,

    Roles Conceptualization, Funding acquisition, Investigation, Software, Validation, Writing – review & editing

    Affiliations College of Mathematics and Computer Science, Zhejiang A&F University, Hangzhou, Zhejiang, China, TRC/ECE, University of Limerick, Limerick, Ireland

  • Shengnan Hao,

    Roles Data curation, Methodology, Software, Visualization, Writing – original draft

    Affiliation Hebei Key Laboratory of Industrial Intelligent Perception, North China University of Science and Technology, Tangshan, Hebei, China

  • Quanming Zhao,

    Roles Data curation, Investigation, Methodology, Resources

    Affiliation Hebei Key Laboratory of Industrial Intelligent Perception, North China University of Science and Technology, Tangshan, Hebei, China

  • Zidong Yu,

    Roles Conceptualization, Methodology, Validation, Writing – original draft

    Affiliation Hebei Key Laboratory of Industrial Intelligent Perception, North China University of Science and Technology, Tangshan, Hebei, China

  • Hongjiu Liu ,

    Roles Conceptualization, Formal analysis, Supervision, Validation

    joe_hunter@zafu.edu.cn (HL); tslilei@sina.com (LL); ivan.ganchev@ul.ie (IG)

    Affiliation College of Mathematics and Computer Science, Zhejiang A&F University, Hangzhou, Zhejiang, China

  • Lei Li ,

    Roles Formal analysis, Investigation, Project administration, Writing – review & editing

    joe_hunter@zafu.edu.cn (HL); tslilei@sina.com (LL); ivan.ganchev@ul.ie (IG)

    Affiliation Department of Cardiology, Tangshan Workers’ Hospital, Tangshan, Hebei, China

  • Ivan Ganchev

    Roles Data curation, Funding acquisition, Project administration, Supervision, Writing – review & editing

    joe_hunter@zafu.edu.cn (HL); tslilei@sina.com (LL); ivan.ganchev@ul.ie (IG)

    Affiliations TRC/ECE, University of Limerick, Limerick, Ireland, Faculty of Mathematics and Informatics, University of Plovdiv “Paisii Hilendarski”, Plovdiv, Bulgaria, Institute of Mathematics and Informatics—Bulgarian Academy of Sciences, Sofia, Bulgaria

Abstract

Medical image segmentation plays a crucial role in computer-aided diagnosis. By segmenting pathological tissues in medical images, doctors can observe anatomical structures more clearly, thereby achieving more accurate disease diagnoses. However, existing medical image segmentation networks have issues such as insufficient capability to extract features from target areas, as well as high number of parameters and increased computational complexity. To address these issues, a lightweight Dual-Encoding Feature Interaction Fusion network (DEFIF-Net) is proposed in this paper for medical image segmentation. Firstly, in the encoding stage of DEFIF-Net, a global dependency fusion branch is introduced as an additional encoder to capture distant feature dependencies, whereby the neighboring and distant feature dependencies are effectively integrated by the newly designed feature interaction fusion convolution. Secondly, between the encoder and decoder, channel feature reconstruction modules (CFRMs) are used to enhance the feature representation of important channels. Additionally, a novel multi-branch ghost module (MBGM) is used in the bottleneck layer of the network to enhance its efficiency in capturing and retaining different types of feature information. Lastly, a novel residual feature enhancement (RFE) decoder is utilized to emphasize boundary features, thereby increasing the network’s sensitivity to lesion boundaries. The segmentation performance of the proposed DEFIF-Net network is evaluated in two different medical image segmentation tasks. The obtained experimental results demonstrate that, compared to state-of-the-art networks, DEFIF-Net exhibits superior segmentation performance on all three datasets used, while also having a lower parameter count and computational complexity.

1 Introduction

Medical image segmentation is a crucial step in the skin lesion segmentation process. It can segment important pathological areas from images, assisting doctors in clinical judgments [1]. Traditional skin lesion segmentation methods overly rely on doctors manually extracting features from images, which significantly increases labor and time costs [2]. Therefore, in order to enhance the efficiency and performance of skin lesion image segmentation and reduce dependence on manual intervention, automated segmentation methods have become the focus and trend of current research [3].

In recent years, researchers have been actively exploring automated skin lesion image segmentation methods based on deep learning. These methods leverage deep neural networks to automatically learn representations of features in images, providing more reliable support for diagnosis and treatment of skin diseases [4,5]. In 2015, Ronneberger et al. [6] proposed the U-Net network to tackle the semantic segmentation challenges in medical image segmentation tasks with limited sample sizes. U-Net adopts an encoder−decoder architecture and introduces skip connections between the encoder and decoder, which allows it to leverage limited sample data more effectively, leading to excellent performance on small-sample datasets such as those containing medical images. The unique design of U-Net has also inspired many subsequent networks, such as MultiResUNet [7], UNET 3+ [8], etc. They adopt similar structures to enhance their adaptability to small sample data and improve their ability to perceive subtle features in medical images. However, their methods of processing input images are somewhat limited, especially when dealing with structurally complex skin disease images, which may result in inadequate modeling of long-range dependencies [9]. To better capture long-distance dependencies between sequences, Chen et al. [10] combined the structures of Transformer and U-Net [6], designing a Transformer-based medical image segmentation network, called TransUNet, which employs a self-attention mechanism to capture global dependencies while also retaining the advantages of U-Net, enabling it to more accurately capture pathological structural information in medical image segmentation tasks. However, since Transformer-based models typically contain a large number of parameters and complex computational structures, they require more computational resources and time for training and inference, which limits their scalability and practicality in real-world applications [11]. In addition, some lightweight skin lesion segmentation models (such as I-UNeXt [12], MALUNet [13], etc.) have lower parameter count and processing speed, but perform poorly in segmentation tasks and exhibit low generalization ability. Currently, despite researchers’ continuous exploration and experimentation with new methods to improve the performance and efficiency of skin lesion image segmentation, striking a balance between segmentation performance and efficiency remains a huge challenge.

To simultaneously lower the computational costs and improve the segmentation performance, this paper proposes a novel lightweight medical image segmentation network, named DEFIF-Net. The proposed network significantly enhances the feature capture and integration capabilities through a dual encoding mechanism, and introduces a novel RFE decoder to optimize the feature space, thereby improving its accuracy in handling regions with high detail complexity in images. Additionally, by integrating specially designed lightweight modules, DEFIF-Net is able to effectively acquire complex feature information from images within low-computational environments.

The main contributions of this article are as follows:

  1. A novel global dependency fusion (GDF) branch is elaborated as a second encoder in the proposed network to capture the long-range feature dependencies that could be otherwise missed. Additionally, in this branch, a novel feature interaction fusion convolution (FIFConv) is utilized to fully integrate the dependency relationships between features at different distances, thereby enhancing the network’s understanding of the overall image structure.
  2. A novel channel feature reconstruction module (CFRM) is designed for the proposed network to break the independence between feature channels and promote information transmission between channels.
  3. A novel RFE decoder is designed to enhance the proposed network’s perception and discrimination ability of subtle boundary features, thereby allowing it to effectively capture the boundary texture features of pathological regions.
  4. A novel multi-branch ghost module (MBGM) is elaborated and utilized at the bottleneck layer of the proposed network. This module is capable of preserving feature diversity and reducing information loss while maintaining low computational complexity.

2 Methods

2.1 Related work

Medical image segmentation plays a crucial role in computer-aided diagnosis, providing essential support for lesion detection, tissue analysis, and surgical planning. In recent years, deep learning networks, particularly Convolutional Neural Networks (CNNs) [14], have made significant advancements in the field of medical image segmentation. U-Net [6] is one of the earliest and most widely used networks for medical image segmentation. It employs an encoder−decoder structure and utilizes skip connections to pass detailed features between the encoder and decoder, thus retaining more spatial information in low-resolution features. Subsequently, variants like UNet++ [15] and Attention U-Net [16] have further enhanced the feature representation capabilities of U-Net. UNet++ introduces nested and dense skip connections to capture more details, while Attention U-Net adaptively focuses on regions of interest through attention mechanisms, thus improving segmentation performance. Subsequently, in order to improve the ability to segment skin lesions, Ibtehaz et al. introduced the Inception block [17] into the U-Net network and designed an enhanced U-Net based network, called MultiResUNet [7]. Compared with other methods that directly increase the depth and width of the network, MultiResUNet can effectively utilize parameters, reduce redundancy, and improve efficiency of skin lesion segmentation, which makes it a particularly good representative of skin lesion segmentation networks. In terms of lightweight skin lesion segmentation network design, networks like MALUNet [13], EGE-UNet [18], and VM-UNet [19] offer high computational efficiency, making them suitable for applications with real-time requirements. In a recent study, Ruan et al. [20] have designed MEW-UNet by replacing the self-attention in ViT with a newly designed Multi-axis External Weights block, which achieved excellent performance on multiple medical image datasets. Chen et al. [21] designed SCSONet to balance the computational efficiency and segmentation effect of skin lesion segmentation. SCSONet effectively enhances the recognition of irregular skin lesion areas and reduces the computational burden by learning complementary attention weights. Although these networks have improved segmentation efficiency, they have not yet reached a satisfactory level in terms of segmentation performance. Currently, despite the progress made by many such networks, challenges remain in skin lesion image segmentation, particularly in handling fine structures, complex backgrounds, and cross-dataset generalization. This paper builds upon the encoder−decoder network structure to further improve feature fusion strategies and structural design, achieving a better balance between performance and inference efficiency. These innovations enhance segmentation performance and demonstrate broad applicability in complex medical imaging scenarios.

2.2 DEFIF-Net

2.2.1 Overall structure.

The proposed DEFIF-Net network has a unique asymmetric encoder−decoder structure. In order to enhance the network’s expression of complex pathological features and optimize its ability to parse pathological structures in images, newly designed modules are integrated into DEFIF-Net to improve its segmentation performance and robustness. These innovative design enables DEFIF-Net to maintain high-performance segmentation even when operating within resource-constrained environments, significantly improving its applicability in the field of medical image segmentation. The specific implementation details of these modules are described in the following subsections.

The DEFIF-Net structure consists of four main parts: a dual encoder, a bottleneck layer, skip connections, and RFE decoders, as illustrated in Fig 1. Initially, the network receives an input feature map of size and feeds this into the dual encoder, where each encoding layer performs layer-wise bilinear downsampling on the input feature map through two different branches: (1) the original U-Net encoder [6] (the left part of the dual encoder), responsible for extracting local feature dependencies in the input feature map; and (2) the newly designed GDF branch (the right part of the dual encoder), focused on capturing long-range feature dependencies in the feature map. These two types of dependencies are fused after each encoding layer to form a richer feature representation. At the end of the encoding process, the final feature maps from the two encoders are merged and passed to the bottleneck layer to further integrate and enhance the feature information. Additionally, each encoding layer is equipped with a max pooling layer to abstract the input feature map, thereby enhancing the network’s processing efficiency. In the bottleneck layer, diverse feature information in the encoder’s output feature map is preserved through two convolutions and a newly designed MBGM. In the skip connections between the encoder and decoder, novel CFRMs are used to further emphasize the important feature channels output by the GDF branches. By integrating an upsampling layer with the RFE decoder modules, the network’s decoding unit merges and enhances features from CFRM and the bottleneck layer, enabling the network to focus more on the boundary texture of pathological targets, thereby achieving finer boundary texture depiction. Finally, a convolution is utilized to adjust the channel numbers based on the number of categories and generate the final binary or multi-class masks.

2.2.2 Dual encoder.

In the process of processing medical images, the network’s judgment of pathological regions needs to comprehensively consider both local features and spatial features [22]. Among these, the local features contain structural and textural information about the lesion areas, which are critical for identifying the type and characteristics (e.g., benign, malignant, and severity) of the lesion areas. Spatial features provide a network with overall structural information of the pathological area and its relationship with the surrounding normal skin positions, helping the network recognize pathological areas of different shapes and accurately locate lesion positions [23]. Based on this, a novel GDF branch was designed specifically to capture long-distance feature dependencies, forming complementary relationship with the output of the U-Net original encoder, thus enabling the network to simultaneously focus on both local and spatial features.

The structure of the DEFIF-Net dual encoder is shown in Fig 2a. First, the input image is separately fed into the U-Net original encoder and the newly proposed GDF branch. In the former, adjacent feature dependencies within a smaller receptive field are captured through two 3 × 3 convolution layers, and the feature maps are dimensionally reduced by a max pooling layer. Meanwhile, a convolutional is utilized by the GDF branch to extract remote feature dependencies from a broader receptive field, which are then fused with the adjacent feature dependencies captured by the U-Net original encoder. However, simple element-wise addition fusion operations may overlook deeper feature dependencies, potentially limiting the network’s ability to comprehend global information [24]. Therefore, a novel type of convolution, FIFConv, is introduced in the GDF branch to further fuse and extract features, aiming to explore the dependencies between features. Then, the fully fused feature maps are sent separately to the max pooling layer and the skip connection section, completing a layer of feature encoding. In DEFIF-Net, this encoding process is repeated times (in the conducted experiments, was set to 4). After completing the feature encoding of the last layer, the final output features from both the U-Net original encoder and the GDF branch are fused and passed to the bottleneck layer of the network. This innovative design significantly enhances the network’s ability to delineate lesion boundaries, improving the segmentation performance and robustness w.r.t. complex lesion morphologies.

thumbnail
Fig 2. The DEFIF-Net’s: (a) dual encoder; (b) RFE decoder.

https://doi.org/10.1371/journal.pone.0324861.g002

2.2.3 RFE decoder.

As the boundary textures of pathological regions in medical images often exhibit diversity, accurately identifying the boundaries of these regions is crucial for improving the network segmentation performance [25]. Therefore, we combined FIFConv with residual structures to design a novel RFE decoder (Fig 2b), as to help the network better extract and emphasize complex boundary features, thereby improving its segmentation performance. Additionally, the residual branch also benefits the network in converging more easily during the training, thereby enhancing the network training efficiency.

Firstly, the RFE decoder restores the compressed feature maps through upsampling and adjusts the channels using convolutions to match the size of the feature maps after the CFRM processing. Subsequently, the adjusted feature map is combined with the feature map passed through skip connections to form a richer feature representation. Next, multi-level semantic features are obtained through convolutions, and FIFConv is utilized to enhance and refine the boundaries of target regions, aiding the network in better extracting boundary features. Finally, a residual branch is combined with feature maps after the initial fusion to emphasize important features, thereby further improving the network’s ability to express boundary features. During the decoding stage, the above process is repeated times.

2.2.4 FIFConv.

When integrating local and spatial features in the GDF branch, using only an element-wise addition might ignore potential relational patterns and pathological structures among the features, thus limiting the network’s analytical and understanding capabilities w.r.t. input images. Directly adding feature maps may lead to imbalance in feature importance, particularly when there are conflicts or overlaps between features, causing some features to be overemphasized or underestimated [26]. Therefore, to fully integrate detail features and spatial features, we propose here the feature interaction fusion convolution (FIFConv), shown in Fig 3, for capturing rich dependency relationships among features.

FIFConv employs a unique feature extraction strategy of tearing, fusion, tearing again, and recombining, effectively achieving interactive fusion of features from different scales and spatial extents. Specifically, for the input feature map X ∈ ℝC×H×W, it is first equally divided into two sub-feature maps along the channel dimension during the tearing stage. Then, the sub-feature maps are transformed and fused using an convolution to generate new feature maps . Subsequently, grouped convolutions are used to achieve deep learning on features within each channel group of with lower computational cost, and adding them to the feature maps processed by the convolution. Simultaneously, feature map is enhanced through residual connections to preserve more original feature information. Subsequently, these two feature maps are fused along the channel dimension and processed through an average pooling and a sigmoid activation function. The calculated weight coefficients are then used to adjust the importance of each channel, thereby refining the expression of features. Finally, to prevent the network from overly relying on information from specific channels, the merged feature map is once again torn along the channel dimension and reassembled through an element-wise addition to enhance the synergistic effect among different channel features, and the output feature map Z ∈ ℝC×H×W from FIFConv is obtained. The above process is represented as follows:

(1)(2)(3)(4)(5)(6)

where and denote splitting and merging feature maps along the channel dimension, respectively, denotes an element-wise multiplication, denotes a regular convolution with a kernel size , denotes a grouped convolution operation with a kernel size dividing the input channels into groups, denotes the sigmoid activation function, and denotes an average pooling.

2.2.5 CFRM.

By selecting task-relevant channels from the feature map and enhancing their performance, the interference of irrelevant information on the decoder can be reduced [27]. Therefore, to break the isolation between channels in the feature map, promote interaction between channels, and enhance the network’s utilization of important channel features, we designed a novel CFRM module with the structure depicted in Fig 4. To avoid increasing the number of network parameters, one branch in CFRM uses an average pooling to embed global features, while the other branch employs group normalization to enhance the network’s understanding of spatial relationships and pathological structures. Subsequently, the network’s adaptability to different data is enhanced by adaptively learning the features of two different channel groups using two learnable parameters. Moreover, by enhancing and reconstructing features of each channel, CFRM allows the proposed network to concentrate more on learning and extracting critical channel features, thus enhancing its sensitivity to pathological regions. Algorithm 1 details the operational process of the CFRM in a pseudocode form.

Specifically, in CFRM, to reduce the computational burden on the network, the feature map is first divided into two sub-channel groups along the channel dimension, and different feature extraction operations are performed on these two sub-channel groups. In the first feature extraction branch, to avoid increasing the network’s parameter count, an average pooling is used to embed global features. Then, by using a learnable parameters to adaptively adjust the network’s selection of different features, the network can more flexibly capture subtle differences in various data, thereby maintaining high segmentation performance. Next, the importance of each pixel is calculated using the sigmoid function to enhance the network’s focus on important features, while reducing its attention to background information (irrelevant features) in the images.

In the other feature extraction branch, a grouped convolution is used to enhance the feature representation capabilities within each channel group. Additionally, group normalization is used to improve the network’s understanding of spatial relationships and structures, and learnable parameters along with the sigmoid function are used to dynamically adjust feature weights. Following this, by multiplying the two feature maps in the second branch, the network’s attention to important channels is enhanced, resulting in feature map . Then, feature maps and are concatenated, and the channel order of the feature maps is randomly shuffled to promote information exchange between feature maps, resulting in the output feature map of CFRM.

Algorithm 1: The PyTorch-style pseudo-code of CFRM

#Input: X, the feature map with shape [B, C, H, W]

#Output: Z, the feature map with shape [B, C, H, W]

#Operator: AP, Average pooling operation

     GN, Group normalized

     Sigmoid, Sigmoid function

1: X1, X2 = X. chunk (2, dim=1)

2: Y1 = X1 * Sigmoid (AP (X1) * Parameter (torch.zeros (1, C// 2, 1, 1)) + Parameter (torch.ones (1, C// 2, 1, 1)))

3: Y2 = Conv1 (X2) * Sigmoid (GN (X2) * Parameter (torch.zeros (1, C// 2, 1, 1)) + Parameter (torch.ones (1, C// 2, 1, 1)))

4: Y = torch.cat ([Y1, Y2], dim=1)

5: B, C, H, W = Y. shape

6: Z = Y. reshape (B, 2, -1, H, W). permute (0, 2, 1, 3, 4). reshape (B, -1, H, W)

2.2.6 MBGM.

Convolution operations have issues with fixed receptive fields and insufficient information fusion [28]. Therefore, employing regular convolutions in the bottleneck layer fails to adequately capture the feature representations in the encoder [29]. To address this issue, a newly designed MBGM is introduced in the bottleneck layer of the proposed network. The MBGM design was inspired by the Ghost module [30], which utilizes a parallel combination of convolutions and convolutions, allowing each branch to focus on capturing different information from the input features. This enables the network to better capture and preserve diverse feature information with lower computational cost. The elaborated MBGM structure is shown in Fig 5.

MBGM first extracts higher-level feature information from input feature map by using a convolution to generate feature map . Afterwards, feature map is passed to the Ghost module, where two sets of convolutions and residual branches are used to merge the feature information of the two feature maps. To further extract local features, an additional feature extraction branch is used after the first convolution in the Ghost module, and the extracted features are fused with feature map . Finally, the feature map is refined through a convolution. Additionally, in MBGM, batch normalization and ReLU activation functions are utilized after each convolution to introduce non-linearity characteristics and maintain network stability. MBGM effectively integrates feature information from the encoder and preserves relatively complete feature representations. This innovative design improves the segmentation performance of the proposed network w.r.t. various pathological images, while maintaining efficient network operation. The MBGM processing is represented as follows:

(7)(8)

3 Experiments and results

3.1 Datasets and data preprocessing

Three datasets were used in the experiments conducted to evaluate the segmentation performance of DEFIF-Net in comparison to state-of-the-art networks. The two datasets used in the first group of experiments were the ISIC2016 [31] and ISIC2017 [32] public skin disease datasets, provided by the International Skin Imaging Collaboration, containing 1279 and 2150 images, respectively. These are two important resources in the field of skin disease research, with a certain level of authority. Additionally, we evaluated the generalization performance of DEFIF-Net, in comparison to state-of-the-art networks, in various segmentation tasks using the DSB2018 public microscopic cell dataset.

In the experiments, the aforementioned datasets were randomly split into training, validation, and test sets at a ratio of 7:2:1, as specified in Table 1. Additionally, in order to address overfitting in the training process and the dataset imbalances, we employed data augmentation techniques such as random rotation, flipping, cropping, and adding random body-hair noise during image preprocessing. These operations can expand the representation of the lesion area and increase the network robustness to foreground lesions.

thumbnail
Table 1. Splitting the utilized datasets into training, validation, and test sets.

https://doi.org/10.1371/journal.pone.0324861.t001

3.2 Experimental environment

DEFIF-Net was developed on a computer equipped with a Xeon(R) Platinum 8255C CPU and an RTX 3080 GPU, using PyTorch version 3.9. In the experiments, to ensure that the network is adequately trained, we adopted the following settings: the number of epochs was set to 200 (as the network training effect remained stable after reaching this number), the batch size was set to 8, the initial learning rate was set to 0.0001, the learning rate decay factor was one-half, and an improved combined binary cross entropy (BCE) [33] − Dice [34] loss function was used for network training, represented as:

(9)

where denotes the predicted value, denotes the ground-truth value, and denotes the total number of pixels.

3.3 Evaluation metrics

In the experiments conducted, the Intersection over Union (IoU), Dice Similarity Coefficient (DSC), accuracy (Acc), and recall (Rec) were used to objectively evaluate the segmentation performance of the proposed network in comparison to state-of-the-art networks.

3.4 Experimental results and analysis

3.4.1 Performance comparison on ISIC2016 and ISIC2017 datasets.

In the first set of experiments, we conducted a quantitative comparison of DEFIF-Net with multiple medical image segmentation networks, performed on the ISIC2016 and ISIC2017 datasets. The networks compared include both classic networks (U-Net [6], UNet++ [15], Attention U-Net [16]) and mainstream networks (TransUNet [10], AttaNet [35], I-UNeXt [12], MALUNet [13], EGE-UNet [18], DCSAU-Net [36], MultiResUNet [7], SCSONet [21]). The experimental results obtained are shown in Table 2.

thumbnail
Table 2. Comparative experimental results, obtained on ISIC2016 and ISIC2017 datasets.

https://doi.org/10.1371/journal.pone.0324861.t002

From the experimental results in Table 2, it is evident that DEFIF-Net achieved the best segmentation performance on both datasets, according to three out of four evaluation metrics including the two major ones used in the field of image segmentation, namely IoU and DSC.

Specifically, on the ISIC2016 dataset, DEFIF-Net achieved an IoU value of 85.06%, a DSC value of 91.60%, and an accuracy value of 95.21%, which are respectively higher by 0.51, 0.34, and 0.12 percentage points than those of the second best performing network (i.e., DCSAU-Net according to IoU and DSC, and AttaNet according to accuracy). Additionally, when compared to the lightweight EGEU-Net network, DEFIF-Net’s IoU, DSC, and accuracy values are better by 2.14, 1.41, and 0.96 percentage points, respectively. However, in terms of recall, DEFIF-Net slightly lags behind the leader (AttaNet) and first runner-up (I-UNeXt), thus taking third place.

On the ISIC2017 dataset, the proposed network also leads in IoU, DSC, and accuracy, achieving 83.39%, 90.02%, and 95.17%, respectively, which makes it better that the second best performing network by: 1.05 percentage points (w.r.t. DCSAU-Net) based on IoU, 0.70 percentage points (w.r.t. Attention U-Net) according to DSC, and 0.32 percentage points (w.r.t. TransUNet) based on accuracy. According to recall, DEFIF-Net takes third place, by closely following the leader (I-UNeXt) and first runner-up (EGEU-Net).

Overall, DEFIF-Net demonstrates strong competitiveness in skin lesion segmentation tasks, and its excellent performance on the ISIC2016 and ISIC2017 datasets clearly demonstrates its potential in the field of skin lesion image analysis.

A sample visualization of the segmentation results, achieved by the compared networks on the ISIC2016 and ISIC2017 datasets, is shown in Fig 6.

thumbnail
Fig 6. Visualization comparison of network segmentation results on the ISIC2016 and ISIC2017 datasets.

https://doi.org/10.1371/journal.pone.0324861.g006

Firstly, when observing the comparative images shown in the first and second columns of Fig 6, it can be noticed that DEFIF-Net effectively suppresses the background noise around the skin lesion area compared to some other networks not performing well. Secondly, from the third and fourth columns, it can be noticed that when faced with large areas of lesion regions, DEFIF-Net can more completely segment the skin lesion area, while other networks can only segment the regions with more obvious lesion features in the images. In particular, U-Net incorrectly identifies the entire image in the third column as a normal skin area. Furthermore, by observing the comparative images in the fifth column of Fig 6, it can be clearly seen that DEFIF-Net performs better than other networks in extracting boundary pixels of skin lesions. Finally, from the sixth column, displaying skin lesion images with body-hair noise, it can be seen that other networks might incorrectly identify areas with abundant body hair as lesion areas, whereas the proposed network localizes the target around the lesion area and its surroundings, resulting in the minimal occurrence of false positives. Overall, DEFIF-Net demonstrates significant superiority over other networks in all aspects, particularly in effectively suppressing background noise, segmenting large lesion areas, and extracting boundary pixels, which establishes it as a more precise tool for skin lesion segmentation.

Table 3 shows the average Hausdorff distances (HD) of different networks compared on the ISIC2016 and ISIC2017 datasets. On both datasets, the proposed DEFIF-Net network achieved the smallest average HD values of 18.60 and 19.56, respectively. Moreover, even though the average HD values of AttaNet (i.e., 18.90 on ISIC2016) and TransUNet (i.e., 19.95 on ISIC2017) are close to those of DEFIF-Net, their values fluctuate greatly. In contrast, DEFIF-Net’s HD values maintain higher stability on both datasets.

thumbnail
Table 3. The average Hausdorff distances (in pixels) of networks compared on the ISIC2016 and ISIC2017 datasets.

https://doi.org/10.1371/journal.pone.0324861.t003

3.4.2 Performance comparison on DSB2018 dataset.

In medical imaging, skin lesions typically manifest as singular and relatively large feature areas, whereas cell nuclei are represented as small and densely packed pathological structures. Therefore, in the second group of experiments, we selected a cell microscopy image dataset with significantly different characteristics from skin lesions to verify the generalizability of the proposed DEFIF-Net network in other medical imaging tasks. From the previous group of experiments, we selected several networks that performed well in cell segmentation tasks, including U-Net, AttaNet, I-UNeXt, MALUNet, EGEU-Net, TransUNet, and DCSAU-Net, for a quantitative comparison with the proposed network. Experimental results are presented in Table 4.

thumbnail
Table 4. Comparative experimental results, obtained on the DSB2018 dataset.

https://doi.org/10.1371/journal.pone.0324861.t004

As can be seen from Table 4, DEFIF-Net outperformed all other networks in the task of cell microscopy image segmentation, according to three out of four evaluation metrics, including the two major ones (IoU and DSC). Specifically, DEFIF-Net achieved an IoU value of 81.37%, a DSC value of 89.17%, and a recall value of 91.86%, which are respectively higher by 0.30, 0.81, and 1.64 percentage points than those of the second best performing network (i.e., DCSAU-Net according to IoU and DSC, and U-Net according to recall). This indicates that DEFIF-Net possesses outstanding generalization ability and versatility when dealing with different types of medical image tasks. Only according to accuracy, the proposed network was not on the top of some networks.

A sample visualization of the segmentation results, achieved by the compared networks on the DSB2018 dataset, is shown in Fig 7. This visualization further confirms the DEFIF-Net’s excellent performance in cell microscopy image segmentation tasks. Relative to other networks, DEFIF-Net achieves a more comprehensive segmentation of cells within microscopy images. Specifically, among all compared networks, only DEFIF-Net and DCSAU-Net successfully segmented the cells in the micrographs of the third column in Fig 7, but compared to DCSAU-Net, the segmentation results of DEFIF-Net are clearer. This comparison highlights the advantages of DEFIF-Net in processing fine structures, which enables it to achieve more detailed segmentation when processing microstructures.

thumbnail
Fig 7. Visualization comparison of network segmentation results on DSB2018 dataset.

https://doi.org/10.1371/journal.pone.0324861.g007

3.5 Ablation study

We adopt U-Net as a baseline network for conducting extensive ablation experiments on the same three public datasets used in the other experiments, so as to explore the influence of the newly designed components on the proposed network’s performance. Table 5 presents the experimental results of the ablation study.

thumbnail
Table 5. Results of ablation study experiments.

https://doi.org/10.1371/journal.pone.0324861.t005

Firstly, we separately added the GDF branch (GDFB), CFRM, and MBGM to the baseline, which resulted in improvement on both metrics used in this experiment (i.e., IoU and DSC) on all three datasets, whereby the MBGM addition performed the best on ISIC2016 and ISIC2017, whereas the CFRM addition was the best on DSB2018. This indicates that these three newly designed components can effectively enhance the baseline network’s performance in segmentation tasks. Next, we retained GDFB and replaced the U-Net original decoder with the newly designed RFE decoder (RFED), which led to an additional improvement of both IoU and DSC on all three datasets. Subsequently, we incorporated CFRM together with MBGM into the baseline, which proved working better than the previous step on ISIC2016 and ISIC2017, but worse on DSB2018. In addition, to explicitly ablate the use of FIFConv by GDFB, we replaced FIFConv with a standard convolution. However, as expected, this replacement resulted in decreasing both metrics on all three datasets, except for DSC on DSB2018. This is because FIFConv fully integrates the details and spatial features, and reduces the feature loss so that there are more available features in the next stage of the network. Therefore, we refrained from applying this replacement in the following steps of the ablation study where components were added in triples to the baseline. Next, on the basis of the network structure with the dual encoder and the RFE decoder, we separately added CFRM and MBGB, and then both of them in combination (resulted in the proposed DEFIF-Net network). Experimental results show that these two components have a good synergistic effect on enhancing the network structure, thus further improving the segmentation performance. Finally, the proposed DEFIF-Net network, formed by incorporation of all four newly designed components into the baseline, achieved the highest results for both metrics on all three datasets, clearly demonstrating the effectiveness of using these novel components in the field of medical image segmentation. Just to be on the safe side, we conducted one additional experiment with all components added to the baseline, however, with GDFB using a standard convolution instead of the newly proposed FIFConv. Again, as expected, this replacement resulted in decreasing both metrics on all three datasets, which completely proved the benefit of using FIFConv instead of a standard convolution.

Fig 8 shows the IoU significance comparison of the effect of adding the newly designed components to the baseline network (U-Net). The significance level is marked with “***” in the figure to indicate whether the difference in the network performance reached a statistically significant level. As can be seen from the figure, the addition of all components to the baseline achieved a “***” result (except for the first two steps performed on the DSB2018 dataset), indicating that each improvement in the network performance after adding the corresponding component/s is statistically significant. In addition, as components were gradually added to the network, the standard deviation gradually decreased, which further proves the important role of the newly designed components in improving the network stability.

thumbnail
Fig 8. IoU significance comparison of the effect of adding the newly designed components to the baseline (the group numbers on X axis correspond to the step numbers in

Table 5; the left bar of each group represents the IoU of the baseline, whereas the right bar represents the IoU of the resultant network after adding the corresponding component(s) to the baseline; the asterisks on each group represents the significance level).

https://doi.org/10.1371/journal.pone.0324861.g008

4 Discussion

In the field of medical image segmentation, the networks’ parameter count and computational complexity have attracted much attention from researchers [37]. Numerous studies have been dedicated to developing deep learning networks with fewer parameters and reduced computational complexity, aimed at improving the efficiency of performing medical image segmentation tasks [38]. However, although many of these have made good improvements w.r.t. parameter count and computational complexity, their performance in image segmentation tasks is not ideal. Accurate recognition of subtle pathological structures and target boundaries typically requires networks to have strong feature representation capabilities. Networks with low number of parameters may not adequately capture this critical information, which affects their segmentation performance [39]. During the implementation of DEFIF-Net, we designed several lightweight modules aimed at enhancing the network’s ability to perceive complex structures and subtle features, as to compensate for any information loss that might occur due to the simplified network structure. The careful design of these modules allows the proposed network to achieve better segmentation performance while maintaining low parameter count and computational complexity. The experimental results obtained in this regard are shown in Table 6.

thumbnail
Table 6. Network comparison w.r.t. parameter count, floating-point operations per second (FLOPs), and inference time.

https://doi.org/10.1371/journal.pone.0324861.t006

Among the compared networks, DEFIF-Net stands out for its low parameter count of only 0.24M parameters (while also achieving a good FLOPs value of 0.33G), surpassing in this regard the lightweight networks I-UNeXt and AttaNet, which highlights the lightweight design of DEFIF-Net. In addition, Table 6 also compares the inference time of DEFIF-Net with the inference time of other networks (calculated w.r.t. all images in the test set under the same environment), compared on the ISIC2017 dataset. As can be seen from the table, w.r.t. inference time, DEFIF-Net is far ahead of mainstream networks (UNet++, Attention U-Net, TransUNet, MultiResUNet) and is comparable to the lightweight networks (AttaNet, I-UNeXt, MALUNet, EGEU-Net). This advantage not only enables DEFIF-Net to be applied in resource-constrained environments, but also effectively improves the actual deployment value of the proposed network.

Through the visualization of parameter count, FLOPs, and IoU, shown in Fig 9, one can more intuitively observe the performance of the compared networks across the three datasets used. For instance, the large network TransUNet demonstrated top segmentation performance, which corresponds to its larger parameter count and FLOPs. However, it is important to note that lightweight networks show considerable variability in performance across various datasets, with good results in some of these and poorer outcomes in others, suggesting relatively weaker generalization abilities. In comparison, the proposed network exhibited the best performance on all three datasets. It not only surpassed larger networks in segmentation performance but also matched lightweight networks in terms of parameter count and FLOPs. This indicates that DEFIF-Net has achieved a good balance between segmentation performance and inference speed, offering remarkably high practicality and applicability in medical image segmentation tasks.

thumbnail
Fig 9. Parameter count, FLOPs, and IoU of networks, compared on: (a) ISIC2016 dataset; (b) ISIC2017 dataset; (c) DSB2018 dataset.

https://doi.org/10.1371/journal.pone.0324861.g009

However, although DEFIF-Net already has fewer parameters and lower computational complexity, there is still room for further optimization. Future research will focus on exploring ways to further reduce the parameter count and computational complexity of DEFIF-Net while maintaining top segmentation performance, aiming to enhance its lightweight nature and inference efficiency.

In order to verify the impact of the loss function on the network segmentation performance, we conducted experiments with the Dice loss and BCE loss, separately and in combination using different weights for the Dice loss () and BCE loss (). Table 7 shows that the best segmentation results were achieved with and . However, when the weight of the Dice loss was further increased, the network segmentation performance was not significantly improved. In addition, when the weight of the Dice loss decreased and the weight of the BCE loss respectively increased, the network segmentation performance was also on a downward trend. These results provide an important reference for further optimizing the training strategy of segmentation networks.

thumbnail
Table 7. Segmentation performance results of DEFIF-Net using different weights of the Dice loss () and BCE loss () in the combined loss function used.

https://doi.org/10.1371/journal.pone.0324861.t007

Fig 10 shows the training curves of the combined loss function using different weights for the Dice loss () and BCE loss (). It can be observed that, initially, the combined loss function rapidly decreases and, after reaching a certain point, it starts to decrease gradually. When and (the weights used in our study), the combined loss is the lowest one, while also exhibiting the fastest convergence.

thumbnail
Fig 10. Training curves of the combined loss function using different weights for the Dice loss (

) and BCE loss ().

https://doi.org/10.1371/journal.pone.0324861.g010

However, the proposed DEFIF-Net network still has limitations w.r.t. some aspects. First, although DEFIF-Net has achieved a good balance between segmentation performance and efficiency, especially when dealing with medical images with complex backgrounds and subtle structures, it still has to compensate a certain gap in segmentation performance compared with some more complex and heavyweight deep learning networks. These complex networks usually have more parameters and deeper structures, which enable them to capture more subtle image features and achieve top segmentation performance in some challenging tasks. Although DEFIF-Net performs well in most cases, it may still not reach the level of heavyweight networks in extremely complex or detailed image segmentation tasks. Second, in the presented study, DEFIF-Net was mainly evaluated on two types of datasets: skin lesions and cells. Although the utilized datasets cover common scenes in medical images of both types, the performance of the proposed network on other types of medical images, such as CT and MRI images, has not been tested. Therefore, the universality of DEFIF-Net and its application potential in other medical fields still need to be further explored. In the future, we plan to apply DEFIF-Net to a wider range of clinical medical tasks, including but not limited to tumor detection, organ segmentation, blood vessel detection, etc., as to better evaluate its robustness and performance in different medical scenarios.

5 Conclusion

This paper has proposed a novel lightweight medical image segmentation network, named DEFIF-Net, with an asymmetric structure. The clever embedding of the newly designed components, presented in the paper, enables DEFIF-Net to achieve top segmentation performance in medical image segmentation tasks. Moreover, these components effectively reduce the network’s parameter count and computational complexity, thus significantly improving the network’s computational speed during the training process. Firstly, to compensate for the absence of spatial features in the network, a novel global dependency fusion (GDF) branch is used as an additional encoder in the proposed network. This branch complements the local features captured by the U-Net original encoder, forming an integrated information structure. Furthermore, this branch effectively fuses distant and nearby feature dependencies by means of a novel feature interaction fusion convolution (FIFConv). Secondly, in the skip connection section of the network, novel channel feature reconstruction modules (CFRMs) are employed to enhance the network’s perception of important channels and facilitate information sharing among different channels. In the bottleneck layer of the network, a novel multi-branch ghost module (MBGM) is employed to ensure the diversity and integrity of the feature information, thus enhancing the network’s robustness and generalization ability. Finally, replacing the U-Net original decoder with the novel residual feature enhancement (RFE) decoder allows the proposed network to focus more on learning and capturing boundary features, thereby improving its adaptability to complex pathological regions. Additionally, employing a BCE−Dice combined loss function during the network training has improved the network’s ability to recognize and segment different pathological categories. Experimental results, obtained on two dermatological public datasets, demonstrated that the proposed network achieves good balance between parameter count, computational complexity, and segmentation performance, displaying state-of-the-art achievement. Additional microscopic cell segmentation experiments demonstrated the strong generalization ability of DEFIF-Net.

References

  1. 1. Qureshi I, Yan J, Abbas Q, Shaheed K, Riaz AB, Wahid A. Medical image segmentation using deep semantic-based methods: a review of techniques, applications and emerging trends. Inf Fusion. 2023;90:316–52.
  2. 2. Narayan V, Faiz M, Mall PK, Srivastava S. A comprehensive review of various approach for medical image segmentation and disease prediction. Wireless Pers Commun. 2023;132(3):1819–48.
  3. 3. Butoi V, Ortiz J, Ma T, Sabuncu M, Guttag J, Dalca A. UniverSeg: universal medical image segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023 Jun 18-22; Vancouver, Canada. IEEE; 2023. p. 21438–51.
  4. 4. Mall PK, Singh PK, Srivastav S, Narayan V, Paprzycki M, Jaworska T, et al. A comprehensive review of deep neural networks for medical image processing: recent developments and future opportunities. Healthcare Analytics. 2023;4:100216.
  5. 5. Dhar T, Dey N, Borra S, Sherratt RS. Challenges of deep learning in medical image analysis—improving explainability and trust. IEEE Trans Technol Soc. 2023;4(1):68–75.
  6. 6. Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference; 2015 Oct 5-9; Munich, Germany: Springer; 2015. 234–41.
  7. 7. Ibtehaz N, Rahman MS. MultiResUNet : rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 2020;121:74–87. pmid:31536901
  8. 8. Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y. UNet 3+: a full-scale connected UNet for medical image segmentation. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP); 2020 May 4-9; Barcelona, Spain. IEEE. 2020. 74–87.
  9. 9. Xiao H, Li L, Liu Q, Zhu X, Zhang Q. Transformers in medical image segmentation: a review. Biomedical Signal Processing and Control. 2023;84:104791.
  10. 10. Chen J, et al. TransUNet: transformers make strong encoders for medical image segmentation. arxiv preprint arxiv:2102.04306. 2021.
  11. 11. Azad R, Kazerouni A, Heidari M, Aghdam EK, Molaei A, Jia Y, et al. Advances in medical image analysis with vision transformers: a comprehensive review. Med Image Anal. 2024;91:103000. pmid:37883822
  12. 12. Luo B, Shu Y, Nie Y, Chang D, Pan Y, Shi H. I-UNeXt: a skin lesion segmentation network based on inception and UNeXt. Information Technology, Networking, Electronic and Automation Control Conference; 2023 Feb 24-26; Chongqing, China. IEEE. 2023. 1269–73.
  13. 13. Ruan J, Xiang S, Xie M, Liu T, Fu Y. MALUNet: a multi-attention and light-weight UNet for skin lesion segmentation. 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2022 Dec 6-9; Las Vegas, NV. IEEE. 2022. 1150–6. https://doi.org/10.1109/bibm55620.2022.9995040
  14. 14. Krichen M. Convolutional neural networks: a survey. Computers. 2023;12(8):151.
  15. 15. Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J. UNet++: a nested U-Net architecture for medical image segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018; 2018 Sep 20; Granada, Spain. Springer. 2018. 3–113.
  16. 16. Oktay O, Schlemper J, Folgoc L, Lee M, Heinrich M, Misawa K. Attention U-Net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999: 2018. https://doi.org/10.48550/arXiv.1804.03999
  17. 17. Szegedy C, Vanhoucke V, Ioffe S. Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition; 2016 Jun 26; Las Vegas, USA. IEEE. 2016. 2818–26.
  18. 18. Ruan J, Xie M, Gao J, Liu T, Fu Y. EGE-UNet: an efficient group enhanced Unet for skin lesion segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention; 2023 Oct 8-12; Vancouver, Canada. Springer. 2023. 481–90.
  19. 19. Ruan J, Li J, Xiang S. VM-UNet: vision Mamba UNet for medical image segmentation. arXiv preprint arXiv:2402.02491. 2024.
  20. 20. Ruan J, Xie M, Xiang S, Liu T, Fu Y. MEW-UNet: multi-axis representation learning in frequency domain for medical image segmentation. arXiv preprint arXiv:2210.14007. 2022.
  21. 21. Chen H, Li Z, Huang X, Peng Z, Deng Y, Tang L, et al. SCSONet: spatial-channel synergistic optimization net for skin lesion segmentation. Front Phys. 2024;12.
  22. 22. Zhang S, Metaxas D. On the challenges and perspectives of foundation models for medical image analysis. Med Image Anal. 2024;91:102996. pmid:37857067
  23. 23. Ma J, Li F, Wang B. U-Mamba: enhancing long-range dependency for biomedical image segmentation. arxiv preprint arxiv:2401.04722. 2024.
  24. 24. Zhao X, Jia H, Pang Y, Lv L, Tian F, Zhang L. M2SNet: multi-scale in multi-scale subtraction network for medical image segmentation. arxiv preprint arxiv:2303.10894. 2023.
  25. 25. Alahmadi M. Boundary aware U-Net for medical image segmentation. Arab J Sci Eng. 2023;48(8):9929–40.
  26. 26. Lu Z, She C, Wang W, Huang Q. LM-Net: a light-weight and multi-scale network for medical image segmentation. Comput Biol Med. 2024;168:107717. pmid:38007973
  27. 27. Li W, Zhang Y, Wang G, Huang Y, Li R. DFENet: a dual-branch feature enhanced network integrating transformers and convolutional feature learning for multimodal medical image fusion. Biomedical Signal Processing and Control. 2023;80:104402.
  28. 28. Taye MMJC. Theoretical understanding of convolutional neural network: concepts, architectures, applications, future directions. Computation; 2023;11(3):52.
  29. 29. Vaswani A, Shazeer N, Parmar N. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.
  30. 30. Han K, Wang Y, Tian Q, Guo J, Xu C, Xu C. GhostNet: more features from cheap operations. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020 Jun 14-19; Seattle, United States. IEEE. 2020. 1577–86.
  31. 31. Gutman D, Codella N, Celebi E, Helba B, Marchetti M, Mishra N. Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging (ISBI) 2016, hosted by the international skin imaging collaboration (ISIC). arxiv preprint arxiv:1605.01397. 2016.
  32. 32. Codella NCF, Gutman D, Celebi ME, Helba B, Marchetti MA, Dusza SW, et al. Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin Imaging collaboration (isic). 2018 IEEE 15th International Symposium on Biomedical Imaging; 2018 Apr 4-7; Washington, USA. IEEE. 2018. 168–72. https://doi.org/10.1109/isbi.2018.8363547
  33. 33. Ruby U, Yendapalli V. Binary cross entropy with deep learning technique for image classification. Adv Trends Comput Sci Eng. 2020;9:5359–97.
  34. 34. Zhao R, Qian B, Zhang X, Li Y, Wei R, Liu Y, et al. Rethinking dice loss for medical image segmentation. 2020 IEEE International Conference on Data Mining (ICDM); 2020 Nov 17–20; Sorrento, Italy. IEEE. 2020. 851–60. https://doi.org/10.1109/icdm50108.2020.00094
  35. 35. Song Q, Mei K, Huang R. AttaNet: attention-augmented network for fast and accurate scene parsing. Proceedings of the AAAI conference on artificial intelligence; 2021 Feb 2-9; New York, USA. 2021. p. 2567–75.
  36. 36. Xu Q, Ma Z, He N, Duan W. DCSAU-Net: a deeper and more compact split-attention U-Net for medical image segmentation. Comput Biol Med. 2023;154:106626. pmid:36736096
  37. 37. Zhang Y, Chen Z, Yang X. Light-M: an efficient lightweight medical image segmentation framework for resource-constrained IoMT. Comput Biol Med. 2024;170:108088. pmid:38320339
  38. 38. Hafiz AM. A survey on light-weight convolutional neural networks: trends, issues and future scope. J Mobile Multimedia. 2023:1277–98.
  39. 39. Liang B, Tang C, Zhang W, Xu M, Wu T. N-Net: an UNet architecture with dual encoder for medical image segmentation. Signal Image Video Process. 2023;17(6):3073–81.