MEBANet: A Multi-domain Enhancement and Boundary Awareness Network for urban village extraction from high-resolution imagery

Fangzhe Chang; Xiaoyong Fan; Ruining Xu; Shuhai Wang; Kun Qin; Xuming Gao

doi:10.1371/journal.pone.0330302

Abstract

Urban villages, as a typical phenomenon in the process of urbanization, play a significant role in urban planning and sustainable development. However, their high-density structures and complex boundaries pose significant challenges for extraction tasks based on remote sensing imagery. To address these challenges, this paper proposes a Multi-domain Enhancement and Boundary Awareness Network (MEBANet) for urban village extraction. MEBANet consists of three core blocks: 1) The spatial-frequency-channel feature extraction block (SFCB), which simultaneously enhances feature representation in the spatial, frequency, and channel domains; 2) The multi-scale boundary awareness block (MBAB), which leverages dense atrous spatial pyramid pooling (DenseASPP) and multi-directional sobel operator convolution to strengthen the perception of complex boundaries; and 3) The deep supervision block (DSB), which accelerates model convergence through multi-level supervision signals. Experiments were conducted on three publicly available datasets from Beijing, Xi’an, and Shenzhen. The results demonstrate that MEBANet outperforms existing methods in terms of precision, recall, F1-score, and IoU. Additionally, cross-dataset transfer experiments validate the robustness and generalization capability of MEBANet. Ablation studies further confirm the effectiveness of each block. This study provides a high-accuracy and automated solution for urban village extraction from high-resolution remote sensing imagery, offering valuable insights for urban planning and management.

Citation: Chang F, Fan X, Xu R, Wang S, Qin K, Gao X (2025) MEBANet: A Multi-domain Enhancement and Boundary Awareness Network for urban village extraction from high-resolution imagery. PLoS One 20(10): e0330302. https://doi.org/10.1371/journal.pone.0330302

Editor: Xu Yanwu, South China University of Technology, CHINA

Received: April 11, 2025; Accepted: July 26, 2025; Published: October 22, 2025

Copyright: © 2025 Chang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The urban village extraction dataset that supports the findings of this study is publicly available from the Figshare repository at https://doi.org/10.6084/m9.figshare.29832671.

Funding: This study was supported by The Central Government-Guided Local Science and Technology Development Fund Project under Grant 236Z6101G, and The Hebei Transportation Investment Group Co., Ltd. Major Scientific Research and Development Project under Grant 20245051001S. The funders had role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Urban villages typically refer to residential settlements originally of rural nature that, despite being located within urban areas, have not been fully integrated into formal urban governance and infrastructure systems during the process of urbanization [1,2]. With the rapid advancement of urbanization, urban villages have emerged as a unique socio-spatial phenomenon at the intersection of rural and urban development [3]. Characterized by high-density, small-scale buildings, urban villages accommodate a significant portion of the urban population and resources, providing affordable housing for low-income residents [4]. However, the negative impacts associated with urban villages cannot be overlooked. These areas often pose serious social, economic, and environmental challenges. For instance, their densely packed and disorderly layout disrupts the urban landscape; inadequate infrastructure lowers residents’ quality of life; and poor sanitation conditions, coupled with limited public services, hinder sustainable urban development [5,6]. Therefore, accurately delineating the spatial extent of urban villages is not only a critical task for urban planning and governance but also a fundamental prerequisite for achieving refined urban management and sustainable development.

Traditional methods for mapping the spatial extent of urban villages primarily rely on field surveys [4,7]. Although these methods yield reliable results, they are constrained by high costs, intensive labor requirements, and long execution periods, factors that limit their scalability when applied to widely distributed urban villages [8]. With the rapid advancement of remote sensing technology, high-resolution imagery has emerged as a promising alternative for urban village extraction. Remote sensing imagery offers wide spatial coverage and frequent updates, making it an increasingly important data source for extracting urban village spatial information [1,9]. However, most Traditional approaches still depend on visual interpretation. While this avoids the complexities of field surveys, visual interpretation remains limited by inconsistent interpretation standards and delayed data updates, thus failing to meet the demands of large-scale and efficient urban village monitoring [4]. Consequently, there is an urgent need for automated solutions.

Currently, mainstream approaches for automated feature extraction can be broadly categorized into traditional machine learning methods and deep learning methods [10]. Traditional machine learning algorithms, such as Support Vector Machines (SVM) and Random Forests (RF), are known for their computational simplicity, fast processing speed, and relatively low data requirements [11]. These methods have been widely applied in urban village extraction tasks. For example, Duque et al. [5] evaluated the performance of logistic regression, SVM, and RF in cities such as Buenos Aires (Argentina), Medellín (Colombia), and Recife (Brazil), demonstrating that SVM achieved the best results. Matarira et al. [1] successfully delineated urban villages in Durban, South Africa, by combining the random forest algorithm with Sentinel-2A imagery. Similarly, Park et al. [12] mapped the spatial distribution of urban villages in Ulaanbaatar from 1990 to 2013 using QuickBird and Landsat images supported by SVM. Gevaert et al. [6] further applied SVM to extract urban villages in Kigali, Rwanda, and Maldonado, Uruguay. Nevertheless, these traditional machine learning methods heavily rely on handcrafted features, which are often incomplete and prone to overfitting. Moreover, they struggle to model complex non-linear relationships in heterogeneous environments, ultimately limiting classification accuracy [13]. With the rise of deep learning, significant advancements have been made in this domain. Deep learning constructs robust nonlinear neural networks capable of automatically learning hierarchical feature representations, eliminating the need for manual feature engineering [7,9,14]. As a result, an increasing number of deep neural network-based models have been employed in extraction tasks for identifying objects of interest in remote sensing imagery, achieving remarkable performance. This includes widely used classical semantic segmentation networks such as FCN [15], U-Net [16], and DeepLabv3+ [17], as well as specialized models designed for specific land cover types or challenges. For instance, Wang et al. [18] proposed RFENet and addressed the vulnerability of deep learning models in aerial image segmentation to adversarial patch attacks. Zhang et al. [19] introduced C_ASegformer to improve multi-scale feature integration and contextual awareness, achieving superior results compared to existing models. In the context of urban village extraction, several deep learning-based studies have demonstrated promising results. Verma et al. [2] proposed the use of pre-trained convolutional networks for detecting urban village in Mumbai. Persello et al. [20] applied FCN to extract urban villages in Dar es Salaam, Tanzania. Similarly, Pan et al. [4] achieved delineation of urban villages in Guangzhou, China, using a U-Net-based method. Wei et al. [7] compared the performance of FCN, U-Net, and ResUNet for urban village extraction at the junction of Dongguan, Huizhou, and Guangzhou. These studies primarily rely on the direct application of existing semantic segmentation networks without considering the unique characteristics of urban villages, thereby limiting further improvements in accuracy.

To address these limitations, some researchers have proposed tailored architectures. For example, Ansari et al. developed a composite model that integrates U-Net with a multiscale contourlet transform to improve extraction accuracy in the cities of Mumbai and Pune, India. Fan et al. [13] introduced UisNet, which enhances extraction accuracy by utilizing building-level information. Based on a Transformer architecture, UisNet incorporates multimodal data including spatial building footprints and floor numbers to extract urban villages in Shenzhen. Du et al. [21] proposed STMNet, which fuses texture features derived from gray-level co-occurrence matrices with original images to extract urban villages in Beijing. Zhang et al. [22] introduced UV-SAM to address inaccurate boundary extraction by incorporating a prompt generation module into a foundational model, achieving promising results in Beijing and Xi’an. Furthermore, Li et al. [23] addressed the limited global context modeling of CNNs and the computational complexity of Transformers by introducing the Mamba architecture for the first time in this task, resulting in the UV-Mamba.

Despite recent advancements, most deep learning-based methods for urban village extraction rely primarily on spatial-domain representations, with limited exploration of the frequency domain. Spatial-domain approaches primarily operate on raw pixel intensities and spatial arrangements, which may not fully reveal the underlying structural patterns, especially in highly heterogeneous and visually complex urban scenes [24,25]. In contrast, frequency-domain representations offer a complementary perspective by decomposing images into components with distinct frequency characteristics. Low-frequency components capture global structures and smooth variations, while high-frequency components emphasize fine-grained textures, edges, and discontinuities-features that are often critical for delineating urban villages with irregular layouts and noisy visual characteristics. Integrating frequency-domain information enables models to enhance their perception of subtle yet important features that may be overlooked in the spatial domain, thereby improving the overall robustness and precision of segmentation [26]. As such, a combined spatial-frequency perspective allows for richer and more discriminative feature representation, which is particularly beneficial in complex urban extraction tasks.

Encouragingly, recent studies have attempted to incorporate frequency-domain information into semantic segmentation tasks. Existing approaches can be broadly categorized into two strategies. The first involves parallel use of frequency-domain and spatial-domain feature extraction modules, often implemented as dual-branch networks or as separate modules inserted into the encoder and decoder stages, respectively [27,28]. The second approach introduces frequency-domain features as auxiliary enhancement modules to support spatial feature learning [29,30]. However, the former design increases computational cost and limits timely feature interaction, while the latter cannot fully exploit frequency information.

To address the aforementioned challenges, this study proposes a novel method named multi-domain enhancement and boundary awareness network (MEBANet). MEBANet is primarily composed of three core modules. The spatial-frequency-channel feature extraction block (SFCB) enhances multi-domain feature representations, particularly by incorporating frequency-domain processing to effectively extract critical yet often overlooked features. The multi-scale boundary awareness block (MBAB) leverages multi-scale feature extraction and boundary awareness to improve the accuracy of boundary delineation. Finally, a deep supervision block (DSB) is introduced to accelerate network convergence. The main contributions of this study can be summarized as follows:

1). We propose a novel MEBANet architecture that integrates multi-domain feature enhancement and multi-scale boundary awareness mechanisms to capture the complex spatial characteristics of urban villages more accurately. Its effectiveness is demonstrated through experiments on datasets from Beijing, Xi’an, and Shenzhen.
2). We design the SFCB module to simultaneously enhance features in the spatial, frequency, and channel domains. Unlike existing studies, our approach fuses spatial-domain enhancement submodule, frequency-domain enhancement submodule, and channel-domain enhancement submodule into a unified block that enables comprehensive feature interaction during extraction. Moreover, the design allows for flexible stage-wise integration within the network. Experimental results validate the effectiveness of SFCB in enhancing representational capacity.
3). We propose the MBAB module, which integrates DenseASPP and Sobel convolution to enhance boundary awareness. DenseASPP captures multi-scale features to address the spatial heterogeneity and scale variation of urban villages, while Sobel convolution refines boundary localization. This design significantly improves segmentation accuracy by reducing false positives and missed detections around complex boundaries.

Proposed methods

Overall structure

The architecture of the proposed MEBANet is illustrated in Fig 1, which consists of an encoder and a decoder. In the encoder, a 3 × 3 convolution is first applied for initial feature extraction. Subsequently, the proposed SFCB is employed as the backbone to hierarchically extract multi-level features. The encoder comprises four SFCB, each consisting of a frequency-domain enhancement submodule (FDE), a spatial-domain enhancement submodule (SDE), and a channel-domain enhancement submodule (CDE). These extractors output four levels of features, denoted as , , , . Assuming the input image has a size of (), the output feature from the encoder are of dimensions (), (), (), and (), where , and .

Download:

Fig 1. The structure of MEBANet.

https://doi.org/10.1371/journal.pone.0330302.g001

In the decoder, instead of adopting a complex decoding structure, all features from the encoder are upsampled to match the input resolution. Each upsampled feature map is adjusted to have a uniform channel dimension of , and then concatenated to form a unified feature representation of size . To enhance the model’s adaptability to the shape and scale variations of urban villages, the MBAB is incorporated before generating the final output. Finally, the DSB is introduced to accelerate model convergence and enhance training efficiency. It leverages the intermediate features , , , from the encoder, combined with the output of the MBAB and the result of a 1 × 1 convolution, to provide auxiliary supervision signals that facilitate model optimization [31]. The subsequent sections of this chapter will provide a detailed explanation of the design and implementation of the SFCB, MBAB, and DSB.

Feature extraction using SFCB

SFCB: The design of the SFCB aims to achieve efficient feature extraction and representation through multi-dimensional enhancement across the frequency, spatial, and spectral domains. As illustrated in Fig 2, the input feature first passes through a 3 × 3 convolution for initial processing. Then, two 1 × 1 convolutions are applied to generate two feature, which are respectively fed into the FDE and SDE. The use of 1 × 1 convolutions serve to reduce the number of channels, thereby improving computational efficiency, while also enabling the separation of features that are more suitable for spatial-domain and frequency-domain enhancement. As a result, two distinct feature maps are obtained: , which is passed to the SDE, and , which is passed to the FDE.

The SDE outputs a refined feature map , while the FDE produces . These two outputs are fused through a cross-domain integration mechanism (CAM) to generate , which is then passed into the CDE. This module recalibrates the importance of each feature channel, resulting in the output feature . Finally, a 3 × 3 convolution followed by a max-pooling operation is applied to downsample the features, producing the SFCB output where . This process can be summarized as follows:

Download:

Fig 2. The structure of SFCB.

https://doi.org/10.1371/journal.pone.0330302.g002

(1)

(2)

(3)

(4)

(5)

where , and denote the 3 × 3 convolution, 1 × 1 convolution, and max-pooling operations, respectively. , , and represent the SDE, FDE, CAM, and CDE, respectively.

SDE: The structure of the SDE is illustrated in Fig 3. Its core lies in leveraging the Transformer module to extract global spatial information. To enhance computational efficiency, we adopted a combination of the Mix-FFN and Efficient Self-Attention mechanism (ESA) [32]. The Mix-FFN automatically computes positional encodings from the input features using 3 × 3 convolutions, eliminating the complexity of adding positional encodings separately, as in Vision Transformer [33]. This design not only preserves positional information but also significantly improves computational efficiency. This process can be described as follows:

Download:

Fig 3. The structure of SDE.

https://doi.org/10.1371/journal.pone.0330302.g003

(6)

where, and represent the input and output features of the Mix-FFN, respectively, denotes the GELU activation function, and refers to the multi-layer perceptron.

The ESA differs from Vision Transformer in that it introduces a parameter to compress the key and value matrices, significantly reducing the computational cost of multi-head attention. First, the query (), key (), and value () are computed, a process that can be defined as follows:

(7)

where, , , and represent the multi-layer perceptron used to compute , , and , respectively. Typically, , , and share the same dimensions , where denotes the batch size, represents the number of attention heads, is the sequence length, and is the embedding dimension per head. The introduction of first reshapes and into , and then uses a multi-layer perceptron to transform and into the size . This process can be described as follows:

(8)

(9)

The output features of the ESA can be described as:

(10)

FDE: The structure of the FDE is illustrated in Fig 4. The primary distinction from the SDE lies in the use of the Fast Fourier Transform (FFT) to map the input features from the spatial domain to the frequency domain before feeding them into the Transformer module. In the frequency domain, the features are adjusted and then remapped back to the spatial domain via the Inverse Fast Fourier Transform (IFFT) [34]. This process effectively captures frequency information that is difficult to extract in the spatial domain.

Download:

Fig 4. The structure of FDE.

https://doi.org/10.1371/journal.pone.0330302.g004

CAM: The structure of the CAM is illustrated in Fig 5. CAM integrates the outputs of the FDE and SDE through cross attention mechanism [35]. This process can be described as follows:

Download:

Fig 5. The structure of CAM.

https://doi.org/10.1371/journal.pone.0330302.g005

(11)

(12)

(13)

(14)

(15)

Where represents the softmax activation function, and is the channel dimension of .

CDE: The structure of the CDE is illustrated in Fig 6. The input features are first processed through max-pooling and average-pooling, followed by and softmax to compute the importance weights for each channel. These weights are then multiplied with the original input features to produce the output feature [36]. This process can be described as follows:

Download:

Fig 6. The structure of CDE.

https://doi.org/10.1371/journal.pone.0330302.g006

(16)

Where denotes the average-pooling operation, and represents the max-pooling operation. Finally, the channel-enhanced feature is processed through a 3 × 3 convolutional layer followed by a max pooling operation, generating the output feature of the SFCB.

MBAB for enhancing boundary information

The primary objective of the MBAB is to enhance the model’s adaptability to the shape and scale variations of urban villages while strengthening the extraction of boundary details. The structure of this module is illustrated in Fig 7, and its core consists of two components: dense atrous spatial pyramid pooling (DenseASPP) [37] and sobel operator convolution [38].

Download:

Fig 7. The structure of MBAB.

https://doi.org/10.1371/journal.pone.0330302.g007

The DenseASPP comprises six parallel branches. One branch employs a residual connection, directly passing the input features without any processing, while the other five branches utilize atrous convolutions with dilation rates of 3, 6, 12, 18, and 24, respectively. This multi-scale design captures semantic information at different receptive fields, enabling adaptation to the complex shape and scale variations in urban village regions. Additionally, dense connections between the branches facilitate top-down multi-scale feature aggregation, ensuring that the output of each branch not only retains semantic information at the current scale but also integrates features from adjacent branches, significantly enhancing the overall feature representation capability. To further improve the extraction of boundary details, we introduce multi-directional Sobel operator convolution after each atrous convolution branch. Specifically, the input features are processed using eight directional Sobel convolution kernels, and the results from all directions are summed to produce the final feature. The eight directional Sobel kernels are defined as follows: , , , ,, , , . The Sobel operator enhances edges in the feature map by computing pixel gradient changes, effectively improving the model’s sensitivity to boundary information [39]. This multi-directional convolution design not only captures edge features at different angles but also strengthens the model’s perception of complex boundary structures.

DSB for network training

Deep Supervision is a training strategy for neural networks that not only utilizes the final output of urban village extraction to provide supervision signals but also incorporates intermediate outputs from different layers of the network to guide the training process. This strategy aims to address issues such as gradient vanishing and exploding in network training, while also improving the convergence speed and overall performance of the network [31,40].

We first unsampled the four features from the encoder, concatenate them, and then process them through the MBAB module and a 1 × 1 convolution to obtain the final output result. This process can be described as:

(17)

where denotes the upsampling operation, represents feature concatenation. Next, we utilize , , , and along with the MBAB and 1 × 1 convolution to generate urban village extraction results at different scales as intermediate outputs of MEBANet. This process can be described as:

(18)

Finally, we denote the ground truth corresponding to as, and apply max pooling operations to to obtain the ground truth for each additional output result. This process can be described as:

(19)

In this study, the loss for each component is calculated using the cross-entropy loss function [41], which can be defined as:

(20)

where represents the ground truth labels, denotes the predicted probabilities from the model, and is the total number of pixels. Therefore, the overall loss function for this study can be defined as:

(21)

where is a balancing coefficient used to adjust the weight between the supervision of the final result and the intermediate results.

Experiments and results

Dataset and interpretation signs

Dataset: In this study, we conducted experimental validation using three datasets collected from Beijing, Xi’an, and Shenzhen. The remote sensing images include data from Beijing in 2016, Xi’an in 2018, and Shenzhen in 2020, with corresponding labels generated by professionals through visual interpretation and cross-validation methods. To fully leverage the dataset and evaluate the model’s generalization ability, we combined the training, testing, and validation sets and employed a 5-fold cross-validation approach to assess model accuracy [42]. During the experiments, the original images and their corresponding labels were cropped into non-overlapping 512 × 512-pixel patches to facilitate model training and testing. To enhance the diversity of the training data, we applied a series of image preprocessing methods for data augmentation, including:1) Image Rotation: Rotating the images and labels by 90°, 180°, and 270°; 2) Image Flipping: Horizontal flipping and vertical flipping; 3) Noise Addition: Gaussian noise, salt-and-pepper noise; 4) Image Blurring: Gaussian blur, mean blur, median blur [31].

Interpretation signs: Due to the disorderly construction of urban villages, they exhibit distinct features in images, as shown in Fig 8. These features include: 1) Irregular Shapes: The lack of planning in the construction of urban villages results in irregular shapes, distinguishing them from commercial or well-planned residential areas. This leads to complex geometric forms in remote sensing imagery. 2) Internal objects: In urban villages, buildings are densely packed with narrow gaps between them, resulting in large, continuous, and compact rooftop-covered areas in the imagery. The individual buildings are typically small in scale, and the internal roads are often very narrow, winding, and lack proper planning. Additionally, public green spaces and centralized green areas are extremely scarce. 3) Rough Textures: The texture features are frequently changing and chaotic, with high spatial complexity, exhibiting strong contrast with the surrounding environment.

Download:

Fig 8. Example of an urban village.

The boundary of the urban village is highlighted in red. The figure was produced by the authors of this study using the publicly available dataset from [13] and [22].

https://doi.org/10.1371/journal.pone.0330302.g008

Implementation details

The experiments in this study were conducted on a 64-bit Windows 10 operating system with an Intel(R) Core (TM) i9-11900K @ 3.50 GHz 16-core processor and an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory. The programming language used was Python 3.8, with PyCharm as the development environment and PyTorch 2.0.0 as the deep learning framework.

During the training phase, manual tuning was employed to identify the optimal batch size, optimizer, learning rate, and . The final batch size was set to 16, with AdamW selected as the optimizer. The learning rate was set to 1e-3, and was set to 0.25. A cosine annealing strategy was applied to dynamically adjust it throughout training. The cross-entropy loss function was used for model optimization. To ensure that the loss curve stabilized, the maximum number of training epochs was set to 100.

To evaluate the model’s performance, we selected four commonly used metrics, including , , , and [43,44]. Additionally, we evaluated the model’s complexities by measuring floating point operations per second (FLOPs), parameters, and the inference time required for processing images of size 3 × 512 × 512.

To assess the performance of the proposed method, we compared it with three publicly available urban village extraction methods, namely UisNet [13] and UV-SAM [22], as well as three classic land cover classification methods, including ABCNet [45], CMTFNet [46], and UNetFormer [43]. To ensure the stability and reliability of the results, all experiments in this study are repeated five times and the results are averaged.

Experimental results

Qualitative analysis: Figs 9–11 present the extraction results of six method on the Beijing, Xi’an, and Shenzhen datasets, respectively. From the figures, it is evident that urban villages exhibit significant heterogeneity in shape and scale, posing considerable challenges for accurate identification. Among the compared methods, except for MEBANet, the other methods generally suffer from issues such as false positives, false negatives, irregular boundaries, and internal holes. In contrast, MEBANet demonstrates superior performance in its extraction results. It not only captures the complete contours of urban villages with clear and continuous boundaries but also achieves the highest alignment with the ground truth labels, with almost no noticeable false positives or false negatives. Furthermore, MEBANet exhibits strong robustness in handling the complex internal structures and scale variations of urban villages, effectively reducing internal voids. These results further validate the superiority of MEBANet in urban village extraction tasks.

Download:

Fig 9. Urban village extraction results on the Beijing dataset.

The figure was produced by the authors of this study using the publicly available dataset from [22].

https://doi.org/10.1371/journal.pone.0330302.g009

Download:

Fig 10. Urban village extraction results on the Xi’an dataset.

The figure was produced by the authors of this study using the publicly available dataset from [22].

https://doi.org/10.1371/journal.pone.0330302.g010

Download:

Fig 11. Urban village extraction results on the Shenzhen dataset.

The figure was produced by the authors of this study using the publicly available dataset from [13].

https://doi.org/10.1371/journal.pone.0330302.g011

Quantitative Evaluation: Table 1 presents the quantitative evaluation results of six methods on the Beijing dataset. It can be observed that the proposed MEBANet achieves the highest performance across all evaluation metrics, with of 89.28%, of 88.47%, of 88.87%, and of 79.97%. Among the remaining methods, UisNet ranks second, achieving 86.74% in , 87.96% in , 87.35% in , and 77.53% in . In contrast, UNetFormer performs the worst in three of the four metrics, namely , , and , while CMTFNet records the lowest at 85.59%. Table 2 summarizes the quantitative evaluation results on the Xi’an dataset. MEBANet again demonstrates superior performance, with of 92.78%, of 89.89%, of 91.31%, and of 84.01%. UisNet ranks second, achieving 90.63% in , 89.39% in , 90.01% in , and 81.83% in . ABCNet performs the worst across all metrics, with scores of 89.03%, 86.83%, 87.92%, and 78.44%, respectively. Table 3 reports the evaluation results on the Shenzhen dataset. MEBANet consistently achieves the best performance, with of 88.44%, of 87.29%, of 87.86%, and of 78.35%. CMTFNet ranks second in both and , with values of 87.08% and 76.23%, respectively, while UisNet achieves the second-best and at 85.53% and 86.09%. ABCNet shows the lowest , , and , with values of 84.08%, 84.71%, and 73.48%, respectively, while UNetFormer records the lowest at 85.18%.

Download:

Table 1. Quantitative results of urban village extraction on the Beijing dataset using six methods. All results are reported as percentages (%).

https://doi.org/10.1371/journal.pone.0330302.t001

Download:

Table 2. Quantitative results of urban village extraction on the Xi’an dataset using six methods. All results are reported as percentages (%).

https://doi.org/10.1371/journal.pone.0330302.t002

Download:

Table 3. Quantitative results of urban village extraction on the Shenzhen dataset using six methods. All results are reported as percentages (%).

https://doi.org/10.1371/journal.pone.0330302.t003

The above results represent the average performance over five repeated experiments. Figs 12–14 present the mean and standard deviation (SD) of for different methods on the Beijing, Xi’an, and Shenzhen datasets, respectively. It can be observed that MEBANet achieves the highest across all datasets. Although it does not obtain the lowest SD on the Beijing dataset, where it ranks slightly behind UisNet and ABCNet, it still achieves the best overall accuracy. On the Xi’an and Shenzhen datasets, MEBANet demonstrates both the highest and the most stable performance, highlighting its robustness and strong generalization capability. Combining the quantitative evaluation results and qualitative analysis, it is evident that MEBANet demonstrates the best performance in urban village extraction tasks. Across all three test datasets, MEBANet significantly outperforms the other compared methods.

Download:

Fig 12. Mean and SD of

for different methods on the Beijing dataset.

https://doi.org/10.1371/journal.pone.0330302.g012

Download:

Fig 13. Mean and SD of

for different methods on the Xi’an dataset.

https://doi.org/10.1371/journal.pone.0330302.g013

Download:

Fig 14. Mean and SD of

for different methods on the Shenzhen dataset.

https://doi.org/10.1371/journal.pone.0330302.g014

Discussion

Ablation experiment

To validate the effectiveness of the constructed blocks, we conducted quantitative evaluation experiments on the Beijing dataset and Xi’an dataset, focusing on the performance contributions of three sub-modules: SFCB, MBAB, and DSB. Among these, MBAB and DSB are plug-and-play modules, while SFCB, as the core feature extraction structure, is an indispensable component of the model. Since traditional methods primarily focus on spatial domain feature extraction and often overlook the importance of frequency and channel domains, we specifically analyzed the contributions of FDE and CDE in detail. The ablation study results for MBAB and DSB are shown in Table 4, and the ablation results for the SFCB sub-module are presented in Table 5.

Download:

Table 4. Ablation study results of MBAB and DSB. All results are reported as percentages (%).

https://doi.org/10.1371/journal.pone.0330302.t004

Download:

Table 5. Ablation study results of SFCB. All results are reported as percentages (%).

https://doi.org/10.1371/journal.pone.0330302.t005

From Table 4, it can be observed that on the Beijing dataset, using the MBAB alone increases the by 0.77 percentage points (from 87.38% to 88.15%) and the by 1.23 percentage points (from 77.58% to 78.81%). Using the DSB alone increases the by 0.55 percentage points (from 87.38% to 87.93%) and the by 0.88 percentage points (from 77.58% to 78.46%). When both MBAB and DSB are used together, the increases by 1.49 percentage points (from 87.38% to 88.87%), and the increases by 2.39 percentage points (from 77.58% to 79.97%).

On the Xi’an dataset, using the MBAB module alone increases the by 1.45 percentage points (from 88.59% to 90.04%) and the by 2.37 percentage points (from 79.52% to 81.89%). Using the DSB alone increases the by 0.64 percentage points (from 88.59% to 89.23%) and the by 1.03 percentage points (from 79.52% to 80.55%). When both MBAB and DSB are used together, the increases by 2.72 percentage points (from 88.59% to 91.31%), and the increases by 4.49 percentage points (from 79.52% to 84.01%). The above results demonstrate that both the MBAB and DSB play significant roles in improving model performance.

Fig 15 illustrates the accuracy variations during the training process with and without DSB. The results demonstrate that while DSB introduces instability during training, it significantly accelerates the convergence of MEBANet. Moreover, the incorporation of DSB further improves the accuracy.

Download:

Fig 15. Comparison of training process with and without DSB.

https://doi.org/10.1371/journal.pone.0330302.g015

From the first three rows of Table 5, it can be observed that the accuracy is relatively low when each module is used independently. However, from the fourth and fifth rows, it becomes evident that the combined use of the modules enhances feature extraction capabilities, leading to improved accuracy in urban village extraction. Specifically, on the Beijing dataset, as indicated by the comparison between the first and fourth rows, the use of FDE results in an increase of 1.25 percentage points in (from 87.24% to 88.49%) and 1.99 percentage points in (from 77.36% to 79.35%). As shown in the comparison between the second and third rows, the use of SDE increases by 1.42 percentage points (from 87.07% to 88.49%) and by 2.26 percentage points (from 77.09% to 79.35%). Further, the inclusion of CDE, as seen in the fourth and fifth rows, results in a further improvement of 0.38 percentage points in (from 88.49% to 88.87%) and 0.62 percentage points in (from 79.35% to 79.97%).

On the Xi’an dataset, the use of FDE, as shown in the comparison between the first and fourth rows, increases by 1.71 percentage points (from 88.78% to 90.49%) and by 2.81 percentage points (from 79.82% to 82.63%). The comparison between the second and fourth rows indicates that SDE results in an increase of 2.21 percentage points in (from 88.28% to 90.49%) and 3.61 percentage points in (from 79.02% to 82.63%). Finally, as shown in the fourth and fifth rows, the incorporation of CDE leads to a further increase of 0.82 percentage points in (from 90.49% to 91.31%) and 1.38 percentage points in (from 82.63% to 84.01%).

Backbone effectiveness analysis

To validate the effectiveness of the proposed feature extractor, SFCB, we conducted comparative experiments on the Beijing dataset and Xi’an dataset, comparing SFCB with several popular feature extractors. These include Transformer-based models such as Vision Transformer [33], Swin Transformer [47], and Mix-Transformer [32], as well as CNN-based models such as DenseNet [48], ResNet50 [49], and Xception [50]. The experimental results are shown in Table 6. Specifically, on the Beijing dataset, among the Transformer-based feature extractors, Vision Transformer performs the worst, but its (87.55%) and (77.86%) are still higher than those of the best-performing CNN-based feature extractor, ResNet50 (: 86.98%, : 76.96%), by 0.57 percentage points and 0.90 percentage points, respectively. On the Xi’an dataset, among the Transformer-based feature extractors, Mix-Transformer performs the worst, but its (88.77%) and (79.81%) are still higher than those of the best-performing CNN-based feature extractor, DenseNet (: 88.36%, : 79.15%), by 0.69 percentage points and 0.66 percentage points, respectively.

Download:

Table 6. Comparison results with different backbones. All results are reported as percentages (%).

https://doi.org/10.1371/journal.pone.0330302.t006

Overall, SFCB significantly outperforms the other compared models on both datasets, validating its superiority in feature extraction. At the same time, Transformer-based feature extractors generally outperform CNN-based feature extractors, likely because Transformers are better at capturing global contextual information, thereby demonstrating stronger feature extraction capabilities in complex scenarios.

To further investigate the impact of the number of SFCB on urban village extraction task, we compared model performance when using 3, 4, and 5 SFCB. The experimental results are shown in Table 7, which demonstrates that the model achieves optimal performance with 4 SFCB on both datasets. Specifically, on the Beijing dataset, using 4 SFCB result in an of 88.87% and an of 79.97%, representing improvements of 1.32 percentage points and 2.11 percentage points, respectively, compared to using 3 SFCB, and improvements of 0.86 percentage points and 1.38 percentage points, respectively, compared to using 5 SFCB. On the Xi’an dataset, using 4 SFCB result in an of 91.31% and an of 84.01%, representing improvements of 1.68 percentage points and 2.79 percentage points, respectively, compared to using 3 SFCB, and improvements of 1 percentage point and 1.67 percentage points, respectively, compared to using 5 SFCB.

Download:

Table 7. Comparison results with different numbers of SFCB. All results are reported as percentages (%).

https://doi.org/10.1371/journal.pone.0330302.t007

Model transferability analysis

To investigate the robustness and transferability of MEBANet, we trained the model using the Beijing dataset and validated it on the Xi’an dataset. The experimental results are shown in Table 8. Due to differences in architectural styles between the two cities, the accuracy of all models decreased compared to the results in Table 1. Nevertheless, MEBANet still demonstrated the best performance, achieving , , , and of 77.74%, 73.82%, 75.73%, and 60.94%, respectively. The second-best performance was achieved by UisNet, with , , , and of 75.55%, 70.71%, 73.05%, and 57.54%, respectively, which are 2.19, 3.11, 2.68, and 3.40 percentage points lower than those of MEBANet. The worst performance was observed with ABCNet, which achieved , , and of 69.05%, 69.81%, and 53.63%, respectively, representing decreases of 8.69, 5.92, and 7.31 percentage points compared to MEBANet. Additionally, UV-SAM had the lowest of 67.29%, which is 6.53 percentage points lower than that of MEBANet.

Download:

Table 8. Quantitative evaluation results of model transferability. All results are reported as percentages (%).

https://doi.org/10.1371/journal.pone.0330302.t008

Fig 16 demonstrates the visual results, and a comparison with Fig 9 also reveals that training and validation on different datasets lead to more positives, false negatives, irregular boundaries, and internal holes across all methods, including the proposed MEBANet. However, it can be observed that among all the methods, MEBANet performs the best, particularly in the second row.

Download:

Fig 16. Visual results of model transferability.

https://doi.org/10.1371/journal.pone.0330302.g016

The figure was produced by the authors of this study using the publicly available dataset from [22].

Parameter fine-tuning

The selection of appropriate model parameters is essential for achieving optimal performance. In this study, manual parameter tuning was employed to determine the optimal values for batch size, optimizer, learning rate, and The tuning process was carried out in two stages: first, the parameters directly related to model training, namely batch size, optimizer, and learning rate, were determined; then, the model structure parameter, , was adjusted. Initially, a learning rate of 0.005, the Adam optimizer, and set to 0.25 were selected. The results, as shown in Fig 17(a), indicate that, through experimentation, the optimal batch size was identified as 16, which also represents the maximum feasible batch size due to hardware limitations. Subsequently, with the batch size fixed at 16 and the learning rate and held constant, the optimizer was varied. The results, shown in Fig 17(b), demonstrate that AdamW was the most effective optimizer. Following this, with the optimizer fixed, the learning rate was adjusted, and the best performance was achieved at a learning rate of 0.001, as shown in Fig 17(c). Finally, with the first three parameters fixed, was fine-tuned, and the results presented in Fig 17(d) reveal that a value of 0.25 yielded the optimal performance.

Download:

Fig 17. Parameter fine-tuning results: (a) batch size; (b) optimizer;(c) learning rate;(d)

.

https://doi.org/10.1371/journal.pone.0330302.g017

Model complexities

Table 9 presents the computational complexity and efficiency of different models. It can be seen that MEBANet is at an average level in terms of model efficiency. MEBANet has 26.71M parameters, which is slightly higher than the lightweight models ABCNet and UNetFormer, much lower than UV-SAM, and at a similar level to UisNet and CMTFNet. In terms of computational complexity, MEBANet is significantly lower than UisNet and UV-SAM, while maintaining comparable efficiency to CMTFNet. Notably, MEBANet’s inference time reaches 20.54ms, which is only slightly slower than ABCNet and UNetFormer, but shows a significant advantage over UisNet, CMTFNet, and UV-SAM.

Download:

Table 9. Computational complexity and efficiency of different methods.

https://doi.org/10.1371/journal.pone.0330302.t009

Limitations and future work

Although the proposed MEBANet achieves high accuracy in urban village extraction task and its effectiveness has been validated through experiments, there are still several areas worthy of further investigation:

1). Dataset expansion: This study was tested only on datasets from three cities in China. While the results are promising, the scope of the research is currently limited. To enhance the generalizability of the model, future work will involve testing it on datasets from additional regions, including cities with different architectural styles and urban features, such as those from Africa and South America. This will help validate the model’s adaptability to diverse geographical and cultural contexts.
2). Utilization of multi-modal data: This study relies solely on optical remote sensing images for urban village extraction. Considering that multi-modal data provide additional perspectives, future research will explore the integration of multi-modal data, such as building height data and SAR imagery, to further enhance the model’s recognition accuracy and robustness.
3). Transfer learning issue: Although MEBANet has been tested with transfer learning and shows better performance compared to others methods, the issue of significant accuracy degradation on unseen data remains unresolved. In future research, we will address this limitation by exploring domain adaptation and self-supervised learning techniques to enhance the model’s robustness and ensure consistent performance across diverse datasets.

Conclusions

This study addresses the complexity and challenges of urban village extraction by proposing an innovative method named MEBANet. The proposed MEBANet significantly enhances the accuracy and robustness of urban village extraction through the synergistic integration of three core modules: the SFCB, the MBAB, and the DSB. Experimental results demonstrate that MEBANet consistently outperforms existing methods on the Beijing, Xi’an and Shenzhen datasets in terms of overall performance. Cross-dataset transfer experiments further validate its strong generalization capability. Ablation studies confirm the individual contributions and effectiveness of each module. Future research directions include testing the accuracy of urban village extraction on different regions, incorporating multi-modal data into the task to improve accuracy, and addressing the challenge of generalization to unseen data.

References

1. Matarira D, Mutanga O, Naidu M. Google earth engine for informal settlement mapping: a random forest classification using spectral and textural information. Remote Sens. 2022;14(20):5130.
- View Article
- Google Scholar
2. Verma D, Jana A, Ramamritham K. Transfer learning approach to map urban slums using high and medium resolution satellite imagery. Habitat Intern. 2019;88:101981.
- View Article
- Google Scholar
3. Ansari RA, Malhotra R, Buddhiraju KM. Identifying informal settlements using contourlet assisted deep learning. Sensors (Basel). 2020;20(9):2733. pmid:32403308
- View Article
- PubMed/NCBI
- Google Scholar
4. Pan Z, Xu J, Guo Y, Hu Y, Wang G. Deep learning segmentation and classification for urban village using a worldview satellite image based on U-Net. Remote Sens. 2020;12(10):1574.
- View Article
- Google Scholar
5. Duque J, Patino J, Betancourt A. Exploring the potential of machine learning for automatic slum identification from VHR imagery. Remote Sens. 2017;9(9):895.
- View Article
- Google Scholar
6. Gevaert CM, Persello C, Sliuzas R, Vosselman G. Informal settlement classification using point-cloud and image-based features from UAV data. ISPRS J Photogram Remote Sens. 2017;125:225–36.
- View Article
- Google Scholar
7. Wei C, Wei H, Crivellari A, Liu T, Wan Y, Chen W, et al. Gaofen-2 satellite image-based characterization of urban villages using multiple convolutional neural networks. Inter J Remote Sens. 2023;44(24):7808–26.
- View Article
- Google Scholar
8. Zhang C, Xing J, Li J, Du S, Qin Q. A new method for the extraction of tailing ponds from very high-resolution remotely sensed images: PSVED. Inter J Digital Earth. 2023;16(1):2681–703.
- View Article
- Google Scholar
9. Lu W, Shi X, Lu Z. A new two-step road extraction method in high resolution remote sensing images. PLoS One. 2024;19(7):e0305933. pmid:39024329
- View Article
- PubMed/NCBI
- Google Scholar
10. Hao M, Dong X, Jiang D, Yu X, Ding F, Zhuo J. Land-use classification based on high-resolution remote sensing imagery and deep learning models. PLoS One. 2024;19(4):e0300473. pmid:38635663
- View Article
- PubMed/NCBI
- Google Scholar
11. Xi Y, Ren C, Tian Q, Ren Y, Dong X, Zhang Z. Exploitation of time series sentinel-2 data and different machine learning algorithms for detailed tree species classification. IEEE J Sel Top Appl Earth Observ Remote Sens. 2021;14:7589–603.
- View Article
- Google Scholar
12. Park H, Fan P, John R, Ouyang Z, Chen J. Spatiotemporal changes of informal settlements: ger districts in Ulaanbaatar, Mongolia. Landsc Urban Plan. 2019;191:103630.
- View Article
- Google Scholar
13. Fan R, Li F, Han W, Yan J, Li J, Wang L. Fine-scale urban informal settlements mapping by fusing remote sensing images and building data via a transformer-based multimodal fusion network. IEEE Trans Geosci Remote Sens. 2022;60:1–16.
- View Article
- Google Scholar
14. Chang F, Ma T, Wang D, Zhu S, Li D, Feng S, et al. Method for building segmentation and extraction from high-resolution remote sensing images based on improved YOLOv5ds. PLoS One. 2025;20(3):e0317106. pmid:40100935
- View Article
- PubMed/NCBI
- Google Scholar
15. Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(4):640–51. pmid:27244717
- View Article
- PubMed/NCBI
- Google Scholar
16. Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. In: 18th International conference on medical image computing and computer-assisted intervention (MICCAI). Munich, GERMANY, CHAM: Springer International Publishing; 2015.
17. Chen LC, Zhu YK, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), 2018.
18. Wang Z, Wang B, Zhang C, Liu Y. Defense against adversarial patch attacks for aerial image semantic segmentation by robust feature extraction. Remote Sens. 2023;15(6):1690.
- View Article
- Google Scholar
19. Zhang S, Chen T, Su F, Xu H, Li Y, Liu Y. Deep layered network based on rotation operation and residual transform for building segmentation from remote sensing images. Sensors (Basel). 2025;25(8):2608. pmid:40285301
- View Article
- PubMed/NCBI
- Google Scholar
20. Persello C, Stein A. deep fully convolutional networks for the detection of informal settlements in VHR images. IEEE Geosci Remote Sens Lett. 2017;14(12):2325–9.
- View Article
- Google Scholar
21. Du S, Xing J, Wang S, Wei L, Zhang Y. STMNet: scene classification-assisted and texture feature-enhanced multiscale network for large-scale urban informal settlement extraction from remote sensing images. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2024;17:13169–87.
- View Article
- Google Scholar
22. Zhang X, Liu Y, Lin Y, Liao Q, Li Y. UV-SAM: adapting segment anything model for urban village identification. AAAI. 2024;38(20):22520–8.
- View Article
- Google Scholar
23. Li L, Chen B, Zou X, Xing J, Tao P. UV-mamba: A DCN-enhanced state space model for urban village boundary identification in high-resolution remote sensing images. In: ICASSP 2025 - 2025 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE; 2025. 1–5.
- View Article
- Google Scholar
24. Zhang J, Shao M, Wan Y, Meng L, Cao X, Wang S. Boundary-aware spatial and frequency dual-domain transformer for remote sensing urban images segmentation. IEEE Trans Geosci Remote Sens. 2024;62:1–18.
- View Article
- Google Scholar
25. Gao F, Fu M, Cao J, Dong J, Du Q. Adaptive frequency enhancement network for remote sensing image semantic segmentation. IEEE Trans Geosci Remote Sens. 2025;63:1–15.
- View Article
- Google Scholar
26. Yang Y, Yuan G, Li J. SFFNet: a wavelet-based spatial and frequency domain fusion network for remote sensing segmentation. IEEE Trans Geosci Remote Sens. 2024;62:1–17.
- View Article
- Google Scholar
27. Liu H, Wang C, Zhao J, Chen S, Kong H. Adaptive fourier convolution network for road segmentation in remote sensing images. IEEE Trans Geosci Remote Sens. 2024;62:1–14.
- View Article
- Google Scholar
28. Zhang H, Xie G, Li L, Xie X, Ren J. Frequency-domain guided swin transformer and global–local feature integration for remote sensing images semantic segmentation. IEEE Trans Geosci Remote Sensing. 2025;63:1–11.
- View Article
- Google Scholar
29. Zhang F, Xia X. Efficient semantic segmentation of remote sensing images through global-local feature integration. IEEE Access. 2025;13:115653–68.
- View Article
- Google Scholar
30. Yan KY, Shen F, Li ZY. Enhancing landslide segmentation with guide attention mechanism and fast fourier transformer. In: 20th International conference on intelligent computing (ICIC) Tianjin Univ Sci & Tech, Tianjin, PEOPLES R CHINA: Springer-Verlag; 2024.
31. Zhang C, Yue P, Tapete D, Jiang L, Shangguan B, Huang L, et al. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J Photogram Remote Sens. 2020;166:183–200.
- View Article
- Google Scholar
32. Xie EZ, Wang WH, Yu ZD, Anandkumar A, Alvarez JM, Luo P. SegFormer: simple and efficient design for semantic segmentation with transformers. In: 35th Conference on neural information processing systems (NeurIPS). LA JOLLA: Neural Information Processing Systems (Nips); 2021.
33. Dosovitskiy A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint. 2020. https://arxiv.org/abs/2010.11929
- View Article
- Google Scholar
34. Zhang T, Dick RP. Spatial-frequency network for segmentation of remote sensing images. In: 30th IEEE international conference on image processing (ICIP). Kuala Lumpur, MALAYSIA, 2023.
35. Ren B, Liu B, Hou B, Wang Z, Yang C, Jiao L. SwinTFNet: dual-stream transformer with cross attention fusion for land cover classification. IEEE Geosci Remote Sensing Lett. 2024;21:1–5.
- View Article
- Google Scholar
36. Woo SH, Park J, Lee JY, Kweon IS. CBAM: Convolutional Block Attention Module. In: 15th European conference on computer vision (ECCV). Munich, GERMANY. 2018.
37. Yang MK, Yu K, Zhang C, Li ZW, Yang KY. DenseASPP for semantic segmentation in street scenes. In: 31st IEEE/CVF conference on computer vision and pattern recognition (CVPR). Salt Lake City, UT. 2018.
38. Zhao D, Ni L, Zhou K, Lv Z, Qu G, Gao Y, et al. A study of the improved A* algorithm incorporating road factors for path planning in off-road emergency rescue scenarios. Sensors (Basel). 2024;24(17):5643. pmid:39275555
- View Article
- PubMed/NCBI
- Google Scholar
39. Jing Y, Zhang T, Liu Z, Hou Y, Sun C. Swin-ResUNet+: an edge enhancement module for road extraction from remote sensing images. Comp Vision Image Understand. 2023;237:103807.
- View Article
- Google Scholar
40. Liu T, Li J, Cao W, Tang M, Yang G. MLCNet: multitask level-specific constraint network for building change detection. IEEE J Sel Top Appl Earth Observ Remote Sens. 2024;17:11823–38.
- View Article
- Google Scholar
41. Sun P, Lu Y, Zhai J. Mapping land cover using a developed U-Net model with weighted cross entropy. Geocarto Inter. 2021;37(25):9355–68.
- View Article
- Google Scholar
42. Xiao Y, Zhao Z, Huang J, Huang R, Weng W, Liang G, et al. The illusion of success: test set disproportion causes inflated accuracy in remote sensing mapping research. Inter J Appl Earth Observ Geoinform. 2024;135:104256.
- View Article
- Google Scholar
43. Wang L, Li R, Zhang C, Fang S, Duan C, Meng X, et al. UNetFormer: a UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J Photogram Remote Sens. 2022;190:196–214.
- View Article
- Google Scholar
44. Gao M, Dong W, Chen L, Wu Z. Automatic extraction of water body from SAR images considering enhanced feature fusion and noise suppression. Appl Sci. 2025;15(5):2366.
- View Article
- Google Scholar
45. Li R, Zheng S, Zhang C, Duan C, Wang L, Atkinson PM. ABCNet: attentive bilateral contextual network for efficient semantic segmentation of fine-resolution remotely sensed imagery. ISPRS J Photogram Remote Sens. 2021;181:84–98.
- View Article
- Google Scholar
46. Wu H, Huang P, Zhang M, Tang W, Yu X. CMTFNet: CNN and multiscale transformer fusion network for remote-sensing image semantic segmentation. IEEE Trans Geosci Remote Sens. 2023;61:1–12.
- View Article
- Google Scholar
47. Liu Z, Lin YT, Cao Y, Hu H, Wei YX, Zhang Z. Swin Transformer: hierarchical vision transformer using shifted windows. In: 18th IEEE/CVF International conference on computer vision (ICCV). 2021.
48. Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: 30th IEEE/CVF Conference on computer vision and pattern recognition (CVPR); 2017 Jul 21-26.Honolulu, HI, New York: IEEE; 2017.
49. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), 2016. 770–8.
- View Article
- Google Scholar
50. Chollet F. Xception: deep learning with depthwise separable convolutions. In: 30th IEEE/CVF Conference on computer vision and pattern recognition (CVPR). Honolulu, HI. 2017.

[ref1] 1. Matarira D, Mutanga O, Naidu M. Google earth engine for informal settlement mapping: a random forest classification using spectral and textural information. Remote Sens. 2022;14(20):5130.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Verma D, Jana A, Ramamritham K. Transfer learning approach to map urban slums using high and medium resolution satellite imagery. Habitat Intern. 2019;88:101981.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Ansari RA, Malhotra R, Buddhiraju KM. Identifying informal settlements using contourlet assisted deep learning. Sensors (Basel). 2020;20(9):2733. pmid:32403308
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Pan Z, Xu J, Guo Y, Hu Y, Wang G. Deep learning segmentation and classification for urban village using a worldview satellite image based on U-Net. Remote Sens. 2020;12(10):1574.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Duque J, Patino J, Betancourt A. Exploring the potential of machine learning for automatic slum identification from VHR imagery. Remote Sens. 2017;9(9):895.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref6] 6. Gevaert CM, Persello C, Sliuzas R, Vosselman G. Informal settlement classification using point-cloud and image-based features from UAV data. ISPRS J Photogram Remote Sens. 2017;125:225–36.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref7] 7. Wei C, Wei H, Crivellari A, Liu T, Wan Y, Chen W, et al. Gaofen-2 satellite image-based characterization of urban villages using multiple convolutional neural networks. Inter J Remote Sens. 2023;44(24):7808–26.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref8] 8. Zhang C, Xing J, Li J, Du S, Qin Q. A new method for the extraction of tailing ponds from very high-resolution remotely sensed images: PSVED. Inter J Digital Earth. 2023;16(1):2681–703.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref9] 9. Lu W, Shi X, Lu Z. A new two-step road extraction method in high resolution remote sensing images. PLoS One. 2024;19(7):e0305933. pmid:39024329
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref10] 10. Hao M, Dong X, Jiang D, Yu X, Ding F, Zhuo J. Land-use classification based on high-resolution remote sensing imagery and deep learning models. PLoS One. 2024;19(4):e0300473. pmid:38635663
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref11] 11. Xi Y, Ren C, Tian Q, Ren Y, Dong X, Zhang Z. Exploitation of time series sentinel-2 data and different machine learning algorithms for detailed tree species classification. IEEE J Sel Top Appl Earth Observ Remote Sens. 2021;14:7589–603.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref12] 12. Park H, Fan P, John R, Ouyang Z, Chen J. Spatiotemporal changes of informal settlements: ger districts in Ulaanbaatar, Mongolia. Landsc Urban Plan. 2019;191:103630.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref13] 13. Fan R, Li F, Han W, Yan J, Li J, Wang L. Fine-scale urban informal settlements mapping by fusing remote sensing images and building data via a transformer-based multimodal fusion network. IEEE Trans Geosci Remote Sens. 2022;60:1–16.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref14] 14. Chang F, Ma T, Wang D, Zhu S, Li D, Feng S, et al. Method for building segmentation and extraction from high-resolution remote sensing images based on improved YOLOv5ds. PLoS One. 2025;20(3):e0317106. pmid:40100935
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref15] 15. Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(4):640–51. pmid:27244717
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref16] 16. Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. In: 18th International conference on medical image computing and computer-assisted intervention (MICCAI). Munich, GERMANY, CHAM: Springer International Publishing; 2015.

[ref17] 17. Chen LC, Zhu YK, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), 2018.

[ref18] 18. Wang Z, Wang B, Zhang C, Liu Y. Defense against adversarial patch attacks for aerial image semantic segmentation by robust feature extraction. Remote Sens. 2023;15(6):1690.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref19] 19. Zhang S, Chen T, Su F, Xu H, Li Y, Liu Y. Deep layered network based on rotation operation and residual transform for building segmentation from remote sensing images. Sensors (Basel). 2025;25(8):2608. pmid:40285301
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref20] 20. Persello C, Stein A. deep fully convolutional networks for the detection of informal settlements in VHR images. IEEE Geosci Remote Sens Lett. 2017;14(12):2325–9.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref21] 21. Du S, Xing J, Wang S, Wei L, Zhang Y. STMNet: scene classification-assisted and texture feature-enhanced multiscale network for large-scale urban informal settlement extraction from remote sensing images. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2024;17:13169–87.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref22] 22. Zhang X, Liu Y, Lin Y, Liao Q, Li Y. UV-SAM: adapting segment anything model for urban village identification. AAAI. 2024;38(20):22520–8.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref23] 23. Li L, Chen B, Zou X, Xing J, Tao P. UV-mamba: A DCN-enhanced state space model for urban village boundary identification in high-resolution remote sensing images. In: ICASSP 2025 - 2025 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE; 2025. 1–5.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref24] 24. Zhang J, Shao M, Wan Y, Meng L, Cao X, Wang S. Boundary-aware spatial and frequency dual-domain transformer for remote sensing urban images segmentation. IEEE Trans Geosci Remote Sens. 2024;62:1–18.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref25] 25. Gao F, Fu M, Cao J, Dong J, Du Q. Adaptive frequency enhancement network for remote sensing image semantic segmentation. IEEE Trans Geosci Remote Sens. 2025;63:1–15.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref26] 26. Yang Y, Yuan G, Li J. SFFNet: a wavelet-based spatial and frequency domain fusion network for remote sensing segmentation. IEEE Trans Geosci Remote Sens. 2024;62:1–17.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref27] 27. Liu H, Wang C, Zhao J, Chen S, Kong H. Adaptive fourier convolution network for road segmentation in remote sensing images. IEEE Trans Geosci Remote Sens. 2024;62:1–14.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref28] 28. Zhang H, Xie G, Li L, Xie X, Ren J. Frequency-domain guided swin transformer and global–local feature integration for remote sensing images semantic segmentation. IEEE Trans Geosci Remote Sensing. 2025;63:1–11.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref29] 29. Zhang F, Xia X. Efficient semantic segmentation of remote sensing images through global-local feature integration. IEEE Access. 2025;13:115653–68.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref30] 30. Yan KY, Shen F, Li ZY. Enhancing landslide segmentation with guide attention mechanism and fast fourier transformer. In: 20th International conference on intelligent computing (ICIC) Tianjin Univ Sci & Tech, Tianjin, PEOPLES R CHINA: Springer-Verlag; 2024.

[ref31] 31. Zhang C, Yue P, Tapete D, Jiang L, Shangguan B, Huang L, et al. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J Photogram Remote Sens. 2020;166:183–200.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref32] 32. Xie EZ, Wang WH, Yu ZD, Anandkumar A, Alvarez JM, Luo P. SegFormer: simple and efficient design for semantic segmentation with transformers. In: 35th Conference on neural information processing systems (NeurIPS). LA JOLLA: Neural Information Processing Systems (Nips); 2021.

[ref33] 33. Dosovitskiy A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint. 2020. https://arxiv.org/abs/2010.11929
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref34] 34. Zhang T, Dick RP. Spatial-frequency network for segmentation of remote sensing images. In: 30th IEEE international conference on image processing (ICIP). Kuala Lumpur, MALAYSIA, 2023.

[ref35] 35. Ren B, Liu B, Hou B, Wang Z, Yang C, Jiao L. SwinTFNet: dual-stream transformer with cross attention fusion for land cover classification. IEEE Geosci Remote Sensing Lett. 2024;21:1–5.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref36] 36. Woo SH, Park J, Lee JY, Kweon IS. CBAM: Convolutional Block Attention Module. In: 15th European conference on computer vision (ECCV). Munich, GERMANY. 2018.

[ref37] 37. Yang MK, Yu K, Zhang C, Li ZW, Yang KY. DenseASPP for semantic segmentation in street scenes. In: 31st IEEE/CVF conference on computer vision and pattern recognition (CVPR). Salt Lake City, UT. 2018.

[ref38] 38. Zhao D, Ni L, Zhou K, Lv Z, Qu G, Gao Y, et al. A study of the improved A* algorithm incorporating road factors for path planning in off-road emergency rescue scenarios. Sensors (Basel). 2024;24(17):5643. pmid:39275555
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref39] 39. Jing Y, Zhang T, Liu Z, Hou Y, Sun C. Swin-ResUNet+: an edge enhancement module for road extraction from remote sensing images. Comp Vision Image Understand. 2023;237:103807.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref40] 40. Liu T, Li J, Cao W, Tang M, Yang G. MLCNet: multitask level-specific constraint network for building change detection. IEEE J Sel Top Appl Earth Observ Remote Sens. 2024;17:11823–38.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref41] 41. Sun P, Lu Y, Zhai J. Mapping land cover using a developed U-Net model with weighted cross entropy. Geocarto Inter. 2021;37(25):9355–68.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref42] 42. Xiao Y, Zhao Z, Huang J, Huang R, Weng W, Liang G, et al. The illusion of success: test set disproportion causes inflated accuracy in remote sensing mapping research. Inter J Appl Earth Observ Geoinform. 2024;135:104256.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref43] 43. Wang L, Li R, Zhang C, Fang S, Duan C, Meng X, et al. UNetFormer: a UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J Photogram Remote Sens. 2022;190:196–214.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref44] 44. Gao M, Dong W, Chen L, Wu Z. Automatic extraction of water body from SAR images considering enhanced feature fusion and noise suppression. Appl Sci. 2025;15(5):2366.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref45] 45. Li R, Zheng S, Zhang C, Duan C, Wang L, Atkinson PM. ABCNet: attentive bilateral contextual network for efficient semantic segmentation of fine-resolution remotely sensed imagery. ISPRS J Photogram Remote Sens. 2021;181:84–98.
View Article
Google Scholar

[127] View Article

[128] Google Scholar

[ref46] 46. Wu H, Huang P, Zhang M, Tang W, Yu X. CMTFNet: CNN and multiscale transformer fusion network for remote-sensing image semantic segmentation. IEEE Trans Geosci Remote Sens. 2023;61:1–12.
View Article
Google Scholar

[130] View Article

[131] Google Scholar

[ref47] 47. Liu Z, Lin YT, Cao Y, Hu H, Wei YX, Zhang Z. Swin Transformer: hierarchical vision transformer using shifted windows. In: 18th IEEE/CVF International conference on computer vision (ICCV). 2021.

[ref48] 48. Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: 30th IEEE/CVF Conference on computer vision and pattern recognition (CVPR); 2017 Jul 21-26.Honolulu, HI, New York: IEEE; 2017.

[ref49] 49. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), 2016. 770–8.
View Article
Google Scholar

[135] View Article

[136] Google Scholar

[ref50] 50. Chollet F. Xception: deep learning with depthwise separable convolutions. In: 30th IEEE/CVF Conference on computer vision and pattern recognition (CVPR). Honolulu, HI. 2017.

Figures

Abstract

Introduction

Proposed methods

Overall structure

Feature extraction using SFCB

MBAB for enhancing boundary information

DSB for network training

Experiments and results

Dataset and interpretation signs

Implementation details

Experimental results

Discussion

Ablation experiment

Backbone effectiveness analysis

Model transferability analysis

Parameter fine-tuning

Model complexities

Limitations and future work

Conclusions

References