Figures
Abstract
Urban villages, as a typical phenomenon in the process of urbanization, play a significant role in urban planning and sustainable development. However, their high-density structures and complex boundaries pose significant challenges for extraction tasks based on remote sensing imagery. To address these challenges, this paper proposes a Multi-domain Enhancement and Boundary Awareness Network (MEBANet) for urban village extraction. MEBANet consists of three core blocks: 1) The spatial-frequency-channel feature extraction block (SFCB), which simultaneously enhances feature representation in the spatial, frequency, and channel domains; 2) The multi-scale boundary awareness block (MBAB), which leverages dense atrous spatial pyramid pooling (DenseASPP) and multi-directional sobel operator convolution to strengthen the perception of complex boundaries; and 3) The deep supervision block (DSB), which accelerates model convergence through multi-level supervision signals. Experiments were conducted on three publicly available datasets from Beijing, Xi’an, and Shenzhen. The results demonstrate that MEBANet outperforms existing methods in terms of precision, recall, F1-score, and IoU. Additionally, cross-dataset transfer experiments validate the robustness and generalization capability of MEBANet. Ablation studies further confirm the effectiveness of each block. This study provides a high-accuracy and automated solution for urban village extraction from high-resolution remote sensing imagery, offering valuable insights for urban planning and management.
Citation: Chang F, Fan X, Xu R, Wang S, Qin K, Gao X (2025) MEBANet: A Multi-domain Enhancement and Boundary Awareness Network for urban village extraction from high-resolution imagery. PLoS One 20(10): e0330302. https://doi.org/10.1371/journal.pone.0330302
Editor: Xu Yanwu, South China University of Technology, CHINA
Received: April 11, 2025; Accepted: July 26, 2025; Published: October 22, 2025
Copyright: © 2025 Chang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The urban village extraction dataset that supports the findings of this study is publicly available from the Figshare repository at https://doi.org/10.6084/m9.figshare.29832671.
Funding: This study was supported by The Central Government-Guided Local Science and Technology Development Fund Project under Grant 236Z6101G, and The Hebei Transportation Investment Group Co., Ltd. Major Scientific Research and Development Project under Grant 20245051001S. The funders had role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Urban villages typically refer to residential settlements originally of rural nature that, despite being located within urban areas, have not been fully integrated into formal urban governance and infrastructure systems during the process of urbanization [1,2]. With the rapid advancement of urbanization, urban villages have emerged as a unique socio-spatial phenomenon at the intersection of rural and urban development [3]. Characterized by high-density, small-scale buildings, urban villages accommodate a significant portion of the urban population and resources, providing affordable housing for low-income residents [4]. However, the negative impacts associated with urban villages cannot be overlooked. These areas often pose serious social, economic, and environmental challenges. For instance, their densely packed and disorderly layout disrupts the urban landscape; inadequate infrastructure lowers residents’ quality of life; and poor sanitation conditions, coupled with limited public services, hinder sustainable urban development [5,6]. Therefore, accurately delineating the spatial extent of urban villages is not only a critical task for urban planning and governance but also a fundamental prerequisite for achieving refined urban management and sustainable development.
Traditional methods for mapping the spatial extent of urban villages primarily rely on field surveys [4,7]. Although these methods yield reliable results, they are constrained by high costs, intensive labor requirements, and long execution periods, factors that limit their scalability when applied to widely distributed urban villages [8]. With the rapid advancement of remote sensing technology, high-resolution imagery has emerged as a promising alternative for urban village extraction. Remote sensing imagery offers wide spatial coverage and frequent updates, making it an increasingly important data source for extracting urban village spatial information [1,9]. However, most Traditional approaches still depend on visual interpretation. While this avoids the complexities of field surveys, visual interpretation remains limited by inconsistent interpretation standards and delayed data updates, thus failing to meet the demands of large-scale and efficient urban village monitoring [4]. Consequently, there is an urgent need for automated solutions.
Currently, mainstream approaches for automated feature extraction can be broadly categorized into traditional machine learning methods and deep learning methods [10]. Traditional machine learning algorithms, such as Support Vector Machines (SVM) and Random Forests (RF), are known for their computational simplicity, fast processing speed, and relatively low data requirements [11]. These methods have been widely applied in urban village extraction tasks. For example, Duque et al. [5] evaluated the performance of logistic regression, SVM, and RF in cities such as Buenos Aires (Argentina), Medellín (Colombia), and Recife (Brazil), demonstrating that SVM achieved the best results. Matarira et al. [1] successfully delineated urban villages in Durban, South Africa, by combining the random forest algorithm with Sentinel-2A imagery. Similarly, Park et al. [12] mapped the spatial distribution of urban villages in Ulaanbaatar from 1990 to 2013 using QuickBird and Landsat images supported by SVM. Gevaert et al. [6] further applied SVM to extract urban villages in Kigali, Rwanda, and Maldonado, Uruguay. Nevertheless, these traditional machine learning methods heavily rely on handcrafted features, which are often incomplete and prone to overfitting. Moreover, they struggle to model complex non-linear relationships in heterogeneous environments, ultimately limiting classification accuracy [13]. With the rise of deep learning, significant advancements have been made in this domain. Deep learning constructs robust nonlinear neural networks capable of automatically learning hierarchical feature representations, eliminating the need for manual feature engineering [7,9,14]. As a result, an increasing number of deep neural network-based models have been employed in extraction tasks for identifying objects of interest in remote sensing imagery, achieving remarkable performance. This includes widely used classical semantic segmentation networks such as FCN [15], U-Net [16], and DeepLabv3+ [17], as well as specialized models designed for specific land cover types or challenges. For instance, Wang et al. [18] proposed RFENet and addressed the vulnerability of deep learning models in aerial image segmentation to adversarial patch attacks. Zhang et al. [19] introduced C_ASegformer to improve multi-scale feature integration and contextual awareness, achieving superior results compared to existing models. In the context of urban village extraction, several deep learning-based studies have demonstrated promising results. Verma et al. [2] proposed the use of pre-trained convolutional networks for detecting urban village in Mumbai. Persello et al. [20] applied FCN to extract urban villages in Dar es Salaam, Tanzania. Similarly, Pan et al. [4] achieved delineation of urban villages in Guangzhou, China, using a U-Net-based method. Wei et al. [7] compared the performance of FCN, U-Net, and ResUNet for urban village extraction at the junction of Dongguan, Huizhou, and Guangzhou. These studies primarily rely on the direct application of existing semantic segmentation networks without considering the unique characteristics of urban villages, thereby limiting further improvements in accuracy.
To address these limitations, some researchers have proposed tailored architectures. For example, Ansari et al. developed a composite model that integrates U-Net with a multiscale contourlet transform to improve extraction accuracy in the cities of Mumbai and Pune, India. Fan et al. [13] introduced UisNet, which enhances extraction accuracy by utilizing building-level information. Based on a Transformer architecture, UisNet incorporates multimodal data including spatial building footprints and floor numbers to extract urban villages in Shenzhen. Du et al. [21] proposed STMNet, which fuses texture features derived from gray-level co-occurrence matrices with original images to extract urban villages in Beijing. Zhang et al. [22] introduced UV-SAM to address inaccurate boundary extraction by incorporating a prompt generation module into a foundational model, achieving promising results in Beijing and Xi’an. Furthermore, Li et al. [23] addressed the limited global context modeling of CNNs and the computational complexity of Transformers by introducing the Mamba architecture for the first time in this task, resulting in the UV-Mamba.
Despite recent advancements, most deep learning-based methods for urban village extraction rely primarily on spatial-domain representations, with limited exploration of the frequency domain. Spatial-domain approaches primarily operate on raw pixel intensities and spatial arrangements, which may not fully reveal the underlying structural patterns, especially in highly heterogeneous and visually complex urban scenes [24,25]. In contrast, frequency-domain representations offer a complementary perspective by decomposing images into components with distinct frequency characteristics. Low-frequency components capture global structures and smooth variations, while high-frequency components emphasize fine-grained textures, edges, and discontinuities-features that are often critical for delineating urban villages with irregular layouts and noisy visual characteristics. Integrating frequency-domain information enables models to enhance their perception of subtle yet important features that may be overlooked in the spatial domain, thereby improving the overall robustness and precision of segmentation [26]. As such, a combined spatial-frequency perspective allows for richer and more discriminative feature representation, which is particularly beneficial in complex urban extraction tasks.
Encouragingly, recent studies have attempted to incorporate frequency-domain information into semantic segmentation tasks. Existing approaches can be broadly categorized into two strategies. The first involves parallel use of frequency-domain and spatial-domain feature extraction modules, often implemented as dual-branch networks or as separate modules inserted into the encoder and decoder stages, respectively [27,28]. The second approach introduces frequency-domain features as auxiliary enhancement modules to support spatial feature learning [29,30]. However, the former design increases computational cost and limits timely feature interaction, while the latter cannot fully exploit frequency information.
To address the aforementioned challenges, this study proposes a novel method named multi-domain enhancement and boundary awareness network (MEBANet). MEBANet is primarily composed of three core modules. The spatial-frequency-channel feature extraction block (SFCB) enhances multi-domain feature representations, particularly by incorporating frequency-domain processing to effectively extract critical yet often overlooked features. The multi-scale boundary awareness block (MBAB) leverages multi-scale feature extraction and boundary awareness to improve the accuracy of boundary delineation. Finally, a deep supervision block (DSB) is introduced to accelerate network convergence. The main contributions of this study can be summarized as follows:
- 1). We propose a novel MEBANet architecture that integrates multi-domain feature enhancement and multi-scale boundary awareness mechanisms to capture the complex spatial characteristics of urban villages more accurately. Its effectiveness is demonstrated through experiments on datasets from Beijing, Xi’an, and Shenzhen.
- 2). We design the SFCB module to simultaneously enhance features in the spatial, frequency, and channel domains. Unlike existing studies, our approach fuses spatial-domain enhancement submodule, frequency-domain enhancement submodule, and channel-domain enhancement submodule into a unified block that enables comprehensive feature interaction during extraction. Moreover, the design allows for flexible stage-wise integration within the network. Experimental results validate the effectiveness of SFCB in enhancing representational capacity.
- 3). We propose the MBAB module, which integrates DenseASPP and Sobel convolution to enhance boundary awareness. DenseASPP captures multi-scale features to address the spatial heterogeneity and scale variation of urban villages, while Sobel convolution refines boundary localization. This design significantly improves segmentation accuracy by reducing false positives and missed detections around complex boundaries.
Proposed methods
Overall structure
The architecture of the proposed MEBANet is illustrated in Fig 1, which consists of an encoder and a decoder. In the encoder, a 3 × 3 convolution is first applied for initial feature extraction. Subsequently, the proposed SFCB is employed as the backbone to hierarchically extract multi-level features. The encoder comprises four SFCB, each consisting of a frequency-domain enhancement submodule (FDE), a spatial-domain enhancement submodule (SDE), and a channel-domain enhancement submodule (CDE). These extractors output four levels of features, denoted as ,
,
,
. Assuming the input image has a size of (
), the output feature from the encoder are of dimensions (
), (
), (
), and (
), where
, and
.
In the decoder, instead of adopting a complex decoding structure, all features from the encoder are upsampled to match the input resolution. Each upsampled feature map is adjusted to have a uniform channel dimension of , and then concatenated to form a unified feature representation of size
. To enhance the model’s adaptability to the shape and scale variations of urban villages, the MBAB is incorporated before generating the final output. Finally, the DSB is introduced to accelerate model convergence and enhance training efficiency. It leverages the intermediate features
,
,
,
from the encoder, combined with the output of the MBAB and the result of a 1 × 1 convolution, to provide auxiliary supervision signals that facilitate model optimization [31]. The subsequent sections of this chapter will provide a detailed explanation of the design and implementation of the SFCB, MBAB, and DSB.
Feature extraction using SFCB
SFCB: The design of the SFCB aims to achieve efficient feature extraction and representation through multi-dimensional enhancement across the frequency, spatial, and spectral domains. As illustrated in Fig 2, the input feature first passes through a 3 × 3 convolution for initial processing. Then, two 1 × 1 convolutions are applied to generate two feature, which are respectively fed into the FDE and SDE. The use of 1 × 1 convolutions serve to reduce the number of channels, thereby improving computational efficiency, while also enabling the separation of features that are more suitable for spatial-domain and frequency-domain enhancement. As a result, two distinct feature maps are obtained: , which is passed to the SDE, and
, which is passed to the FDE.
The SDE outputs a refined feature map , while the FDE produces
. These two outputs are fused through a cross-domain integration mechanism (CAM) to generate
, which is then passed into the CDE. This module recalibrates the importance of each feature channel, resulting in the output feature
. Finally, a 3 × 3 convolution followed by a max-pooling operation is applied to downsample the features, producing the SFCB output
where
. This process can be summarized as follows:
where ,
and
denote the 3 × 3 convolution, 1 × 1 convolution, and max-pooling operations, respectively.
,
,
and
represent the SDE, FDE, CAM, and CDE, respectively.
SDE: The structure of the SDE is illustrated in Fig 3. Its core lies in leveraging the Transformer module to extract global spatial information. To enhance computational efficiency, we adopted a combination of the Mix-FFN and Efficient Self-Attention mechanism (ESA) [32]. The Mix-FFN automatically computes positional encodings from the input features using 3 × 3 convolutions, eliminating the complexity of adding positional encodings separately, as in Vision Transformer [33]. This design not only preserves positional information but also significantly improves computational efficiency. This process can be described as follows:
where, and
represent the input and output features of the Mix-FFN, respectively,
denotes the GELU activation function, and
refers to the multi-layer perceptron.
The ESA differs from Vision Transformer in that it introduces a parameter to compress the key and value matrices, significantly reducing the computational cost of multi-head attention. First, the query (
), key (
), and value (
) are computed, a process that can be defined as follows:
where, ,
, and
represent the multi-layer perceptron used to compute
,
, and
, respectively. Typically,
,
, and
share the same dimensions
, where
denotes the batch size,
represents the number of attention heads,
is the sequence length, and
is the embedding dimension per head. The introduction of
first reshapes
and
into
, and then uses a multi-layer perceptron to transform
and
into the size
. This process can be described as follows:
The output features of the ESA can be described as:
FDE: The structure of the FDE is illustrated in Fig 4. The primary distinction from the SDE lies in the use of the Fast Fourier Transform (FFT) to map the input features from the spatial domain to the frequency domain before feeding them into the Transformer module. In the frequency domain, the features are adjusted and then remapped back to the spatial domain via the Inverse Fast Fourier Transform (IFFT) [34]. This process effectively captures frequency information that is difficult to extract in the spatial domain.
CAM: The structure of the CAM is illustrated in Fig 5. CAM integrates the outputs of the FDE and SDE through cross attention mechanism [35]. This process can be described as follows:
Where represents the softmax activation function, and
is the channel dimension of
.
CDE: The structure of the CDE is illustrated in Fig 6. The input features are first processed through max-pooling and average-pooling, followed by and softmax to compute the importance weights for each channel. These weights are then multiplied with the original input features to produce the output feature
[36]. This process can be described as follows:
Where denotes the average-pooling operation, and
represents the max-pooling operation. Finally, the channel-enhanced feature
is processed through a 3 × 3 convolutional layer followed by a max pooling operation, generating the output feature
of the SFCB.
MBAB for enhancing boundary information
The primary objective of the MBAB is to enhance the model’s adaptability to the shape and scale variations of urban villages while strengthening the extraction of boundary details. The structure of this module is illustrated in Fig 7, and its core consists of two components: dense atrous spatial pyramid pooling (DenseASPP) [37] and sobel operator convolution [38].
The DenseASPP comprises six parallel branches. One branch employs a residual connection, directly passing the input features without any processing, while the other five branches utilize atrous convolutions with dilation rates of 3, 6, 12, 18, and 24, respectively. This multi-scale design captures semantic information at different receptive fields, enabling adaptation to the complex shape and scale variations in urban village regions. Additionally, dense connections between the branches facilitate top-down multi-scale feature aggregation, ensuring that the output of each branch not only retains semantic information at the current scale but also integrates features from adjacent branches, significantly enhancing the overall feature representation capability. To further improve the extraction of boundary details, we introduce multi-directional Sobel operator convolution after each atrous convolution branch. Specifically, the input features are processed using eight directional Sobel convolution kernels, and the results from all directions are summed to produce the final feature. The eight directional Sobel kernels are defined as follows: ,
,
,
,
,
,
,
. The Sobel operator enhances edges in the feature map by computing pixel gradient changes, effectively improving the model’s sensitivity to boundary information [39]. This multi-directional convolution design not only captures edge features at different angles but also strengthens the model’s perception of complex boundary structures.
DSB for network training
Deep Supervision is a training strategy for neural networks that not only utilizes the final output of urban village extraction to provide supervision signals but also incorporates intermediate outputs from different layers of the network to guide the training process. This strategy aims to address issues such as gradient vanishing and exploding in network training, while also improving the convergence speed and overall performance of the network [31,40].
We first unsampled the four features from the encoder, concatenate them, and then process them through the MBAB module and a 1 × 1 convolution to obtain the final output result. This process can be described as:
where denotes the upsampling operation,
represents feature concatenation. Next, we utilize
,
,
, and
along with the MBAB and 1 × 1 convolution to generate urban village extraction results at different scales as intermediate outputs of MEBANet. This process can be described as:
Finally, we denote the ground truth corresponding to as
, and apply max pooling operations to
to obtain the ground truth for each additional output result. This process can be described as:
In this study, the loss for each component is calculated using the cross-entropy loss function [41], which can be defined as:
where represents the ground truth labels,
denotes the predicted probabilities from the model, and
is the total number of pixels. Therefore, the overall loss function for this study can be defined as:
where is a balancing coefficient used to adjust the weight between the supervision of the final result and the intermediate results.
Experiments and results
Dataset and interpretation signs
Dataset: In this study, we conducted experimental validation using three datasets collected from Beijing, Xi’an, and Shenzhen. The remote sensing images include data from Beijing in 2016, Xi’an in 2018, and Shenzhen in 2020, with corresponding labels generated by professionals through visual interpretation and cross-validation methods. To fully leverage the dataset and evaluate the model’s generalization ability, we combined the training, testing, and validation sets and employed a 5-fold cross-validation approach to assess model accuracy [42]. During the experiments, the original images and their corresponding labels were cropped into non-overlapping 512 × 512-pixel patches to facilitate model training and testing. To enhance the diversity of the training data, we applied a series of image preprocessing methods for data augmentation, including:1) Image Rotation: Rotating the images and labels by 90°, 180°, and 270°; 2) Image Flipping: Horizontal flipping and vertical flipping; 3) Noise Addition: Gaussian noise, salt-and-pepper noise; 4) Image Blurring: Gaussian blur, mean blur, median blur [31].
Interpretation signs: Due to the disorderly construction of urban villages, they exhibit distinct features in images, as shown in Fig 8. These features include: 1) Irregular Shapes: The lack of planning in the construction of urban villages results in irregular shapes, distinguishing them from commercial or well-planned residential areas. This leads to complex geometric forms in remote sensing imagery. 2) Internal objects: In urban villages, buildings are densely packed with narrow gaps between them, resulting in large, continuous, and compact rooftop-covered areas in the imagery. The individual buildings are typically small in scale, and the internal roads are often very narrow, winding, and lack proper planning. Additionally, public green spaces and centralized green areas are extremely scarce. 3) Rough Textures: The texture features are frequently changing and chaotic, with high spatial complexity, exhibiting strong contrast with the surrounding environment.
Implementation details
The experiments in this study were conducted on a 64-bit Windows 10 operating system with an Intel(R) Core (TM) i9-11900K @ 3.50 GHz 16-core processor and an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory. The programming language used was Python 3.8, with PyCharm as the development environment and PyTorch 2.0.0 as the deep learning framework.
During the training phase, manual tuning was employed to identify the optimal batch size, optimizer, learning rate, and . The final batch size was set to 16, with AdamW selected as the optimizer. The learning rate was set to 1e-3, and
was set to 0.25. A cosine annealing strategy was applied to dynamically adjust it throughout training. The cross-entropy loss function was used for model optimization. To ensure that the loss curve stabilized, the maximum number of training epochs was set to 100.
To evaluate the model’s performance, we selected four commonly used metrics, including ,
,
, and
[43,44]. Additionally, we evaluated the model’s complexities by measuring floating point operations per second (FLOPs), parameters, and the inference time required for processing images of size 3 × 512 × 512.
To assess the performance of the proposed method, we compared it with three publicly available urban village extraction methods, namely UisNet [13] and UV-SAM [22], as well as three classic land cover classification methods, including ABCNet [45], CMTFNet [46], and UNetFormer [43]. To ensure the stability and reliability of the results, all experiments in this study are repeated five times and the results are averaged.
Experimental results
Qualitative analysis: Figs 9–11 present the extraction results of six method on the Beijing, Xi’an, and Shenzhen datasets, respectively. From the figures, it is evident that urban villages exhibit significant heterogeneity in shape and scale, posing considerable challenges for accurate identification. Among the compared methods, except for MEBANet, the other methods generally suffer from issues such as false positives, false negatives, irregular boundaries, and internal holes. In contrast, MEBANet demonstrates superior performance in its extraction results. It not only captures the complete contours of urban villages with clear and continuous boundaries but also achieves the highest alignment with the ground truth labels, with almost no noticeable false positives or false negatives. Furthermore, MEBANet exhibits strong robustness in handling the complex internal structures and scale variations of urban villages, effectively reducing internal voids. These results further validate the superiority of MEBANet in urban village extraction tasks.
The figure was produced by the authors of this study using the publicly available dataset from [22].
The figure was produced by the authors of this study using the publicly available dataset from [22].
The figure was produced by the authors of this study using the publicly available dataset from [13].
Quantitative Evaluation: Table 1 presents the quantitative evaluation results of six methods on the Beijing dataset. It can be observed that the proposed MEBANet achieves the highest performance across all evaluation metrics, with of 89.28%,
of 88.47%,
of 88.87%, and
of 79.97%. Among the remaining methods, UisNet ranks second, achieving 86.74% in
, 87.96% in
, 87.35% in
, and 77.53% in
. In contrast, UNetFormer performs the worst in three of the four metrics, namely
,
, and
, while CMTFNet records the lowest
at 85.59%. Table 2 summarizes the quantitative evaluation results on the Xi’an dataset. MEBANet again demonstrates superior performance, with
of 92.78%,
of 89.89%,
of 91.31%, and
of 84.01%. UisNet ranks second, achieving 90.63% in
, 89.39% in
, 90.01% in
, and 81.83% in
. ABCNet performs the worst across all metrics, with scores of 89.03%, 86.83%, 87.92%, and 78.44%, respectively. Table 3 reports the evaluation results on the Shenzhen dataset. MEBANet consistently achieves the best performance, with
of 88.44%,
of 87.29%,
of 87.86%, and
of 78.35%. CMTFNet ranks second in both
and
, with values of 87.08% and 76.23%, respectively, while UisNet achieves the second-best
and
at 85.53% and 86.09%. ABCNet shows the lowest
,
, and
, with values of 84.08%, 84.71%, and 73.48%, respectively, while UNetFormer records the lowest
at 85.18%.
The above results represent the average performance over five repeated experiments. Figs 12–14 present the mean and standard deviation (SD) of for different methods on the Beijing, Xi’an, and Shenzhen datasets, respectively. It can be observed that MEBANet achieves the highest
across all datasets. Although it does not obtain the lowest SD on the Beijing dataset, where it ranks slightly behind UisNet and ABCNet, it still achieves the best overall accuracy. On the Xi’an and Shenzhen datasets, MEBANet demonstrates both the highest
and the most stable performance, highlighting its robustness and strong generalization capability. Combining the quantitative evaluation results and qualitative analysis, it is evident that MEBANet demonstrates the best performance in urban village extraction tasks. Across all three test datasets, MEBANet significantly outperforms the other compared methods.
Discussion
Ablation experiment
To validate the effectiveness of the constructed blocks, we conducted quantitative evaluation experiments on the Beijing dataset and Xi’an dataset, focusing on the performance contributions of three sub-modules: SFCB, MBAB, and DSB. Among these, MBAB and DSB are plug-and-play modules, while SFCB, as the core feature extraction structure, is an indispensable component of the model. Since traditional methods primarily focus on spatial domain feature extraction and often overlook the importance of frequency and channel domains, we specifically analyzed the contributions of FDE and CDE in detail. The ablation study results for MBAB and DSB are shown in Table 4, and the ablation results for the SFCB sub-module are presented in Table 5.
From Table 4, it can be observed that on the Beijing dataset, using the MBAB alone increases the by 0.77 percentage points (from 87.38% to 88.15%) and the
by 1.23 percentage points (from 77.58% to 78.81%). Using the DSB alone increases the
by 0.55 percentage points (from 87.38% to 87.93%) and the
by 0.88 percentage points (from 77.58% to 78.46%). When both MBAB and DSB are used together, the
increases by 1.49 percentage points (from 87.38% to 88.87%), and the
increases by 2.39 percentage points (from 77.58% to 79.97%).
On the Xi’an dataset, using the MBAB module alone increases the by 1.45 percentage points (from 88.59% to 90.04%) and the
by 2.37 percentage points (from 79.52% to 81.89%). Using the DSB alone increases the
by 0.64 percentage points (from 88.59% to 89.23%) and the
by 1.03 percentage points (from 79.52% to 80.55%). When both MBAB and DSB are used together, the
increases by 2.72 percentage points (from 88.59% to 91.31%), and the
increases by 4.49 percentage points (from 79.52% to 84.01%). The above results demonstrate that both the MBAB and DSB play significant roles in improving model performance.
Fig 15 illustrates the accuracy variations during the training process with and without DSB. The results demonstrate that while DSB introduces instability during training, it significantly accelerates the convergence of MEBANet. Moreover, the incorporation of DSB further improves the accuracy.
From the first three rows of Table 5, it can be observed that the accuracy is relatively low when each module is used independently. However, from the fourth and fifth rows, it becomes evident that the combined use of the modules enhances feature extraction capabilities, leading to improved accuracy in urban village extraction. Specifically, on the Beijing dataset, as indicated by the comparison between the first and fourth rows, the use of FDE results in an increase of 1.25 percentage points in (from 87.24% to 88.49%) and 1.99 percentage points in
(from 77.36% to 79.35%). As shown in the comparison between the second and third rows, the use of SDE increases
by 1.42 percentage points (from 87.07% to 88.49%) and
by 2.26 percentage points (from 77.09% to 79.35%). Further, the inclusion of CDE, as seen in the fourth and fifth rows, results in a further improvement of 0.38 percentage points in
(from 88.49% to 88.87%) and 0.62 percentage points in
(from 79.35% to 79.97%).
On the Xi’an dataset, the use of FDE, as shown in the comparison between the first and fourth rows, increases by 1.71 percentage points (from 88.78% to 90.49%) and
by 2.81 percentage points (from 79.82% to 82.63%). The comparison between the second and fourth rows indicates that SDE results in an increase of 2.21 percentage points in
(from 88.28% to 90.49%) and 3.61 percentage points in
(from 79.02% to 82.63%). Finally, as shown in the fourth and fifth rows, the incorporation of CDE leads to a further increase of 0.82 percentage points in
(from 90.49% to 91.31%) and 1.38 percentage points in
(from 82.63% to 84.01%).
Backbone effectiveness analysis
To validate the effectiveness of the proposed feature extractor, SFCB, we conducted comparative experiments on the Beijing dataset and Xi’an dataset, comparing SFCB with several popular feature extractors. These include Transformer-based models such as Vision Transformer [33], Swin Transformer [47], and Mix-Transformer [32], as well as CNN-based models such as DenseNet [48], ResNet50 [49], and Xception [50]. The experimental results are shown in Table 6. Specifically, on the Beijing dataset, among the Transformer-based feature extractors, Vision Transformer performs the worst, but its (87.55%) and
(77.86%) are still higher than those of the best-performing CNN-based feature extractor, ResNet50 (
: 86.98%,
: 76.96%), by 0.57 percentage points and 0.90 percentage points, respectively. On the Xi’an dataset, among the Transformer-based feature extractors, Mix-Transformer performs the worst, but its
(88.77%) and
(79.81%) are still higher than those of the best-performing CNN-based feature extractor, DenseNet (
: 88.36%,
: 79.15%), by 0.69 percentage points and 0.66 percentage points, respectively.
Overall, SFCB significantly outperforms the other compared models on both datasets, validating its superiority in feature extraction. At the same time, Transformer-based feature extractors generally outperform CNN-based feature extractors, likely because Transformers are better at capturing global contextual information, thereby demonstrating stronger feature extraction capabilities in complex scenarios.
To further investigate the impact of the number of SFCB on urban village extraction task, we compared model performance when using 3, 4, and 5 SFCB. The experimental results are shown in Table 7, which demonstrates that the model achieves optimal performance with 4 SFCB on both datasets. Specifically, on the Beijing dataset, using 4 SFCB result in an of 88.87% and an
of 79.97%, representing improvements of 1.32 percentage points and 2.11 percentage points, respectively, compared to using 3 SFCB, and improvements of 0.86 percentage points and 1.38 percentage points, respectively, compared to using 5 SFCB. On the Xi’an dataset, using 4 SFCB result in an
of 91.31% and an
of 84.01%, representing improvements of 1.68 percentage points and 2.79 percentage points, respectively, compared to using 3 SFCB, and improvements of 1 percentage point and 1.67 percentage points, respectively, compared to using 5 SFCB.
Model transferability analysis
To investigate the robustness and transferability of MEBANet, we trained the model using the Beijing dataset and validated it on the Xi’an dataset. The experimental results are shown in Table 8. Due to differences in architectural styles between the two cities, the accuracy of all models decreased compared to the results in Table 1. Nevertheless, MEBANet still demonstrated the best performance, achieving ,
,
, and
of 77.74%, 73.82%, 75.73%, and 60.94%, respectively. The second-best performance was achieved by UisNet, with
,
,
, and
of 75.55%, 70.71%, 73.05%, and 57.54%, respectively, which are 2.19, 3.11, 2.68, and 3.40 percentage points lower than those of MEBANet. The worst performance was observed with ABCNet, which achieved
,
, and
of 69.05%, 69.81%, and 53.63%, respectively, representing decreases of 8.69, 5.92, and 7.31 percentage points compared to MEBANet. Additionally, UV-SAM had the lowest
of 67.29%, which is 6.53 percentage points lower than that of MEBANet.
Fig 16 demonstrates the visual results, and a comparison with Fig 9 also reveals that training and validation on different datasets lead to more positives, false negatives, irregular boundaries, and internal holes across all methods, including the proposed MEBANet. However, it can be observed that among all the methods, MEBANet performs the best, particularly in the second row.
The figure was produced by the authors of this study using the publicly available dataset from [22].
Parameter fine-tuning
The selection of appropriate model parameters is essential for achieving optimal performance. In this study, manual parameter tuning was employed to determine the optimal values for batch size, optimizer, learning rate, and The tuning process was carried out in two stages: first, the parameters directly related to model training, namely batch size, optimizer, and learning rate, were determined; then, the model structure parameter,
, was adjusted. Initially, a learning rate of 0.005, the Adam optimizer, and
set to 0.25 were selected. The results, as shown in Fig 17(a), indicate that, through experimentation, the optimal batch size was identified as 16, which also represents the maximum feasible batch size due to hardware limitations. Subsequently, with the batch size fixed at 16 and the learning rate and
held constant, the optimizer was varied. The results, shown in Fig 17(b), demonstrate that AdamW was the most effective optimizer. Following this, with the optimizer fixed, the learning rate was adjusted, and the best performance was achieved at a learning rate of 0.001, as shown in Fig 17(c). Finally, with the first three parameters fixed,
was fine-tuned, and the results presented in Fig 17(d) reveal that a
value of 0.25 yielded the optimal performance.
Model complexities
Table 9 presents the computational complexity and efficiency of different models. It can be seen that MEBANet is at an average level in terms of model efficiency. MEBANet has 26.71M parameters, which is slightly higher than the lightweight models ABCNet and UNetFormer, much lower than UV-SAM, and at a similar level to UisNet and CMTFNet. In terms of computational complexity, MEBANet is significantly lower than UisNet and UV-SAM, while maintaining comparable efficiency to CMTFNet. Notably, MEBANet’s inference time reaches 20.54ms, which is only slightly slower than ABCNet and UNetFormer, but shows a significant advantage over UisNet, CMTFNet, and UV-SAM.
Limitations and future work
Although the proposed MEBANet achieves high accuracy in urban village extraction task and its effectiveness has been validated through experiments, there are still several areas worthy of further investigation:
- 1). Dataset expansion: This study was tested only on datasets from three cities in China. While the results are promising, the scope of the research is currently limited. To enhance the generalizability of the model, future work will involve testing it on datasets from additional regions, including cities with different architectural styles and urban features, such as those from Africa and South America. This will help validate the model’s adaptability to diverse geographical and cultural contexts.
- 2). Utilization of multi-modal data: This study relies solely on optical remote sensing images for urban village extraction. Considering that multi-modal data provide additional perspectives, future research will explore the integration of multi-modal data, such as building height data and SAR imagery, to further enhance the model’s recognition accuracy and robustness.
- 3). Transfer learning issue: Although MEBANet has been tested with transfer learning and shows better performance compared to others methods, the issue of significant accuracy degradation on unseen data remains unresolved. In future research, we will address this limitation by exploring domain adaptation and self-supervised learning techniques to enhance the model’s robustness and ensure consistent performance across diverse datasets.
Conclusions
This study addresses the complexity and challenges of urban village extraction by proposing an innovative method named MEBANet. The proposed MEBANet significantly enhances the accuracy and robustness of urban village extraction through the synergistic integration of three core modules: the SFCB, the MBAB, and the DSB. Experimental results demonstrate that MEBANet consistently outperforms existing methods on the Beijing, Xi’an and Shenzhen datasets in terms of overall performance. Cross-dataset transfer experiments further validate its strong generalization capability. Ablation studies confirm the individual contributions and effectiveness of each module. Future research directions include testing the accuracy of urban village extraction on different regions, incorporating multi-modal data into the task to improve accuracy, and addressing the challenge of generalization to unseen data.
References
- 1. Matarira D, Mutanga O, Naidu M. Google earth engine for informal settlement mapping: a random forest classification using spectral and textural information. Remote Sens. 2022;14(20):5130.
- 2. Verma D, Jana A, Ramamritham K. Transfer learning approach to map urban slums using high and medium resolution satellite imagery. Habitat Intern. 2019;88:101981.
- 3. Ansari RA, Malhotra R, Buddhiraju KM. Identifying informal settlements using contourlet assisted deep learning. Sensors (Basel). 2020;20(9):2733. pmid:32403308
- 4. Pan Z, Xu J, Guo Y, Hu Y, Wang G. Deep learning segmentation and classification for urban village using a worldview satellite image based on U-Net. Remote Sens. 2020;12(10):1574.
- 5. Duque J, Patino J, Betancourt A. Exploring the potential of machine learning for automatic slum identification from VHR imagery. Remote Sens. 2017;9(9):895.
- 6. Gevaert CM, Persello C, Sliuzas R, Vosselman G. Informal settlement classification using point-cloud and image-based features from UAV data. ISPRS J Photogram Remote Sens. 2017;125:225–36.
- 7. Wei C, Wei H, Crivellari A, Liu T, Wan Y, Chen W, et al. Gaofen-2 satellite image-based characterization of urban villages using multiple convolutional neural networks. Inter J Remote Sens. 2023;44(24):7808–26.
- 8. Zhang C, Xing J, Li J, Du S, Qin Q. A new method for the extraction of tailing ponds from very high-resolution remotely sensed images: PSVED. Inter J Digital Earth. 2023;16(1):2681–703.
- 9. Lu W, Shi X, Lu Z. A new two-step road extraction method in high resolution remote sensing images. PLoS One. 2024;19(7):e0305933. pmid:39024329
- 10. Hao M, Dong X, Jiang D, Yu X, Ding F, Zhuo J. Land-use classification based on high-resolution remote sensing imagery and deep learning models. PLoS One. 2024;19(4):e0300473. pmid:38635663
- 11. Xi Y, Ren C, Tian Q, Ren Y, Dong X, Zhang Z. Exploitation of time series sentinel-2 data and different machine learning algorithms for detailed tree species classification. IEEE J Sel Top Appl Earth Observ Remote Sens. 2021;14:7589–603.
- 12. Park H, Fan P, John R, Ouyang Z, Chen J. Spatiotemporal changes of informal settlements: ger districts in Ulaanbaatar, Mongolia. Landsc Urban Plan. 2019;191:103630.
- 13. Fan R, Li F, Han W, Yan J, Li J, Wang L. Fine-scale urban informal settlements mapping by fusing remote sensing images and building data via a transformer-based multimodal fusion network. IEEE Trans Geosci Remote Sens. 2022;60:1–16.
- 14. Chang F, Ma T, Wang D, Zhu S, Li D, Feng S, et al. Method for building segmentation and extraction from high-resolution remote sensing images based on improved YOLOv5ds. PLoS One. 2025;20(3):e0317106. pmid:40100935
- 15. Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(4):640–51. pmid:27244717
- 16.
Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. In: 18th International conference on medical image computing and computer-assisted intervention (MICCAI). Munich, GERMANY, CHAM: Springer International Publishing; 2015.
- 17.
Chen LC, Zhu YK, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), 2018.
- 18. Wang Z, Wang B, Zhang C, Liu Y. Defense against adversarial patch attacks for aerial image semantic segmentation by robust feature extraction. Remote Sens. 2023;15(6):1690.
- 19. Zhang S, Chen T, Su F, Xu H, Li Y, Liu Y. Deep layered network based on rotation operation and residual transform for building segmentation from remote sensing images. Sensors (Basel). 2025;25(8):2608. pmid:40285301
- 20. Persello C, Stein A. deep fully convolutional networks for the detection of informal settlements in VHR images. IEEE Geosci Remote Sens Lett. 2017;14(12):2325–9.
- 21. Du S, Xing J, Wang S, Wei L, Zhang Y. STMNet: scene classification-assisted and texture feature-enhanced multiscale network for large-scale urban informal settlement extraction from remote sensing images. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2024;17:13169–87.
- 22. Zhang X, Liu Y, Lin Y, Liao Q, Li Y. UV-SAM: adapting segment anything model for urban village identification. AAAI. 2024;38(20):22520–8.
- 23. Li L, Chen B, Zou X, Xing J, Tao P. UV-mamba: A DCN-enhanced state space model for urban village boundary identification in high-resolution remote sensing images. In: ICASSP 2025 - 2025 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE; 2025. 1–5.
- 24. Zhang J, Shao M, Wan Y, Meng L, Cao X, Wang S. Boundary-aware spatial and frequency dual-domain transformer for remote sensing urban images segmentation. IEEE Trans Geosci Remote Sens. 2024;62:1–18.
- 25. Gao F, Fu M, Cao J, Dong J, Du Q. Adaptive frequency enhancement network for remote sensing image semantic segmentation. IEEE Trans Geosci Remote Sens. 2025;63:1–15.
- 26. Yang Y, Yuan G, Li J. SFFNet: a wavelet-based spatial and frequency domain fusion network for remote sensing segmentation. IEEE Trans Geosci Remote Sens. 2024;62:1–17.
- 27. Liu H, Wang C, Zhao J, Chen S, Kong H. Adaptive fourier convolution network for road segmentation in remote sensing images. IEEE Trans Geosci Remote Sens. 2024;62:1–14.
- 28. Zhang H, Xie G, Li L, Xie X, Ren J. Frequency-domain guided swin transformer and global–local feature integration for remote sensing images semantic segmentation. IEEE Trans Geosci Remote Sensing. 2025;63:1–11.
- 29. Zhang F, Xia X. Efficient semantic segmentation of remote sensing images through global-local feature integration. IEEE Access. 2025;13:115653–68.
- 30.
Yan KY, Shen F, Li ZY. Enhancing landslide segmentation with guide attention mechanism and fast fourier transformer. In: 20th International conference on intelligent computing (ICIC) Tianjin Univ Sci & Tech, Tianjin, PEOPLES R CHINA: Springer-Verlag; 2024.
- 31. Zhang C, Yue P, Tapete D, Jiang L, Shangguan B, Huang L, et al. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J Photogram Remote Sens. 2020;166:183–200.
- 32.
Xie EZ, Wang WH, Yu ZD, Anandkumar A, Alvarez JM, Luo P. SegFormer: simple and efficient design for semantic segmentation with transformers. In: 35th Conference on neural information processing systems (NeurIPS). LA JOLLA: Neural Information Processing Systems (Nips); 2021.
- 33. Dosovitskiy A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint. 2020. https://arxiv.org/abs/2010.11929
- 34.
Zhang T, Dick RP. Spatial-frequency network for segmentation of remote sensing images. In: 30th IEEE international conference on image processing (ICIP). Kuala Lumpur, MALAYSIA, 2023.
- 35. Ren B, Liu B, Hou B, Wang Z, Yang C, Jiao L. SwinTFNet: dual-stream transformer with cross attention fusion for land cover classification. IEEE Geosci Remote Sensing Lett. 2024;21:1–5.
- 36.
Woo SH, Park J, Lee JY, Kweon IS. CBAM: Convolutional Block Attention Module. In: 15th European conference on computer vision (ECCV). Munich, GERMANY. 2018.
- 37.
Yang MK, Yu K, Zhang C, Li ZW, Yang KY. DenseASPP for semantic segmentation in street scenes. In: 31st IEEE/CVF conference on computer vision and pattern recognition (CVPR). Salt Lake City, UT. 2018.
- 38. Zhao D, Ni L, Zhou K, Lv Z, Qu G, Gao Y, et al. A study of the improved A* algorithm incorporating road factors for path planning in off-road emergency rescue scenarios. Sensors (Basel). 2024;24(17):5643. pmid:39275555
- 39. Jing Y, Zhang T, Liu Z, Hou Y, Sun C. Swin-ResUNet+: an edge enhancement module for road extraction from remote sensing images. Comp Vision Image Understand. 2023;237:103807.
- 40. Liu T, Li J, Cao W, Tang M, Yang G. MLCNet: multitask level-specific constraint network for building change detection. IEEE J Sel Top Appl Earth Observ Remote Sens. 2024;17:11823–38.
- 41. Sun P, Lu Y, Zhai J. Mapping land cover using a developed U-Net model with weighted cross entropy. Geocarto Inter. 2021;37(25):9355–68.
- 42. Xiao Y, Zhao Z, Huang J, Huang R, Weng W, Liang G, et al. The illusion of success: test set disproportion causes inflated accuracy in remote sensing mapping research. Inter J Appl Earth Observ Geoinform. 2024;135:104256.
- 43. Wang L, Li R, Zhang C, Fang S, Duan C, Meng X, et al. UNetFormer: a UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J Photogram Remote Sens. 2022;190:196–214.
- 44. Gao M, Dong W, Chen L, Wu Z. Automatic extraction of water body from SAR images considering enhanced feature fusion and noise suppression. Appl Sci. 2025;15(5):2366.
- 45. Li R, Zheng S, Zhang C, Duan C, Wang L, Atkinson PM. ABCNet: attentive bilateral contextual network for efficient semantic segmentation of fine-resolution remotely sensed imagery. ISPRS J Photogram Remote Sens. 2021;181:84–98.
- 46. Wu H, Huang P, Zhang M, Tang W, Yu X. CMTFNet: CNN and multiscale transformer fusion network for remote-sensing image semantic segmentation. IEEE Trans Geosci Remote Sens. 2023;61:1–12.
- 47.
Liu Z, Lin YT, Cao Y, Hu H, Wei YX, Zhang Z. Swin Transformer: hierarchical vision transformer using shifted windows. In: 18th IEEE/CVF International conference on computer vision (ICCV). 2021.
- 48.
Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: 30th IEEE/CVF Conference on computer vision and pattern recognition (CVPR); 2017 Jul 21-26.Honolulu, HI, New York: IEEE; 2017.
- 49. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), 2016. 770–8.
- 50.
Chollet F. Xception: deep learning with depthwise separable convolutions. In: 30th IEEE/CVF Conference on computer vision and pattern recognition (CVPR). Honolulu, HI. 2017.