Figures
Abstract
Deeplabv3+ currently is the most representative semantic segmentation model. However, Deeplabv3+ tends to ignore targets of small size and usually fails to identify precise segmentation boundaries in the UAV remote sensing image segmentation task. To handle these problems, this paper proposes a semantic segmentation algorithm of UAV remote sensing images based on edge feature fusing and multi-level upsampling integrated with Deeplabv3+ (EMNet). EMNet uses MobileNetV2 as its backbone and adds an edge detection branch in the encoder to provide edge information for semantic segmentation. In the decoder, a multi-level upsampling method is designed to retain high-level semantic information (e.g., the target’s location and boundary information). The experimental results show that the mIoU and mPA of EMNet improved over Deeplabv3+ by 7.11% and 6.93% on the dataset UAVid, and by 0.52% and 0.22% on the dataset ISPRS Vaihingen.
Citation: Li X, Li Y, Ai J, Shu Z, Xia J, Xia Y (2023) Semantic segmentation of UAV remote sensing images based on edge feature fusing and multi-level upsampling integrated with Deeplabv3+. PLoS ONE 18(1): e0279097. https://doi.org/10.1371/journal.pone.0279097
Editor: Daniel Capella Zanotta, Universidade do Vale do Rio dos Sinos, BRAZIL
Received: August 16, 2022; Accepted: November 30, 2022; Published: January 20, 2023
Copyright: © 2023 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data Availability UAVid dataset: Data available from the following link: https://uavid.nl/. Other conditions of data usage include that: You must attribute the work in the manner specified by the authors, you may not use this work for commercial purposes and if you alter, transform, or build upon this work, you may distribute the resulting work only under the same license. ISPRS Vaihingen dataset: Data are available from the Working Group (WG) III/4 of ISPRS from Vaihingen area of Germany. The link is https://www.isprs.org/education/benchmarks/UrbanSemLab/default.aspx. Other conditions of data usage include that: 1) The data must not be used for other than research purposes. Any other use is prohibited. 2) The data must not be distributed to third parties. Any person interested in the data may obtain them via ISPRS WG III/4. 3) Any scientific papers whose results are based on the Vaihingen test data must cite [17] and must contain the following acknowledgement: “The Vaihingen data set was provided by the German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF) [Cramer, 2010]: http://www.ifp.uni-stuttgart.de/dgpf/DKEP-Allg.html.”
Funding: This research was supported by National Natural Science Foundations of China (No. 42261078, 42174055, 41962018) and East China University of Technology Foundation (NO. DHYC-202217). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Nowadays, UAV low-altitude remote sensing has become an essential technical tool for rapid national natural resources investigation [1], emergency mapping [2], and disaster monitoring [3]. However, its high spatial resolution characteristics bring about problems such as complicated feature categories, significant changes in target scale, rich texture details, and intricate contour boundaries, which bring great challenges to image segmentation [4]. Therefore, it is crucial to develop algorithms that can achieve high-precision intelligent segmentation of UAV low-altitude remote sensing images.
In the last decade, image segmentation based on deep learning (DL) has achieved promising application results. Convolutional neural networks (CNNs) [5] are the most commonly used DL models in image segmentation. Fully convolutional network (FCN) [6] achieves high segmentation accuracy on standard datasets (PASCAL VOC) by replacing the fully connected layer of CNN with a fully convolutional layer, allowing images of arbitrary size as inputs [7], and demonstrates the powerful performance of deep convolutional neural networks in semantic segmentation.
The Deeplab semantic segmentation network was improved from FCN and has been developed to Deeplabv3+ [8], which combines the advantages of encoder-decoder structure and spatial pyramid pooling (ASPP) [9] module and has shown an excellent comprehensive performance in semantic segmentation recently. Wang et al. [10] investigated the application of Deeplabv3+ in remote sensing of forest fires and achieved satisfying segmentation performance and running speed; Zhang et al. [11] performed urban land use classification based on Deeplabv3+ and optimized the classification results using the fully connected conditional random field (CRF); Wang et al. [12] integrated class feature attention mechanism into Deeplabv3+ and improved the segmentation accuracy, but it still has problems of not being able to accurately segment small targets and having numerous model parameters. The above studies show that Deeplabv3+ performs quite well in semantic segmentation of remote sensing images, but its network structure is complex and requires a lot of computational resources and time to converge during training. In addition, its large upsampling amplitude leads to severe loss of pixel information [13]. For semantic segmentation of high-resolution remote sensing images, it still has problems such as low accuracy of small target recognition and poor edge recognition.
Lightweight BiSeNetV2 [14] uses detail branching and semantic branching to balance low-level and high-level semantic information. Detail branching captures low-level detail and generates high-resolution feature representations. Semantic branching is a lightweight convolutional model that uses fast downsampling to expand the perceptual field while designing contextual embedding blocks. Although it substantially reduces the number of parameters, its segmentation accuracy is not promising.
To solve the above problems, an improved Deeplabv3+ is proposed in this paper, which uses the lightweight MobileNetV2 [15] as the backbone network, and improves the accuracy of semantic segmentation using edge features provided by edge branches. Meanwhile, the decoding part uses a multi-level upsampling to enhance the tight connection between the encoder and decoder to retain the target’s location and boundary information more completely. Experimental results on the publicly available datasets UAVid [16] and ISPRS Vaihingen [17] show that the proposed model is more effective and robust than mainstream segmentation models.
The main contributions of this work are summarized as follows:
- A semantic segmentation algorithm for UAV remote sensing images based on improved Deeplabv3+ is proposed to effectively utilize edge features and low-level image features.
- The edge detection network built by 6 gating mechanism modules (Gate) can effectively extract edge features to improve the segmentation performance.
- A multi-level upsampling method is designed in the decoder to retain the target’s position and boundary information when restoring the feature map more completely.
2. Literature review
With the development of aviation technology, satellite remote sensing technology is favoured by researchers because of its low cost and easy access [18]. In the past few years, more and more research has been conducted using DL to process remote sensing images, such as land cover classification based on hyperspectral images [19], multi-scale geospatial target detection [20], semantic segmentation of urban scenes [21], and DL has proven to be effective in processing remote sensing images.
Aerial imaging has become a common approach to acquiring data with the advent of Unmanned Aerial Vehicles (UAV). Compared with satellite-based aerospace remote sensing, UAV remote sensing can fly at low altitudes under clouds, making up for the fact that clouds often block satellite optical remote sensing from obtaining high-quality images [22]. Manual visual detection of multiple objects in an image is a time-consuming, biased and inaccurate operation. Therefore, designing algorithms that can quickly and accurately obtain information from images of this kind is a recent major challenge. Many researchers have proposed various image segmentation methods, which can be divided into three categories: traditional methods, methods based on machine learning and methods based on DL.
For remote sensing images, traditional segmentation methods mainly include threshold segmentation algorithms and edge detection segmentation algorithms. In order to improve the real-time performance of segmentation, Cheng et al. [23] proposed a threshold segmentation algorithm based on sample space reduction and interpolation methods. Xu et al. [24] used the traditional edge detection operator to solve the two-dimensional function and then selected the corresponding threshold to extract the edges of the image to realize the segmentation of UAV remote sensing images. Traditional methods are also effective in solving image segmentation tasks when dealing with images of desirable quality.
In addition, machine learning algorithms such as K-nearest neighbrs, decision tree, random forests, and support vector machines are also used for image segmentation tasks.
Cariou et al. [25] improved K-nearest neighbor method for density-based pixel clustering of hyperspectral remote sensing images for image segmentation. Yang et al. [26] combined the image digital surface model (DSM) and texture information to extract rice fallout areas using the maximum likelihood method and a decision tree classification model. Feng et al. [27] applied random forest and texture analysis to urban vegetation mapping of UAV remote sensing; Ma et al. [28] combined random forest and support vector machine for UAV remote sensing land cover classification. Although the above methods perform well in some cases, they are usually only applicable to a small range of data and cannot be validated on large datasets due to poor generalization ability [29].
DL has been widely used in semantic segmentation tasks in recent years and has performed well. As a result, many semantic segmentation methods based on DL have been applied to remote sensing image segmentation, as shown in Table 1. Ghorbanzadeh et al. [30] used CNN for landslide detection; Yang et al. [31] used CNN to extract mature rice areas and estimate rice production automatically; Su et al. [32] improved the CNN and proposed a new rice lodging identification method; Wang et al. [21] combined convolution with transformer to achieve semantic segmentation of urban scene imagery. The Deeplab series algorithm has shown outstanding performance in semantic segmentation in recent years. Based on Deeplabv1 [33], the researchers have proposed Deeplabv2 [9], Deeplabv3 [34], and Deeplabv3+ [8], which gradually improve the algorithm segmentation performance by optimizing the network structure. Wang et al. [10] performed remote sensing of forest fires based on Deeplabv3+ and achieved quite well segmentation performance; Zhang et al. [11] achieved promising results in urban land use classification based on Deeplabv3+ and UAV remote sensing technology; Wang et al. [12] added a class feature attention mechanism to Deeplabv3+ and achieved high overall segmentation accuracy; Du et al. [35] incorporated Deeplabv3+ and object-based image analysis strategy to label remote sensing image, which achieves impressive accuracy.
In addition, the balance between the accuracy and efficiency of detection models in large-scale remote sensing image segmentation tasks is also a research point of interest. Yao et al. [36] combined the channel attention mechanism with a lightweight deep convolutional neural networks (DCNN) to achieve efficient cloud detection on remote sensing images. For the convenience of readers, we summarize the above methods in Table 1. The above studies have improved Deeplabv3+ make it more suitable for remote sensing image semantic segmentation tasks, but there is still room for improvement in edge fineness and small target recognition accuracy.
3. Methodology
Currently, Deeplabv3+ is a well-performing deep semantic segmentation model that uses the ASPP module and the encoder-decoder structure. The former captures multi-scale contextual information by pooling feature layers at different resolutions and the latter capturing clearer object boundaries. In ASPP, multi-scale features are captured by parallel null convolution with different expansion rates. Then the stitched feature maps are fed into a 1×1 convolutional layer, and the output feature maps are used as the output of the encoder. In the decoding part, the feature maps output by the encoder is first 4-fold bilinearly upsampled and then connected with the corresponding size low-level feature maps extracted from Xception [38] backbone network. In this case, another 1×1 convolution is used for the low-level features to reduce the number of channels in network layers. After joining, the features are refined using 3×3 convolution, and then 4-fold bilinear upsampling is performed again to ensure that the output segmentation map is as large as the original image.
However, during the downsampling of the feature map by the encoder, as the number of layers in the network deepens, the resolution of the feature map gradually decreases, and the features of small targets are gradually blurred. At the same time, the null convolution with a significant void rate in ASPP is not conducive to segmenting low-resolution feature maps [39]. In the upsampling phase of the feature map, the decoder part does not fully use the multi-level feature map generates by the encoder and directly quadruples the bilinear upsampling of the feature map, which is not conducive to pixel-level information.
To solve the above problems, an improved EMNet based on Deeplabv3+ is proposed. As shown in Fig 1, EMNet mainly consists of an encoder and a decoder, and the encoder contains a semantic segmentation module and an edge detection module.
3.1. Semantic segmentation module
As shown in Fig 1, the semantic segmentation module consists of a backbone feature extraction network and an ASPP module. In order to reduce the model computation and memory footprint so that image features can be mined more efficiently and quickly [36], EMNet uses the lightweight MobileNetV2 network as the backbone feature extraction network. Compared with the Xception network of Deeplabv3+, this network has shallower layers, fewer parameters, lower model complexity, and faster convergence. The structure of MobileNetV2 network is shown in Table 2, where t is the multiplication factor (i.e., expansion factor) of the input channels, c denotes the number of output channels, n represents the number of repetitions of the module, while s is the step size.
3.2. Edge detection module
Deeplabv3+ captures the colour, shape, and texture information together using DCNN, which reduces the segmentation accuracy due to the aggregation of all the different types of information related to the recognition target at the bottom layer of the network. In comparison, the edge detection branch of EMNet can capture and learn the edge features of the input image solely, which helps to obtain more detailed information and thus can provide adequate edge information for semantic segmentation.
Edge detection module (EDM) takes the output of each layer of the Mobilenetv2 network as its input. Borrowing from the literature [40], EDM module is designed to consist of six gating mechanism modules (Gate), and the specific structure of the Gate is shown in Fig 2.
St denotes the edge stream, Tt denotes the semantic stream, || denotes the connection of feature mapping, C denotes convolutional operation, and αt can be considered as an attention graph that assigns greater weight to regions with important boundary information. The Gate first uses a residual block and a 1×1 convolutional block to extract, downsample and upsample the input edge feature stream St, and downscales the input semantic stream Tt using a 1×1 convolutional block. Then the features of these two streams are fused, and the output feature map is reduced in dimension using two 1×1 convolutional blocks. And finally, we use the sigmoid function S to restrict the output to the range of [0, 1] so that each value in the output vector can represent the weight of its corresponding channel feature in the input feature as implemented in Eq (1).
Vt denotes the edge stream St processed by the residuals and the 1×1 convolutional block, and then Vt and αt are connected by residuals. Finally, channel-wise weighting with kernel wt to obtain a feature map with prominent edges:
(2)
The edge feature map obtained by EDM module is upsampled back to the input image size after channel downsampling on the one hand. Then the edge extraction process is supervised using the edge labels transformed from semantic segmentation labels. On the other hand, the edge feature map is transferred to the ASPP module and fused with advanced semantic features to provide edge information for semantic segmentation. Moreover, as shown in Fig 1, we first use the Canny edge detection operator to obtain the edges of semantic segmentation labelled images, then take the edges as the image gradient, which will be fused with the edge features outputted from EDM, and finally transfer the fused features to the ASPP module. This enhances the edge weight of the feature map and thus solves the problem of edge information loss due to downsampling during feature extraction.
3.3. Decoder module
During the gradual downsampling of the image in encoding, the boundary information of the target is gradually blurred, and after the upsampling of the feature map by the decoder, the edges of the target are even more blurred, resulting in poor segmentation performance. Compared with satellite remote sensing images, higher accuracy of boundary contour extraction is required when semantic segmentation is performed on UAV remote sensing images.
In Fig 3(A), Deeplabv3+ recovers the feature maps directly by 4-fold upsampling for advanced semantic features in the decoding process. This decoding method has promising performance when applied to satellite remote sensing images, but will lose much detailed information for UAV remote sensing images, which makes the network’s segmentation performance not good enough. Considering that in the encoder, the input images are also gradually transformed from low-level to high-level semantic features through feature extraction. By supplementing the low-level feature of the corresponding size of the encoder module with the high-level semantic features of decoder for feature fusion, it is possible to compensate for some of the location and boundary information of the target lost in the process of recovering the feature map. Therefore, EMNet in this paper is designed to integrate a multi-level upsampling module (Multi-level, MultiL).
(a) Deeplabv3+ decoding module per-forms 4-fold bilinear upsampling; (b) EMNet decoding module performs multi-level upsampling.
As shown in Figs 1 and 3(B), the information obtained from EDM module is transferred to the ASPP module, which is fused with the high-level semantic features outputted from the semantic segmentation backbone network to provide edge information for the semantic segmentation task. The advanced semantic features outputted by the ASPP module are recovered from the feature map by using 2-fold upsampling twice. After each upsampling operation, the semantic features are summed with the feature map of the same size in the encoding, enhancing the tight connection between the encoder and decoder. The number of channels remains unchanged after the connection summation, while the number of parameters is reduced. At the same time, the location and boundary information of the target can be retained more thoroughly.
MultiL can be described as Eq (3), where U denotes the upsampling operation, k denotes the upsampling multiplier, and C denotes the convolutional operation. The feature layer Di in the decoder is first upsampled. Then the feature map Ei of the corresponding size in the encoder is up-dimensioned by 1×1 convolution and added with Di to perform the feature fusion.
3.4. Loss function
Inspired by the idea of multi-task learning, we combine the prediction losses of semantic segmentation and edge detection modules as the final loss:
(4)
where LS is the loss of the semantic segmentation task, Le denotes the loss of the edge detection task, and ρ represents the weight of the loss of the edge detection task.
A multi-class cross-entropy function is used to calculate the loss for the semantic segmentation task. As shown in Eq (5), N denotes the number of pixels, LS denotes the loss of all pixels, and lspixel is the loss of a single pixel.
lspixel can be calculated as:
(6)
where C is the number of predicted categories,
is the true label of the pixel at location (i, j),
is the predicted probability of the corresponding category k at location (i, j).
The task loss of the edge detection branch is calculated using a binary cross-entropy function. As shown in Eq (7), N is the number of pixels, and Le denotes the loss of all pixels, lepixel denotes the loss of a single pixel
(7)
lepixel can be calculated as
(8)
where y(i,j) = {0,1}, which denotes the true label of the pixel at (i, j) position, P(i,j)∈(0,1), which denotes the predicted probability of the positive label at (i, j) position, α1 and α2 denote the weights of labels:
(9)
(10)
where |Y+| denotes the number of pixel points at the edge in the image, |Y−| indicates the number of pixel points at the non-edge in the image. Considering that the number of edge pixels in the edge detection task is small, inspired by Liu et al. [41], we use λ to adjust the weight of positive labels.
3.5. Evaluation metrics
UAV remote sensing image segmentation is a sub-task of semantic segmentation, so we can directly adopt the evaluation criteria commonly used in semantic segmentation: Mean Pixel Accuracy (mPA) and Mean Intersection Over Union (mIoU). PA is mainly used to evaluate pixel-level classification accuracy for each category. mPA is averaged over all categories. IoU is used to evaluate the segmentation effectiveness of models for each category separately. mIoU is averaged over all categories. Higher values of mPA and mIoU represent better segmentation overall performance of models. For each category i, TPi represents the number of pixels predicted to be true for positive samples; FPi represents the number of pixels predicted to be false for positive samples; TNi represents the number of pixels predicted to be true for negative samples, FNi represents the number of pixels predicted to be false for negative samples, and k is the number of segmentation categories
(11)
(12)
4. Experimental evaluation
4.1. Dataset
The models are trained and tested on two publicly available datasets: the UAVid dataset and the ISPRS Vaihingen dataset.
In the UAVid dataset [16], the shooting scene is urban; the camera angle is about 45 degrees vertical, and the flight height is about 50 metres above the ground. The image resolution is 3840×2160 and 4096×2160, consisting of red, green and blue bands. There are 270 images in the dataset, labelled with eight categories, which are building, road, static car (s car), tree, low vegetation (low veg), human, moving car (m car) and background clutter (clutter). To fully utilize the image data, the images and labels were manually cropped in chunks to 960×720 pixels to obtain 3240 samples, divided into training and validation sets by a 9:1 ratio. In order to facilitate subsequent network training, the size of each sample was uniformly adjusted to 512×512 pixels in the data preparing step before training. As shown in Fig 4, the first row shows the cropped original images, and the second row shows the corresponding labels.
Reprinted from [16] under a CC BY license, with permission from [UAVid], original copyright [2020].
The Vaihingen dataset [17] used in this paper were provided by Working Group (WG) III/4 of ISPRS from the Vaihingen area of Germany in the context of the “ISPRS test project on urban classification and 3D building reconstruction”. Vaihingen dataset contains 33 remotely sensed images extracted from a larger top-level orthophoto. There are 6 categories: impervious surfaces, buildings, low vegetation, trees, cars, and clutter. The images are 8-bit TIFF files with a resolution of 0.09 m for the ground sample. The three bands of the TIFF files correspond to the near infrared, red and green bands delivered by the camera. The images varied in pixel size with an average size of 2494 × 2064. To enhance the data and adapt to the hardware environment, we cropped the images with overlap: width overlap 370 and height overlap 320. Each image and label were manually cropped to a size of 512×512 pixels to obtain 3269 samples, with a training set and validation set ratio of 9:1. As shown in Fig 5, the first row shows the cropped original image, and the second row shows the corresponding labels.
Reprinted from [17] under a CC BY license, with permission from [DGPF], original copyright [2010].
4.2. Experimental settings
We perform the experiments on a desktop running Ubuntu 18.04 with 2.50GHZ Intel Xeon E5-2678 CPU, 32 GB Memory, and an NVIDIA 1080Ti Graphics Card. The experiments were run based on PyTorch 1.6. In the training course, we choose stochastic gradient descent (SGD) as the optimizer, and set momentum and weight decay factor to 0.9 and 0.0004 respectively. In addition, based on the results of comparison experiments, we set the initial learning rate and batch size to 0.03 and 6 respectively.
4.3. Experiment analysis
4.3.1. Effectiveness analysis of EMNet.
To verify the validity of EMNet, Deeplabv3+_Xception [8] (Dv_Xtion), Deeplabv3+_MobileNetV2 (Dv_Mnetv2) and BiSeNetV2 [14] models were selected for experimental comparative analysis.
From Fig 6, it can be seen that the segmentation accuracies of Dv_Xtion and Dv_Mnetv2 are relatively low, as they lack edge information and ignore the target location of small targets and the recovery of edge information. The same problem exists with the BiSeNetV2, as it fails to recognize the “human” object, and its predicted segmentation boundaries of “s car” and “low veg” are not precise enough. In contrast, EMNet addresses the above problems by adding an EDM module and MultiL structure to make the model more applicable to high-resolution UAV remote sensing images.
(a) The original image; (b) The label; (c) The segmentation result of Dv_Xtion; (d) The segmentation result of Dv_Mnetv2; (e) The segmentation result of BiSeNetV2; (f) The segmentation result of EMNet. Reprinted from [16] under a CC BY license, with permission from [UAVid], original copyright [2020].
Table 3 shows the mIoU and mPA values of different models on the UAVid test set. Table 3 also presents the number of parameters (Parameters) and floating point operations (FLOPs) of each model, and these two statistics are independent of the dataset. We can see that EMNet outperforms Dv_Xtion, Dv_Mnetv2 and BiSeNetV2 in both mIoU and mPA. Moreover, EMNet is superior to Dv_Xtion in terms of Parameters and FLOPs. EMNet has slightly more parameters than Dv_Mnetv2 and larger FLOPs than Dv_Mnetv2 and BiSeNetV2, which is due to the fact that EMNet is based on multi-task learning to perform both the task of edge detection and semantic segmentation. Considering these evaluation metrics collectively, we can see that EMNet achieves a good balance between computational efficiency and segmentation accuracy.
As shown in Table 4 that EMNet has the highest IoU in all categories, especially on small target segmentation, such as “human”. The above experimental results show that EMNet has notable segmentation performance on UAV remote sensing images. We also performed a t-test (p = 5.23×10−5 < 0.05), which indicates that our method significantly outperforms all baseline methods.
To further verify the validity of EMNet, experiments were also conducted on the publicly available dataset ISPRS Vaihingen, the comparison in Fig 7 also shows that the segmentation accuracy of EMNet is higher than other models.
(a) The original image; (b) The label; (c) The segmentation result of Dv_Xtion; (d) The segmentation result of Dv_Mnetv2; (e) The segmentation result of BiSeNetV2; (f) The segmentation result of EMNet. Reprinted from [17] under a CC BY license, with permission from [DGPF], original copyright [2010].
From Table 5, we can see that the evaluation metrics of both mIoU and mPA of EMNet outperformed Dv_Xtion, Dv_Mnetv2, and BiSeNetV2.
4.3.2. Ablation analysis of EDM and MultiL modules.
Ablation experiments were conducted to verify the effectiveness of EDM module and MultiL structure in EMNet. Under the same experimental conditions, we regard Deeplabv3+ using MobileNetV2 backbone feature extraction network (Dv_Mnetv2) as the baseline. The segmentation results in ablation experiments are shown in Fig 8.
(a) The original image; (b) The label; (c) The segmentation result of Dv_Mnetv2; (d) The segmentation result of Dv_Mnetv2+EDM; (e) The segmentation result of Dv_Mnetv2+MultiL; (f) The segmentation result of EMNet. Reprinted from [16] under a CC BY license, with permission from [UAVid], original copyright [2020].
As can be seen in Fig 8(C), the baseline network fails to identify all the “human” in the original image accurately, while both Dv_Mnetv2+EDM and Dv_Mnetv2+MultiL show improved segmentation results. EMNet combines the advantages of EDM and MultiL has superior performance in image segmentation.
As shown in Table 6, compared with the baseline, the model incorporating an EDM (Dv_Mnetv2+EDM) improved the mIoU and mPA on the test set by 1.15% and 1.57%, respectively; meanwhile, the model containing the MultiL structure (Dv_Mnetv2+MultiL) improved the mIoU and mPA on the test set by 0.33% and 1.08%, respectively.
5. Conclusions
Based on DeepLabv3+, the proposed EMNet model uses the edge detection branch in the encoder to extract edge features and provide edge information for semantic segmentation. A multi-level upsampling method is designed in the decoder to retain the target’s location and boundary information when recovering the feature map. Compared to DeepLabv3+, EMnet is more accurate in identifying small-sized targets and segmenting edges. The experimental results show that the mIoU and mPA of EMNet are 71.46% and 80.46% on dataset UAVid, and 91.80% and 95.42% on the dataset ISPRS Vaihingen. EMNet outperforms other baseline models on all of these metrics and can better perform the semantic segmentation task of UAV remote sensing images.
Acknowledgments
The Vaihingen data set was provided by the German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF) [17]: http://www.ifp.uni-stuttgart.de/dgpf/DKEP-Allg.html.
References
- 1. Gibril M. B. A., Kalantar B., Al-Ruzouq R., Ueda N., Saeidi V., Shanableh A., et al. "Mapping heterogeneous urban landscapes from the fusion of digital surface model and unmanned aerial vehicle-based images using adaptive multiscale image segmentation and classification." Remote Sensing 12 (2020): 1081. https://doi.org/10.3390/rs12071081.
- 2. Williams J. G., Rosser N. J., Kincey M. E., Benjamin J., Oven K. J., Densmore A. L., et al. "Satellite-based emergency mapping using optical imagery: Experience and reflections from the 2015 nepal earthquakes." Nat. Hazards Earth Syst. Sci. 18 (2018): 185–205. https://doi.org/10.5194/nhess-18-185-2018.
- 3. Siam M., Elkerdawy S., Jagersand M. and Yogamani S. "Deep semantic segmentation for automated driving: Taxonomy, roadmap and challenges." Presented at 2017 IEEE 20th international conference on intelligent transportation systems (ITSC), 2017. IEEE, 1–8.
- 4. Kotaridis I. and Lazaridou M. "Remote sensing image segmentation advances: A meta-analysis." ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021): 309–22. https://doi.org/10.1016/j.isprsjprs.2021.01.020.
- 5. Simonyan K. and Zisserman A. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014): https://doi.org/10.48550/arXiv.1409.1556.
- 6. Long J., Shelhamer E. and Darrell T. "Fully convolutional networks for semantic segmentation." Presented at 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. IEEE Computer Society, 3431–40.
- 7. Everingham M., Van Gool L., Williams C. K. I., Winn J. and Zisserman A. "The pascal visual object classes (voc) challenge." International Journal of Computer Vision 88 (2010): 303–38. https://doi.org/10.1007/s11263-009-0275-4.
- 8. Chen L.-C., Zhu Y., Papandreou G., Schroff F. and Adam H. "Encoder-decoder with atrous separable convolution for semantic image segmentation." Presented at Proceedings of the European conference on computer vision (ECCV), 2018. 801–18.
- 9. Chen L.-C., Papandreou G., Kokkinos I., Murphy K. and Yuille A. L. "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs." IEEE transactions on pattern analysis and machine intelligence 40 (2018): 834–48. pmid:28463186
- 10. Wang Z., Peng T. and Lu Z. "Comparative research on forest fire image segmentation algorithms based on fully convolutional neural networks." Forests 13 (2022): 1133. https://doi.org/10.3390/f13071133.
- 11. Zhang C., Li M., Wei D. and Wu B. "Enhanced deeplabv3+ for urban land use classification based on uav-borne images." Presented at 2022 7th International Conference on Image, Vision and Computing (ICIVC), 2022. 449–54.
- 12. Wang Z., Wang J., Yang K., Wang L., Su F. and Chen X. "Semantic segmentation of high-resolution remote sensing images based on a class feature attention mechanism fused with deeplabv3+." Computers & Geosciences 158 (2022): 104969. https://doi.org/10.1016/j.cageo.2021.104969.
- 13. Y S. U., Lin Y., Fang X. and Zhong L. "Improved deeplabv3+ network segmentation method for urban road scenes." Presented at 2022 IEEE 10th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), 2022. 10, 1274–80.
- 14. Yu C., Gao C., Wang J., Yu G., Shen C. and Sang N. "Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation." International Journal of Computer Vision 129 (2021): 3051–68. https://doi.org/10.1007/s11263-021-01515-2.
- 15. Sandler M., Howard A., Zhu M., Zhmoginov A. and Chen L.-C. "Mobilenetv2: Inverted residuals and linear bottlenecks." Presented at 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. IEEE, 4510–20.
- 16. Lyu Y., Vosselman G., Xia G.-S., Yilmaz A. and Yang M. Y. "Uavid: A semantic segmentation dataset for uav imagery." ISPRS Journal of Photogrammetry and Remote Sensing 165 (2020): 108–19.
- 17. Cramer M. "The dgpf-test on digital airborne camera evaluation overview and test design." Photogrammetrie-Fernerkundung-Geoinformation (2010): 73–82.
- 18. Zhang N., Zhang X., Yang G., Zhu C., Huo L. and Feng H. "Assessment of defoliation during the dendrolimus tabulaeformis tsai et liu disaster outbreak using uav-based hyperspectral images." Remote Sensing of Environment 217 (2018): 323–39. https://doi.org/10.1016/j.rse.2018.08.024.
- 19. AL-Alimi D., Al-qaness M. A. A., Cai Z., Dahou A., Shao Y. and Issaka S. "Meta-learner hybrid models to classify hyperspectral images." Remote Sensing 14 (2022): 1038. https://doi.org/10.3390/rs14041038.
- 20. AL-Alimi D., Shao Y., Feng R., Al-qaness M. A. A., Elaziz M. A. and Kim S. "Multi-scale geospatial object detection based on shallow-deep feature extraction." Remote Sensing 11 (2019): 2525. https://doi.org/10.3390/rs11212525.
- 21. Wang L., Li R., Zhang C., Fang S., Duan C., Meng X. and Atkinson P. M. "Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery." ISPRS Journal of Photogrammetry and Remote Sensing 190 (2022): 196–214. https://doi.org/10.1016/j.isprsjprs.2022.06.008.
- 22. Osco L. P., Marcato Junior J., Marques Ramos A. P., de Castro Jorge L. A., Fatholahi S. N., de Andrade Silva J., Matsubara E. T., Pistori H., Gonçalves W. N. and Li J. "A review on deep learning in uav remote sensing." International Journal of Applied Earth Observation and Geoinformation 102 (2021): 102456. https://doi.org/10.1016/j.jag.2021.102456.
- 23. Cheng H., Shi X. and Glazier C. "Real-time image thresholding based on sample space reduction and interpolation approach." Journal of computing in civil engineering 17 (2003): 264–72. https://doi.org/10.1061/(ASCE)0887-3801(2003)17:4(264).
- 24. Xu D., Zhao Y., Jiang Y., Zhang C., Sun B. and He X. "Using improved edge detection method to detect mining-induced ground fissures identified by unmanned aerial vehicle remote sensing." Remote Sensing 13 (2021): 3652. https://doi.org/10.3390/rs13183652.
- 25. Cariou C., Le Moan S. and Chehdi K. "Improving k-nearest neighbor approaches for density-based pixel clustering in hyperspectral remote sensing images." Remote Sensing 12 (2020): 3745. https://doi.org/10.3390/rs12223745.
- 26. Yang M.-D., Huang K.-S., Kuo Y.-H., Tsai H. P. and Lin L.-M. "Spatial and spectral hybrid image classification for rice lodging assessment through uav imagery." Remote Sensing 9 (2017): 583. https://doi.org/10.3390/rs9060583.
- 27. Feng Q., Liu J. and Gong J. "Uav remote sensing for urban vegetation mapping using random forest and texture analysis." Remote Sensing 7 (2015): 1074–94. https://doi.org/10.3390/rs70101074.
- 28. Ma L., Fu T., Blaschke T., Li M., Tiede D., Zhou Z., Ma X. and Chen D. "Evaluation of feature selection methods for object-based land cover mapping of unmanned aerial vehicle imagery using random forest and support vector machine classifiers." ISPRS International Journal of Geo-Information 6 (2017): 51. https://doi.org/10.3390/ijgi6020051.
- 29. Wang S., Mu X., Yang D., He H. and Zhao P. "Attention guided encoder-decoder network with multi-scale context aggregation for land cover segmentation." IEEE Access 8 (2020): 215299–309.
- 30. Ghorbanzadeh O., Blaschke T., Gholamnia K., Meena S. R., Tiede D. and Aryal J. "Evaluation of different machine learning methods and deep-learning convolutional neural networks for landslide detection." Remote Sensing 11 (2019): 196. https://doi.org/10.3390/rs11020196.
- 31. Yang Q., Shi L., Han J., Zha Y. and Zhu P. "Deep convolutional neural networks for rice grain yield estimation at the ripening stage using uav-based remotely sensed images." Field Crops Research 235 (2019): 142–53. https://doi.org/10.1016/j.fcr.2019.02.022.
- 32. Su Z., Wang Y., Xu Q., Gao R. and Kong Q. "Lodgenet: Improved rice lodging recognition using semantic segmentation of uav high-resolution remote sensing images." Computers and Electronics in Agriculture 196 (2022): 106873. https://doi.org/10.1016/j.compag.2022.106873.
- 33. Chen L.-C., Papandreou G., Kokkinos I., Murphy K. and Yuille A. L. "Semantic image segmentation with deep convolutional nets and fully connected crfs." arXiv preprint arXiv:1412.7062 (2014).
- 34. Chen L.-C., Papandreou G., Schroff F. and Adam H. "Rethinking atrous convolution for semantic image segmentation." arXiv preprint arXiv:1706.05587 (2017): https://doi.org/10.48550/arXiv.1706.05587.
- 35. Du S., Du S., Liu B. and Zhang X. "Incorporating deeplabv3+ and object-based image analysis for semantic segmentation of very high resolution remote sensing images." International Journal of Digital Earth 14 (2021): 357–78. https://doi.org/10.1080/17538947.2020.1831087.
- 36. Yao X., Guo Q. and Li A. "Light-weight cloud detection network for optical remote sensing images with attention-based deeplabv3+ architecture." Remote Sensing 13 (2021): 3617. https://doi.org/10.3390/rs13183617.
- 37. Al-Alimi D., Al-qaness M. A. A., Cai Z., Dahou A., Shao Y. and Issaka S. Meta-learner hybrid models to classify hyperspectral images. 14. 2022.
- 38. Chollet F. "Xception: Deep learning with depthwise separable convolutions." Presented at 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. IEEE Computer Society, 1800–07.
- 39. Baheti B., Innani S., Gajre S. and Talbar S. "Semantic scene segmentation in unstructured environment with modified deeplabv3+." Pattern Recognition Letters 138 (2020): 223–29. https://doi.org/10.1016/j.patrec.2020.07.029.
- 40. Takikawa T., Acuna D., Jampani V. and Fidler S. "Gated-scnn: Gated shape cnns for semantic segmentation." Presented at Proceedings of the IEEE/CVF international conference on computer vision, 2019. 5229–38.
- 41. Liu Y., Cheng M. M., Hu X., Bian J. W., Zhang L., Bai X. and Tang J. "Richer convolutional features for edge detection." IEEE transactions on pattern analysis and machine intelligence 41 (2019): 1939–46. pmid:30387723