A feature fusion deep-projection convolution neural network for vehicle detection in aerial images

With the rapid development of Unmanned Aerial Vehicles, vehicle detection in aerial images plays an important role in different applications. Comparing with general object detection problems, vehicle detection in aerial images is still a challenging research topic since it is plagued by various unique factors, e.g. different camera angle, small vehicle size and complex background. In this paper, a Feature Fusion Deep-Projection Convolution Neural Network is proposed to enhance the ability to detect small vehicles in aerial images. The backbone of the proposed framework utilizes a novel residual block named stepwise res-block to explore high-level semantic features as well as conserve low-level detail features at the same time. A specially designed feature fusion module is adopted in the proposed framework to further balance the features obtained from different levels of the backbone. A deep-projection deconvolution module is used to minimize the impact of the information contamination introduced by down-sampling/up-sampling processes. The proposed framework has been evaluated by UCAS-AOD, VEDAI, and DOTA datasets. According to the evaluation results, the proposed framework outperforms other state-of-the-art vehicle detection algorithms for aerial images.


Introduction
As one of the core research topic of computer vision, object detection is widely used in automatic driving, crowd flow counting, topographic exploration, environmental pollution monitoring, etc. The task of object detection is to find out the various targets in the images, and determine the locations and categories of these targets. Because the appearance of objects changes significantly according to various factors [1][2][3][4], object detection is commonly regarded as one of the most challenging tasks in the field of computer vision.
The traditional object detection algorithms are normally based on hand-crafted features or textures. R.M. Haralick et al. proposed textural features for image classification in 1973 [5]. D. G. Lowe proposed scale-invariant feature transform (SIFT) in 1999 [6]. In 2014, T. Moranduzzo and F. Melgani used SIFT to count the number of vehicles and trained a support vector 1. A novel residual block named stepwise res-block is proposed. To explore high-level semantic features, a single stepwise res-block contains three 3x3 convolutional layers. These three layers form a hierarchical structure that produces features processed by various numbers of convolution layers. Comparing to the original res-block, the proposed stepwise res-block increases the depth of the network, and keeps the parameter number at a relatively low level. In addition to the shortcut connection adding to the output, the proposed stepwise res-block introduces another shortcut to conserve the low-level features.
2. Based on the proposed stepwise res-block, a backbone for small object detection is designed. This backbone is composed of 34 stepwise res-blocks. Comparing with the original ResNet or Res2Net with roughly the same parameter number, the proposed backbone provides more convolutional layers which enhance its capacities of extracting complex high-level semantic features. Meanwhile, the combination of multiple stepwise res-block enables the output features to be processed by various numbers of convolutional layers, as well as conserves the low-level features which are important in small object detection. Thus the feature generated by the proposed backbone contains both low-level detailed information and high-level semantic information at the same time.
3. A new feature fusion and detection network is proposed. To make full use of the high-level features and low-level features generated by the backbone, a feature fusion and multi-scale detection network is designed. The outputs from different levels of the proposed backbone are collected into the feature library through the spatial pyramid structure, and the features suitable for detection are obtained through an automatic selection process of the network. The selected features are then fed into the multi-scale detection module for detection. To reduce the information contamination introduced by up-sampling/down-sampling processes, a deep-projection deconvolution module is adopted in the proposed feature fusion network.
The rest of this paper is organized as follows. Section 2 reviews the state-of-the-art vehicle detection algorithms for aerial images; The implementation detail of the proposed Feature Fusion Deep-Projection Convolution Neural Network is described in section 3; Section 4 contains the evaluation results of the proposed framework using three different datasets; section 5 concludes this paper.

Related work
Due to the significant performance advantages of Convolution Neural Network (CNN), the most recent proposed vehicle detection algorithms for aerial images are based on various CNNs. In 2017, T. Tang et al. designed an improved Faster-RCNN to solve the difficulties of locating the positions of small vehicles and classifying the vehicle from the background [22]. The authors use a hyper region proposal network (HRPN) combined with different levels of feature maps to extract vehicles, and an enhanced classifier to reduce false detection.
In 2018, S. Liu et al. proposed a Path Aggregation Network (PANet) which aims at solving the problem of boosting information flow in a proposal-based instance segmentation framework [23]. By extending the bottom-up path, an accurate positioning signal can be obtained at the bottom layer. Thereby the entire feature hierarchy is enhanced, and the information path between the bottom layer and the topmost feature is shortened. The authors also proposed an adaptive feature pool to link all features, so that the useful information of each layer can be directly transmitted to the proposal subnetwork.
M.Y. Yang et al. presented a novel double focal loss convolutional neural network framework (DFL-CNN) in 2018 [24]. The proposed framework uses skip connection and focal loss functions to improve detection performance. At the same time, a new large-scale vehicle detection dataset named ITCVD was proposed in the same paper. According to the authors, DFL-CNN obtains good performance on the ITCVD dataset.
In 2018, Y. Koga et al applied Hard Example Mining (HEM) to Stochastic Gradient Descent (SGD) in the process of training neural networks [12]. According to the authors, the application of HEM can improve detection accuracy by training the networks with more informative samples.
G. Cheng et al tried to solve the detecting problem related to object rotation, within-class diversity and inter-class similarity by adding a rotation-invariant regularizer and a fisher discrimination regularizer to the existing neural networks in 2018 [25]. Compared with the other state-of-the-art algorithms, the authors claimed that their method achieved good performance on various datasets.
In 2019, Murari Mandal et al. designed a one-stage vehicle detection network (AVDNet) [11]. The proposed network used ConvRes to hold the features of small objects on multiple scales. The author also proposed a recurrent-feature aware visualization (RFAV) technique to analyze the layers in the network.
W. Liu et al. also introduced an object detection algorithm for aerial images in 2019 [26]. A feature introducing strategy based on oriented response dilated convolution to make the model adaptable to multiscale objection detection.
Qiu H et al. proposed a novel end-to-end Adaptively Aspect Ratio multi-scale Network (A2RMNet) which focuses on detecting objects with various sizes and aspect ratios in 2019 [27]. The authors designed a feature gate fusion network to integrate multi-scale feature maps, and an aspect ratio attention network to prevent the changes of objects' aspect ratios.
Artacho, B et al. proposed a new semantic segmentation network in 2019 [28]. The main structure of the network is named "Waterfall" Atrous Spatial Pooling (WASP) architecture. Compared with the traditional spatial pyramid structure, WASP can reduce the number of parameters used in the network, while increasing the receptive field of the network.
W.Li et al. proposed a network to detect and count vehicles simultaneously in aerial images in 2019 [4]. The authors utilized the combination of bottom-up cues and top-down attention mechanisms to maximize the use of mutual information between object categories and features. An effective loss function is used in the proposed network to promote the ability to push the anchors toward matching the ground-truth boxes as much as possible.
In 2020, Y. Yang et al. proposed a framework to solve the perspective distortion problem of aerial images [29]. The proposed framework utilized a reverse perspective network to evaluate perspective distortion, and evenly distort the image to obtain similar example scales. At the same time, the framework forces the regressor to learn from the augmented density maps via an adversarial network to further solve the problem of scale distortion in dense regions.
Rabbi. J et al. proposed a novel small object detection framework by combining a generative adversarial network (GAN)-based model called enhanced super-resolution GAN (ESRGAN), an Edge-Enhancement Network EEN and a detection network in 2020 [30]. The proposed framework achieved promising performance on both public datasets and self-assembled dataset.
A brief summary of some related works and their contributions is shown in Table 1.

Introduction of proposed framework
As shown in Fig 1, the proposed Feature Fusion Deep-Projection Convolution Neural Network (FFDP-CNN) can be roughly divided into two parts: a backbone network composed of stepwise res-blocks, and the feature fusion network incorporating with the detection network.
Thanks to the stepwise res-block, the proposed backbone produces features processed by various numbers of convolution layers, which helps to conserve low-level detailed information and produce high-level semantic information. The feature fusion network fuses the low-level and high-level feature maps and selects the information that is effective to vehicle detection in the different feature maps.

Introduction of stepwise res-block
The original residual block [31] is proposed to solve the vanishing gradient problem which appears as the depth of the network increases. By stacking up multiple residual blocks, networks with various depth have been proposed to better explore high-level semantic features. However, in small object detection, low-level detailed information is considered as important as high-level semantic information. Although the side connection in the original residual block helps to increase the depth of the network, it cannot prevent the disappear of low-level detailed information caused by convolutions. To obtain the low-level detailed information and high-level semantic information at the same time, a stepwise hierarchical convolution structure is created to replace the 3×3 convolution layer in the original residual block. Because the input feature is divided and processed hierarchically, like climbing stairs, the proposed resblock is named as stepwise res-block. The structure of the proposed stepwise res-block is shown in Fig 2. S i j indicates a feature subset (i indicates the number of convolution layers the subset has been processed by, i2 (0,3), j indicates the subset number, j2 (0,3)). F i indicates a feature map assembled by feature subsets (i indicate the number of this feature map). C j i indicates a i×i convolution layer. The input of stepwise res-block is first processed by a C 0 1 to decrease the channel number (to 1/4 of the input's original channel number), and then split into even quarters (S 0 0 ; S 0 1 ; S 0 2 ; S 0 3 ) by the number of channel. S 0 0 is directly superimposed onto F 4 without processing, while S 0 1 ; S 0 2 ; S 0 3

PLOS ONE
pass through C 0 3 to obtain S 1 1 ; S 1 2 ; S 1 3 (collectively called F 1 ). Then, S 1 1 is directly superimposed onto F 4 without processing, while S 1 2 ; S 1 3 pass through C 1 3 to obtain S 2 2 ; S 2 3 (collectively called F 2 ). After that, S 2 2 is directly superimposed onto F 4 without processing, while S 2 3 passes through C 2 3 to obtain S 3 3 (also named F 2 ) which is superimposed onto F 4 . It can be seen that S 0 0 ; S 1 1 ; S 2 2 and S 3 3 are concatenated together to form feature map F 4 . At the end of this process, F 4 is processed by another 1×1 convolution layer C 1 1 to restore the channel number to the input's original channel number. Side connection is also used in stepwise res-block after F 4 passing through a C 1 1 . Different subsets of output processed by different numbers of convolution layers, resulting in different subsets containing features with different receptive fields. Among them, small receptive field subsets, such as S 0 0 ; S 1 1 , pass through convolution layers fewer times, contain more detail information which is important for small object detection. Subsets with large receptive field, such as S 2 2 ; S 3 3 , blur the detail information but can explore deep semantic information which is also important for detection [21]. Two C 1 are used at the beginning and end of the proposed block to automatically select the suitable features and implement a bottleneck to reduce the parameter number.
Comparing with the original res-block, stepwise res-block can provide more maximum depth with fewer parameters. Assuming that the dimension of the input and the outputs is n � n � m, the comparison of parameter number between the original res-block and stepwise resblock is shown in Table 2.

PLOS ONE
Compared with the original residual module (as shown in Table 2), the stepwise res-block can add two more convolution layer to the network (ignoring the 1 x 1 convolution layers), but the number of parameters has increased by only 30%, which is more suitable for composing deep network and exploring the underlying semantic information. The effectiveness and efficiency of the proposed stepwise res-block are validated in the experiments shown in section 4.4.

Introduction of backbone based on stepwise res-block
The stepwise res-block can be used in the mainstream networks to replace the convolution layer or the residual module, but to fully explore its advantages, a backbone based on the  Table 2. Parameter comparison between original res-block and stepwise res-block.

Block name Original res-block
Stepwise res-block

PLOS ONE
proposed stepwise res-block is designed in this section. The network structure is shown in Table 3.
Taking full advantage of stepwise res-block, the proposed backbone can explore high-level semantic information and conserve low-level details information at the same time. To explore high-level semantic information, 34 stepwise res-blocks (each stepwise res-block contains three 3x3 convolutional layers) is used in the proposed backbone. As a result, the maximum depth of the proposed backbone is 102 3x3 convolutional layers. As a comparison, widely used ResNet101 only contains 33 3x3 convolutional layers. A deeper network can generate better nonlinear expressions and semantic features. At the same time, the task of each convolutional layer becomes clearer while the depth of the networks goes deeper [21].
On the other hand, thanks to the stepwise res-block, the output of the proposed backbone has hierarchical receptive fields. Given the fact that the amount of input processed by 0-3 convolutional layers are divided equally in each stepwise res-block, the output of the proposed backbone contains feature generated by various numbers of convolutional layers (0-102, distributed equally if 1x1 convolutional layers are ignored), which increase the diversity of the features contained in the output. As a result, as well as high-level semantic features, there are plentiful low-level or mid-level features in the output.

Introduction of feature fusion deep-projection module and multi-scale detection network
Since the feature maps obtained from different levels of the proposed backbone have different dimensions, a feature fusion module is required to mix features generated by different levels of the proposed backbone together to ensure each detector receives feature maps containing full information of targets. To avoid feature loss, deep-projection deconvolution is used in the feature maps down/up-sampling process. As shown in Fig 3, feature maps with different scales go through the feature fusion deep-projection module and multi-scale detection network to generate detection results. The feature fusion deep-projection module is responsible for mixing Feature images of different levels and restoring them to the specified resolution. On the one hand, low-level and high-level features can be fused together to generate feature maps containing both semantic and detail information for the detectors. On the other hand, the deep-projection deconvolution module implemented in the down/up projection unit prevents the information contamination introduced by up-sampling/down-sampling steps. The detailed process of the proposed feature fusion deep-projection module is represented in section 3.4.1.

PLOS ONE
The multi-scale detection network of the proposed framework remains the same as YOLO V3, while the loss function is changed to mine hard samples. The detailed introduction of the multi-scale detection network can be found in section 3.4.2.
Introduction of feature fusion deep-projection module. In general, the output from different levels of the proposed backbone is sampled and concatenated into a collection of feature maps. Multiple 1×1 convolutions are implemented before and after the obtained collection of feature maps to implement a feature selection process. The selected feature is then restored to feature maps with different resolutions which finally sent to various detectors. Because of the proposed feature fusion network, each detector can use both the high-level and low-level information from the backbone.
Because of the down-sampling and up-sampling steps used in the proposed feature fusion network, information contamination is unavoidable. To reduce the impact of information contamination, inspired by the deep-projection unit used in super-resolution image reconstruction algorithms [32], a deep-projection deconvolution module is implemented in the proposed feature fusion network. The detailed structure of the proposed deep-projection deconvolution module is shown in Fig 4. It contains two different components: the up projection unit (for up-sampling) and the down projection unit (for down-sampling). In the up projection unit, as shown in Fig 4(A), the low-resolution feature maps LR 1 go through a deconvolution layer to get the high-resolution feature maps HR 1 firstly. Then the high-resolution feature maps HR 1 pass through a convolution layer to obtain a low-resolution feature maps LR 2 . The residual result of the obtained low-resolution feature maps LR 2 and the initial low-resolution feature maps LR 1 goes through another deconvolution layer to obtain high-resolution feature maps HR 2 . Finally, the high-resolution feature maps HR 2 (after passing through a convolution layer) and the high-resolution feature maps HR 1 obtained initially are added together to obtain the final high-resolution feature map HR 3 . In the down projection unit, as shown in Fig 4(B), the high-resolution feature maps HR 1 go through convolution layer to get the low-resolution feature maps LR 1 firstly. Then the low-resolution LR 1 passes through a deconvolution layer to obtain the high-resolution feature maps HR 2 . The residual result of the obtained high-resolution feature maps HR 2 and the initial high-resolution feature maps HR 1 goes through a convolution layer to obtain the low-resolution feature maps LR 2 . Finally, the low-resolution feature maps LR 2 (after passing through a convolution layer) and the low-

PLOS ONE
resolution feature maps LR 1 obtained initially are added together to obtain the final low-resolution feature map LR 3 . These projection units continuously correct themselves and produce accurate feature maps which will be fed into 3 detectors later. The effectiveness of the proposed feature fusion deep-projection module is validated in the experiments shown in section 4.4.

Introduction of improved multi-scale detection network.
The detection network used in the proposed framework is improved from the YOLOv3 framework. The anchor boxes used in the proposed detection network are optimized according to the target sizes of the evaluation datasets since the original anchor boxes used in YOLOv3 are obtained by clustering the target sizes in the COCO dataset and not suitable for vehicle detection in aerial images. Multiscale detection is kept in the proposed detection network. Each scale uses different anchors, which is shown in Fig 5, and is responsible for detecting different groups of targets. The 80×80 scale detector, which is responsible for the mainstream targets detection tasks, is assigned to anchors with medium sizes, e.g. (65×40), (44×75), (40×65). The 20×20 scale detector, which is responsible for the relatively large targets detection tasks, is assigned to anchors with relatively large size, e.g. (75×42), (50×92), (108×40). The 160×160 scale detector, which is responsible for relatively small targets detection tasks, is assigned to anchors with relatively small size, e.g. (38×39), (34×59), (59×35).
The loss function is used to compute the error between the predicted value and true value. The loss function in our framework is improved from the loss function used in YOLOv3. The original loss function is consist of three parts: location prediction box loss, IOU loss, classification loss. The detailed description is introduced in [33].

Loss ¼ Loss location þ Loss class þ Loss confidence ;
ð1Þ

PLOS ONE
λ coord , λ noord represent the weight of the corresponding loss, s 2 represents the total number of grids in the image, B represents the total number of predicted bounding box in each grid, ' obj ij = 1 represents that the jth predicted bounding box is effective detection in ith grid. λ coord , λ noord is assigned 1 in our network. Since one stage network without region proposal network (RPN) is usually plagued by overwhelming negative samples, a hard sample mining strategy is added in our classification loss. The improved classification loss is shown below.
α represents a constant that is usually equal to the score of vehicle determination. When a sample is an emblematic negative sample, P i (c) is close to 0, and this emblematic negative sample leads to a small effect on the back propagation process. When a sample is a hard sample, P i (c) is close to the score of vehicle determination. M HSM obtains a high value, which improves the effect on the back propagation process. Therefore, the hard sample mining strategy filters the impact of a large number of negative samples on classification, and at the same time strengthens the impact of hard samples on classification.

Brief introduction.
The increase of datasets has been one of the driving forces for the rapid development of object detection algorithms in recent years. Commonly used datasets, such as VOC [34] and COCO [35], not only provide data for various algorithms, but also a baseline for performance comparisons.
Comparing with the images contained in datasets designed for general object detection or image classification, aerial images have special characteristics. It is unlikely to train vehicle detection algorithms for aerial images only based on datasets designed for general object detection algorithms or image classification algorithms. As a result, several aerial images datasets, e.g. RSOD [36], INRIA [37], UCAS-AOD [19], VEDAI [18], DOTA [20] are announced as vehicle detection in aerial images has developed significantly recently. In this paper, three commonly used datasets designed for vehicle detection in aerial images, i.e. UCAS-AOD, VEDAI and DOTA, are used in the evaluation section. Since Images form theses three datasets are taken by UAVs or satellites from high altitudes, the sizes of the targets in these three datasets vary while the imaging angle of these targets remains the same. [19] in 2015. The images contained in UCAS-AOD are selected from Google earth, and labelled by Patterns and Intelligent System Development Laboratory in the University of China Academy of Sciences. This dataset contains two categories of targets: vehicles, aircraft. Some details of the dataset are shown in Table 4. All the targets labelled as vehicles are used in the evaluation process.

VEDAI datasets.
The VEDAI dataset is proposed in [18] in 2015. The images of the VEDAI dataset were taken from the satellite in the spring of 2012. The authors manually selected 1210 1024×1024 images from immense original images. All of these images were taken at the same height and the same shooting angle from the sky. Selected images have a variety of backgrounds, e.g. fields, grass, mountains, urban area, etc. As shown in Table 5, the VEDAI dataset contains nine different categories of vehicles: car, pick-up, truck, plane, boat, camping car, tractor, van, and others. These categories are divided into two meta-classes. The "small land vehicles" class consists of the "car","pick-up","tractor" and"van" categories. The

PLOS ONE
"large land vehicles" class consists of the "truck" and"camping car" categories. In this paper, all the objects from the "small land vehicles" meta-category are used in the evaluation process.
Since targets in some categories, e.g. "baseball diamond", "tennis court", "basketball court", "harbor", "bridge", are irrelevant to the theme of this paper, only the targets labelled as "small vehicle", "ship" and "plane" are used in this partition to evaluate the performance of the proposed algorithm on targets with various scales. In these three categories, small vehicles have relatively stable scales (from 2×9 to 89×124), planes have various scales (from 9×8 to 856×842) but distinctive shapes, while the scales of ships various rapidly(from 5×12 to 939×1750).

Training detail and preprocessing on dataset
The proposed FFDP-CNN is an end to end one stage algorithm. It is implemented on the Ten-sorFlow 2.0 deep learning framework, trained and evaluated using NVIDIA GeForce RTX 2080Ti with 11G memory. The weight decay and the momentum are set to 1e −4 and 0.9 respectively. In the experiments, the learning rate is set to 1e −3 initially, and gradually decades to 1e −4 and 1e −5 .
Images from the three datasets are pre-processed before fed into the network. Through the sliding window, padding, and stride mechanism, all datasets are processed into images with a resolution of 640×640. Besides, the annotations of all datasets are optimized. The optimization

PLOS ONE
process corrects the vehicle labels if targets are obviously missing. This is a common dataset optimization method that is also used in [21,27].

Evaluation method
To evaluate the performance of the proposed framework, Average Precision (AP),mean Average Precision (mAP) and F1 are used in the experiments. 4.3.1. P-R curve, AP and mAP. The main purpose of object detection is to find out targets and classify them into various categories. The evaluation indicators for these two tasks are "the proportion of the correct targets detected to all targets" and "the correct rate of the classification of detected targets", which can be represented by recall(abbreviated as R) and precision (abbreviated as P). The these two measurements are defined using true positive(TP), false positive (FP), true negative (TN), and false negative (FN).
The precision is calculated as shown in formula (7): The recall is calculated as shown in formula (8): When comparing the performance of different frameworks, the precision and recall indicators sometimes appear contradictory. Moreover, it is necessary to measure the accuracy of a detector with a single index. Therefore, AP(average precision) is proposed in [34]. P and R are sorted according to score, and draw a rectangular coordinate graph with P and R as the coordinates. This rectangular graph is called P-R curve. The area under the P-R curve is AP, and mAP (mean average precision) is the average of APs of multiple categories. AP measures the quality of the detector in each category, while mAP measures the quality of the detector in all categories. As a result, in this paper, AP is used when targets are from only one category (i.e. evaluations using UCAS-AOD and VEDIA datasets), and mAP is used when targets are from multiple categories (i.e. evaluation using DOTA dataset).

F-Measure.
F-Measure (also known as F-Score) is another commonly used indicator for object detectors. The F-Measure is a weighted average of Precision(P) and Recall(R): When α equal 1, F-Measure can be called F1: F1 combines the results of P and R. when F1 is higher, it means that the detector is more effective. The P and R values of the proposed framework on various datasets are also provided for comprehensive evaluation.

Results on UCAS-AOD dataset
The effectiveness of the proposed framework has been validated on the UCAS-AOD dataset. The APs of the proposed FFDP-CNN detector and other 18 state-of-the-art vehicle detection algorithms for aerial images are shown in Table 7. It can be seen that the proposed FFDP-CNN detector achieves an AP of 97.34% which is roughly 1.2% higher than the APs achieved by UCAS + NWPU + VS-GANs 2019 [42] and Improved FBPN-Based Detection Network [38]. The P-R curve and the detailed performance of the proposed FFDP-CNN detector are shown in Fig 6 and Tables 7, 8 respectively.
To demonstrate the effectiveness of the proposed stepwise res-block and feature fusion deep-projection module, two other algorithms (FFDP-CNN with Res2Net and FPN with Stepwise Res-block) have been evaluated, and the evaluation results are shown in Table 8.

Effectiveness of stepwise res-block.
Compared with the proposed FFDP-CNN (which utilizes Stepwise Res-block and FFDP), FFDP-CNN with Res2Net utilizes Res2Net blocks to build its backbone while other parts remain the same with FFDP-CNN. In other words, the only difference between FFDP-CNN and FFDP-CNN with Res2Net is the res-block used in their backbone. As a result, the effectiveness of the proposed Stepwise Res-block can be demonstrated by comparing the evaluation results of FFDP-CNN and FFDP-CNN with Res2Net. According to the evaluation results shown in Table 8, FFDP-CNN with Res2Net achieves an AP of 95.38% which is 1.96% lower than the AP obtained by FFDP-CNN. In other words, comparing with Res2Net, Stepwise res-block gains an advantage of 1.96% AP in the evaluation.  Table 8, it can be seen that compared with standard FPN, the feature fusion deep-projection module (FFDP) achieves a solid performance advantage of 2.48% AP.

Parameter number and processing speed.
To demonstrate the efficiency of the proposed stepwise res-block, the parameter number of the proposed FFDP-CNN and other algorithms are shown in Table 9. It can be seen that FFDP-CNN has less parameter comparing

PLOS ONE
with other state-of-the-art algorithms (i.e. YOLO, Faster RCNN and An Improved FBPN).
Because of the complexity of the proposed feature fusion and deep-projection module, the processing speed of the proposed FFDP-CNN for 640×640 color images is 14.01 FPS on an NVI-DIA 2080Ti GPU.

Performance analysis.
The detection results of the proposed FFDP-CNN are analyzed manually to get a comprehensive understanding of its performance. In the detection results, most targets can be found correctly regardless of their orientations, colours and scales, which also demonstrated the effectiveness of the proposed framework. However, the proposed algorithm sometimes fails when parts of the targets are blocked by trees or building roofs in the input images, especially when the blocked parts are more than half of the original targets.

Results on VEDAI dataset
The proposed FFDP-CNN is trained and evaluated on objects labelled as "small land vehicles" in VEDAI dataset. The APs of the proposed detector and other 13 state-of-the-art algorithms

Results on DOTA dataset
The proposed framework has also been evaluated by the DOTA dataset. The evaluation results of the proposed FFDP-CNN detector and other 16 state-of-the-art algorithms are shown in Table 12. In general, it can be seen that the proposed framework achieves an mAP of 89.86% which is higher than the mAPs obtained by other state-of-the-art algorithms. The P-R curves of the proposed FFDP-CNN algorithm based on the DOTA dataset are shown in Fig 8. Table 13 demonstrates the detailed evaluation results of the proposed framework. Table 12 also demonstrate that the proposed FFDP-CNN achieves the best performance when detecting small vehicles (with an AP of 89.78%), and the second-best performance when detecting planes and ships (with APs of 91.33% and 88.48% respectively). As a result, comparing to other algorithms, the performance of FFDP-CNN can maintain at a relatively stable level when the scales of the targets various rapidly.

Conclusion and future works
In this paper, a Feature Fusion Deep-Projection Convolution Neural Network (FFDP-CNN) is proposed to detect vehicles in aerial images. The main contribution of this research has three aspects. Firstly, a novel residual block named stepwise res-block is designed in this paper.  Thanks to its special hierarchical structure, the output of the stepwise res-block contains features processed by 0-3 convolutional layers. Besides, the parameter number keeps at a relatively low level. Secondly, based on the proposed stepwise res-block, the backbone of the proposed FFDP-CNN is designed. By composing 34 stepwise res-blocks, the proposed backbone produce features processed by 0-102 3x3 convolutional layers. It can explore the highlevel semantic feature and conserve the low-level detail feature at the same time. Last but not least, a feature fusion module is used to mix features generated from different levels of the backbone, and further balance the low-level and high-level features. A special designed deepprojection deconvolution module is utilized in the proposed feature fusion module to reduce the impact of information contamination introduced by downsampling/upsampling processes. According to the evaluation results, the proposed FFDP-CNN outperforms other state-of-theart algorithms on UCAS-AOD, VEDAI and DOTA datasets. According to the detection results, although the proposed FFDP-CNN can detect targets regardless of their orientations,

PLOS ONE
colours and scales, the ability to detect partially blocked targets still can be improved. In the future work, GANs will be used to generate more partially blocked training data to train the proposed framework to detect partially blocked targets, and attention mechanisms will be implemented in the stepwise res-block to further increase the detection accuracy.

Author Contributions
Conceptualization: Bin Xu.
Formal analysis: Bin Xu.
Funding acquisition: Bin Wang.
Project administration: Bin Wang.