Figures
Abstract
In the field of UAV aerial image processing, ensuring accurate detection of tiny targets is essential. Current UAV aerial image target detection algorithms face challenges such as low computational demands, high accuracy, and fast detection speeds. To address these issues, we propose an improved, lightweight algorithm: LCFF-Net. First, we propose the LFERELAN module, designed to enhance the extraction of tiny target features and optimize the use of computational resources. Second, a lightweight cross-scale feature pyramid network (LC-FPN) is employed to further enrich feature information, integrate multi-level feature maps, and provide more comprehensive semantic information. Finally, to increase model training speed and achieve greater efficiency, we propose a lightweight, detail-enhanced, shared convolution detection head (LDSCD-Head) to optimize the original detection head. Moreover, we present different scale versions of the LCFF-Net algorithm to suit various deployment environments. Empirical assessments conducted on the VisDrone dataset validate the efficacy of the algorithm proposed. Compared to the baseline-s model, the LCFF-Net-n model outperforms baseline-s by achieving a 2.8% increase in the mAP50 metric and a 3.9% improvement in the mAP50–95 metric, while reducing parameters by 89.7%, FLOPs by 50.5%, and computation delay by 24.7%. Thus, LCFF-Net offers high accuracy and fast detection speeds for tiny target detection in UAV aerial images, providing an effective lightweight solution.
Citation: Tang D, Tang S, Fan Z (2024) LCFF-Net: A lightweight cross-scale feature fusion network for tiny target detection in UAV aerial imagery. PLoS ONE 19(12): e0315267. https://doi.org/10.1371/journal.pone.0315267
Editor: Yile Chen, Macau University of Science and Technology, MACAO
Received: October 5, 2024; Accepted: November 23, 2024; Published: December 19, 2024
Copyright: © 2024 Tang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The VisDrone dataset referenced in this study is publicly accessible and can be retrieved from https://github.com/VisDrone/VisDrone-Dataset. The source code for the LCFF-Net model is available at https://github.com/Tdzdele/LCFF-Net.
Funding: This research was funded by Heilongjiang Postdoctoral Fund to pursue scientific research (grant number LBH-Z23025), Collaborative Innovation Achievement Program of Double First-class Disciplines in Hei-longjiang Province (grant number LJGXCG2022-085), National College Students’ Innovation and Entrepreneurship Training Program of China (No.202410240065, No.202410240030, No.202410240080 and No.202310240082) and Heilongjiang Provincial College Students’ Innovation and Entrepreneurship Training Programme (No.S202410240020). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Recent advancements in control systems and target detection technologies for unmanned vehicles have had a substantial impact on various sectors, most notably transportation [1]. Among unmanned vehicle types, Unmanned Aerial Vehicles (UAVs) have become essential assets across diverse domains, including agriculture, surveillance, disaster response, and infrastructure inspection. The unique aerial perspectives provided by UAVs, combined with their efficiency and flexibility in data collection, underscore their value in these applications. Nevertheless, the majority of state-of-the-art deep neural network architectures, including SSD [2], R-CNNs [3–6], DETRs [7–9], and YOLOs [10–19], are predominantly designed and benchmarked on manually collected image datasets, such as MS-COCO [20].
These image datasets were primarily captured manually by photographers, resulting in most images being taken from a human perspective, with a limited number featuring overhead aerial views. The targets within the images generally occupy a significant portion of the frame, and the photographs were typically selected under optimal lighting conditions to minimize issues such as glare, overexposure, and underexposure. Additionally, each image tends to include a limited number of targets, with photographers often striving to exclude extraneous objects or backgrounds unrelated to the primary subject.
In contrast, images obtained from UAVs exhibit several distinct characteristics compared to those captured manually. Firstly, the target object generally constitutes only a minor fraction of the overall image area, which can complicate target detection and analysis. Additionally, lighting conditions in UAV-captured images often vary widely; some images suffer from glare or overexposure, while others may have insufficient lighting. These images are predominantly taken from high-altitude, overhead perspectives, with few captured from ground level or elevated angles. Moreover, the backgrounds in these images are often complex, featuring numerous targets that are densely arranged and overlap significantly, which can lead to challenges in distinguishing between similar-looking objects and may induce perturbations in image processing and analytic tasks.
Beyond the unique characteristics of image data, UAV aerial target detection methods are applied in two distinct scenarios. The first scenario involves using a high-performance desktop computer or server to process image data captured by UAVs in order to ensure a high level of detection accuracy. The second scenario requires real-time processing of aerial image data using the UAV’s onboard embedded microcomputer, which is typically employed for obstacle avoidance and automated mission planning. Neural network-based target detection methods must be tailored to the specific requirements of each scenario. For desktop environments, maximizing detection accuracy is crucial, whereas in embedded environments, models must balance accuracy with limited computational and memory resources. Consequently, the neural network methods for target detection should be specifically optimized to tackle the distinct complexities inherent in UAV-captured aerial imagery, thereby fulfilling the diverse requirements of various application scenarios.
Considering the unique challenges posed by UAV aerial images and UAV target detection for target detection algorithms, this paper proposes a lightweight, efficient, and detail-enhanced LCFF-Net detection algorithm. The proposed method further explores and exploits granular detail and intrinsic features within the channel domain through techniques such as reparameterisation, shared convolution, and multi-scale feature fusion. These strategies significantly improve detection accuracy and optimize model efficiency.
The primary contributions of this paper can be succinctly categorized into three fundamental areas.
- A novel LFECB (Lightweight Feature Extraction Convolution Block) is introduced, which serves as the foundation for the proposed LFERELAN (Lightweight Feature Extraction Reparameterised Efficient Layer Aggregation Network). This network incorporates concepts from CSPNet, GELAN, and the reparameterised convolution technique to extract features at a lower cost.
- LR-NET and LC-FPN, both grounded in the LFERELAN architecture, were engineered to optimize the backbone and neck networks of the model. These enhancements are designed to bolster the model’s capacity for detecting minute targets in UAV aerial imagery, all while maintaining an efficient architectural structure. In addition, a LDSCD-Head (Lightweight Detail-Enhanced Shared Convolution Detection Head), is proposed to further streamline the model’s complexity.
- A series of LCFF-Net models of varying scales to accommodate different application scenarios. These range from ultra-small models optimized for deployment on embedded devices under extreme conditions to high-precision, large-parameter models intended for desktop platforms.
The structure of the paper is as follows: Section Related work offers an extensive review of related literature. Section Methods presents a detailed explanation of the enhanced LCFF-Net detection methodology. Section Results delineates the experimental setup and parameter configurations, followed by a comprehensive series of comparative evaluations, ablation analyses, and visual comparisons to substantiate the merits of the proposed model. Section Discussion and conclusion offers a synthesis of the experimental results and explores potential avenues for future research.
Related work
Amid ongoing advancements in deep learning-driven target detection technologies, numerous real-time detection algorithms have been introduced. Among these, the YOLO (You Only Look Once) [10] algorithm, introduced in 2015, has garnered significant attention due to its accuracy and detection speed. The algorithm achieves high efficiency, flexibility, and good generalization performance through multi-scale feature fusion and multi-level prediction, ensuring robust detection accuracy.
Jocher et al. [14] proposed YOLOv5, which introduced the CSP (Cross Stage Partial) structure and the focus module in the backbone network. Moreover, this study extended the CSP architecture of CSPNet to the neck and integrated the PAFPN (Path Aggregation Network) module to bolster the network’s feature fusion capabilities, while the original SPP (Spatial Pyramid Pooling) was substituted with the more efficient SPPF (Fast SPP) structure. Li et al. [15] developed YOLOv6, employing the Rep-PAN structure in the neck network to maintain robust multi-scale feature fusion. Furthermore, an Efficient Decoupled Head was engineered in this work to accelerate model convergence. Wang et al. [16] further advanced the series with YOLOv7 by introducing the E-ELAN (Extended Efficient Aggregation Network). This network optimally leverages model parameters by controlling the shortest and longest gradient pathways, thus augmenting the model’s ability to learn multi-scale features. Building upon YOLOv5, Jocher et al. [17] introduced YOLOv8, which features the C2f (CSP Bottleneck with 2 convolutions) fusion module in the Backbone structure. This module integrates the advantages of the ELAN and C3 fusion mechanisms. Furthermore, an anchor-free approach and Decoupled-Head architecture were employed to decouple the classification and detection heads, achieving an optimal trade-off between accuracy and model efficiency. Wang et al. [18] proposed YOLOv9, integrating PGI (Programmable Gradient Information) and a GELAN (General Efficient Layer Aggregation Network) to address the issue of information loss during the feedforward process. This approach facilitates model updates and enhances detection accuracy. In the latest iteration, YOLOv10, Wang et al. [19] introduced NMS-free training and a dual allocation strategy, which reduces inference delay and computational redundancy.
Target detection in UAV imagery poses distinct challenges, including object occlusion, background noise, and fluctuating lighting conditions [21]. Unlike conventional images, UAV aerial imagery is often captured from top-down perspectives, complicating object detection [22].
Zhang et al. [23] proposed Drone-YOLO, based on YOLOv8-l, incorporating the RepVGG heavy-parameterized convolution module as the downsampling layer. In the neck, a small-sized object detection head was added by expanding the PAFPN structure to three layers. This improvement enhances the model’s capacity to spatially localize and classify target objects, thus boosting detection accuracy. Yue et al. [24] designed the LHGNet backbone based on the HGNetv2 concept, integrating the lightweight LHG block to enhance spatial feature fusion and expand the receptive field. This study further incorporated the GSConv, LGS bottleneck, LGSCSP fusion module, and LGSneck, which synergistically enhanced detection accuracy and reinforced the model’s lightweight design, particularly for small to medium-sized targets in UAV imagery. Huang et al. [25] developed EDGS-YOLO, an improvement on YOLOv8, incorporating the DDetect detection head with deformable convolution (DCNv2) to minimize feature information loss and enhance local detail collection. The use of EMA in the neck network further increases accuracy by focusing on critical regions in the image. Additionally, the C3Ghost module, which combines GhostConv and C3 modules, effectively reduced model size without sacrificing performance. Zhao et al. [26] advanced the ITD-YOLOv8 by replacing conventional convolution in the neck structure with the lightweight AKConv, reducing model complexity. And this work introduced VoVGSCSP and CoordAtt attention mechanisms, which enhance global contextual information and multi-scale features, thus improving accuracy and robustness in detecting infrared-occluded targets in complex environments.
Contemporary UAV-based detection algorithms aim to maximize the accuracy of tiny object detection; however, they often result in increased model parameters and higher computational resource demands. In light of these challenges, this paper introduced a series of efficiently lightweight structures (including LFERELAN, LR-NET, LC-FPN, and LDSCD-Head) designed to reduce model parameters and computational resource demands while enhancing the accuracy of tiny object detection.
Methods
The architecture of the LCFF-Net model introduced in this paper constitutes a significant enhancement over the YOLO framework. This model employs the LR-NET, LC-FPN, and LDSCD-Head as its backbone network, neck network, and detection head, respectively; these components are introduced for the first time in this study. A detailed illustration of the LCFF-Net model’s architecture is presented in Fig 1.
Furthermore, this study refined the model’s scale configurations to develop a series of LCFF-Net models of varying scales. Significantly, the LCFF-Net-a model, distinguished by its remarkably low parameter count and computational overhead, is highly optimized for embedded deployment in resource-limited environments. This series of models demonstrated strong detection performance on the VisDrone dataset, even without employing the widely adopted attention mechanism.
Lightweight feature extraction reparameterised efficient layer aggregation network
In the YOLO network architecture, the C2f (CSP Bottleneck with 2 Convolutions) module plays a crucial role in implementing cross-stage partial fusion. However, the bottleneck design in the C2F module incurs significant resource overhead. To mitigate this challenge, this study proposed a LFECB (Lightweight Feature Extraction Convolutional Block) that leveraged PConv (Partial Convolution) [27] and CGLU (Convolutional Gated Linear Unit) [28] to more effectively aggregated multi-scale information while minimizing resource consumption.
PConv is an advanced convolutional technique tailored to minimize computational redundancy and memory consumption by selectively applying standard convolution operations to a subset of input channels dedicated to spatial feature extraction, while preserving the remaining channels unchanged. The CGLU functions as a channel mixer that incorporates a 3 × 3 depth-wise separable convolution prior to the activation function of the GLU (Gated Linear Unit) within the gated branch. This architecture allows each token to receive a unique gating signal derived from its closest fine-grained features, thereby promoting more efficient and precise information processing. In LFECB, the output of PConv undergoes further processing through the CGLU, which optimally utilizes all channel information while minimizing computational redundancy. Additionally, DropPath is implemented following CGLU to randomly discard certain branch information, thereby enhancing the robustness of the model. Shortcut connections are also employed to improve gradient propagation and facilitate information flow.
The LFERELAN (Lightweight Feature Extraction Reparameterised Efficient Layer Aggregation Network) proposed in this paper integrates the core concepts of CSPNet [29] and GELAN [18] architectures, while also introducing the LFECB and RepConv (reparameterized convolution) [30].
LFERELAN utilizes 1×1 convolutional layers at both the input and output stages to modulate the channel dimensions. The processing results at each step occur in a branched manner: one branch directs the output immediately, while the other branch continues sequentially. Initially, the input with the adjusted number of channels undergoes a RepConv operation, followed by several cascaded LFECBs. After passing through a final 1×1 convolution, the output merges with the other branch from the preceding layers before reaching the final output.
The core idea of the LFERELAN structural design is to reduce computational and memory overhead during inference by using RepConv to reparameterize multi-branch convolutional layers into single-branch layers, while also employing LFECB in an optimized network structure for efficient feature fusion and extraction. A comprehensive illustration of the LFECB and LFERELAN architectures is presented in Fig 2.
Lightweight reparameterised net
LR-NET (Lightweight Reparameterised Net), designed by replacing the C2f structure with the LFERELAN structure and introducing the Spatial-channel decoupled downsampling (SCDown) [19], which consists of only two convolutional operations, building upon the original backbone architecture of YOLO, serving as the backbone network for LCFF-Net.
The following equations detail the parameter calculations for each module within LR-NET. Specifically, Eq (1) defines the parameters for the Conv module, Eq (2) for the RepConv module in the LFERELAN module, Eq (3) for the LFECB module, Eq (4) for the LFERELAN module, Eq (5) for the SCDown module, and Eq (6) for the SPPF module. In these equations, C denotes the number of input and output feature channels, K indicates the convolution kernel size used within the modules, and n, Ex, and Ms represent the proportional parameters relevant to the LFERELAN module.
(1)
(2)
(3)
(4)
(5)
(6)
The core objective of LR-NET is to replace the bulkier C2f with the lighter and more efficient LFERELAN module, which significantly decreases both model parameters and computational overhead while improving performance. Additionally, the introduction of SCDown further reduces parameters and resource consumption. Consequently, the lightweight LR-NET enhances feature extraction and fusion capabilities, minimizing overall parameters and computational demands, thereby increasing its applicability for deployment in embedded systems.
The efficacy of LR-NET was demonstrated through the ablation study discussed in Section Ablation experiment. A detailed depiction of the LFECB and LFERELAN architectures is provided in Fig 3.
Lightweight cross-scale feature pyramid network
The original neck structure of the YOLO model does not fully exploit the image features extracted at each layer of the backbone network. This limitation leads to suboptimal performance in certain specialised application scenarios, such as object detection in UAV imagery. To mitigate this challenge, this paper proposed a LC-FPN (Lightweight Cross-scale Feature Pyramid Network).
The LC-FPN is inspired by the CCFF (CNN-based Cross-scale Feature Fusion) [8] structure. It further adjusts and optimizes the feature fusion method and architecture. Additionally, it incorporates the LFERELAN proposed in this paper, along with SCDown. The LC-FPN is designed to function as the neck network of LCFF-Net. It first standardizes the dimensionality of image features extracted from each layer of the backbone network, followed by a hierarchical extraction and fusion of these features. Finally, downsampling is applied to output the image features at each layer for the detection head. Through this process, the information extracted by the backbone network is comprehensively utilized, particularly enhancing the detection of tiny objects by focusing on their specific feature representations.
The effectiveness of LC-FPN has also been verified through the ablation experiment detailed in Section Ablation experiment. In comparison to the original neck architecture of the YOLOv8 model, LC-FPN notably reduces the parameter count while enhancing accuracy, particularly in detecting tiny objects within UAV aerial imagery. The LC-FPN architecture is depicted in Fig 4.
Lightweight detail-enhanced shared convolution detection head
In the original YOLO model, three detection heads independently extract image features through two branches with two consecutive 3×3 convolutions followed by a 1×1 convolution. However, this architecture substantially inflates the model’s parameter count, with the detection head alone accounting for one-fifth of the total parameters in the entire YOLO algorithm. Furthermore, the conventional single-scale prediction structure employed by the original YOLO detection head proves insufficient for multi-scale target detection. This shortcoming arises from the model’s dependence on predictions from a single feature map scale, thus overlooking the potential contributions of multi-scale features. To mitigate this challenge, this paper proposed a novel detection head structure termed LDSCD-Head (Lightweight Detail-Enhanced Shared Convolution Detection Head).
The LDSCD-Head introduces a modification to the original YOLO architecture by replacing the multiple independent convolutions of the three detection heads with a shared group-normalized DEConv (Detail-Enhanced convolution) [31] and two additional shared group-normalized convolutions, as depicted in Fig 5 (blue and green sections). To address the issue of inconsistent target scales detected by each head, a low-computation-cost scaling layer is incorporated at the input of each detection head, aligning it to a unified dimension (illustrated in the yellow section of Fig 5). This architectural adjustment significantly reduces the number of parameters while leveraging richer feature scales for target detection, thereby enhancing the multi-scale sensing capabilities of the detection heads and improving overall detection performance.
Proposed models
To adapt the LCFF-Net algorithm to various task environments, this paper proposes several scales of the model, ranging from the smallest to the largest, denoted by configurations a, t, n, s, m, and l. These configurations are distinguished through adjustments in key architectural parameters: depth, width, and the maximum number of channels.
The depth parameter determines the number of repetitions for each feature extraction stage in the model. Increasing the depth enhances the model’s complexity and its capacity for representation; however, it also results in greater computational demand and extended inference times. The width parameter controls the number of convolutional channels within each layer, effectively setting the width of each layer’s feature space. While a larger width enables more comprehensive feature extraction at each layer, it also increases computational requirements and the total number of parameters. Finally, the maximum number of channels parameter sets an upper limit on the number of channels within each convolutional layer, ensuring that the number of channels does not exceed this specified threshold. By constraining the channel count, the model avoids excessive depth-induced channels, reducing both computational load and memory usage.
Table 1 summarizes the main differences among these configurations, including standard parameter counts and computation volumes when the model is configured for 80 input detection categories.
Among the different scales of the LCFF-Net model, LCFF-Net-a stands out for its lightweight design and low computational requirements, while preserving a commendable level of accuracy, rendering it ideal for deployment in resource-constrained or extreme environments. Both LCFF-Net-n and LCFF-Net-t exhibit reduced parameter counts, offer satisfactory performance, and deliver faster processing speeds, making them applicable to a broader range of scenarios. In contrast, LCFF-Net-l features the highest parameter count and computational complexity, alongside superior detection accuracy. It is most appropriate for deployment on high-performance desktop computer servers.
Results
Dataset
Compiled by the AISKYEYE team at the Machine Learning and Data Mining Lab of Tianjin University, China, the VisDrone [32] dataset is meticulously designed to facilitate target detection, tracking, and counting tasks from aerial UAV perspectives. The dataset was curated from an extensive range of UAV-mounted camera footage captured across 14 cities in China, encompassing vast geographical expanses spanning thousands of kilometers. It comprises 261,908 frames of video footage and 10,209 still images, resulting in a total of 288 video clips. Image resolutions range from 480×360 pixels to 2000×1500 pixels, capturing diverse environments such as crowded streets, intersections, and public places. It encompasses various lighting conditions (e.g., daytime, nighttime and bright light exposure), target types (e.g., people, car and tricycle), weather conditions (e.g., sunny, cloudy and rainy), and densities (e.g., crowded or sparse scenarios).
Over 2.6 million bounding boxes corresponding to common objects such as people, cars, motorcycles, and tricycles were manually annotated on these images, exemplifying the diversity and complexity of real-world urban environments. The perspective of these images captured from UAVs differs significantly from that of ground-level datasets (e.g., MS-COCO [20] and VOC2012 [33]) regarding camera angle, object scale, background, and weather conditions. This variability increases the dataset’s complexity, establishing it as a crucial resource for assessing the performance and robustness of computer vision models. Fig 6 presents samples from the VisDrone dataset.
To thoroughly evaluate the model’s performance, the dataset was divided into three subsets. First, the training subset, containing 6,471 images, was used to optimize the model’s parameters and facilitate robust feature learning. Second, a validation subset of 548 images was used to assess the model’s training and adjust hyperparameters, thereby preventing overfitting. Finally, the remaining 1,610 images formed the test subset, an independent set designed to evaluate the model’s generalization capability on unseen data. Table 2 presents a detailed composition of the dataset, including image and object counts for each subset and the distribution of object categories across the dataset.
Experimental setup and hyperparameter configuration
The experimental setup is thoroughly outlined in Table 3, with the hyperparameters for model training specified in Table 4. Unless otherwise indicated, the experimental environment and hyperparameter configurations remain consistent across the training, testing, and validation stages. Notably, none of the experiments utilized pre-trained parameters for these networks.
In the embedded application experiments, the Taishan Pi development board, produced by JLC Technology Group, was employed as the experimental environment. This board features an RK3566 SoC (system-on-chip). Detailed hardware and software specifications are provided in Table 5.
Experiment metrics
The experiments assess the proposed method by examining its detection accuracy, parameter footprint, and computational complexity. The evaluation metrics encompass Params (M) (millions of parameters), FLOPs (G) (giga floating-point operations), and mAP (mean average precision).
(8)
P (Precision) is defined as the ratio of correctly identified targets to the total number of detected targets, as outlined in Eq (7). R (Recall) represents the proportion of correctly detected targets relative to the total number of actual targets, as described in Eq (8).
(9)
In these equations, TP (true positive) represents the count of accurately predicted targets, FP (false positive) denotes the count of incorrect predictions, while FN (false negative) represents the instances where actual targets were present but failed to be correctly identified.
(10)
The AP (Average Precision) quantifies the area under the precision-recall curve, as formalized in Eq (9). Meanwhile, the mAP extends this concept by averaging the AP values across all categories, as indicated in Eq (10), where K refers to the total number of classes and APi denotes the individual average precision for each class.
The evaluation metrics utilized in this study include two distinct mean average precision measures: mAP50 and mAP50–95. For mAP50, a predicted bounding box is considered accurate if the IoU (Intersection over Union) between the predicted and ground truth boxes exceeds a 0.50 threshold. Conversely, mAP50–95 computes the average precision over a range of IoU thresholds, from 0.50 to 0.95 in 0.05 increments, with the mean precision calculated across these thresholds.
Comparison
In this subsection, this paper compared the experimental results against algorithms from the YOLO family and several recently developed methods used the VisDrone dataset. Specifically, the algorithms included were YOLOv8 [17], YOLOv10 [19], MFFSODNet [34], LE-YOLO [24], MPE-YOLO [35], LHRYNet [36], FocusDet [37], TA-YOLO [38], APNet [39], EUAVDet [40], MFEFNet [41], BRSTD [42], SOD-YOLO-n [43], AMFEF-DETR [44], UAV-YOLO [45], MVT-B [46], and HSP-YOLO [47]. These algorithms served as the baseline for our experimental comparison.
Initially, this study presented a comparative analysis of the proposed LCFF-Net algorithm against the YOLOv8 and YOLOv10 algorithms across various model scales used the VisDrone-val dataset. The performance evaluation was based on four key metrics: Params (M), FLOPs (G), mAP50, and mAP50–95. As demonstrated in Table 6, the LCFF-Net algorithm outperforms the baseline models in terms of overall performance.
Specifically, when comparing the LCFF-Net-n model with the YOLOv8-s model, the LCFF-Net-n model exhibits a 2.8% enhancement in mAP50 and a 3.9% improvement in mAP50–95. Furthermore, it achieves an 89.7% reduction in model parameters and a 50.5% decrease in computational demands, significantly optimizing both efficiency and resource utilization. When assessing the LCFF-Net-m model against the YOLOv8-x model, the LCFF-Net-m shows an enhancement of 14.3% in the mAP50 metric and 15.1% in the mAP50–95 metric, accompanied by an 88.0% reduction in model parameters and a 52.0% decrease in computational requirements, further streamlining resource efficiency and computational performance.
Similarly, when comparing the LCFF-Net-n model with the YOLOv10-s model (identified as one of the more effective baseline models), the LCFF-Net-n demonstrates an improvement of 1.0% in the mAP50 metric and 2.0% in the mAP50–95 metric. Moreover, the model exhibits an 84.2% reduction in parameter count, coupled with a 34.1% decrease in computational complexity, significantly optimizing both resource efficiency and operational performance. When assessing the LCFF-Net-m model against the YOLOv10-x model, the LCFF-Net-m shows an enhancement of 11.0% in the mAP50 metric and 12.7% in the mAP50–95 metric, Accompanied by a 72.2% reduction in model parameters and a 22.8% decrease in computational demands, further enhancing both scalability and computational efficiency.
This study subsequently compared the performance of the LCFF-Net algorithm with other recently proposed algorithms using the same evaluation metrics on the VisDrone-val dataset, as detailed in Table 7. The LCFF-Net-l model, the largest in the LCFF-Net series, achieved the highest mAP50 and mAP50–95 metrics among all models. In comparison to MFEFNet, which demonstrates the highest mAP50 score among the baseline models, the LCFF-Net-l model achieved a 2.3% enhancement in mAP50 and an 11.4% improvement in mAP50–95, while concurrently reducing the required parameters by 63.6%. The LCFF-Net-n model, a more compact model of the LCFF-Net algorithm, outperforms MPE-YOLO with a 2.3% increase in mAP50 and a 3.5% improvement in mAP50–95. Additionally, it achieved these gains while reducing the number of required parameters by 74.1%. Furthermore, the LCFF-Net algorithm consistently outperforms the baseline model across all model sizes, exhibiting notable superiority in key metrics such as accuracy and computational efficiency.
Embedded environment deployment experiment
In the embedded environment deployment experiment, the models YOLOv8-n, YOLOv8-s, YOLOv10-n, and YOLOv10-s were selected as baseline methods for comparison with the proposed LCFF-Net-a, LCFF-Net-t, and LCFF-Net-n models. The evaluation metrics utilized for this comparison include Params (M), FLOPs (G), mAP50, mAP50–95 and the average computation time required (Embedded Latency (s), Latency on server (s)).
The experimental results, presented in Table 8, indicate that the LCFF-Net-a model achieves the fastest computation speed among all evaluated models. Notably, the LCFF-Net-n model is over 20% faster than YOLOv8-s and YOLOv10-s models while sustaining high accuracy. Furthermore, a comparison of computational delays between the embedded and experimental server environments reveals that the delay in the YOLOv8 and YOLOv10 series models increases by more than 5000 times in the embedded environment, whereas the LCFF-Net experiences an increase of just over 3000 times. These findings suggest that, relative to the baseline models, LCFF-Net is better suited for environments with constrained computational resources, although its delay in the embedded environment remains significantly higher than in the server setting. This underscores the substantial impact that limited computing power in embedded systems has on model calculation speed.
Ablation experiment
In the ablation experiments, YOLOv8-n was selected as the baseline model. Various structures from LCFF-Net were incrementally introduced into this baseline model, with multiple performance metrics evaluated on the VisDrone-val dataset for each modification. This approach was employed to assess the effectiveness of each structural improvement in LCFF-Net. Throughout the experiment, the experimental environment and all hyperparameter settings were maintained constant, in alignment with the experimental conditions described earlier.
As shown in Table 9, the most significant improvement in LCFF-Net is attributed to the introduction of LC-FPN. Specifically, integrating LC-FPN into the baseline model yielded a 19.0% enhancement in mAP50 and a 21.1% boost in mAP50–95, while concurrently reducing the parameter count by 34.6%. The additional structural optimizations play a pivotal role in decreasing the model’s parameter count and computational complexity, while simultaneously maintaining or improving accuracy metrics.
Additionally, the LCFF-Net model does not incorporate a commonly utilized attention mechanism. When addressing specific tasks, incorporating a targeted attention mechanism into the model may enhance its performance. To test this hypothesis, MLCA [48], EMA [49], CAA [50], and AA [51] were added after the SPPF module in the backbone network of the LCFF-Net model, and experiments were conducted. As shown in Table 10, although the attention mechanism was directly applied to the model’s backbone network without further optimization, the model’s performance demonstrated a partial improvement in the context of the current task. These findings indicate that incorporating a task-specific attention mechanism in the LCFF-Net model can indeed enhance its performance.
Visual analysis
To compare the object detection performance of various models visually, YOLOv8-x and YOLOv10-x are selected as baseline models for comparison against the LCFF-Net-l model proposed in this paper. The comparison is conducted across multiple criteria, including precision-recall (PR) curves, confusion matrices, heat maps of representative network layers, and direct visual assessments.
Fig 7 presents the PR curves for YOLOv8-x, YOLOv10-x, and the proposed LCFF-Net-l model. The PR curve, a critical tool for evaluating model performance, is particularly useful for imbalanced datasets. It is generated by plotting precision against recall, thereby demonstrating how the model performs under varying decision thresholds.
As illustrated in Fig 7, the PR curve of the LCFF-Net-l model, compared to the YOLOv8-x and YOLOv10-x models, approaches the upper-right corner, exhibiting a larger area under the curve. This implies that the LCFF-Net-l model exhibits superior detection performance compared to the other two models. Nevertheless, despite these advancements, the detection accuracy for certain categories—specifically bicycles and awning tricycles—remains suboptimal. This observation indicates that a significant portion of these target categories is missed during the detection process.
Fig 8 presents the confusion matrices for YOLOv8-x, YOLOv10-x, and the proposed LCFF-Net-l models to illustrate the accuracy of object detection. In these matrices, the rows denote the actual class labels, while the columns correspond to the predicted labels. The matrix values are normalized to lie within the range of 0 to 1, where the diagonal elements signify the proportion of accurate classifications for each category, and the off-diagonal elements capture the fraction of misclassifications across classes.
As illustrated in Fig 8, the confusion matrix of the proposed LCFF-Net-l model demonstrates higher intensity along the diagonal and lower intensity along the lower edge compared to the YOLOv8-x and YOLOv10-x models. This signifies that the LCFF-Net-l model exhibits superior object detection capabilities in comparison to the baseline models. Although the model achieves an overall improvement in detection accuracy, reflected in a lower missed detection rate across most categories, it continues to exhibit certain errors in specific categories where the baseline models do not. This underscores the presence of residual challenges and suggests potential for further refinement in detection accuracy.
Fig 9 presents the Grad-CAM heatmap visualizations of representative network layers (P2, P3, P4, P5, and the Head layer) in the YOLOv8-x, YOLOv10-x, and the proposed LCFF-Net-l models. As illustrated in Fig 9(a)–9(j), both the YOLOv8-x and YOLOv10-x models tend to focus on features with a broader scope, thereby failing to adequately capture the features of tiny objects. In contrast, Fig 9(k)–9(o) demonstrate that the LCFF-Net-l model significantly improves the focus on tiny objects, highlighting their key features with superior accuracy and comprehensiveness. These visual heatmaps clearly indicate that the LCFF-Net-l model surpasses the YOLOv8-x and YOLOv10-x baseline models in accuracy when detecting tiny targets. The Grad-CAM heatmap visualizations provide critical insights into the model’s decision-making process, further corroborating the exceptional performance of the LCFF-Net-l model in accurately detecting small-scale targets.
(a–e) show the Grad-CAM heat maps for the prediction process of the YOLOv8-x model; (f–j) show the Grad-CAM heat maps for the YOLOv10-x model; and (k–o) present the Grad-CAM heat maps for the LCFF-Net-l model.
To assess the robustness and effectiveness of the LCFF-Net-l model in challenging scenarios, its capacity to detect tiny targets across diverse and complex environments was rigorously analyzed and compared against the performance of the YOLOv8-x and YOLOv10-x models. The significant discrepancies in detection results between models are highlighted in orange boxes in the accompanying Figs, with enlarged views provided for clarity. Six complex scenes were selected from the VisDrone-val (Fig 10) and VisDrone-test (Fig 11) datasets, incorporating diverse lighting conditions (e.g., daytime, nighttime, glare, and shadows), target types (pedestrians, vehicles, bicycles, and tricycles), and environmental settings (sparse versus crowded areas), which are common in UAV aerial imagery yet notoriously difficult to detect accurately.
As demonstrated in Figs 10a, 10b, 11a and 11b, the YOLOv8-x and YOLOv10-x models tend to overlook tiny objects in distant, crowded, and dimly lit environments, leading to challenges in accurately detecting these objects. Conversely, the LCFF-Net-l model proposed in this study facilitates more efficient and accurate detection of tiny objects across a range of challenging conditions, such as distant, densely populated, high-glare, or low-light environments.
In summary, the findings highlight the LCFF-Net model’s versatility and efficacy in detecting tiny aerial targets within intricate urban landscapes. The model consistently achieves high accuracy and computational efficiency, even under demanding conditions, underscoring its robustness and adaptability across a wide spectrum of real-world scenarios.
Discussion and conclusion
This paper presents LCFF-Net, a lightweight cross-scale feature fusion network designed specifically for the detection of tiny targets in UAV aerial imagery. The algorithm addresses key challenges in UAV image target detection, such as varying lighting conditions, complex environments, and tiny target sizes. To mitigate these challenges, a Lightweight Feature Extraction Convolutional Block (LFECB) was designed and a Lightweight Feature Extraction Reparameterized Efficient Layer Aggregation Network (LFERELAN) was developed. This structure efficiently extracted target features while reducing computational costs. Furthermore, the LC-FPN and LR-NET architectures, which enhance the backbone and neck structures of baseline model, were introduced respectively. These modifications enable low-cost fusion of multi-level network features and bottom-layer features rich in spatial information, significantly improving the model’s ability to detect tiny objects. Additionally, the Lightweight Detail-Enhanced Shared Convolution Detection Head (LDSCD-Head) was designed to improve the detection head of baseline model. This detection head enables efficient multi-scale feature fusion and sharing, optimizing the utilization of information across various scales while substantially minimizing the model’s parameter complexity and computational overhead. Empirical results on the VisDrone-val dataset demonstrated that the proposed LCFF-Net-l model outperforms the baseline across both the mAP50 and mAP50–95 performance metrics. The smaller LCFF-Net-n model improves mAP50 by 2.8% and mAP50–95 by 3.9% compared to the baseline, while reducing model parameters by 89.7% and computational costs by 50.5%. Furthermore, the LCFF-Net algorithm demonstrates faster computational speeds in embedded environments compared to the baseline. Notably, our network improvement method does not incorporate the widely-used attention mechanism, suggesting potential for further task-specific optimizations.
The LCFF-Net algorithm includes models of varying scales. The smaller models are optimized for embedded deployment under constrained computational resources, whereas the larger models are designed for desktop computing platforms, making them suitable for different application scenarios. While LCFF-Net has been optimized for embedded environments, particularly in extreme conditions, its performance still shows potential for further enhancement. Future work could focus on refining the model’s computational efficiency to improve its robustness in challenging scenarios. Additionally, incorporating task-specific attention mechanisms may offer further performance gains with minimal computational overhead.
Supporting information
S1 File. The VisDrone dataset referenced in this study is publicly accessible and can be retrieved from https://github.com/VisDrone/VisDrone-Dataset.
https://doi.org/10.1371/journal.pone.0315267.s001
(DOCX)
S2 File. The source code for the LCFF-Net model is available at https://github.com/Tdzdele/LCFF-Net.
https://doi.org/10.1371/journal.pone.0315267.s002
(DOCX)
References
- 1. Alam F., Mehmood R., Katib I., S. M. Altowaijri, and Albeshri A., “Taawun: a decision fusion and feature specific road detection approach for connected autonomous vehicles,” Mobile Networks and Applications, vol. 28, no. 2, pp. 636–652, Apr 2023. [Online]. Available: https://doi.org/10.1007/s11036-019-01319-2
- 2.
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: Single Shot MultiBox Detector. In: Computer Vision—ECCV 2016. Cham: Springer International Publishing; 2016. p. 21–37.
- 3.
Girshick R, Donahue J, Darrell T, Malik J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. CVPR’14. USA: IEEE Computer Society; 2014. p. 580–587. Available from: https://doi.org/10.1109/CVPR.2014.81.
- 4.
Girshick R. Fast R-CNN. In: International Conference on Computer Vision (ICCV); 2015.
- 5. Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(6):1137–1149. pmid:27295650
- 6.
He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV); 2017. p. 2980–2988.
- 7.
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-End Object Detection with Transformers; 2020. Available from: https://arxiv.org/abs/2005.12872.
- 8.
Zhao Y, Lv W, Xu S, Wei J, Wang G, Dang Q, et al. DETRs Beat YOLOs on Real-time Object Detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024. p. 16965–16974.
- 9. Lv W, Zhao Y, Chang Q, Huang K, Wang G, Liu Y. RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer; 2024. Available from: https://arxiv.org/abs/2407.17140.
- 10. Redmon J, Divvala S, Girshick R, Farhadi A. You Only Look Once: Unified, Real-Time Object Detection; 2016.
- 11.
Redmon J, Farhadi A. YOLO9000: Better, Faster, Stronger; 2016.
- 12.
Redmon J, Farhadi A. YOLOv3: An Incremental Improvement; 2018.
- 13.
Bochkovskiy A, Wang CY, Liao HYM. YOLOv4: Optimal Speed and Accuracy of Object Detection; 2020.
- 14.
Jocher G. Ultralytics YOLOv5; 2020. Available from: https://github.com/ultralytics/yolov5.
- 15.
Li C, Li L, Geng Y, Jiang H, Cheng M, Zhang B, et al. YOLOv6 v3.0: A Full-Scale Reloading; 2023.
- 16.
Wang CY, Bochkovskiy A, Liao HYM. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023. p. 7464–7475.
- 17.
Jocher G, Chaurasia A, Qiu J. Ultralytics YOLOv8; 2023. Available from: https://github.com/ultralytics/ultralytics.
- 18.
Wang CY, Yeh IH, Liao HYM. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information; 2024.
- 19.
Wang A, Chen H, Liu L, Chen K, Lin Z, Han J, et al. YOLOv10: Real-Time End-to-End Object Detection; 2024. Available from: https://arxiv.org/abs/2405.14458.
- 20.
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: Common Objects in Context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, editors. Computer Vision—ECCV 2014. Cham: Springer International Publishing; 2014. p. 740–755.
- 21. Jadhav A, Mukherjee P, Kaushik V, Lall B. Aerial multi-object tracking by detection using deep association networks; 2019.
- 22. Ding J, Xue N, Xia GS, Bai X, Yang W, Yang MY, et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022;44(11):7778–7796. pmid:34613910
- 23. Zhang Z. Drone-YOLO: An Efficient Neural Network Method for Target Detection in Drone Images. Drones. 2023;7(8).
- 24. Yue M, Zhang L, Huang J, Zhang H. Lightweight and Efficient Tiny-Object Detection Based on Improved YOLOv8n for UAV Aerial Images. Drones. 2024;8(7).
- 25. Huang M, Mi W, Wang Y. EDGS-YOLOv8: An Improved YOLOv8 Lightweight UAV Detection Model. Drones. 2024;8(7).
- 26. Zhao X, Zhang W, Zhang H, Zheng C, Ma J, Zhang Z. ITD-YOLOv8: An Infrared Target Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles. Drones. 2024;8(4).
- 27.
Chen J, Kao Sh, He H, Zhuo W, Wen S, Lee CH, et al. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023. p. 12021–12031.
- 28.
Shi D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers; 2024. Available from: https://arxiv.org/abs/2311.17132.
- 29.
Wang CY, Mark Liao HY, Wu YH, Chen PY, Hsieh JW, Yeh IH. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 2020. p. 1571–1580.
- 30.
Ding X, Zhang X, Ma N, Han J, Ding G, Sun J. RepVGG: Making VGG-style ConvNets Great Again. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021. p. 13728–13737.
- 31. Chen Z, He Z, Lu ZM. DEA-Net: Single Image Dehazing Based on Detail-Enhanced Convolution and Content-Guided Attention. IEEE Transactions on Image Processing. 2024;33:1002–1015. pmid:38252568
- 32. Zhu P, Wen L, Du D, Bian X, Fan H, Hu Q, et al. Detection and tracking meet drones challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021;44(11):7380–7399.
- 33. Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A. The Pascal Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision. 2015;111(1):98–136.
- 34. Jiang L., Yuan B., Du J., Chen B., Xie H., Tian J., et al., “Mffsodnet: Multiscale feature fusion small object detection network for uav aerial images,” IEEE Transactions on Instrumentation and Measurement, vol. 73, pp. 1–14, 2024.
- 35. Jia Su ZJ Yichang Qin, Liang B. MPE-YOLO: enhanced small target detection in aerial imaging. Scientific Reports. 2024;14.
- 36. Tang L, Yun L, Chen Z, Cheng F. HRYNet: A Highly Robust YOLO Network for Complex Road Traffic Object Detection. Sensors. 2024;24(2).
- 37. Yanli Shi YJ, Zhang X. FocusDet: an efficient object detector for small object. Scientific Reports. 2024;14. pmid:38730236
- 38. Minze Li TZ Yuling Chen, Huang W. TA-YOLO: a lightweight small object detection model based on multi-dimensional trans-attention module for remote sensing images. Complex & Intelligent Systems. 2024;10.
- 39. Zhang P, Zhang G, Yang K. APNet: Accurate Positioning Deformable Convolution for UAV Image Object Detection. IEEE Latin America Transactions. 2024;22(4):304–311.
- 40. Wu W, Liu A, Hu J, Mo Y, Xiang S, Duan P, et al. EUAVDet: An Efficient and Lightweight Object Detector for UAV Aerial Images with an Edge-Based Computing Platform. Drones. 2024;8(6).
- 41. Zhou L, Zhao S, Wan Z, Liu Y, Wang Y, Zuo X. MFEFNet: A Multi-Scale Feature Information Extraction and Fusion Network for Multi-Scale Object Detection in UAV Aerial Images. Drones. 2024;8(5).
- 42. Huang S., Lin C., Jiang X., and Qu Z., “Brstd: Bio-inspired remote sensing tiny object detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024.
- 43. Li Y., Li Q., Pan J., Zhou Y., Zhu H., Wei H., et al., “Sod-yolo: Small-object-detection algorithm based on improved yolov8 for uav images,” Remote Sensing, vol. 16, no. 16, 2024. [Online]. Available: https://www.mdpi.com/2072-4292/16/16/3057
- 44. Wang S., Jiang H., Yang J., Ma X., and Chen J., “Amfef-detr: An end-to-end adaptive multi-scale feature extraction and fusion object detection network based on uav aerial images,” Drones, vol. 8, no. 10, 2024. [Online]. Available: https://www.mdpi.com/2504-446X/8/10/523
- 45. Tan S., Duan Z., and Pu L., “Multi-scale object detection in uav images based on adaptive feature fusion,” PLOS ONE, vol. 19, no. 3, pp. 1–21, 03 2024. [Online]. Available: https://doi.org/10.1371/journal.pone.0300120 pmid:38536859
- 46. Jing S., Lv H., Zhao Y., Liu H., and Sun M., “Mvt: Multi-vision transformer for event-based small target detection,” Remote Sensing, vol. 16, no. 9, 2024. [Online]. Available: https://www.mdpi.com/2072-4292/16/9/1641
- 47. Zhang H, Sun W, Sun C, He R, Zhang Y. HSP-YOLOv8: UAV Aerial Photography Small Target Detection Algorithm. Drones. 2024;8(9).
- 48. Wan D., Lu R., Shen S., Xu T., Lang X., and Ren Z., “Mixed local channel attention for object detection,” Engineering Applications of Artificial Intelligence, vol. 123, p. 106442, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0952197623006267
- 49.
D. Ouyang, S. He, G. Zhang, M. Luo, H. Guo, J. Zhan, et al., “Efficient multi-scale attention module with cross-spatial learning,” in ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- 50.
X. Cai, Q. Lai, Y. Wang, W. Wang, Z. Sun, and Y. Yao, “Poly kernel inception network for remote sensing detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 27 706–27 716.
- 51.
D. Shi, “Transnext: Robust foveal visual perception for vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 17 773–17 783.