Figures
Abstract
The accuracy of road object detection is crucial for ensuring the safe driving of autonomous vehicles. Challenges such as small object missed detection, excessive parameters, low accuracy, and poor robustness are commonly observed in current road object detection models. To address the above problems, a road object detection model named FRCP-YOLO is proposed in the present study, which is developed based on the YOLOv8n. Firstly, to reduce the model’s parameters and complexity, the C2f module in the backbone network is replaced with a lightweight FasterNet Block, which enhancing the speed of image feature extraction; then the proposed R-CA module, which is based on a residual block with the Coordinate Attention (CA) mechanism, is introduced to enhance the model’s focus on objects of interest and improve its feature-learning capability. Secondly, to enhance small object detection performance, a high-resolution branch for feature extraction and a detection head for processing these features are introduced, thereby improving the model’s robustness. Finally, PIoU v2 is selected as the bounding box regression loss function to effectively prevent anchor box enlargement, enhance the ability to focus on anchor boxes, and further improve overall detection accuracy. Based on the KITTI dataset, the comparison experiments between FRCP-YOLO and other mainstream algorithms were carried out, FRCP-YOLO achieves object detection accuracies of 0.924 and 0.667 (in terms of mAP@50 and mAP@50–95) on the test set, representing improvements of 5.0% and 6.6% over the baseline model, while reducing parameters by 4%. Comparative experiments were conducted on the BDD100K dataset of complex road scenes. The detection accuracy of FRCP-YOLO outperforms other mainstream algorithms in challenging scenarios, such as dense traffic, occlusions, and night conditions, which verifies the generalization of FRCP-YOLO, highlighting its reliability and effective object detection capabilities in complex scenarios.
Citation: Liu D, Wang C, Li X, Zhao X, Yin C, Liu Y, et al. (2026) FRCP-YOLO: Road object detection algorithm based on improved YOLOv8n. PLoS One 21(2): e0342084. https://doi.org/10.1371/journal.pone.0342084
Editor: Ankit Gupta, CCET: Chandigarh College of Engineering and Technology, INDIA
Received: April 6, 2025; Accepted: January 16, 2026; Published: February 20, 2026
Copyright: © 2026 Liu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Link to Kitti dataset: https://www.cvlibs.net/datasets/kitti/ Link to BDD00k dataset: https://github.com/WCC0810/Minimal-dataset__BDD100K-subset/releases.
Funding: This study was financially supported by the Jilin Provincial Enterprise “Science and Technology Innovation Commissioner (Deputy Chief Technology Officer)” Program in 2024 in the form of an award received by DL, CW, XL and XZ. No additional external funding was received for this study. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
As autonomous driving technology advances rapidly [1–3], research on road object detection, which perceives the surrounding environment [4,5], traffic conditions [6,7], and supports accurate driving decisions, has gained prominence. Road object detection aims to accurately perceive the external environment, especially to accurately identify and locate various objects in complex road environments such as dense occlusion, night and rainy days [8]. By detecting vehicles [9,10], pedestrians [11,12], lane lines [13,14], etc. on the road, potential dangers can be discovered in time, thereby achieving early warning, reducing the risk of traffic accidents [15], and improving the driving experience. As a core component of the autonomous driving system, the detection accuracy of road object detection technology is crucial for ensuring vehicle safety. Accurate road object detection can build a more intelligent and efficient traffic management system [16], and has significant practical value in improving road capacity and safety, optimizing traffic flow management, and improving the level of intelligent driving.
Road object detection algorithm based on deep learning has attracted much attention from researchers for its accuracy and efficiency. Currently, it is mainly exploring and improving multiple directions such as loss function, small object feature extraction method, network architecture, etc., and is committed to further improving the performance of object detection. Yang et al. [17] embedded a deformable convolution network (DCN) in the C2f module of the Backbone network to improve the model’s feature extraction ability under complex background conditions. And they added a global attention mechanism (GAM) to the Neck network to highlight important feature information. Wang et al. [18] embedded the BiFormer attention mechanism in the Neck layer of the original YOLOv8n model to effectively capture the association and dependency between features. Meanwhile, WIoU v3 was used as the loss function to enhance the model’s focus on high-quality anchor boxes. Dang et al. [19] added an attention module to the YOLOv5 backbone network to suppress non-critical information and used CIoU as the loss function to improve vehicle detection accuracy. Chen et al. [20] improved the detection accuracy of small objects by adding a small object detection head to the YOLOX network, and used the deep separable convolution method to optimize the Neck network, further improving the model’s computational efficiency. LI et al. [21] introduced spatial deep convolution (SPD-Conv) to replace the original convolution layer to solve the problem of small object detection in the network; in order to make the network pay more attention to the key information in the feature map, the NAM module was added after the C3 module in the backbone network. Liu et al. [22] proposed a high-resolution detection network (HRDNet) for detecting small objects, in which high-resolution input will be sent to a shallow network to retain more position information, and low-resolution input will be sent to a deep network to extract more semantics. Liu et al. [23] proposed a feature sharing detection head to reduce the redundant parameters of the model; the DWR module was used to enhance the feature extraction capability of the backbone network. Wang et al. [24] proposed an improved feature pyramid network AF-FPN, which uses an adaptive attention module (AAM) and a feature enhancement module (FEM) to reduce information loss in the feature map generation process and enhance the representation capability of the feature pyramid. Zhang et al. [25] reconstructed the YOLOv7 backbone network using the Res3Unit structure to improve the model’s ability to obtain more nonlinear features, and added a mixed attention mechanism module ACmix after the SPPCSPC layer to enhance the model’s attention to vehicles. The above improved algorithms have achieved better results in the road object detection task, but there is still room for improvement in the detection of small objects, the number of model parameters, and the detection accuracy, while the robustness in complex traffic scenarios needs to be improved.
The main contributions of the present study are as follows:
- 1). To effectively reduce model’s parameters and computational complexity while improving the computational efficiency, the lightweight and fast FasterNet Block is utilized for road object feature extraction. Furthermore, to effectively overcome the model gradient disappearance and performance degradation problems, enhance the detection effect of the object of interest, and improve the detection accuracy, an efficient R-CA module is proposed by combining the residual block in the residual network with the CA mechanism.
- 2). To enhance small object detection and mitigate missed detections, the small object feature extraction network within the Neck network is improved. Specifically, a dedicated branch for capturing small object feature is incorporated. Additionally, a specialized detection head for small object feature processing is integrated into the Head network, thereby boosting model performance in complex road scenes and overall detection accuracy.
- 3). PIoU v2 is used as the bounding box regression loss function, which enhances the bounding box regression speed and improves the road object positioning accuracy of the algorithm. Incorporating a penalty factor into PIoU v2 effectively prevents anchor boxes from becoming excessively large, while the introduction of a non-monotonic attention function further improves focus on anchor boxes.
2 Related work
2.1 YOLOv8n
YOLOv8 [26] achieves precise positioning and detection of objects by optimizing network architecture, lightweight design, and utilizing multiscale feature fusion. It is well-suited for autonomous driving applications scenarios that demand real-time performance and high accuracy, and it is straightforward to use and deploy. According to the model’s parameters and performance, YOLOv8 is divided into five versions, among which YOLOv8n is the lightest and fastest network. Due to the limited computing resources of autonomous driving on-board equipment, the object detection model used is required to be both lightweight and fast, YOLOv8n perfectly meets this requirement; it has achieved excellent detection accuracy on public datasets and is suitable for detecting different road object scenarios. Therefore, YOLOv8n is selected as the basic model, which consists of three parts: Backbone, Neck and Head network. The overall structure is shown in Fig 1.
2.1.1 Backbone network.
The backbone network of YOLOv8n adopts a complex convolutional neural network primarily for image feature extraction. It replaces the C3 module with the C2f module based on the YOLOv5 [27] network, making the gradient flow information richer and improving the convergence speed of the model, the C2f structure is shown in Fig 1. Moreover, it is combined with convolutional layers to extract high-level semantic features from the input image [28], further enhancing the network’s feature extraction capability.
2.1.2 Neck network.
The Neck network serves as a crucial link between Backbone and Head networks, comprising the SPPF layer and an improved Path Aggregation Network (PANet) [29]. Its primary function is to fuse the multi-scale features extracted by the Backbone, which is essential for enhancing the overall object detection performance. Specifically, the SPPF layer is used to process features at different scales, which not only improves the feature expression capability, but also speeds up the inference speed of the model compared with SPP. The PANet structure comprises two components: the Feature Pyramid Network (FPN) and the Path Aggregation Network (PAN). FPN adopts top-down method to connect deep and shallow feature maps, while PAN adopts bottom-up feature pyramid network structure, which makes it easier to transfer the bottom information to the top of the upper level, and fully integrates multiscale features.
2.1.3 Head network.
The Head network includes a total of three detection heads corresponding to the feature maps of three different scales in the Neck network, which are mainly responsible for processing the extracted feature maps and generating the final prediction results, including the bounding box coordinates, the object confidence scores, and the category labels. The loss function part uses Binary Cross Entropy Loss (BCE Loss) as the classification loss, and Distribution Focal Loss (DFL) and Complete Intersection over Union (CIoU) [30] Loss as the regression loss. The Head module of YOLOv8 is improved compared with YOLOv5: 1) Improvement of the Coupled-Head structure to Decoupled-Head structure to extract position information and category information respectively; 2) Meanwhile, from the Anchor-Based approach to the Anchor-Free approach, the center point and width-to-height ratio of the object are predicted directly. The above improvements enhance the detection speed and accuracy of the YOLOv8 network model.
3 FRCP-YOLO
Although the YOLOv8n network exhibits satisfactory performance in general object detection tasks, it still suffers from missed and false detections of small objects, insufficient focus capability, and limited detection accuracy in road object detection for autonomous driving scenarios. To address these issues, an improved detection framework named FRCP-YOLO is proposed, as illustrated in Fig 2. The specific improvements over the YOLOv8n network are described in Sections 3.1–3.3.
3.1 Improvement of the FRCP-YOLO backbone network
This section provides an in-depth analysis of the primary components of the FRCP-YOLO backbone network, including the FasterNet Block and R-CA module, and presents a comparative analysis of the backbone structure prior to and following the improvements, as shown in Fig. 3. FasterNet Block is introduced to reduce the number of model’s parameters and improve computational efficiency. The R-CA module is proposed to address gradient disappearance in model training and enhance the model’s learning ability.
3.1.1 FasterNet block.
The C2f convolution module in the YOLOv8n network significantly improves the model’s feature extraction capability by increasing the number of Bottleneck modules, but also increases the number of parameters. Meanwhile, the feature maps between different channels are highly similar, generating redundant channel information, causing the model to frequently access memory during operation and increase redundant information calculation. To solve the above problems, the FasterNet Block is adopted, which is designed with Partial Convolution (PConv) and Pointwise Convolution (PWConv) as proposed in the FasterNet [31] neural network. The structure is shown in Fig 4.
Lightweight networks such as MobileNets, ShuffleNets, and GhostNet, primarily utilize depthwise convolutions (DWConv) and group convolutions (GConv) for feature extraction. While these methods have achieved notable reductions in model complexity, they often lead to frequent memory accesses and increased computational overhead [31]. In contrast to these conventional lightweight convolutions, PConv applies convolution for feature extraction on only a part of channels of the input feature map
, and leaving the remaining channels untouched, subtly reduces the computation of redundant feature maps in the channels, largely reduces memory accesses, and improves the detection speed of the model. To more effectively utilize the information of all channels, PWConv is introduced, and the overall receptive field is like a T-shaped convolution, which enhances the model’s attention to the central area of the image. The structure is shown in Fig 5.
Compared with modules such as C2f, FasterNet Block not only solves the problem of frequent memory accesses and channel redundant information calculation of the model, but also reduces model complexity, improves training efficiency. It achieves a better balance between parameter count and detection accuracy, thereby enabling more efficient spatial feature extraction.
3.1.2 R-CA module.
Attention mechanism is one of the common methods to improve the performance of deep neural networks. The widely adopted Squeeze-and-Excitation attention mechanism [32] (SE) reduces computational cost; however, it solely considers channel information and overlooks crucial positional information. In contrast, the CBAM attention mechanism [33] incorporates positional information but is limited to capturing local relationships and involves a large number of parameters. Coordinate Attention (CA) [34] is a lightweight attention mechanism suitable for mobile devices, which enhances accuracy by integrating positional information into channel attention, as illustrated in Fig 6.
Specifically, for any input , where
and
represent the height and width of the input feature map, and the position information is accurately encoded along the horizontal and vertical directions using two pooling kernels of sizes (H, 1) and (1, W) for channel
, respectively, and the attention feature maps in the two directions are obtained. The outputs of the channel
are shown in Eqs. (1) and (2), respectively:
To fully utilize the obtained information, the above two feature maps are concatenated and convolved to produce an intermediate feature map, as shown in Eq. (3):
Where denotes the concatenation operation along the spatial dimension,
is the nonlinear activation function, and
is the convolutional transformation function. The tensor
is then divided into two separate tensors
and
along the spatial dimension, followed by the application of convolutional transformation and nonlinear activation, respectively, as shown in Eqs. (4) and (5):
Where is the sigmoid function,
and
perform convolutional transformations on
and
respectively,
and
are used as the attention weights in the horizontal and vertical directions. The output of the CA mechanism is shown in Eq (6):
This demonstrates that the CA mechanism captures long-range dependencies along one spatial direction while preserving precise positional information along the other, thereby enhancing the model’s feature extraction capabilities.
As the number of network layers increases, the model encounters issues such as gradient disappearance, which leads to difficult training of the network and reduced model performance. Therefore, the residual block from residual networks [35] is adopted to effectively address the deterioration in model performance caused by an increase in network layers. Combining the CA mechanism and the residual block, the R-CA module is proposed, the structure is shown in Fig 7, and the expressions of R-CA are shown in (9).
Where represents the input feature map.
and
denote standard convolution operations with kernel sizes of 1 × 1 and 3 × 3, respectively.
is the output feature map after standard convolution.
is the feature map processed by the CA mechanism, and
is the final output feature map.
Incorporating the CA mechanism within the residual block combines the strengths of both modules, enabling the model to focus on critical image regions and enhance its recognition accuracy. The CA mechanism can make the gradient flow propagate better in the model by weighting and adjusting the features. Simultaneously, the residual connection allows the network to effectively capture the difference between input and output, reinforce shallow feature learning, achieve more comprehensive learning of the object features in the image, and enhance the model’s generalization capability.
3.2 Improving small object feature extraction network
When YOLOv8 model detects road objects of varying sizes, due to the fact that small objects are easy to be occluded and confused and the feature map area is small, leading to increased chances of missed detection of small objects. This issue is particularly prevalent in complex road scenes containing small objects. To address the above problems and improve the model’s small object detection capability, a dedicated network for small object feature extraction is introduced, as highlighted within the green dotted box in Fig 2. First, the PANet structure in the Neck network is improved by adding an additional branch for small object feature extraction to capture finer details in the image. The number of branches is increased from three to four, resulting in a small object feature map with a high-resolution of 160 × 160 and rich details. Additionally, a detection head is added to the original Head network to process shallow feature maps with more detailed information, further improving the model’s ability to detect small objects.
Aiming at the problems of YOLOv8 model in small object detection, the small object feature extraction network is improved, so that the model more effectively fuses multiscale features, effectively reduces the loss of object features, comprehensively and accurately detects objects of various sizes, and at the same time can cope with the interference of the complex scene such as light change and object occlusion, which improves the robustness of the model.
3.3 Improving the bounding box regression loss function
The bounding box regression loss function quantifies the discrepancy between predicted and ground truth bounding boxes, serving as a pivotal component in the object detection algorithm. A lower loss value indicates that predicted box is closer to ground truth box, which results in better detection performance by the model.
The Complete Intersection over Union (CIoU) [30] bounding box regression loss function is employed in YOLOv8n, which comprehensively accounts for the distance, overlap area, and aspect ratio between bounding boxes. Fig 8 illustrates the intersection between the predicted and ground truth boxes, and Equation (10) present the corresponding calculation formulas.
Where is the intersection loss of the predicted and ground truth boxes,
and
denote the centroids of the prediction and ground truth boxes, respectively,
represents the diagonal length of the minimum outer rectangle, and
is the distance between the centroids of the two boxes;
,
and
,
are the widths and heights of the predicted and ground truth boxes; and
represents the correction factor, while
denotes the weighting coefficients.
Since the CIoU loss function involves multi-class operations, requires substantial computation time and resources, and exhibits insufficient focus on small objects, it fails to meet the requirements for fast and accurate road object detection in autonomous driving scenarios. Therefore, the efficient Powerful-IoU v2 (PIoU v2) [36] loss function is introduced. The computational formula is shown in equation (12), and the intersection schematic between predicted and ground truth boxes is shown in Fig 9.
Where is the penalty factor,
and
,
and
represent the absolute distances between the corresponding sides of the predicted and ground truth boxes.
Existing bounding box regression loss functions, such as CIoU, EIoU, and SIoU, suffer from the use of inappropriate penalty factors, which lead to enlarged anchor boxes, slower convergence, and insufficient consideration of object scale [36]. To overcome these limitations, a penalty factor is incorporated into PIoU v2, which effectively avoids the oversized anchor box and steer the anchor box to return to the ground truth box through a direct and effective way, so that the anchor box more closely match the ground truth box. Additionally, a non-monotonic attention function u(x) is introduced to adaptively adjust gradients for the anchor box quality, thereby improving the focus on medium-quality and high-quality anchor boxes, and the formula for the PIoU v2 loss function is shown in the following equation:
In this case, the penalty factor is replaced by
, with the hyperparameter
set to 1.3. This modification enhances the convergence speed and detection accuracy of the PIoU v2 bounding box regression loss function.
4 Experimental results and analysis
4.1 Dataset
The autonomous driving dataset KITTI [37], jointly created by the Karlsruhe Institute of Technology in Germany and the Toyota Technological Institute of America, is used in this experiment. It contains a large number of overlapping and small objects, covering a variety of road scenarios such as countryside, city center as well as highway, with a total of Tram, Truck, Car, Person sitting, Pedestrian, Van, Cyclist, Misc, and DontCare 9 categories. After comprehensively considering the actual application of the autonomous driving system, Car, Tram, Truck, and Van are uniformly classified as Car, Person sitting and Pedestrian are uniformly classified as Pedestrian, Cyclist is retained, Misc and DontCare are deleted, and a total of 3 categories (Car, Pedestrian, and Cyclist) are set. Since the test set images in KITTI do not have labeled data, this experiment only uses the labeled training set images, a total of 7481 images, and randomly divides this dataset into the training, validation, and test sets according to the ratio of 8:1:1.
4.2 Experimental environment and parameter settings
The platform configuration for this experiment is presented in Table 1. All models were trained for 200 epochs with a batch size of 16 samples. The input image size was set to 640 × 640, the learning rate to 0.01, momentum to 0.937, and the weight decay coefficient to 0.0005.
4.3 Evaluation metrics
The present study uses four key evaluation metrics of object detection to comprehensively evaluate the detection performance of different models, including recall (R), precision (P), mean average precision (mAP), and parameters.
The corresponding formula is presented below:
Where True positives (TP) represents the number of correctly classified as positive examples, False positives (FP) represents the number of incorrectly classified as positive examples, False negatives (FN) represents the number of incorrectly classified as negative examples. Let denote the average precision of the
category, and let N represent the total number of categories. The number of parameters is used to assess model complexity. As the number of parameters increases, the model becomes more complex and demands greater computational resources.
4.4 Experimental results
4.4.1 Ablation experiment.
To explore the impact of the proposed FasterNet Block, improved small object feature extraction network, and R-CA module on detection performance, the effectiveness of the improvements is verified one by one based on the baseline model. Five groups of ablation experiments are designed based on the KITTI dataset. The test results are shown in Table 2. Fig 10 shows the comparative relationship between the evaluation metrics of different ablation experiments.
In Model 2, the FasterNet Block is introduced to lighten the original Backbone network, which significantly reduces the complexity of the model, solves the problem of frequent access to memory and channel redundant information, and improves the training efficiency of the model, but the detection accuracy is affected to a certain extent. Model 3 enhances the small object feature extraction network from Model 2, which can be seen in Fig 10 improves all the evaluation metrics, especially R (from 0.791 to 0.835) and mAP@50–95 (from 0.584 to 0.64), which effectively reduces the small object missed detection on the road, and the detection performance has been substantially improved. Compared to Model 2, Model 4 incorporates the R-CA module, resulting in improvements across all evaluation metrics. Specifically, P and R increase by 1.2% and 3.5%, while mAP@50 and mAP@50–95 improve by 1.1% and 1.7%, respectively. These results suggest that the network places greater emphasis on key regions in the image and more effectively learns residual information. Finally, the three proposed improvements are integrated into Model 5. Compared with the original model YOLOv8n, P remains almost unchanged, while R, mAP@50 and mAP@50–95 are increased by 7.0%, 3.7% and 5.9% respectively, which significantly improves the road object detection performance and shows the effectiveness of detecting various object categories.
The ablation experiment results demonstrate that the FasterNet Block, improved small object feature extraction network, and R-CA module proposed in this study effectively improve the performance of the FRCP-YOLO model without any conflict.
4.4.2 Comparative evaluation of bounding box regression loss functions.
Using the Model 5 with better detection performance in section 4.4.1 as the baseline model, experiments were conducted on common loss functions under the same training conditions to study the impact of loss functions on the model’s object detection performance. Table 3 gives the object detection evaluation metrics of the PIoU v2 loss function and other bounding box loss functions on the KITTI dataset.
As shown in Table 3, the PIoU v2 loss function has better detection performance on Model 5. Compared with the loss function CIoU used by the original network YOLOv8n, R, mAP@50 and mAP@50–95 are increased by 1.6%, 1.3% and 0.7% respectively, and P is slightly reduced. In general, PIoU v2 improves the overall performance of the model better than other loss functions. It not only avoids the anchor box from enlargement, but also enhances the focusing ability of the anchor box. The experimental results show that the PIoU v2 loss function is effective in improving the performance of Model 5.
4.4.3 Comparative experiments of different mainstream algorithms.
To evaluate the effectiveness of the proposed algorithm in road object detection for autonomous driving scenarios, a comparative experiment was conducted using FRCP-YOLO, YOLOv8, and other mainstream object detection algorithms on the KITTI test set. The results are presented in Table 4 and Fig 11.
Experimental results demonstrate that FRCP-YOLO achieves excellent detection performance while maintaining a lightweight architecture. With only 2.966M parameters, FRCP-YOLO is significantly smaller than larger models such as YOLOv3-tiny (8.671M) and YOLOv7-tiny (6.013M), yet it achieves the highest detection performance among them. Compared with lightweight models such as YOLOv5n, YOLOv9t, YOLOv10n, YOLOv11n, and YOLOv13n, FRCP-YOLO delivers superior detection accuracy while maintaining a compact design. Furthermore, compared with Transformer-based object detectors such as RT-DETR-l and RT-DETR-ResNet50, FRCP-YOLO demonstrates a clear advantage by achieving a substantial reduction in parameter count and computational cost while sustaining leading detection accuracy. Relative to the baseline YOLOv8n model, FRCP-YOLO reduces the number of parameters by 4% while significantly enhancing detection performance, with R increased by 8.6%, and mAP@50 and mAP@50–95 improved by 5.0% and 6.6%, respectively. Overall, these results clearly demonstrate that FRCP-YOLO effectively combines a lightweight design with robust and accurate detection capability for road object detection tasks.
4.5 Verification of the generalization ability of the FRCP-YOLO model
To further verify the generalization ability of FRCP-YOLO, road detection experiments were conducted on the BDD100K [38] dataset released by the Berkeley AI Lab (BAIR) in 2018 under various weather conditions (such as sunny, cloudy, rainy, etc.) and across different times of day and night. Considering that the number of images containing Motor and Bike objects in the BDD100K dataset is small and the characteristics of the two are similar, the two are classified as Two-Wheelers, and then Pedestrian, Car, Truck, Bus, and Two-Wheelers are used as the objects of road object detection. A random selection of 10,000 images from the BDD100K dataset is used to evaluate the model’s generalization ability, which is divided into the training set, validation set, and test set according to the ratio of 8:1:1. The performance of FRCP-YOLO was compared with other mainstream algorithms from Section 4.5 on this dataset for object detection, with the experimental results presented in Table 5 and Fig 12.
The experimental results show that the proposed road object detection model outperforms other network models across all metrics. Compared with the baseline model YOLOv8n, FRCP-YOLO has higher detection performance. Specifically, P increased by 2.2%, R increased by 8.8%, mAP@50 and mAP@50–95 increased by 6.6% and 4%, respectively. A large number of comparative experimental results show that the FRCP-YOLO can accurately perceive the external environment in complex road scenes and realize accurate detection of road objects, with strong adaptability and generalization ability.
5 Conclusion
Aiming at the problems of small object missed detection, large number of parameters, and low detection accuracy of existing models in complex road scenes, a lightweight and efficient road object detection model named FRCP-YOLO is proposed. Specifically, the lightweight and fast FasterNet Block is utilized to extract features from a part of the input channels, by cutting down model’s parameters and redundant computation; combined with the Coordinate Attention mechanism and residual block, an efficient R-CA module is proposed to enhance the detection of objects of interest and overcome the problems of gradient disappearance and performance degradation; the small object feature extraction network is improved to capture tiny details in the image, reduce the loss of object information, decrease the impact of complex road scenes, and improve the robustness of the model; PIoU v2 is introduced as the bounding box loss function to guides anchor boxes to regress along efficient paths, resulting in faster convergence and improved detection accuracy.
Experimental results show that FRCP-YOLO performs better in road object detection tasks based on the KITTI dataset and the BDD100K dataset containing complex road environments. Compared with the baseline model, its accuracy increased by 5.0% and 4% respectively (in terms of mAP@50), while R increased by 8.6% and 8.8%, and the number of parameters decreased by 4%. Compared with the other mainstream algorithms, FRCP-YOLO has achieved detection accuracy superior to the other mainstream algorithms while realizing lightweight, has better generalization, effectively solves the problems of small object missed detection, large number of parameters and low detection accuracy in complex road scenarios, and is suitable for object detection of autonomous driving vehicle in complex roads. This study evaluates the improved model from multiple dimensions such as detection accuracy and the number of parameters, and provides new ideas and methods for subsequent autonomous driving visual perception research. Despite these promising results, the present study still has certain limitations, and future work will focus on optimizing the model from the following two aspects:
- (1). In the experiment, only major road objects such as pedestrians and motor vehicles were detected. In the future, road objects such as traffic lights and traffic signs will be included in the research scope.
- (2). The next step will be to consider adding a semantic segmentation network based on the FRCP-YOLO model to complete lane line detection and realize the joint detection of road objects and lane lines for autonomous driving vehicles.
References
- 1. Cai Z, Chen R, Wu Z, Xue W. YOLOv8n-FAWL: Object Detection for Autonomous Driving Using YOLOv8 Network on Edge Devices. IEEE Access. 2024;12:158376–87.
- 2. Liang S, Wu H, Zhen L, Hua Q, Garg S, Kaddoum G, et al. Edge YOLO: Real-Time Intelligent Object Detection System Based on Edge-Cloud Cooperation in Autonomous Vehicles. IEEE Trans Intell Transport Syst. 2022;23(12):25345–60.
- 3. Wang H, Liu C, Cai Y, Chen L, Li Y. YOLOv8-QSD: An Improved Small Object Detection Algorithm for Autonomous Vehicles Based on YOLOv8. IEEE Trans Instrum Meas. 2024;73:1–16.
- 4. Natan O, Miura J. DeepIPC: Deeply Integrated Perception and Control for an Autonomous Vehicle in Real Environments. IEEE Access. 2024;12:49590–601.
- 5. Wu W, Liu C, Zheng H. A panoramic driving perception fusion algorithm based on multi-task learning. PLoS One. 2024;19(6):e0304691. pmid:38833435
- 6. Wang L, Law KLE, Lam C-T, Ng B, Ke W, Im M. Automatic Lane Discovery and Traffic Congestion Detection in a Real-Time Multi-Vehicle Tracking Systems. IEEE Access. 2024;12:161468–79.
- 7. Su G, Shu H. Traffic flow detection method based on improved SSD algorithm for intelligent transportation system. PLoS One. 2024;19(3):e0300214. pmid:38483877
- 8. Zhao J, Zhao W, Deng B, Wang Z, Zhang F, Zheng W, et al. Autonomous driving system: A comprehensive survey. Exp Syst Appl. 2024;242:122836.
- 9. Chen C, Wang C, Liu B, He C, Cong L, Wan S. Edge Intelligence Empowered Vehicle Detection and Image Segmentation for Autonomous Vehicles. IEEE Trans Intell Transport Syst. 2023;24(11):13023–34.
- 10. Kang L, Lu Z, Meng L, Gao Z. YOLO-FA: Type-1 fuzzy attention based YOLO detector for vehicle detection. Exp Syst Appl. 2024;237:121209.
- 11. Yin Y, Zhang Z, Wei L, Geng C, Ran H, Zhu H. Pedestrian detection algorithm integrating large kernel attention and YOLOV5 lightweight model. PLoS One. 2023;18(11):e0294865. pmid:38019827
- 12. Zhang X, Zhang X, Wang J, Ying J, Sheng Z, Yu H, et al. TFDet: Target-Aware Fusion for RGB-T Pedestrian Detection. IEEE Trans Neural Netw Learn Syst. 2025;36(7):13276–90. pmid:39178072
- 13. Deng T, Wu Y. Simultaneous vehicle and lane detection via MobileNetV3 in car following scene. PLoS One. 2022;17(3):e0264551. pmid:35245342
- 14. Liu J, Deng B, Zou C, Zeng B, Zhang W, Tan J. Multi-lane detection by combining line anchor and feature shift for urban traffic management. Eng Appl Artif Intell. 2023;123:106238.
- 15. Muhammad K, Ullah A, Lloret J, Ser JD, de Albuquerque VHC. Deep Learning for Safe Autonomous Driving: Current Challenges and Future Directions. IEEE Trans Intell Transport Syst. 2021;22(7):4316–36.
- 16. Liu Q, Liu Y, Lin D. Revolutionizing Target Detection in Intelligent Traffic Systems: YOLOv8-SnakeVision. Electronics. 2023;12(24):4970.
- 17. Yang B, Hu Z. Automatic driving target detection based on YOLOv8n improved algorithm. Control Eng China. 2025;32(12):2177–84.
- 18. Wang B, Li Y-Y, Xu W, Wang H, Hu L. Vehicle–Pedestrian Detection Method Based on Improved YOLOv8. Electronics. 2024;13(11):2149.
- 19. Dong X, Yan S, Duan C. A lightweight vehicles detection network model based on YOLOv5. Eng Appl Artif Intell. 2022;113:104914.
- 20. Chen S, Ni S, Tong L. Improved YOLOX Algorithm for Object Detection in Autonomous Driving Scenarios. J Comput Eng Appl. 2024;60(12):225–33.
- 21. Li F, Zhao Y, Wei J, Li S, Shan Y. SNCE-YOLO: An Improved Target Detection Algorithm in Complex Road Scenes. IEEE Access. 2024;12:152138–51.
- 22. Liu Z, Gao G, Sun L, Fang Z. HRDNet: High-Resolution Detection Network for Small Objects. In: 2021 IEEE International Conference on Multimedia and Expo (ICME). 2021. 1–6.
- 23. Liu X, Wang Y, Yu D, Yuan Z. YOLOv8-FDD: A Real-Time Vehicle Detection Method Based on Improved YOLOv8. IEEE Access. 2024;12:136280–96.
- 24. Wang J, Chen Y, Dong Z, Gao M. Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput Applic. 2022;35(10):7853–65.
- 25. Zhang Y, Sun Y, Wang Z, Jiang Y. YOLOv7-RAR for Urban Vehicle Detection. Sensors (Basel). 2023;23(4):1801. pmid:36850399
- 26.
ULTRALYTICS. YOLOv8. Available from: https://github.com/ultralytics/ultralytics
- 27. Khanam R, Hussain M. What is YOLOv5: A deep look into the internal features of the popular object detector. arxiv preprint. arxiv:2407.20892.
- 28. Yaseen M. What is YOLOv8: An in-depth exploration of the internal features of the next-generation object detector. arxiv preprint. 2024.
- 29. He K, Zhang X, Ren S, Sun J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16. pmid:26353135
- 30. Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. AAAI. 2020;34(07):12993–3000.
- 31. Chen J, Kao SH, He H, Zhuo W, Wen S, Lee CH. Run, don’t walk: chasing higher FLOPS for faster neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 12021–31.
- 32. Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. p. 7132–41.
- 33. Woo S, Park J, Lee JY, Kweon IS. Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018. p. 3–19.
- 34. Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 13713–22.
- 35. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 770–8.
- 36. Liu C, Wang K, Li Q, Zhao F, Zhao K, Ma H. Powerful-IoU: More straightforward and faster bounding box regression loss with a nonmonotonic focusing mechanism. Neural Netw. 2024;170:276–84. pmid:38000311
- 37. Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2012. p. 3354–61.
- 38. Yu F, Chen H, Wang X, Xian W, Chen Y, Liu F, et al. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 2636–45.