A panoramic driving perception fusion algorithm based on multi-task learning

Weilin Wu; Chunquan Liu; Haoran Zheng

doi:10.1371/journal.pone.0304691

Abstract

With the rapid development of intelligent connected vehicles, there is an increasing demand for hardware facilities and onboard systems of driver assistance systems. Currently, most vehicles are constrained by the hardware resources of onboard systems, which mainly process single-task and single-sensor data. This poses a significant challenge in achieving complex panoramic driving perception technology. While the panoramic driving perception algorithm YOLOP has achieved outstanding performance in multi-task processing, it suffers from poor adaptability of feature map pooling operations and loss of details during downsampling. To address these issues, this paper proposes a panoramic driving perception fusion algorithm based on multi-task learning. The model training involves the introduction of different loss functions and a series of processing steps for lidar point cloud data. Subsequently, the perception information from lidar and vision sensors is fused to achieve synchronized processing of multi-task and multi-sensor data, thereby effectively improving the performance and reliability of the panoramic driving perception system. To evaluate the performance of the proposed algorithm in multi-task processing, the BDD100K dataset is used. The results demonstrate that, compared to the YOLOP model, the multi-task learning network performs better in lane detection, drivable area detection, and vehicle detection tasks. Specifically, the lane detection accuracy improves by 11.6%, the mean Intersection over Union (mIoU) for drivable area detection increases by 2.1%, and the mean Average Precision at 50% IoU (mAP50) for vehicle detection improves by 3.7%.

Citation: Wu W, Liu C, Zheng H (2024) A panoramic driving perception fusion algorithm based on multi-task learning. PLoS ONE 19(6): e0304691. https://doi.org/10.1371/journal.pone.0304691

Editor: Chenchu Xu, Anhui University, CANADA

Received: October 10, 2023; Accepted: May 16, 2024; Published: June 4, 2024

Copyright: © 2024 Wu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All code and datasets can be obtained from the GitHub address, which is https://github.com/liucq666/yolop_improved.

Funding: This work was supported in part by Guangxi Science and Technology Base and Talent Project under Grant GuiKeAD23026199, in part by Guangxi Science and Technology Major Project under Grant GuikeAA23073006-02, in part by a grant from Guangxi Key Laboratory of Machine Vision and Intelligent Control under Grant 2022B02, in part by Guangxi Minzu University Xiangsi Lake Youth Scholar Innovation Team Funding under Grant 2023GXUNXSHQN06, in part by the National Natural Science Foundation of China under Grant 62241302.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In recent years, under the guidance of the low-carbon economy concept, the global automotive industry is continuously developing towards energy diversification, intelligence, and greening. With the advent of the 5G era, the development of intelligent connected vehicles has been greatly promoted. At the same time, it also has higher requirements for the level of intelligence of existing vehicles. With the continuous evolution of intelligent connected vehicle technology, panoramic driving perception systems are also constantly advancing as one of its key components The panoramic driving perception system senses the surrounding environment of the vehicle through a variety of sensors (such as cameras, lidar), providing reliable data support for intelligent connected vehicles. However, in practical applications, considerations extend beyond accuracy and robustness to computational efficiency and overall performance in order to meet the application requirements of low-cost autonomous driving. Within the panoramic driving perception systems framework, lane line detection, drivable area detection, and vehicle detection represent pivotal technological tasks.

In the realm of lane detection, traditional computer vision algorithms are progressively being supplanted by deep learning methodologies. Conventional lane detection techniques typically hinge on computer vision technologies, such as edge detection and morphological transformations. In contrast, deep learning accomplishes more precise lane detection by extracting more intricate image features, as exemplified by LaneATT [1], Enet [2] (Efficient Net), and LATR [3] (LAne detection TRansformer), among others. As for drivable area detection, the use of multi-scale algorithms like Swin-APT [4] (Swin-Transformer Adaptor for Intelligent Transportation) and DeepLabv3+ [5] can effectively augment its accuracy and robustness. In the realm of vehicle detection, the advent of deep learning has seen a gradual replacement of traditional vehicle detection methods with deep learning networks, including EnsembleNet [6], You Only Look Once (YOLO), and Swin Transformer [7]. While these tasks individually exhibit excellent performance, their combined performance often needs to be improved, struggling to balance accuracy and robustness. The YOLO series of object detection algorithms, such as You Only Look Once version 5 [8] (YOLOv5) and You Only Look Once version 8 [9] (YOLOv8), as current mainstream detection algorithms, provide clear directions for the multi-task learning network in this paper.

In order to enhance the performance and practicality of panoramic driving perception systems, numerous studies have been dedicated to designing more efficient and accurate multi-task learning networks. In recent research, multi-task learning networks such as DeMT [10] (Deformable Mixer Transformer), SMNet [11] (Symmetric Multi-task Network), YOLOP [12], HybridNets [13], and YOLOPv2 [14] have gradually integrated single tasks into multi-tasks and processed them simultaneously to impove performance. However, these multi-task learning networks still face certain challenges in current low-cost autonomous driving applications. The design of a high-performance, small-parametric multi-task learning network that is suitable for traffic scenarios is an urgent problem that requires immediate attention.

Most modern cars are equipped with single-type sensors, providing crucial data sources for automotive driver assistance systems. However, any system that relies on a single sensor as its only data source involves a trade-off between advantages and shortcomings. For example, Tesla relies solely on cameras for assisted driving, which leads to poor environmental adaptability. It is easily disturbed by weather, such as rain, snow, fog, and dust, and cannot meet the requirements for autonomous driving in different weather conditions and at higher speeds.

Therefore, a single sensor cannot resolve all issues, and data fusion from multiple sensors is inevitably a trend. Multi-sensor fusion mainly includes pixel-level fusion, feature-level fusion, and decision-level fusion [15]. Common fusion methods include wavelet transform methods [16], clustering methods [17], and logical theory [18]. By fusing data from multiple sensors, the accuracy of detection and the system’s safety can be effectively improved. This approach is less affected by the limitations of sensors, thereby enhancing the performance of the car’s driver assistance system.

Addressing the aforementioned issues, this paper proposes a panoramic driving perception fusion algorithm based on multi-task learning, which comprehensively handles multiple tasks such as lane detection, drivable area detection, and vehicle detection. It adopts a multi-sensor fusion strategy, specifically, the fusion of lidar and visual sensors, to achieve synchronization of different sensors in time and space, enhancing the accuracy and robustness of the panoramic driving perception system. Furthermore, it bolsters computational efficiency and overall system performance, bearing significant implications for low-cost autonomous driving applications.

The main contributions of this paper can be summarized as follows:

This paper proposes a panoramic driving perception fusion algorithm based on multi-task learning. This algorithm enables the simultaneous processing of multi-task and multi-sensor data. It achieves feature-level fusion of lidar and visual sensors, leading to a comprehensive enhancement in the driving perception performance of vehicles. Additionally, it presents a feasible technical solution for autonomous driving.
In order to address the limitations of the YOLOP network, this study introduces the C2f, SPPF, and ConvTranspose2D structures. These structures are aimed at improving the adaptability of the feature map pooling operation and minimizing the loss of details during downsampling. Through optimization of the original network structure and loss function, the paper effectively resolves the issues related to adaptability and detail, thus significantly enhancing the detection performance and robustness of the multi-task learning network.
A data fusion algorithm for lidar and visual sensors is devised to overcome the limitations of using a single sensor. This algorithm reduces the redundancy of sensor data, facilitates the sharing of perception information between sensors, and leads to a substantial improvement in the performance and accuracy of perception information subsequent to multi-sensor data fusion.

Fusion perception algorithm

This paper proposes a multi-sensor fusion perception algorithm that is built upon the framework of the multi-task learning network. A backbone network is employed to detect three tasks: lane lines, vehicles, and drivable areas. And through the fusion of lidar and vision sensor data, the features of both are merged, resulting in a more comprehensive and informative representation. The architecture of the perception fusion network is illustrated in Fig 4. In this section, we present the designed multi-task learning network, the point cloud processing procedure of lidar, and the feature-level fusion strategy of multiple sensors, based on the relevant content mentioned above.

Download:

Fig 4. Network structure of perception fusion algorithm.

https://doi.org/10.1371/journal.pone.0304691.g004

Improved YOLOP algorithm

The multi task learning algorithm in this paper introduces C2f structure on the basis of YOLOP network structure. The C2f structure aims to expedite both the model’s training process and inference speed. Additionally, the SPPF is incorporated to effectively address the adaptive issue encountered during the feature map pooling operation, while the ConvTranspose2D is introduced to tackle the problem of detail loss. Furthermore, in order to enhance the detection performance of multi-task learning, adjustments and optimizations are made to the feasible driving area and lane detection module. Specifically, an additional layer is added to the lane detection component, augmenting the network’s capacity for processing intricate information. The network structure can be visualized in Fig 5. This section outlines the designed multi-task learning network structure and elaborates on the specific optimizations and enhancements made to the original modules, network structure, and loss functions, all based on the YOLOP multi-task learning network.

Download:

Fig 5. Network structure of multi-task learning.

https://doi.org/10.1371/journal.pone.0304691.g005

Improved module

1.C2f module. In our modified network architecture, the BottleneckCSP module is supplanted by the C2f module [23], as depicted in Fig 6. This alteration significantly curtails computational overhead by channel reduction, leading to expedited model training and inference. Moreover, it substantially mitigates GPU memory consumption. The C2f module stands out, restricting information flux, averting information degradation, and amplifying both robustness and the model’s generalization capability. Such attributes are invaluable for object detection in intricate settings. Refinements in network structures, coupled with the elimination of superfluous computations, further diminish model parameters and computational intricacy, all the while preserving model precision and efficacy.

Download:

Fig 6. Structure of C2f.

https://doi.org/10.1371/journal.pone.0304691.g006

2.SPPF module. In our enhanced network, the SPP module is substituted by a more efficient SPPF module [24], illustrated in Fig 7. This module offers a tangible reduction in both computational requirements and storage demands. The SPPF module integrates convolutional strata with fully connected layers. When juxtaposed with the SPP module, it boasts a leaner parameter set and an accelerated computation rate. Additionally, it magnifies the model’s receptive ambit and expressiveness. The SPPF module adaptively conducts pooling of varying magnitudes on the feature map, ensuring extraction of multi-scale feature information. This adaptability circumvents issues associated with information omission or redundancy during drastic input image size variations. Furthermore, the SPPF module possesses a heightened ability for information amalgamation, adeptly integrating multi-scale data, subsequently enhancing model performance.

Download:

Fig 7. Structure of SPPF.

https://doi.org/10.1371/journal.pone.0304691.g007

3. ConvTranspose2D. Within the lane line detection branch, Upsample is superseded by ConvTranspose2D [25]. Leveraging an adept convolution computation library, this module batch processes convolution operations on an array of input imagery, maximizing hardware capabilities like GPU for parallel computation, resulting in a marked improvement in processing speed. Contrary to Upsample, ConvTranspose2D consolidates both convolution and upsampling operations, streamlining the process. The Upsample method typically employs a static interpolation technique, devoid of the capability to glean the optimal transformation from datasets, rendering it less adaptable to specific tasks. Conversely, ConvTranspose2D, tailored to task-specific requirements, utilizes modifiable convolution kernel parameters, culminating in superior outcomes.

Optimize network structure

1. Introduce path aggregation network (PAN). ① To strengthen the information fusion and representation of different scale feature images, on the basis of the original FPN, add a PAN [26], which aligns the pixels in different scale feature images to make the number of pixels equal in different scale feature images, facilitating subsequent processing. FPN organically combines feature information of different scales, improves the representation ability of image features, and thus improves the recognition accuracy of the model.

② To further improve the efficiency and accuracy of object detection and segmentation tasks, add the drivable area detection module before FPN. This can preliminarily process the input image, remove some invalid areas, better integrate the information of the drivable area into the overall features, ensure the accuracy of the detection results, and increase the robustness of the model.

③ Connect the lane line detection module after FPN to avoid excessive computation of low-level features, which can improve computational efficiency and make the information of lane lines clearer after higher-level feature extraction and fusion, further improving the accuracy of lane line detection results.

2. Improve network structure. ① Enhancements to the model’s computational velocity are achieved by diminishing the convolution kernel’s size and stride. This approach reduces the feature map’s resolution and the convolution’s computational intricacy. In the head network segment, the number of convolutional, pooling, and deconvolution layers are minimized. Utilizing larger convolution and deconvolution kernels along with a deeper network structure amplifies the segmentation head’s resolution and precision on the feature map.

② Incorporating an additional upsampling layer within the drivable area detection branch transitions the architecture from the initial three-layer upsampling to a more refined four-layer structure. This modification effectively addresses the feature omission challenges engendered by superficial feature layers. Within the segmentation head network, a deconvolution layer of size 4 is integrated. By judiciously reducing model parameters, both the model’s speed and stability witness comprehensive improvements.

Improve loss function

During the training process, adaptively adjust for different tasks and added datasets, and use different loss functions and training strategies, while summing the loss functions of all tasks with weights to achieve joint learning of multiple tasks. Such a design allows the algorithm to perform well in different scenes and datasets.

The total loss function is weighted and summed can be expressed as: (5)

The total loss function, vehicle detection loss function, drivable area segmentation loss function, lane line segmentation loss function, and lane line detection loss function are represented by L_sum, L_det, L_{da_seg}, L_{lane_seg}, L_{lane_iou} respectively; α₁, α₂, α₃, α₄ are the weights of each loss function. Due to the structure of drivable areas and lane lines has been optimized and adjusted, leading to adjustments in α₂ and α₃ from the original value of 0.2 to 0.3 and 0.5, respectively, with the aim of enhancing attention to details. α₁ and α₄, on the other hand, remain unchanged and are set at 1 and 0.2, respectively.

The weighted sum of the loss function for vehicle detection can be expressed as: (6)

In the equation, L_box, L_obj, and L_cls represent the bounding box regression loss function, object confidence loss function, and classification loss function respectively; α₅, α₆, α₇ are the weights of each loss function. In order to ensure consistency with the weight settings of the YOLOP network, the weights are specifically defined as 0.05, 1.0, and 0.5, respectively. To improve the effectiveness of vehicle detection, the loss functions L_obj and L_cls incorporate the utilization of the Focal loss.

Focal Loss can effectively alleviate the class balance problem between fewer class samples (vehicles, drivable areas, and lane lines) and a large number of background samples. Unlike traditional losses, Focal Loss assigns higher weights to samples that are difficult to classify and misclassified, effectively improving the network’s robustness to difficult samples and noisy data.

The Focal loss function [27, 28] can be expressed as: (7)

In the equation, α₈ is a balancing factor used to solve the problem of imbalance between positive and negative samples; γ is an adjustment factor used to adjust the weights of easy and difficult samples; p_t is the predicted probability output by the network, and ln(p_t) is the logarithm of the predicted value. During the training phase of the model, the parameter α₈ is assigned a value of 0.25, which allows for the adjustment of the weight assigned to negative samples to be four times greater than that assigned to positive samples. This adjustment prevents the model from overly prioritizing the more prevalent negative samples. Additionally, the parameter γ is set to 2 in order to amplify the weight of challenging samples. This amplification facilitates the model’s focus on samples that are difficult to classify, ultimately leading to an enhancement in the performance of the model.

Lane line detection has a certain degree of difficulty in multi-task learning networks because the shapes and colors of lane lines vary greatly, and they are often obscured. Before adopting the Dice Loss, the CrossEntropy Loss (CE Loss) was used as the loss function for the lane line detection task, and lane line detection was achieved through the method of positive sample matching. However, this method has a significant problem: the ratio of the number of lane line pixels to background pixels is extremely imbalanced, which makes the model more inclined to predict background pixels and ignore lane line pixels.

Dice Loss calculates the similarity between the predicted result and the true value on a per-pixel basis [29, 30], which can effectively solve the problem of pixel number imbalance. The loss function can be expressed as: (8)

In the equation, y and respectively represent the authentic labels and the labels as forecasted by the model. Using Dice Loss as the loss function can effectively measure the degree of match between the model’s predicted results and the actual values, and make the model pay more attention to the detection of lane line pixels. Leveraging Focal Loss can further ameliorate the issue of an excessive number of negative samples in lane line detection. This enhancement heightens the network’s attention to the lane lines, thereby augmenting the accuracy of lane line detection.

To further optimize the performance of lane line detection in multi-task learning networks, a combination loss function combining Focal Loss and Dice Loss [31, 32] is used for model training. Dice Loss and Focal Loss respectively target the influence of factors such as diverse lane line shapes, colors, and frequent obscuration, emphasize the detection of lane line pixels, and improve the model’s attention to lane lines, while optimizing the problem of a large number of background pixels in the training dataset. The combination of these two loss functions can fully utilize their complementary advantages, effectively improving the accuracy and robustness of the lane line detection task.

The combined loss function of focal loss and dice loss can be expressed as: (9)

In the equation, λ is a constant that controls the weight ratio of Dice loss in the lane line loss function. To enhance the model’s focus on the accuracy of lane lines and the overlap of areas, while paying relatively less attention to class imbalance, the parameter λ is set to 2. By increasing the weight ratio of the Dice Loss, the model can make more precise predictions of lane lines, consequently improving the overall performance.

Point cloud processing algorithm

To provide more accurate and reliable scene understanding and perception results, the original laser point cloud data is processed in a series. The processing process is shown in Fig 8. This section introduces the technical approach used in processing lidar point clouds.

Download:

Fig 8. Lidar point cloud processing process.

https://doi.org/10.1371/journal.pone.0304691.g008

PassThrough filtering

To improve the processing effect on point cloud data and reduce the complexity of subsequent point cloud data processing, PassThrough filtering [33] is introduced as a preprocessing step for laser point cloud. PassThrough filtering is a preprocessing method used for laser point cloud processing, aiming to remove invalid data in the vertical or horizontal direction, and is suitable for tasks of extracting specific attribute areas in the point cloud. For example, the perception range around the vehicle body can be extracted by setting a distance range.

Radius outlier removal (ROR) filtering

During the driving process of a car, the onboard sensors may be affected by various factors such as weather conditions, obstructions, or sensor failures, causing the panoramic driving system’s perception performance to decline. Therefore, ROR filtering [34] can remove abnormal interference points and maintain local structural information of point cloud data, such as vehicle shape or road contour. ROR filtering can also reduce the interference of noise and stray points in point cloud data on obstacle detection and scene understanding, thereby improving the accuracy of the perception system for obstacles and enhancing driving safety performance.

Region growing clustering

In autonomous driving, the car driving system needs to understand the complex road environment in which the vehicle is located, including elements such as lane lines, traffic signs, and pedestrians. In order to achieve a more comprehensive and accurate understanding of the driving environment, key scene elements are identified and extracted through the region growing clustering algorithm [35], and obstacles are segmented based on the local information and features of the point cloud, clustering adjacent points together to form accurate obstacle boundaries, thereby achieving more reliable obstacle perception and tracking. The region growing clustering algorithm can adaptively determine the parameters of clustering, such as the minimum or maximum clustering range, the number of neighborhood points. This enables the algorithm to produce satisfactory results across various scenes and datasets.

Fusion algorithm

The fusion perception algorithm uses the high-precision distance data provided by lidar to identify and locate potential obstacles, providing an important reference for the vehicle’s obstacle avoidance decision-making. At the same time, the algorithm can accurately distinguish different categories of objects, such as vehicles, pedestrians, or traffic signs, and assign them semantic labels. In addition, the fusion of visual information and lidar distance information can provide the vehicle with accurate lane line positioning and trajectory tracking results, and can accurately detect the vehicle’s drivable area, enhancing the vehicle’s understanding and planning ability of the road environment. Such a feature-level fusion strategy provides more comprehensive and accurate information, enhancing the performance of the panoramic driving perception system and the reliability of decision-making. The fusion perception process is shown in Fig 9. This section introduces the strategy design for feature-level fusion of lidar and visual sensor data.

Download:

Fig 9. Flow chart of fusion perception.

https://doi.org/10.1371/journal.pone.0304691.g009

Experiment

In order to verify the effectiveness and reliability of the perception fusion algorithm proposed in this paper, this section not only trains the multi-task learning network model but also compares the training results with different networks (including CNN and transformer models) and different datasets (BDD100K and KITTI) to highlight the superiority and generalizability of the multi-task network. In addition, the feasibility and reliability of the multi-sensor fusion strategy are validated, and an analysis of the final visualization results is conducted.

Experimental setup

The experiments are all implemented in an environment with an RTX3090 24G GPU, using the PyTorch 1.9.0 deep learning framework. The experiments involved in this study were conducted using the publicly available BDD100K dataset. This dataset consists of 100,000 images of driving routes covering approximately 100,000 kilometers. The dataset was divided into training, testing, and validation sets using a 7:2:1 ratio to ensure the experiment’s results are reliable and reproducible. During the model training process, a cosine annealing strategy is used to adjust the learning rate, the number of training iterations is set to 200, the batch size is set to 32, and the time for one iteration is about 32 minutes. The first three training cycles are set as warm-up training to further optimize the model performance during the training process.

The performance assessment of the experimental results mainly includes mAP50, Recall, mIoU, Accuracy, and IoU. Among them, mAP50 and Recall are the evaluation metrics for vehicle detection, mIoU is for drivable area detection, and Accuracy and IoU are for lane line detection. Their calculation formulas are can be expressed as: (10) (11) (12) (13) (14)

Among them, P_n and R_n denote the precision and recall at the nth threshold respectively. R_n and R_n-1 correspond to two contiguous yet distinct intervals on the abscissa. TP (True Positive) refers to the quantity of pixels predicted as positive samples, which coincide with the actual annotations. FN (False Negative) pertains to the count of pixels predicted as negative samples that nonetheless overlap with the actual annotations. FP (False Positive) signifies the number of pixels predicted as positive samples yet bear no overlap with the actual annotations. k represents the quantity of samples. Lastly, TN (True Negative) refers to the number of pixels that are predicted as negative samples and align with the actual annotations.

Analysis of model training results

Model performance analysis.

The experiment compares the performance of four different networks in multitask processing, as illustrated in Table 1. Among these networks, the proposed multitask network significantly reduces the parameter count compared to HybridNets and YOLOPv2 networks, differing by only 3.1M parameters when compared to the YOLOP network. However, the improved network surpasses the YOLOP network in various key metrics, including recall, mean Intersection over Union (mIoU), and accuracy. Additionally, the performance of this approach exceeds that of both the transformer-based model proposed by Wenjie Zhu and the YOLO-ODL model, and the improve network exhibits superior speed in comparison to both the HybridNets and YOLOPv2 networks while performing on par with the YOLOP network. This facilitates more precise and efficient handling of multiple tasks while utilizing fewer computing resources. Moreover, it showcases tremendous potential and competitiveness in practical application scenarios.

Download:

Table 1. Experimental results of different network performance.

https://doi.org/10.1371/journal.pone.0304691.t001

Analysis of vehicle detection results

The BDD100K dataset includes many traffic objects, such as buses, trucks, trains., which have similarity and correlation with cars in terms of shape, size. To increase the diversity of training data, these traffic objects are merged into a single car category when processing the dataset, to improve the accuracy and robustness of vehicle detection. The improved network is compared with traditional multitasking and single-task detection networks, as shown in Table 2.

Download:

Table 2. Experimental results of vehicle detection (* multi-task).

https://doi.org/10.1371/journal.pone.0304691.t002

The results show that the networks based on the YOLO series perform well in vehicle detection. The improved network improves the mAP50 indicator by 20.0% compared to the traditional multi-task detection network MultiNet, and increases by 3.0% compared to the single-task network YOLOv5s, and it also surpasses the transformer model by Wenjie Zhu with a 4.4% improvement. Although it is slightly reduced compared to YOLOPv2 in terms of mAP50, the overall performance is still excellent, surpassing most vehicle detection networks.

Analysis of driveable area detection results

The comparison of the drivable area detection network experiment results is shown in Table 3. The results show that networks based on the YOLO series, such as YOLOP, HybridNets, and YOLOPv2, perform well on the BDD100K dataset. Compared to other networks, the proposed multi-task network exhibits outstanding performance in the drivable area detection task, surpassing the multi-task network DLT-Net by 22.3% and the single-task network PSPNet by 4%, it surpasses the performance of Team Host_29005 on the BDD100K challenge website by a remarkable margin of 10%. Moreover, improved by 6.2% compared to the transformer-based multi-task model by Xiwen Liang, thereby showcasing exceptional detection capabilities. In conclusion, the enhanced network demonstrates outstanding performance in the drivable area detection task.

Download:

Table 3. Experimental results of drivable area detection (* multi-task).

https://doi.org/10.1371/journal.pone.0304691.t003

Analysis of lane line detection results

The comparison of lane line detection experiment results is shown in Table 4. The results show that compared to other networks (such as ENet, SCNN, ENet-SAD, YOLOP.), the improved network performs better, especially compared to the single-task network ENet, the performance is improved by nearly 48%. In comparison to Wenjie Zhu’s multi-task Transformer model, the performance improvement was 7.2%. Despite the slight decrease in our model’s performance, as presented in this paper, when compared to the HybridNets and YOLOPv2 models, it still showcases certain performance advantages. Notably, our model’s parameter count is reduced by 1.8M and 27.9M in comparison to HybridNets and YOLOPv2 models, respectively. Additionally, our model achieves higher FPS (frames per second) than both of them. Therefore, we can infer that the improved network is relatively small and has excellent performance in lane line detection. It can implement localized processing on some edge devices and show good accuracy and detail capture ability.

Download:

Table 4. Experimental results of lane detection (* multi-task).

https://doi.org/10.1371/journal.pone.0304691.t004

Ablation experiment

Based on the YOLOP network, the network performance is improved through the improvement and optimization of the network structure, hyperparameters, and loss functions. At the same time, through quantitative and qualitative comparison experiments, the improved multi-task network has significantly improved in various performance indicators compared to the YOLOP network. The comparison of ablation experiment results is shown in Table 5. During the network training process, the effects of different training methods and loss functions on network performance are fully considered.

Download:

Table 5. Ablation experimental results.

https://doi.org/10.1371/journal.pone.0304691.t005

Analysis of training results on the KITTI dataset

To assess the generalization capability of the proposed multi-task learning network model, we conducted experiments using the KITTI dataset. Given that the KITTI dataset solely provides object detection data and lacks drivable areas and lane line datasets, our focus was exclusively on validating the object detection aspect of the multi-task learning network. The dataset comprises 7,481 images, and the data was partitioned in a ratio of 7.5:1.5. Only the data partition ratio was varied in the experiment, while maintaining consistent settings with the aforementioned experiments. The experiment results are presented in Table 6.

Download:

Table 6. Traffic object detection results: Comparing on the KITTI dataset.

https://doi.org/10.1371/journal.pone.0304691.t006

According to the experimental results in Table 6, the multi-task learning network proposed in this paper demonstrates superior performance in object detection compared to YOLOP. The parameter count is reduced by 2.1M, while the model training time is reduced by 3.8-fold. Overall, the evaluation metrics surpass those of the YOLOP model. Consequently, the training results on the KITTI dataset robustly validate the generalization capability and superiority of the proposed multi-task learning network.

Joint calibration of lidar and vision

Vision sensor calibration.

The vision sensor used in the experiment is a 640x480 pixel USB camera, and the internal and external parameters of the camera are obtained using the camera calibration tool in Autoware, as shown in Fig 10.

Download:

Fig 10. Calibration process of vision sensor.

https://doi.org/10.1371/journal.pone.0304691.g010

Among them, X represents the situation of left and right movement in the field of view, Y represents the situation of up and down movement, Size represents the situation of the field of view being full, and Skew represents the situation of angle change. When the progress bar turns green and is full, the calibration is completed. Finally, the internal parameters and distortion data of the camera are calculated. The final internal parameter matrix A and distortion parameters B are as follows:

Joint calibration of lidar and vision

The lidar used in the experiment is a robosense 16-line hybrid solid-state lidar with a measuring distance of up to 150 meters, a horizontal measuring angle of 360°, up to 300000 points per second, and a vertical measuring angle of -15°-15°. The combination platform of laser and vision is about 1 meter above the ground, as shown in Fig 11.

Download:

Fig 11. Lidar and vision combination platform.

https://doi.org/10.1371/journal.pone.0304691.g011

Before joint calibration, a point cloud packet of the calibration board at different positions needs to be recorded, and by replaying the recorded point cloud packet, 9 different pixel point cloud pairs are selected. These data are used to obtain the external parameter matrix of the combination platform, namely the rotation matrix and the translation matrix. The final external parameter matrix C is as follows:

Processing of lidar point cloud

PassThrough filtering.

To be consistent with the view in front of the car, as shown in Fig 12, the PassThrough filter is used to limit the original point cloud (as shown in Fig 13) to a forward distance of 0.3–20 meters. At the same time, to avoid the interference of ground point clouds, the original point cloud is limited to a height of -0.9–5 meters. The number of point clouds is reduced from the original 28800 to 9124.

Download:

Fig 12. Camera view.

https://doi.org/10.1371/journal.pone.0304691.g012

Download:

Fig 13. Original point cloud.

https://doi.org/10.1371/journal.pone.0304691.g013

ROR filtering.

To facilitate the construction of the KD tree and reduce the computational load of the algorithm, the experiment sets the ROR filter search radius to 0.1 and sets a point cloud to have at least 10 neighboring points within this radius to be retained. It can filter out most of the interference from environmental factors.

Region growing clustering.

The experiment obtains the geometric and surface feature information in the point cloud by constructing a KD-tree and calculating the normals, and uses the normal information for region growing clustering. In the experiment, the minimum and maximum cluster sizes are set to 30 and 10000 respectively, the number of neighboring searches is set to 20, the smoothness threshold is set to 70, and the curvature threshold is set to 1.0. The number of point clouds after clustering is 2648. All the above parameter settings are verified by comparative experiments. The lidar point cloud processing results as shown in Fig 14.

Download:

Fig 14. Lidar point cloud processing results.

https://doi.org/10.1371/journal.pone.0304691.g014

Construction of three-dimensional(3D) prediction boxes.

The construction of 3D prediction boxes is a fundamental task in panoramic driving perception, as it offers crucial input for key tasks, including object detection, tracking, and decision-making. In this experiment, objects within the laser point cloud are segmented into distinct regions, and relevant features like geometric properties and point cloud density are extracted from each region. Utilizing these features, three-dimensional bounding boxes are constructed to precisely depict information such as the object’s position, size, and orientation. The results of this construction process are displayed in Fig 15.

Download:

Fig 15. Construction of 3D prediction boxes.

https://doi.org/10.1371/journal.pone.0304691.g015

Fusion of lidar and vision.

Based on the data obtained from previous experiments, a preliminary fusion of lidar and vision is performed, i.e., the lidar point cloud coordinates are converted into pixel coordinates, as shown in Fig 16, ensuring that the point cloud and image have a consistent coordinate system.

Download:

Fig 16. Preliminary fusion results.

https://doi.org/10.1371/journal.pone.0304691.g016

Through a series of processes such as point cloud filtering and clustering, features of perceived objects can be extracted, including important information such as the position and actual distance of the object. The point cloud processing results are shown in Fig 17. The 3D prediction boxes of the objects are then fused with vision to obtain more accurate information about the object’s size and position. The results of this fusion are shown in Fig 18.

Download:

Fig 17. Point cloud processing results.

https://doi.org/10.1371/journal.pone.0304691.g017

Download:

Fig 18. Fusion results of 3D prediction boxes.

https://doi.org/10.1371/journal.pone.0304691.g018

Analysis of visualization results

Analysis of vision perception visualization results.

The improved YOLOP model was compared with several state-of-the-art panoramic driving perception technologies. A unified confidence threshold of 0.25 and an IoU threshold of 0.45 were used to filter out inaccurate predicted boxes, ensuring high-quality detection objects. The experiments validated the effectiveness of the improved model in different environments and clarity scenes. Fig 19 presents a visual comparison of the experimental results. From the comparison in Fig 19, it is evident that the improved model outperforms the YOLOP and HybridNets models in terms of lane line and drivable area detection. It also exhibits superior robustness in lane line detection compared to YOLOPv2. In terms of vehicle detection, the HybridNets and YOLOPv2 models display a higher false positive rate based on the visual results. Thus, it can be concluded that the improved model outperforms the majority of existing models in terms of performance, while maintaining high robustness and accuracy.

Download:

Fig 19. Comparison of visualization results.

https://doi.org/10.1371/journal.pone.0304691.g019

Analysis of fusion perception visualization results.

The fusion of perception visualization results is depicted in Fig 20. The top left of the predicted bounding boxes provides information regarding the class and confidence of the perceived object, while the top right displays distance information acquired from lidar perception. The class and confidence information of the predicted bounding boxes play a crucial role in identifying the types and potential levels of danger of surrounding objects, thereby influencing driving decisions and ensuring safety. Combining the 3D predicted bounding boxes obtained from lidar perception with the corresponding 2D predicted bounding boxes acquired from visual perception allows for improved accuracy in object recognition, tracking, pose estimation, and precise localization in traffic scenarios, thereby enhancing the perception capabilities and safety of the autonomous driving system. The distance information obtained from lidar perception measures the spatial relationship between perceived objects and the vehicle, facilitating obstacle avoidance and path planning.

Download:

Fig 20. Fusion perception visualization results.

https://doi.org/10.1371/journal.pone.0304691.g020

To emphasize the performance of our model, we compared the visualization results in various scenarios, including single-task and multi-task as well as single-sensor and multi-sensor setups, as depicted in Fig 21. In the single-task scenario, the YOLOv5 object detection algorithm was employed, while the OpenPCDet [48] lidar 3D object detection algorithm was used in the single-sensor scenario. Fig 21 illustrates that YOLOv5 solely detects the object category, whereas our multi-task model not only identifies object categories but also detects lane lines and drivable areas. Conversely, OpenPCDet only detects 3D objects and lacks the ability to precisely perceive relevant information such as object distance and category. Therefore, the proposed multi-task perception fusion algorithm, integrating lidar and visual sensors, enables vehicles to attain more comprehensive and accurate perception results, thus enhancing the panoramic driving system’s understanding of the surrounding environment.

Download:

Fig 21. Comparative analysis of visualization results across various variants.

https://doi.org/10.1371/journal.pone.0304691.g021

Conclusion

The presented research introduces a panoramic driving perception fusion algorithm hinged on multi-task learning. The experimental results demonstrate that this algorithm exhibits exceptional detection performance not only on the BDD100K dataset but also on the KITTI dataset, outperforming the majority of CNN-based and transformer-based models. Furthermore, it showcases improved overall performance, high accuracy, and robustness. The fusion technique, which synergizes lidar and visual sensors, significantly augments the holistic perception and comprehension of the ambient environment. Lidar’s proficiency in delivering pinpoint distance metrics is instrumental in sculpting precise environmental networks and obstacle detection. Concurrently, visual sensors excel in discerning objects, lane demarcations, and navigable terrains. The amalgamation of data and characteristics from both lidar and visual sensors markedly enhances perception accuracy and robustness, effectively addressing the challenge of achieving precise panoramic driving perception on limited hardware resources. This provides a fundamental basis of support for applications such as autonomous driving and intelligent connected vehicles.

Looking forward, endeavors will concentrate on refining the fusion algorithm’s structure, bolstering perception accuracy, and its eventual integration into real-world autonomous driving ecosystems. Additionally, avenues like the confluence of lidar, millimeter-wave radar, and visual sensor data, alongside multi-target trajectory tracking, will be explored. Such investigative trajectories aim to amplify the efficacy and applicability of panoramic driving perception, catalyzing advancements in autonomous driving innovations.

References

1. Tabelini L., Berriel R., Paixao T. M., Badue C., De Souza A. F., & Oliveira-Santos T. (2021). Keep your eyes on the lane: Real-time attention-guided lane detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 294–302).
- View Article
- Google Scholar
2. Li B., Zhao Y., & Lou L. (2022). Fast Lane Detection Based on Improved Enet for Driverless Cars. In Advances in Computational Intelligence Systems: Contributions Presented at the 20th UK Workshop on Computational Intelligence, September 8–10, 2021, Aberystwyth, Wales, UK 20 (pp. 379–389). Springer International Publishing.
- View Article
- Google Scholar
3. Luo Y., Zheng C., Yan X., Kun T., Zheng C., Cui S., et al. (2023). Latr: 3d lane detection from monocular images with transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7941–7952).
- View Article
- Google Scholar
4. Liu Y., Wu C., Zeng Y., Chen K., & Zhou S. (2023). Swin-APT: An Enhancing Swin-Transformer Adaptor for Intelligent Transportation. Applied Sciences, 13(24), 13226.
- View Article
- Google Scholar
5. Wang Y., Wang C., Wu H., & Chen P. (2022). An improved Deeplabv3+ semantic segmentation algorithm with multiple loss constraints. Plos one, 17(1), e0261582. pmid:35045083
- View Article
- PubMed/NCBI
- Google Scholar
6. Mittal U., Chawla P., & Tiwari R. (2023). EnsembleNet: A hybrid approach for vehicle detection and estimation of traffic density based on faster R-CNN and YOLO models. Neural Computing and Applications, 35(6), 4755–4774.
- View Article
- Google Scholar
7. Deshmukh P., Satyanarayana G. S. R., Majhi S., Sahoo U. K., & Das S. K. (2023). Swin transformer based vehicle detection in undisciplined traffic environment. Expert Systems with Applications, 213, 118992.
- View Article
- Google Scholar
8. Xie Z., Li Y., Xiao Y., Diao Y., Liao H., Zhang Y., et al. (2023). Sugarcane stem node identification algorithm based on improved YOLOv5. Plos one, 18(12), e0295565. pmid:38079443
- View Article
- PubMed/NCBI
- Google Scholar
9. Kim J. H., Kim N., & Won C. S. (2023, June). High-Speed Drone Detection Based On Yolo-V8. In ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–2). IEEE.
- View Article
- Google Scholar
10. Xu, Y., Yang, Y., & Zhang, L. (2023, June). DeMT: Deformable mixer transformer for multi-task learning of dense prediction. In Proceedings of the AAAI conference on artificial intelligence (Vol. 37, No. 3, pp. 3072–3080).
11. Niu Y., Guo H., Lu J., Ding L., & Yu D. (2023). SMNet: symmetric multi-task network for semantic change detection in remote sensing images based on CNN and transformer. Remote Sensing, 15(4), 949.
- View Article
- Google Scholar
12. Wu D., Liao M. W., Zhang W. T., Wang X. G., Bai X., Cheng W. Q., et al. (2022). Yolop: You only look once for panoptic driving perception. Machine Intelligence Research, 19(6), 550–562.
- View Article
- Google Scholar
13. Vu D., Ngo B., & Phan H. (2022). Hybridnets: End-to-end perception network. arXiv preprint arXiv:2203.09035.
- View Article
- Google Scholar
14. Han C., Zhao Q., Zhang S., Chen Y., Zhang Z., & Yuan J. (2022). Yolopv2: Better, faster, stronger for panoptic driving perception. arXiv preprint arXiv:2208.11434.
- View Article
- Google Scholar
15. Wang X., Li K., & Chehri A. (2023). Multi-sensor fusion technology for 3D object detection in autonomous driving: A review. IEEE Transactions on Intelligent Transportation Systems.
- View Article
- Google Scholar
16. Chen, Y., Wang, J., & Li, G. (2022, December). A efficient predictive wavelet transform for LiDAR point cloud attribute compression. In 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP) (pp. 1–5). IEEE.
17. Zhao Y., Zhang X., & Huang X. (2021). A technical survey and evaluation of traditional point cloud clustering methods for lidar panoptic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2464–2473).
- View Article
- Google Scholar
18. Chen Y., & De Luca G. (2021). Technologies supporting artificial intelligence and robotics application development. Journal of Artificial Intelligence and Technology, 1(1), 1–8.
- View Article
- Google Scholar
19. Lu K., Zhao F., Xu X., & Zhang Y. (2023). An object detection algorithm combining self-attention and YOLOv4 in traffic scene. PLoS one, 18(5), e0285654. pmid:37200376
- View Article
- PubMed/NCBI
- Google Scholar
20. Huang Z., Wang J., Fu X., Yu T., Guo Y., & Wang R. (2020). DC-SPP-YOLO: Dense connection and spatial pyramid pooling based YOLO for object detection. Information Sciences, 522, 241–258.
- View Article
- Google Scholar
21. Deng C., Wang M., Liu L., Liu Y., & Jiang Y. (2021). Extended feature pyramid network for small object detection. IEEE Transactions on Multimedia, 24, 1968–1979.
- View Article
- Google Scholar
22. Xiong C., Hu S., & Fang Z. (2022). Application of improved YOLOV5 in plate defect detection. The International Journal of Advanced Manufacturing Technology, 1–13.
- View Article
- Google Scholar
23. Li Y., Fan Q., Huang H., Han Z., & Gu Q. (2023). A Modified YOLOv8 Detection Network for UAV Aerial Image Recognition. Drones, 7(5), 304.
- View Article
- Google Scholar
24. Tang H., Liang S., Yao D., & Qiao Y. (2023). A visual defect detection for optics lens based on the YOLOv5-C3CA-SPPF network model. Optics Express, 31(2), 2628–2643. pmid:36785272
- View Article
- PubMed/NCBI
- Google Scholar
25. Weng W., & Zhu X. (2021). INet: convolutional networks for biomedical image segmentation. Ieee Access, 9, 16591–16603.
- View Article
- Google Scholar
26. Zhou L., Rao X., Li Y., Zuo X., Qiao B., & Lin Y. (2022). A lightweight object detection method in aerial images based on dense feature fusion path aggregation network. ISPRS International Journal of Geo-Information, 11(3), 189.
- View Article
- Google Scholar
27. Gan, X., Qu, J., Yin, J., Huang, W., Chen, Q., & Gan, W. (2021). Road damage detection and classification based on M2det. In Advances in Artificial Intelligence and Security: 7th International Conference, ICAIS 2021, Dublin, Ireland, July 19–23, 2021, Proceedings, Part I 7 (pp. 429–440). Springer International Publishing.
28. Gao, Z. (2023, February). YOLOCa: Center aware yolo for dense object detection. In Journal of Physics: Conference Series (Vol. 2425, No. 1, p. 012019). IOP Publishing.
29. Cui, Y., Jia, M., Lin, T. Y., Song, Y., &Belongie, S. (2019). Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9268–9277).
30. Wong V. W. H., Ferguson M., Law K. H., Lee Y. T. T., & Witherell P. (2022). Segmentation of additive manufacturing defects using U-net. Journal of Computing and Information Science in Engineering, 22(3), 031005.
- View Article
- Google Scholar
31. Chen M., Fang L., & Liu H. (2019, April). FR-NET: Focal loss constrained deep residual networks for segmentation of cardiac MRI. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) (pp. 764–767). IEEE.
- View Article
- Google Scholar
32. Prencipe B., Altini N., Cascarano G. D., Brunetti A., Guerriero A., & Bevilacqua V. (2022). Focal dice loss-based V-Net for liver segments classification. Applied Sciences, 12(7), 3247.
- View Article
- Google Scholar
33. Ran W., Lan Y., Dai X., Gu J., Liu B., Geng L., et al. (2022). Obstacle detection system for autonomous vineyard robots based on passthrough filter. International Journal of Precision Agricultural Aviation, 5(1).
- View Article
- Google Scholar
34. Duan Y., Yang C., & Li H. (2021). Low-complexity adaptive radius outlier removal filter based on PCA for lidar point cloud denoising. Applied Optics, 60(20), E1–E7. pmid:34263788
- View Article
- PubMed/NCBI
- Google Scholar
35. del Río-Barral P., Soilán M., González-Collazo S. M., & Arias P. (2022). Pavement crack detection and clustering via region-growing algorithm from 3D MLS point clouds. Remote Sensing, 14(22), 5866.
- View Article
- Google Scholar
36. Zhu W., Li H., Cheng X., & Jiang Y. (2023). A multi-task road feature extraction network with grouped convolution and attention mechanisms. Sensors, 23(19), 8182. pmid:37837012
- View Article
- PubMed/NCBI
- Google Scholar
37. Guo J., Wang J., Wang H., Xiao B., He Z., & Li L. (2023). Research on road scene understanding of autonomous vehicles based on multi-task learning. Sensors, 23(13), 6238. pmid:37448087
- View Article
- PubMed/NCBI
- Google Scholar
38. Teichmann M., Weber M., Zoellner M., Cipolla R., & Urtasun R. (2018, June). Multinet: Real-time joint semantic reasoning for autonomous driving. In 2018 IEEE intelligent vehicles symposium (IV) (pp. 1013–1020). IEEE.
- View Article
- Google Scholar
39. Qian Y., Dolan J. M., & Yang M. (2019). DLT-Net: Joint detection of drivable areas, lane lines, and traffic objects. IEEE Transactions on Intelligent Transportation Systems, 21(11), 4670–4679.
- View Article
- Google Scholar
40. Ren S., He K., Kirchick R., & Sun J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
- View Article
- Google Scholar
41. Yao J., Li Y., Liu C., & Tang R. (2023). Ehsinet: Efficient High-Order Spatial Interaction Multi-task Network for Adaptive Autonomous Driving Perception. Neural Processing Letters, 1–18.
- View Article
- Google Scholar
42. Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881–2890).
43. Yu Y., Lu Y., Wang P., Han Y., Xu T., & Li J. (2023). Drivable Area Detection in Unstructured Environments based on Lightweight Convolutional Neural Network for Autonomous Driving Car. Applied Sciences, 13(17), 9801.
- View Article
- Google Scholar
44. Liang X., Niu M., Han J., Xu H., Xu C., & Liang X. (2023). Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9611–9621).
- View Article
- Google Scholar
45. https://eval.ai/web/challenges/challenge-page/1875/leaderboard/4414
46. Schirrmeister R. T., Springenberg J. T., Fiederer L. D. J., Glasstetter M., Eggensperger K., Tangermann M.,… & Ball T. (2017). Deep learning with convolutional neural networks for EEG decoding and visualization. Human brain mapping, 38(11), 5391–5420. pmid:28782865
- View Article
- PubMed/NCBI
- Google Scholar
47. Hou Y., Ma Z., Liu C., & Loy C. C. (2019). Learning lightweight lane detection cnns by self attention distillation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1013–1021).
- View Article
- Google Scholar
48. Nikolovski G., Reke M., Elsen I., & Schiffer S. (2021, July). Machine learning based 3D object detection for navigation in unstructured environments. In 2021 IEEE Intelligent Vehicles Symposium Workshops (IV Workshops) (pp. 236–242). IEEE.
- View Article
- Google Scholar

[ref1] 1. Tabelini L., Berriel R., Paixao T. M., Badue C., De Souza A. F., & Oliveira-Santos T. (2021). Keep your eyes on the lane: Real-time attention-guided lane detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 294–302).
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Li B., Zhao Y., & Lou L. (2022). Fast Lane Detection Based on Improved Enet for Driverless Cars. In Advances in Computational Intelligence Systems: Contributions Presented at the 20th UK Workshop on Computational Intelligence, September 8–10, 2021, Aberystwyth, Wales, UK 20 (pp. 379–389). Springer International Publishing.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Luo Y., Zheng C., Yan X., Kun T., Zheng C., Cui S., et al. (2023). Latr: 3d lane detection from monocular images with transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7941–7952).
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Liu Y., Wu C., Zeng Y., Chen K., & Zhou S. (2023). Swin-APT: An Enhancing Swin-Transformer Adaptor for Intelligent Transportation. Applied Sciences, 13(24), 13226.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Wang Y., Wang C., Wu H., & Chen P. (2022). An improved Deeplabv3+ semantic segmentation algorithm with multiple loss constraints. Plos one, 17(1), e0261582. pmid:35045083
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref6] 6. Mittal U., Chawla P., & Tiwari R. (2023). EnsembleNet: A hybrid approach for vehicle detection and estimation of traffic density based on faster R-CNN and YOLO models. Neural Computing and Applications, 35(6), 4755–4774.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref7] 7. Deshmukh P., Satyanarayana G. S. R., Majhi S., Sahoo U. K., & Das S. K. (2023). Swin transformer based vehicle detection in undisciplined traffic environment. Expert Systems with Applications, 213, 118992.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref8] 8. Xie Z., Li Y., Xiao Y., Diao Y., Liao H., Zhang Y., et al. (2023). Sugarcane stem node identification algorithm based on improved YOLOv5. Plos one, 18(12), e0295565. pmid:38079443
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref9] 9. Kim J. H., Kim N., & Won C. S. (2023, June). High-Speed Drone Detection Based On Yolo-V8. In ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–2). IEEE.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref10] 10. Xu, Y., Yang, Y., & Zhang, L. (2023, June). DeMT: Deformable mixer transformer for multi-task learning of dense prediction. In Proceedings of the AAAI conference on artificial intelligence (Vol. 37, No. 3, pp. 3072–3080).

[ref11] 11. Niu Y., Guo H., Lu J., Ding L., & Yu D. (2023). SMNet: symmetric multi-task network for semantic change detection in remote sensing images based on CNN and transformer. Remote Sensing, 15(4), 949.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Wu D., Liao M. W., Zhang W. T., Wang X. G., Bai X., Cheng W. Q., et al. (2022). Yolop: You only look once for panoptic driving perception. Machine Intelligence Research, 19(6), 550–562.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Vu D., Ngo B., & Phan H. (2022). Hybridnets: End-to-end perception network. arXiv preprint arXiv:2203.09035.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref14] 14. Han C., Zhao Q., Zhang S., Chen Y., Zhang Z., & Yuan J. (2022). Yolopv2: Better, faster, stronger for panoptic driving perception. arXiv preprint arXiv:2208.11434.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref15] 15. Wang X., Li K., & Chehri A. (2023). Multi-sensor fusion technology for 3D object detection in autonomous driving: A review. IEEE Transactions on Intelligent Transportation Systems.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref16] 16. Chen, Y., Wang, J., & Li, G. (2022, December). A efficient predictive wavelet transform for LiDAR point cloud attribute compression. In 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP) (pp. 1–5). IEEE.

[ref17] 17. Zhao Y., Zhang X., & Huang X. (2021). A technical survey and evaluation of traditional point cloud clustering methods for lidar panoptic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2464–2473).
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref18] 18. Chen Y., & De Luca G. (2021). Technologies supporting artificial intelligence and robotics application development. Journal of Artificial Intelligence and Technology, 1(1), 1–8.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref19] 19. Lu K., Zhao F., Xu X., & Zhang Y. (2023). An object detection algorithm combining self-attention and YOLOv4 in traffic scene. PLoS one, 18(5), e0285654. pmid:37200376
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref20] 20. Huang Z., Wang J., Fu X., Yu T., Guo Y., & Wang R. (2020). DC-SPP-YOLO: Dense connection and spatial pyramid pooling based YOLO for object detection. Information Sciences, 522, 241–258.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref21] 21. Deng C., Wang M., Liu L., Liu Y., & Jiang Y. (2021). Extended feature pyramid network for small object detection. IEEE Transactions on Multimedia, 24, 1968–1979.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref22] 22. Xiong C., Hu S., & Fang Z. (2022). Application of improved YOLOV5 in plate defect detection. The International Journal of Advanced Manufacturing Technology, 1–13.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref23] 23. Li Y., Fan Q., Huang H., Han Z., & Gu Q. (2023). A Modified YOLOv8 Detection Network for UAV Aerial Image Recognition. Drones, 7(5), 304.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref24] 24. Tang H., Liang S., Yao D., & Qiao Y. (2023). A visual defect detection for optics lens based on the YOLOv5-C3CA-SPPF network model. Optics Express, 31(2), 2628–2643. pmid:36785272
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref25] 25. Weng W., & Zhu X. (2021). INet: convolutional networks for biomedical image segmentation. Ieee Access, 9, 16591–16603.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref26] 26. Zhou L., Rao X., Li Y., Zuo X., Qiao B., & Lin Y. (2022). A lightweight object detection method in aerial images based on dense feature fusion path aggregation network. ISPRS International Journal of Geo-Information, 11(3), 189.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref27] 27. Gan, X., Qu, J., Yin, J., Huang, W., Chen, Q., & Gan, W. (2021). Road damage detection and classification based on M2det. In Advances in Artificial Intelligence and Security: 7th International Conference, ICAIS 2021, Dublin, Ireland, July 19–23, 2021, Proceedings, Part I 7 (pp. 429–440). Springer International Publishing.

[ref28] 28. Gao, Z. (2023, February). YOLOCa: Center aware yolo for dense object detection. In Journal of Physics: Conference Series (Vol. 2425, No. 1, p. 012019). IOP Publishing.

[ref29] 29. Cui, Y., Jia, M., Lin, T. Y., Song, Y., &Belongie, S. (2019). Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9268–9277).

[ref30] 30. Wong V. W. H., Ferguson M., Law K. H., Lee Y. T. T., & Witherell P. (2022). Segmentation of additive manufacturing defects using U-net. Journal of Computing and Information Science in Engineering, 22(3), 031005.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref31] 31. Chen M., Fang L., & Liu H. (2019, April). FR-NET: Focal loss constrained deep residual networks for segmentation of cardiac MRI. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) (pp. 764–767). IEEE.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref32] 32. Prencipe B., Altini N., Cascarano G. D., Brunetti A., Guerriero A., & Bevilacqua V. (2022). Focal dice loss-based V-Net for liver segments classification. Applied Sciences, 12(7), 3247.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref33] 33. Ran W., Lan Y., Dai X., Gu J., Liu B., Geng L., et al. (2022). Obstacle detection system for autonomous vineyard robots based on passthrough filter. International Journal of Precision Agricultural Aviation, 5(1).
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref34] 34. Duan Y., Yang C., & Li H. (2021). Low-complexity adaptive radius outlier removal filter based on PCA for lidar point cloud denoising. Applied Optics, 60(20), E1–E7. pmid:34263788
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref35] 35. del Río-Barral P., Soilán M., González-Collazo S. M., & Arias P. (2022). Pavement crack detection and clustering via region-growing algorithm from 3D MLS point clouds. Remote Sensing, 14(22), 5866.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref36] 36. Zhu W., Li H., Cheng X., & Jiang Y. (2023). A multi-task road feature extraction network with grouped convolution and attention mechanisms. Sensors, 23(19), 8182. pmid:37837012
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref37] 37. Guo J., Wang J., Wang H., Xiao B., He Z., & Li L. (2023). Research on road scene understanding of autonomous vehicles based on multi-task learning. Sensors, 23(13), 6238. pmid:37448087
View Article
PubMed/NCBI
Google Scholar

[106] View Article

[107] PubMed/NCBI

[108] Google Scholar

[ref38] 38. Teichmann M., Weber M., Zoellner M., Cipolla R., & Urtasun R. (2018, June). Multinet: Real-time joint semantic reasoning for autonomous driving. In 2018 IEEE intelligent vehicles symposium (IV) (pp. 1013–1020). IEEE.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref39] 39. Qian Y., Dolan J. M., & Yang M. (2019). DLT-Net: Joint detection of drivable areas, lane lines, and traffic objects. IEEE Transactions on Intelligent Transportation Systems, 21(11), 4670–4679.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref40] 40. Ren S., He K., Kirchick R., & Sun J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

[ref41] 41. Yao J., Li Y., Liu C., & Tang R. (2023). Ehsinet: Efficient High-Order Spatial Interaction Multi-task Network for Adaptive Autonomous Driving Perception. Neural Processing Letters, 1–18.
View Article
Google Scholar

[119] View Article

[120] Google Scholar

[ref42] 42. Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881–2890).

[ref43] 43. Yu Y., Lu Y., Wang P., Han Y., Xu T., & Li J. (2023). Drivable Area Detection in Unstructured Environments based on Lightweight Convolutional Neural Network for Autonomous Driving Car. Applied Sciences, 13(17), 9801.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref44] 44. Liang X., Niu M., Han J., Xu H., Xu C., & Liang X. (2023). Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9611–9621).
View Article
Google Scholar

[126] View Article

[127] Google Scholar

[ref45] 45. https://eval.ai/web/challenges/challenge-page/1875/leaderboard/4414

[ref46] 46. Schirrmeister R. T., Springenberg J. T., Fiederer L. D. J., Glasstetter M., Eggensperger K., Tangermann M.,… & Ball T. (2017). Deep learning with convolutional neural networks for EEG decoding and visualization. Human brain mapping, 38(11), 5391–5420. pmid:28782865
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref47] 47. Hou Y., Ma Z., Liu C., & Loy C. C. (2019). Learning lightweight lane detection cnns by self attention distillation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1013–1021).
View Article
Google Scholar

[134] View Article

[135] Google Scholar

[ref48] 48. Nikolovski G., Reke M., Elsen I., & Schiffer S. (2021, July). Machine learning based 3D object detection for navigation in unstructured environments. In 2021 IEEE Intelligent Vehicles Symposium Workshops (IV Workshops) (pp. 236–242). IEEE.
View Article
Google Scholar

[137] View Article

[138] Google Scholar

Figures

Abstract

Introduction

Related content

Multi-task learning network-YOLOP

Lidar and vision joint calibration

Fusion perception algorithm

Improved YOLOP algorithm

Improved module

Optimize network structure

Improve loss function

Point cloud processing algorithm

PassThrough filtering

Radius outlier removal (ROR) filtering

Region growing clustering

Fusion algorithm

Experiment

Experimental setup

Analysis of model training results

Model performance analysis.

Analysis of vehicle detection results

Analysis of driveable area detection results

Analysis of lane line detection results

Ablation experiment

Analysis of training results on the KITTI dataset

Joint calibration of lidar and vision

Vision sensor calibration.

Joint calibration of lidar and vision

Processing of lidar point cloud

PassThrough filtering.

ROR filtering.

Region growing clustering.

Construction of three-dimensional(3D) prediction boxes.

Fusion of lidar and vision.

Analysis of visualization results

Analysis of vision perception visualization results.

Analysis of fusion perception visualization results.

Conclusion

References