Figures
Abstract
Multimodal 3D object detection is crucial for autonomous systems but suffers from high delay due to significant computational demands. To address this, we propose an edge computing-assisted framework that balances load between terminal devices and edge servers. We introduce dynamic threshold tuning and resolution-adaptive offloading algorithms to optimize performance. Experimental results demonstrate that our approach significantly reduces delay by minimizing offloading frequency while maintaining high accuracy, achieving a superior delay-accuracy trade-off. Furthermore, the framework exhibits robust adaptability across various models and bandwidth conditions, ensuring effectiveness in dynamic environments.
Citation: Zhao K, Gou Z, Liu H (2026) Edge-assisted adaptive offloading algorithm for 3D object detection tasks. PLoS One 21(4): e0345876. https://doi.org/10.1371/journal.pone.0345876
Editor: Tri-Hai Nguyen, Van Lang University: Truong Dai hoc Van Lang, VIET NAM
Received: October 13, 2025; Accepted: March 11, 2026; Published: April 16, 2026
Copyright: © 2026 Zhao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting Information files. Source code:https://github.com/kangkangsister/plos.git.
Funding: This work was supported by the Sichuan Science and Technology Program (grant number 2024JDRC0053 to [KLZ]) and the Aba Normal University Innovative Research Team (grant number AS-KCTD2026-1 to[KLZ]). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Abbreviations: AP, Average Precision; CAVs, Connected and Autonomous Vehicles; C-V2X, Cellular Vehicle-to-Everything; FTAR, Fixed Threshold Adaption Resolution; IoT, Internet of Things; mAP, Average Mean Precision; MEC, Multi-access Edge Computing; OB, Object-Based resolution selection; OD, Only Device; PSO, Particle Swarm Optimization; STC, Spatio-Temporal Lidar Point Cloud Compression; VA, dynamic adaptive offloading for Video Analytics; V2I, Vehicle-to-Infrustructure; V2X, Vehicle-to-Everything.
Section 1: Introduction
In recent years, the number of autonomous vehicles has been steadily increasing, with projections indicating that by 2030, the global fleet of autonomous vehicles will reach 2.2 million [1]. Autonomous driving has brought new industry development opportunities to the traditional automotive sector. Its diversified application scenarios, operational efficiency, and immense development potential have led many countries to view it as one of the key technologies for improving urban operations and creating economic value. For instance, the UK government announced its Connected and Automated Mobility (CAM) Roadmap in 2022, aiming to promote the adoption of autonomous vehicles by 2025 [2]. The U.S. Department of Transportation plans to fully implement cellular-based V2X (Vehicle-to-Everything) coverage by 2036 to further support higher levels of autonomous driving services [3]. Japan plans to achieve a digital transformation in the automotive sector with software-defined vehicles between 2030 and 2035 [4].
Autonomous driving relies on various sensors and algorithms to quickly perceive the environment and make rapid decisions, ensuring the timeliness and reliability of driving decisions. 3D object detection, which collects data through sensors such as LiDAR or RGB-D cameras, enables object recognition and localization in 3D space [5]. It has become one of the most commonly used object recognition technologies in the field of autonomous driving. Algorithms allow the vehicle to quickly understand environmental data, enabling it to avoid potential hazards [6]. In addition, point cloud object detection plays an irreplaceable role in perceiving the three-dimensional world, driving advancements across diverse fields. In autonomous driving, it supports critical tasks such as pavement data acquisition and analytics, enhancing vehicle safety and infrastructure monitoring [7]. Beyond automotive applications, it is essential in intelligent robotics, enabling precise object identification on platforms like Nao robots and facilitating multimodal detection using depth and image data for manufacturing parts [8]. Furthermore, these detection technologies synergize with Internet of Things (IoT) frameworks to support cooperative models for emergency transportation planning. To ensure efficient deployment in these complex scenarios, optimal inferential control of convolutional neural networks is often employed to balance performance and computational cost. Collectively, these applications demonstrate the broad adaptability and significance of 3D object detection in modern intelligent systems.
Multimodal 3D object detection technology fully combines the advantages of deep learning and multimodal data, significantly improving the model’s accuracy [9] and overcoming the inherent limitations of single-modal detection in acquiring depth information. However, this method requires significant computational resources. Given the limited computational power of in-vehicle terminal devices, requiring them to execute all object detection tasks leads to a decline in inference speed and accuracy. Furthermore, it fails to meet the stringent delay requirements for task processing in autonomous driving. Meanwhile, LiDAR supports a point cloud sampling rate of up to 8 million points per second, which, in turn, leads to a significant increase in inference delay due to the high-speed point cloud sampling rate [10]. To improve the processing efficiency of 3D point cloud data, some studies [11–12] have employed lightweight model techniques, such as knowledge distillation or neural network pruning, to accelerate the inference process. However, lightweight models may not provide sufficient performance to support complex learning tasks and may introduce additional communication and computational overhead for retraining [13]. Connected and Autonomous Vehicles (CAVs) [14] can leverage cellular vehicle-to-everything communication (C-V2X) [15], vehicle-to-infrastructure communication (V2I), and multi-access edge computing (MEC) [16] to enable data exchange and task transmission between vehicles, edge nodes, and cloud servers. This facilitates the acquisition of complementary environmental information, the allocation of computational resources, and provides a solid technological foundation for the application of edge computing in vehicular scenarios. The vehicular edge computing paradigm [17] allocates computational and storage resources to network edge servers and deploys multimodal deep learning models. By offloading part or all of the tasks from in-vehicle terminals to edge servers for proximal processing, it improves task computational efficiency. However, 3D object detection tasks based on edge computing still face several technical challenges, such as the significant bandwidth consumption required for transmitting multimodal data, which leads to severe delay issues.
Designing efficient and energy-saving 3D object detection models is critical for ensuring the safety of autonomous driving [7,18]. Due to the large and complex nature of point cloud data, which demands high computational power, edge computing has become an ideal computing paradigm for rapidly addressing 3D object detection tasks. Considering that information exchange between endpoints can enhance both the accuracy and speed of object detection, several scholars [6,19,20] have thoroughly explored the design of end-to-end frameworks, capturing correlations between tracked objects and facilitating the interaction of computational tasks. Additionally, given the limited resources in edge computing environments, some studies have focused on optimizing the computation offloading mechanisms to improve the efficiency of collaborative edge-based object detection systems. Parameters such as frame resolution, frame rate, and confidence thresholds [21–22] are now being utilized to enable adaptive task offloading decisions, aimed at meeting the performance requirements of dynamic scenarios, including accuracy, throughput [23], and energy consumption [24].
However, current edge-assisted object detection technologies are primarily based on image data, with insufficient research on point cloud data generated by LiDAR. In fact, point cloud object detection plays an irreplaceable role in perceiving the three-dimensional world, driving the advancement and application of autonomous driving, intelligent robotics, virtual gaming, and other fields [25–26]. Unlike image data, the quality of point cloud data is characterized by the point cloud resolution parameter, and the relationship between this resolution and accuracy and delay is still unclear. Therefore, existing offloading mechanisms cannot be directly applied to 3D object detection scenarios.
Based on the aforementioned motivations, this paper proposed a highly efficient computational algorithm to quantify the data upload priority of different modalities. Furthermore, our proposed framework enhances computational efficiency by dynamically distributing the inference workload. The lightweight terminal model filters out easy samples, reducing the frequency of expensive edge inferences, while the adaptive resolution mechanism minimizes the data volume for necessary offloading, optimizing both computational and communication resources. Experimental results show that point cloud resolution and confidence significantly influence the performance of 3D object detection. On the one hand, point cloud resolution, defined as the number of points per unit volume, characterizes the fine-grained details of the point cloud. Fig 1 illustrates the inference performance of the unimodal model PointPillar [27] and the multimodal model DeepInteraction [28] under different point cloud resolutions. In this study, “Device” refers to the efficiency of the point cloud-based unimodal detection model (PointPillar) deployed on local devices. “Edge” denotes the efficiency of the image-point cloud-based multimodal detection model (DeepInteraction) deployed on edge devices. Furthermore, “Transmit” signifies the delay associated with data offloading and transmission to edge devices during multimodal data inference based on edge collaboration. As shown in Fig 2, the point cloud resolution increases from 1 × 104 pts/m³to 4 × 104 pts/m³, the inference delay of the unimodal model rises by 0.1 seconds, while the multimodal inference delay increases by 0.07 seconds, and the transmission delay increases by 0.4 seconds. Fig 1 and Fig 2 illustrate that with higher point cloud resolutions, both the accuracy of the unimodal and multimodal models improves; however, this also leads to significant increase in transmission delay and inference. Therefore, considering the impact of point cloud resolution is essential for optimizing the edge-device collaborative 3D object detection model. The experimental setup detailed in Section 4.
Image resolution is represented by pixel density, while point cloud resolution reflects the density and distribution of points in 3D space. To assess the effect of image resolution on detection performance, we varied the image resolution and observed the corresponding changes in accuracy, as summarized in Table 1. As illustrated in Table 1, reducing the image resolution from the original 1600 × 900–400 × 300 results in a decrease of Average Precision (AP) by less than 0.01, indicating that image resolution has a minimal impact on accuracy.. Instead, we select point cloud resolution and confidence threshold as the key metrics. The confidence threshold serves as the decision trigger: when the confidence of local detection falls below this threshold, the terminal offloads the multimodal data to the edge server for higher-precision inference. However, due to the high-speed movement of vehicles, network conditions change rapidly, and traffic patterns become unpredictable, posing challenges in designing effective task offloading mechanisms for edge-assisted 3D object detection frameworks [29–30]: The transmission delay of point cloud data with the same resolution varies with the network environment, the confidence of object detection changes dynamically with the detection content, and a low confidence threshold can lead to frequent offloading, causing excessive transmission delay.
To address the above challenges, the main contributions of this paper are summarized as follows:
- This paper proposes an edge-assisted 3D object detection framework that leverages the heterogeneous resource characteristics of terminal devices and edge servers to enable collaborative inference of unimodal and multimodal detection models.
- A detailed analysis is provided on the impact of point cloud resolution and confidence threshold on the delay and accuracy of object detection. Based on this, a mathematical optimization problem is formulated to minimize delay while meeting accuracy requirements.
- A comprehensive offloading algorithm is proposed in this paper, which includes accuracy estimation, threshold updating, and resolution adjustment. To evaluate accuracy in real-time, a confidence-based accuracy estimation method is designed, and a confidence-accuracy mapping relationship is established using quadratic fitting functions. Based on this, dynamic threshold updating and adaptive offloading algorithms are proposed, which fine-tune the threshold based on accuracy estimated from the previous cycle and dynamically adjust the resolution according to real-time bandwidth to balance accuracy and delay.
- An edge-terminal collaborative hardware experimental platform is introduced, which unimodal and multimodal object detection models are implemented on the terminal device and edge server, respectively. The performance of the models is evaluated using the nuScenes [31] and KITTI [32] datasets. Experimental results demonstrate that, compared with five baseline algorithms, the proposed approach achieves optimal overall performance in terms of both delay and accuracy under various testing scenarios.
The remainder of this paper is organized as follows: Section 2 introduces related work. Sections 3 describes the system model and algorithm design. Section 4 discusses the experimental platform and verifies the effectiveness of the proposed algorithm. Finally, Section 5 concludes the paper.
Section 2: Related works
Autonomous vehicles, intelligent robots, and logistics systems depend on multi-sensor perception to make informed decisions for safety and efficiency [33–34]. While deep learning has significantly advanced 3D object detection, particularly through one-stage methods like PointNet [35] and PointGNN [36], and two-stage approaches like PointPillars [27], achieving a balance between accuracy and real-time performance remains a challenge. Advanced methods often rely on complex operations such as sparse 3D convolutions or transformers [13], which demand substantial computational resources. Although these techniques achieve high accuracy in experimental settings, their high parameter counts and computational overhead frequently create inference bottlenecks, making them difficult to deploy on resource-constrained terminal devices where real-time responsiveness is critical [37–38].
In the realm of multimodal data fusion, edge-cloud collaborative frameworks typically employ early, intermediate, or late fusion strategies. Early fusion [39–40] optimizes data quality but consumes excessive bandwidth, while late fusion [41] minimizes bandwidth usage but often sacrifices accuracy by ignoring intrinsic inter-modal correlations. Intermediate fusion [6,42] offers a promising middle ground by balancing accuracy and delay through feature-level transmission. However, research in this area remains insufficient; most existing studies focus solely on model feature extraction without adequately addressing the implications of rapidly changing communication environments. Furthermore, current approaches often prioritize average accuracy metrics while neglecting critical performance indicators such as end-to-end delay [43–44], which are essential for safety-critical applications.
From the perspective of application scenarios, existing methodologies often lack the environmental adaptability required for real-world deployment. First, many methods depend on stable, high-quality network environments and struggle to cope with significant fluctuations or signal interruptions, leading to unacceptable delay spikes [45]. Second, current collaborative frameworks typically rely on rigid task partitioning that makes terminals dependent on edge servers for inference; if the edge becomes unavailable or overloaded, the terminal cannot operate independently [39,46]. Finally, existing works primarily optimize model structures for theoretical accuracy, ignoring the impact of external environmental factors—such as dynamic bandwidth availability and computing resource contention—on system performance. Consequently, these methods are often ill-suited for the dynamic and variable environments characteristic of autonomous driving and intelligent robotics. Table 2 presents the similarities and differences between our study and existing studies.
The aforementioned research effectively leverages the heterogeneous resources of terminal devices, achieving a reduction in the transmission volume of multimodal data while maintaining accuracy. Nevertheless, certain areas still necessitate urgent enhancements:
- (1) Existing methods rely on a high-quality network environment to achieve real-time data transmission, which makes them unable to cope with scenarios involving significant network fluctuations, signal interruptions, etc. This results in extremely high delay, making them unsuitable for applications that require very low delay.
- (2) The existing edge-collaboration frameworks are based on task or model partitioning, requiring the terminal to rely on the edge for subsequent model inference. This dependency limits the terminal’s ability to operate independently and may result in increased delay.
- (3) Current approaches primarily focus on designing collaboration strategies that consider the structural characteristics of the model in relation to accuracy, while neglecting the inherent connection between external environmental factors and model performance. For instance, fluctuations in bandwidth and the availability of computational resources are often overlooked, rendering these frameworks inadequately equipped to adapt to the dynamic and fluctuating nature of real-time environments.
In contrast to the aforementioned studies, this paper proposes a 3D object detection framework based on the collaboration of single-modal and multimodal models. In this framework, lightweight single-modal models and computationally intensive multimodal models are deployed at resource-constrained terminals and edge servers with higher computational capacity, respectively. On the terminal side, the lightweight single-modal model, with its high efficiency, can quickly respond to and process local data, providing preliminary and reliable prediction results for subsequent multimodal inference. This design significantly reduces dependence on network bandwidth while alleviating the computational burden on edge servers. On the edge side, the complex multimodal model can integrate various data sources to perform in-depth feature extraction and fusion, further improving the accuracy and robustness of object detection. Additionally, this paper delves into the impact of several key variables on delay and accuracy. Through a series of experimental results, the inherent relationships between these variables are analyzed, providing a theoretical basis for offloading decision-making. Based on the aforementioned research, a lightweight offloading algorithm is designed to dynamically adjust key parameters such as confidence thresholds and real-time resolution, based on real-time predictions of model accuracy, in order to adaptively balance delay and accuracy in real-time environments, thus meeting the diverse demands of various application scenarios.
Section 3: System model and algorithm design
3.1. System model
In this section, a framework for a 3D object detection system based on edge-assisted technology is proposed, consisting primarily of terminal devices and edge servers. In contrast to existing edge-assisted architectures, the proposed framework introduces three key innovations: (1) A heterogeneous model deployment strategy that pairs a lightweight single-modal model on the terminal with a complex multimodal model on the edge to optimize resource utilization; (2) A dynamic confidence threshold update mechanism that adapts to real-time accuracy requirements, acting as an intelligent gatekeeper for offloading; and (3) An adaptive resolution selection algorithm that minimizes transmission delay by optimizing data volume based on current network bandwidth and delay constraints. These components work synergistically to balance accuracy and efficiency in dynamic environments.
As shown in Fig 3, the terminal device is composed of four modules: data collection, local inference, offloading controller, and parameter configuration updater. First, the data collection module is responsible for acquiring multimodal data, including video and point cloud data. In autonomous driving scenarios, the terminal device is typically a mobile vehicle equipped with various sensors, including onboard cameras and LiDAR. These sensors collect image and point cloud data at a fixed sampling frequency. The local inference module deploys a single-modal inference model, PointPillar, based on point cloud data. We selected PointPillar specifically for its lightweight architecture and low parameter count, enabling efficient real-time execution on resource-limited terminal hardware while maintaining sufficient accuracy for initial filtering. This module processes the point cloud data and provides the results and confidence levels for the detection tasks. The offloading controller makes decisions regarding the offloading of detection tasks and the point cloud resolution based on the confidence threshold and accuracy requirements. The point cloud resolution is defined as the number of points divided by the volume of the object, which determines the density of the point cloud data and influences the size of the point cloud data to be transmitted during the offloading process.
As shown in Fig 4, the size of the point cloud data is linearly correlated with its resolution. When the confidence level of the local inference result does not meet the threshold, multimodal data is offloaded to the server for further inference. The threshold updater is responsible for dynamically updating the confidence threshold. On the edge server side, a multimodal detection model is deployed to assist the terminal device with inference, thereby improving detection accuracy.
Typically, multimodal models provide better accuracy than single-modal models. Fig 5 and Fig 6 present performance comparisons between single-modal and multimodal inference models on terminal devices, with the terminal device being the NVIDIA Jetson AGX Xavier. It is evident that the accuracy of the three multimodal models, DeepInteraction, Transfusion [47], and BEVFusion [48], surpasses that of the single-modal model, PointPillar, with significantly higher inference delay as well. Additionally, Fig 7 and Fig 8 compare the performances of various single-modal models on the terminal device. Although PointPillar has the lowest accuracy among the three single-modal models, it also exhibits the lowest inference delay. Moreover, under the proposed offloading strategy, higher-accuracy single-modal models reduce offloading frequency by correctly classifying more samples locally. As a result, using these models on the terminal does not necessarily outperform PointPillar when considering the overall balance between accuracy and latency. Consequently, to balance delay and accuracy, PointPillar is chosen as the inference model for the terminal device. Furthermore, DeepInteraction, which has the highest inference accuracy, is used as the inference model on the edge side to compensate for the lower accuracy of terminal model inference in certain scenarios.
Next, the specific processing flow of the proposed edge-assisted collaborative object detection framework is described as follows and shown in Fig 3. In the first step, the data collection module gathers multimodal data through sensor data; In the second step, the local inference model performs single-modal object detection on the collected point cloud data to obtain the detection confidence of the terminal device. In the third step, based on the terminal’s detection confidence, the offloading controller makes the offloading decision. If the confidence level exceeds the preset threshold, the terminal’s detection result is used as the final result. Otherwise, based on real-time bandwidth and accuracy requirements, the point cloud resolution is dynamically adjusted, and the data, along with image data, is compressed and sent to the terminal device for further processing. In the fourth step, the terminal device receives the compressed data packet, decodes it, and processes it with the multimodal model, returning the result to the terminal. During this process, the threshold updater will collect and analyze the results and periodically update the confidence threshold based on the established update strategy. The threshold update strategy will be described in detail in Section 4 of the algorithm.
The overall computational complexity analysis of the proposed edge-assisted adaptive offloading algorithm is as follows:
Offloading decision module: The offloading decision for each task involves evaluating the local inference cost and the edge inference cost (including transmission delay). For N tasks and K candidate edge servers, the complexity of the offloading decision process is .
Point cloud resolution adaptation module: The resolution selection involves searching over R candidate resolution levels to optimize the accuracy-delay trade-off, contributing a complexity of per task.
Overall complexity: The total complexity of the proposed algorithm is . Since K and R are typically small constants in practical deployment scenarios, the algorithm scales linearly with the number of tasks N, ensuring computational efficiency suitable for real-time applications.
3.2. Problem description
In this section, a detailed analysis of the above-mentioned offloading process is conducted, and a theoretical optimization model is established. The definitions of the symbols used in this article are shown in Table 3.
When the Sample is processed by the local model at the original resolution
, the processing delay is equal to the inference delay of the local model, as shown in equation (1):
Equation (2) indicates that when a sample with resolution
is offloaded to the edge server, its processing delay is the sum of the transmission delay and the inference delay at the edge server.
where represents the transmission delay for sample
, calculated as the sample size divided by the transmission rate, as shown in below:
Here is a proportional coefficient used to calculate the size of the point cloud data, and
is the real-time transmission rate, which dynamically changes based on network conditions.
is the inference delay at the edge server for the sample
with resolution
.
To model the computational cost efficiently, we analyzed the relationship between inference delay and point cloud resolution. As illustrated in Fig 2, extensive empirical tests on the NVIDIA GeForce RTX 3090 using the BEVFusion model reveal a strong linear correlation between resolution and inference delay. Consequently, we adopt a linear model for real-time delay estimation. This approximation significantly reduces optimization complexity while maintaining sufficient accuracy for the offloading decision. Thus, exhibits a linear relationship with the resolution
, i.e.,
, where the parameters
and
can be obtained through curve fitting.
Based on the above, the problem addressed in this paper is defined as follows: given a multimodal data set , a set of selectable point cloud resolutions
, and a target accuracy requirement
, the objective of this problem is to minimize the total system delay by optimizing the point cloud resolution
and the confidence threshold
.the specific mathematical expression is as follows:
Subject to:
where constraint ensures that the processing delay for each data frame does not exceed the maximum allowable delay
, and constraint
ensures that the
is no less than the given accuracy requirement
, since r is a discrete variable and
is a continuous variable, this model constitutes a linear mixed-integer optimization problem, which is difficult to solve optimally in polynomial time.
In Section 1, the impact of point cloud resolution on system performance has been discussed in detail. As shown in Fig 9, there is a nonlinear positive correlation between delay and accuracy performance and the confidence threshold. Specifically, as shown in Fig 10, in dynamic network environments, the inference delay and accuracy under fixed bandwidth and point cloud resolution vary dynamically. Therefore, how to dynamically optimize the threshold and point cloud resolution in an unstable network environment to improve system performance is an important research issue that urgently needs to be addressed.
3.3. Algorithm design
3.3.1. Confidence-based accuracy estimation method.
In real-time environments, obtaining the true labels of data is often challenging, making it difficult to accurately assess model performance. To address this issue, this study proposes a confidence-based accuracy estimation method, whose fundamental idea is to establish a functional mapping between confidence and accuracy using historical observation data.
Specifically, the method uniformly partitions the confidence interval [0,1] into multiple sub-intervals. Each sample is then assigned to a corresponding interval based on its confidence value, and the average sample accuracy within each confidence interval is computed. Based on the observed distribution pattern between confidence and accuracy, a cubic function is employed for curve fitting. Fig 11 and Fig 12 illustrate the confidence–accuracy fitting relationships for the terminal single-modal model and the edge multimodal model, respectively. It can be observed that the two models exhibit different curve shapes and concavity/convexity characteristics.
On this basis, confidence–accuracy fitting functions are established for both single-modal and multimodal models, expressed mathematically as follows:
Here, and
denote the confidence and estimated accuracy of sample
at the terminal/edge respectively. Based on Equations (5) and (6), the accuracy of samples can be estimated in real time, providing a basis for subsequent threshold update algorithms. The proposed confidence-based accuracy estimation method is applicable to arbitrary single-modal and multimodal models.
3.3.2. Dynamic threshold updating algorithm.
This paper proposes an adaptive threshold update algorithm, which primarily aims to evaluate the average accuracy of all frames from the previous cycle and, based on this evaluation, adjust the current cycle’s confidence threshold to meet the required accuracy. Specifically, when the average accuracy exceeds the required threshold, the confidence threshold is increased to reduce the offloading frequency, thereby minimizing delay. Conversely, when the average accuracy falls below the required threshold, the confidence threshold is decreased to increase the offloading frequency, thereby enhancing accuracy.
Table 4 presents the flowchart of the dynamic threshold updating algorithm, which consists of two main steps. In the first step, the of the previous cycle is estimated using Equations (5) and (6). In the second step,
is compared with the required accuracy
, when
is larger than
, the threshold need to be lower, thus the update equation is shown as below, the equation (7) calculates the update threshold
:
where is the threshold increment. When the
is less than or equal to the required threshold
, the threshold is increased according to the following update formula:
In this context, serves as a proportional factor, designed to regulate the rate of threshold increase to be greater than the rate of decrease. This design choice is driven by the fact that accuracy is treated as a primary constraint, with the fulfillment of accuracy requirements being prioritized over reducing delay and conserving network bandwidth. Additionally, the algorithm incorporates upper and lower bounds for the threshold, denoted as
and
, to prevent the occurrence of excessively high or low threshold values, which could result in either overly frequent or infrequent offloading.
3.3.3. Offloading algorithm.
Based on the dynamic threshold updating algorithm, this paper designs an adaptive offloading algorithm to dynamically determine the point cloud resolution that needs to be transmitted during the offloading process. The fundamental idea is to adjust the resolution in real-time based on dynamic network bandwidth, while ensuring that the target accuracy requirements are met, thereby minimizing the overall delay as much as possible.
Algorithm 2 illustrates the flow of the adaptive offloading algorithm, as shown in the algorithm the offloading conditions are assessed explicitly based on the confidence threshold. Specifically, the confidence threshold acts as a decision gate: if the local detection confidence falls below this threshold, the task is marked for offloading to leverage the edge server’s superior accuracy; otherwise, the local result is accepted to preserve real-time performance:
First, for the current data to be detected , Algorithm 2 computes the terminal confidence
using a unimodal model. Second, the offloading conditions are assessed. If any of the following conditions are met, the terminal detection result is returned directly:
, it indicates that the terminal detection result meets the accuracy requirements;
it indicates that the sum of the transmission and inference delays required for offloading exceeds the maximum delay constraint, meaning the network bandwidth conditions are insufficient to meet the offloading requirements, otherwise, the algorithm proceeds to the next step;
- For the data
to be offloaded, the algorithm iterates through the available resolution set
, electing the maximum resolution that satisfies the constraints for offloading to the edge server, and then returns the edge detection result.
The specific implementation of Algorithm 2 is shown as Table 5:
Section 4: Experimental results and analysis
4.1. Experimental setup
To validate the effectiveness of the proposed solution, this paper establishes an experimental testing platform for 3D object detection based on edge-assisted computation. The experiments adhere to the standard data processing protocols for 3D data as established within the OpenPCDet and MMDetection3D frameworks, encompassing methodologies such as PointPillar, DeepInteraction, BEVFusion, and TransFusion. These standardized processes ensure consistency and efficacy in data preprocessing, thereby providing a solid foundation for the research. As shown in Fig 13, the terminal device is equipped with the edge computing accelerator NVIDIA Jetson AGX Xavier, while the edge side employs the more powerful GPU server NVIDIA GeForce RTX 3090.
The comparison of the device parameters is shown in Table 6. As in Section 3, the 3D object detection models deployed on the terminal and edge sides are the single-modal model PointPillar and the multimodal model DeepInteraction, respectively. To validate the effectiveness of our proposed framework, we utilize two widely used benchmark datasets: KITTI and nuScenes, both of which provide high-precision annotations for multimodal data collected in real-world traffic environments.
- (1) KITTI Dataset: A widely recognized benchmark in autonomous driving, capturing diverse road scenarios (urban, rural, highway) with synchronized image and LiDAR data.
- (2) nuScenes Dataset: A large-scale dataset featuring 1000 scenes of 20 seconds each, collected in diverse weather and lighting conditions (e.g., rain, night) with 360-degree sensor coverage.
These datasets were selected as standard benchmarks to allow for direct comparison with state-of-the-art methods [6,34,43,49,50]. For evaluation, we use mAP as the primary metric for object detection accuracy [2,43]. Additionally, we measure computation delay to evaluate the efficiency of our proposed edge computing framework [43,50]. In this paper, a continuous sequence of 1200 frames of multimodal data from the dataset is selected as the experimental test data, primarily for detecting vehicle targets in the scene.
To evaluate the performance of the algorithm, this paper adopts the following performance metrics:
1. AP: This metric is the primary indicator used to assess the performance of object detection algorithms, and is expressed as follows:
Here, is the accuracy at a threshold,
representing the overlap area between the predicted and ground truth targets,
is a value in the range [0,1], and
is the length of the detection range covered by
, the AP typically provides a better reflection of the quality of object detection.
2. Delay: Defined as the total processing delay for all data divided by the total amount of data. The higher this metric, the more efficient the system’s processing capability.
Regarding algorithm comparison, this paper employs five baseline algorithms, which are described as follows:
- OD: All samples are processed for inference on the terminal device.
- AR: Resolution is adaptively adjusted under the constraint of delay, and data is offloaded to the edge server for inference. If the delay constraint is not satisfied, inference is performed on the terminal device.
- FTAR: When both a fixed confidence threshold and delay constraints are met, resolution is adaptively adjusted and offloaded to the edge server for inference. Otherwise, inference is performed on the terminal device. In this case, a fixed confidence threshold of 0.5 is set.
- OB [51]: The target accuracy is set according to the accuracy of the object being detected, and the PSO algorithm is used to select the highest resolution that satisfies both delay and accuracy constraints for offloading to the edge server.
- VA [52]: A cost function is established based on resolution and accuracy. A greedy algorithm is used to traverse selectable resolutions, finding the minimum resolution that satisfies the target accuracy, which is then offloaded to the edge server.
In terms of parameter settings, this paper tests the performance under various accuracy requirements with different confidence increments , the testing environment is a dynamic network, and the bandwidth fluctuation curve is shown in Fig 14.
We set the values of across six different ranges, and test the delays and accuracies of our algorithm. As shown in Fig 15, algorithm achieves the highest accuracy and lowest average delay when
. Thus, we set
in the subsequent experiments. Additionally, the scale factor
is set to 3 based on empirical testing. This value provides an optimal balance between responsiveness to new channel conditions and robustness against transient noise. A higher
makes the system too volatile, while a lower
causes lag in response. Unless specifically stated, all experiments are conducted with the default parameter configuration.
4.2. Performance comparisons
4.2.1. Performances of different algorithms.
This paper tests and compares the performance of various offloading algorithms under , with the experimental results shown in Fig 16 and 17.
In terms of delay, the OD algorithm achieved the minimum value because it only performs inference locally. On the other hand, OB experienced the maximum delay, as the PSO algorithm requires multiple iterations to select the optimal resolution. The VA algorithm employs a traversal method for resolution selection, which introduces additional delay during the traversal process. The AR algorithm also had a longer delay, as it only considers resolution adaptation, leading to frequent offloading. FTAR, which accounts for a fixed confidence threshold and resolution adaptation, reduces the amount of offloaded data compared to AR, resulting in a smaller delay.
In terms of accuracy, OB achieved the best AP value due to the finer search granularity provided by the particle swarm iterations, enabling it to find the maximum resolution that meets the delay constraint. VA, however, uses a coarser traversal granularity compared to OB, resulting in a slightly lower accuracy. Meanwhile, OD, which performs inference locally, resulted in an AP value lower than the required accuracy of 0.85. The proposed algorithm in this paper, which combines dynamic threshold tuning and resolution adaptation, minimizes delay while meeting the accuracy requirement, achieving a better balance compared to the benchmark algorithms.
4.2.2. Performances of different edge inference models.
To verify the scalability of the algorithm, this paper replaces the multimodal models at the edge with BEVFusion and TransFusion. As shown in Fig 5 and Fig 6, compared to DeepInteraction, the AP values of BEVFusion and TransFusion have significantly decreased. If the target is still maintained at 0.85, it will lead to frequent offloading, resulting in severe delays. Therefore, in this set of tests, the paper adjusts
to 0.8.
Fig 18 and Fig 19 show the performance when the edge model is switched to BEVFusion. Compared to Fig 16 and 17, except for the OD algorithm, the AP values of the other algorithms have decreased to varying degrees. The OD algorithm, which does not perform task offloading, maintains both accuracy and delay at their lowest and most stable values. AR, OB, and VR still achieve relatively high AP performance but at the cost of higher delays. Notably, due to the lower accuracy requirement, the AP value of FTAR just meets the achieving the lowest delay among the algorithms that meet the delay requirements. The proposed algorithm in this paper, by adjusting according to the target accuracy, also adjusts the offloading strategy in a timely manner. While meeting the
, it achieves the second-best delay performance among the algorithms that meet the conditions.
Fig 20 and Fig 21 show the performance when the edge model is switched to TransFusion. TransFusion’s detection accuracy and delay are both lower than those of BEVFusion. Therefore, when the edge model is TransFusion, the overall AP value decreases compared to when the edge model is BEVFusion. Notably, the FTAR algorithm cannot meet the . However, the proposed algorithm adjusts the offloading frequency adaptively by tuning the confidence threshold and resolution, still satisfying the 0.8 accuracy requirement, while achieving the lowest delay performance among the algorithms that meet the accuracy requirement. The above analysis shows that the proposed algorithm has a certain degree of generalization ability across different edge models.
The proposed algorithm achieves delays of 0.42s and 0.35s when using BEVFusion_L and TransFusion_L as terminal models, both of which are higher than the delay of PointPillar as the terminal model (approximately 0.31s, as shown in Fig 16 and Fig 17. This is because both BEVFusion_L and TransFusion_L have longer inference delays compared to PointPillar (as shown in Fig 7 and Fig 8). However, when BEVFusion_L and TransFusion_L are used as the terminal models, there is no significant advantage in AP values over PointPillar. This is because the proposed edge-terminal collaboration framework effectively compensates for the insufficient inference accuracy of the terminal models on complex samples through the edge model.
Additionally, as shown in Figs 22–25, compared to other methods, the proposed offloading algorithm achieves the lowest delay while meeting the accuracy requirement, achieving the best balance. This further validates the generalization ability of the proposed algorithm across different terminal model scenarios.
4.2.3. Performances of different datasets.
In this section, algorithm performance tests were conducted on the autonomous driving evaluation dataset KITTI to verify the generalization ability of the algorithm across different datasets. Compared to nuScenes, the KITTI dataset has a more limited perspective. Specifically, nuScenes image data is captured from six different viewpoints, while the point cloud has a 360-degree field of view and a large range. In contrast, KITTI image data is captured from a single viewpoint, and the point cloud has a 180-degree field of view with a smaller range. The data formats of the nuScenes and KITTI datasets are inconsistent, which makes the DeepInteraction multimodal model, originally deployed on the edge, incompatible. Therefore, the edge model was replaced with the Focals Conv-F [53] model, which performs one of the best on the KITTI dataset, while the terminal model continued using the PointPillar model. The accuracy comparison of these two models on the KITTI dataset is shown in Fig 26. As shown in the Fig 26, due to the single viewpoint of KITTI’s image and point cloud data, the data is not as rich, and the Focals multimodal model achieves an accuracy of about 0.85 on the KITTI dataset. Compared to the nuScenes dataset, this represents a significant decrease in accuracy. Therefore, in this section, the target accuracy was adjusted downward to 0.8.
The performance comparison results of different algorithms on the KITTI dataset are shown in Fig 27 and Fig 28. Similar to their performance on the nuScenes dataset, the FR algorithm, due to the use of a fixed original resolution, results in an excessively large data volume, which cannot meet the delay threshold set for offloading. As a result, the inference task can only be performed on the terminal. Therefore, the performance of FR is similar to that of the DO algorithm, with the lowest accuracy and the lowest delay. Although the FTAR algorithm can adaptively adjust the resolution, it has a fixed confidence threshold, which prevents it from filtering offloaded data. Its accuracy, compared to OD, does not show significant improvement. The AR algorithm adjusts the resolution adaptively to increase the likelihood of offloading and improve accuracy (about 0.81), but the large amount of offloaded data leads to higher delay. In the OB algorithm, the multiple iterations of the particle swarm optimization (PSO) result in the highest delay (about 0.52s), while it also favors higher resolutions, leading to the highest accuracy, approximately 0.82. The VA algorithm finds the minimum resolution that satisfies the target accuracy, resulting in a slightly lower delay than OB. The algorithm proposed in this paper maintains the most balanced performance, achieving the lowest delay (about 0.43s) while meeting the accuracy requirements.
4.2.4. Performances of different point cloud compression algorithms.
In this section, the proposed point cloud resolution compression algorithm is compared with a real-time spatial compression algorithm, STC [54], in terms of performance. STC removes redundant information from point cloud data by combining spatial and temporal encoding to reduce the size of the point cloud data. This study selected 1200 frames of point cloud data from the KITTI dataset, applying different compression algorithms and comparing various metrics. As shown in Table 5, during the compression phase, the STC algorithm has a larger delay due to its complex computation process, while the delay of the proposed point cloud compression strategy is negligible. In terms of compression effectiveness, as shown in Table 5, the proposed compression algorithm generates smaller data sizes compared to STC. Specifically, even when the point cloud resolution is set to 80,000, the data size obtained by the proposed method is still smaller than that of the STC method. Moreover, as shown in Fig 29, the proposed method achieves better accuracy during terminal model inference with a smaller data size.
4.2.5. Performances of different bandwidth.
To verify the adaptability of the algorithm, the performance was tested under different network environments. The experimental results are shown in Fig 30 and Fig 31. The OD algorithm, due to its lack of offloading, is unaffected by the network environment, maintaining the lowest accuracy and delay without fluctuations. As the transmission rate increases, the resolution offloaded by the AR algorithm also increases, leading to higher delay and accuracy. Similarly, FTAR increases its offloading resolution to improve accuracy as the transmission rate rises, but still fails to meet the target accuracy requirements. Both OB and VA achieve optimal accuracy, but their delay costs are the highest. In contrast, the proposed algorithm maintains the target accuracy while achieving the lowest delay across all network conditions, thus balancing both accuracy and delay.
Section 5: Conclusions
In the field of autonomous driving, 3D object detection has high requirements for both accuracy and real-time performance. Currently, relying solely on the limited computational capacity of on-board terminals cannot meet these demands. To address this challenge, this paper proposes a novel edge-assisted 3D object detection framework that leverages the heterogeneous computational resources of terminal devices and edge servers to achieve collaborative inference between single-modal and multimodal models. Based on this, the paper analyzes in detail the impact of point cloud resolution and confidence thresholds on delay and accuracy, providing decision guidance for offloading inference tasks based on these two metrics. Specifically, a mixed-integer programming model with point cloud resolution and confidence threshold as optimization variables is established, aiming to minimize delay while meeting accuracy requirements. To assess accuracy in real-time, a confidence-accuracy function mapping relationship is established based on historical confidence-accuracy distribution patterns and cubic fitting functions. In order to adapt to dynamic network environments, a dynamic threshold update algorithm is proposed, which adjusts the threshold by periodically evaluating accuracy. Additionally, an adaptive decision algorithm is designed, which real-time adjusts point cloud resolution to achieve a balance between delay and accuracy. Finally, an edge-terminal collaborative hardware experimental platform is built, and the proposed solution is validated on real autonomous driving datasets, nuScenes and KITTI. Experimental results show that, under various testing scenarios, including different network bandwidths and model deployments, the proposed solution achieves a more comprehensive performance in terms of both delay and accuracy compared to the baseline solutions.
However, this study has not yet fully addressed the feasibility of real-time implementation in hardware-constrained systems, accuracy in small object detection, and performance under adverse weather conditions. Future research will focus on assessing the applicability of this algorithm in real-time systems, particularly for autonomous driving and robotics. This will involve exploring advanced calibration techniques to enhance model confidence, which is vital for systems where reliability is of extreme importance. Additionally, we plan to address the specific challenges of insufficient feature representation and background interference in small object detection, and evaluate the system’s robustness under adverse weather conditions, such as rain or fog, to ensure consistent performance in dynamic environments. Furthermore, to overcome the limitations of optical sensors in adverse weather conditions, future research will explore the integration of wireless sensing modalities. Recent advancements in 6G Integrated Sensing and Communication (ISAC) [55] have shown that wireless signals (e.g., 6.75 GHz) can generate high-fidelity point clouds via Channel State Information (CSI). Fusing these wireless-generated point clouds with visual and LiDAR data could drastically improve system robustness and safety in all-weather autonomous driving scenarios. In this work, network parameters (e.g., available bandwidth and transmission delay) are empirically measured on real hardware platforms under actual network conditions, inherently accounting for factors like path loss, multipath fading, and interference. Future research will extend these measurements to high-mobility scenarios through field tests at various device velocities, incorporating Doppler effect modeling to analyze the impact of mobility on transmission reliability, offloading latency, and detection accuracy. This will facilitate the development of mobility-aware adaptive offloading strategies for dynamic edge-device collaboration.
References
- 1.
Grand View Research. Autonomous vehicle market (2025-2030). 1. 2025. https://www.grandviewresearch.com/industry-analysis/autonomous-vehicles-market
- 2. Aung NHH, Sangwongngam P, Jintamethasawat R, Shah S, Wuttisittikulkij L. A review of LiDAR-based 3D object detection via deep learning approaches towards robust connected and autonomous vehicles. IEEE Trans Intell Veh. 2025;10(1):526–47.
- 3.
United States Department of Transportation USDOT. National deployment plan for vehicle-to-everything (V2X) technologies to reduce death and serious injuries on America’s roadways. 8. 2024. https://www.transportation.gov/briefing-room/usdot-releases-national-deployment-plan-vehicle-everything-v2x-technologies-reduce
- 4.
Institute of Science and Technology Strategy and Consulting CA of S. Japan releases digital transformation strategy for the automotive industry. 2024. https://casisd.cas.cn/zkcg/ydkb/kjzcyzxkb/2024/zczxkb202407/202410/t20241029_7409918.html
- 5. Zhang Y, Wang Y, Cui Y, Chau L-P. 3DGeoDet: general-purpose geometry-aware image-based 3D object detection. IEEE Trans Multimedia. 2025;27:6235–47.
- 6. Chen J-H, Shieh J-L, Haq MA, Ruan S-J. Monocular 3D object detection utilizing auxiliary learning with deformable convolution. IEEE Trans Intell Transport Syst. 2024;25(3):2424–36.
- 7. Ryan M. The future of transportation: ethical, legal, social and economic impacts of self-driving vehicles in the year 2025. Sci Eng Ethics. 2020;9:1185–208.
- 8. Asad M, Azeem W, Jiang H, Tayyab Mustafa H, Yang J, Liu W. 2M3DF: advancing 3D industrial defect detection with multi-perspective multimodal fusion network. IEEE Trans Circuits Syst Video Technol. 2025;35(7):6803–15.
- 9. Hung CC, Ananthanarayanan G, Bodik P. VideoEdge: processing camera streams using hierarchical clusters. In: 11 IEEE/ACM symposium on edge computing (SEC), 2018. 115–31.
- 10. Kim J, Yi J, Kim W, Shin S, Lee Y. Combinational point sampling for fast and accurate on-device LiDAR 3D object detection. In: IEEE INFOCOM 2025 - IEEE conference on computer communications, 2025. 1–10.
- 11. Li F, Min C, Wang J, Xiao L, Zhao D, Nie Y, et al. Efficient distillation using channel pruning for point cloud-based 3D object detection. IEEE Trans Intell Transport Syst. 2025;26(10):16144–58.
- 12. Ding R, Qi X, Shi S, Wang Z, Yang J. Towards efficient 3D object detection with knowledge distillation. In: Advances in neural information processing systems 35, 2022. 21300–13.
- 13. Lu Y, Hao X, Li Y, Chai W, Sun S, Velipasalar S. Range-aware attention network for LiDAR-based 3D object detection with auxiliary point density level estimation. IEEE Trans Veh Technol. 2025;74(1):292–305.
- 14. Ren R, Li H, Han T, Tian C, Zhang C, Zhang J, et al. Vehicle crash simulations for safety: Introduction of connected and automated vehicles on the roadways. Accid Anal Prev. 2023;186:107021. pmid:36965209
- 15. Chen S. Cellular vehicle-to-everything (C-V2X). Berlin, Germany: Springer; 2023.
- 16. Singh R, Sukapuram R, Chakraborty S. A survey of mobility-aware multi-access edge computing: challenges, use cases and future directions. Ad Hoc Networks. 2023;140:103044.
- 17. Ferrer AJ, Marquès JM, Jorba J. Towards the decentralised cloud: Survey on approaches and challenges for mobile, ad hoc, and edge computing. ACM Comput Surv. 2019;6:1–36.
- 18. Cui Y, Chen R, Chu W, Chen L, Tian D, Li Y, et al. Deep learning for image and point cloud fusion in autonomous driving: a review. IEEE Trans Intell Transport Syst. 2022;23(2):722–39.
- 19. Ding S, Schneider L, Cordts M, Gall J. ADA-Track++: end-to-end multi-camera 3D multi-object tracking with alternating detection and association. IEEE Trans Pattern Anal Mach Intell. 2026;48(1):482–99. pmid:40982527
- 20. Su H, Arakawa S, Murata M. Cooperative 3D multi-object tracking with cross-agent data association for connected and automated vehicles. IEEE Trans Veh Technol. 2025;74(11):16990–7004.
- 21. Yang S, Yang P, Chen J, Ye Q, Zhang N, Shen X. Delay-optimized multi-user VR streaming via end-edge collaborative neural frame interpolation. IEEE Trans Netw Sci Eng. 2024;11(1):284–98.
- 22. Wu D, Zhang D, Zhang M, Zhang R, Wang F, Cui S. ILCAS: imitation learning-based configuration- adaptive streaming for live video analytics with cross-camera collaboration. IEEE Trans on Mobile Comput. 2024;23(6):6743–57.
- 23. Xu J, Wei Z, Yuan X. Delay-constrained multicast throughput maximization in mec networks for high-speed railways. international conference on collaborative computing: networking, Applicat Workshar. 2023;10: 308–28.
- 24. Yan M, Zhang L, Jiang W, Chan CA, Gygax AF, Nirmalathas A. Energy consumption modeling and optimization of UAV-Assisted MEC networks using deep reinforcement learning. IEEE Sensors J. 2024;24(8):13629–39.
- 25. Xie T, et al. FARP-Net: local-global feature aggregation and relation aware proposals for 3D object detection. IEEE Trans Multimedia. 2023;5:1027–40.
- 26. Wang S, Lu K, Xue J, Zhao Y. DA-Net: density-aware 3D object detection network for point clouds. IEEE Trans Multimedia. 2025;27:665–78.
- 27. Lang AH, Vora S, Caesar H, Zhou L, Yang J, Beijbom O. PointPillars: fast encoders for object detection from point clouds. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2019. 12689–97.
- 28. Yang Z, Chen J, Miao Z. DeepInteraction: 3D object detection via modality interaction. Adv Neural Inform Process Syst. 2022;11:1–14.
- 29. Mohajer A, Sorouri F, Mirzaei A, Ziaeddini A, Rad KJ, Bavaghar M. Energy-aware hierarchical resource management and backhaul traffic optimization in heterogeneous cellular networks. IEEE Systems Journal. 2022;16(4):5188–99.
- 30. Mohajer A, Hajipour J, Leung VCM. Dynamic offloading in mobile edge computing with traffic-aware network slicing and adaptive TD3 strategy, IEEE Commun Letters. 2025;1: 95–9.
- 31. Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, et al. nuScenes: a multimodal dataset for autonomous driving. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2020. 11618–28.
- 32. Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition, 2012. 3354–61.
- 33. Wei W, Yang R, Gu H, Zhao W, Chen C, Wan S. Multi-objective optimization for resource allocation in vehicular cloud computing networks. IEEE Trans Intell Transp Syst. 2022;23(12):25536–45.
- 34. Ouyang Z, Dong X, Cui J, Niu J, Guizani M. PV-EncoNet: fast object detection based on colored point cloud. IEEE Trans Intell Transport Syst. 2022;23(8):12439–50.
- 35. Sun Y, Lu B, Liu Y, Yang Z, Behera A, Song R, et al. Exploiting label uncertainty for enhanced 3D object detection from point clouds. IEEE Trans Intell Transport Syst. 2024;25(6):6074–89.
- 36. Shi W, Rajkumar R. Point-GNN: graph neural network for 3D object detection in a point cloud. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2020. 1708–16.
- 37. Zhang Y, Hu Q, Xu G, Ma Y, Wan J, Guo Y. Not all points are equal: learning highly efficient point-based detectors for 3D LiDAR point clouds. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022. 18931–40.
- 38. Deng J, Shi S, Li P, Zhou W, Zhang Y, Li H. Voxel R-CNN: towards high performance voxel-based 3D object detection. AAAI. 2021;35(2):1201–9.
- 39. Liu L, Zhao Z, Feng J, Xu F, Zhang Y, Pei Q, et al. Distributed collaborative computing for task completion rate maximization in vehicular edge computing. IEEE Trans Intell Transport Syst. 2025;26(10):18070–82.
- 40. Chen Q, Tang S, Yang Q, Fu S. Cooper: cooperative perception for connected autonomous vehicles based on 3D point clouds. In: 2019 IEEE 39th International conference on distributed computing systems (ICDCS), 2019. 514–24.
- 41. Rawashdeh ZY, Wang Z. Collaborative automated driving: a machine learning-based method to enhance the accuracy of shared information. In: 2018 21st International conference on intelligent transportation systems (ITSC), 2018. 3961–6.
- 42. Yang K, Yang D, Zhang J, Li M, Liu Y, Liu J, et al. Spatio-temporal domain awareness for multi-agent collaborative perception. In: 2023 IEEE/CVF International conference on computer vision (ICCV), 2023. 23326–35.
- 43. Zhang H, Mekala MS, Yang D, Isaacs J, Nain Z, Park JH, et al. 3D harmonic loss: towards task-consistent and time-friendly 3D object detection on edge for V2X orchestration. IEEE Trans Veh Technol. 2023;72(12):15268–79.
- 44. He C, Li R, Li S, Zhang L. Voxel set transformer: a set-to-set approach to 3D object detection from point clouds. In: 2022 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), 2022. 8407–17.
- 45. Ren C, Gong C, Liu L. Task-oriented multimodal communication based on cloud–edge–UAV collaboration. IEEE Internet Things J. 2024;11(1):125–36.
- 46. Thornton S, Dey S. Multi-modal data and model reduction for enabling edge fusion in connected vehicle environments. IEEE Trans Veh Technol. 2024;73(8):11979–94.
- 47. Bai X, Hu Z, Zhu X, Huang Q, Chen Y, Fu H, et al. TransFusion: robust LiDAR-camera fusion for 3D object detection with transformers. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022. 1080–9.
- 48. Liu Z, Tang H, Amini A, Yang X, Mao H, Rus DL, et al. BEVFusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: 2023 IEEE International conference on robotics and automation (ICRA), 2023. 2774–81.
- 49. Hoang Duy L, Kim G-W. AEC3D: an efficient and compact single stage 3D multiobject detector for autonomous driving. IEEE Trans Intell Transport Syst. 2022;23(12):23422–32.
- 50. Ye W, Xia Q, Wu H, Dong Z, Zhong R, Wang C, et al. Fade3D: fast and deployable 3D object detection for autonomous driving. IEEE Trans Intell Transport Syst. 2025;26(9):12934–46.
- 51. Wang C, Yang P, Lin J, Wu W, Zhang N. Object-based resolution selection for efficient edge-assisted multi-task video analytics. In: GLOBECOM 2022 - 2022 IEEE global communications conference, 2022. 5081–6.
- 52. Murad T, Nguyen A, Yan Z. DAO: dynamic adaptive offloading for video analytics. In: Proceedings of the 30th ACM International conference on multimedia, 2022. 3017–25.
- 53. Chen Y, Li Y, Zhang X, Sun J, Jia J. Focal sparse convolutional networks for 3D object detection. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022. 5418–27.
- 54. Feng Y, Liu S, Zhu Y. Real-time spatio-temporal LiDAR point cloud compression. In: 2020 IEEE/RSJ International conference on intelligent robots and systems (IROS), 2020. 10766–73.
- 55. Bazzi A, Ying M, Kanhere O, Rappaport TS, Chafii M. ISAC imaging by channel state information using ray tracing for next generation 6G. IEEE J Selected Topics Electromag Ant Propag. 2025;1(1):135–49.