Fig 1.
Our LGMMfusion framework initially extracts features from LiDAR point clouds and camera images through their respective backbone networks.
Subsequently, the accurate depth information from the LiDAR is harnessed to guide the multi-view images in generating Image BEV representations. Ultimately, these image BEVs are integrated with the LiDAR BEVs to achieve a comprehensive fusion.
Fig 2.
Multi-head multi-scale self-cross attention mechanism block.
Fig 3.
Multi-head adaptive cross-attention block.
Fig 4.
Within the BEV coordinate system, we construct a three-dimensional grid comprising sampling points.
The illustrations depict the projections of these 3D grid points onto the 2D image plane as viewed from various angles.
Table 1.
Results on nuScenes val set. The modalities are Camera (C), LiDAR (L).
Table 2.
Results on the nuScenes validation set. The results are compared across different methods using LiDAR (L) and Camera (C) modalities(Mod). The performance is evaluated both overall (mAP, NDS) and at the per-class level [75]. The classes include Car, Truck(Tru), Construction Vehicle (C.V.), Bus, Trailer(Tra), Barrier(Bar), Motorcycle (Motor), Bike, Pedestrian (Ped.), and Traffic Cone (T.C.). Small object categories include pedestrian, traffic cone, and bicycle, which show the most notable mAP improvements.
Fig 5.
This figure shows an example of 3D annotation, each row represents a scene.
The first row is a tunnel during the day, and the second row is a tunnel at night.
Table 3.
Extended ablation study results on the nuScenes validation set. This table compares the performance of LGMMfusion variants, including the impact of Image BEV (I-BEV), BEV Query (BEV-Q), attention structure modules (MHMS-SA and MHA-CA), and attention parameters (number of heads H and scales S). I-BEV variations involve the use of BatchNorm (BN) and ReLU activations.