Table 1.
Comparison between different sensors for perception.
Fig 1.
Compared to traditional RL/DL-based UAV navigation systems which struggle with unseen environments, require expensive sensors, and rely on external instructions, VLM-Nav offers better generalization, cost efficiency, independence from external inputs, and minimal training needs, making it a more adaptable and efficient navigation approach.
Table 2.
Recent studies in vision based UAV navigation.
Fig 2.
Navigation setup used in this study.
Fig 3.
Our system is validated in the following three environments- (a) Simple environment with a single obstacle between two walls.
This environment is used to train the navigator model. (b) The block environment is provided by AirSim. (c) Created using the Downtown West pack from the UE marketplace.
Fig 4.
Overview of the proposed VLM-Nav method.
(Left) First, RGB scene images are captured and converted into a depth map. (Top right) This depth map is then analyzed by the vision-language model (VLM), which provides the corresponding action response. (Bottom right) Lastly, the VLM’s feedback, along with the relative heading angle, left and right distance sensor measurements, and proximal object detection output, are sent to the navigation model.
Fig 5.
Examples of depth image estimation from Depth-Anything v2.
Fig 6.
(left) The depthmap is estimated from the RGB scene, which is then normalized and rescaled into (0–255).
(middle) The depthmap is sent to VLM along with the preset Prompt (P). (right) Based on VLM feedback (R), the suggested direction to avoid the obstacle is extracted using keyword search.
Fig 7.
(Left) The process of proximal object detection in VLM-Nav: First, the depthmap is cropped, and the pixel values are binarized using a threshold , followed by connected component analysis.
Finally, the output indicates whether any connected groups exist within the three defined regions. (Right) An example of the process with two scenarios.
Table 3.
Details of input parameters for the navigator model.
Fig 8.
Navigator model architecture.
Table 4.
Hyperparameters of the Navigator FCN model.
Table 5.
VLM-Nav Configurations.
Table 6.
Performance comparison of depth estimation algorithms.
Fig 9.
Minimum distance to obstacle detection based on different value of thresholds.
Table 7.
Performance of navigator module.
Table 8.
Overall navigation performance of VLM-Nav.
Table 9.
Inference time of VLM-Nav.
Fig 10.
Five example flight paths generated by VLM-Nav in three different environments are presented.
The starting and target coordinates are marked by () and (
) symbols, respectively. Selected points (indicated by circles) along the flight paths are shown from the UAV’s front camera perspective. At these points, the UAV’s movement direction, taken to avoid obstacles, is depicted with red arrow symbols (e.g.,
for yaw right,
for yaw left, and
to go upward).
Fig 11.
The flight path comparison between VLM-Nav and human-controlled flight in (a): Blocks and (b) Downtown West environment.
The starting and target location is shown using color () and (
) markers.
Table 10.
Comparison of our approach with recent studies in UAV navigation.
Fig 12.
Ablation case studies.