VLM-Nav: Mapless UAV navigation using monocular vision driven by vision-language models

doi:10.1371/journal.pone.0345778

Table 1.

Comparison between different sensors for perception.

More »

Expand

Fig 1.

Compared to traditional RL/DL-based UAV navigation systems which struggle with unseen environments, require expensive sensors, and rely on external instructions, VLM-Nav offers better generalization, cost efficiency, independence from external inputs, and minimal training needs, making it a more adaptable and efficient navigation approach.

More »

Expand

Table 2.

Recent studies in vision based UAV navigation.

More »

Expand

Fig 2.

Navigation setup used in this study.

More »

Expand

Fig 3.

Our system is validated in the following three environments- (a) Simple environment with a single obstacle between two walls.

This environment is used to train the navigator model. (b) The block environment is provided by AirSim. (c) Created using the Downtown West pack from the UE marketplace.

More »

Expand

Fig 4.

Overview of the proposed VLM-Nav method.

(Left) First, RGB scene images are captured and converted into a depth map. (Top right) This depth map is then analyzed by the vision-language model (VLM), which provides the corresponding action response. (Bottom right) Lastly, the VLM’s feedback, along with the relative heading angle, left and right distance sensor measurements, and proximal object detection output, are sent to the navigation model.

More »

Expand

Fig 5.

Examples of depth image estimation from Depth-Anything v2.

More »

Expand

Fig 6.

(left) The depthmap is estimated from the RGB scene, which is then normalized and rescaled into (0–255).

(middle) The depthmap is sent to VLM along with the preset Prompt (P). (right) Based on VLM feedback (R), the suggested direction to avoid the obstacle is extracted using keyword search.

More »

Expand

Fig 7.

(Left) The process of proximal object detection in VLM-Nav: First, the depthmap is cropped, and the pixel values are binarized using a threshold , followed by connected component analysis.

Finally, the output indicates whether any connected groups exist within the three defined regions. (Right) An example of the process with two scenarios.

More »

Expand

Table 3.

Details of input parameters for the navigator model.

More »

Expand

Fig 8.

Navigator model architecture.

More »

Expand

Table 4.

Hyperparameters of the Navigator FCN model.

More »

Expand

Table 5.

VLM-Nav Configurations.

More »

Expand

Table 6.

Performance comparison of depth estimation algorithms.

More »

Expand

Fig 9.

Minimum distance to obstacle detection based on different value of thresholds.

More »

Expand

Table 7.

Performance of navigator module.

More »

Expand

Table 8.

Overall navigation performance of VLM-Nav.

More »

Expand

Table 9.

Inference time of VLM-Nav.

More »

Expand

Fig 10.

Five example flight paths generated by VLM-Nav in three different environments are presented.

The starting and target coordinates are marked by () and () symbols, respectively. Selected points (indicated by circles) along the flight paths are shown from the UAV’s front camera perspective. At these points, the UAV’s movement direction, taken to avoid obstacles, is depicted with red arrow symbols (e.g., for yaw right, for yaw left, and to go upward).