Prompt-guided and multimodal landscape scenicness assessments with vision-language models

doi:10.1371/journal.pone.0307083

Fig 1.

ScenicOrNot image ratings plotted at their georeferenced coordinates for the entirety of Great Britain and the Isle of Man.

Values range from 1 to 10, where 10 is the most scenic, with an average scenicness rating of 4.43.

More »

Expand

Fig 2.

Rated landscape prompt annotation workflow.

Voters are asked to imagine landscapes that they like or dislike. They are then asked to write a description of these landscapes and to give it a rating between 1 and 10. The resulting dataset is a collection of landscape descriptions and their associated ratings from every voter.

More »

Expand

Fig 3.

Simplified overview of VLM models.

VLMs use two separate encoders, which map images and text to the same latent space.

More »

Expand

Fig 4.

Prediction pipeline for the contrastive prompting method.

We first define a positive and a negative prompt with a shared prompt context. Then, we use the model’s confidence in the positive prompt and rescale it between 1 and 10 as the scenicness prediction for a given image.

More »

Expand

Fig 5.

Methods of ensembling for generating image ratings from landscape prompts.

In early ensembling, we use the likelihood that any given prompt matches an image. The likelihood of each prompt is then multiplied by its voter-provided scenicness rating to determine the scenicness score of a given image. In late ensembling, we consider this weighted likelihood for the prompts of each voter individually to calculate a voter-specific scenicness score. We then average across all voter scenicness scores to calculate the scenicness score for the image.

More »

Expand

Fig 6.

Results for the few-and zero-shot prediction methods.

The black line shows the performance of a ViT-14/L model initialised with CLIP pre-trained weights trained on the complete SON dataset. The line represents the same model, where only the linear probe is fine-tuned in a few-shot setting. The red line shows the performance of the ViT-14/L vision transformer pre-trained using CLIP. While both the ConvNeXt-Large and ViT-14/L models provide adequate few-shot performance, the transformer model with web-scale pre-training and more parameters performs substantially better. When using 25 samples to estimate the best prompt combination, our zero-shot contrastive prompting method (shown in magenta) shows superior ranking performance compared to the few-shot models, although predictions are further off from the reference values as evidenced by the high RMSE.

More »

Expand

Fig 7.

Distribution of Kendall’s τ of each of the six prompt contexts of the contrastive prompting method when evaluated on the entire dataset (left) and when applied to only the n = 25 samples from the few-shot learning case (right).

More »

Expand

Table 1.

Metric performance of the top-five contrastive prompting configurations evaluated on the calibration set.

The highly-performing prompts in this setting are similar to those observed for the full dataset.

More »

Expand

Table 2.

Comparison of our LPE method with SON image scores.

Late fusion results in image scenicness ratings that are closer to the SON image ratings, and including more voters results in a higher degree of agreement with SON, even if these voters provide fewer prompts per person.

More »

Expand

Fig 8.

Maps of all methods tested in our research compared to the SON reference data (shown in panel a)). The first row (panels b-d-f) showcases methods based on the image ratings of SON, while the second row (panels c-e-g) showcases methods based on the descriptions provided by our volunteers. The predicted scenicness ratings of each method vary greatly, though the main trends between the models are similar, e.g., rugged wilderness being considered more beautiful than man-made areas such as cities.

More »

Expand

Fig 9.

Comparison of ground truth average image ratings as plotted for each class in SON (left) with the image ratings generated by the LPE method that most closely matched SON in ranking performance (right).

The relative rankings for each land cover class are highly similar, though the LPE mean rating is far higher than in SON. Details about the classes is given in S2 Appendix.

More »

Expand