Probing the link between vision and language in material perception using psychophysics and unsupervised learning
Fig 5
Effect of the number of unique words, language models, and material names on individual behavioral results.
(A) Distribution of the number of unique words participants used in the Verbal Description task. (B) Comparison of vision-language correlations across different language models. For each individual, we computed within-person Spearman’s correlation between the Vision and Text RDMs. The Text RDM is built by embedding verbal descriptions with four different pre-trained LLMs: CLIP’s text encoder, Sentence-BERT, GPT-2, and OpenAI Embedding V3-small. The blue bars indicate the correlation values when all text features are included to construct the Text RDM. The gray bars indicate the correlation values when the “material name” is excluded from constructing the Text RDM. Asterisks indicate FDR-corrected p-values: *** p < 0.001, ** p < 0.01, and * p < 0.05.