Fig 1.
Our framework to investigate the link between vision and language in material perception.
We built an expandable Space of Morphable Material Appearance based on the unsupervised image generation model StyleGAN (see details in Fig 2). Our method allows us to synthesize images of diverse material appearances in a controllable manner. We created six image categories of material: three original materials (i.e., soap, toy, rock) by directly learning from real photos, and three morphed materials by cross-material morphing (i.e., soap-to-rock, rock-to-toy, and soap-to-toy). The image examples displayed represent each of the six categories. Sampling stimuli from the Space of Morphable Material Appearance, we measured material perception with two psychophysical tasks: visual material similarity judgment and verbal description. We quantitatively evaluated the representations across modalities using the Representational Similarity Analysis (RSA) and the unsupervised alignment method Gromov-Wasserstein Optimal Transport (GWOT). Within each participant, we compared Representational Dissimilarity Matrices (RDMs) between the visual judgment (i.e., Vision RDM) and verbal description (i.e., Text RDM) results. We also compared the participant’s visual judgment behavior with the image-feature representations (i.e., Image-feature RDM) of stimuli extracted from the self-supervised (e.g., DINO) or weakly-supervised models (e.g., text-guided model CLIP) pre-trained on large-scale datasets.
Fig 2.
Overview of the synthesis pipeline for morphable material appearances.
(A) Training datasets. (B) Transfer learning pipeline. Upon training, we obtained models to generate images from three material classes. We can generate images of a desired material (e.g., soaps) by injecting the latent codes (e.g., wsoap ∈ Wsoap) into the corresponding material generator (e.g., Gsoap). (C) Illustration of cross-category material morphing. By linearly interpolating between a soap and a rock, we obtain a morphed material, “soap-to-rock,” produced from its latent code wsoap−to−rock and generator Gsoap−to−rock. (D) Illustration of the Space of Morphable Material Appearance. (E) Examples of generated images from the Space of Morphable Material Appearance. These images are a subset of stimuli used in our psychophysical experiments, covering two major lighting conditions (i.e., strong and weak lighting).
Fig 3.
Illustrations of psychophysical experiment interface.
(A) The Multiple Arrangement task. Participants (N = 16) arranged images within a circle based on their judgment of the visual similarity of material properties. In the first trial, participants were presented with all 72 images of materials. In each subsequent trial, a subset of images was iteratively presented based on an adaptive sampling algorithm [49]. (B) The Verbal Description task. With free-form text input, participants were asked to describe the material shown in the image from five aspects: material name, color, optical properties, mechanical properties, and surface texture. The gray texts are example responses from one trial.
Fig 4.
Vision-based similarity judgment and verbal description of materials are moderately correlated.
(A) RDMs of visual material similarity judgment via Multiple Arrangement (Vision RDMs) and Verbal Description (Text RDMs). Top: Vision RDMs. Bottom: Text RDMs. From left to right: RDMs for three participants and the group average RDM across all participants. In each RDM, on both x- and y-axis, the images are organized by the type of material generator, spanning from the learned original materials (i.e., soap, toy, rock) to the morphed midpoint materials (i.e., soap-to-rock, rock-to-toy, and soap-to-toy). The dissimilarities in the individual RDMs are normalized to the 0 and 1 range, and the average RDMs are computed based on the normalized values. The green colors indicate low dissimilarity between pairwise combinations of materials, whereas the pink colors indicate high dissimilarity. The Spearman’s correlation (rs) between the corresponding Vision and Text RDMs are annotated in the box below. (B) Two-dimensional embedding from the MDS of the group average Vision and Text RDMs, color-coded based on the six types of material generator depicted in (A). The percentage of explained variance is shown in parentheses.
Fig 5.
Effect of the number of unique words, language models, and material names on individual behavioral results.
(A) Distribution of the number of unique words participants used in the Verbal Description task. (B) Comparison of vision-language correlations across different language models. For each individual, we computed within-person Spearman’s correlation between the Vision and Text RDMs. The Text RDM is built by embedding verbal descriptions with four different pre-trained LLMs: CLIP’s text encoder, Sentence-BERT, GPT-2, and OpenAI Embedding V3-small. The blue bars indicate the correlation values when all text features are included to construct the Text RDM. The gray bars indicate the correlation values when the “material name” is excluded from constructing the Text RDM. Asterisks indicate FDR-corrected p-values: *** p < 0.001, ** p < 0.01, and * p < 0.05.
Fig 6.
Annotated MDS of the group average Vision RDM.
For “Colorfulness”, we display the data point in the form of the image stimuli. For “Optical properties”, “Surface texture”, “Mechanical properties” and “Material name”, we display the most frequently used word (applied with the same color) for each image, aggregated across all participants. The color schemes of words are not comparable across MDS plots. An interactive version of this plot is provided in the S1 File.
Fig 7.
The similarity structures between visual judgment and verbal description align on the coarse categorical level but lack precise one-to-one stimulus-level mapping.
(A) Illustration of Gromov-Wasserstein Optimal Transport (GWOT), an unsupervised alignment method to compare two similarity structures (e.g., similarity structure 1 and 2). Dij denotes the dissimilarity between stimulus i and j in one RDM, and denotes the dissimilarity between stimulus k and l in another RDM. Solving the minimization problem of Gromov-Wasserstein distance (GWD) yields an optimal transportation plan Γ. The group average Vision (Top left) and Text (Top right) RDMs are shown as examples. (B) Optimal transportation plan Γ between group average Vision RDM and Text RDM. Each element in the Γ matrix indicates the probability of an image in the similarity structure of verbal description corresponding to another item in the similarity structure of visual material similarity judgment. The purple diagonal indicates the perfect alignment on the image-to-image level. Only a small fraction of diagonal elements in Γ show high values, indicating the lack of one-to-one mapping between verbal descriptions and visual judgment of the stimuli. The Spearman’s correlation (rs) between the Vision and Text RDMs is noted in the bottom left corner of the Γ matrix. (C) Γ matrix computed from individual participant’s Vision and Text RDMs. The samples on the X- and Y-axis follow the same order as the Vision and Text RDMs.
Fig 8.
Compare human visual judgments with image representations.
(A) Image-feature RDMs of 72 stimuli created from pre-trained models: LPIPS (perceptual image similarity), DINO-ViT-s8 (self-supervised model), OpenCLIP-ViT-L/14 (visual-semantic model). The elements in the RDM follow the same order as the Vision and Text RDMs. (B) Spearman’s correlation between an individual’s Vision RDM and Image-feature RDM from each vision encoder. The bars represent the average correlations across participants. The block dots represent the individual participants. The red dotted line indicates the lower bounds of the noise ceiling of human visual judgment results. On top of each bar, * indicates p < 0.005 for model-specific one-sided signed-rank tests against zero. The horizontal black bar indicates p < 0.05 for two-sided pairwise signed-rank tests between vision encoder models. (C) Optimal transportation plan (Γ) matrix between group average human visual judgment and image feature embedding of 72 psychophysical stimuli. The Spearman’s correlation (rs) between the Image-feature RDM and the group average human Vision RDM is marked in the bottom left corner. (D) Jointing verbal descriptions with image features from the visual-semantic model OpenCLIP-ViT-L/14 improves the prediction of visual material similarity judgment for all participants. The y-axis represents individual participants. The x-axis represents the explained variance.