Unsupervised learning reveals interpretable latent representations for translucency perception

doi:10.1371/journal.pcbi.1010878

Fig 1.

The Translucent Appearance Generation (TAG) model.

(A) Given inputs of natural images, the TAG framework, which is based on the StyleGAN2-ADA generator and pSp encoder architectures, learns to synthesize perceptually convincing images of translucent objects. The model maps photos of translucent objects into the W+ latent space. The W+ can disentangle the effects of scene attributes (e.g., shape, material, and body color) and predict the human perception of translucency. (B) The detailed process of embedding a photo into StyleGAN’s W+ latent space. This allows us to generate an image at a particular location in the latent space. (C) Emerged human-understandable scene attributes in the layer-wise latent space. Without supervision, the W+ spontaneously disentangles three salient scene attributes: material, shape/orientation, and body color. In each row, an original generated image (left) is gradually manipulated by modifying its latent vectors at specific layers. Early-layer (w₁ to w₆) manipulation of W+ transforms the shape and orientation of the object. Middle-layer (w₇ to w₉) manipulation modifies the material appearance. Later-layer (w₁₀ to w₁₈) manipulation changes the body color (color of the diffuse component of the surface reflection).

More »

Expand

Fig 2.

Experiment 1: Real-versus-generated discrimination.

(A) Examples of real photographs and model-synthesized images of soaps. The “generated” soaps were synthesized by embedding a real photograph into the W+ latent space of the trained StyleGAN2-ADA using the pSp encoder. We used 150 real photographs and 150 generated images as stimuli for Experiments 1 and 2. (B) The procedure of Experiment 1. (C) Overall correct and error rates of judging real and generated images. The error rate of 50% indicates pure guessing. (D) Distribution of the percentage of real and generated images misjudged by the observers. The x-axis represents the percentage of observers misjudging an image and the y-axis is the percentage of images being misjudged. Gray color represents data of real images and green represents data of generated images.

More »

Expand

Fig 3.

Experiment 2: Material attribute rating.

(A) The user interface of Experiment 2. (B) The distribution of the mean normalized attribute ratings across observers. For each observer, we normalize their attribute ratings to the range of 0 to 1. The x-axis represents the normalized ratings of an attribute averaged over 20 observers, and the y-axis shows the percentage of images. (C) The scatter plots of ratings between a pair of material attributes, with the Pearson correlations shown at the top. All correlation coefficients are statistically significant at the confidence level of 95% (p < 0.001). In both (B) and (C), gray and green colors represent results for real and generated images, respectively. (D) Examples of real and generated images judged to have different levels of translucency. We grouped the images based on the mean normalized translucency rating: high (0.6 to 1), intermediate (0.2 to 0.6), and low (0 to 0.2).

More »

Expand

Fig 4.

Experiment 3: Perceptual evaluation of emerged scene attributes in the layer-wise latent space.

(A) Examples of morphed image sequences used in Experiment 3, generated by linearly interpolating between the latent codes of source (w_A) and target (w_B) separately at the early (w₁ to w₆), middle (w₇ to w₉), and later-layers (w₁₀ to w₁₈). The λ is the interpolation step from the source image in the linear interpolation. Source-target pairs were picked under three conditions based on soap’s material properties: opaque-translucent (OT), opaque-opaque (OO), and translucent-translucent (TT). (B) The user interface of Experiment 3. (C) The perceptual results on how different layers correspond to scene attributes. The number in each cell represents the average percentage of times observers chose a visual attribute as the most prominent one that changes in the image sequence generated by the corresponding layer manipulation. The standard deviation across observers is shown in parentheses. Each row of the heat map accounts for 50 image sequences. (D) The representation of lighting in the latent space. Top panel: manipulation of early-layers (layers 4 to 6) also changes the direction of lighting. From left to right, the lighting direction rotates clockwise. Bottom panel: manipulation of middle-layers (layers 7 to 9) alters the environment of lighting. From left to right, the strength of backlighting gradually decreases.

More »

Expand

Fig 5.

The middle-layers of W+ latent space can effectively modulate translucency of generated images and predict human perception.

(A) Illustration of a trained layer-specific supported vector machine (SVM) classifier for the milky-versus-glycerin soap discrimination. (B) The scatter plots show the model prediction values versus the human mean normalized attribute ratings for each generated image in Experiment 2. Green, blue, and orange colors represent the data for translucency, see-throughness, and glow, respectively. (C) The tuning curve of correlation coefficients (correlation between model prediction and human perceptual rating, r_hc) over all layers in the W+ latent space. Model prediction values using the middle-layers’ decision boundaries (d₇, d₈, and d₉) strongly correlate with human attribute ratings. “*” indicates the correlations at that layer are statistically insignificant at the 95% confidence level. (D) Examples of translucency-modulated sequences. Top: Manipulating the layer-9 latent vector of the original image (left end) along the normal of the learned decision boundary has a coherent effect on the translucent material appearance of the object. Left: Moving to the positive direction of the normal of the decision boundary makes the soap appear more opaque. Right: Moving to the negative direction of the normal of the decision boundary makes the soap appear more translucent. Bottom: Manipulating the layer-12 latent vector of the original image along the normal of the learned decision boundary does not fundamentally change the translucent appearance.

More »

Expand

Fig 6.

Visualization of the generative process of the network.

Impression of translucency emerges at the early stages of the image synthesis process while more details of the appearance are added in the later stages. Each row corresponds to the intermediate generative outputs from a sequence of tRGB layers at different spatial resolutions in StyleGAN2-ADA’s generative network. Translucency-related features are established as early as 32 pixels × 32 pixels (layers 7 and 8) and 64 pixels × 64 pixels (layer 9). The surface reflective properties such as specular highlights are only clearly visible at 128 pixels × 128 pixels (layers 11 to 12). The body color of the soap was finalized at the resolution of 1024 pixels × 1024 pixels (layers 17 to 18).

More »

Expand

Fig 7.

Visualization of features for translucency.

(A) The intermediate generative results (tRGB layer output at 64 pixels × 64 pixels resolution) of the images from the high-translucency dataset. The images are resized for display. (B) Middle-layer ICA kernels obtained by training a system of 64 basis functions on 24 pixels × 24 pixels image patches extracted from images in (A). The kernels are of size 24 × 24. (C) Visualization of applying three-dimensional convolution of the individual ICA kernels in (B) on a real photograph of translucent soap. (D) The resulting filtered images of four different soaps with selected chromatic kernels. The mid-to-low spatial frequency chromatic kernels can capture features of translucency such as “chromatic caustics” (row 2, column 1), “glowing edges” (row 1, column 4), and “inner glow” (row 1, column 1 and 4). The orientation-free kernel in the last row reveals the variation of color over a relatively coarse spatial scale, which is also diagnostic of translucency.

More »

Expand