Brain2GAN: Feature-disentangled neural encoding and decoding of visual perception in the primate brain

doi:10.1371/journal.pcbi.1012058

Fig 1.

Example results.

Stimulus (top) and reconstructions (bottom) from brain activity in V1, V4 and IT. Face images in this figure are replaced for copyright reasons. The original version of the figure can be accessed here.

More »

Expand

Fig 2.

Neural coding.

The transformation between sensory stimuli and brain responses via an intermediate feature space. Neural encoding is factorized into a nonlinear “analysis” and a linear “encoding” mapping. Neural decoding is factorized into a linear “decoding” and a nonlinear “synthesis” mapping.

More »

Expand

Fig 3.

StyleGAN3 generator architecture.

The generator takes a 512-dim. z-latent (entangled or correlated dimensions) as input and maps this to its 512-dim. w-latent (disentangled or decorrelated dimensions) via the MLP, f(), for feature disentanglement. Then, the w-latent is transformed into a 1024 × 1024 px RGB image. Face images in this figure are replaced for copyright reasons. The original version of the figure can be accessed here.

More »

Expand

Fig 4.

Passive fixation task.

The monkey was fixating a red dot with gray background for 300 ms followed by a fast sequence of four face images (500² pixels): 200 ms stimulus presentation and 200 ms inter-trial interval. The stimuli were slightly shifted to the lower right such that the fovea corresponded with pixel (150, 150). The monkey was rewarded with juice if fixation was kept for the whole sequence.

More »

Expand

Fig 5.

Encoding performance.

The effectiveness of each encoding model is assessed using the Pearson correlation coefficients between predicted and recorded neural responses. For each dataset, the first and second graphs denote discriminative and generative representations, respectively. The correlation distribution across each encoding model shows a robust level of accuracy.

More »

Expand

Fig 6.

Generative-based encoding performance.

For each individual microelectrode unit, we fit three encoding models based on three distinct feature representations: z-, w− and CLIP-latent representations. As such, we fit 3× 960 independent encoders, resulting in 3× 960 predicted neural responses because there were seven, four and four microelectrode arrays (64 units each) for V1, V4 and IT, respectively (i.e., 7 × 64 = 448 in V1, 4 × 64 = 256 in V4 and 4 × 64 = 256 in IT). The scatterplots display the prediction-target correlation (r) of one encoding model on the X-axis and another encoding model on the Y-axis to investigate the relationship between the two. Each dot represents the performance of one modeled microelectrode unit in terms of both encoding models (960 dots per plot, in total). Negative correlation values were set to zero. The diagonal represents equal performance between both models. The critical r-value at Bonferonni-corrected α = 5.21e–5 is at r = 0.3895 and r = 0.2807 for faces (df = 100) and natural images (df = 200), respectively, and is denoted by the shaded area. It is clear that w-latents outperform both z- and CLIP-latents because most dots lie in the direction of the w-axis (above the diagonal). The stars indicate the mean correlation coefficient per region of interest based on the data points outside the shaded area. Face images in this figure are replaced for copyright reasons. The original version of the figure can be accessed here.

More »

Expand

Fig 7.

w-based encoding performance across visual areas.

The left panel presents the distribution of correlation coefficients for face images using a swarm plot, with mean values indicated for V1 (0.53), V4 (0.52) and IT (0.53). The right panel displays the distribution for natural images, with mean values for V1 (0.40), V4 (0.47) and IT (0.56).

More »

Expand

Fig 8.

w-latents explain high-level brain activity.

Three encoding models were fit on early (1; layer 2/16), middle (3; layer 7/16) and deep (5; layer 13/16) feature representations of VGG16 pretrained for face/object recognition. The representation that led to the highest encoding performance was assigned to each microelectrode unit, resulting in the complexity gradient where more low-level and high-level representations are assigned to earlier and more downstream brain areas, respectively (see most-left graph for reference). In each of the three plots, one VGG16 representation was replaced by the w-latent representation to see where it falls on the complexity gradient. The results illustrate that w-latents predominantly accounted for neural responses in downstream IT. Face images in this figure are replaced for copyright reasons. The original version of the figure can be accessed here.

More »

Expand

Fig 9.

Qualitative reconstruction results: The 100 test set stimuli (top row) and their reconstructions from brain activity in V1, V4 and IT (bottom row) via w-latents. Face images in this figure are replaced for copyright reasons. The original version of the figure can be accessed here.

More »

Expand

Fig 10.

Qualitative reconstruction results: The 200 test set stimuli (top row) and their reconstructions from brain activity in V1, V4 and IT. (bottom row) via w-latents.

More »

Expand

Table 1.

Quantitative results.

The upper and lower block display model performance (mean ± std.error) when reconstructing face images and natural images, respectively, in terms of six metrics of perceptual cosine similarity using the five MaxPool layer outputs of VGG16 for face recognition (face images) / object recognition (natural images) and latent cosine similarity between w-latents of stimuli and their reconstructions. The rows display decoding performance when using the recordings from all recording sites (i.e., V1, V4 and IT together) or the recordings within a specific brain area.

More »

Expand

Fig 11.

Time-based decoding.

A For each trial, responses were recorded for 300 ms with stimulus onset at 100 ms. Rather than taking the average response within the original time windows (see the three color-coded windows for V1, V4 and IT), we slid a 100 ms window with a stride of 25 ms over the entire time course, resulting in nine average responses across time. B, D Two stimulus-reconstruction examples evolve over time for faces and natural images, respectively. C, E Decoding performance over time for faces and natural images, respectively. The error bars denote the standard error of the cosine similarities between features of stimuli and reconstructions. It can be noted how V1 performance climbs up slightly earlier in time than the other two visual areas. For faces, IT outperforms V1 and V4 in most instances. For natural images, V1 outperforms V4 and IT for low-level feature similarity, after which V4 and IT climb up together and outperform V1 for the more high-level feature similarity metrics. Face images in this figure are replaced for copyright reasons. The original version of the figure can be accessed here.

More »

Expand

Fig 12.

Linear operations to latent codes.

(row 1) shows linear operations to two ground-truth w-latents and (row 2) to two predicted w-latents from brain activity. The linearly-manipulated latents were then fed to the generator for image generation. (A1, A2) face images, also contains vector arithmetic. (B) As for (A1, A2) but for natural images. Face images in this figure are replaced for copyright reasons. The original version of the figure can be accessed here.

More »

Expand