High-level visual prediction errors in early visual cortex

doi:10.1371/journal.pbio.3002829

Fig 1.

Paradigm.

(A) A single trial, showing a letter cue (500 ms) followed by an image (500 ms) and a variable ITI (approximately 5,000 ms). The image was expected or unexpected given the preceding letter cue. Participants responded by button press to the images, indicating whether the entity in the image was animate or inanimate. (B) TPM determining the associations between cues and images. Each of the 8 images was associated with one of the 8 letter cues. The expected image was 7 times more likely to appear than any other image given its cue. Numbers in each cell indicate the number of trials per run. The specific cue-image associations were randomized and differed between participants. Moreover, the set of 8 images also varied for different participants. (C) Two cycles of a localizer trial. During the localizer, one image was presented repeatedly (500 ms on, 300 ms off) for 12,000 ms. The identity of the images was not predictable. Participants responded to a high brightness version of the images, which was shown once during each trial for one cycle. ISI, interstimulus interval; ITI, intertrial interval; TPM, transitional probability matrix.

More »

Expand

Fig 2.

Behavioral facilitation due to prediction.

(A) RTs to expected objects were faster compared to unexpected images requiring the same or different response as the expected stimulus. (B) Accuracy of responses was high overall, but responses were more accurate to expected compared to unexpected images requiring a different response than the expected stimulus. Error bars indicate the 95% within-subject confidence intervals. *** p < 0.001, ** p < 0.01, * p < 0.05. Data and code that support these findings are available at: https://doi.org/10.34973/8e49-2012. RT, reaction time.

More »

Expand

Fig 3.

Convergence of DNN and visual cortical representations showing a gradient from low- to high-level features in prediction-free contexts.

(A) RSA of visual responses shows that (feedforward) visual responses during the prediction-free localizer were best explained by a gradient of low-level to high-level visual features going up the ventral visual hierarchy. EVC responses aligned more closely with early DNN layers, indicative of low-level visual feature processing (depicted in cold colors: purple to blue). HVC areas, like the fusiform gyrus, show a greater correlation with late DNN layers, representing high-level visual feature processing (illustrated in warm colors: yellow to red). Analysis was masked to visual cortex and thresholded at z > 3.1 (p < 0.001, uncorrected) of the RSA. (B) ROI masks, depicting voxels included in the anatomically and functionally defined masks. Color indicates the ROI: Blue = V1, Purple = LOC, Orange = HVC. Opacity indicates the proportion of participants whose individual masks included the voxel. For visualization, full opacity corresponds to a proportion of 0.1, with voxel inclusion proportions reaching up to ~0.7. Data and code that support these findings are available at: https://doi.org/10.34973/8e49-2012. DNN, deep neural network; EVC, early visual cortex; HVC, higher visual cortex; ROI, region of interest; RSA, representational similarity analysis; V1, primary visual cortex.

More »

Expand

Fig 4.

Analytical approach.

(A) If you expect to see the first guitar on the left, what kind of visual features does visual cortex predict? Low-level visual features, illustrated next to the expected image, concern local oriented edges, spatial frequency, and similar properties. High-level visual features entail more complex visual representations, such as texture-like features [32], core object parts and their relationships, features commonly shared between instances of an object, irrespective of the specific depiction. Depending on which features are predicted, the 2 “seen” images will result in different prediction error magnitudes. The image of the woman is very different in high-level visual features, but shares some local orientation with the expected guitar, hence resulting primarily in high-level visual surprise. On the other hand, the image of the other guitar is very different in terms of low-level visual features, as it is differently rotated compared to the expected guitar, but it is still a guitar and thus shares high-level visual features. The key question of the analysis is whether and where in the visual system low-level or high-level surprise results in larger prediction errors. (B) Analysis procedure. The left side shows a single trial with a letter cue predicting a specific image. Below multiple unexpected images are illustrated, which were presented on other trials. The right side depicts the extraction of low-level and high-level visual feature dissimilarity from layer 2 (low-level) and layer 8 (high-level) of the visual DNN. Dissimilarity of the unexpected seen image relative to the expected stimulus was then added as a parametric modulator in the first level GLMs of the fMRI analysis. For illustration purposes, the graph at the bottom uses data from one example participant to depict how BOLD responses (example data from V1; ordinate) are modulated as a function of low-level (blue) and high-level (red) visual dissimilarity (abscissa). A positive slope thus indicates that the more dissimilar a seen image was relative to the expected stimulus the more vigorous the neural response. Dots indicating individual dissimilarities (i.e., distances of seen unexpected images compared to the expected image) are arranged in rows, reflecting the granularity of the available surprise distances for this example participant. In the example data, the image of the unexpected woman would hence result in larger prediction errors compared to the image of the unexpected guitar, because of the larger high-level surprise. Critically, this analysis only uses BOLD data from unexpected image trials. Additional control variables for task relevance (animacy) and word meaning, discussed in more detail later, were also included. We performed this parametric modulation analysis in a voxel-wise fashion across the whole brain. DNN, deep neural network; GLM, general linear model.

More »

Expand

Fig 5.

Prediction error magnitude scales with high-level visual feature surprise.

(A) Whole-brain results assessing the modulation of surprise responses as a function of high-level (top row) and low-level (bottom row) visual feature dissimilarity. The top row shows that surprise responses to unexpected images were increased if the image was more distant from the expected image in terms of high-level visual features. Color indicates the beta parameter estimate of the parametric modulation, with red and yellow representing increased responses. Black outlines denote statistically significant clusters (GRF cluster corrected). No significant modulation of sensory responses was observed by low-level visual surprise. (B) ROI analysis zooming in on ROIs in early visual (V1), intermediate (LOC), and HVC (encompassing occipito-temporal sulcus and fusiform cortex). Results mirror those of the whole-brain analysis, with significant modulations of the visual responses by high-level visual surprise (red), but not low-level visual surprise (blue). Error bars indicate the 95% within-subject confidence intervals. Gray dots denote individual subjects. P values are FDR corrected. *** p < 0.001, ** p < 0.01, * p < 0.05, ⁼ BF₁₀ < 1/3. (C) Prediction errors preferentially scale with high-level visual features (layer 8 and layer 7) throughout most of the visual system, including EVC, LOC and HVC. Color indicates the DNN layer with the largest effect (explained variance) on scaling the neural responses to surprising inputs. Cold colors (purple–blue) represent early layers (i.e., low-level visual features), while warm colors (yellow–red) indicate late layers (i.e., high-level visual features). Analysis was masked to visual cortex and thresholded at a liberal z ≥ 1.96 (i.e., p < 0.05, two-sided) to explore the landscape of prediction error modulations across DNN layers. Results strongly contrast with those observed for prediction-free visual responses during the localizer (Fig 3A). (D, E) ROI analysis regressing BOLD responses (D) or decoded true class probability (E) onto high-level visual dissimilarity. Results indicate a monotonic relationship between high-level surprise and BOLD responses across all 3 ROIs, as well as decoding performance in V1 and HVC. The chance level for decoding the true class probability is 0.125. For display purposes dissimilarities were ranked and averaged across participants, while regression models were fit per participant on the correlation distances. Data and code that support these findings are available at: https://doi.org/10.34973/8e49-2012. DNN, deep neural network; EVC, early visual cortex; HVC, higher visual cortex; LOC, lateral occipital complex; ROI, region of interest; V1, primary visual cortex.

More »

Expand

Fig 6.

Prediction error magnitudes are best explained by high-level visual feature dissimilarity.

(A) Whole-brain contrasts of the high-level visual feature model (layer 8) contrasted against 4 control variables. The top row shows that high-level visual models performed significantly better than low-level visual models (layer 2). Similarly, high-level visual surprise better accounted for prediction error magnitudes than the task-relevant animacy category of the unexpected stimuli (second row) and the semantic, word category surprise model (word2vec; third row). The bottom row shows that high-level visual dissimilarity significantly better explained prediction error magnitudes compared to an untrained but otherwise identical DNN layer 8. (B) ROI analysis including primary (V1), intermediate (LOC), and high-level visual cortex (HVC). Results confirm the whole-brain results, showing significant modulations of BOLD responses by high-level visual surprise (red) compared to low-level visual (blue), response category (green), and word category surprise (purple). Error bars indicate the 95% within-subject confidence intervals. Gray dots denote individual subjects. P values are FDR corrected. *** p < 0.001, ** p < 0.01, * p < 0.05. Data and code that support these findings are available at: https://doi.org/10.34973/8e49-2012. DNN, deep neural network; HVC, higher visual cortex; LOC, lateral occipital complex; ROI, region of interest; V1, primary visual cortex.

More »

Expand