The representational hierarchy in human and artificial visual systems in the presence of object-scene regularities

doi:10.1371/journal.pcbi.1011086

Fig 1.

Experimental design.

(A) The stimulus set includes 2 domains: animals and scenes, each including 6 different identity conditions (4 images for each condition). Due to copyright restrictions, the images shown here are royalty-free examples of images downloaded from https://unsplash.com/ chosen based on same criteria used to select the original stimuli. The pictures of animals were carefully selected to avoid that background information could be informative for object identification (e.g., the polar bear and the gorilla have a very similar neutral background). To control for shape, we further divided the animal categories in three subsets along the animacy continuum (2 mammals, 2 birds, and 2 small rounded animals). Within each subset, animals are matched for body shape (e.g., gorilla and polar bear) but each animal is paired with a different scene. As an example, the passerine bird and the seagull have similar body shape but are associated with two different backgrounds. As for the pictures of scenes, 3 of the backgrounds are characterized by rich navigational properties where there is no object in focus in the middle of the image: seashores, ice landscapes, and jungle forests. The other 3 backgrounds are object-like scenes with little navigational layout properties: anemones, leaves, and tree branches. Concurrently, animals and scenes conditions were selected based on their frequent co-occurrence in real-world images: polar bears live in ice landscapes and gorilla live in forest jungles, thus allowing the creation of 6 specific object-scene contextual pairs (B) The ROIs included in the brain RSA analysis included visual areas (for their relevance in object recognition) and frontoparietal areas (for their relevance in goal-directed behavior): BA17, posterior ventral-temporal cortex (VTC), anterior VTC, lateral VTC, occipital place area (OPA), parahippocampal area (PPA), retrosplenial cortex (RSC), intraparietal sulcus (IPS), and dorsal prefrontal cortex (DPFC). See Methods for details on the localization procedure. (C) Four models were tested: GIST, condition, domain, and co-occurrence.

More »

Expand

Fig 2.

The representational hierarchy for separation and interaction of objects and scenes in the brain.

The ROI-based (A, B) and whole-brain (C) RSA results for the 4 models (GIST, condition, domain, co-occurrence) are shown for brain data. Results reveal a strong separation for domain (scene and animal) representations in most ventral regions. The effect for animal-scene co-occurrence emerges in frontoparietal areas. (A) For group-averaged results, filled bars indicate significant values against baseline (p < 0.001) computed with permutation tests (10,000 randomizations of stimulus labels). (B) For individual subject results, reliability boundaries (in gray) indicate the highest expected correlation considering signal noise (see Methods) and error bars indicate SEM. Filled bars indicate significant values against baseline (p <0.005, corrected for n. or ROIs) calculated with pairwise t-tests across subjects (n = 19). For each ROI, the neural dissimilarity matrix (1—r) is shown below. (C) The random-effects whole-brain RSA results corrected with Threshold-Free Cluster Enhancement [TFCE; 37] are displayed separately for each individual model against baseline [BrainNet Viewer; 38]. Note that for some of these maps (e.g., co-occurrence vs domain), the direct contrast did not reveal a significant difference.

More »

Expand

Fig 3.

Similar representational hierarchy in the brain and DCNNs.

(A) The DCNNs RSA results for the 4 models (GIST, condition, domain, co-occurrence) are shown for 4 DCNNs (AlexNet, VGG16, GoogLeNet, ResNet-50). The network’s depth is shown on the x axis. For each graph and each model, color-coded lines indicate significant effects relative to all remaining models (p < 0.001) computed with pairwise permutations tests (10000 randomizations of stimulus labels). For each DCNN, the representational dissimilarity (1—r) is shown for the last fully connected layer. (B) Correlational matrices show second-order relationships among representational patterns in the brain’s ROIs and individual DCNNs’ layers. Color-coded line boxes highlight the ROIs and DCNNs’ layers where each model reaches significance. For brain areas, significance for each model is shown relative to baseline (p <0.0001), calculated with permutation tests (10,000 randomizations of stimulus labels). The order in which ROIs are shown does not imply a strict correspondence with the computational hierarchy in the brain. For DCNNs’ layers, significance for each model is shown relative to all remaining models (p < 0.001), calculated with permutation tests (10000 randomizations of stimulus labels). Both systems show similar transformations in the representational space. Early on, the object space reflects image low-level visual properties (GIST model, yellow color-coded), it then shifts towards animal-scene domain division (domain model, light-blue color-coded), to finally reveal animal-scene co-occurrence effects (co-occurrence model, purple color-coded).

More »

Expand

Fig 4.

DCNNs acquire human-like conceptual biases.

(A) The correlational plot shows the degree of similarity between behavioral judgments and each of the four DCNN architectures. (B) MDS spaces (metric stress) show consistent object-scene clusters in human behavioral judgments as well as DCNNs (last fully connected layer). (C) The object-scene cluster analysis (right) for behavioral and DCNNs (last layer) data, show a consistent significant effect of congruency (lower distance) for object-scene stimulus pairs. Dashed bars show the averaged data for the 6 stimulus pairs.

More »

Expand

Fig 5.

The effect of increasing levels of object-scene co-occurrence in DCNN object space.

We trained a DCNN multiple times (N = 5) in 4 training conditions with increasing levels of object-background regularity from 0% to 100%. (A) Accuracy results for the model’s validation test. (B) The RSA results for the 4 models (GIST, condition, domain, co-occurrence) are shown for AlexNet’s fc7 that underwent different training regimes with increasing co-occurrence (0%, 58%, 87%, 100%) between objects and backgrounds (n = 6). Error bars indicate SEM. In the dissimilarity matrices, orange represents animal conditions and red represents background conditions.

More »

Expand

Fig 6.

Multiple domain-specific object spaces.

The RSA results for the 4 models (GIST, condition, animacy continuum, navigational layout) are shown for group-averaged brain (left) data and DCNNs (right). For the neural data, only ROIs where the domain model reached significance were included (see Fig 2A). The same DCNN architecture (GoogLeNet) was trained either on object recognition (ImageNet), or scene recognition (Scene 365). For comparison with the first RSA analysis (see Fig 3A), gray shaded areas indicate the network’s layers in which the domain model significantly outperformed the remaining models. Color-coded lines on top of bar/graphs indicate the network’s layers/ROIs where each model significantly outperformed the remaining models (p < 0.001) computed with pairwise permutations tests (10000 randomizations of stimulus labels).

More »

Expand