Fig 1.
Training generic fitting models on embodied visual experiences.
(a) First-person visual experiences collected from human adults, wearing head-mounted cameras. The videos were recorded in natural environments, with large numbers of objects, people, and places. (b-e) We used RDMs to evaluate whether generic fitting models have color-based versus shape-based representational spaces. (b) Color scores were computed by averaging across RDM cells where objects matched in color (filled cells in Panel b). (c) Shape scores were computed by averaging across RDM cells where objects matched in shape (filled cells in Panel c). (d) Untrained fitting models have color-based representational spaces, as shown in the RDMs (left) and color/shape scores (right). (e) Trained fitting models developed shape-based representational spaces. The one exception was the smallest (1H) model, which failed to develop shape perception. (f) t-SNE visualizations showed that untrained generic fitting models group objects using color, whereas (g) trained generic fitting models group objects based on shape. The images in the t-SNE visualization correspond to frames captured as the object rotated in the simulation. Error bars denote standard error for each model across the color cells and shape cells shown in Fig 1b, c.
Fig 2.
(a) To test whether our results generalize to realistic objects, we tested the models on objects with more complex features. (b) Untrained transformers had color-based representational spaces, as shown in the RDMs (left) and color/shape scores (right). (c) Trained transformers developed shape perception, grouping objects by shape rather than color. The exception was the smallest (1H) model, which developed partial shape perception. (d) t-SNE visualizations showed that untrained models group objects based on color, whereas (e) trained models group objects based on shape. These models were trained on the human adult data from Experiment 1. Error bars denote standard error for each model across the color cells and shape cells shown in Fig 1b, c.
Fig 3.
Comparing learning across newborn animals and generic fitting models.
(a) Wood11 showed that newborn chicks develop shape perception when reared with a single object. During the training phase, the chick’s environment contained a single object. During the test phase, the chambers measured whether the chicks developed shape perception. (b) To simulate the chick study, we created virtual animal chambers and simulated the first-person visual experiences available to the chicks during the training phase. As in the chick study, the virtual chambers presented one object that rotated continuously, revealing a different color and shape on each of its two sides. (c) We trained generic fitting models (transformers) using the simulated visual experiences from the virtual chamber. (d) To test the models, we simulated the visual experiences available to the chicks during the test phase. (e) Visualization of the representation spaces of untrained versus trained fitting models. Untrained fitting models (ViT-CoT 6H shown here) organize objects based on color, whereas models trained on the visual experiences of newborn chicks learned to organize objects based on shape. Error bars denote standard error for each model across the color cells and shape cells shown in Fig 1b, c.
Fig 4.
Controlled-rearing experiments on generic fitting models.
(a) In the ‘dense exploration’ condition, the agent freely moves around the virtual controlled-rearing chamber, densely sampling the visual experiences available in the chamber with head movements. (b) In the ‘shuffled images’ condition, we randomized the order of the simulated training images to test the importance of temporal continuity on the development of shape-based vision. (c) Untrained generic fitting models had color-based representational spaces, as shown in the RDMs (top) and color/shape scores (bottom). (d) Generic fitting models trained in the dense exploration condition developed robust shape perception. The one exception was the smallest (1H) model, which failed to develop shape perception. (e) Generic fitting models trained in the shuffled images condition also failed to develop shape perception, highlighting the importance of temporal continuity in shape learning. For all RDMs, the images used to make the RDMs were the same as those used in Fig 1. Error bars denote standard error for each model across the color cells and shape cells shown in Fig 1b, c.
Fig 5.
Ablating head movements and transitional views.
(a) In the ‘no head movements’ condition, the agent moved freely around the virtual chamber, but rather than performing head movements at each location (as in the dense exploration condition, left), the agent stared directly at the object (right). (b) In the ‘no transitional views’ condition, the chamber was divided into 49 equally-spaced locations, and the agent teleported to each location. The colored circles and associated images show sample views obtained from five locations. Unlike the dense exploration condition, the agent did not acquire transitional views moving from place to place. (c) Generic fitting models trained in the no head movements condition developed partial sensitivity to shape, as shown in the RDMs (top) and color/shape scores (bottom). This differs from models trained in the dense exploration condition, which developed robust shape perception (Fig 4d). (d) Generic fitting models trained in the no transitional views condition developed partial sensitivity to shape. Again, this differs from models trained in the dense exploration condition, which developed robust shape perception (Fig 4d). In both conditions, the models partially reduced their weighting of color features (compared to untrained models, Fig 4c). For all RDMs, the images used to make the RDMs were the same as those used in Fig 1. Error bars denote standard error for each model across the color cells and shape cells shown in Fig 1b, c.
Fig 6.
Ablating lateral and depth transitions between views.
(a) In the ‘no side-to-side transitions’ condition, the agent teleported to locations within the horizontal red stripe. Since the agent was limited to the red stripe, they could not collect views showing side-to-side transitions of the object. (b) In the ‘no depth transitions’ condition, the agent teleported to locations within the vertical red stripe. Since the agent was limited to the red stripe, they could not collect views showing depth transitions of the object. (c) Generic fitting models trained in the no side-to-side transitions condition developed partial sensitivity to shape, as shown in the RDMs (top) and color/shape scores (bottom). This differs from models trained in the dense exploration condition, which developed robust shape perception (Fig 4d). (d) Generic fitting models trained in the no depth transitions condition developed partial sensitivity to shape. Again, this differs from models trained in the dense exploration condition, which developed robust shape perception (Fig 4d). The one exception was the 3H model, which developed shape perception. In both conditions, the models partially reduced their weighting of color features (compared to untrained models, Fig 4c). For all RDMs, the images used to make the RDMs were the same as those used in Fig 1. Error bars denote standard error for each model across the color cells and shape cells shown in Fig 1b, c.
Fig 7.
Hardcoding spatial priors into fitting models.
Unlike Vision Transformers (ViTs), convolutional neural networks (CNNs) have hardcoded spatial priors. To measure the impact of hardcoding spatial priors into fitting models, we compared ViTs and CNNs across all of the controlled-rearing conditions. The ViTs and CNNs had the same temporal learning objective, so the models differed only in architecture. Unlike the ViTs, which learned shape perception by leveraging view diversity (Figs 4–6), CNNs learned shape perception in nearly all of the controlled-rearing conditions. The one exception was when the training views were shuffled, which prevented shape learning in both CNNs and ViTs. The same pattern emerged with both imprinting objects (top and bottom rows). Hardcoding spatial priors into fitting models reduces the model’s reliance on view diversity for learning shape perception. For all RDMs, the images used to make the RDMs were the same as those used in Fig 1. Error bars denote standard error for each model across the color cells and shape cells shown in Fig 1b, c.
Fig 8.
Training generic fitting models with different artificial image augmentations.
(a) Generic fitting models trained with no artificial image augmentations developed color-based representational spaces, as shown in the RDMs (top) and color/shape scores (bottom). (b-d) Likewise, generic fitting models trained with gaussian blur, horizontal flip, or random cropping developed color-based representational spaces. (e-f) Conversely, generic fitting models trained with color jitter or grayscale developed shape-based representational spaces. For all RDMs, the images used to make the RDMs were the same as those used in Fig 1. Error bars denote standard error for each model across the color cells and shape cells shown in Fig 1b, c.
Fig 9.
Filtering raw visual experiences through artificial retinas.
(a) We created six artificial retinas, each with a different-sized foveal region. The foveal region had color filters, mimicking the high density of cone (color) receptors in the retina. The periphery had grayscale filters, mimicking the high density of rod cells in the retina. The eye filter with a radius of 7.5 units is the closest match to human eyes. (b) Color and shape sensitivity of generic fitting models trained on human visual experiences filtered through artificial retinas. All models, except the smallest model (1H), develop shape perception when trained using artificial retinas with small foveal regions (akin to human eyes). Models trained using artificial retinas with larger foveal regions failed to develop shape perception. Error bars denote standard error for each model across the color cells and shape cells shown in Fig 1b, c.
Fig 10.
(a) Realistic artificial retinas convert RGB images into retinal-formatted images, akin to the transformations performed by biological retinas. For this experiment, we used fovea sizes of 15 and 30, 99% of cones in fovea and 1% of cones in periphery, 1% of rods in fovea and 99% of rods in periphery, larger receptive fields in the periphery versus fovea, visual crowding in the periphery, and a dynamic fovea that moved to salient regions across successive images. (b) We tested the untrained and trained models across the three stimuli sets. (c) Untrained models were color-based, whereas trained models (d-e) developed forms of shape perception. Error bars denote standard error for each model across the color cells and shape cells shown in Fig 1b, c.
Table 1.
Hyperparameter details of the models.
Table 2.
Results of paired t-tests comparing between shape and color scores within each model architecture and training condition.
Table 3.
Results of unpaired Welch’s two-sample t-tests comparing shape and color scores between trained and untrained models.