Fig 1.
(1) Run controlled-rearing experiments testing how view-invariant object recognition develops in newborn chicks. (2) Simulate the visual experiences available to the chicks during the training phase. (3) Train self-supervised DNNs with the simulated images, then freeze the DNN weights to prevent further learning. (4) Simulate the visual experiences available to the chicks during the test phase. (5) Evaluate the DNN’s view-invariant object recognition performance using a linear classifier, which is trained & tested with the simulated test images in a cross-validated design. (6) Compare the view-invariant recognition performance of the chicks and DNNs.
Fig 2.
(A) The four train/test conditions presented to the chicks. (B) The schematic shows how the virtual objects were presented for sample 4-hr periods. During the training phase, a single virtual object appeared on one display wall at a time (indicated by blue segments on the timeline), switching walls every 2 hr, after a 1-min period of darkness (black segments). The object rotated back and forth through a limited 60° viewpoint range. (C) During the Test Phase, two virtual objects (one imprinted, the other novel) were shown simultaneously, one on each display wall, for 20 min per hour (orange segments). The illustrations below the timeline are examples of paired test objects displayed in four of the test trials. The test objects rotated through a 60° viewpoint range. Each test trial was followed by a 40-min rest period (blue segments). During the rest periods, the imprinting stimulus from the training phase was shown on one display wall, and the other display wall was blank. The illustrations show the displays seen by chicks that were imprinted to Object A: Front View (see Panel A).
Fig 3.
(A) View-invariant recognition performance of the five self-supervised CNN models across the four rearing conditions. The red horizontal line shows the chicks’ performance, with the ribbon representing standard error. (B) Linear classifier training. We evaluated the accessibility of the view-invariant features by training/testing linear classifiers on different numbers of viewpoint ranges. We used a cross-validated design, with different viewpoint ranges in the training (blue) versus test (orange) image sets. Thus, all results reflect the generalization performance of CNNs across novel views. (C) The models successfully recognized the object across novel views, even when the linear classifiers were trained on a single viewpoint range. Error bars represent standard error of model performances across validation folds.
Fig 4.
(A) Architecture of GreedyInfoMax (GIM) model. The CNN is divided into separate gradient-isolated modules, each with its own contrastive loss function. A gradient blocker blocks the backward flow of gradients, preventing backpropagation. The loss is calculated by taking the sum of individual losses within each module. (B) View-invariant recognition performance of newborn chicks and different GIM architecture sizes, across the four rearing conditions presented to the chicks. The red horizontal line shows the chicks’ performance. (C) Comparison of untrained versus trained GIM models across the three architecture sizes. All GIM models showed large learning gains, showing that CNNs without backpropagation can learn view-invariant features in the impoverished environments faced by newborn chicks. Error bars represent standard error of model performances across validation folds.
Fig 5.
(A) We increased the number of hardcoded spatial operations by adding more layers to the CNN architecture. To create different architecture sizes, we systematically added and removed residual blocks and bridge connections between blocks from the original ResNet architecture from Experiment 1. (B) Performance of the untrained CNNs (grey bars) increased when CNNs had more layers. Increasing the number of hardcoded spatial operations improved performance in untrained models. The colored bars show the learning gains that emerged in SimCLR (blue bars) and autoencoders (pink bars). Once the models were trained, performance decreased as a function of architecture size. Learning allowed smaller CNNs to achieve similar (or better) fits to the environment than larger CNNs, despite the smaller CNNs starting with weaker hardcoded spatial knowledge. Error bars represent standard error of model performances across validation folds. The red line shows the chicks’ performance, with the ribbon representing standard error. (C) To test whether learning plateaued in the models, we varied the number of images used to train the CNNs. Most models achieved similar performance when trained on 5,000 to 80,000 images, and a few algorithms (SimCLR, BYOL) showed modest performance gains with more training.
Fig 6.
Measuring the impact of dense sampling of the visual environment on view-invariant recognition performance.
(A) We trained the best performing CNN from the prior experiments (10-layer SimCLR) on datasets containing different numbers of images sampled from the virtual chamber. (B) Recognition performance improved systematically when the SimCLR model was trained on larger numbers of unique images. Denser sampling of the visual environment produced more accurate view-invariant object features. The red line shows the chicks’ performance, with the ribbon representing standard error. (C) The autoencoder algorithm did not achieve better performance when trained on larger numbers of images. This indicates that some learning algorithms (e.g., SimCLR) can leverage the unique views present in embodied data streams to build up accurate view-invariant features. (D) Two-dimensional projections of the feature representations from the untrained and trained CNNs. Each point represents a CNN representation of an input image containing a single object. Colors denote the identities and viewpoint ranges of the objects; warm colors (red-yellow) represent Object 1 and cold colors (green-purple) represent Object 2. Denser sampling of the visual environment led to more clustered representations in the embedding space.
Fig 7.
Visualizing the representations learned by models.
t-SNE embeddings of the representations in the last layer of a CNN (10 layers, SimCLR algorithm, 80K training images). (A) For this visualization, the agent was stationary in front of the monitor while viewing Object A (red dots) or Object B (blue dots) from different test viewpoints. The CNN learned a structured feature space for representing both object identity and viewpoint. The images on the right correspond to the colored dots on the left. (B) For this visualization, the agent started at the front of the chamber (facing the object on Monitor M1), then moved straight backwards. The colored dots (left) denote the distance of the agent from the object. The images on the right correspond to the colored dots. The CNN learned a structured feature space for representing object distance and position in the chamber.
Fig 8.
(A) Contrastive Learning Through Time (CLTT) model. Each image is passed through a ResNet backbone, preserving the temporal order of images. Encoded features are aligned in the feature space using a temporal learning window of 3 frames. This window mimics the spike-timing-dependent plasticity learning window of biological visual systems (~300 ms). (B) View-invariant recognition performance of newborn chicks and SimCLR-CLTT models. We evaluated two architecture sizes (4-layer and 10-layer), across the four rearing conditions presented to the chicks. The red horizontal line shows the chicks’ performance. CNNs showed substantial learning gains over untrained CNN performance (untrained 4-layer CNN performance = 52.5%; untrained 10-layer CNN performance = 60.1%). CNNs can leverage time as a teaching signal to learn in impoverished environments. Error bars represent standard error of model performances across validation folds.
Fig 9.
(A) The vision transformer architecture. Images are first divided into smaller 8x8 patches and then reshaped into a sequence of flattened patches. A learnable positional embedding is added to each flattened patch, and a class token (CLS_Token) is added to the sequence. The resulting embedding is then sequentially processed by transformer blocks while also being analyzed in parallel by attention heads, which generate attention filters shown next to each head. The learned representation of the image is adjusted based on the contrastive learning through time loss function. (B) View-invariant recognition performance of newborn chicks and different ViT-CoT architecture sizes, across the four rearing conditions presented to the chicks. The red horizontal line shows the chicks’ performance. (C) Comparison of untrained versus trained ViT-CoT models across the four architecture sizes. All ViT-CoT models showed large learning gains, showing that vision transformers can learn view-invariant features in the impoverished environments faced by chicks. Error bars represent standard error of model performances across validation folds.
Fig 10.
Behavioral Consistency Analysis.
Representational similarity between the chicks and models. We measured representational similarity as the correlation between each model’s performance across the 12 test viewpoints and average chick performance across the 12 test viewpoints. We show each chick’s correlation to average chick performance (red dots) and each model’s correlation to average chick performance (black dots). Average and standard error for each model architecture are shown as bars and error bars, respectively. The lower and upper bounds of the chicks’ average correlation are shown as red lines with shading in between. The upper bound shows the mean correlation between each chick and the group-averaged performance across viewpoints. The lower bound shows the mean correlation between each chick and the remaining chicks’ group-averaged performance across viewpoints. The chicks and models generally showed the same pattern of successes and failures across the test viewpoints.
Fig 11.
Comparison of deep neural networks trained in (A) natural visual environments versus (B) controlled-rearing chambers. (C) View-invariant recognition performance of the image-based CNN models from Experiment 1. (D) View-invariant recognition performance of the CNN space-time fitting models from Experiment 3. (E) View-invariant recognition performance of the transformer space-time fitting models from Experiment 4. Models trained in natural environments generally performed at similar levels as models trained in controlled-rearing chambers. Error bars represent standard error of model performances across validation folds. The sample images shown in panel A are for illustrative purposes only and were not used for training the models.
Table 1.
Hyperparameters of self-supervised learning algorithms used in Experiments 1–5.
Table 2.
ResNet with different architecture sizes.
Table 3.
ViT-CoT with different architecture sizes.