Computational origins of shape perception

doi:10.1371/journal.pcbi.1013674

Computational origins of shape perception

Fig 7

Hardcoding spatial priors into fitting models.

Unlike Vision Transformers (ViTs), convolutional neural networks (CNNs) have hardcoded spatial priors. To measure the impact of hardcoding spatial priors into fitting models, we compared ViTs and CNNs across all of the controlled-rearing conditions. The ViTs and CNNs had the same temporal learning objective, so the models differed only in architecture. Unlike the ViTs, which learned shape perception by leveraging view diversity (Figs 4–6), CNNs learned shape perception in nearly all of the controlled-rearing conditions. The one exception was when the training views were shuffled, which prevented shape learning in both CNNs and ViTs. The same pattern emerged with both imprinting objects (top and bottom rows). Hardcoding spatial priors into fitting models reduces the model’s reliance on view diversity for learning shape perception. For all RDMs, the images used to make the RDMs were the same as those used in Fig 1. Error bars denote standard error for each model across the color cells and shape cells shown in Fig 1b, c.

doi: https://doi.org/10.1371/journal.pcbi.1013674.g007