High-performing neural network models of visual cortex benefit from high latent dimensionality

Geometric descriptions of deep neural networks (DNNs) have the potential to uncover core representational principles of computational models in neuroscience. Here we examined the geometry of DNN models of visual cortex by quantifying the latent dimensionality of their natural image representations. A popular view holds that optimal DNNs compress their representations onto low-dimensional subspaces to achieve invariance and robustness, which suggests that better models of visual cortex should have lower dimensional geometries. Surprisingly, we found a strong trend in the opposite direction—neural networks with high-dimensional image subspaces tended to have better generalization performance when predicting cortical responses to held-out stimuli in both monkey electrophysiology and human fMRI data. Moreover, we found that high dimensionality was associated with better performance when learning new categories of stimuli, suggesting that higher dimensional representations are better suited to generalize beyond their training domains. These findings suggest a general principle whereby high-dimensional geometry confers computational benefits to DNN models of visual cortex.

It is important to emphasize that the relationship between ED and encoding performance cannot be explained as a trivial statistical consequence of models with high ED.First, all models were evaluated using cross-validation, which means that the only way for a model to perform well is by explaining meaningful variance that generalizes to held-out data.If models with high ED were simply overfit to the training data, their performance on the held-out test data would be poor.Second, our ED metric characterizes the distribution of variance in the eigenspectrum, but it does not directly indicate the number of available dimensions in a system, nor does it change the number of parameters in the model.In fact, all models examined here were full rank, meaning that their image representations spanned the maximum number of latent dimensions.Thus, in our analyses, ED alone has no direct relationship to the maximum number of latent dimensions that could potentially be used in a regression.Finally, the data that we modeled come from a high-level visual region (IT) whose image-evoked responses have long been a challenging target for computational modelers.In fact, decades of efforts to model the representations in this brain region directly led to the advent of deep learning approaches for the computational neuroscience of vision [8,3].If any model with high ED could trivially explain the representations of IT, then neuroscientists would have no need for deep neural networks.One could, instead, solve the challenging problem of modeling IT by running linear regression on RGB pixel values and adding polynomials or interaction terms until ED was high enough to account for the variance in neural responses.The reason that such an approach would not work is that the space of all possible image representations is infinite: there is an unlimited variety of arbitrary computations that could be used to add dimensions to a model.Models that achieve high ED through arbitrary computations would have a negligible probability of overlapping with the representations of visual cortex.We, thus, suspect that the use of performance-optimized DNN architectures is critical for constraining the computations of encoding models and increasing their overlap with cortical representations.
Question: Why should we care about latent dimensionality if ImageNet classification performance also correlates with encoding performance?
Prior work has found that a model's object classification accuracy (e.g., on ImageNet) strongly correlates with its encoding performance for certain brain regions [8,9].Could we simply focus on classification accuracy, or is latent dimensionality theoretically important in its own right?While object classification accuracy seems to account for the encoding performance of current models, it is worth asking whether or not it can be a viable theory going forward.To say that it is a complete explanation is to say that we believe the correlation between classification accuracy and encoding performance will hold indefinitely, across the space of all current and future models (i.e., a model that performs better on object classification will always obtain higher encoding performance).This is unlikely to be the case.Optimal object classification requires that a model be invariant to features unrelated to object identity, such as orientation and position, which can only contribute to noise in the classifier [4].However, we know that the brain represents orientation, position, and a host of other features unrelated to object identity.Therefore, we know that the object classification theory of encoding performance breaks down in some regime, and that the true dimensionality of visual cortex must be higher than what ideal object classification models would predict.Indeed, initial results suggest that the relationship between object classification and encoding performance does indeed break down past a certain ImageNet classification accuracy [5,6].A theory based on latent dimensionality (and alignment pressure) has the potential to explain the encoding performance of both current and future models on more rich neural datasets, and it may help us to understand why the relationship between encoding performance and classification accuracy breaks down at the highest levels of classification performance.
An interesting question emanating from this discussion is whether the observed relationship between classification accuracy and encoding performance might be overly optimistic due to the limited space of DNNs that are available for computational neuroscientists to examine.Most of the DNNs in visual neuroscience are trained on ImageNet or similar image databases, and we do not have DNNs that can perform open-ended tasks in complex, real-world environments.If we did have DNNs that handled more complex and naturalistic visual behaviors, we postulate that they would surpass the encoding performance of our best object-classification models (and also have higher dimensionality).With the current space of state-of-the-art DNNs being dominated by (a) supervised object classification and (b) self-supervised objectives that learn invariances tailored to object classification, we are bound to observe the current correlation between object classification performance and encoding performance because object recognition is undoubtedly one important problem that biological vision solves-but, importantly, it is one of many complex problems solved by the representations of visual cortex.
Finally, ED and classification performance are two fundamentally different levels of explanation that are mutually compatible because high classification accuracy itself is explained by certain geometric properties of a representation, of which ED is an important one (see Sections 2.4 and 2.5 in the main manuscript).We elaborate on this point in S10 where we examine the relationship between classification performance, encoding performance, and ED in our models.

Question: Does effective dimensionality really represent the number of accurately encoded visual features?
Essential to our theory is the assumption that the variance of a representation along a particular dimension is proportional to the meaningfulness of the feature it encodes.In such cases, it is valid to say that ED roughly quantifies the number of encoded visual features.This interpretation is central to popular dimensionality reduction techniques, such as PCA, and it has good theoretical support given that high-variance dimensions are typically more robust and would, thus, be best-suited for carrying the signal in a population code.Importantly, in the DNN literature, recent findings have shown that neural networks expand variance along dimensions that are useful for solving their tasks and contract variance along noise dimensions that are left over from their random initialization [2, 4, 1].There is, thus, a straightforward relationship between the number of meaningful latent dimensions in a neural network and the shape of the principal component variance spectrum.
Nonetheless, there is no guarantee that high-variance dimensions correspond to meaningful signal and that low-variance dimensions correspond to random features or noise.Furthermore, ED is only an imperfect summary statistic of the rate of decay of the eigenspectrum, and does not directly quantify the number of meaningful features.An interesting direction for future work would be the development of dimensionality metrics that try to explicitly differentiate between meaningful and random dimensions in DNNs, as can be done for neural data when using repeated stimulus presentations [7].