The Invariance Hypothesis Implies Domain-Specific Regions in Visual Cortex

Is visual cortex made up of general-purpose information processing machinery, or does it consist of a collection of specialized modules? If prior knowledge, acquired from learning a set of objects is only transferable to new objects that share properties with the old, then the recognition system’s optimal organization must be one containing specialized modules for different object classes. Our analysis starts from a premise we call the invariance hypothesis: that the computational goal of the ventral stream is to compute an invariant-to-transformations and discriminative signature for recognition. The key condition enabling approximate transfer of invariance without sacrificing discriminability turns out to be that the learned and novel objects transform similarly. This implies that the optimal recognition system must contain subsystems trained only with data from similarly-transforming objects and suggests a novel interpretation of domain-specific regions like the fusiform face area (FFA). Furthermore, we can define an index of transformation-compatibility, computable from videos, that can be combined with information about the statistics of natural vision to yield predictions for which object categories ought to have domain-specific regions in agreement with the available data. The result is a unifying account linking the large literature on view-based recognition with the wealth of experimental evidence concerning domain-specific regions.

. It is hypothesized that properties of the ventral stream are determined by these three factors. We are not the only ones to identify them in this way. For example, Simoncelli and Olshausen distinguished the same three factors [1]. The crucial difference between their efficient coding hypothesis and our invariance hypothesis is the particular computational task that we consider. In their case, the task is to provide an efficient representation of the visual world. In our case, the task is to provide an invariant signature supporting object recognition.
The new theory of architectures for object recognition [2]-applied here to the ventral stream-is quite general. It encompasses many non-biological hierarchical networks in the computer vision literature in addition to ventral stream models like HMAX. It also implies the existence of a wider class of hierarchical recognition algorithms that has not yet been fully explored. The conjecture with which this paper is concerned is that the algorithm implemented by the ventral stream's feedforward processing is in this class. The theory can be developed from four postulates: (1) Computing a representation that is unique to each object and invariant to identity-preserving transformations is the main computational problem to be solved by an object recognition system-i.e., by the ventral stream. (2) The ventral stream's feedforward, hierarchical operating mode is sufficient for recognition [3][4][5]. (3) Neurons can compute high-dimensional dot products between their inputs and a stored vector of synaptic weights [6]. (4) Each layer of the hierarchy implements the same basic "HW-"module, performing filtering and pooling operations via the scheme proposed by Hubel and Wiesel for the wiring of V1 simple cells to complex cells [7].
We argue that as long as these postulates are approximately correct, then the algorithm implemented by the (feedforward) ventral stream is in the class described by the theory, and this is sufficient to explain its domain-specific organization.

The first regime: generic invariance
First, consider the (compact) group of 2D in-plane rotations G. With some abuse of notation, we use g to indicate both an element of G and its unitary representation acting on images. The orbit of an image I under the action of the group is O I = {gI | g ∈ G}. The orbit is invariant and unique to the object depicted in I. That is, O I = O I if and only if I = gI for some g ∈ G. For an example, let I be an image.
Its orbit O I is the set of all images obtained by rotating I in plane. Now consider, g 90 • I, its rotation by 90 • . The two orbits are clearly the same, i.e. O I = O g 90 • I . The set of images obtained by rotating I is the same as the set of images obtained by rotating g 90 0 I.
The fact that orbits are invariant and unique (for compact groups) suggests a recognition strategy. Simply store the orbit for each familiar object. Then, for each new image, check what orbit it is in. Such a strategy would yield invariant representations for familiar objects. However, it could only be used in cases where we had already stored the entire orbit for all objects of interest. How could this approach work in the more realistic setting where only one sample from the test object's orbit is available?
The key property that enables this approach to object recognition is the following condition. For a stored template t with unit norm gI, t = I, g t ∃g ∈ G ∀g ∈ G. (1) It is true whenever g is unitary since in that case g = g −1 . It implies that it is not necessary to have the orbit of I in advance. Instead, the orbit of t is sufficient. Eq. (1) enables the invariance learned from observing a set of templates to transfer to new images. Consider the case where the full orbits of several templates t 1 , . . . , t K were stored. Let I be a completely novel image. Let P be a function mapping sets of real numbers to R. For example, we can choose P = max(·). An invariant signature µ(·) can be defined as . . .
So far, this analysis has only applied to compact groups. Essentially the only interesting one is in-plane rotation. We need an additional idea in order to consider more general groups: Most transformations are generally only observed through a range of transformation parameters. For example, in principle, one could translate arbitrary distances. But in practice, all translations are contained within some finite window. That is, rather than considering the full orbit under the action of G, we consider partial orbits under the action of a subset G 0 ⊂ G (note: G 0 is not a subgroup).
We can now define the basic module that will repeat through the hierarchy. As mentioned in the main text, an HW-module consists of one C-unit and all of its afferent S-units. For an image I, the output of the k-th HW-module is µ k (I) = P ({ I, gt k ) | g ∈ G 0 }). The subset G 0 is called the HW-module's pooling domain. Note that if G 0 is a set of translations the pooling domain has the same interpretation as a spatial region as in HMAX.
Consider, for simplicity, the case of 1D images (centered in zero) transforming under the 1D locally compact group of translations. What are the conditions under which an HW-module will be invariant over the range where η is a positive, bijective function. The k-th component of the signature vector will then be . Suppose we transform the image I (or equivalently, the template) by a translation ofx > 0, implemented by Tx. Under what conditions does µ k (I) = µ k (TxI)? Note first that I, T x t k = (I * t k )(x), where * indicates convolution. By the properties of the convolution operator, we have [(TxI This observation allows us to write a condition for the invariance of the signature vector components with respect to the translation Tx (see also Fig. 2). For a positive nonlinearity η, (no cancelations in the sum) and bijective (the support of the dot product is unchanged by applying η) the condition for invariance is:

Figure 2. Localization condition of the S-unit response for invariance under the transformation Tx
Eq. 3 is a localization condition on the S-unit response. It is necessary and sufficient for invariance. In this case, eq. (1) is trivial since we are considering group transformations.

The second regime: class-specific invariance
So far, we have explained how the localization properties of the S-response allow invariance in the case of partially observed group transformations. Next, we show how localization still enables approximate invariance ( -invariance) even in the case of non-group (smooth) transformations. However, as will be shown below, in order for eq. (1) to be (approximately) satisfied, the class of templates needs to be much more constrained than in the group case.
Consider a smooth transformation parametrized by r ∈ R, T r . Its Taylor expansion w.r.t. r around, e.g., zero is: where J I is the Jacobian of the transformation T , and L I (·) = e(·) + J I (·)r. The operator L I corresponds to the best linearization around the point r = 0 of the transformation T r . Let R be the range of the parameter r such that T r (I) ≈ L I r (I). If the localization condition holds for a subset of the transformation parameters contained in R, i.e.
and as long as the pooling range P , in the r parameter is chosen so that P ⊆ R, then we are back in the group case. Thus the same reasoning used above for translation will still apply. However this is not the case for eq. (1). The tangent space of the image's orbit is given by the Jacobian, and it clearly depends on the image itself. Since the tangent space of the image and of the template will generally be different (see Fig. 3), this prevents eq. (1) from being satisfied. More formally, for r ∈ R: That is, eq. (1) is only satisfied when the image and template "transform the same way" (see Fig. 3). To summarize, the following three conditions are needed to have invariance for non-group transformations: 1. The transformation must be differentiable (the Jacobian must exist).

2.
A localization condition of the form in eq. (5) must hold to allow a linearization of the transformation.
3. The image and templates must transform "in the same way", i.e. the tangent space of their orbits (in the localization range) must be equal. This is equivalent to J I ≡ J t k .
Remark: The exposition of the theory given here is specialized for the relevant case of the general theory. In general, we allow each "element" of the signature (as defined here) to be a vector representing a distribution of one-dimensional projections of the orbit. See [2] for details. Illumination is also a class-specific transformation. The appearance of an object after a change in lighting direction depends both on the object's 3D structure and on its material properties (e.g. reflectance, opacity, specularities). Figure 4 displays the results from a test of illumination-invariant recognition on three different object classes which can be thought of as statues of heads made from different materials-A: wood, B: silver, and C: glass. The results of this illumination-invariance test follow the same pattern as the 3D rotation-invariance test. In both cases the view-based model improves the pixel-based models' performance when the template and test images are from the same class ( fig. 4-plots on the diagonal). Using templates of a different class than the test class actually lowered performance below the pixelbased model in some of the tests e.g. train A-test B and train B-test C ( fig. 4-off diagonal plots). This simulation suggests that these object classes have high ψ with respect to illumination transformations. However, the weak performance of the view-based model on the silver objects indicates that it is not as high as the others (see the table below). This is because the small differences in 3D structure that define individual heads give rise to more extreme changes in specular highlights under the the transformation.
Where σ is the Gaussian's variance parameter. The class-specific layer takes in any vector representation of an image as input. We investigated two hierarchical architectures built off of different layers of the HMAX model (C1 and C2-global) [8]-referred to in fig. 5 as the V1-like and IT-like models respectively.
For the pose-invariant body recognition task, the template images were drawn from a subset of the 44 bodies-rendered in all poses. In each of 10 cross-validation splits, the testing set contained images of 10 bodies that never appeared in the model-building phase-again, rendered in all poses (fig. 5).
The HMAX models perform almost at chance. The addition of the class-specific mechanism significantly improves performance on this difficult task. That is, models without class-specific features were unable to perform the task while class-specific features enabled good performance on this difficult invariant recognition task ( fig. 5).
Downing and Peelen (2011) argued that the extrastriate body area (EBA) and fusiform body area (FBA) "jointly create a detailed but cognitively unelaborated visual representation of the appearance of the human body". These are perceptual regions-they represent body shape and posture but do not explicitly represent high-level information about "identities, actions, or emotional states" (as had been claimed by others in the literature [9]). The model of body-specific processing suggested by the simulations presented here is broadly in agreement with this view of EBA and FBA's function. It computes, from an image, a body-specific representation that could underlie many further computations e.g. action recognition, emotion recognition, etc. We consider three different arbitrary choices for the distributions of objects from five different categories: faces, bodies, vehicles, chairs, and animals (see table 2). Importantly, one set of simulations used statistics which were strongly biased against the appearance of faces as opposed to other objects.      Figure 9. The percentage of objects in the first N clusters containing the dominant category object (clusters sorted by number of objects in dominant category). A, B and C are respectively, the "realistic" distribution, uniform distribution, and the biased against faces distribution (see table 2)). 100% of the faces go to the first face cluster-only a single face cluster developed in each experiment. Bodies were more "concentrated" in a small number of clusters, while the other objects were all scattered in many clusters-thus their curves rise slowly. These results were averaged over 5 repetitions of each clustering simulation using different randomly chosen objects.  Figure 10. The classification performance on face recognition, a subordinate-level task (top row) and car vs. airplane, a basic-level categorization task (bottom row) using templates from each cluster. 5-fold cross-validation, for each fold, the result from the best-performing cluster of each category is reported. A, B and C indicate "realistic", uniform, and biased distributions respectively (see table 2). Note that performance on the face recognition task is strongest when using the face cluster while the performance on the basic-level car vs. airplane task is not stronger with the vehicle cluster (mostly cars and airplanes) than the others.

Stimuli
Illumination Illumination: Within each class the texture and material properties were exactly the same for all objects. We used Blender to render images of each object with the scene's sole light source placed in different locations. The 0 position was set to be in front of the object's midpoint; the light was translated vertically. The most extreme translations brought the light source slightly above or below the object. Material data files were obtained from the Blender Open Material Repository (http://matrep.parastudios.de/). 40 heads were rendered with each material type. For each repetition of the experiment, 20 were randomly chosen to be templates and 20 to be testing objects. Each experiment was repeated 20 times with different template and testing sets.

Bodies / pose
DAZ 3D Studio was used to render each of 44 different human bodies under 32 different poses, i.e., 44*32 =1408 images in total.

Body-pose experiments
For the body-pose invariance experiments ( fig. 5), the task was identical to the test for unfamiliar faces and novel object classes. The same classifier (Pearson correlation) was used for this experiment. Unlike rotation-in-depth, the body-pose transformation was not parameterized.

Clustering by transformation compatibility
Pseudocode for the clustering algorithm is given below (algorithm 1).
Let A i be the i th frame of the video of object A transforming and B i be the i th frame of the video of object B transforming. The Jacobian can be approximated by the "video" of difference images: J A (i) = |A i − A i+1 | (∀i). The "instantaneous" transformation compatibility is ψ The transformation compatibility ψ of a cluster C was defined as the average of the pairwise compatibilities ψ(A, B) of all objects in C. ψ(C) := mean(ψ (A, B)) for all pairs of objects (A, B) from C.