^{*}

^{¤}

^{*}

Current address: Volen Center for Complex Systems, Brandeis University, Waltham, Massachusetts, United States of America

Conceived and designed the experiments: PB RET MS. Performed the experiments: PB. Analyzed the data: PB. Wrote the paper: PB RET MS.

The authors have declared that no competing interests exist.

The visual system must learn to infer the presence of objects and features in the world from the images it encounters, and as such it must, either implicitly or explicitly, model the way these elements interact to create the image. Do the response properties of cells in the mammalian visual system reflect this constraint? To address this question, we constructed a probabilistic model in which the identity and attributes of simple visual elements were represented explicitly and learnt the parameters of this model from unparsed, natural video sequences. After learning, the behaviour and grouping of variables in the probabilistic model corresponded closely to functional and anatomical properties of simple and complex cells in the primary visual cortex (V1). In particular, feature identity variables were activated in a way that resembled the activity of complex cells, while feature attribute variables responded much like simple cells. Furthermore, the grouping of the attributes within the model closely parallelled the reported anatomical grouping of simple cells in cat V1. Thus, this generative model makes explicit an interpretation of complex and simple cells as elements in the segmentation of a visual scene into basic independent features, along with a parametrisation of their moment-by-moment appearances. We speculate that such a segmentation may form the initial stage of a hierarchical system that progressively separates the identity and appearance of more articulated visual elements, culminating in view-invariant object recognition.

When we look at a visual scene, neurons in our eyes “fire” short, electrical pulses in a pattern that encodes information about the visual world. This pattern passes through a series of processing stages within the brain, eventually leading to cells whose firing encodes high-level aspects of the scene, such as the identity of a visible object regardless of its position, apparent size or angle. Remarkably, features of these firing patterns, at least at the earlier stages of the pathway, can be predicted by building “efficient” codes for natural images: that is, codes based on models of the statistical properties of the environment. In this study, we have taken a first step towards extending this theoretical success to describe later stages of processing, building a model that extracts a structured representation in much the same way as does the visual system. The model describes discrete, persistent visual elements, whose appearance varies over time—a simplified version of a world built of objects that move and rotate. We show that when fit to natural image sequences, features of the “code” implied by this model match many aspects of processing in the first cortical stage of the visual system, including: the individual firing patterns of types of cells known as “simple” and “complex”; the distribution of coding properties over these cells; and even how these properties depend on the cells' physical proximity. The model thus brings us closer to understanding the functional principles behind the organisation of the visual system.

It is well established that the receptive fields (RFs) of neurons in the early visual cortex depend on the statistics of sensory input and can be modified by perturbations of those statistics during development

The generative modelling approach takes a complementary functional view. It is based on the Helmholtzian account of perception as inverse inference (sometimes called analysis-by-synthesis). That is, that the goal of the perceptual system is to infer from sensation the environmental causes most likely to be responsible for producing the sensory experience

Many models that have been formulated in terms of the optimisation of an objective function could also be viewed as implementing inference within an appropriate generative model: the assumptions and structure of the model are implicit in the objective function. Thus, recoding based on the sparseness objective corresponds to inference within a generative model in which a number of independent, sparsely active causes combine linearly to form the image

A remarkable success of these functional models, whether formulated generatively or in terms of a representational objective function, is that, when used to learn an appropriate representation from a set of natural images, they yield elements that mirror a number of response properties of primary visual cortical neurons (though some notable discrepancies do remain

In the present study we focused on one basic structural aspect of the environment: The visual world is largely composed of discrete objects, which each contributes a set of discrete visual features to the image. Moreover, the objects, and therefore their associated features, usually remain in view for some time, although their precise appearances might change gradually due to changes in viewpoint, lighting or in the object's position. We thus formulated a model in which the

We fitted this model to natural video images, without using any additional information about which elements were present or what their transformations might be. We found that the model naturally learned biologically plausible features, with low dimensional manifolds of attributes. Many aspects of the learnt representation corresponded closely to both anatomical and functional observations regarding simple and complex cells in the primary visual cortex (V1). Thus, the model offers a functional interpretation for the presence of two main classes of cells in V1. Complex cells represent the probability of presence of an oriented feature, while simple cells parametrise the precise appearance of the feature in the visual input. We speculate that a similar representation in the form of feature identities and attributes may continue up the visual hierarchy, ultimately contributing to view-independent object recognition.

A) Each visual element is represented by a binary-valued identity variable

The set of partial images associated with all of the active elements then combine through a function

In this abstract form the model is very powerful, and provides an intuitively satisfying generative structure for images. Unfortunately, for manifolds and combination functions modelling the appearance of entire complex objects and the interactions between them as illustrated in

The complete probabilistic generative model for image sequences includes probability distributions over the identity and attribute variables. We chose distributions in which objects or features appeared independently of one another, and where the probability of appearance at time

The parameters of the model specify the partial images generated by each feature (represented by the basis vectors

Probabilistic models are often fit by adjusting the parameters to maximise the probability given to the observed data—called the

More complex models can always be adjusted to give higher probability to any data set, and so the maximum likelihood approach would always favour a model with greater dimensionality. This effect can lead to

Bayesian estimation is well-defined only if a

We used this model to investigate the visual elements that compose natural images, comparing features of the representation learnt by the model when fit to natural image sequences to the representation found in V1. The input data were a subset of the CatCam recordings

Computational constraints prevented us from modelling the entire video sequence. Instead, we fit the model to the time-series defined by the pixel intensities within fixed windows of size

Given an observed image sequence, the model could be used to infer a posterior probability distribution over the values of the identity and attribute variables at each point in time. We compared the means of these distributions to the firing rates of neurons in the visual cortex. The use of the mean was necessarily arbitrary, since there is no generally agreed theory linking probabilistic models to neural activity. The brain may well represent more than a single point from this distribution. For example, information about the uncertainty in that value would be necessary to weight alternative interpretations of the data. Once the model had been fit to the data, however, we found that the attribute variable distributions estimated from high-contrast stimuli were strongly concentrated around their means. Thus, many different choices of neural correlates would have given essentially identical results. It is also worth mentioning here that although the identity variables describe the presence or absence of a feature in the generative model and are thus binary-valued, the posterior probability of the feature being present (which is the same as the posterior mean of the binary identity variable) is continuous. Thus, neurons presumed to encode these posterior means would respond to stimuli with graded responses, which would take uncertainty about feature identity (e.g., under conditions of low contrast) into account.

A) The posterior mean basis vectors

A) Basis vectors corresponding to one of the identity variables in the learnt model (row 14 in

A–C) Distribution of orientation, frequency, and phase for RFs computed for pairs of attribute variables associated with the same identity variable. D–F) Similar plots for pairs of simple cell RFs recorded from the same electrode in area 17 of the cat visual cortex. Reproduced with permission from DeAngelis et al., 1999

To explore this connection further we compared the properties of simple cell RFs in V1 as reported in the physiological literature with the ‘RFs’ of the attribute variables. The RF of an attribute variable was defined by analogy to the conventional physiological definition. We fixed the posterior distribution over the parameters of the model to that estimated by VBEM from the natural data, and then examined the values of the attribute variables that were inferred given coloured Gaussian noise input. The RF was defined to be the best linear approximation to the mapping from this input to the inferred mean attribute value, a procedure equivalent to finding the “corrected spike-triggered average” or Wiener filter

The distribution of preferred frequencies and orientations in the RFs of attribute variables are shown in

Comparison between the joint distribution of normalised RF width and length in our model (blue circles) and as reported by Ringach

The model was initialised using a representation that contained 6-dimensional attribute manifolds for each feature. However, in the posterior distribution identified by VBEM, the probability of the basis vectors corresponding to many of these dimensions being non-zero vanished—that is, a model in which the image data were described using fewer dimensions was found to be more probable. In fact, the VB posterior representation was only slightly overcomplete, with 96 basis vectors representing an 81-dimensional input space, and with the dimensionality of most feature manifolds lying between 2 and 4 (

A) Distribution of the dimensionality of the attribute manifold. Attribute filters with norm

A key aspect of our model is the temporal dependence of the identity and attribute variables. To ask what role this temporal structure had on the feature basis vectors found, we shuffled the order of frames in the CatCam database, and then refit the model using exactly the same procedure as before. When using unshuffled data, the learning process adapted the feature manifolds so that the inferred values of identity variables persisted in time, while the inferred attribute variables changed smoothly. In the shuffled data such a persistent and smooth representation cannot be found. Instead, learning adjusts the attribute manifolds so as to maximise the independence of the associated identity variables, grouping together attribute dimensions that tend to co-occur in single frames. This is similar in spirit to Independent Subspace Analysis

A) Basis vectors

Histogram of the fractional error of fit (sum of squares of the residuals divided by the sum of squares of the RFs) in simple cells as reported by DeAngelis

Despite finding a larger number of basis vectors, the model described a larger proportion of the shuffled data as noise, thereby fitting them more poorly. We evaluated the probability given to 50 new batches of 3000 frames each by the parameter distributions learnt from the shuffled and unshuffled data. As estimated by the VB approach, the probability assigned by the unshuffled model was more than

It is interesting to note that despite these deficiencies in the representation learnt from shuffled sequences, the basis vectors of the attribute variables still resembled simple cell RFs. This observation stands in contrast to results from previous models of complex cells based on temporal stability, which had assumed a hierarchical organisation similar to the classical energy model

We have investigated a new generative model for images which makes explicit the separation between the identity of a visual element and the attributes that determine its appearance. This structure within the model makes it possible to extract and bind together attributes that belong to the same visual element, and at the same time to construct an invariant representation of the element itself. We modelled identity with a set of binary-valued variables, each coding for the presence or absence of a different feature. Their appearances were described by manifolds, parametrised by a set of attribute variables. Both identity and attribute variables were assumed to exhibit temporal dependence within image sequences. We were also interested in determining the size of the model, i.e., the number of attribute and identity variables required to optimally describe the input data. This was achieved by performing a Bayesian analysis of the model, which avoids over-fitting and involves defining an appropriate prior distribution over the generating basis vectors. As a result, after convergence of an iterative algorithm, only the basis elements needed to effectively match the data remained active and all redundant attribute directions were pruned away, avoiding overfitting the image data. The algorithm was applied to natural image sequences in order to learn a low-level representation of visual scenes. The filters associated with the individual attribute variables were shown to have characteristics similar to those of simple cells in V1. The RFs of attributes associated with the same identity variable had similar positions, orientations, and frequencies, but different phases. As a consequence, the corresponding identity variable became invariant to phase change and behaved like a complex cell. In the standard energy model of complex cells and in several previous functional models, complex and simple cells form a hierarchy. Simple cells have the role of subunits and are regarded as an intermediate step on the way to the complex cell. Their phase-dependent information is then discarded as a first step towards the construction of an invariant representation. Here complex and simple cells do not form a hierarchy, but rather two parallel interacting populations of cells with two different functional roles: the first coding for the presence or absence of oriented features in its RFs, the latter describing local parameters of the features (mainly their phase). A formal analysis of the model reveals that, indeed, the interaction between identity and attribute variables in our model is richer than in the energy model. In addition to a quadratic term similar to the one in the energy model inside an exponential, the interaction includes a divisive normalisation term, and dependence on the statistics of natural input and the prior probability of the feature encoded by the identity variable being present (

In

The computational power of a class of models similar to the one in this paper has been investigated by Tenenbaum and Freeman

In the model described here, the appearance manifolds associated with each feature are linear, and they combine additively to form the image. These choices are a matter of computational tractability, and have two main limitations. First, the additive combination function

It may be possible to extend the model developed here so as to represent more complex visual elements. One approach is illustrated in

The dotted line represents cases where the attributes influence the presence of objects parts. For example, in the case a face seen from behind, nose, mouth, and eyes would not be visible and thus would not need to be generated.

Algorithms related to the temporal stability principle have also been applied with some success to learning a high-level object representation

In the

This paper has presented a first step toward including constraints regarding the structure of the visual environment in computational models of vision. By taking into account the conceptual distinction between identity and attributes of visual elements, we were able to match more closely the physiological and anatomical organisation of V1. Further steps in this direction will hopefully lead us toward the development of a more complete, probabilistic account of visual inference.

The generative model describes the probability of a sequence

The generative process maps these hidden identity and attribute variables to observations according to Eq. 2. Assuming Gaussian noise with variance

The prior distributions over the variables were defined according to the intuitions described in the

Our intuition that objects are persistent in time is respected when the probability of remaining in the current state is larger than that of switching, i.e. when the transition probabilities

The matrices

The priors on the basis vectors

These zero-centred Gaussian prior distributions discouraged large components within the basis vectors. The widths of the distributions are set by the

For the remaining parameters we also chose conjugate priors. Conjugacy means that the posterior distribution has the same functional form as the prior, resulting in tractable integrals. Conjugate priors are intuitively equivalent to having previously observed a number of imaginary

Circles represent random variables, and squares represent hyperparameters; the grey-shaded circle represents the observed image; light grey nodes and symbols represent variables associated with neighbouring frames. The variables within the dashed rectangular box are those associated solely with the

In the Bayesian formulation the parameters of the model are formally equivalent to hidden variables, differing only in that their number does not increase with the number of data points. The goal of learning is then to infer the posterior joint distribution over variables and parameters given the data:

The maximisation over

The key factorisation underlying the VBEM algorithm Beal2003 is the one between hidden variables and parameters

Given this basic factorisation, the algorithm proceeds in a way similar to Expectation Maximisation (EM) by iteratively inferring the hidden variable distribution

The input data to our model were taken from the CatCam videos

We initialised the model with 30 identity variables (

Parameters were identical for the fit to shuffled data, the only difference being that the selected frames were not consecutive in time. At the end of the VBEM iterations we compared the free energy of the original model to that of the time-shuffled model on a novel set of 50 batches of 3000 frames each, taken from the CatCam data as described above. The free energies were computed for each batch separately.

We also ran one additional fit (not shown) to check that the results obtained for shuffled data were not strongly influenced by our choice of priors on

In order to compare the properties of the learnt units to those of cortical neurons we proceeded in a way similar to that reported in the experimental literature. In electrophysiological recordings one does not have access to the complete input-output function of a neuron,

Given coloured noise data

For visualisation and analysis, the filters were projected back in image space using the pseudoinverse of the PCA matrix.

Optimal parameters for the RFs derived in this way were computed by fitting a Gabor function to them. Gabor functions are defined as

The parameters

Basis vectors, filters, and Gabor fit of the main experiment

(0.08 MB PDF)

Comparison of the distribution of relative modulation in our results and in electrophysiological experiments

(0.01 MB PDF)

Basis vectors, filters, and Gabor fit for the time-shuffled experiment

(0.08 MB PDF)

Relation to the energy model of complex cells

(0.03 MB PDF)

Technical details of the identity/attribute model

(0.17 MB PDF)

Effect of dimensionality reduction

(0.19 MB PDF)

We thank Peter Latham and Yee Whye Teh for valuable comments on the manuscript, and Jörg Lücke for help with