^{1}

^{*}

^{2}

^{1}

^{3}

Conceived and designed the experiments: RCC PD OS. Performed the experiments: RCC OS. Analyzed the data: RCC. Wrote the paper: RCC PD OS.

The authors have declared that no competing interests exist.

Spatial context in images induces perceptual phenomena associated with salience and modulates the responses of neurons in primary visual cortex (V1). However, the computational and ecological principles underlying contextual effects are incompletely understood. We introduce a model of natural images that includes grouping and segmentation of neighboring features based on their joint statistics, and we interpret the firing rates of V1 neurons as performing optimal recognition in this model. We show that this leads to a substantial generalization of divisive normalization, a computation that is ubiquitous in many neural areas and systems. A main novelty in our model is that the influence of the context on a target stimulus is determined by their degree of statistical dependence. We optimized the parameters of the model on natural image patches, and then simulated neural and perceptual responses on stimuli used in classical experiments. The model reproduces some rich and complex response patterns observed in V1, such as the contrast dependence, orientation tuning and spatial asymmetry of surround suppression, while also allowing for surround facilitation under conditions of weak stimulation. It also mimics the perceptual salience produced by simple displays, and leads to readily testable predictions. Our results provide a principled account of orientation-based contextual modulation in early vision and its sensitivity to the homogeneity and spatial arrangement of inputs, and lends statistical support to the theory that V1 computes visual salience.

One of the most important and enduring hypotheses about the way that mammalian brains process sensory information is that they are exquisitely attuned to the statistical structure of the natural world. This allows them to come, over the course of development, to represent inputs in a way that reflects the facets of the environment that were responsible. We focus on the case of information about the local orientation of visual input, a basic level feature for which a wealth of phenomenological observations are available to constrain and validate computational models. We suggest a new account which focuses on the statistics of orientations at nearby locations in visual space, and captures data on how such contextual information modulates both the responses of neurons in the primary visual cortex, and the corresponding psychophysical percepts. Our approach thus helps elucidate the computational and ecological principles underlying contextual processing in early vision; provides a number of predictions that are readily testable with existing experimental approaches; and indicates a possible route for examining whether similar computational principles and operations also support higher-level visual functions.

Contextual influences collectively denote a variety of phenomena associated with the way information is integrated and segmented across the visual field. Spatial context strongly modulates the perceptual salience of even simple visual stimuli, as well as influencing cortical responses, as early as in V1

At least two main lines of theoretical inquiry have addressed these influences from different perspectives. First, computational models have related perceptual salience to low-level image features

However, although collectively covering a huge range of computational, psychophysical, and neural data, these two theoretical approaches have not been unified. To this end, we introduce a computational model of the statistical dependencies of neighboring regions in images, rooted in recent advances in computer vision

Top row: cartoon of a divisive normalization model that accounts for surround modulation of V1 responses. In a textured, homogeneous visual stimulus, the center and surround of a V1 neuron's RF (schematically illustrated by red and orange circles, respectively) receive similar inputs. The model pools together the corresponding outputs (computed by oriented linear filters), and combines them (here generically denoted by a function

We focused in particular on the dependencies between orientations across space, optimizing model parameters based on natural scenes. We used the resulting model to simulate neural and perceptual responses to stimuli used in physiological and perceptual experiments to test orientation-based surround modulation. Note that we did not fit the model to experimental data from individual cells or subjects, but rather compared the qualitative behavior of a model trained on ecologically relevant stimuli with general properties of early visual surround modulation.

There is a wealth of experimental results on surround modulation, some classical, and some still subject to debate. We chose to model a set of findings that we intend to be canonical examples of both sorts of results. We included data that have been the subject of previous theoretical treatments, but paid particular attention to phenomena that lie at the boundary between integration and segmentation. Indeed, one claim from our approach is that subtleties at this border might help explain some of the complexities of the experimental findings. We make predictions for regions of the stimulus space that have not yet been fully tested in experiments.

In sum, we show that the statistical principles introduced can account for a range of neural phenomena that demand tuned surround suppression as well as facilitation, and encompass V1 as a salience map

We first illustrate a class of characteristic statistical dependencies across space in natural scenes, and we outline a model of such dependencies. We provide implementation details and model equations, and describe how model parameters were optimized for an ensemble of natural scenes. Further, we explain how the model relates to contextual modulation in the visual cortex. We specifically focus on relating the statistical model of images to V1 neural firing rates, and also illustrate how this constitutes a generalized form of divisive normalization.

We concentrated on the statistical dependencies between V1-like filters (or receptive fields, RFs; see also

(_{1}) given the outputs of the other RF (_{2}). We computed these conditional histograms based on 100,000 Gaussian white noise image patches (_{1} for each given value of _{2}. We matched the average RMS contrast of noise and natural images; the larger range of RFs responses to natural images reflects the abundance of oriented features that are optimal for the RFs. The insets illustrate the orientation and relative position of the RFs: (_{1} depends on the magnitude of _{2}, which is typical of natural images. Further, we report the Pearson correlation coefficient between _{1} and _{2} at the bottom: the stronger linear dependence between collinear filters reflects the predominance of elongated edges and contours in scenes. (

We extended a well-known probabilistic model of the variance dependence of _{1}_{2}_{1}_{2}_{1}_{2}

We then introduced a variant of the GSM (see also

We implemented the model with a bank of 72 bandpass oriented filters or RFs taken from the first level of a steerable pyramid

We used a bank of linear filters, depicted as colored bars in the top row, comprising 4 orientations at 1 center position and 8 surround positions. We grouped surround RFs according to their orientation, each labeled with a different color. A surround group can be either co-assigned with the center group (i.e., the model assumes dependence between center and surround groups, and includes them both in the normalization pool for the center, as in

The parameters governing RF interactions – i.e. the covariance matrices and the prior probability of each component of the MGSM – were optimized by maximizing the likelihood of an ensemble of natural scenes (downloadable from

Top row: co-assigned components (from left to right,

However, to explore the effects of cardinal axis biases and variability across different scenes, we also trained the model separately on each of 40 natural images from the Berkeley Segmentation Dataset (

To relate the model to contextual modulation in the visual cortex we assume that firing rates in V1 represent information about the Gaussian variables

The MGSM is a statistical model of images, which is sometimes called a generative or graphics model

In practice, we obtained model responses in two steps, i.e. we first collected linear RFs outputs with a given input stimulus, and then we used them to compute the Gaussian estimates; this is an abstract schematization of how V1 firing rates are produced, conceptually similar to the canonical linear-nonlinear scheme

For a given input stimulus, we first collected the outputs of center RFs _{k}^{−10}) several orders of magnitude smaller than the smallest value observed for the other term under the square root.

Eventually, we inferred the posterior probability,

The inferred model responses (Equations 2,4,5) thus constitute a form of divisive normalization

However, our model generalizes standard divisive normalization in two substantial ways. First, consider the effect of covariance. Normalization allows cells to discount a global stimulus property that is shared across RFs, i.e. contrast in simple divisive normalization, or the mixer value in the GSM. The mixer corresponds to RFs that are statistically coordinated and tend to be high or low together in their absolute value. However, in the generative model, large outputs from RFs that often respond together (i.e. RFs with large covariance or linear correlations) could be generated either by a large value of the mixer or by similar draws from the correlated Gaussians; whereas similar, large outputs from RFs that rarely respond together (small covariance) are more likely to have been generated by a large value of the mixer. Therefore linearly correlated RFs should contribute less to the estimate of the mixer (which is loosely proportional to the normalization signal). This arises in the model since the covariance matrices learned from scenes weight the contribution of the RFs to the normalization signal ^{−1}

Second, Equation 2 uses a stimulus-dependent normalization pool, since, for any given input stimulus, only RFs that are inferred as being statistically coordinated and thus to share a common mixer, are jointly normalized (the normalization pool comprises different RFs for each mixture component in Equations 4,5). The same RFs can be coordinated for some stimuli and not for others.

The computation involved, which in the model is distinct from the evaluation of the corresponding normalization signals (Equation 6), does not necessarily have to be segregated in the biological system, and may be achieved by a number of neural mechanisms. One possibility may be that the normalization signals are computed by inhibitory interneurons that pool the outputs of distinct subpopulations: the different firing thresholds or diversity across types of interneurons

For a given input stimulus, we inferred the posterior co-assignment probabilities of each mixture component,

(

Note also that, differently from standard divisive normalization which is inherently suppressive, our model encompasses both surround suppression and facilitation as summarized in

Modulation of simulated model responses when the surround is co-assigned (_{co-assign}_{no-assign}_{co-assign}−R_{no-assign}_{no-assign}

The full distribution of the RFs variables under the MGSM is given by the mixture model:_{k}_{S}_{k}_{S}

Similarly, we can derive the distribution conditional on

The parameters governing RF interactions – i.e. the covariance matrices and the prior probability of each component of the MGSM – need to be optimized for an ensemble of natural scenes. The training data were obtained by randomly sampling 25000 patches from an ensemble of 5 natural images from a database of standard images used in image compression benchmarks (known as Einstein, boats, goldhill, mountain, crowd; see

Our hypothesis, illustrated schematically in

First, the model uses a normalization pool (i.e., the group of RFs whose outputs contribute to the normalization signal) that depends on the input stimulus (see Equations 2,4,5): when center and surround RFs outputs are statistically coordinated (i.e. they lead to high posterior co-assignment probability, Equation 10), then both are included in the normalization pool for the center; whereas, when they are independent, the surround RFs are excluded from the normalization pool. For a given input stimulus, we measured the degree of homogeneity (and therefore involvement of the surround RFs) by the probability of co-assignment. Intuitively, this probability is larger for stimuli that elicit similar outputs for center and surround RFs; and this indeed turns out to be true of Bayesian inference in our model. In addition, the probability tends to be larger for high contrast. The grating patch of

Second, the linear correlations between center and surround RFs, captured by their covariance matrix ^{−1} (see Equation 6). Thus, the normalization signal can be reduced, and the model response enhanced, when the covariance between RFs is large or when a given RF variance is larger (as described in

We learned the parameters of the model (i.e., the covariance matrices and the prior co-assignment probabilities) entirely from the natural scenes, and then fixed them. We first evaluated the model by comparing its response (Equation 3) to presentations of different forms and sizes of sinusoidal gratings with those described in previous neurophysiological studies of spatial contextual modulation. Physiology experiments have made extensive use of gratings to show that stimuli presented in regions of visual space that do not drive the neuron (i.e. the surround) can still strongly modulate the responses to a stimulus presented within the RF.

First, we tested gratings of variable size and contrast whose orientation and spatial frequency match the chosen RF. Experiments in cat and macaque monkey V1

(

The contrast-dependence of size tuning has previously been ascribed to divisive normalization

We next addressed three sets of data related to the orientation tuning of surround modulation (

(

We first verified that the model produces contrast-invariant orientation tuning curves (

We also explored the model behavior as a function of the stimulus contrast and the sizes of the central grating patch and surround annulus (see

The responses of the model (

Interestingly, the presence in the model of multiple surround groups also allowed it to reproduce data on surround modulation of a non-optimal center stimulus (_{0}

Finally, we compared the orientation tuning curves measured with small gratings confined to the center RF, and large homogeneous (single orientation) gratings that also covered the surround. Chen et al.

We then addressed the spatial organization of surround modulation, to illustrate the interplay between the assignment process and the spatial structure of correlations learned by the model from natural images. Different regions of the RF surround can produce different levels of modulation when stimulated with grating patches oriented similarly to the center (called positional bias). Walker et al.

(

To elucidate the origins of this response of the model, we systematically tested a larger range of contrast and size values than has so far been experimentally examined (

Our model explains the observed spatial asymmetry of modulation by the form of the covariance matrices learned from natural scenes. In the mixture component with dependent center and vertical surround groups, the variances of the vertical center RF and its collinear neighbors were similar, and higher than the variances of the parallel neighbors (

Collinear facilitation with grating stimuli has been observed in V1, although with the contrast not matched between center and surround; for instance, Polat et al.

In addition to the effect we discussed, Cavanaugh et al.

However, that there is a considerable scatter of positional biases due to cell-to-cell variability in

(

The data presented so far provided an illustration of the biological significance of the two key components of the model, i.e. the assignment and the covariances. As a control, we presented the same stimuli to a reduced model that ignored the assignments and assumed that center RFs were always normalized by the vertical surround group. Such reduced model produced a much weaker contrast-dependence of size tuning of the surround compared to

We then asked whether the same statistically motivated computational principles underlie perceptual salience: local image regions are deemed salient when e.g. they can be detected more readily, appear to have higher luminance, or attract human gaze more often compared to the remaining parts of the image (see

To address saliency effects, we combined model responses corresponding to all four center orientations, as is typical in saliency computations. Specifically, for a given image, at each input location we computed the responses as in Equation 3 separately for each orientations, and then took the maximum

The stimulus shown in

(

Another important class of saliency effects involves collinear facilitation. We tested the model with simple arrangements of bars that are known to produce higher perceptual salience for collinear (

(

First, we consider the bars on the boundary regions. In

In addition, we observed a larger salience enhancement for the middle row of bars, relative to the homogeneous region, when they were collinear (

The two examples above are to some extent combined in the border effect in stimuli such as

We have addressed spatial contextual influences in early vision by introducing a statistical model that is rich enough to capture the coordination that exists amongst spatially distributed V1-like RFs in natural scenes. Our model took explicit account of the spatial heterogeneity of scenes. Inference in the model amounted to a generalized form of divisive normalization from each RF's surround. The model reproduced a host of V1 neural surround nonlinearities, including both suppression and facilitation, such as RF expansion at low contrast (

Many important mechanistic accounts of contextual interactions have included ideas about excitatory and inhibitory connections and their links to divisive normalization, e.g.

In addition, surround modulation was more often engaged when stimulating collinear than parallel regions of the surround; but, when surround modulation was engaged, collinear responses were enhanced relative to parallel. These behaviors, due to the form of the covariances learned from natural scenes, may also find a basis in the known specificity and anisotropy of horizontal (e.g.

Two main statistical innovations were critical for capturing the full set of biological data. First, our model provides a formal treatment of the idea of statistical homogeneity vs heterogeneity of visual inputs. We proposed that V1 is sensitive to whether center and surround responses to a given stimulus are inferred to be dependent or independent based on the statistical structure of natural scenes. This sensitivity is actually required to capture the stronger statistical dependence apparent within spatially extended visual objects. In our model, it implies that surround modulation is fully engaged by stimuli comprising extended single objects, but disengaged by stimuli with entirely independent center and surround.

This (dis)engagement proved to be crucial to obtain the neural effects in

The flexibility of the normalization pool also implied that the model could exhibit surround facilitation. While a strong (and co-assigned) surround was suppressive, a weak (and co-assigned) surround was facilitative, when compared to responses in the absence of surround (not co-assigned). The magnitude of the modulation was larger for weaker center RFs outputs (

The second key statistical feature in our model, namely the RFs covariance matrices, reflects the structure of linear correlations found in natural images. For image patches that the model determined to be homogenous, we observed larger variance for the central RF and its collinear neighbors, and larger covariance between the collinear neighbors, than for any other surround location (

At high contrast, collinear surrounds suppress targets more than parallel surrounds. The model reproduced this characteristic (

Furthermore, the model generated salience maps that exhibited collinear enhancement and texture border assignment with simple bar displays (

Our statistical modeling relates most closely to a number of recent powerful approaches in computer vision and statistical learning which have not been directly applied to contextual effects in neuroscience. These include hierarchical models for unsupervised learning of statistical structure in images (e.g.

The model of Schwartz et al.

Certain other statistically-minded approaches to computer vision have addressed the implications for biological vision. Schwartz and Simoncelli

A natural extension of our approach would be to optimize the sampling of visual features and test how well the model predicts V1 responses to complex scenes; and their effects on perceptually assessable factors such as salience detection under controlled manipulation of the assignments, and gaze patterns on natural images

The natural images used to train the model.

(TIF)

Orientation tuning of the center is contrast-invariant. Model tuning curves at different contrast are approximately scaled versions of each other, a characteristic feature of V1 tuning. This holds across different stimulus sizes, both smaller and larger than the receptive field size.

(EPS)

Perceptual pop-out in the model depends on the bars length, separation, and contrast. The red box correspond to the configuration used in

(TIF)

Equations and examples of the linear filters used in the simulations.

(PDF)

Mathematical details on the training algorithm.

(PDF)

Extended simulation results on surround orientation tuning as a function of contrast, and center and surround stimulus size.

(PDF)

Additional simulation results on the positional bias. Cardinal axis effects and variability across different natural images can produce a scatter of positional biases consistent with that observed in V1 populations.

(PDF)

We are very grateful to M. Cohen, A. Kohn, A. Sanborn, M. Smith, J. Solomon, C. Williams, and L. Zhaoping for discussion and helpful comments on the manuscript; and to J.A. Movshon and J.R. Cavanaugh for providing us with the data reproduced in