Skip to main content
Advertisement

< Back to Article

Fig 1.

Crowding.

a. In crowding, the perception of a target element deteriorates in the presence of nearby elements. When fixating the left cross, the target letter V on the right is hard to identify because of the nearby flankers. b. The task is easier than in (a), because the flankers are further away from the target letter V. Bouma’s law states that crowding occurs only when flankers are sufficiently close to the target, within the so-called Bouma’s window. c. Crowding is a ubiquitous phenomenon since elements are rarely seen in isolation. For example, when fixating the central red dot, the child on the left is easier to detect because it is not surrounded by nearby flankers, as is the child on the right.

More »

Fig 1 Expand

Fig 2.

a. Standard view of visual processing. First, edges are detected by low-level neurons with small receptive fields. Higher level neurons pool signals from lower level neurons in a hierarchical, feedforward manner, creating higher level representations of objects by combining low-level features [25,26]. For example, two low-level edge detectors may be combined to create a “corner” representation. Four such corner detectors can be assembled to create a rectangle representation. Receptive field size naturally increases along this pathway since, for example, a rectangle covers larger parts of the visual field than the lines making up the rectangle. b. Uncrowding. Observers performed a vernier discrimination task. The y-axis shows the threshold for which observers correctly discriminate the vernier offset in 75% of trials (so performance is good when the threshold is low). First, only a vernier is presented, an easy task (performance for this condition is shown as the dashed horizontal line). Then, a flanking square is added making the task much more difficult (leftmost stimulus). This is a classic crowding effect. Importantly, adding more flanking squares improved performance gradually, i.e., performance improved the more squares are presented [19]. We call this effect uncrowding. c. The global configuration of the entire stimulus determines crowding. Performance is strongly affected by elements far away from the target as shown in these examples [15]. d. Performance is not determined by local interactions only. In this display, fine-grained vernier acuity of about 200” depends on elements as far away as 8.5 degrees—a difference of two orders of magnitude, extending far beyond Bouma’s window.

More »

Fig 2 Expand

Fig 3.

Tested models and their characteristics.

Models may integrate information locally or globally, and the interference mechanism may be pooling, substitution, or other. Models are feed-forward or recurrent, and may or may not compute grouping-like aspects of the stimulus. The aim of the current work is to investigate which models can explain the global effects of crowding.

More »

Fig 3 Expand

Fig 4.

Stimulus categories.

We used 40 different stimuli from 11 different categories. The task was always to report the offset direction of the central vernier. This figure shows one example from each category. The stimulus database is tailored to test for global effects such as uncrowding. Human data was taken from previous work [10,11,15,17,19,20]. Human and model results are summarized in the discussion (Fig 14 shows the results for all stimuli and models).

More »

Fig 4 Expand

Fig 5.

Epitomes.

a. Illustration of the epitome model. An image (left) is compressed into an epitome (center), a summary of local features. The image on the right is reconstructed from the epitome. b. As an example for the classic texture evaluation, we show the stimulus and reconstructed image for the 1- and 7-square conditions. Human vernier offset thresholds are better for the 1-square than the 7-square condition. The model does not produce uncrowding because vernier offset direction in the output is not easier to make out in the 7-square than in the 1-square case (according to the authors’ judgment). c. Example for our performance measure. Human and model thresholds (see main text for how model threshold was computed) for vernier alone, single square and 7 squares conditions. The 7-square threshold is higher than the 1- square threshold, in contrast with human performance. Note: the model outputs a number quantifying how different the left and right vernier offset versions of the input are (so the higher this difference, the better the performance). To make comparison with the human threshold easier, we applied the following monotonic transformation to the output: “threshold-like output” = 1/“raw output”. Then, we scaled the result to be in the same range as the human results. This monotonic re-scaling cannot change the conclusions because monotonic outputs are mapped on monotonic performance and the same is true for U-shaped functions (see methods).

More »

Fig 5 Expand

Fig 6.

Texture Synthesis and Texture Tiling Model.

a. A texture (right) synthesized from the input on the left using the Portilla & Simoncelli [29] summary statistics. The output resembles crowding. Pooling- and substitution-like effects occur. b. In the TTM, instead of applying the summary statistics process to the whole image at once, only local patches of the image are processed, yielding a local summary statistics model. The local patches are thought to reflect V2 receptive fields. c. Whole-field summary statistics. From left to right: stimuli and Portilla & Simoncelli textures for the vernier, 1-square and 7-square conditions. The vernier offset is easy to determine from the texture in the vernier alone condition, and slightly harder in the crowded condition (a right-offset is discernable in the middle top of the display). Across all data, the model consistently produces crowding, but no uncrowding, as exemplified in the right condition in which no offset is present at all. d. Texture Tiling model. The left column shows three synthesized examples from the 1-square condition. On the right is the 7-flanking squares case. The model cannot produce uncrowding: since the stimulus on the right is less crowded than the stimulus on the left in the human data, the direction of the vernier should be easier to make out on the right than on the left. However, this is not the case.

More »

Fig 6 Expand

Fig 7.

Deep textures.

a. In the deep textures algorithm, the correlation between a deep neural network’s unit activities is used as a summary statistic. Textures are then synthesized to match that statistic. b. Original stimuli and textures synthesized from these stimuli using the deep textures algorithm by Gatys et al. [35]. The vernier offset is poorly visible, therefore, despite its clear success at synthesizing textures, the model in its present form in not suitable to model crowding with our stimuli. We tried different zooms on our stimuli but the results did not change.

More »

Fig 7 Expand

Fig 8.

Wilson and Cowan network with end-stopped receptive fields: a. Structure of the network in [39] which we augmented with end-stopped receptive fields. An excitatory and an inhibitory layer of neurons are activated by the stimulus and interact with one another. The output of the excitatory layer is cross-correlated with a vernier template to measure performance. b. Output for the squares category (with psychometric function fitted on the squares category). In accordance with human results, performance is better in the 7 squares than in the 1 square case. c. Output for the irregular category (with psychometric function fitted on the squares category). Performance is marginally better in the 7 irregular1 than in the 1 irregular1 case. d. Output for the stars category (with psychometric function fitted on the squares category). There is no uncrowding for this stimulus. Uncrowding occurs only for specific kinds of stimuli, where element size regularities seem important. Further, performance depends strongly on which data are used for the training set (i.e., for fitting the psychometric function), suggestive of overfitting. e. Model output images. Columns are different stimuli: vernier, 1 square and 7 squares. The first row shows the stimuli, and the three subsequent rows show the model output for the short, medium and long end-stopped receptive fields. The crucial result is that the vernier is better represented in the short and medium populations in the 7 squares than in the 1 square conditions (i.e., uncrowding occurs). As mentioned, uncrowding occurred for very few stimuli categories. In cases that didn’t show uncrowding, the vernier representation deteriorated further when flankers were added (see results on the online repository). Note: the model outputs a cross-correlation quantifying how similar the model output is to the model output in the vernier alone condition (so the higher this cross-correlation, the better the performance). To make comparisons with human thresholds easier, we applied the same linking hypothesis as Hermens et al. [39]: we fitted a psychometric function to link model outputs to behavioural results, as explained in the main text.

More »

Fig 8 Expand

Fig 9.

V1 Segmentation model.

a. The input is sampled at each grid position by neurons tuned to 12 orientations, mimicking V1 simple cells. b. The connectivity pattern between cells depends on their relative position and orientation as shown here. Solid lines indicate excitation and dashed lines indicate inhibition. As shown, each neuron excites aligned neurons and inhibits non-aligned neurons. Each neuron has the same connectivity pattern, suitably rotated and translated. c. Output images for the square category. Each small oriented bar shows the maximally active orientation at this grid position. d. Results for the squares category. The dashed red bar shows the vernier threshold, which is matched for humans and the model. As shown, uncrowding does not occur in the model, because performance is worse for the 7 squares than the 1 square stimulus. Note: the model outputs a cross-correlation quantifying how similar the model output is to the model output in the vernier alone condition (so the higher this cross-correlation, the better the performance). To make comparison with the human threshold easier, we applied the same procedure as we did for the epitomes, i.e., we applied the following monotonic transformation to the output: “threshold-like output” = 1/”raw output”. Then we scaled the result to be in the same range as the human results. This monotonic re-scaling does not change the conclusions–the phenomenon of uncrowding cannot be altered.

More »

Fig 9 Expand

Fig 10.

The LAMINART variation.

a: Activity in the LAMINART model. Colors represent the most active orientation (red: vertical, green: horizontal). When a stimulus is presented, segmentation starts to propagate along connected (illusory or actual) contours from two locations marked by attentional selection signals. Visual elements linked together by illusory contours form a group. After dynamic, recurrent processing, the stimulus is represented by three distinct neural populations, one for each group. Crowding is high if other elements are grouped in the same population as the vernier, and low if the vernier is alone. On the left, the flanker is hard to segment because of its proximity to the vernier. Across the trials, the selection signals often overlap with the whole stimulus, considered as a single group. Therefore, the flanker interferes with the vernier in most trials, and crowding is high. On the right, the flankers are linked by illusory contours and form a group that spans a large surface. In this case, segmentation signals can easily hit the flankers group successfully (without hitting the vernier). The vernier thus ends up alone in its group in most trials and crowding is low. b: The left row shows human performance with the square flanker stimuli. The right row is the output of the LAMINART model. It fits the data very well. The same holds true for a majority of our stimuli. To compute the LAMINART’s output values, we used the same linking hypothesis as in the original description of the model [45]: template matching is used to decide if the target vernier offset is left or right, and this result is monotonically transformed into a threshold-like measure. c: Sometimes flankers group together (illusory contours are formed) when they should not, erroneously predicting uncrowding for this condition. d: Sometimes flankers group with the vernier when they should not. Here, weak illusory contours connect the central flanker and the vernier. No uncrowding can be produced for this condition because segmentation always spreads to the vernier, independently of the success of the selection signals.

More »

Fig 10 Expand

Fig 11.

Alexnet.

a. Stimuli consisted of either verniers, verniers surrounded by a single square or verniers with seven squares. The stimuli had varying sizes, vernier offsets and positions. Alexnet’s architecture and a classifier are shown on the right (there was a classifier at each layer). The boxes correspond to the input (leftmost box) and activated neuron layers (see [49] for the detailed architecture of Alexnet). We trained softmax classifiers on all ReLU layers following the convolution layers and the last fully connected layer to detect vernier orientation from the layer’s activity. b. Accuracy of softmax classifiers trained to detect vernier orientation from different layers in the deep neural network Alexnet. Across all layers, the offsets in crowded stimuli (1 square flanker) are always better detected than offsets in uncrowded stimuli (7 square flankers). This runs contrary to human performance. NB. This model only produces percent correct, there is no output image.

More »

Fig 11 Expand

Fig 12.

Hierarchical Sparse Selection model.

a. The model posits that receptive fields along the visual hierarchy are large and dense. This allows for “lossless” transmission of information through the visual system. For instance, the offset of the vernier in this illustration is not corrupted by pooling thanks to the density of the receptive fields (blue and red circles). Crowding occurs because, when we try to access information, only a few sparse receptive fields are used for readout (red circles). Hence, crowding occurs at readout because of sparse sampling of receptive fields. This sparse readout can occur at any stage of visual processing, from low-level features (shown here) to faces. b. Uncrowding does not occur in the Hierarchical Sparse Selection model because performance is worse for the model on the 7 squares than the 1 square condition, contrary to human performance. NB. This model only produces a scalar output, there is no output image.

More »

Fig 12 Expand

Fig 13.

Fourier model.

Left. The Fourier model computes Fourier transforms for the left- and right-offset versions of each stimulus. If these transforms are very different, crowding is low because the offset direction is easy to decode in Fourier space [15]. Right. Output of the Fourier model. The model failed on most stimuli [15]. NB. This model only produces a scalar output, there is no output image.

More »

Fig 13 Expand

Fig 14.

Summary of results.

Results for all models (columns). In black, the left panel displays all crowding stimuli and the right panel displays all uncrowding stimuli (i.e., better performance when extra elements are added to the crowded condition) as observed in human data (rows). Superscript numbers indicate which publication the results are taken from (1: Sayim, Westheimer & Herzog [17]; 2: Manassi et al. [11]; 3: Manassi, Sayim & Herzog [19]; 4: Manassi et al. [15]). Red indicates that the model predicts crowding, green indicates uncrowding and gray indicates that we did not run the model on the stimulus. A perfect model would have only red in the left half of the table and only green in the right half. Only the LAMINART is capable of producing uncrowding consistently. Fourier and the Wilson-Cowan network produce uncrowding, but suffer from overfitting (see discussion). For these two models, we provide the results for the best parameters. For example, the Wilson and Cowan with different parameters can explain the lines category but then it cannot explain the squares and irregular1 categories.

More »

Fig 14 Expand

Fig 15.

Model comparison.

All models produce crowding, but only the Fourier, Wilson and Cowan and LAMINART models produce uncrowding. The Fourier and the Wilson and Cowan model overfit and thus do not capture general principles. The LAMINART is the only model that explicitly computes grouping like aspects and segments the image into different layers.

More »

Fig 15 Expand