Fig 1.
A potential implementation of ORA in the brain.
ORA uses object reconstruction as top-down attention feedback and iteratively routes the most relevant visual information in successive steps of a feed-forward object recognition process. This reconstruction-guided attention operates at two levels, 1) Spatial attention: a long-range projection that spatially constrains attention to the most likely object location and 2) Feature-based attention: more local feedback that biases feature binding strengths to favor the formation of a better object reconstruction. See main text for more details.
Fig 2.
Object reconstructions and generated spatial masks for three forward processing steps (out of a possible five in total).
The spatial mask serves to limit ORA’s attention to the shape of the most likely object. In this example, the model’s most likely digit hypothesis changed from a 5 to a 3. The groundtruth class is 3.
Fig 3.
Step-wise visualizations of reconstruction-based feature binding.
The matrices show binding coefficients between object slots (rows) and feature slots (columns) for three time steps. The binding coefficients were initialized to one at t = 1, resulting in the dark matrix on the top, but in the subsequent time steps there is a rapid selective suppression of coefficients (matrix cells becoming lighter) as the model learns the object features leading to higher reconstruction accuracy. In the illustrated example, the coefficient matrix becomes sparser with each iteration as ORA focuses its attention on the features of the digit 3 (darker line forming along the fourth row). The middle column of bar graphs show the hypothesis yielding the highest reconstruction score (in blue), and the rightmost column shows the class likelihood adjusted by the reconstruction score. For clarity, only 5 object slots and 20 feature slots are illustrated, but the full figure can be found in S1 Fig.
Table 1.
Model comparison results using MNIST-C and MNIST-C-shape datasets.
Recognition accuracy (means and standard deviations from 5 trained models, hereafter referred to as model “runs”) from ORA and two CNN baselines, both of which were trained using identical CNN encoders (one a 2-layer CNN and the other a Resnet-18), and a CapsNet model following the implementation in [51].
Fig 4.
Model accuracy and reaction time for specific corruption types from MNIST-C-shape.
A: ORA vs our CNN using each model’s best performing encoder (Resnet-18 and a 2-layer CNN, respectively). B: ORA’s reaction time (RT), estimated as the number of forward processing steps required to reach a confidence threshold. While many digits could be recognized in only one feed-forward pass, feedback mechanisms become useful for addressing specific types of noise, such as fog, or when the identity of the input is notably ambiguous, as demonstrated in S5 Fig. Error bars indicate standard deviations from 5 different model runs. Asterisks (*) indicate statistical significance at a p-value of < 0.05.
Fig 5.
Results from ablating (-) ORA’s reconstruction-based spatial masking and feature binding components.
Leftmost bars in pink indicate the complete (no ablation) ORA model. Left: Recognition accuracy. Right: reaction time, approximated by the number of forward processing steps taken to recognize digits with confidence. Error bars are standard deviations from 5 different model runs. Asterisks (*) indicate statistical significance at a p-value of < 0.05 with Bonferroni corrections.
Fig 6.
Human recognition experiment and results.
A: Overview of the behavioral paradigm for measuring recognition reaction time (first button press) and accuracy (second button press). B: Stimuli were grouped into easy, medium, and difficult levels of recognition difficulty based on the number of forward processing steps needed for ORA to reach a predetermined recognition confidence threshold (dashed line; left panel). Colored regions indicate the corresponding standard deviations (SDs) across images. The unnumbered ticks on the x-axis indicate three local iterations of feature-based attention occurring within each forward processing (i.e., global spatial-masking iteration). The right panel shows distributions of human accuracy (averaged over participants) for the three difficulty conditions. C: Correlation between human RT and ORA’s RT, with the normalized marginal distribution of each variable shown on the top and right panels. Error bars represent standard error estimates bootstrapped from 1000 samples. Correlation plots for all MNIST-C corruption types can be found in S6 Fig.
Fig 7.
ORA often hallucinates a non-existing pattern out of noise, and these errors are perceived as more likely by humans compared to errors made by our CNN baseline.
A: Example of ORA’s explain-way behavior and the resulting errors in classification. B: Two alternative forced choice experimental procedure requiring participants to choose which of two model predictions, one from ORA and the other from our CNN baseline (but both errors), is the more plausible human interpretation of the corrupted digit. C: The relative likelihood of ORA’s errors being perceived as more like those from a human, quantified as an odds ratio (a higher value indicates a more plausible ORA error). Error bars indicate standard errors, and asterisks (*) indicate odds ratios greater than 1 at a significance level of p < 0.05.
Fig 8.
Model performance on the ImageNet-C dataset.
Classification accuracy on the y-axes and level of corruption on the x-axes. ORA often demonstrates a clear advantage over the CNN baseline (ResNet-50), particularly with high levels of noise corruption.
Table 2.
Model details.