Beyond core object recognition: Recurrent processes account for object recognition under occlusion

doi:10.1371/journal.pcbi.1007001

Fig 1.

Temporal dynamics object recognition under various levels of occlusion.

(a) Multivariate pattern classification of MEG data. We extracted MEG signals from -200 ms to 700 ms relative to the stimulus onset. At each time point (ms resolution), we computed average pairwise classification accuracy between all exemplars. (b) Time courses of pairwise decoding accuracies for the three different occlusion levels (without backward masking) averaged across 15 subjects. Thicker lines indicate a decoding accuracy significantly above chance (right-sided signrank test, FDR corrected across time, p < 0.05), showing that MEG signals can discriminate between object exemplars. Shaded error bars represent standard error of the mean (SEM). The two vertical shaded areas show the time from onset to peak, for 60% occluded and 0% occluded objects, which are largely non-overlapping. The onset latency is 79±3 ms (mean ± SD) in the no-occlusion condition; and 123±15 ms in the 60% occlusion; the difference between onset latencies is significant (p<10⁻⁴, two-sided signrank test). Arrows above the curves indicate peak latencies. The peak latencies are 139±1ms and 199±3ms for the 0% occluded and partially occluded (60%) objects respectively. The difference between the peak latencies is also statistically significant (p < 10⁻⁴). Images shown to participants are available from here: https://github.com/krajaei/Megocclusion/blob/master/Sample_occlusion_dataset.png.

More »

Expand

Fig 2.

Temporal generalization patterns of object recognition with and without occlusion.

(a) Time-time decoding analysis. The procedure is similar to the calculations of pairwise decoding accuracies explained in Fig 1, except that here the classifier is trained at a given time-point, and then tested at all time-points. In other words, for each pair of time points (t_x, t_y), a SVM classifier is trained by N-1 MEG pattern vectors at time t_x and tested by the remaining one pattern vector at time t_y, resulting to an 801x801 time-time decoding matrix. (b-c) Time-time decoding accuracy and plausible processing architecture for no-occlusion and 60% occlusion. The results are for MEG trials without backward masking. Horizontal axis indicates testing times and vertical axis indicates training times. Color bars represent percent of decoding accuracies (chancel level = 50%); please note that in the time-time decoding matrices, the color bar ranges for 0% occlusion and 60% occlusion are different. Within the time-time decoding matrices, significantly above chance decoding accuracies, are surrounded by the white dashed contour lines (right-sided signrank test, FDR corrected across the whole 801x801 decoding matrix, p < 0.05). For each time-time decoding matrix, we also show the plausible processing architecture corresponding to it. These are derived from the observed patterns of temporal generalization from onset-to-peak decoding (shown by the gray dashed rectangles) [see Fig 5 of [52]]. Generalization patterns for the no-occlusion condition are consistent with a hierarchical feedforward architecture; whereas, for the occluded objects (60%) the temporal generalization patterns are consistent with a hierarchical architecture with recurrent connections. (d) Difference in time-time decoding accuracies between no-occlusion and occlusion conditions. Significantly above zero differences are surrounded by the white dashed contour lines (right-sided signrank test, FDR corrected across [80–240]ms matrix at p < 0.05).

More »

Expand

Fig 3.

Generalization across time and occlusion levels.

(a) The classifier is trained on an occlusion level (e.g. 0% occlusion) and tested on the other occlusion level (e.g. 60% occlusion). Time-points with significant decoding accuracy are shown inside the dashed contours (right-sided signrank test, FDR-corrected across time, p<0.05). The contour of significant time-points has a shift towards the upper side of the diagonal when the classifier is trained with 0% occlusion and tested on 60% occlusion (i.e. 63% of significant time points are above the diagonal) whereas in the lower right matrix we see the opposite pattern (66% of significant time points are located below the diagonal). (b) The two color maps below the decoding matrices show the difference between the two decoding matrices located above them.

More »

Expand

Fig 4.

Backward masking significantly impairs object decoding under occlusion, but has no significant effect on object decoding under no occlusion.

(a) Time-courses of the average pairwise decoding accuracies under no-occlusion. Thicker lines indicate significant time-points (right-sided signrank test, FDR corrected across time, p < 0.05). Shaded error bars indicate SEM (standard error of the mean). Downward pointing arrows indicate peak decoding accuracies. There is no significant difference between decoding time-courses for mask and no-mask trials, under no-occlusion (b) Time-courses of the average pairwise decoding under 60% occlusion (for 80% occlusion see S5 Fig). Under occlusion, the decoding onset latency for the no-mask trials is 123±15ms, with its peak decoding accuracy at 199±3ms; whereas the time-course for the masked trials does not reach statistical significance, demonstrating that backward masking significantly impairs object recognition under occlusion. Black horizontal lines below the curves show the time-points at which the two decoding curves are significantly different. This is particularly evident around the peak latency of the no-mask trials [from 185ms to 237ms]. (c, d) Time-time decoding matrices of 60% occluded and (0%) un-occluded objects with and without backward masking. Horizontal axes indicate testing times and the vertical axes indicate training times. Color bars show percent of decoding accuracies. Please note that in the time-time decoding matrices, the color bar ranges for 0% occlusion and 60% occlusion are different. Significantly above chance decoding accuracies, are surrounded by the white dashed contour lines (right-sided signrank test, FDR corrected across the whole 801x801 decoding matrix, p < 0.05). (f) Difference between time-time decoding matrices with and without backward masking. Statistically significant differences are surrounded by the black dotted contours (right-sided signrank test, FDR corrected across time at p < 0.05). There are significant differences between mask and no-mask only under occlusion.

More »

Expand

Fig 5.

Comparing human MEG and behavioral data with feedforward and recurrent computational models of visual hierarchy.

(a) Time-varying representational similarity analysis between human MEG data and the computational models. We, first, obtained representational dissimilarity matrices (RDM) for each computational model—using feature values of the layer before the softmax operation—, and for the MEG data at each time-point. For each subject, their MEG RDMs were correlated (Spearman’ R) with the computational model RDMs (i.e. AlexNet & HRRN) across time; the results were then averaged across subjects. (b, c) Time-courses of RDM correlations between the models and the human MEG data. HRRN readout stage 0 represents the purely feedforward version of HRRN. Thicker lines show significant time points (right-sided signrank test, FDR-corrected across time, p < = 0.05). We indicate peak correlation latencies by numbers (mean ± SD) above the downward pointing arrows. Under no-occlusion, AlexNet and HRRN demonstrate almost similar time-courses except that the peak latency for HRRN (249±10ms) is significantly later than the peak latency for AlexNet (219±12ms). However, under occlusion, only HRRN showed significant correlation with MEG data, with a peak latency of 182±19ms. (d) Object recognition performance of humans (mask and no-mask trials) and models [AlexNet and HRRN-ReadoutStage-0 (feedforward) and HRRN(recurrent)] across different levels of occlusion. We evaluated model accuracies on a multiclass recognition task similar to the multiclass behavioral experiment done in humans (S6 Fig). The models’ performances were calculated by holding out an occlusion level for testing, and training a SVM classifier on the remaining levels of occlusion. Error bars are SEM.

More »

Expand

Fig 6.

Hierarchical recurrent ResNet (HRRN) in unfolded form is equivalent to an ultra-deep ResNet.

(a) A hierarchy of convolutional layers with local recurrent connections. This hierarchical structure models the feedforward and local recurrent connections along the hierarchy of ventral visual pathway (e.g. V1, V2, V4, IT). (b) Each recurrent unit is equivalent to a deep ResNet with arbitrary number of layers depending on the unfolding depth. h_t is the layer activity at a specific time (t) and K_t represents a sequence of nonlinear operations (e.g. convolution, batch normalization, and ReLU). [see [63] for more info].

More »

Expand

Fig 7.

Contribution of the feedforward and recurrent models in explaining MEG data under 60% occlusion.

(a) Correlation between the models RDMs and the average MEG RDM over two different time bins. (b) Unique contribution of each model (semipartial correlation) in explaining the MEG data. Error bars represent SEM (Standard Error of the Mean). Significantly above zero correlations/semipartial-correlations and significant differences between the two models are indicated by stars. * = p<0.05; ** = p<0.01; *** = p<0.001.

More »

Expand