Invariant recognition drives neural representations of action sequences

doi:10.1371/journal.pcbi.1005859

Fig 1.

Action recognition stimulus set.

Sample frames from action recognition dataset consisting of 2s video clips depicting five actors performing five actions (top row: drink, eat, jump, run and walk). Actions were recorded at five different viewpoints (bottom row: 0-frontal, 45, 90-side, 135 and 180 degrees with respect to the normal to the focal plane), they were all performed on a treadmill and actors held a water bottle and an apple in their hand regardless of the action they performed in order to minimize low-level object/action confounds. Actors were centered in the frame and the background was held constant regardless of viewpoint. The authors who collected the videos identified themselves and the purpose of the videos to the people being video recorded. The individuals agreed to have their videos taken and potentially published.

More »

Expand

Fig 2.

Spatiotemporal Convolutional Neural Networks.

Schematic overview of the class of models we used: Spatiotemporal Convolutional Neural Networks (ST-CNNs). ST-CNNs are hierarchical feature extraction architectures. Input videos go through layers of computation and the output of each layer serves as input to the next layer. The output of the last layer constitutes the video representation used in downstream tasks. The models we considered consisted of two convolutional-pooling layers’ pairs, denoted as Conv1, Pool1, Conv2 and Pool2. Convolutional layers performed template matching with a shared set of templates at all positions in space and time (spatiotemporal convolution), and pooling layers increased robustness through max-pooling operations. Convolutional layers’ templates can be either fixed a priori, sampled or learned. In this example, templates in the first layer Conv1 are fixed and depict moving Gabor-like receptive fields, while templates in the second simple layer Conv2 are sampled from a set of videos containing actions and filmed at different viewpoints. The authors who collected the videos identified themselves and the purpose of the videos to the people being video recorded. The individuals agreed to have their videos taken and potentially published.

More »

Expand

Fig 3.

Structured and unstructured pooling.

We introduced modifications to the basic ST-CNN to increase robustness to changes in 3D-viewpoint. Qualitatively Spatiotemporal Convolutional Neural Networks detect the presence of a certain video segment (a template) in their input stimulus. The 3D orientation of this template is discarded by the pooling mechanism in our structured pooling model, analogous to how position in space is discarded in a traditional CNN. a) In models with Structured Pooling (model 3, in the main text), the template set for Conv2 layer cells was sampled from a set of videos containing four actors performing five actions at five different viewpoints (see Materials and Methods). All templates sampled from videos of a specific actor and performing a specific action were pooled together by one Pool2 layer unit. b) Models employing Unstructured Pooling (model 2, in the main text) allowed Pool2 cells to pool over the entire spatial extent of their input as well as across channels. These models used the exact same templates employed by models relying on Structured Pooling and matched these models in the number of templates wired to a pooling unit. However, the assignment of templates to pooling was randomized (uniform without replacement) and did not reflect any semantic structure. The authors who collected the videos identified themselves and the purpose of the videos to the people being video recorded. The individuals agreed to have their videos taken and potentially published.

More »

Expand

Fig 4.

Action recognition: Viewpoint match condition.

We trained a supervised machine learning classifier to discriminate videos based on their action content by using the feature representation computed by each of the Spatiotemporal Convolutional Neural Network models we considered. This figure shows the prediction accuracy of a machine learning classifier trained and tested using videos recorded at the same viewpoint. The classifier was trained on videos depicting four actors performing five actions at either the frontal or side view. The machine learning classifier accuracy was then assessed using new, unseen videos of a new, unseen actor performing those same five actions. No generalization across changes in 3D viewpoints was required of the feature extraction and classification system. Here we report the mean and standard error of the classification accuracy over the five possible choices of test actor. Models with learned templates outperform models with fixed templates significantly on this task. Chance is 1/5 and is indicated by a horizontal line. Horizontal lines at the top indicate significant difference between two conditions (p < 0.05) based on group ANOVA or Bonferroni corrected paired t-test (see Materials and Methods section).

More »

Expand

Fig 5.

Action recognition: Viewpoint mismatch condition.

This figure shows the prediction accuracy of a machine learning classifier trained and tested using feature representations of videos at opposed viewpoints. Hierarchical models were constructed using convolutional templates sampled or learned from videos showing all five viewpoints. During the training and testing of the classifier however, mismatching viewpoints were used. When the classifier was trained using videos at, say, the frontal viewpoint, its accuracy in discriminating new, unseen videos would be established using videos recorded at the side viewpoint. Here we report the mean and standard error of the classification accuracy over the five possible choices of test actor. Models with learned templates resulted in significantly higher accuracy in this task. Among models with fixed templates, Spatiotemporal Convolutional Neural Networks employing Structured pooling outperformed both purely convolutional and Unstructured Pooling models. Chance is 1/5 indicated with horizontal line. Horizontal lines at the top indicate significant difference between two conditions (p < 0.05) based on group ANOVA or Bonferroni corrected paired t-test (see Materials and Methods).

More »

Expand

Fig 6.

Feature representation empirical dissimilarity matrices.

We used feature representations, extracted with the four Spatiotemporal Convolutional Neural Network models, from 50 videos depicting five actors performing five actions at two different viewpoints, frontal and side. Moreover, we obtained Magnetoencephalography (MEG) recordings of human subjects’ brain activity while they were watching these same videos, and used these recordings as a proxy for the neural representation of these videos. These videos were not used to construct or learn any of the models. For each of the six representations of each video (four artificial models, a categorical oracle and one neural recordings) we constructed an empirical dissimilarity matrix using linear correlation and normalized it between 0 and 1. Empirical dissimilarity matrices on the same set of stimuli constructed with video representations from a) Model 1: Purely Convolutional model, b) Model 2: Unstructured pooling model, c) Model 3: Structured pooling model d) Model 4: Learned templates model e) Categorical oracle and f) Magnetoencephalography brain recordings.

More »

Expand

Fig 7.

Representational Similarity Analysis between model representations and human neural data.

We computed the Spearman Correlation Coefficient (SCC) between the lower triangular portion of the dissimilarity matrix constructed with each of the artificial models we considered and the dissimilarity matrix constructed with neural data (shown and described in Fig 6). We assessed the uncertainty of this measure by resampling the rows and columns of the matrices we constructed. In order to give the SCC score a meaningful interpretation we reported here a normalized score: the SCC is normalized so that the noise ceiling is 1 and the noise floor is 0. The noise ceiling was assessed by computing the SCC between each individual human subjects’ dissimilarity matrix and the average dissimilarity matrix over the rest of the subjects. The noise floor was computed by assessing the SCC between the lower portion of the dissimilarity matrix constructed using each of the model representation and a scrambled version of the neural dissimilarity matrix. This normalization embeds the intuition that we cannot expect artificial representations to match human data better than an individual human subject’s data matches the mean of other humans and that we should only be concerned care with how much better the models we considered are, on this scale, than a random guess. Models with learned templates agree with the neural data significantly better than models with fixed templates. Among these, models with Structured Pooling outperform both purely Convolutional and Unstructured models. Horizontal lines at the top indicate significant difference between two conditions (p < 0.05) based on group ANOVA or Bonferroni corrected paired t-test (see Materials and Methods).

More »

Expand