Invariant recognition drives neural representations of action sequences
Fig 5
Action recognition: Viewpoint mismatch condition.
This figure shows the prediction accuracy of a machine learning classifier trained and tested using feature representations of videos at opposed viewpoints. Hierarchical models were constructed using convolutional templates sampled or learned from videos showing all five viewpoints. During the training and testing of the classifier however, mismatching viewpoints were used. When the classifier was trained using videos at, say, the frontal viewpoint, its accuracy in discriminating new, unseen videos would be established using videos recorded at the side viewpoint. Here we report the mean and standard error of the classification accuracy over the five possible choices of test actor. Models with learned templates resulted in significantly higher accuracy in this task. Among models with fixed templates, Spatiotemporal Convolutional Neural Networks employing Structured pooling outperformed both purely convolutional and Unstructured Pooling models. Chance is 1/5 indicated with horizontal line. Horizontal lines at the top indicate significant difference between two conditions (p < 0.05) based on group ANOVA or Bonferroni corrected paired t-test (see Materials and Methods).