Invariant recognition drives neural representations of action sequences
Fig 4
Action recognition: Viewpoint match condition.
We trained a supervised machine learning classifier to discriminate videos based on their action content by using the feature representation computed by each of the Spatiotemporal Convolutional Neural Network models we considered. This figure shows the prediction accuracy of a machine learning classifier trained and tested using videos recorded at the same viewpoint. The classifier was trained on videos depicting four actors performing five actions at either the frontal or side view. The machine learning classifier accuracy was then assessed using new, unseen videos of a new, unseen actor performing those same five actions. No generalization across changes in 3D viewpoints was required of the feature extraction and classification system. Here we report the mean and standard error of the classification accuracy over the five possible choices of test actor. Models with learned templates outperform models with fixed templates significantly on this task. Chance is 1/5 and is indicated by a horizontal line. Horizontal lines at the top indicate significant difference between two conditions (p < 0.05) based on group ANOVA or Bonferroni corrected paired t-test (see Materials and Methods section).