Invariant recognition drives neural representations of action sequences
Fig 3
Structured and unstructured pooling.
We introduced modifications to the basic ST-CNN to increase robustness to changes in 3D-viewpoint. Qualitatively Spatiotemporal Convolutional Neural Networks detect the presence of a certain video segment (a template) in their input stimulus. The 3D orientation of this template is discarded by the pooling mechanism in our structured pooling model, analogous to how position in space is discarded in a traditional CNN. a) In models with Structured Pooling (model 3, in the main text), the template set for Conv2 layer cells was sampled from a set of videos containing four actors performing five actions at five different viewpoints (see Materials and Methods). All templates sampled from videos of a specific actor and performing a specific action were pooled together by one Pool2 layer unit. b) Models employing Unstructured Pooling (model 2, in the main text) allowed Pool2 cells to pool over the entire spatial extent of their input as well as across channels. These models used the exact same templates employed by models relying on Structured Pooling and matched these models in the number of templates wired to a pooling unit. However, the assignment of templates to pooling was randomized (uniform without replacement) and did not reflect any semantic structure. The authors who collected the videos identified themselves and the purpose of the videos to the people being video recorded. The individuals agreed to have their videos taken and potentially published.