Invariant recognition drives neural representations of action sequences
Fig 2
Spatiotemporal Convolutional Neural Networks.
Schematic overview of the class of models we used: Spatiotemporal Convolutional Neural Networks (ST-CNNs). ST-CNNs are hierarchical feature extraction architectures. Input videos go through layers of computation and the output of each layer serves as input to the next layer. The output of the last layer constitutes the video representation used in downstream tasks. The models we considered consisted of two convolutional-pooling layers’ pairs, denoted as Conv1, Pool1, Conv2 and Pool2. Convolutional layers performed template matching with a shared set of templates at all positions in space and time (spatiotemporal convolution), and pooling layers increased robustness through max-pooling operations. Convolutional layers’ templates can be either fixed a priori, sampled or learned. In this example, templates in the first layer Conv1 are fixed and depict moving Gabor-like receptive fields, while templates in the second simple layer Conv2 are sampled from a set of videos containing actions and filmed at different viewpoints. The authors who collected the videos identified themselves and the purpose of the videos to the people being video recorded. The individuals agreed to have their videos taken and potentially published.