Table 1.
The summary of the related works.
Fig 1.
Video-level overview of our STA-TSN.
The input video is divided into multiple segments (represented by different colors), and a Multi-Scale Spatial Focus Features Enhancement strategy is used to obtain the global feature presentment with spatial focus features enhancement. Then, the key frames exploration is realized using LSTM and a temporal-attention regularization is designed to guide our model to better explore the key frames. Eventually, the final class score is obtained by fusing the scores of all segments. Meanwhile, the same process is adopted for different modalities. Reprinted from [10] under a CC BY license, with permission from IEEE publisher, original copyright 2018.
Fig 2.
Details of our multi-scale spatial focus features enhancement strategy.
The input of the module is the output of the last convolutional layer. First, a soft attention mechanism with SPP is used to obtain the multi-scale spatial features. Then, the spatial focus features are summed with the original features, and GAP is used to obtain the global feature representations with multi-scale spatial focus features enhancement.
Fig 3.
The details of key frames exploration.
The input of the module is the global feature representations of the frames sampled from each segment. First, the LSTM is used to obtain the temporal dynamic features. Then, the temporal attention weights are obtained using the soft attention mechanism. Finally, the final segment feature representation is calculated by temporal attention weighting.
Table 2.
Performances of the baseline and our proposed method on UCF101 (split1), HMDB51 (split1), and JHMDB (split1).
Fig 4.
Category accuracy of the test set on three datasets (split 1) using our STA-TSN.
(a) UCF101 dataset, (b) HMDB51 dataset, and (c) JHMDB dataset. Horizontal axis represents classes and the vertical axis shows accuracies for the corresponding class for the test set.
Fig 5.
Confusion matrices for the three datasets using our STA-TSN.
(a) UCF101 dataset, (b) HMDB51 dataset, and (c) JHMDB dataset. Horizontal axis represents predicted class, the vertical axis represents actual class and the main diagonal represents the true positives. The main diagonal is brighter, the number of the true positives is more.
Table 3.
Performances of our STA-TSN on UCF101 (all three splits), HMDB51 (all three splits), and JHMDB (all three splits).
Table 4.
Comparison with the state-of-the-art on UCF101 (average over three splits).
Table 5.
Comparison with the state-of-the-art on HMDB51 (average over three splits).
Table 6.
Comparison with the state-of-the-art on JHMDB (average over three splits).
Fig 6.
The visualization results of our STA-TSN for “shoot ball” in HMDB51.
The first line is RGB images cropped from the center to a size of 224×224. The second line is RGB images with spatial attention masks, where the brightness indicates the focus level in space. The third line is the histogram of the temporal attention weights of the corresponding frames. Reprinted from [10] under a CC BY license, with permission from IEEE publisher, original copyright 2018.