Fig 1.
Architecture of the proposed model.
Pre-trained CNNs and LSTMRes were shared across inputs.
Fig 2.
Architecture of LSTM with residual learning.
Implementation of residual learning in LSTM with shortcut connection after forgetting of old information and addition of the new information. The dotted line shows shortcut connection.
Table 1.
Scenarios used in the experiments.
Fig 3.
Training and validation errors of LSTM, LSTMRes, LSTMResKim, and ConvLSTM.
From the start until the end of learning, LSTMResKim had lower performance than the others.
Fig 4.
Accuracy of the proposed model with different pre-trained CNNs.
Average recognition rates of the proposed model: VGG-19(intermediate): 90.40%; VGG-16(intermediate): 90.90%; VGG-19(fine-tuned intermediate): 88.38%; VGG-16(fine-tuned intermediate): 91.41%;VGG-19(last): 92.92%; VGG-16(last): 94.69%.
Fig 5.
Average accuracy rate of the proposed model.
Comparison between MSLSTMRes (94.69%) and MV-DNN (95.70%).
Fig 6.
Average accuracy rate of the proposed model using LSTMRes with feature-fusion and score-fusion techniques.
Accuracy of shared-weights LSTMRes with feature fusion (95.71%), and score fusion employing the arithmetic mean (97.22%) and geometric mean (93.93%).
Table 2.
Average accuracy gain (%)with new configuration.
Fig 7.
Improvement of recognition rate with the new configuration.
There was no improvement in recognizing the-“sit down/get up/pick up”-actions, as perfect recognition rate was achieved by the model with the previous structure. The highest accuracy gain was in recognizing wave action (25%).
Fig 8.
Comparison between multi-view and single-view approaches.
Recognition rate of proposed model using multi-view inputs (96.37±3.39%) and single-view inputs from Cam 1 (80.34±7.57%), Cam 2 (82.48±8.10%), Cam 3 (79.70±6.66%), Cam 4 (79.27±9.56%), and Cam 5 (59.19±16.37%).
Table 3.
Comparison for recognition (%) using the proposed model with state-of-the-art methods on IXMAS.
Table 4.
Comparison for recognition (%) of proposed model with state-of-the-art methods on i3DPost.
Fig 9.
Average accuracy rate of the proposed model on i3DPost.
The row and column represents action: walk(1), run(2), jump(3), bend(4), hand-wave(5), jump-in-place(6), sit-stand up(7), run-fall(8), walk-sit(9), run-jump-walk(10), handshake(11), pull(12). A and B illustrate experimental result on 10 and 12 actions, respectively.
Table 5.
Average F1-score of the proposed model with 11 and 13 actions of IXMAS, and 10 and 12 actions of i3DPost.
Fig 10.
Example of ambiguous-action clips.
A: sequence of images from early watch-checking action. B: sequence of ambiguous actions (transition from punching to kicking action).
Fig 11.
Average accuracy rate of the proposed model in online classification.
A: accuracy of the proposed model with a varying number of sliding windows: t = 10 (64.24 ± 4.26%); t = 20 (63.55 ± 3.45%); t = 30 (69.36 ± 5.43%), t = 40 (72.60 ± 5.15%), and t = 50 (73.64 ± 7.15%). B: average F1-score of the proposed model with a varying number of sliding windows: t = 10 (0.63 ± 0.08); t = 20 (0.62 ± 0.10); t = 30 (0.7 ± 0.06%), t = 40 (0.72 ± 0.07%), and t = 50 (0.73 ± 0.11%).
Fig 12.
Percentage labels in dataset and accuracy rate of the proposed model.
A: percentage of classes in IXMAS dataset segmented with t equaled to 50. B: accuracy of the proposed model for each class with t = 50.