A deep neural network model for multi-view human activity recognition

doi:10.1371/journal.pone.0262181

Fig 1.

Architecture of the proposed model.

Pre-trained CNNs and LSTMRes were shared across inputs.

More »

Expand

Fig 2.

Architecture of LSTM with residual learning.

Implementation of residual learning in LSTM with shortcut connection after forgetting of old information and addition of the new information. The dotted line shows shortcut connection.

More »

Expand

Table 1.

Scenarios used in the experiments.

More »

Expand

Fig 3.

Training and validation errors of LSTM, LSTMRes, LSTMResKim, and ConvLSTM.

From the start until the end of learning, LSTMResKim had lower performance than the others.

More »

Expand

Fig 4.

Accuracy of the proposed model with different pre-trained CNNs.

Average recognition rates of the proposed model: VGG-19(intermediate): 90.40%; VGG-16(intermediate): 90.90%; VGG-19(fine-tuned intermediate): 88.38%; VGG-16(fine-tuned intermediate): 91.41%;VGG-19(last): 92.92%; VGG-16(last): 94.69%.

More »

Expand

Fig 5.

Average accuracy rate of the proposed model.

Comparison between MSLSTMRes (94.69%) and MV-DNN (95.70%).

More »

Expand

Fig 6.

Average accuracy rate of the proposed model using LSTMRes with feature-fusion and score-fusion techniques.

Accuracy of shared-weights LSTMRes with feature fusion (95.71%), and score fusion employing the arithmetic mean (97.22%) and geometric mean (93.93%).

More »

Expand

Table 2.

Average accuracy gain (%)with new configuration.

More »

Expand

Fig 7.

Improvement of recognition rate with the new configuration.

There was no improvement in recognizing the-“sit down/get up/pick up”-actions, as perfect recognition rate was achieved by the model with the previous structure. The highest accuracy gain was in recognizing wave action (25%).

More »

Expand

Fig 8.

Comparison between multi-view and single-view approaches.

Recognition rate of proposed model using multi-view inputs (96.37±3.39%) and single-view inputs from Cam 1 (80.34±7.57%), Cam 2 (82.48±8.10%), Cam 3 (79.70±6.66%), Cam 4 (79.27±9.56%), and Cam 5 (59.19±16.37%).

More »

Expand

Table 3.

Comparison for recognition (%) using the proposed model with state-of-the-art methods on IXMAS.

More »

Expand

Table 4.

Comparison for recognition (%) of proposed model with state-of-the-art methods on i3DPost.

More »

Expand

Fig 9.

Average accuracy rate of the proposed model on i3DPost.

The row and column represents action: walk(1), run(2), jump(3), bend(4), hand-wave(5), jump-in-place(6), sit-stand up(7), run-fall(8), walk-sit(9), run-jump-walk(10), handshake(11), pull(12). A and B illustrate experimental result on 10 and 12 actions, respectively.

More »

Expand

Table 5.

Average F1-score of the proposed model with 11 and 13 actions of IXMAS, and 10 and 12 actions of i3DPost.

More »

Expand

Fig 10.

Example of ambiguous-action clips.

A: sequence of images from early watch-checking action. B: sequence of ambiguous actions (transition from punching to kicking action).

More »

Expand

Fig 11.

Average accuracy rate of the proposed model in online classification.

A: accuracy of the proposed model with a varying number of sliding windows: t = 10 (64.24 ± 4.26%); t = 20 (63.55 ± 3.45%); t = 30 (69.36 ± 5.43%), t = 40 (72.60 ± 5.15%), and t = 50 (73.64 ± 7.15%). B: average F1-score of the proposed model with a varying number of sliding windows: t = 10 (0.63 ± 0.08); t = 20 (0.62 ± 0.10); t = 30 (0.7 ± 0.06%), t = 40 (0.72 ± 0.07%), and t = 50 (0.73 ± 0.11%).

More »

Expand

Fig 12.

Percentage labels in dataset and accuracy rate of the proposed model.

A: percentage of classes in IXMAS dataset segmented with t equaled to 50. B: accuracy of the proposed model for each class with t = 50.

More »

Expand