OM-VST: A video action recognition model based on optimized downsampling module combined with multi-scale feature fusion | PLOS One

Advertisement

Browse Subject Areas

?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1 — Fig 1.

3D convolution.

More »

Fig 2 — Fig 2.

Self-attention module.

More »

Fig 3 — Fig 3.

Transformer model network structure.

More »

Fig 4 — Fig 4.

VST Block network structure.

More »

Fig 5 — Fig 5.

OM-VST model network structure.

More »

Fig 6 — Fig 6.

Optimized Downsampling network structure.

More »

Fig 7 — Fig 7.

Patch merging network structure.

More »

Fig 8 — Fig 8.

Multi-scale feature information fusion module network structure.

More »

Table 1 — Table 1.

Experiment parameter configuration.

More »

Table 2 — Table 2.

Performance comparison of various categories.

More »

Fig 9 — Fig 9.

Confusion matrix.

More »

Fig 10 — Fig 10.

ROC curves.

More »

Fig 11 — Fig 11.

Comparison of model accuracy.

More »

Fig 12 — Fig 12.

P-R curve.

More »

Table 3 — Table 3.

Performance comparison of different models.

More »

Table 4 — Table 4.

Comparison of model parameters.

More »

Table 5 — Table 5.

Comparison of ablation experiment results.

More »