Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

A comparative analysis of video vision transformers on word-level sign language datasets

Fig 3

Five self-attention blocks of TimeSformer [56].

Each variant illustrates a different way of modeling spatial and temporal relationships in video data using transformer blocks. (1) Space Attention (S): attends only across spatial dimensions in each frame. (2) Joint Space-Time Attention (ST): computes attention jointly across space and time. (3) Divided Space-Time Attention (T+S): separates temporal and spatial attention sequentially. (4) Sparse Local-Global Attention (L+G): combines local and global spatial attention for broader context. (5) Axial Attention (T+W+H): factors attention across time, width, and height axes independently. Each block outputs updated video representations used for downstream tasks.

Fig 3

doi: https://doi.org/10.1371/journal.pone.0341909.g003