A comparative analysis of video vision transformers on word-level sign language datasets

doi:10.1371/journal.pone.0341909

A comparative analysis of video vision transformers on word-level sign language datasets

Fig 3

Five self-attention blocks of TimeSformer [56].

Each variant illustrates a different way of modeling spatial and temporal relationships in video data using transformer blocks. (1) Space Attention (S): attends only across spatial dimensions in each frame. (2) Joint Space-Time Attention (ST): computes attention jointly across space and time. (3) Divided Space-Time Attention (T+S): separates temporal and spatial attention sequentially. (4) Sparse Local-Global Attention (L+G): combines local and global spatial attention for broader context. (5) Axial Attention (T+W+H): factors attention across time, width, and height axes independently. Each block outputs updated video representations used for downstream tasks.

doi: https://doi.org/10.1371/journal.pone.0341909.g003