A comparative analysis of video vision transformers on word-level sign language datasets | PLOS One

Advertisement

Browse Subject Areas

?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1 — Fig 1.

Architecture of frame rate-corrected dataset construction, recognition, and benchmarking.
This figure illustrates the end-to-end workflow adopted for isolated sign language recognition across diverse datasets. The process begins with data collection and preprocessing, including clip segmentation, frame extraction, and frame rate correction. After preparing the dataset through frame selection, resizing, and tensor generation, data augmentation techniques such as horizontal flipping are applied to improve model generalization. Pretrained video transformer models (VideoMAE, ViViT, and TimeSformer) are then fine-tuned and evaluated using standard metrics, including accuracy, precision, recall, and F1-score. The results are compared with existing state-of-the-art approaches to assess model performance.

More »

Fig 2 — Fig 2.

Frame Count vs Number of short clips of BdSLW60 dataset.
This figure displays the distribution of frame counts across 9,307 short video clips in the BdSLW60 dataset. Each bar represents the frame count of an individual MP4 clip, sorted in ascending order. The minimum and maximum frame counts are highlighted in red and blue, respectively—ranging from 9 to 164 frames per clip.

More »

Table 1 — Table 1.

Comparison of different video transformer models and their architecture details.

More »

Fig 3 — Fig 3.

Five self-attention blocks of TimeSformer [56].
Each variant illustrates a different way of modeling spatial and temporal relationships in video data using transformer blocks. (1) Space Attention (S): attends only across spatial dimensions in each frame. (2) Joint Space-Time Attention (ST): computes attention jointly across space and time. (3) Divided Space-Time Attention (T+S): separates temporal and spatial attention sequentially. (4) Sparse Local-Global Attention (L+G): combines local and global spatial attention for broader context. (5) Axial Attention (T+W+H): factors attention across time, width, and height axes independently. Each block outputs updated video representations used for downstream tasks.

More »

Table 2 — Table 2.

Training hyperparameters.

More »

Table 3 — Table 3.

Dataset splitting configurations.

More »

Table 4 — Table 4.

Performance of VideoMAE, ViViT and TimeSformer with and without augmentation.

More »

Table 5 — Table 5.

Experimental results on different models and datasets - Part 1.

More »

Table 6 — Table 6.

Experimental results on different models and datasets - Part 2.

More »

Fig 4 — Fig 4.

Loss curve for fold 6 of the BdSLW60 dataset.
This figure shows the training and validation loss curves on BdSLW60 using a split-axis visualization to highlight the initial loss spike and subsequent convergence. After a brief early spike, both losses decrease rapidly and remain closely aligned, indicating stable learning, minimal overfitting, and good generalization throughout training.

More »

Fig 5 — Fig 5.

Confusion matrix for fold 6 of the BdSLW60 test set.
This confusion matrix shows how well the model classified each of the 60 sign language gestures. The diagonal cells represent correct predictions, with darker shades indicating better performance. Most predictions fall along this diagonal, showing that the model accurately recognized the majority of signs. The few lighter cells outside the diagonal indicate occasional misclassifications, but overall, the results suggest strong and consistent performance across all classes.

More »

Fig 6 — Fig 6.

Confusion matrix for BdSLW401 test set (visualizing first 50 classes).
This confusion matrix presents the model’s classification results for the first 50 classes out of a total of 401 in the BdSLW401 test set. Each row represents the actual class label, while each column shows the predicted label. The darker diagonal cells indicate correct predictions, suggesting the model has learned to recognize many of the signs accurately within this subset. The few lighter off-diagonal entries represent misclassifications, pointing to some confusion between certain signs. This visualization provides insight into the model’s performance on a portion of the full class set.

More »

Table 7 — Table 7.

Dataset details with FPS, models, SR and clip durations.

More »

Table 8 — Table 8.

Correlation summary across datasets and models.

More »

Fig 7 — Fig 7.

Long-tail nature of WLASL-2000 on VideoMAE kinetics model.
This figure illustrates the long-tail effect in WLASL-2000, showing class-wise F1-score versus training-set class frequency (log scale), grouped into head, middle, and tail classes. A weak but statistically significant positive correlation is observed (Spearman r = 0.122, p < 0.001), with higher variability among tail classes.

More »