Table 1.
This table shows the subject characteristics of the dataset used in this study.
Fig 1.
Illustration of a Vision Transformer (ViT): A neural network architecture designed for computer vision tasks.
The diagram depicts the distinctive structure of the vision transformer, emphasizing its attention mechanisms and positional encoding techniques, facilitating efficient image data processing within a transformer-based framework.
Fig 2.
The workflow of the proposed Hybrid-RViT model.
Fig 3.
Overview of the Hybrid-RViT architecture.
The process begins with inputting images of dimensions 224×224, which are fed to ResNet-50 for feature extraction, the extracted features undergo processing through ViT in the form of patches (N+1)+D). A transformer encoder is then applied to perform self-attention. Finally, the learned features are passed through an MLP classifier for classification. The model optimization during training is performed using the validation set, while the test dataset is utilized to evaluate the model’s performance on unseen data.
Table 2.
Hyper-parameters used during training of proposed Hybrid-RViT.
Fig 4.
Plot of training accuracy and loss.
Fig 4A displays training accuracy and validation accuracy, while Fig 4B shows training loss and validation loss.
Table 3.
Evaluation metrics of the Hybrid-RViT.
Fig 5.
Confusion matrix for the Hybrid-RViT model on test data set.
Fig 6.
Comparative analysis of accuracy for proposed Hybrid-RViT model against VGG-TSwift, SMIL-DeiT, Efficient+ViT, and ViT.
Fig 7.
Results of the ablation study comparing ResNet-50 versus ResNet-101.
Fig 7A shows the training accuracy and validation accuracy of Hybrid-RViT after replacing ResNet-50 with ResNet-101. In Fig 7B, the training loss and validation loss of Hybrid-RViT are shown after replacing ResNet-50 with ResNet-101.