Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Attention to detail: A conditional multi-head transformer for traffic sign recognition

  • Isra Naz,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Software, Writing – original draft

    Affiliations Department of Computer Science, COMSATS University Islamabad, Wah Campus, Islamabad, Pakistan, Department of Computer Science, University of Wah, Wah Cantt, Pakistan

  • Jamal Hussain Shah,

    Roles Methodology, Software, Supervision, Visualization

    Affiliation Department of Computer Science, COMSATS University Islamabad, Wah Campus, Islamabad, Pakistan

  • Ali Tahir,

    Roles Conceptualization, Project administration, Writing – review & editing

    Affiliation Department of Computer Science, College of Engineering and Computer Science, Jazan University, Jazan, Kingdom of Saudi Arabia

  • Mahatma Reddy Marri,

    Roles Methodology, Software, Writing – review & editing

    Affiliation Independent Researcher, Texas, United States of America

  • Rabia Saleem,

    Roles Conceptualization, Formal analysis, Supervision, Writing – review & editing

    Affiliation Department of Information Technology, Government College University Faisalabad, Gurunanakpura, Faisalabad, Pakistan

  • Mutaz Elradi S. Saeed

    Roles Investigation, Resources, Validation

    mutelr@nilevalley.edu.sd

    Affiliation Department of Computer Science, Nile Valley University, Atbara, Sudan

Abstract

The challenge of traffic sign detection and recognition for driving vehicles has become more critical with recent advances in autonomous and assisted driving technologies. Although object recognition problems, particularly traffic sign recognition, have been extensively studied, most Vision Transformer (ViT) models still rely on static attention mechanisms with fixed projection matrices (Q, K, and V). Using this mechanism limits the ViTs to handle real-world problems such as object detection and traffic sign recognition, etc. Problems, such as partially or fully obscured signs, changes in illumination, and weather conditions, result in subpar feature extraction, which compounds the misclassification problem. To overcome this challenge, a Conditional Visual Transformer (CViT) is proposed in this research, which dynamically adapts feature aggregation, Q, K, and V projections, as well as attention-based mechanisms, based on the input sign type. Its main component consists of a controlled failure deep learning model using a CViT that targets specific types of traffic signs through varying feature extraction and attention adjustments, resulting in high classification performance and minimizing misclassifications. Furthermore, an adaptive gating technique is employed that optimally adjusts the projection matrix across different traffic signs. The proposed CViT achieved an overall accuracy of 99.87%, with a Micro Precision of 99.07%, a Macro Recall of 94.3%, and a Macro F1 Score of 99.07%, respectively. These results demonstrate the potential of CViT to improve both the efficiency and reliability of traffic sign recognition in autonomous driving applications.

1 Introduction

The rapid development of autonomous vehicles has underscored the critical importance of computer vision systems, especially for accurate traffic sign recognition. Traffic Sign Recognition (TSR) enables fully automated driving systems to understand and respond appropriately to road signs and regulations [1]. Unfortunately, deep learning and even convolutional neural networks (CNNs) have not been able to fully overcome the problem of misclassification [2]. In the case of TSR, misclassification can have a devastating impact, from collisions and legal infractions to undermining the trust of the public in autonomous technology [3]. While current deep learning models excel in achieving high accuracy rates, their reliance on traditional performance metrics, such as confusion matrices, often overlooks the need to minimize misclassification. All classifiers, including CNNs, rely on constantly providing an output regardless of the threshold of their prediction confidence [4]. This non-information-preserving approach works for many applications; however, in safety-critical scenarios such as autonomous driving, an uncertain or misplaced label poses unacceptable risks.

Although CNNs have been the most dominant TSR applications for the last decade [5], they struggle to capture global dependencies in more complex and intricate visual environments. These limitations are further amplified in real-world conditions such as occlusions, illumination changes, or motion blur [6]. Moreover, decision-making and motion planning frameworks increasingly rely on accurate perception modules, where errors in traffic sign recognition can propagate into unsafe maneuvers [7].

Transformer-based architectures, originally developed for natural language processing, have recently been adapted for vision tasks to address these challenges [8]. These models employ self-attention methods designed to capture long-range dependencies, making them more effective at visual understanding [9]. Nonetheless, Visual Transformers (ViTs) overlook context capturing, meaning that all input tokens are treated homogeneously, which is not the best solution, especially in noisy or cluttered backgrounds. The risk of misclassification remains a significant concern, particularly in safety-critical scenarios, such as autonomous vehicles. Incorrect recognition of a traffic sign can lead to erroneous decision-making, potentially endangering passengers and pedestrians. To address these limitations, we propose a Conditional Vision Transformer (CViT) with an integrated fail-control mechanism. Unlike conventional classifiers that always produce an output, the fail-control mechanism withholds predictions when confidence is low, thereby preventing harmful errors. This capability enables the model to distinguish between confidently classifiable samples and those that should be acknowledged for further verification or human oversight.

2 Research contributions

The contributions of this study are as follows:

  • Unlike standard ViTs that use a fixed cls_token, the proposed CViT dynamically chooses between the cls_token and Global Average Pooling (GAP) to ensure optimal feature representation across diverse traffic sign categories.
  • Instead of fixed projection matrices, CViT employs a gating mechanism that learns adaptive Q, K, and V representations conditioned on the input sign type, thereby enhancing classification robustness.
  • CViT integrates an adaptive attention mechanism that adjusts attention weights based on traffic sign characteristics (e.g., shape, color, or partial occlusion). This specialization ensures that each sign type receives tailored feature processing, reducing misclassifications and improving recognition accuracy.

The remainder of this paper is organized as follows: Section 3 provides background information on ML classifiers and their limitations in critical systems. The proposed software architecture for the proposed systems is presented in Section 4. The experimental setup and results are detailed in Section 5, while Section 6 discussesfuture work and concludes the paper.

3 Related work

TSR is a critical component of autonomous driving systems, enabling vehicles to interpret and respond to road regulations effectively. Recent advances in deep learning, particularly CNNs, have significantly improved the accuracy and robustness of TSR systems. [10] illustrated how modern CNN architectures such as VGG16 and EfficientNet have also been at the forefront in all other benchmark datasets, including the German Traffic Sign Recognition Benchmark (GTSRB). However, as with most innovations, there are a few glaring issues. In this case, TSR systems are sensitive to drastic changes and misclassifications, especially from everyday circumstances of illumination, occlusions, and even adversarial attacks that degrade performance [11,12]. The integration of TSR systems into autonomous vehicles has also been extensively studied. As noted in [13], real-time processing and risk profile of autonomous vehicles are of concern for TSR systems, stressing the importance of risk management throughout the AI lifecycle regarding the technologies of the use case. Although adversarial training and augmentation [14] have enhanced robustness, most CNN-based approaches still treat misclassification as secondary, focusing primarily on accuracy rather than reliability, an omission that is particularly dangerous in safety-critical contexts. Traditional evaluation metrics, such as accuracy, precision, and recall, focus on correctly classified instances but fail to adequately address the impact of misclassification [15,16]. This limitation is particularly problematic in safety-critical applications, such as autonomous driving, where even a single misclassification can lead to catastrophic outcomes. Recent work by [17] introduced tailored data augmentation strategies to address class imbalance and instance scarcity. While these methods improve robustness by enriching training data, they do not alter the decision process itself. [18] employed convolutional autoencoders as a dual-purpose defense, detecting adversarial perturbations and reconstructing corrupted inputs. While effective against attacks like FGSM and PGD, such defenses focus on input sanitization rather than output reliability. Similarly, localization-oriented models like RID-LIO have achieved robust LiDAR-based SLAM performance in degraded environments, highlighting the need for dependable perception in all conditions [19].”

Takeaway 1: CViT introduces dynamic attention and conditional projections (Q, K, V) to overcome the limitations of traditional Vision Transformers in handling real-world traffic sign recognition challenges

Recent research has shifted toward transformer-based models, specifically vision transformers (ViTs) [20], to overcome the limitations of traffic sign detection and recognition models, which leverage self-attention mechanisms to capture long-range dependencies and global contextual information. ViTs are emphasized here because their patch-based representation and self-attention enable them to capture global context that CNNs typically miss. This makes them particularly promising for TSR, where signs may appear in cluttered, partially occluded, or dynamically changing environments. Unlike CNNs, ViTs treat images as patches and process them similarly to natural language sequences, enabling a more holistic understanding of visual patterns. [21] proposed a Tokens-to-Token ViT architecture, which refines the patch embedding process and improves ImageNet training stability from scratch. Yin et al. [22] introduced A-ViT, an adaptive token sampling strategy to reduce computational overhead while maintaining accuracy. These models have paved the way for the application of ViTs in dense classification tasks such as TSR. Several researchers have explored ViT-based models to address real-world challenges in the domain of traffic sign recognition. In addition, [23] presented a local ViT tailored for TSR, achieving superior results compared to traditional CNN models, particularly under conditions with high intraclass similarity and environmental noise. Similarly, [24] proposed an attention-guided CAM method that enhanced model explainability by utilizing self-attention weights to more precisely localize images. [25] proposed TSD-YOLO, a robust detection model that integrates ViTs within a YOLO framework to improve performance under adverse weather conditions such as fog, rain, and snow. Hybrid models that combine the strengths of CNNs and transformers have also gained attention [26]. Ghouse et al. [27] compared standard CNNs with transfer learning-based ViTs and concluded that ViTs offer better generalization. However, ViTs are not without drawbacks. Their reliance on large datasets and extensive pretraining limits scalability in real-world TSR deployments. Moreover, treating all patches equally overlooks context-dependent importance, which can reduce interpretability in safety-critical settings. To address this issue, researchers have explored knowledge distillation, data augmentation, and token pruning techniques. Beyond vision-based methods, multimodal frameworks such as YCANet have demonstrated the effectiveness of combining camera and LiDAR data for robust target detection in complex traffic scenes [28].

[29] explored cross-domain few-shot in-context learning with multimodal large language models (MLLMs) to reduce reliance on extensive labeled data. This improves generalization across different countries’ traffic sign datasets but does not address the reliability of outputs under uncertainty. [30] proposed E-MobileViT, a lightweight ViT variant that combines convolutional and transformer layers with efficient local attention modules. This design improves efficiency and accuracy on benchmarks such as GTSRB and BTSD. However, despite architectural optimizations, E-MobileViT still outputs a prediction for every input, without mechanisms to handle low-confidence cases. Overall, ViT represents a promising paradigm for traffic sign recognition, offering improved accuracy, interpretability, and robustness compared to traditional CNNs. However, to enable large-scale deployment in real-world autonomous systems, further optimization in terms of computational efficiency, model generalization, and safety-critical interpretability is required.

Given these limitations, safety requires not only accurate classification but also mechanisms that prevent unreliable predictions from influencing vehicle decisions. Fail-controlled systems, widely applied in avionics and railways, offer this capability by discarding uncertain outputs. In the context of machine learning, fail-controlled classifiers (FCCs) have gained traction as a means to mitigate the risks associated with misclassification. A recent work [31] formalized the concept of FCCs, proposing a framework that combines self-checking mechanisms, input/output processors, and safety wrappers to enable FCC behavior. [32] Empirically assess the properness by analyzing the distribution of binary classifier misclassifications instead of simply counting misclassifications. These studies highlight the potential of FCCs to improve the safety and reliability of autonomous systems. However, further research is needed to adapt these approaches to specific domains, such as traffic sign recognition. [33] revisited physical-world adversarial attacks on commercial TSR systems, showing that real deployments remain vulnerable even when benchmark performance is high. Their findings highlight the gap between experimental accuracy and system-level safety. Also, human drivers often rely on subjective risk assessments when navigating uncertain environments, and similar safety-driven considerations must be embedded into autonomous systems [34]. Inspired by the success of transformers in NLP, [35] proposed the ViT, which uses self-attention mechanisms to segment an image into patches and process them as tokens. ViT and its variants have demonstrated impressive results on various vision benchmarks [36,37]. Nonetheless, they assume equal importance for all input patches and do not account for context-dependent relevance, a limitation that is particularly impactful in safety-critical tasks such as TSR, where focusing on meaningful regions is essential. A comprehensive comparison of the techniques discussed in this section is shown in Table 1 below:

thumbnail
Table 1. Comparison of some existing approaches to Traffic Sign Recognition (TSR).

https://doi.org/10.1371/journal.pone.0335341.t001

4 Materials and methods

The proposed CViT comprises three major phases. The first phase is the preprocessing phase, in which the input traffic sign images are divided into different patches. Each patch is then flattened, and positional embeddings are added to each flattened patch. This phase results in a complete class token input sequence. The feature extraction phase is the second phase, in which the patch embeddings sequence is processed through multiple conditional transformer layers. Each layer applies a multi-head attention (MHA) mechanism that extracts rich context-aware features. The last phase is the classification phase, in which the class tokens extracted through the second phase are passed through a fully connected layer to produce the final prediction of the traffic sign images. The proposed methodology is shown in Fig 1 below:

thumbnail
Fig 1. Proposed Conditional Visual Transformer (CViT) Model.

https://doi.org/10.1371/journal.pone.0335341.g001

The detailed steps involved in each phase of the proposed methodology are discussed in the following section.

4.1 Data preprocessing.

In the data processing phase, the CViT prepares the input raw traffic sign image by dividing it into small patches, converting them into embeddings, and then adding positional information to these embeddings. This step transforms the traffic sign image into a structured sequence that the CViT can process easily. Fig 2 shows the steps performed in this phase, and they are described in detail as follows:

4.1.1 Patch Extraction.

The first step in the data preprocessing phase is to collect a dataset of traffic sign images, each labeled with the correct class. These images are first resized to a fixed resolution as , e.g., 224 × 224 pixels. Each resized image , where H and W denote height and width, and C denotes the number of channels (e.g., C = 3 for RGB images). The total number of patches extracted from the image can be calculated as follows:

(1)

Then, each patch are flattened into a vector of size

4.1.2 Patch Embedding and Positional Encoding.

In this second step, for obtaining patch embeddings each flattened patch is projected via a trainable linear layer into a higher-dimensional embedding space. Then, positional encodings are added so that the spatial relationship between patches can be retained. The patch embedding can be represented as follows:

(2)

Where and are learnable parameters, and is the embedding dimension. The positional encoding for the patch can be represented as follows:

(3)

Where represents the positional encoding of patch i.

4.1.3 Forming the Token Sequence.

The next step of the preprocessing phase is to generate learnable class tokens to represent global information. Learnable positional embeddings are added to this class token. The complete input sequence is then formed by combining the class token with the patches’ tokens. The class token formation of the final token sequence can be represented as follows:

(4)

This sequence is the input to the conditional transformer encoder.

4.2 Feature extraction through conditional transformer layer

The second phase is the feature extraction through the conditional transformer layer. This phase focuses on extracting contextual features from the token sequence using a conditional attention mechanism. A gating 00compute the adaptive Q, K, and V representations. These dynamic projections help the model attend differently based on the type of traffic sign image input. Phase 2 of the proposed methodology is shown in Fig 3 and each step is described in the next sub-sections.

thumbnail
Fig 3. Gating Network for Conditional Q, K, and V Projection.

https://doi.org/10.1371/journal.pone.0335341.g003

4.2.1 Gating network (Global features to weights).

In the first step, the global features are processed through a fully connected layer with softmax activation to generate the weights. The computed weights can be represented as follows:

(5)

Where

  • is the learnable weight matrix of the FC layer,
  • the learnable bias vector that shifts the FC layer outputs before softmax,
  • is the resulting normalized weight vector such that

4.2.2 Conditional Q, K, V projection.

In the proposed CViT, adaptive Q, K, and V matrices are computed through a weighted sum over several expert matrices instead of fixed projection matrices. The effective projections are represented in Eq (6) as follows:

(6)

Where , are the expert projection matrices. This allows the conditional layer to integrate traffic sign-relevant features with the current target embeddings (the patch embeddings).

4.2.3 Conditional Multi-Head Attention (MHA).

In the next step, adaptive matrices are fed into the MHA mechanism to compute the attention score and output representation. The self-attention computation can be described as follows:

(7)

Where is the dimension of the key vectors. Then, a multi-head mechanism is applied in which the projections are split into multiple heads, and each head computes its attention. The block inside the dotted region is repeated N times, where N denotes the number of stacked transformer encoder layers. The outputs are then concatenated and linearly projected back to the original embedding dimension. The conditional Multi-Head attention with residual connections and feed-forward network is depicted in Fig 4.

thumbnail
Fig 4. Conditional multi-head attention, residual connections, and feed-forward network.

https://doi.org/10.1371/journal.pone.0335341.g004

4.2.4 Residual Connections and Feed-Forward Network (FFN).

After applying the attention mechanism, a residual connection and layer normalization are applied, followed by a feed-forward network to refine the traffic sign features further. The Residual & Normalization after Attention, Feed-Forward Network (FFN), and Residual & Normalization after FFN follow the standard transformer encoder formulation [38,39], which can be mathematically represented as follows:

(8)(9)(10)

The output from repeated blocks of conditional attention and FFN forms the final encoded representation of the traffic sign image. This contextual representation of the input traffic sign image encodes the dynamic attention-based relationships among patches and is further passed for classification.

4.3 Classification of traffic signs.

In the final phase of CViT, the refined global feature representation is now used to classify traffic signs into different categories. This final phase condenses the transformer encoder output into a decision through the classifier. In the first step of this phase, the final feature is extracted from the output of CViT. Then, the extracted global feature is fed into a classification head, which is a fully connected layer that maps it to class logits. The mapping of logic and prediction can be represented as follows:

(11)(12)

Here, are logits and the produces class probabilities for the traffic sign. The proposed CViT’s complete algorithm can be described as follows:

input: Traffic sign image (X)

output: Class label,

begin

1.Pre-processing Phase

  i) The input image X is resized to a fixed resolution (e.g., 224 × 224).

  ii) Divide image X into N non-overlapping patches of size P × P.

  iii) Flatten each patch into a vector of size , where = number of channels.

  iv) Project each patch into embedding space:

  v) Add positional encoding to each embedding to retain spatial information:

  vi) Generate a learnable class token and add its positional encoding:

  vii) Form the complete input sequence to the transformer:

2.Feature Extraction Phase using Conditional Transformer

  i. Compute the global features from the input sequence Z.

  ii. Pass global features through a gating network with SoftMax to obtain weights

  iii. Compute adaptive projections using the weighted sum of expert matrices:

  

  iv. Applying Conditional MHA:

     

  v. Add the residual connection and layer normalization:

  vi. Pass through the Feed-Forward Network:

  vii. Apply the second residual connection and normalization:

  viii. Steps (b)-(g) are repeated through multiple Conditional Transformer blocks.

3.Classification Phase

  i) Extract the final encoded class token from the output of the transformer.

  ii) Pass it through a fully connected classification head to generate logits:

  iii) Applying SoftMax to obtain class probabilities and predict label:

End

The fail-control strategy is also integrated as a post-classification decision layer in the CViT Model. Following the model’s output probabilities, a confidence threshold is applied to identify predictions with insufficient certainty. If the maximum softmax probability falls below this predefined threshold , the output is intentionally omitted to avoid risking a low-confidence and potentially incorrect classification. The fail-control mechanism is shown in Fig 5. This rejection mechanism is inspired by previous work on fail-safe classifiers, particularly those discussed in safety-critical machine learning frameworks [31,32,3537]. By treating omission as a valid and intentional outcome, the system can balance accuracy and operational safety, especially in real-world deployments where uncertainty cannot be ignored.

The classifier generates an initial class prediction for input data (traffic sign image). This prediction is passed to an output checker, which evaluates whether the result is “trustable” based on confidence thresholds and fail-control metrics. If trustable, the prediction is accepted as part of the final classification results. If not trustable, the prediction is discarded to avoid misleading or unsafe outputs. The mechanism integrates metrics from Table 7 to decide on trustworthiness. For instance, α measures how many predictions are accepted, ε tracks error, while φ quantifies critical omissions. This layered decision process improves robustness by ensuring only reliable predictions are used in downstream applications such as autonomous driving.

5 Experimental setup and results

The experimental evaluation of the proposed CViT was experimentally evaluated on the German Traffic Sign Recognition Benchmark (GTSRB) dataset, which comprises over 50,000 images of 43 distinct traffic sign classes. Images are resized to 32 × 32 pixels and augmented using random horizontal flips, rotations, and color jitter for the training set, while standard resizing is used for validation and testing. The dataset was divided into training, validation, and testing subsets using a 80:10:10 split, ensuring that class distribution was maintained across subsets. A custom PyTorch Dataset class reads the image paths and labels from the CSV files. A WeightedRandomSampler function is applied during training to address class imbalance. The CViT architecture includes patch embedding, a gating network, and multiple transformer encoder layers with conditional MHA using expert-based query, key, and value projections. The proposed model comprised approximately 12.3 million trainable parameters, reflecting its computational complexity and capacity to capture fine-grained patterns in traffic sign images. The model is trained for 50 epochs using the Adam optimizer and cross-entropy loss. To avoid overfitting and ensure convergence, early stopping and learning rate scheduling were employed. The model was trained using the PyTorch deep learning framework on a machine equipped with an NVIDIA RTX GPU, 16 GB RAM, and an AMD RYZEN.

For evaluating the performance of the proposed CViT, different experimental setups were conducted using a publicly available GTSRB dataset, which has multi-class traffic signs, with approximately 43 distinct traffic sign categories. The classification performance is assessed using different performance metrics, such as accuracy, precision, recall, and F1-score, for each class. In addition, macro, micro, and weighted averages are measured to demonstrate the model’s overall performance.

Initial experiments were conducted using a simple ViT architecture to evaluate the effectiveness of the proposed CViT. This baseline model served as a reference to assess how well the conditional attention mechanism improves performance. The same dataset and evaluation metrics were then applied to the CViT model. These results are discussed one by one as follows:

5.1 Experimental results using simple Vision Transformer (ViT)

First, the traffic sign images are passed to a simple ViT model. Although the simple ViT achieved an overall accuracy of 94.25%, the class-wise results show that these results can also be improved. The performance measures of the top 5 best-performing classes and the top 5 least-performing classes are shown in Table 2. T21, T2, and T41 exhibited the lowest accuracy values, with T21 being the most misclassified, achieving an accuracy of 91.60%. Here, T1–T43 denote the 43 traffic sign categories defined in the GTSRB dataset (e.g., T1 = Speed Limit 20, T2 = Speed Limit 50, T21 = Keep Right, T41 = End of No Passing, etc.). Compared with the best-performing class T8 (98.61%), T2 shows a 7.01% drop in accuracy, indicating more frequent confusion during classification.

thumbnail
Table 2. Top 5 Best-performing and top 5 least-performing classes for the simple ViT.

https://doi.org/10.1371/journal.pone.0335341.t002

T21 and T2 also have the lowest precision and recall scores (around 60%–63%), indicating high misclassification rates in both directions. The recall for T21 (59%) is the lowest across all classes, indicating that the model frequently fails to correctly identify true instances of this class. Only the five best-performing and five least-performing classes are shown to highlight extremes in performance. The micro average, macro average, and weighted average are also measured, as shown in Table 3 below.

thumbnail
Table 3. Micro average, macro average, and weighted average of simple ViT.

https://doi.org/10.1371/journal.pone.0335341.t003

5.2 Conditional Vision Transformer (CViT) Results

After collecting the classification result of traffic sign images through a simple ViT model, the CViT model is evaluated. To ensure a fair comparison, both models were trained and evaluated on the same dataset under similar conditions. The results reveal significant improvements in overall accuracy, per-class stability, and generalizability with CViT. The performance measures of the top 5 best-performing classes and the top 5 least-performing classes are shown in Table 4. The CViT significantly outperformed the simple ViT, achieving an overall accuracy of 99.87%. Its per-class consistency and high precision-recall values indicate a robust classification capability. T21, T2, and T41 still exhibit slightly lower accuracy than other classes, with T1 achieving the lowest accuracy of 99.27%.

thumbnail
Table 4. Top 5 best-performing and top 5 least-performing classes for CViT.

https://doi.org/10.1371/journal.pone.0335341.t004

Takeaway 2: The model achieves outstanding performance on the GTSRB dataset with 99.87% accuracy, significantly outperforming the baseline ViT and recent approaches.

The accuracy difference between T21 and the top-performing class T12 (99.89%) is just 0.62%, indicating remarkable consistency across classes. While all precision and recall values are above 92%, T21’s recall (92%) is still the lowest, suggesting room for improvement in identifying true positives for this class. Unlike

the Simple ViT, the F1 scores for all classes are tightly grouped between 92% and 99%, indicating a balanced prediction quality. Only the five best-performing and five least-performing classes are shown to highlight extremes in performance. The micro average, macro average, and weighted average are also measured, as shown in Table 5 below:

thumbnail
Table 5. Micro average, macro average, and weighted average of CViT.

https://doi.org/10.1371/journal.pone.0335341.t005

The comparison of simple ViT and CViT is shown in Table 6. The comparison in the table shows that the proposed CViT clearly outperforms the simple ViT. CViT achieved a higher overall accuracy (99.87% vs. 94.25%) and improved recall for the most misclassified class (T21) from 59% to 92%. It also narrowed the accuracy gap between the best- and worst-performing classes from 7.01% to only 0.62%, reflecting more balanced learning across all categories. Moreover, CViT consistently delivered stronger precision, recall, and F1-scores, and raised the performance of even the weakest classes (from ~67.4% to ~95.2% in average F1). These results highlight that CViT not only achieves top-level accuracy but also handles challenging cases more effectively, making it highly reliable for autonomous driving applications.

thumbnail
Table 6. Comparison of the performance measures of simple ViT and CViT.

https://doi.org/10.1371/journal.pone.0335341.t006

Takeaway 3: CViT demonstrates the potential for fail-controlled, interpretable, and highly accurate models in autonomous driving and other safety-critical applications.

To validate the effectiveness of the proposed CViT, a comparative analysis is performed on the GTSRB dataset using some recent approaches. Fig 6 compares the performance of four recent models [4042] along with the Simple ViT and the proposed CViT. Although these existing models achieved notable accuracies ranging from 98.41% to 99.66%, the CViT outperforms them all, attaining 99.87% accuracy. Specifically [41] shows the highest accuracy among the models mentioned in the table, achieving 99.66%, followed by [42] at 98.5%, and [40] at 98.41%. However, the Simple ViT baseline achieved an accuracy of 94.25%. In contrast, the CViT significantly boosts the classification accuracy by 5.62% over the Simple ViT, and by 0.21% over [41]. These results highlight the robustness and reliability of the proposed CViT for better classification of traffic signs with different types of images by refining the feature extraction process. The key reason for this improvement is CViT’s adaptive attention mechanism, which dynamically adjusts Q, K, and V projections based on input variations, unlike fixed-attention models that struggle with occlusions, illumination changes, and diverse traffic sign appearances.

thumbnail
Fig 6. Graphical representation of comparison with existing ViT-based techniques.

https://doi.org/10.1371/journal.pone.0335341.g006

5.3 Fail control mechanism on the CViT Model

The fail-control strategy is embedded in the CViT output layer through a threshold-based confidence filter. After the model generates class probabilities, predictions with softmax scores below the confidence threshold are rejected. These inputs are labeled as “non-trustable” and omitted from classification. To assess the effectiveness of a fail-controlled mechanism, several key evaluation metrics are used, which are shown in Table 7 below:

thumbnail
Table 7. Key evaluation metrices used for evaluating fail control mechanism.

https://doi.org/10.1371/journal.pone.0335341.t007

These metrics provide a clear picture of the trade-off between predictive accuracy and system reliability under fail-controlled operation. Table 8 summarizes the results after applying the fail-control mechanism to the classification results. This table shows the behavior of the model under both standard and fail-controlled conditions.

thumbnail
Table 8. Results of applying the fail control mechanism to the CViT model: α (Acceptance Rate), ε (Error Rate), φ (Critical Omissions), αw (Weighted Acceptance Rate), εw (Weighted Error Rate), φc (Critical Omissions – Corrected), φm (Missed Predictions), εgain (Error Reduction), φm_ratio (Missed Prediction Ratio).

https://doi.org/10.1371/journal.pone.0335341.t008

The base model achieved an overall accuracy of 82.1%. However, by activating the fail-control mechanism, the system reduced its error rate from 17.9% to 7.9%, resulting in a relative error reduction of 55.9%. Additionally, approximately 54.5% of misclassifications were effectively omitted, demonstrating the rejection strategy’s ability to prevent incorrect predictions from reaching the decision stage. Although some correct predictions were also omitted (8.3%), this trade-off is considered acceptable in safety-critical contexts. The slight reduction in accuracy among accepted predictions (from 82.1% to 73.8%) reflects a more cautious classification strategy, which prioritizes correctness over coverage. The FC-CViT model demonstrates the benefits of integrating fail-control into high-performing architectures. Although its overall accuracy (α = 82.1%) is lower due to deliberate omissions, its controlled accuracy on accepted predictions () remains high at 73.8%. More importantly, the rejection mechanism reduces the error rate by 55.9%, and over half (54.5%) of potential misclassifications are effectively eliminated. These improvements are particularly impactful in real-world applications where the cost of a wrong prediction is far greater than that of no prediction.

6 Conclusion

In this study, a CViT model is proposed, comprising conditional query-key-value (Q, K, V) projections and attention-based mechanisms for robust and accurate traffic sign classification for autonomous vehicles. It also uses an adaptive gating network for better projection. The proposed method is evaluated on the publicly available GTSRB dataset, which is a multi-class dataset comprising 43 unique traffic sign classes. After experimenting on the dataset using the CViT, the proposed model significantly outperformed the baseline ViT model. It achieved an overall accuracy of 99.87%, precision of 99.07%, recall of 99.07%, and F1-score of 99.07%, demonstrating its effectiveness across all traffic sign classes. The proposed method is also compared with recent state-of-the-art approaches, and the results confirm that CViT not only provides superior accuracy but also maintains balanced precision, recall, and F1 performance. The findings highlight the importance of adaptive attention-based mechanisms in ViT architectures for tasks such as object detection and traffic sign recognition. This work opens a new direction for adopting this mechanism in fail-controlled, interpretable, and highly accurate detection and recognition for real-world autonomous driving systems and other safety-critical applications.

In the future, this work can be extended by incorporating fail-control mechanisms more explicitly to reduce misclassification in safety-critical environments, integrating explainable AI (XAI) techniques for better interpretability of decisions, and adapting the CViT framework for real-time deployment on embedded systems in autonomous vehicles. Further, adversarial testing frameworks such as critical scenario generation can also provide a promising avenue to evaluate the robustness of fail-controlled recognition models under safety-critical conditions [43].

Supporting information

References

  1. 1. Triki N, Karray M, Ksantini M. A Comprehensive Survey and Analysis of Traffic Sign Recognition Systems With Hardware Implementation. IEEE Access. 2024;12:144069–81.
  2. 2. Sensoy M, Saleki M, Julier S, Aydogan R, Reid J. Misclassification risk and uncertainty quantification in deep classifiers. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021. 2484–92.
  3. 3. Chen R, Hei L, Lai Y. Image Recognition and Safety Risk Assessment of Traffic Sign Based on Deep Convolution Neural Network. IEEE Access. 2020;8:201799–805.
  4. 4. Wang T, Chen J, Lu J, Liu K, Zhu A, Snoussi H, et al. Synchronous Spatiotemporal Graph Transformer: A New Framework for Traffic Data Prediction. IEEE Trans Neural Netw Learn Syst. 2023;34(12):10589–99. pmid:35522636
  5. 5. Wali SB, Abdullah MA, Hannan MA, Hussain A, Samad SA, Ker PJ, et al. Vision-Based Traffic Sign Detection and Recognition Systems: Current Trends and Challenges. Sensors (Basel). 2019;19(9):2093. pmid:31064098
  6. 6. Farzipour A, Nejati Manzari O, Shokouhi SB. Traffic sign recognition using local Vision Transformer. In: 2023 13th International Conference on Computer and Knowledge Engineering (ICCKE). IEEE; 2023. p. 191–6.
  7. 7. Li Z, Hu J, Leng B, Xiong L, Fu Z. An Integrated of Decision Making and Motion Planning Framework for Enhanced Oscillation-Free Capability. IEEE Trans Intell Transport Syst. 2024;25(6):5718–32.
  8. 8. Xu S, Chang D, Xie J, Ma Z. GRAD-CAM Guided Channel-Spatial Attention Module for Fine-Grained Visual Classification. In: 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), 2021. 1–6.
  9. 9. Ghouse M, Farag S, Butt U. Traffic sign recognition based on CNN vs different transfer learning techniques. In: Fifth Symposium on Pattern Recognition and Applications (SPRA 2024), 2025. 17.
  10. 10. Fawole O, Rawat D. Recent Advances in Vision Transformer Robustness Against Adversarial Attacks in Traffic Sign Detection and Recognition: A Survey. ACM Comput Surv. 2025;57(10):1–33.
  11. 11. Zhao R, Tang SH, Shen J, Supeni EEB, Rahim SA. Enhancing autonomous driving safety: A robust traffic sign detection and recognition model TSD-YOLO. Signal Processing. 2024;225:109619.
  12. 12. Khalid S, Shah JH, Sharif M, Dahan F, Saleem R, Masood A. A Robust Intelligent System for Text-Based Traffic Signs Detection and Recognition in Challenging Weather Conditions. IEEE Access. 2024;12:78261–74.
  13. 13. Hussain M, Hong JE. Evaluating and improving adversarial robustness of deep learning models for intelligent vehicle safety. IEEE Trans Reliab. 2024.
  14. 14. Meshram V, Suryawanshi Y, Meshram V, Patil K. Addressing misclassification in deep learning: A Merged Net approach. Software Impacts. 2023;17:100525.
  15. 15. Kheirandish M, Zhang S, Catanzaro DG, Crudu V. Quantifying uncertainty in deep learning binary classification with discrete noise in inputs for risk-based decision making. IISE Trans. 2025;(just-accepted):1–20.
  16. 16. Smith MR, Martinez T. Improving classification accuracy by identifying and removing instances that should be misclassified. In: The 2011 International Joint Conference on Neural Networks, 2011. 2690–7.
  17. 17. Alsiyeu U, Duisebekov Z. Enhancing traffic sign recognition with tailored data augmentation: addressing class imbalance and instance scarcity. arXiv preprint. 2024.
  18. 18. Martinović I, Mateo Sanguino T de J, Jovanović J, Jovanović M, Djukanović M. One Possible Path Towards a More Robust Task of Traffic Sign Classification in Autonomous Vehicles Using Autoencoders. Electronics. 2025;14(12):2382.
  19. 19. Sun T, Guo R, Chen G, Wang H, Li E, Zhang W. RID-LIO: robust and accurate intensity-assisted LiDAR-based SLAM for degenerated environments. Meas Sci Technol. 2025;36(3):036313.
  20. 20. Wang Y, Deng Y, Zheng Y, Chattopadhyay P, Wang L. Vision Transformers for Image Classification: A Comparative Survey. Technologies. 2025;13(1):32.
  21. 21. Zeng F, Yu D, Kong Z, Tang H. Token transforming: A unified and training-free token compression framework for vision transformer acceleration. arXiv preprint. 2025. https://arxiv.org/abs/2506.05709
  22. 22. Ishibashi R, Meng L. Automatic pruning rate adjustment for dynamic token reduction in Vision Transformer. Appl Intell. 2025;55(5):342.
  23. 23. Parse MV, Pramod D, Kumar D. A hybrid model combining depthwise separable convolutions and vision transformers for traffic sign classification under challenging weather conditions. Int J Syst Assur Eng Manag. 2025;:1–23.
  24. 24. Leem S, Seo H. Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention. AAAI. 2024;38(4):2956–64.
  25. 25. He S, Yu W, Tang T, Wang S, Li C, Xu E. FOS-YOLO: Multiscale context aggregation with attention-driven modulation for efficient target detection in complex environments. IEEE Transactions on Instrumentation and Measurement. 2025.
  26. 26. Long H. Hybrid Design of CNN and Vision Transformer: A Review. In: Proceedings of the 2024 7th International Conference on Computer Information Science and Artificial Intelligence, 2024. 121–7.
  27. 27. Mehdipour S, Mirroshandel SA, Tabatabaei SA. Vision transformers in precision agriculture: A comprehensive survey. 2025. https://arxiv.org/abs/2504.21706
  28. 28. Shen Z, He Y, Du X, Yu J, Wang H, Wang Y. YCANet: Target Detection for Complex Traffic Scenes Based on Camera-LiDAR Fusion. IEEE Sensors J. 2024;24(6):8379–89.
  29. 29. Borthakur A, Srivastava A, Kar A, Dewan D, Sheet D. Fantom: Federated Adversarial Network for Training Multi-Sequence Magnetic Resonance Imaging in Semantic Segmentation. In: 2024 IEEE International Conference on Image Processing (ICIP), 2024. 3972–8.
  30. 30. Song S, Ye X, Manoharan S. E-MobileViT: a lightweight model for traffic sign recognition. Ind Artif Intell. 2025;3(1).
  31. 31. Zoppi T, Khokhar FA, Ceccarelli A, Montecchi L, Bondavalli A. Fail-Controlled Classifiers: Do they Know when they don’t Know?. In: 2024 IEEE 29th Pacific Rim International Symposium on Dependable Computing (PRDC), 2024. 43–54.
  32. 32. Gharib M, Zoppi T, Bondavalli A. On the Properness of Incorporating Binary Classification Machine Learning Algorithms Into Safety-Critical Systems. IEEE Trans Emerg Topics Comput. 2022;10(4):1671–86.
  33. 33. Wang N, Xie S, Sato T, Luo Y, Xu K, Chen QA. Revisiting physical-world adversarial attack on traffic sign recognition: a commercial systems perspective. arXiv preprint. 2024.
  34. 34. Song D, Zhao J, Zhu B, Han J, Jia S. Subjective Driving Risk Prediction Based on Spatiotemporal Distribution Features of Human Driver’s Cognitive Risk. IEEE Trans Intell Transport Syst. 2024;25(11):16687–703.
  35. 35. Yin H, Vahdat A, Alvarez JM, Mallya A, Kautz J, Molchanov P. A-ViT: Adaptive tokens for efficient Vision Transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2022. p. 10809–18.
  36. 36. Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang ZH, et al. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 558–67.
  37. 37. Fu Z. Vision Transformer: ViT and its derivatives. arXiv preprint. 2022.
  38. 38. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in Neural Information Processing Systems, 2017.
  39. 39. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv preprint. 2020.
  40. 40. Mingwin S, Shisu Y, Wanwag Y, Huing S. Revolutionizing traffic sign recognition: unveiling the potential of vision transformers. arXiv preprint. 2024.
  41. 41. Santhiya P, Jebadurai IJ, Paulraj GJL, A J, Jawahar ED, V ENaveen. Implementing ViT Models for Traffic Sign Detection in Autonomous Driving Systems. In: 2024 5th International Conference on Recent Trends in Computer Science and Technology (ICRTCST), 2024. 382–7.
  42. 42. Lim XR, Lee CP, Lim KM, Ong TS. Enhanced Traffic Sign Recognition with Ensemble Learning. JSAN. 2023;12(2):33.
  43. 43. Zhu B, Tang R, Zhao J, Zhang P, Li W, Cao X, et al. Critical scenarios adversarial generation method for intelligent vehicles testing based on hierarchical reinforcement architecture. Accid Anal Prev. 2025;215:108013. pmid:40121971