Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Explainable AI for sign language recognition models: Integrating Grad-Cam LIME and Integrated Gradients

  • Fatima-Zahrae El-Qoraychy ,

    Roles Methodology, Visualization, Writing – original draft, Writing – review & editing

    fatima.el-qoraychy@utbm.fr

    Affiliation Université de Technologie de Belfort Montbéliard, UTBM, CIAD UR 7533, Belfort, France

  • Yazan Mualla,

    Roles Methodology, Supervision, Writing – review & editing

    Affiliation Université de Technologie de Belfort Montbéliard, UTBM, CIAD UR 7533, Belfort, France

  • Hui Zhao,

    Roles Writing – review & editing

    Affiliation Department of Computer Science and Technology, Tongji University, Shanghai, China

  • Mahjoub Dridi,

    Roles Supervision, Writing – review & editing

    Affiliation Université de Technologie de Belfort Montbéliard, UTBM, CIAD UR 7533, Belfort, France

  • Jean-Charles Créput,

    Roles Supervision, Writing – review & editing

    Affiliation Université de Technologie de Belfort Montbéliard, UTBM, CIAD UR 7533, Belfort, France

  • Luca Longo

    Roles Writing – review & editing

    Affiliations Artificial Intelligence and Cognitive Load Research Lab, University College Cork, Cork, Ireland, School of Computer Science and Information Technology, University College Cork, Western Gateway Building, Cork, Ireland

Abstract

Sign language recognition is crucial in bridging the communication gaps between hearing and deaf communities. In this study, we build on an existing sign language classification model based on the VGG19 architecture, enhancing its robustness through dataset augmentation and alternative data representations. We introduce a segmentation-based approach that utilizes hand masks generated by the U-Net model, replacing depth images to mitigate noise and improve classification accuracy. We use an Explainable Artificial Intelligence (XAI) approach that incorporates Grad-CAM, LIME, and integrated gradients methods to interpret the model’s decision-making process, ensuring transparency and reliability in real-world applications. Our comparative analysis between Red-Green-Blue (RGB) and mask-based models demonstrates that while the RGB model benefits from richer texture and color information, the mask-based model effectively focuses on hand shape and structure. The integration of XAI further validates our results by highlighting key regions of the image that influence the model’s predictions and by enabling a multi-perspective analysis that captures complementary aspects of the input, including region-based attention, pixel-level attribution, and structural shape analysis, thereby facilitating a deeper understanding of the model’s internal representations and potential failure modes. This work contributes to advancing American Sign Language recognition by improving model generalisation and explainability, ultimately fostering greater trust and usability in assistive technologies.

Introduction

Human-Computer Interaction (HCI) is based on exchanging information and commands between human users and technological systems. It explores how humans interact with various technologies, including computers, machines, artificial intelligence (AI), agents, and robots. These interactions can take multiple forms: cooperation, collaboration, team dynamics, symbiosis, and integration. Understanding these relationships is essential for designing systems that enhance user experience while improving the transparency and trustworthiness of technological systems [1]. The early studies on HCI primarily focused on designing user-friendly interfaces for command-line systems [2]. Over time, HCI has expanded to include multimodal interaction methods such as touch interfaces, voice recognition, and gesture-based controls [3]. The growing presence of computing devices, from personal computers to smartphones, makes HCI ubiquitous in everyday life. This increased reliance on various interaction forms requires continuous innovation to ensure technology remains intuitive, accessible, and efficient for users. There are various forms of HCI, such as text input, voice commands, and gestures. Text-based inputs, like typing on a keyboard, remain fundamental but inefficient, especially for users with physical or cognitive impairments. Voice recognition technologies, such as virtual assistants (Siri, Alexa, etc.), offer hands-free interaction, valuable in contexts where manual input is impractical [4]. Gesture recognition, which underpins Sign Language Recognition (SLR), is vital in bridging the communication gap between the deaf or hard-of-hearing community and the hearing population by converting gestures into text or speech. Early approaches to SLR used rule-based systems or sensor-based inputs. However, advances in AI, particularly Convolutional Neural Networks (CNNs), have enabled the development of more sophisticated models for recognising sign language from video sequences or images [57]. These models significantly improve accessibility for individuals with hearing impairments, promoting inclusivity across various sectors, including education and customer service. Despite these advancements, deploying AI-driven SLR systems remains challenging due to deep learning models’ “black box" nature [8]. While CNNs excel in classification tasks, the inferential mechanism they learn from input to output remains opaque. This makes it difficult for developers and users to understand why a model behaves in certain ways, especially when misclassifications occur. This limited transparency raises concerns about trust, reliability, and adaptability, particularly when dealing with complex gestures, individual variations, or regional differences in sign language. For example, signs may vary due to factors like signing speed, personal style, or regional dialects, further complicating classification. Thus, addressing these challenges requires enhancing the AI models’ explainability.

Explainable Artificial Intelligence (XAI) is key to overcoming this barrier. XAI includes techniques designed to make AI systems more interpretable by clearly explaining their decisions and actions [9,10]. In the context of SLR, XAI can help identify which gesture features influence a model’s prediction, allowing users and developers to better understand and improve the system. Moreover, XAI ensures that AI systems are fair, transparent, and free from hidden biases, which is particularly important in SLR, where accessibility and inclusivity are paramount. It is increasingly recognised that fostering effective communication and understanding between humans and AI systems is vital for their successful integration into various applications [11,12]. Explainability is essential for technical improvement and fostering trust with users who rely on these systems for critical communication. The Defence Advanced Research Projects Agency (DARPA) launched the “XAI Program" in 2017 [13], sparking significant advancements in AI explainability. Notable contributions include the HAExA architecture [14], which provides transparent agent decision explanations, while RISE [15] generates importance maps to highlight key neural network elements. These initiatives underscore the growing importance of XAI across various domains, including SLR, where it improves model performance and trustworthiness by offering insights into decision-making processes. Several techniques have emerged in XAI, each offering distinct advantages depending on the application. Methods such as Local Interpretable Model-Agnostic Explanations (LIME) [1619], SHapley Additive exPlanations (SHAP) [2022], and Gradient-weighted Class Activation Mapping (Grad-CAM) [2326] have been widely used to explain AI inferential mechanisms in various domains.

In this article, we explore how these techniques can be applied to interpret CNN-based SLR systems, focusing on enhancing both model transparency and performance. The primary objective of this work is to develop a real-time prediction system for American Sign Language (ASL) alphabet recognition, as illustrated in Fig 1 (the referenced ASL alphabet chart is extracted from [27]). The system is based on RGB and mask images to predict letters, while integrating XAI techniques to improve the interpretation of the input–output mapping learned by the models. Our proposed approach combines three complementary explainability techniques, each offering distinct insights, as illustrated in Fig 2. These include:

  • Region-based Explanation: it highlights key areas in an input image that strongly influence classification outcomes;
  • Pixel-Level Importance: it evaluates individual pixel contributions to the final decision;
  • Shape Analysis: it focuses on structural elements, understanding how variations in form impact classification results.
thumbnail
Fig 2. Real-time ASL recognition approach with integrated visual explanation using Grad-CAM, LIME, and Integrated Gradients.

https://doi.org/10.1371/journal.pone.0336481.g002

The structure of this paper is as follows. Sect 2 presents an overview of the basic concept of SLR, existing classification models, and relevant XAI techniques. Sect 3 outlines the experimental setup, detailing how the classification and XAI models are integrated. In Sect 4, a discussion of the results, focusing on the insights gained through XAI and the improvements made to the classification model, is presented. Eventually, Sect 5 concludes the article by summarizing the contributions and proposing future research directions.

Related work

SLR has attracted growing interest due to its potential to bridge communication barriers for the deaf and hard-of-hearing communities. Recent advances in deep learning, particularly CNNs, have significantly improved the performance of SLR systems, enabling the automatic recognition of hand gestures from RGB images, videos, and depth data [2831].

Early works focused primarily on handcrafted features and skin colour segmentation in various colour spaces such as RGB and HSV [32]. These methods improved gesture localisation but were highly sensitive to lighting variations, background clutter, and inter-user variability. For example, Tripathi et al. [33] developed a gesture-to-text translation system for Indian Sign Language using speeded-up robust features for feature extraction, combined with edge detection, skin masking, and a Bag of Visual Words model followed by classical machine learning classifiers. Similarly, [34] proposed an SVM-based recognition system trained on a custom dataset. While effective in structured settings, these traditional approaches do not scale well to complex or dynamic gestures. In parallel, some early approaches leveraged sensor-based systems to capture hand kinematics directly, bypassing the need for visual input. For example, Wang et al. [35] proposed an ASL recognition system that combined data from a CyberGlove and a Flock of Birds motion tracker within a multi-dimensional Hidden Markov Model (HMM) framework, enabling the recognition of both static and dynamic gestures. Similarly, Raheem [27] utilised a sensory glove in conjunction with a Multi-Layer Perceptron (MLP) to classify ASL hand gestures and analyse recognition accuracy. While both systems demonstrated high performance in controlled conditions, they share key limitations: a reliance on costly and intrusive hardware, poor generalisation to diverse signers or environments, and limited scalability beyond laboratory settings. Furthermore, these sensor-dependent approaches are tightly coupled to specific modalities, making them less adaptable to current vision-based, marker-free SLR trends. To overcome such limitations, researchers have explored skeletal data obtained from depth sensors like Microsoft Kinect [36,37], which capture 3D joint coordinates for robust spatio-temporal modelling. Although these methods enhance recognition of dynamic signs, they require specialised hardware, limiting their use in real-world or low-resource environments. More recently, deep learning approaches have emerged as the dominant paradigm. For example, Talaat et al. [38] proposed a real-time Arabic Sign Language translation system using YOLOv8 for gesture detection and an animated avatar for visual feedback, achieving high accuracy across multiple datasets. Najib [39] developed MSLI, a multilingual system capable of recognising signs from 11 different sign languages using a two-stage pipeline for language detection and gesture classification, demonstrating strong generalisation capabilities. In the context of real-time applications, CNN-based pipelines have been successfully deployed [40,41]. However, many of these models rely on large annotated datasets that often lack diversity in terms of hand shape, skin tone, and signing styles [42,43]. As a result, generalisation to heterogeneous populations remains a challenge. Moreover, deep learning models often act as black boxes, limiting transparency and raising concerns about fairness and reliability, especially in assistive or safety-critical settings. In sign language, subtle gesture variations can alter meaning, amplifying the need for interpretable models. Recent studies have integrated Explainable Artificial Intelligence (XAI) techniques into SLR frameworks [4446]. Methods such as LIME, SHAP, and Grad-CAM help visualise the relationship between input features and model predictions. McCleary et al. [47] utilised micro-Doppler radar signatures and implemented a custom explainability algorithm to localise relevant signal regions. While their method achieves impressive accuracy using a compact dataset, the explainability remains limited to signal-domain saliency. Ridwan et al. [48] employed transfer learning and SHAP-based interpretability in a multi-model architecture, demonstrating enhanced trust in gesture classification systems. While existing works provide valuable contributions, most rely on a single post-hoc XAI method and do not offer comparative evaluations or explore the effect of explanation strategies across varying gesture complexities. Furthermore, although some studies use XAI to uncover biases or improve debugging, there remains a lack of methodological frameworks that integrate explainability into the model design process itself rather than treating it as a post-hoc diagnostic tool. This gap hinders the development of more trustworthy and generalizable SLR systems as presented in the review [49,50]. To provide a comprehensive overview of prior work, Table 1 presents a chronological summary of representative SLR systems, detailing their methodologies, the incorporation of explainability techniques, and the reported performance. This selection covers a wide range of approaches, from traditional handcrafted pipelines and rule-based segmentation to CNNs, skeleton-based models, and classical machine learning techniques. It also reflects the diversity of input modalities (e.g., RGB, depth, glove-based sensors, and segmented images) and the application of XAI methods, such as modular feedback, LIME, and SHAP, either as post-hoc analysis tools or integrated components within the training and validation process.

thumbnail
Table 1. Chronological Summary of SLR Methods, Explainability, and Reported Accuracy.

https://doi.org/10.1371/journal.pone.0336481.t001

Despite notable progress in SLR, many existing systems still face key limitations, including limited generalisation to diverse users and environments and a lack of interpretability in model predictions. Moreover, most prior works employ a single explainability technique as a post-hoc diagnostic tool, without exploring how different methods might complement each other. To address these gaps, our work proposes a real-time prediction system for ASL alphabet recognition, leveraging both RGB and segmentation-based representations. In contrast to traditional colour-based segmentation methods, which are often sensitive to lighting variations and skin tone, we employ a U-Net-based segmentation model to generate robust binary hand masks. This enables more consistent gesture localisation across varied backgrounds and user profiles, enhancing generalisation. Furthermore, we integrate three distinct XAI methods (Grad-CAM, LIME, and integrated gradients) to provide interpretability and to analyse their respective strengths in capturing region-level, pixel-level, and structural information. By applying these techniques systematically to the same inputs, we demonstrate how their complementary perspectives can offer a richer understanding of model behaviour, highlight failure cases, and guide model refinement.

Methodology

The starting point for the development of an explainable SLR system is the work developed by Damion Joyner (https://www.kaggle.com/code/damionjoyner/sign-language-classification-cnn-vgg19), which endeavours to classify a set of RGB and depth images of ASL using a CNN model based on the Visual Geometry Group 19 (VGG19) architecture. This model is trained using the ASL alphabet dataset, which comprises over 100,000 images of English letters in sign language collected from five individuals. With 24 letters in the English alphabet and the images provided by five pairs of hands, the model must be capable of classifying images based on the different letters.

Classification model structure

The classification model is based on a combination of the pre-trained VGG19 model, enriched with additional layers designed specifically for the image classification task. The CNN architecture devised by the Visual Geometry Group (VGG) is widely recognised for its depth, characterised by many convolution layers. It has played a key role in the evolution of object recognition models, outperforming many benchmarks on various tasks and datasets. It is pretrained with ImageNet datasets and can still outperform unseen datasets, making it one of the most used image recognition architectures. Multiple variants exist for such a network, including VGG-16 and VGG-19, differing only in the total number of layers. VGG19 is an advanced version of the VGG architecture, incorporating 19 convolution layers [51]. It consists of several convolutional blocks, each comprising multiple convolutional layers followed by pooling layers. The model utilises small filters (3×3) with a stride pattern of 1 and padding of 1 to preserve extracted feature sizes. Using this pattern for the convolutional layers means that the convolutional filters move one pixel at a time across the input data, and one layer of zero pixels is added around the input to maintain its size during the convolution process. After the convolutional blocks, the network connects to fully connected layers for classification. While the VGG19 architecture is relatively established and widely used in image classification tasks, it was chosen for this project due to its depth, simplicity, and ability to generalise across unseen datasets. VGG19 remains a competitive architecture for tasks involving image recognition, and its extensive use allows for leveraging pretrained weights on ImageNet, speeding up training and improving accuracy. In this work, the classification model comprises both the VGG19 model and additional layers. The VGG19 layers adhere to the standard architecture, featuring blocks of convolutional layers followed by pooling layers. These layers progressively capture features across different scales and complexities. Additional layers are then introduced after the VGG19 layers to tailor the model for the image classification task. These supplementary layers include:

  • a flattened layer to transform the outputs into a one-dimensional vector;
  • dense (fully connected) layers for final classification, including dropout layers for regularisation;
  • batch normalisation layers for normalising activations and stabilising learning;
  • the final dense layer, with neurons corresponding to the number of classes (letters) in the classification task.

In our implementation, the input image size is (64,64,3), and the feature extraction part strictly follows the VGG19 configuration. It consists of five blocks of convolutional layers with small 3x3 kernels and a stride of 1. After flattening the output of the final block, we added two dense (fully connected) layers with 512 units each. Both are activated using ReLU and followed by a Dropout layer with a rate of 0.5 to reduce overfitting. Each dense block also includes Batch Normalisation to stabilise learning and improve convergence. The final output layer contains 26 neurons (one per ASL letter), activated using Softmax. The network was trained using the Adam optimiser with a 10−4 learning rate and categorical cross-entropy loss. Fig 3 shows the full architecture of the proposed model in a horizontal layout for clarity and reproducibility.

thumbnail
Fig 3. Architecture of the proposed CNN model based on a modified VGG19 structure.

The network consists of five convolutional blocks (Conv1–Conv5), each followed by ReLU activations and max-pooling operations. After the feature extraction phase, the feature maps are flattened and passed through two fully connected (Dense) layers of 512 units each, followed by ReLU activations, Dropout (rate = 0.5), and Batch Normalisation. The final output layer uses a Softmax activation with 26 neurons to classify ASL alphabet signs.

https://doi.org/10.1371/journal.pone.0336481.g003

Model limitations and motivation for improvement

The results obtained show a test accuracy of 95%; however, when evaluated on out-of-distribution images, not present in the original dataset, the model exhibits a clear divergence in performance (https://www.kaggle.com/code/damionjoyner/sign-language-classification-cnn-vgg19). It fails even to recognise simple hand signs, revealing significant limitations in its generalisation ability. A more in-depth analysis of the model behaviour uncovers several fundamental weaknesses that severely undermine its capacity for real-world deployment. The training dataset is imbalanced in class distribution and limited in subject diversity, as it includes samples from only five individuals, all with similar skin tones. This dual constraint introduces a strong learning bias: the model tends to overfit to frequent classes and person-specific features, rather than learning gesture-invariant representations. As a result, its performance drops notably for rare classes and when applied to new users not represented in the training set. Furthermore, the small size of the dataset, combined with its imbalance, significantly increases the risk of overfitting. Although the overall accuracy appears high, this metric can be misleading. The model may simply memorise dataset-specific patterns rather than learning transferable features. This concern is reinforced by the inconsistencies observed during additional evaluations, particularly on underrepresented classes. Another critical limitation lies in the way multimodal information is handled. Although the model uses both RGB and depth data, the fusion is performed through simple concatenation, which does not fully leverage the complementary strengths of these modalities. This naive fusion strategy may cause the model to overlook essential modality-specific cues or to resolve ambiguities inadequately, shortcomings that more advanced fusion mechanisms could potentially overcome. These interconnected limitations expose the structural weaknesses of the current approach and motivate the need for methodological improvements. Specifically, they justify the development of a more robust architecture that employs advanced multimodal fusion techniques, applies targeted data augmentation to mitigate class imbalance, and incorporates regularisation strategies to promote generalisation. These enhancements form the basis of the improvements proposed in the following section.

An enhanced interpretable solution for SLR

To overcome the limitations of existing ASL datasets and improve the robustness and interpretability of gesture recognition models, we propose a three-step approach: (1) data augmentation and diversification, (2) a mask-based preprocessing strategy, and (3) an explainable classification framework. These three steps are described in the following subsections.

Data augmentation and diversification.

The accuracy of ASL recognition models is often hindered by the limited variability in existing datasets, particularly with respect to hand shape, skin tone, background, and lighting conditions. To address this, we significantly changed and diversified our dataset. Instead of relying on a standard collection of 100,000 images covering 24 letters, we compiled a new dataset of 325,000 cropped RGB images representing all 26 letters of the ASL alphabet as illustrated in Fig 4.

thumbnail
Fig 4. Comparison of label distributions between the initial and the newly collected ASL letter datasets.

The left bar chart shows the original dataset with imbalanced counts across 24 letters, while the right chart illustrates the new dataset, which includes 12,500 per letter and covers all 26 letters of the ASL alphabet with uniform distribution.

https://doi.org/10.1371/journal.pone.0336481.g004

These images were gathered from four publicly available sources to ensure heterogeneity and cover a wide range of environments and hand configurations:

The dataset was constructed by merging multiple publicly available ASL alphabet datasets. After removing duplicates and balancing the classes, we obtained approximately 325,000 images (12,500 per class). Since the images were mixed and reorganized into a unified dataset, we report only the final distribution. The processed dataset and split files are made publicly available at (https://www.kaggle.com/datasets/fatimaelqoraychy/dataset). Although ‘j’ and ‘z’ are dynamic ASL signs requiring motion to be fully expressed, some of the datasets we employed include static image approximations for these letters. In this study, we used these static representations to cover the complete 26-letter ASL alphabet. While these approximations provide useful proxies, they cannot fully reproduce the motion aspect of the signs. In this study, we did not employ geometric or photometric data augmentation (e.g., rotations, shifts, zooms, flips, or brightness variations). The only preprocessing applied was pixel-value rescaling (1/255), performed online during training using the Keras ImageDataGenerator. This ensured consistent normalization across the dataset without altering the original distribution of hand images.

Segmentation-based preprocessing with U-Net.

Previous modelling methods often rely on depth images to capture 3D hand structures. However, depth data is highly sensitive to background clutter, lighting inconsistencies, and camera angle variations, making it less reliable for robust gesture recognition. To overcome these challenges, we propose a segmentation-based preprocessing approach that focuses on binary hand masks instead of depth images. We employed the U-Net architecture [52] for this task. U-Net is a convolutional neural network designed for precise image segmentation. Its encoder-decoder structure, combined with skip connections, allows it to retain both high-level semantics and spatial resolution, making it particularly effective for extracting hand regions from complex scenes. To train this network and produce a segmentation model, we used the HGR1 dataset (https://sun.aei.polsl.pl//~mkawulok/gestures/), which contains 899 labelled RGB images with corresponding binary masks. This dataset includes diverse hand poses captured under varying lighting conditions, backgrounds, and camera types, making it suitable for training a robust and generalizable segmentation model. The resulting binary masks isolate the hand region, removing irrelevant background information. This facilitates a more focused and noise-free input for classification, allowing the model to better capture critical gesture features such as finger positions and hand contours.

The segmentation module is implemented as a 5 -level U -Net with skip connections designed to predict binary hand masks from RGB inputs. Each encoder stage consists of two convolutional layers with ReLU activation and “same” padding, followed by max pooling for downsampling. The number of feature channels increases with depth, following the sequence . The decoder mirrors this structure: at each stage, a up -sampling layer is concatenated with the corresponding encoder features, and followed by two Conv–ReLU layers. A final convolution with sigmoid activation produces a per -pixel probability map of the hand region. The network is optimized with RMSprop and trained with mean squared error (MSE) loss for 50 epochs using a batch size of 32. All input images are normalized to the [0,1] range. To ensure fair evaluation and prevent identity leakage, we apply a subject -disjoint split for train, validation, and test sets, and monitor validation performance during training. The predicted hand masks serve two purposes within the proposed framework. First, they constitute the input to the mask -based classifier, enabling the comparison between RGB and silhouette representations. Second, they support the quantitative explainability analysis by providing a reference structure to verify whether attribution maps align with the hand regions. Representative qualitative results, including input images, predicted masks, and overlays, are reported to illustrate both typical successes and failure cases.

Model training and evaluation.

Following the segmentation process, two distinct gesture recognition models were trained. The first model was trained on the original RGB images, while the second model utilised the binary hand masks generated by the U-Net segmentation model. The dataset collection and processing workflow began with acquiring a total of 325,000 labelled images. Each image underwent two distinct preprocessing stages: first, it was used as an RGB image in its original form, and second, it was passed through the segmentation model to generate the corresponding hand masks. All images were resized to a fixed resolution of 64×64 pixels, which is the required input dimension for the VGG19-based architecture used in our models as presented in Fig 5. To enable reliable model evaluation, the dataset was split into 76.5% for training, 13.5% for validation, and 10% for testing, using random shuffling and a fixed random seed labeled SEED equal to 99 to ensure reproducibility. This guarantees consistent data partitioning across different runs. To monitor and mitigate overfitting during training, we recorded both training and validation losses at each epoch across all 25 training epochs. Early stopping was applied by monitoring the validation loss. The training was halted using an early stopping criterion once no improvement was detected over a set number of consecutive epochs. The test set was used to evaluate the final model after training was complete. A summary of the training hyperparameters is presented in Table 2. All experiments were conducted using the Kaggle cloud-based environment, which provides NVIDIA Tesla P100 GPUs. Each model was trained for approximately six hours.

thumbnail
Fig 5. Overview of the gesture classification process using RGB images and segmented hand masks.

Both input types are preprocessed and passed through a CNN for gesture prediction.

https://doi.org/10.1371/journal.pone.0336481.g005

thumbnail
Table 2. Hyperparameters used in training the classification models.

https://doi.org/10.1371/journal.pone.0336481.t002

Real-time prediction interface

We developed a real-time prediction interface using OpenCV and TensorFlow to evaluate model performance in real-world settings. The interface captures live images of hand gestures through a webcam and performs on-the-fly predictions using the pre-trained model. It simultaneously displays the predicted ASL letter and overlays visual explanations generated through integrated XAI methods.

The interface operates by continuously capturing frames, resizing them to 64×64 pixels to match the model’s input dimensions, and normalising the pixel values. The processed frame is then passed through the model to obtain prediction probabilities, and the class with the highest score is selected as the predicted letter. The predicted class index is converted back into a human-readable label using a loaded label binarized object.

Performance evaluation. To quantify the real-time usability of the system, we measured both frame rate (frames per second, FPS) and per-frame latency (milliseconds). On our test hardware (Intel CPU), the interface achieved an average of 10 FPS with a mean latency of approximately 100 ms per frame.

Explainable AI for SLR model enhancement

We apply XAI techniques to our SLR models to enhance transparency and interpretability. In ASL recognition, it is crucial to ensure that a model focuses on semantically meaningful regions of the input, such as hand shapes and orientations, rather than background artefacts. This is particularly important for model validation and for fostering confidence among users and system developers. Explainability also serves as a practical tool for model debugging and for detecting potential biases or spurious correlations learned during training [44]. For instance, if a model consistently misclassifies a gesture due to attention on irrelevant background elements, explainability methods can help identify and correct such failures. Furthermore, it provides traceability for misclassifications by highlighting decision-relevant input regions. However, due to ASL gestures’ complexity and spatial variability, no single XAI technique is sufficient. We therefore employ three complementary and widely recognised XAI methods: Grad-CAM, LIME, and integrated gradients.

Gradient-weighted class activation mapping.

Gradient-weighted Class Activation Mapping (Grad-CAM) [23] is a widely used post-hoc explainability technique that provides visual explanations for CNNs by highlighting image regions most relevant to a model’s decision. Given an input image and a target class c, Grad-CAM produces a heatmap that localises the spatial regions that have the strongest influence on the output score y. This is particularly useful in visual tasks like SLR, where fine-grained hand features drive classification. To compute Grad-CAM, we first calculate the gradient of the output score y concerning the feature maps A of a chosen convolutional layer:

These gradients are globally averaged to obtain importance weights for each feature map:

Where Z is the total number of pixels in the feature map.

The class-discriminative localization map is then computed as:

The ReLU operation retains only features with a positive influence on the class score. The resulting heatmap is then upsampled and overlaid on the input image using a JET colourmap, where red and yellow indicate high importance and blue indicates low relevance.

Local interpretable model-agnostic explanations.

Local Interpretable Model-agnostic Explanations (LIME) [16] is a widely used technique designed to interpret the predictions of any black-box model by approximating it locally with an interpretable model. Unlike Grad-CAM, which is specific to CNNs, LIME is model-agnostic and can be applied to any classifier. The core idea is to understand a model’s decision for a specific instance x by generating a set of perturbed samples around x and observing how the model’s predictions vary. For each perturbed sample , the model predicts an output , and a locality-aware weight is assigned based on the proximity to the original instance x. A simple, interpretable model g, such as a linear regression or decision tree, is trained on this locally weighted dataset to approximate the complex model f in the vicinity of x. The optimisation objective is:

where measures the fidelity of g to f locally, defines the locality, and is a complexity penalty for g. In image classification, LIME segments the image into super pixels and perturbs it by randomly turning these super pixels on or off. It then identifies which super pixels most strongly influence the prediction by observing the model’s output across these perturbations.

Integrated gradients.

Integrated Gradients [53] is a model-specific explainability method designed to attribute a deep neural network’s prediction to its input features. It addresses the limitations of standard gradients, which can be noisy or vanish due to saturation. The method works by computing the average gradients as the input transitions from a baseline (typically a zero or black image) to the actual input. For an input x and baseline , the attribution for the i-th feature is defined as:

This integral is approximated numerically via summation over m steps:

The integrated gradients method satisfies desirable axioms such as sensitivity and implementation invariance, making it theoretically sound for attribution tasks.

Evaluation metrics for saliency maps

To quantitatively assess the compactness and dispersion of the explanations, We computed two complementary metrics on the saliency maps produced by Grad-CAM, LIME, and Integrated Gradients.

.

Energy concentration.

Let denote the normalized saliency map, with and . After sorting all pixel intensities in descending order , the energy concentration at fraction p (e.g., p = 0.05 for top-5%) is defined as:

(1)

A higher E(p) indicates that a small fraction of pixels captures most of the attribution energy, i.e., the explanation is more compact and localized.

Entropy.

Considering the normalized heatmap as a discrete probability distribution, The entropy is defined as:

(2)

where ε is a small constant to prevent numerical instability. Lower entropy values correspond to more focused saliency maps, while higher values reflect more diffuse explanations.

A multi-level interpretability strategy for SLR

The decision to integrate multiple XAI techniques, rather than rely on a single method, is motivated by the complementary nature of the insights each one provides. Grad-CAM, LIME, and integrated gradients represent distinct families of interpretability methods: Grad-CAM leverages convolutional activations to visualise class-specific saliency maps, LIME builds local surrogate models through feature perturbation, and integrated gradients quantify the cumulative contribution of each input pixel via a path-integrated gradient from a baseline. Each method brings specific strengths and limitations. Grad-CAM produces intuitive heatmaps but may lack fine-grained spatial resolution. LIME offers detailed, instance-level attributions but can be unstable across similar inputs. The integrated gradients method provides pixel-level precision but is sensitive to baseline selection and may miss high-level spatial structure. By combining all three, we obtain a multi-level perspective of the model’s internal logic—from broad spatial focus to localised feature attribution and cumulative influence. This combined framework is particularly crucial in SLR, where classification hinges on nuanced finger position, contour, and orientation differences. Relying on a single method may result in incomplete or misleading interpretations. Therefore, the integration of Grad-CAM, LIME, and integrated gradient methods is not redundant—it is essential for achieving robust, trustworthy interpretability, enabling the diagnosis of both successful and failed predictions with higher fidelity.

For quantitative evaluation, we report precision, recall, and F1-score. These metrics are particularly appropriate for our context due to class imbalance and the presence of visually ambiguous gestures. Precision measures the exactness of the predictions, recall assesses completeness, and F1-score offers a harmonic balance of both. Unlike overall accuracy, these metrics provide a more nuanced assessment of model behaviour, especially on minority or error-prone classes. To guide both interpretability and performance analysis, we selected representative subsets of ASL letters:

  • Simple Signs: ‘a’ and ‘l’
  • Challenging Pairs: ‘i’ vs. ‘j’, and ‘m’ vs. ‘n’

The first pair (‘a’ and ‘l’) serves as a baseline to verify that the model correctly handles straightforward gestures. The second and third sets represent more ambiguous cases with high visual similarity. These allow us to test the model’s discriminative ability and examine how explanation methods respond in both correct and incorrect classifications under challenging conditions. All scripts and code used to generate the results in this study are available on Kaggle at https://www.kaggle.com/code/fatimaelqoraychy/sign-language-classification-cnn-vgg19-1. The notebook is publicly accessible and can be used to reproduce the experiments reported in this manuscript.

Results

This section introduces the results of the empirical work described in the previous sections. Such results are grouped into two parts: one related to the model’s accuracy and one related to its explainability.

Model for gesture recognition

As shown in Figs 6 and 7, both models converge early during training. The RGB model reached optimal performance at epoch 13, while the mask-based model continued improving until epoch 24. This difference reflects the RGB model’s faster learning due to richer visual features, whereas the mask model, relying solely on shape information, required more epochs to generalise effectively. In both cases, the use of early stopping ensured stable convergence and prevented overfitting. Both models demonstrated high performance in recognising ASL gestures during the test, with strong average precision, recall, and F1-scores. While Table 4 provides an overview of global performance, a more detailed breakdown is presented in Table 3, which reports the exact values of each metric for every letter in the alphabet.

thumbnail
Fig 6. Training and validation accuracy and loss curves for the RGB model.

https://doi.org/10.1371/journal.pone.0336481.g006

thumbnail
Fig 7. Training and validation accuracy and loss curves for the Mask-based model.

https://doi.org/10.1371/journal.pone.0336481.g007

thumbnail
Table 3. Classification of ASL letters in the testing phase and comparison of classification performance (Precision, Recall, F1-score) between the RGB and Mask models.

https://doi.org/10.1371/journal.pone.0336481.t003

thumbnail
Table 4. Performance comparison between the RGB and Mask models for ASL gesture recognition.

https://doi.org/10.1371/journal.pone.0336481.t004

To evaluate model performance in a real-world scenario, we implemented a real-time prediction interface capable of recognising ASL letters as presented in Sect . During this phase, we observed that the RGB model consistently predicted the correct letter, even for complex signs. However, in the case of ‘i’ and ‘j’, despite their similarity, the model provided the correct prediction. To better understand the source of these errors, we applied explainability methods to analyse which image regions the model relied on during prediction.

Explainability and model validation

To better understand our system’s prediction mechanisms, we analyze visual explanations generated by Grad-CAM, LIME, and integrated gradient methods across the two classification models. While the RGB model achieves consistently high accuracy across all letters, the mask-based model occasionally struggles, particularly confusing visually similar signs such as ‘n’ and ‘m’. Interestingly, it correctly classifies ‘m’ and successfully distinguishes other subtle variations such as ‘i’ and ‘j’. To systematically investigate this behaviour, we selected representative signs for qualitative analysis: simple and clearly distinguishable gestures (‘a’, ‘l’), a visually similar but correctly classified pair (‘i’, ‘j’), and a confused pair (‘m’, ‘n’). This strategy allows us to assess what features the model learns, how it fails, and how explanations differ between RGB and mask-based models.

Simple Signs ‘a’ and ‘l’: For the simple letters ‘a’ and ‘l’, all three methods show strong alignment between the model’s focus and the relevant gesture areas, as illustrated in Fig 8. Grad-CAM consistently highlights the hand’s central region, capturing the gesture’s overall shape, particularly the closed fist for ‘a’ and the raised thumb and index for ‘l’. This suggests the model is using spatially relevant information to make correct predictions. LIME provides contour-based insights, outlining the hand boundaries precisely. The highlighted regions correspond well with the hand’s silhouette, indicating that the model is sensitive to the global shape and the number of extended fingers. The integrated gradients method shows high activation in areas with clear structural edges, such as knuckles, fingertips, and wrists. These results confirm that pixel-level contributions are coherent with human intuition. This consistency across methods confirms the model’s robustness in recognising simple, well-separated gestures. These signs serve as effective baselines for validating that the explainability tools reflect correct model behaviour.

thumbnail
Fig 8. Explainability results for the simple ASL letters ‘a’ and ‘l’ using Grad-CAM, LIME, and integrated gradients methods.

Each column corresponds to a different explanation technique, while each row shows original inputs and their corresponding binary masks.

https://doi.org/10.1371/journal.pone.0336481.g008

Challenging Signs ‘i’ vs. ‘j’: The distinction between i’ and j’ lies primarily in the little finger’s slight curvature or motion-related aspect. Despite their visual similarity, both the RGB and mask-based models correctly classify these signs (Fig 9). Grad-CAM tends to produce similar heatmaps for both letters, often centred on the palm and lower fingers, suggesting reliance on coarse spatial features. LIME adds nuance while ‘i has tightly bound contours around the little, ‘j’ frequently shows extended contours, sometimes trailing into the curved stroke direction, hinting that LIME can capture implied motion. The integrated gradients method exposes localised pixel importance, with ‘j’ often showing extended activation beyond the little, consistent with its curved nature. The combined analysis suggests that while Grad-CAM alone may miss subtle motion cues, LIME and integrated gradients provide additional evidence that the model captures fine variations in finger trajectory.

thumbnail
Fig 9. Explainability analysis for visually similar ASL signs: ‘i’ vs. ‘j’ and ‘m’ vs. ‘n’ using Grad-CAM, LIME, and integrated gradients methods.

Each column corresponds to a different explanation technique, while each row shows original inputs and their corresponding binary masks.

https://doi.org/10.1371/journal.pone.0336481.g009

Challenging Signs ‘m’ vs. ‘n’: The pair ‘m’ and ‘n’ presents a fine-grained classification challenge, both involve fingers folded over the thumb, differing primarily in the number of visible knuckles—three for ‘m’ and two for ‘n’. While the RGB model correctly classifies both signs, the segmentation-based model systematically misclassified ‘n’ as ‘m’, as shown in Fig 9. Grad-CAM shows that in the mask-based model, both ‘m’ and ‘n’ result in nearly identical heatmaps, with broad activation spread over the entire hand region. This suggests that the model fails to localise the discriminative area, the number of folded fingers, which is critical for distinguishing the two letters. In contrast, the RGB model shows slightly more concentrated activation toward the finger area for ‘m’, reflecting finer spatial awareness. LIME provides a more informative contrast. For the RGB model, LIME outlines distinct superpixels around three folded fingers for ‘m’ and two for ‘n’, helping the model separate the two signs. However, in the mask-based model, the superpixel boundaries for ‘n’ are either missing, merged, or appear very similar to those for ‘m’. This visual similarity may explain why the model maps to the same internal representation. The integrated gradients method confirms this limitation. In the RGB model, the attribution maps for ‘m’ and ‘n’ highlight different finger segments with clear separation, three highlighted areas for ‘m’ , two for ‘n’. However, in the mask-based model, both maps show overlapping attribution across the top of the hand, with weak contrast between the two and three finger zones. This blurs the structural boundary between the two classes. These results suggest that the mask model fails to attend to the correct discriminative region (i.e., finger count), likely due to the absence of texture, shading, and depth cues. Without this fine-grained information, the visual difference between ‘m’ and ‘n’ collapses, and the model overgeneralizes, classifying both as ‘m’. In contrast, the RGB model benefits from additional visual signals that help preserve subtle structural differences.

To quantitatively assess the compactness and dispersion of the explanations, we computed two complementary metrics on the saliency maps produced by Grad-CAM, LIME, and Integrated Gradients. Energy concentration was defined as the proportion of attribution energy contained in the top k% most salient pixels (). Higher values indicate that the explanation is concentrated on fewer, more relevant regions. Entropy was calculated over the normalized heatmap to measure dispersion. Lower entropy values correspond to more focused explanations, while higher values indicate more diffuse attention. These metrics were computed for six representative ASL letters (a, l, i, j, m, n) and across both models (RGB- and mask-based), then averaged to provide a global comparison. Table 5 summarizes the averaged results across all letters. The results reveal distinct behaviors across the three explanation techniques. LIME produces the most compact explanations, with the highest concentration values (e.g., 21.6% at top-10%) and the lowest entropy (7.9), indicating highly localized attribution maps. Grad-CAM, by contrast, exhibits more diffuse saliency maps, reflected in the highest entropy (9.3) and only moderate energy concentration (13.5% at top-10%). Integrated Gradients yields the lowest energy concentration overall (12.4% at top-10%), with entropy values (8.3) falling between LIME and Grad-CAM, suggesting smoother and more spread attributions. Overall, these results confirm that the three XAI methods provide complementary perspectives: LIME highlights precise local dependencies, Grad-CAM emphasizes broader spatial regions, and Integrated Gradients distributes importance across both local and contextual cues. This quantitative analysis strengthens the qualitative observations and underlines the benefit of using multiple explanation techniques for robust model auditing in sign language recognition.

thumbnail
Table 5. Average energy concentration (computed at top-k% pixels) and entropy.

Values are averaged across six representative letters (a, l, i, j, m, n) and across both input modalities (RGB and mask). Higher energy concentration indicates more compact explanations; lower entropy indicates less diffuse maps.

https://doi.org/10.1371/journal.pone.0336481.t005

Implications for big data

Recent advances in sign language recognition highlight the increasing availability of large-scale datasets covering multiple sign languages, signing styles, and recording conditions. Such variability reflects a Big Data challenge, where models must generalize across highly heterogeneous samples. Our work is relevant in this context because it integrates explainability into SLR, providing tools to validate and interpret models when trained on large and diverse datasets. In particular, the integration of multiple XAI methods (Grad-CAM, LIME, and Integrated Gradients) facilitates model auditing at scale by enabling transparent validation of predictions, even in massive datasets where manual inspection of each sample is infeasible. Moreover, as highlighted in recent studies on the AI-powered evolution of big data [54], explainability is increasingly recognized as a key requirement for deploying trustworthy models in real-world, data-intensive environments. Therefore, we view our contribution as a step toward scalable and interpretable SLR systems capable of operating under Big Data conditions.

Discussion

The experimental results presented in this study demonstrate high predictive performance for both RGB and mask-based models and expose deeper insights into the model behaviors through the lens of explainability. While the overall classification accuracy appears strong for both models (RGB: 99%, Mask: 98%), a deeper examination reveals key differences in generalisation and robustness. The RGB-based model consistently achieves better performance across ambiguous sign pairs, such as ‘m’ vs. ‘n’, which differ primarily by the number of extended fingers. This advantage stems from its access to rich visual cues, such as texture, shading, and skin tone, that enable the capture of subtle features. In contrast, despite eliminating background noise, the segmentation-based model lacks these visual depths and hence struggles with fine-grained discrimination. Moreover, the mask-based model demonstrates better robustness under noisy or cluttered environments, particularly in real-time settings where background conditions are variable. Its reliance on binary shape information makes it less sensitive to colour variations, lighting inconsistencies, or camera artefacts. This trade-off between robustness and discriminative power highlights the need for hybrid strategies that can balance visual richness with noise resilience.

We did not include a quantitative comparison with the baseline VGG19 model of Joyner, as this model failed to provide valid predictions on our datasets. This limitation was also reported by the original author on the project website, where it is noted that the model struggles with out-of-distribution samples. Our proposed models, by contrast, achieved consistent and interpretable results under the same conditions. One of this work’s most salient contributions is the comparative use of three XAI methods: Grad-CAM, LIME, and integrated gradients. Each explainability method brings a unique perspective: Grad-CAM identifies broader spatial attention, LIME reveals local sensitivity to super pixel-level features, and integrated gradients provide cumulative pixel-level attribution. While Grad-CAM offers intuitive heatmaps, it often lacks the resolution to distinguish fine-grained structural differences. LIME, though more precise, can vary between samples. The integrated gradients method excels at identifying localised saliency but struggles to convey global context. Their combined use triangulates the model’s behaviour, helping mitigate the limitations of any one method. Results show that for simple and clearly separated gestures such as ‘a’ and ‘l’, all three methods align in highlighting the correct hand regions across both models. This consistency supports the model’s ability to learn relevant features and provides confidence in its correct predictions. However, for visually similar gestures, particularly ‘m’ vs. ‘n’ divergence emerges. While the RGB model correctly classifies both signs with precise attribution maps (e.g., three vs. two finger contours), the mask-based model frequently misclassified ‘n’ as ‘m’. Explainability analysis reveals the root cause: Grad-CAM fails to focus on the relevant discriminative region; LIME outlines are inconsistent or overlapping; and integrated gradients show blurred attribution across the fingers. These results suggest that the mask-based model lacks the resolution and richness to differentiate subtle anatomical differences due to its reliance solely on shape. In contrast, both models perform well for ‘i’ vs. ‘j’. This is likely because the primary distinction, little finger curvature or implied motion, is still partially preserved in the binary mask. This highlights that not all structural differences are equally affected by the loss of RGB information; the ability to recognise dynamic or curvature-based features may remain intact in segmented representations, whereas subtle finger count distinctions do not. The RGB model outperforms the mask-based model not only in quantitative metrics but also in interpretability fidelity. The presence of colour, texture, and shading allows the RGB model to extract richer features and localise attention more effectively. These visual cues are especially crucial in gestures with fine-grained variations. Conversely, the binary mask strips away such information, which hinders the model’s ability to resolve ambiguous patterns like those in ‘m’ and ‘n’. This analysis underscores that segmentation-based input alone is insufficient for high-resolution gesture classification. To address this, several strategies are recommended:

  • Feature Enhancement: Preprocessing techniques such as edge enhancement or skeletal overlays could improve the visibility of finger contours.
  • Hybrid Input Models: Combining RGB and mask inputs could provide complementary benefits, shape consistency from masks, and detail richness from RGB.
  • Attention Mechanisms: Using attention-based networks may help the model focus more selectively on discriminative regions.
  • Explainability Guided Retraining: Insights from Grad-CAM or integrated gradients could guide targeted data augmentation or hard example mining during training.

In contrast to recent approaches such as SignExplainer [45], which primarily rely on LIME and SHAP to interpret feature contributions within ensemble classifiers or [48] introduces self-supervised learning with vision transformers and interprets decisions using SHAP, our work offers a multimodal, multi-technique explainability framework tailored to deep learning architectures. More recently, FedXAI-ISL [55] has proposed a federated and explainable framework for Indian Sign Language, and an explainable real-time SLR pipeline [56] has been introduced to address deployment at scale under big data conditions. While these contributions highlight the growing relevance of explainability in sign language recognition, most rely on a single explainability paradigm or are limited to specific learning setups.

By integrating Grad-CAM, LIME, and Integrated Gradients across RGB and segmentation-based models, we provide spatial, local, and structural insights beyond feature attribution and support deeper model auditing. Trust and transparency are critical in the context of assistive technology. An accurate yet opaque model is insufficient for user acceptance. The explainability framework presented here allows developers and users alike to understand model limitations, validate predictions, and adapt the system more effectively to real-world scenarios. A final implication is to integrate explainability into the model design pipeline rather than treating it as a post-hoc tool. Future models could incorporate attention mechanisms or self-supervised learning schemes explicitly guided by XAI feedback. For instance, models could be penalised during training if their attention diverges from annotated hand regions, encouraging alignment between learned representations and interpretable features. A limitation of our approach is the treatment of dynamic ASL signs ‘j’ and ‘z’. Since our models were trained on static images, we relied on dataset-provided approximations of these letters. While this enables classification of all 26 alphabet signs, the absence of true temporal modeling may affect recognition performance for such dynamic gestures. Future work will address this gap by incorporating video-based datasets and spatio-temporal architectures (e.g., 3D CNNs, temporal convolutional networks, or recurrent models).

Conclusion

This study investigated the integration of XAI techniques into ASL recognition models to improve interpretability and transparency. Two distinct classification models were developed: one trained on RGB images and the other on binary hand masks obtained using a U-Net segmentation model. The need to enhance model interpretability and better understand the visual cues underlying prediction decisions motivated this study. We employed three complementary XAI methods to evaluate performance and decision-making mechanisms: Grad-CAM, LIME, and integrated gradients. Grad-CAM identifies broad spatial regions of interest, LIME captures local dependencies through input perturbation, and integrated gradients quantify the contribution of individual pixels. Together, these methods were chosen because they offer a multifaceted view of how each model interprets ASL gestures. Experimental results indicate that both models achieved strong classification performance, with the RGB-based model slightly outperforming the mask-based model, especially for visually similar hand gestures. Nevertheless, the mask-based model demonstrated a consistent focus on gesture-relevant regions by eliminating background noise, an advantage in real-world scenarios with variable lighting or clutter. A core contribution of this work lies in showing how XAI methods can elucidate internal model logic. The visualisations confirmed that the models generally focus on semantically meaningful hand regions and also exposed inconsistencies in handling ambiguous gestures (e.g., ‘i’ vs. ‘j’, ‘m’ vs. ‘n’). These inconsistencies highlight regions where the model’s attention drifts from semantically relevant features, underscoring the limitations of relying solely on accuracy metrics. Such findings have practical implications: they increase model transparency, support trust building in user-facing applications, and help identify avenues for improvement. For example, attention maps revealed that the segmentation model is more sensitive to hand pose variations, suggesting that incorporating pose normalisation or fusing RGB and mask data could improve robustness. Additionally, the models sometimes failed to distinguish dynamic or highly similar gestures due to the static nature of input images. This limitation could be addressed by integrating temporal features (e.g., via recurrent or 3D convolutional networks) or using attention mechanisms to reinforce focus on discriminative regions. In summary, the use of explainability tools validated the model’s correct focus on gesture areas and provided actionable insights for improving ASL recognition. Future work will explore hybrid and temporal modelling approaches guided by these findings, aiming for more robust, generalizable, and interpretable gesture recognition systems.

In addition, this work highlights the value of combining Grad-CAM, LIME, and Integrated Gradients to provide complementary insights into model behavior. These explainability tools not only validated that the models focused on semantically meaningful regions but also revealed their limitations in challenging cases, such as distinguishing between ‘m’ and ‘n’. This multi-technique strategy supports more reliable model auditing and contributes to a deeper understanding of both successes and failures.

Acknowledgments

We gratefully acknowledge the support of Begüm Aksoy and Piotr Kluczyński for their assistance during the explainability phase of this study.

References

  1. 1. Picard A, Mualla Y, Gechter F, Galland S. Human-computer interaction and explainability: intersection and terminology. In: World Conference on Explainable Artificial Intelligence. 2023. p. 214–36.
  2. 2. Card SK. The psychology of human-computer interaction. Crc Press; 2018.
  3. 3. Turk M. Multimodal interaction: a review. Pattern Recognition Letters. 2014;36:189–95.
  4. 4. Hoy MB. Alexa, Siri, Cortana, and More: an introduction to voice assistants. Med Ref Serv Q. 2018;37(1):81–8. pmid:29327988
  5. 5. Pigou L, Dieleman S, Kindermans PJ, Schrauwen B. Sign language recognition using convolutional neural networks. In: Computer Vision - ECCV 2014 Workshops. 2015. p. 572–8.
  6. 6. Starner T, Weaver J, Pentland A. Real-time American sign language recognition using desk and wearable computer based video. IEEE Trans Pattern Anal Machine Intell. 1998;20(12):1371–5.
  7. 7. Antonowicz P, Kasperek D, Podpora M. Sign language recognition—dataset cleaning for robust word classification in a landmark-based approach. IEEE Access. 2025;13:81877–88.
  8. 8. Vilone G, Rizzo L, Longo L. A comparative analysis of rule based, model-agnostic methods for explainable artificial intelligence. In: Proceedings of the 28th Irish Conference on Artificial Intelligence and Cognitive Science (AICS 2020). 2020. p. 85–96.
  9. 9. Angelov PP, Soares EA, Jiang R, Arnold NI, Atkinson PM. Explainable artificial intelligence: an analytical review. WIREs Data Min & Knowl. 2021;11(5).
  10. 10. Loh HW, Ooi CP, Seoni S, Barua PD, Molinari F, Acharya UR. Application of explainable artificial intelligence for healthcare: a systematic review of the last decade 2011 -2022). Comput Methods Programs Biomed. 2022;226:107161. pmid:36228495
  11. 11. Mualla Y, Kampik T, Tchappi IH, Najjar A, Galland S, Nicolle C. Explainable agents as static web pages: UAV simulation example. In: International Workshop on Explainable, Transparent Autonomous Agents and Multi-Agent Systems. Springer; 2020. p. 149–54.
  12. 12. Mualla Y, Najjar A, Kampik T, Tchappi I, Galland S, Nicolle C. Towards explainability for a civilian UAV fleet management using an agent-based approach. arXiv preprint 2019. https://arxiv.org/abs/1909.10090
  13. 13. Gunning D. Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA). 2017.
  14. 14. Mualla Y, Tchappi I, Kampik T, Najjar A, Calvaresi D, Abbas-Turki A, et al. The quest of parsimonious XAI: a human-agent architecture for explanation formulation. Artificial Intelligence. 2022;302:103573.
  15. 15. Petsiuk V, Das A, Saenko K. RISE: randomized Input sampling for explanation of black-box models. arXiv preprint 2018. https://arxiv.org/abs/1806.07421
  16. 16. Ribeiro MT, Singh S, Guestrin C. Why should i trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. p. 1135–44.
  17. 17. Zafar MR, Khan N. Deterministic local interpretable model-agnostic explanations for stable explainability. MAKE. 2021;3(3):525–41.
  18. 18. Todoriki M, Shingu M, Yano S, Tolmachev A, Komikado T, Maruhashi K. Semi-automatic reliable explanations for prediction in graphs. In: 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC); 2021. p. 311–20.
  19. 19. Barr Kumarakulasinghe N, Blomberg T, Liu J, Saraiva Leao A, Papapetrou P. Evaluating local interpretable model-agnostic explanations on clinical machine learning classification models. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS). 2020. p. 7–12. https://doi.org/10.1109/cbms49503.2020.00009
  20. 20. Antwarg L, Miller RM, Shapira B, Rokach L. Explaining anomalies detected by autoencoders using Shapley Additive Explanations. Expert Systems with Applications. 2021;186:115736.
  21. 21. Nohara Y, Matsumoto K, Soejima H, Nakashima N. Explanation of machine learning models using shapley additive explanation and application for real data in hospital. Comput Methods Programs Biomed. 2022;214:106584. pmid:34942412
  22. 22. Ekanayake IU, Meddage DPP, Rathnayake U. A novel approach to explain the black-box nature of machine learning in compressive strength predictions of concrete using Shapley additive explanations (SHAP). Case Studies in Construction Materials. 2022;16:e01059.
  23. 23. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017.
  24. 24. Das P, Ortega A. Gradient-weighted class activation mapping for spatio temporal graph convolutional network. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022. p. 4043–7. https://doi.org/10.1109/icassp43922.2022.9746621
  25. 25. Yin S, Wang L, Shafiq M, Teng L, Laghari AA, Khan MF. G2Grad-CAMRL: an object detection and interpretation model based on gradient-weighted class activation mapping and reinforcement learning in remote sensing images. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2023;16:3583–98.
  26. 26. Zhao Y, Cao L, Ji Y, Wang B, Wu W. Interpretable EEG emotion classification via CNN model and gradient-weighted class activation mapping. Brain Sci. 2025;15(8):886. pmid:40867216
  27. 27. Raheem FA, Raheem H. ASL recognition quality analysis based on sensory gloves and MLP neural network. American Scientific Research Journal for Engineering, Technology, and Sciences. 2018;47(1):1–20.
  28. 28. Bragg D, Koller O, Bellard M, Berke L, Boudreault P, Braffort A, et al. Sign language recognition, generation, and translation: an interdisciplinary perspective. In: Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility. ASSETS ’19. New York, NY, USA: Association for Computing Machinery; 2019. p. 16–31.
  29. 29. Ryumin D, Ivanko D, Axyonov A. Cross-language transfer learning using visual information for automatic sign gesture recognition. Int Arch Photogramm Remote Sens Spatial Inf Sci. 2023;XLVIII-2/W3-2023:209–16.
  30. 30. Papastratis I, Chatzikonstantinou C, Konstantinidis D, Dimitropoulos K, Daras P. Artificial intelligence technologies for sign language. Sensors (Basel). 2021;21(17):5843. pmid:34502733
  31. 31. Tao W, Leu MC, Yin Z. American sign language alphabet recognition using convolutional neural networks with multiview augmentation and inference fusion. Engineering Applications of Artificial Intelligence. 2018;76:202–13.
  32. 32. Chen Z, Kim J-T, Liang J, Zhang J, Yuan Y-B. Real-time hand gesture recognition using finger segmentation. ScientificWorldJournal. 2014;2014:267872. pmid:25054171
  33. 33. Tripathi KM, Kamat P, Patil S, Jayaswal R, Ahirrao S, Kotecha K. Gesture-to-text translation using SURF for Indian sign language. ASI. 2023;6(2):35.
  34. 34. Veeraiah D, Basha SJ, Deepthi KS, Sathvik T, Ganesh P. Enhancing communication for deaf and dumb individuals through sign language detection: a comprehensive dataset and SVM-based model approach. In: 2024 3rd International Conference on Applied Artificial Intelligence and Computing (ICAAIC). 2024. p. 947–54. https://doi.org/10.1109/icaaic60222.2024.10575397
  35. 35. Wang H, Leu MC, Oz C. American sign language recognition using multi-dimensional hidden Markov models. Journal of Information Science and Engineering. 2006;22(5):1109–23.
  36. 36. 36 Devineau G, Moutarde F, Xi W, Yang J. Deep learning for hand gesture recognition on skeletal data. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). 2018. p. 106–13. https://doi.org/10.1109/fg.2018.00025
  37. 37. Kumar EK, Kishore PVV, Kiran Kumar MT, Kumar DA. 3D sign language recognition with joint distance and angular coded color topographical descriptor on a 2 – stream CNN. Neurocomputing. 2020;372:40–54.
  38. 38. Talaat FM, El-Shafai W, Soliman NF, Algarni AD, Abd El-Samie FE, Siam AI. Real-time Arabic avatar for deaf-mute communication enabled by deep learning sign language translation. Computers and Electrical Engineering. 2024;119:109475.
  39. 39. Najib FM. A multi-lingual sign language recognition system using machine learning. Multimed Tools Appl. 2024;84(24):27987–8011.
  40. 40. Rahman MdM, Islam MdS, Rahman MdH, Sassi R, Rivolta MW, Aktaruzzaman M. A new benchmark on American sign language recognition using convolutional neural network. In: 2019 International Conference on Sustainable Technologies for Industry 4.0 (STI). 2019. p. 1–6. https://doi.org/10.1109/sti47673.2019.9067974
  41. 41. Bantupalli K, Xie Y. American sign language recognition using deep learning and computer vision. In: 2018 IEEE International Conference on Big Data (Big Data). 2018. p. 4896–9. https://doi.org/10.1109/bigdata.2018.8622141
  42. 42. Ikram S, Dhanda N. American sign language recognition using convolutional neural network. In: 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON). 2021. p. 1–12. https://doi.org/10.1109/gucon50781.2021.9573782
  43. 43. Tao W, Leu MC, Yin Z. American Sign Language alphabet recognition using Convolutional Neural Networks with multiview augmentation and inference fusion. Engineering Applications of Artificial Intelligence. 2018;76:202–13.
  44. 44. Longo L, Brcic M, Cabitza F, Choi J, Confalonieri R, Ser JD, et al. Explainable Artificial Intelligence (XAI) 2.0: A manifesto of open challenges and interdisciplinary research directions. Information Fusion. 2024;106:102301.
  45. 45. Kothadiya DR, Bhatt CM, Rehman A, Alamri FS, Saba T. SignExplainer: an explainable AI-enabled framework for sign language recognition with ensemble learning. IEEE Access. 2023;11:47410–9.
  46. 46. Paudyal P, Lee J, Kamzin A, Soudki M, Banerjee A, Gupta SK. Learn2Sign: explainable AI for sign language learning. In: IUI Workshops; 2019.
  47. 47. McCleary J, Garcia LP, Ilioudis C, Clemente C. Sign language recognition using micro-doppler and explainable deep learning. In: 2021 IEEE Radar Conference (RadarConf21). 2021. p. 1–6. https://doi.org/10.1109/radarconf2147009.2021.9455257
  48. 48. Ridwan AEM, Chowdhury MI, Mary MM, Abir MTC. Deep neural network-based sign language recognition: a comprehensive approach using transfer learning with explainability. arXiv preprint 2024. https://arxiv.org/abs/2409.07426
  49. 49. Al Abdullah BA, Amoudi GA, Alghamdi HS. Advancements in sign language recognition: a comprehensive review and future prospects. IEEE Access. 2024;12:128871–95.
  50. 50. Raji NR, Kumar RMS, Biji CL. Explainable machine learning prediction for the academic performance of deaf scholars. IEEE Access. 2024;12:23595–612.
  51. 51. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint 2015. https://arxiv.org/abs/1409.1556
  52. 52. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. Springer; 2015. p. 234–41.
  53. 53. Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. In: Precup D, Teh YW, editors. Proceedings of the 34th International Conference on Machine Learning. vol. 70 of Proceedings of Machine Learning Research. PMLR; 2017. p. 3319–28.
  54. 54. Kumar Y, Marchena J, Awlla AH, Li JJ, Abdalla HB. The AI-powered evolution of big data. Applied Sciences. 2024;14(22):10176.
  55. 55. Ghosh A, Krishnamoorthy P. FedXAI-ISL: explainable artificial intelligence-based federated model in recognition and community decentralized learning of indian sign language. In: Hassanien AE, Anand S, Jaiswal A, Kumar P, editors. Innovative Computing and Communications. Singapore: Springer Nature Singapore; 2024. p. 385–93.
  56. 56. Joseph T, William MM, Jericho K, Isaac MM, Lule E, Kimbugwe N, et al. Explainable real-time sign language to text translation. In: Congress on Intelligent Systems. Springer; 2024. p. 213–42.