Evaluating Explainable Artificial Intelligence (XAI) techniques in chest radiology imaging through a human-centered Lens

doi:10.1371/journal.pone.0308758

Table 1.

Performance metrics (on testset) of deep learning models on Dataset 1 (Chest X-ray images for pneumonia detection).

More »

Expand

Table 2.

Performance metrics (on testset) of deep learning models on Dataset 2 (CT Scans for COVID-19 detection).

More »

Expand

Fig 1.

Accuracy and loss for the deep learning models used for Dataset 1 (Chest X-ray images for pneumonia detection).

The plots illustrate the model’s performance and convergence during the training process.

More »

Expand

Fig 2.

Accuracy and loss for the deep learning model used for Dataset 2 (CT Scans for COVID-19 detection).

The plots illustrate the model’s performance and convergence during the training process. (a) ROC curve for Dataset 1 (Chest X-ray Scans for Pneumonia detection) (b) ROC curve for Dataset 2 (Chest CT Scans for COVID-19 detection).

More »

Expand

Fig 3.

Receiver Operating Characteristic (ROC) (green) with Area under the ROC Curve (AUC) generated for the clinical case studies (Datasets 1 and 2).

ROC and AUC plots are generated for the best performing deep learning models for classifying Chest X-rays with and without Pneumonia, and CT scans with and without COVID-19. (a) ROC curve for Dataset 1 (Chest X-ray Scans for Pneumonia detection), (b) ROC curve for Dataset 2 (Chest CT Scans for COVID-19 detection).

More »

Expand

Fig 4.

Constructed questions for questionnaire—part 1.1.

Questions aim to collect basic information concerning participants’ medical speciality, medical imaging knowledge, and overall experience in the medical field.

More »

Expand

Fig 5.

Constructed questions for questionnaire—part 1.2.

Questions aim to explore participants’ familiarity with the concept of XAI in medical imaging.

More »

Expand

Fig 6.

Explainable AI (XAI) visualization results for clinical case study one.

This figure illustrates XAI techniques (Grad-CAM (b and f) and LIME (d and h)) applied to chest X-ray images for pneumonia detection, highlighting the regions and features of the images that the deep learning model focuses on to make its predictions.

More »

Expand

Fig 7.

Explainable AI (XAI) visualization results for clinical case study two.

This figure illustrates XAI techniques (Grad-CAM (b and f) and LIME (d and h)) applied to chest CT images for COVID-19 detection, highlighting the regions and features of the images that the deep learning model focuses on to make its predictions.

More »

Expand

Fig 8.

Constructed questions for questionnaire—part 3.1.

Questions aim to assess the quality of the explanation provided by both Grad-CAM and LIME, and assess their effectiveness in influencing clinical decision-making within radiology workflow.

More »

Expand

Fig 9.

Constructed questions for questionnaire—part 3.2.

Questions aim to assess the impact of the coloring scheme on the XAI visual results.

More »

Expand

Fig 10.

Constructed questions for questionnaire—part 4.1.

Questions aim to collect recommendations for improving the explainability of AI models in medical imaging from the users’ perspective.

More »

Expand

Fig 11.

Distribution of participant’s total medical experience.

The figure indicates that 18 participants have more than 10 years of experience, showcasing the overall experience levels within the participant group.

More »

Expand

Fig 12.

Distribution of participants’ experience analyzing radiology images.

The histogram indicates that 13 participants have more than 10 years of experience, highlighting the expertise level within the group.

More »

Expand

Fig 13.

Distribution of participants’ experience with AI-based medical imaging tools.

The figure reveals that 16 participants have zero experience with AI-based medical imaging tools, highlighting a significant portion of the group with no prior exposure to this technology.

More »

Expand

Fig 14.

Distribution of participants’ familiarity with AI.

The figure shows that 14 participants reported being “little familiar” with AI, highlighting the varying levels of AI knowledge among the participants.

More »

Expand

Fig 15.

Distribution of participants’ comfort with the general widespread use of AI.

The figure shows that most participants are feeling very comfortable with the general widespread use of AI.

More »

Expand

Fig 16.

Distribution of participants’ comfort with the medical decisions generated from AI-based tolls.

The figure shows that opinions almost split between being Not sure and Comfortable.

More »

Expand

Fig 17.

Distribution of participants’ confidence in AI-based diagnostic tools.

The figure shows that most participants, fourteen in total, reported poor confidence in AI-based diagnostic decisions.

More »

Expand

Fig 18.

Distribution of participants’ support for understanding the decision-making process of AI algorithms used in medical imaging.

The figure illustrates that nineteen participants consider it crucial for medical practitioners to understand the rationale of AI decisions in medical imaging systems, while only five participants view this aspect as unimportant.

More »

Expand

Fig 19.

Distribution of participants’ awareness of XAI.

The figure shows that most participants reported being poor familiarity of XAI in medical imaging.

More »

Expand

Fig 20.

Distribution of participants’ belief in the effectiveness of XAI tools insights.

The figure shows that most participants didn’t respond to this question due to their poor familiarity with the XAI concept.

More »

Expand

Fig 21.

Grad-CAM clinical relevance (Usefulness).

The figure shows that most participants expressed positive evaluations on the usefulness of the Grad-CAM method in explaining the AI results.

More »

Expand

Fig 22.

Participants’ views on Grad-CAM colouring scheme.

The figure shows that thirteen participants indicated that the colored heatmaps had a negative impact on the readability of the XAI results.

More »

Expand

Fig 23.

LIME clinical relevance (Usefulness).

The figure shows that nine participants scored LIME less than 2 for the usefulness criteria.

More »

Expand

Fig 24.

Grad-CAM comprehensibility.

The figure shows that twenty-two participants rated the heatmap visualisations positively, with scores of three or higher.

More »

Expand

Fig 25.

LIME comprehensibility.

The figure shows that only six participants assigned a score of 4 or 5 for the comprehensibility criteria.

More »

Expand

Fig 26.

Participants’ preference between Grad-CAM and LIME.

The figure shows that nineteen participants favoured Grad-CAM (heatmap) over LIME visualisations.

More »

Expand

Fig 27.

Grad-CAM and LIME confidence.

The figure shows that nine participants expressed confidence in the accuracy of the XAI visualisations, while seven participants lacked confidence in the results.

More »

Expand

Fig 28.

Impact of XAI on improving trust in AI.

The figure shows that twelve participants expressed uncertainty about the impact on their trust in AI results in medical imaging, and eleven participants reported an improvement in their trust in AI systems after reviewing XAI visualizations.

More »