The impact of AI feedback on the accuracy of diagnosis, decision switching and trust in radiography

Clare Rainey; Raymond Bond; Jonathan McConnell; Avneet Gill; Ciara Hughes; Devinder Kumar; Sonyia McFadden

doi:10.1371/journal.pone.0322051

Abstract

Artificial intelligence decision support systems have been proposed to assist a struggling National Health Service (NHS) workforce in the United Kingdom. Its implementation in UK healthcare systems has been identified as a priority for deployment. Few studies have investigated the impact of the feedback from such systems on the end user. This study investigated the impact of two forms of AI feedback (saliency/heatmaps and AI diagnosis with percentage confidence) on student and qualified diagnostic radiographers’ accuracy when determining binary diagnosis on skeletal radiographs. The AI feedback proved beneficial to accuracy in all cases except when the AI was incorrect and for pathological cases in the student group. The self-reported trust of all participants decreased from the beginning to the end of the study. The findings of this study should guide developers in the provision of the most advantageous forms of AI feedback and direct educators in tailoring education to highlight weaknesses in human interaction with AI-based clinical decision support systems.

Citation: Rainey C, Bond R, McConnell J, Gill A, Hughes C, Kumar D, et al. (2025) The impact of AI feedback on the accuracy of diagnosis, decision switching and trust in radiography. PLoS One 20(5): e0322051. https://doi.org/10.1371/journal.pone.0322051

Editor: Cristiano Miranda de Araujo, Tuiuti University of Parana: Universidade Tuiuti do Parana, BRAZIL

Received: July 24, 2024; Accepted: March 15, 2025; Published: May 9, 2025

Copyright: © 2025 Rainey et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The original Excel and SPSS datasets supporting this publication are openly available from Ulster University’s Research Portal at [https://doi.org/10.21251/50890091-4b54-4644-b980-ea9da646aa0e and https://doi.org/10.21251/9ec9de85-31c3-47c9-90d2-0ededc0a73c3]. The dataset of clinical images cannot be shared publicly because of ethical restrictions on sharing sensitive data. Access to the data can be provided following a successful application to Ulster University’s Nursing and Health Research Ethics Filter Committee. Ulster University’s Research Portal contains metadata on the dataset and instructions on how to request access to this dataset. This information can be accessed at [https://doi.org/10.21251/50890091-4b54-4644-b980-ea9da646aa0e and https://doi.org/10.21251/9ec9de85-31c3-47c9-90d2-0ededc0a73c3]. This paper is accompanied by representative samples of experimental data and the relevant numerical tabulated raw data is available from Ulster University’s Research Portal at 10.21251/794ab086-d855-4e44-805a-5c019868546b or by contacting pure-support@ulster.ac.uk.

Funding: This work has been funded by the College of Radiographers Research Industry Partnership Research awards scheme (CoRIPS) no. 183. The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The current backlog and delay in the reporting of radiographs has driven investigations into the adoption of new technologies that could increase efficiency and “free up clinicians” to spend more time with patients [1,2]. Artificial intelligence (AI) has been proposed as a solution in automating the diagnosis of pathology on radiographic images, e.g., breast and chest imaging [3–5]. The dramatic developments in computer technology and processing ability have permitted ever more sophisticated and useful applications of AI. The latest technologies mimic the way the human brain functions, so that the AI can ‘learn’ from experience. AI systems have been shown to have a high degree of accuracy in the detection of abnormality on radiographic images, however clinical utilisation is incomplete due to the lack of transparency in how the system makes decisions resulting in trust issues between users and the system.

Background

The first paper detailing the use of computers to assist in the diagnosis of pathology from radiographic images was published in the 1960s [6]. The rapidly increasing computational power available has permitted the development of ever more sophisticated pathology detection systems, such as the development and use of computer aided detection systems in mammography in the 1980s to proposals of autonomous triage systems in the present day [7]. Differing methods of analysis of radiographic, and other, images have been proposed. Deep Learning systems (DL) using Convolutional Neural Networks (CNNs) are one of the most recent and seemingly most promising forms of AI for detecting disease on radiographic images.

The use of AI has been targeted as an area of focus for modernising and future-proofing the NHS in the UK with proposed tasks such as image interpretation, autonomous triage and natural language processing [8]. This is particularly important in worldwide healthcare systems coping with the current and ongoing pressures of the COVID-19 global pandemic, where resources are limited [9].

Promising accuracies of DL using CNNs for detection of pathology from plain radiographs have been reported for chest imaging [3,4] and mammography [10], however, possibilities for determining diagnosis from skeletal radiographs have been less extensively investigated [11]. This is despite plain radiography being the initial modality of choice when imaging this area, with recent figures quoting that plain radiography has made up in excess of 23 million of a total of 44.7 million radiographic imaging examinations a year in the UK (in the period May 2018 to May 2019 alone) [12]. In the USA, the numbers of imaging examinations involving radiation continues to rise [13], although more detailed national data is not available.

The first publication of promising experimental results for detecting fractures on skeletal extremity radiographs was in 2017 [14]. Since then, other findings have been published evidencing the impressive performance of CNNs for pathology detection in comparison to, and in conjunction with, human experts [11,15–17].

Despite reported accuracies and benefits, clinicians’ trust in AI remains a barrier to AI implementation in the health care setting [7,18]. This is particularly the case with the use of DL systems. DL algorithms make use of multiple neural layers to analyse and process image data but there are a number of these layers which are hidden to the user. It is not entirely apparent, therefore, how the algorithm reaches its ultimate decision. This raises ethical and legal issues as well as having implications for the users’ trust in the system – if the user doesn’t fully understand how the AI has reached its decision, can the clinician be expected to assume ultimate responsibly for the outcome [19]? Additional information provided by the AI system, such as percentage confidence in diagnosis, triage recommendation and suggestion for further imaging have been proposed as other useful AI outputs [20].

Attempts are currently being made to make the DL decision-making process more transparent using visual representations to highlight the areas on the image that the AI is attending to, for example, attention/saliency heatmaps and regions of interest superimposed onto the radiograph [21,22]. It is proposed that a user may be able to calibrate their trust in the AI if the user can see the area/s on the image that the AI focussed on when making its decision.

Decision switching occurs when a decision-maker changes their initial image interpretation or diagnosis based on new information, or by assessing the same information from a different perspective. In the field of medical imaging, an overreliance on computer input has been found to cause errors of commission and omission [23,24]. This is known as automation bias and is defined as a human naively over-relying on computer information. This happens when the human has more faith in the machine rather than their own cognitive conclusions [23,24]. This would not be a problem in a perfect system, but this is not reflective of real life, where errors can occur in both humans and computer systems. Using AI may also cause the user to choose to change their mind in a positive direction, resulting in the desirable outcome of an increase in diagnostic accuracy as a result of interaction with the AI.

This study investigates the effect of feedback from an AI algorithm on diagnostic accuracy, decision switching and trust in student and qualified radiographers. The latest census from both the Royal College of Radiologists [25] and the Society and College of Radiographers [26] identify shortages of imaging professionals of up to 17% across the UK. With increased numbers of newly qualified and student imaging professionals in the NHS to fill this gap, it is important to understand how they, as well as currently practicing radiographers, will interact with new technologies being integrated into the imaging department. This study focuses on diagnostic radiographers, but these findings may be useful in benchmarking the impact of different forms of AI feedback on accuracy, decision switching and trust in all clinicians who use radiographic images for diagnosis.

To the authors’ knowledge, no study has investigated the impact of the type of AI feedback on the diagnostic accuracy of radiographers and the impact that level of experience has on the acceptability of the AI decision.

Summary

AI is present and will be increasingly more present and utilised in healthcare moving into the future. This study aims to clarify how radiographers and student radiographers are affected by feedback from a poorly functioning AI system. This is particularly important as the literature is brimming with potentially promising results of AI performance. This study uses an AI which performed well in the laboratory (test set) but poorly with more clinically relevant images (clinical dataset). In addition, any difference in the perceived trust and acceptance of AI aided diagnoses between students and qualified radiographers was also investigated. Findings are intended to provide direction for educating undergraduate and practicing clinicians to maximise the promise and recognise the pitfalls of integrating AI into the clinical setting. It is envisaged these findings will provide an indication of the areas where caution should be exercised to aid developers to incorporate the most useful forms of AI feedback in their systems.

Aim and objectives

This experimental study aimed to discover how a binary diagnosis and visual feedback from an AI algorithm affects the diagnostic accuracy of radiographers with differing levels of expertise when interpreting radiographic images of the upper extremities.

The principal aim was to quantify the impact of performance, decision switching and trust in an AI algorithm following exposure to two different forms of AI feedback. Two forms of AI feedback were assessed for their respective impacts:

AI feedback type 1) an attention map that shows where on the image the AI is attending to when making its decision (Fig 1) and
AI feedback type 2): a simple binary diagnosis, i.e., the model suggests that there is either a pathology or no pathology (with a percentage confidence in its decision).

Download:

Fig 1. Graphical representation of study pathway.

https://doi.org/10.1371/journal.pone.0322051.g001

This study has the following objectives:

(i) to determine the baseline diagnostic accuracy of radiographers of differing levels of expertise when interpreting a selection of radiographic images of the upper appendicular skeleton.
(ii) to expose the participants to both binary and visual feedback from an AI algorithm.
(iii) to investigate the impact of the AI feedback on diagnostic accuracy.
(iv) to investigate the effect of this AI feedback on decision switching.
(v) to investigate the perceptions of trust of participants on the AI system.

Methods

Ethical approvals

Ethical permission for this study has been granted by Ulster University Nursing and Health Research Ethics Filter Committee FCNUR-20–035. Online, informed consent was gathered form all participants prior to commencement of the study, by an initial slide presentation, detailing the background, aims and objectives of the study. There were no minors participating in this study. Participants were permitted to exit the study at any point, but they were informed that their submissions up to this point would be included in analysis. Ethical permission for the use of the clinical dataset images has been granted previously to use the images for research purposes (Monash University, Clayton, Australia, 2011).

Model training

MURA, a large dataset for abnormality detection in musculoskeletal radiographs, was used for training and testing of our AI model. MURA consists of 40,561 images taken by conducting 14,863 studies of the upper extremity. Each study is then labelled by a radiologist as either abnormal or normal. For this binary classification task, musculoskeletal radiographs from seven upper extremities including shoulder, humerus, elbow, forearm, wrist, hand and finger were used. The dataset is divided into training and validation sets with 9045 normal and 5818 abnormal radiographic studies divided between the two sets. In the training set, there are 11,184 patients with 13,457 studies and 36808 images. The images in all the sets vary in resolution and aspect ratio with no overlap of patients between training and validation set.

Test set

As there was no explicit test set, we use half of the validation set (783 patients, 1,199 studies and 3197 images) as our test set and the rest as validation set. We made sure again that there was no overlap between any of the sets. The test set was chosen to contain approximately half of each of the upper extremities for adequate and balanced representation of each class.

AI model

In this study we used a convolution neural network (CNN) specifically ResNet-152 pretrained on image net. During training time, one or more views of study is presented to the CNN and arithmetic mean of the output is taken to determine whether it’s abnormal or normal, similar to the original MURA study [40]. Any probability greater than 0.5 is deemed as abnormal. Using this criteria, the model is trained using the training set till the network stopped improving and training was shut off using early stopping criteria. For optimization, Adam optimizer was used with initial learning rate to 10^-4.

Salience map

To understand the model output prediction and for use in this study, we create a binary saliency map for each image output alongside its abnormality score. Each salience map is created using the binary map creation technique as described by Kumar et al., 2018 [21]. In the binary saliency map, we use a heat map overlay in which the white indicates the strongest regions and black indicates null values. Spatial location of the binary saliency map (and the associated heat map) indicates the spatial area in the input radiography image which is used by the model to produce the given output. The strength of heat map indicates the strength of spatial regions that contribute the most towards the given output abnormality score. This form of explainable AI will allow the participant to determine if the binary output of the AI is related to the appropriate area of the image, or based on another area of the image that the users deem incorrect or inconsequential.

Test dataset

The test dataset consists of radiographic images of the upper appendicular skeleton. They were obtained on real patients presenting to a hospital in Australia and were used as part of another PhD study [41]. The radiographic examinations in the dataset have all been anonymised. All patients’ identifiable information such as the patient’s name, date of birth and health and care number have been removed from each image. Images do not contain any rare abnormalities or pathologies which could readily identify an individual.

There are a total of 268 examinations in the full dataset and have approximately a 3:7 split of pathology: no pathology. 21 examinations were chosen at random for inclusion in this study.

The participants will be blinded to the ground truth at all stages of the study, to avoid bias [42].

The radiographic examinations have been used to determine diagnosis. There are three to five radiological reports from radiologists and reporting radiographers available for each. Consensus binary diagnosis has been determined by inspection of radiology reports (fracture/no fracture), and this consensus is used as ‘ground truth’ in this study. Agreement of the participants in this study with ‘ground truth’ has been termed ‘accuracy’.

The AI model described above was used to obtain diagnosis for each examination. Predictions were produced as a binary diagnosis (i.e., pathology/no pathology) and percentage confidence of the AI in its decision. A heatmap overlay (GradCAM) was also provided on each image (Fig 6).

Of the 21 examinations included in this study, the AI made the correct prediction on 12 examinations (57.1% accuracy). There was pathology present on nine out of the 21 examinations (42.9%) (Available in ‘Supporting Information’, S1)

Patient-public involvement

A PPI group was set up to help drive the direction of this study and to ensure the study is relevant and useful to the public and radiographers in clinical practice. The group consisted of two student radiographers, two practicing radiographers of differing levels of experience (approximately 40 years’ and 15 years’ experience) and one patient with a clinical history of repeated attendance for plain radiographic examinations of the appendicular skeleton due to repeated sports injuries.

Pilot study

Six images were selected randomly from each anatomical region in the test dataset (fingers, hand, wrist, forearm, elbow and humerus) and embedded into Qualtrics® for interpretation by seven participants. Purposive sampling was used to select participants to the pilot study who represent the target respondents, to ensure all potential participants would understand expectation of their input. Representation from each year of a UK diagnostic radiography programme (Ulster University) was obtained, along with qualified radiographers with differing lengths of clinical experience. Participants were asked to comment on the acceptability of the study design, the quality of the images in the survey and the time taken to complete the survey, ensuring face and content validity and the acceptance of the time sacrifice required to complete the study. This information was used to build the survey for the full study.

Qualtrics survey

The number of images in the survey was chosen based on an acceptable estimated time for completion (approximately 15 minutes). From the test dataset (n=21 examinations) the randomiser function in Qualtrics was used to allocate three radiographic examinations to each participant. Three examinations were chosen to encourage participation as the time taken to complete is deemed to be acceptable to participants and encourage thoughtful responses, therefore avoiding random responses and premature cessation of the survey [43].

Each examination contained two or more radiographic images. Each image in the examination was presented, and the participant was asked to determine if there was a pathology present on the image. The participant was then presented with the heatmap overlay and asked again if they felt there was a pathology present, and whether the AI heatmap has caused them to change their mind from their initial decision. This was repeated for each image in the examination. When all images had been presented, the participant was presented with the images again and asked if they felt there was a pathology present. The responses to this question were not included in the analysis of the data but was provided to ensure the participant had access to all images again to best determine the impact of the binary feedback, which was determined for the entire examination and not per image.

Following the above, participants were given the AI binary diagnosis and asked if they would like to change their mind from the first evaluation of the images. When all images, heatmaps and binary diagnoses were presented, the participants were asked to determine if they felt there was a pathology present. This question was included to represent the clinical scenario, where clinicians would have the opportunity to view all images to determine a final diagnosis. They were then provided with the AI binary decision and asked if they now believed there to be a pathology present on the image. They were asked if the binary feedback caused them to change their mind from their initial decision and to give indication in their trust in the AI following exposure to all images and AI feedback for this examination (Available in ‘Supporting Information’).

Participant selection

The study was open to all diagnostic radiographers, who are currently in clinical practice, including students. The landing page of the Qualtrics® survey provided participants with information on the study rationale and aim. A brief precis of the relevant literature on the subject was also provided. Informed consent was requested by indication of the participants desire to proceed via a yes/no response. If the participant indicated that they did not give their consent the ‘skip logic’ function exited them from the study. A final page notified respondents of submission of responses, although a full review of responses was not given. The study was promoted via the European Congress of Radiology (ECR) Research Hub (open from 2^nd March to the 12^th April 2021) and by promotion on social media (Twitter® and LinkedIn®). The last response included in analysis was collected on the 2^nd November 2021. Data was, therefore, collected between 2^nd March to 2^nd November 2021. Due to the lack of research in the area, this method of convenience, snowball sampling was felt to be appropriate to gain insight upon which to base future studies. A power calculation was not carried out due to the lack of previous studies in this area, however, ‘rule-of-thumb’ estimates indicate that there should be 10–15 participants in each group for quantitative studies [44,45].

Participants were grouped according to broader experience groups in order to ensure adequate sample size in each group (student radiographers/qualified radiographers) to allow for more meaningful outcomes from statistical analyses.

Statistics and reproducibility

Tests of normality (Kolmogorov-Smirnov and Shapiro-Wilk) were conducted. Skewness and kurtosis were visually determined by inspection of histograms and distribution curves. Comparison was made of the mean and median for each condition in both the student and radiographer groups. Data was found to be normally distributed and parametric tests were used for inferential statistics (Table 2).

Download:

Table 1. Demographic details of participants.

https://doi.org/10.1371/journal.pone.0322051.t001

Download:

Table 2. Test of normality.

https://doi.org/10.1371/journal.pone.0322051.t002

Descriptive statistics are used to describe the impact of the AI feedback on participants’ accuracy. This is further sub-divided into experience categories (i.e., student and radiographer) and condition (i.e., instance where the AI was correct, incorrect, pathological cases and non-pathological cases). Data is presented per examination as each examination had differing numbers of images contained within. Participants were allocated three examinations at random, therefore data is analysed as % accuracy, rather than total number of decision points, however this data pertaining to the total number of decision points is given in Table 3.

Download:

Table 3. Impact of A.I feedback on student and qualified diagnostic radiographers’ diagnostic accuracy.

https://doi.org/10.1371/journal.pone.0322051.t003

Participant accuracy was not considered as related to the individual, but rather as a group: student or radiographer (Fig 2). Diagnostic accuracy was determined at three points; before any AI feedback, following exposure to the AI generated heatmap and following the AI binary diagnosis. The findings are tabulated, in full, in Table 3.

Download:

Fig 2. Graphical representation of data analysis – t-test.

https://doi.org/10.1371/journal.pone.0322051.g002

The feedback was provided in a sequential manner, i.e., pre-heatmap (no AI feedback), post-heatmap and post-AI binary diagnosis, therefore repeated measured ANOVA was used to investigate the impact of the type of AI feedback provided (Fig 3). Post-hoc pairwise comparisons were conducted to determine the specific factors responsible for the differences. Combined effects of experience level (students, radiographers) were used investigate any differences in accuracy in response to the AI feedback. Effect size of any statistically significant finding was estimated using partial eta squared (= SSeffect/ (SSeffect + SSerror)). Effect sizes are reported using an established ‘rule of thumb’:

Download:

Fig 3. Graphical representation of data analysis – ANOVA.

https://doi.org/10.1371/journal.pone.0322051.g003

ηp2 = 0.01 indicates a small effect

ηp2 = 0.06 indicates a medium effect

ηp2 = 0.14 indicates a large effect [46]

T-tests (two-tailed) investigated the significance of any differences between the accuracy of the student and radiographer groups under each of the four investigated conditions: (i) AI correct, (ii) AI incorrect, (iii) pathological cases and (iv) non-pathological cases. Cohen’s d was used to estimate effect size of any statistically significant result: small 0.2, moderate 0.5, large 0.8 effect [47].

Repeated measures ANOVA was used to investigate any statistically significant difference between the impact of the type of AI feedback and diagnostic accuracy. The rate of decision switching has been presented using descriptive statistics for the collective group for each of these scenarios: where the (i) AI was correct, (ii) AI was incorrect, and where the image was (iii) pathological or (iv) non-pathological. This was repeated for each group (students and radiographers). The direction of the switch of each of the groups in each of the conditions (as before: (i)-(iv)) through the impact of the AI on accuracy, where if the accuracy of the group increased, the AI feedback had a positive effect on the diagnostic accuracy of the participants.

Data was tabulated and graphically represented, where appropriate.

The impact of the different forms of AI feedback on the propensity of the participants to change their mind from their initial diagnosis were investigated. All participants were asked if the AI feedback caused them to change their mind from their initial diagnosis. This question was posed following the AI feedback in the form of the heatmap and again following provision of the AI binary diagnosis.

Results

All data analysis was conducted on SPSS® v 27 [27] and Microsoft® Excel® [28].

Demographics

Full demographic details of the participants are given in Table 1. Following cleaning of the data there were 94 participants included in the analyses. Responses were included if at least part of the study was completed. Responses were removed if the participant did not give consent via the Qualtrics platform or did not complete any part of any of the questions. Of the 94 participants, 57.5% (n=54) were students and 42.6% (n=40) were radiographers with representation of a range of experience levels from year one of an undergraduate degree programme to greater than 20 years clinical experience. Most respondents were from the UK (England, Scotland, Northern Ireland) or Ireland (85%, n=80).

Accuracy

‘Ground Truth’ has been determined as consensus diagnosis from at least three out of five reporting radiographers and radiologists (see Methodology section). ‘Accuracy’ is defined in this study as the agreement of the participants with the ground truth diagnosis. Percentage accuracy in each of the two experience levels – student radiographer (‘students’) and qualified radiographers (‘radiographers’) is reported. Descriptive statistics are reported for each experience level. Data was found to be normally distributed (see Methodology section, and Table 2), and further analysis of the significance of any relationships in the data are reported using t-tests (α=0.05), comparing accuracy of student and radiographer groups and repeated measures ANOVA, investigating the impact of each form of AI feedback (Fig 2 and 3). Full results are presented in Figs 4–12 and Tables 3–5. Dotted lines in all figures represent lines of best fit/ trendline.

Download:

Fig 4. Impact of AI feedback on all participants’ diagnostic accuracy.

https://doi.org/10.1371/journal.pone.0322051.g004

Download:

Fig 5. Impact of AI feedback on all participants’ diagnostic accuracy, when the AI feedback is correct and INcorrect.

https://doi.org/10.1371/journal.pone.0322051.g005

Download:

Fig 6. Impact of AI feedback on all participants’ diagnostic accuracy in pathological and non-pathological cases.

https://doi.org/10.1371/journal.pone.0322051.g006

Download:

Fig 7. Impact of AI feedback on student participants’ diagnostic accuracy.

https://doi.org/10.1371/journal.pone.0322051.g007

Download:

Fig 8. Impact of AI feedback on student participants’ diagnostic accuracy, when the AI feedback is correct and INcorrect.

https://doi.org/10.1371/journal.pone.0322051.g008

Download:

Fig 9. Impact of AI feedback on student participants’ diagnostic accuracy in pathological and non-pathological cases.

https://doi.org/10.1371/journal.pone.0322051.g009

Download:

Fig 10. Impact of AI feedback on radiographer participants’ diagnostic accuracy.

https://doi.org/10.1371/journal.pone.0322051.g010

Download:

Fig 11. Impact of AI feedback on radiographer participants’ diagnostic accuracy, when the AI feedback is correct and INcorrect.

https://doi.org/10.1371/journal.pone.0322051.g011

Download:

Fig 12. Impact of AI feedback on radiographer participants’ diagnostic accuracy in pathological and non-pathological cases.

https://doi.org/10.1371/journal.pone.0322051.g012

The qualified radiographers had a greater accuracy across all examinations, under all conditions, however, this difference was not statistically significant. Initial accuracy was used as a baseline to clarify the impact of the AI feedback (Figs 4–12). The standard deviation is presented as error bars on all graphs and listed numerically in Table 3. These are large as there are differing accuracies across all examinations, i.e., some examinations may be more ‘difficult’ to interpret than others, although ‘difficulty’ of the task was not included in the analysis for this study. Figs 4–12 illustrate the impact of the AI feedback on participants collectively, followed by more granular analysis of students and radiographers under each of the four conditions – when the AI agrees with ground truth (‘AI correct’), disagrees with ground truth (‘AI incorrect’), in pathological cases and non-pathological cases. The initial point on the graph represents the initial accuracy of the users collectively and further points illustrate the impact of the heatmap form of feedback and binary, sequentially.

Further interrogation of the data revealed that although there was no statistically significant difference in the two groups’ diagnostic accuracy, there was a small to moderate effect size under all conditions – when the AI feedback was correct, when the AI was incorrect, in pathological cases and in non-pathological cases. This disparity between statistical significance and effect size may be due to small sample size in some cases (n=16–26). The findings are presented in full in Table 4.

Download:

Table 4. t-tests comparing students and radiographers’ accuracy in determining diagnosis from radiographic images, following AI decision support, across four conditions: AI correct/incorrect and pathological/non-pathological cases.

https://doi.org/10.1371/journal.pone.0322051.t004

Impact of AI feedback

Two forms of AI feedback were provided in sequence in this study – 1) a ‘heatmap’ overlay and 2) binary diagnosis with % confidence from the model in its diagnosis. The heatmap (or ‘saliency’ map) provides a visual indication of the area/s of the image that the system found most important in determining its diagnosis (Figs 13–16).

Download:

Fig 13. Patient 11 – Pathological examination: AI correct (83.6% confidence).

https://doi.org/10.1371/journal.pone.0322051.g013

Download:

Fig 14. Patient 2 – Pathological examination: AI incorrect (99.3% confidence in incorrect diagnosis).

https://doi.org/10.1371/journal.pone.0322051.g014

Download:

Fig 15. Patient 16 – Non-pathological examination: AI correct (97.0% confidence).

https://doi.org/10.1371/journal.pone.0322051.g015

Download:

Fig 16. Patient 12 – Non-pathological examination: Ai incorrect (65.24% confidence in incorrect diagnosis).

https://doi.org/10.1371/journal.pone.0322051.g016

There is a statistically significant difference in participant accuracy following AI feedback (i.e., pre-AI feedback, post-heatmap and post-binary feedback from the AI) when the AI is correct (p=.002) and when the examination has been determined as demonstrating pathology (p=.013) (Available in ‘Supporting Information’).

Pairwise comparisons indicate that when the AI is correct there is a significant improvement in the participants’ performance before presentation with any AI feedback and following presentation of the binary AI feedback (p=.007) (i.e., between the ‘plain’ image and following textual AI feedback: ‘The AI system determined that this examination/imaging series DID/DID NOT contain evidence of pathology with x % certainty’). In the case of pathological examinations, there was a statistically significant difference between both the pre-AI feedback and post-heatmap stages (p=.015) and post heatmap and post binary feedback (p=.013). Further inspection of the descriptive statistics would indicate that there was a statistically significant decrease in performance following presentation of the heatmap, followed by an increase, exceeding the performance with the un-aided interpretation (No AI feedback 65.85%, post-heatmap 57.62%, post-binary feedback 72.35%), indicating that the heatmap was detrimental to performance in pathological cases.

Decision switching

Students were more likely than radiographers to change their mind following heatmap feedback (23.5% students, 14.3% radiographers – difference 9.2%) (Fig 17 and 18). The student group were also more likely to change their mind following binary feedback, with a greater difference between the two experience groups than heatmap provision only (32.7% students, 19.3% radiographers – difference 13.4%). There was also a difference found in the instances where participants felt they would reconsider their initial opinion following both heatmap and binary diagnosis (19.8% students, 11.0% radiographer – difference 8.8%; 27.0% students, 12.9% radiographers – difference 14.1%, for heatmap and binary AI feedback respectively) (Figs 17 and 18). This indicates that the AI feedback is more likely to cause students to change their mind from, and feel uncertainty in, their initial decision.

Download:

Fig 17. Impact of heatmap feedback on students and radiographers’ propensity to change their mind from their original decision.

https://doi.org/10.1371/journal.pone.0322051.g017

Download:

Fig 18. Impact of binary feedback on students and radiographers’ propensity to change their mind from their original decision.

https://doi.org/10.1371/journal.pone.0322051.g018

The Mann-Whitney U test was conducted to investigate any statistical significance of these findings. The decision switching rate of student radiographers differed significantly from radiographers following presentation of the heatmap, for yes (p=.023), no (p=.002) and reconsider responses (p=.008), with the student group responding that they changed their mind or reconsidered their initial diagnosis more often that the radiographer group. The radiographer group responded that they did not change their mind following their initial decision more often than students following both heatmap and binary AI feedback. A medium effect size (r = ) was found in all cases. Full results are presented in Table 5a.

Download:

Table 5a. Mann Whitney U test applied to differences in rates of decision switching (instances of yes, no and reconsider expressed as a proportion of the total reponses) of students and radiographers. Mean ranks are reported and effect size has been repoting using Pearson’s r with effect sizes: small 0.1-0.3, medium 0.3-0.5 and large 0.5 and over (Cohen, 1988).

https://doi.org/10.1371/journal.pone.0322051.t005

As this data was self-reported by the participants, further analysis was conducted on the respondents’ diagnosis to determine the rate and direction of the decision switch, i.e., whether their change of mind was positive (switching from an incorrect decision to a correct one) or negative (Table 5, Figs 17 and 18). The direction of the switch was noted as positive, i.e., more correct, and negative, i.e., less correct, and no change, where the group of participants did not change their minds. Data was, again, analysed collectively for the two groups (students and radiographers) as the number of decision points varied across the participants. The direction of the switch was determined for three comparisons: (i) pre and post heatmap (i.e., impact of heatmap only), (ii) pre heatmap and post binary feedback (the effect of all AI feedback) and (iii) post heat map and post binary (effect of binary feedback only).

Automation bias

Automation bias was investigated by determining the negative impact of each type of feedback. The student group was more likely to change their mind to a more incorrect response, following AI feedback. Figs 4 to 12 represent the impact of each type of AI feedback on the accuracy of participant interpretation. Additional analysis of the direction switch is given in Table 5b, by subtracting the initial and final diagnostic accuracy of the participants. The AI feedback (i.e., heatmap and binary AI decision) proved beneficial to participants except for situations where the AI was incorrect and pathological examinations in the student group (decrease in accuracy of 3.4%). This effect was greater in the student group (9.6% decrease).

Download:

Table 5b. Decision switching before any AI feedback and after all AI feedback, reported as % difference in diagnostic accuracy of participants (i.e., difference in diagnostic accuracy before AI feedback and diagnostic accuracy after all AI feedback). Grey highlighted cells represent instances where the AI feedback had net negitive impact on diagnostic accuracy for the examination.

https://doi.org/10.1371/journal.pone.0322051.t005b

Trust analysis

Trust perception (0 representing no trust and 5 representing absolute trust) has been gathered at several points during the study:

At the beginning of the study, when participants had no access to any of the images nor AI feedback provided as part of this study.
Following exposure to all images, heatmap and binary feedback in each complete examination, i.e., three per participant
Finally, at the end of the study, when the participant will have engaged with the full study, consisting of three complete examinations including all images and AI feedback contained therein

(Table 6, Fig 19).

Download:

Table 6. Trust perception of student radiographers and radiographers at the beginning of the study, before accessing any of the AI feedback, following exposure to the AI and at the end of the study, following exposure to all examinations and all AI feedback associated with the allocated examinations.

https://doi.org/10.1371/journal.pone.0322051.t006

Download:

Fig 19. Students’ and radiographers’ trust perception before, during and after AI feedback.

https://doi.org/10.1371/journal.pone.0322051.g019

This data is analysed using descriptive statistics, firstly to determine the differences in mean trust of all participants at the beginning, following each examination and again at the end of the study and secondly, sub analysis of the two groups: students and radiographers.

Initial mean trust is lower for the radiographer group than the student group (mean=4.1, n=54, SD 0.9; mean=3.9, n=40, SD 1.1, for students and radiographers respectively). Trust at the end of the study, compared to the beginning, decreased in both groups (mean=3.4, n=44, SD 1.3; mean=3.2, n=34, SD 1.1, a decrease of 0.7 for both students and radiographers respectively). Overall, mean trust is higher in the student group than the radiographer group during the image assessments, i.e., when asked after each heatmap and each AI binary feedback (3.5, n=142, SD=0.6; 3.0, n=101, SD=0.7 for students and radiographers respectively).