Visual explanations for polyp detection: How medical doctors assess intrinsic versus extrinsic explanations

Deep learning has achieved immense success in computer vision and has the potential to help physicians analyze visual content for disease and other abnormalities. However, the current state of deep learning is very much a black box, making medical professionals skeptical about integrating these methods into clinical practice. Several methods have been proposed to shed some light on these black boxes, but there is no consensus on the opinion of medical doctors that will consume these explanations. This paper presents a study asking medical professionals about their opinion of current state-of-the-art explainable artificial intelligence methods when applied to a gastrointestinal disease detection use case. We compare two different categories of explanation methods, intrinsic and extrinsic, and gauge their opinion of the current value of these explanations. The results indicate that intrinsic explanations are preferred and that physicians see value in the explanations. Based on the feedback collected in our study, future explanations of medical deep neural networks can be tailored to the needs and expectations of doctors. Hopefully, this will contribute to solving the issue of black box medical systems and lead to successful implementation of this powerful technology in the clinic.


Introduction
Deep learning is becoming an increasingly popular method for analyzing medical data to perform tasks like lesion detection or disease classification.However, despite the prevalent use of deep learning in medical research, deep learning is rarely deployed in a clinical setting [1].There are several factors that make using deep learning-based systems in medicine problematic, like the potential legal ramifications of incorrect diagnoses or the presence of unintended biases against a specific race or gender.Many of these issues stem from a general lack of explainability and interpretability in the employed deep learning algorithms.Deep neural networks are complex statistical models that consist of millions, if not billions, of parameters, making it difficult to understand what reasoning lies behind a specific prediction.Explainable artificial intelligence (XAI) aims to solve the issue of explainability and interpretability by providing methods that aim to explain the internal decision-process of the neural network in a more digestible and understandable manner.Several XAI methods have been proposed, where SHapley Additive exPlanations (SHAP) [2] and salient-based explanations like Gradient-weighted Class Activation Mapping (GradCAM) [3] are among the most popular techniques for image-based models.These methods provide an overlay that signifies what regions of the image contributed to the predicted output, making them relatively easy to understand among a non-technical audience.Several studies stress the importance of explanations that can be interrupted by non-tech-savvy users like medical 1 doctors or clinicians for them to better understand the underlying reasoning behind a prediction [4].However, there is no consensus on what explanation methods are preferred or if medical professionals actually find them useful.Similar studies have been done on the general population [5], however, a study on the opinion of domain experts on XAI has yet to be done as far as we know.
This paper presents a study on gathering feedback from medical doctors regarding the current state-of-the-art XAI methods used to explain the prediction of computer-vision-based deep learning models.The study was conducted using automatic detection of colon polyps as a use-case, where a deep learning-based model is tasked with classifying images as either containing a polyp or not.Polyps are lesions within the bowel detectable as mucosal outgrows.Polyps are flat, elevated, or pedunculated and distinguished from normal mucosa by color and surface pattern.Most bowel polyps are harmless, but some have the potential to grow into cancer.Therefore, detection and removal of polyps are important to prevent the development of colorectal cancer.Since the doctors may overlook polyps, automatic detection would most likely improve examination quality.In live endoscopy, information about the endoscope configuration helps determine the current localization of the endoscope tip (and thereby also the polyp site) within the length of the bowel.Automatic computer-aided detection of polyps would be valuable for diagnosis, assessment, and reporting, and is currently a very popular research area in medical artificial intelligence (AI) [6,7].Due to its timeliness and clear objective, we find that the gastrointestinal (GI) use-case makes a perfect case-study for evaluating XAI explanation methods for medical use-cases.Please note that this study only looks at the explanation of 2-dimensional visual prediction models.
The rest of this paper is organized as follows: In Section 2, we provide background information on the stateof-the-art explanation methods, explanations and how their current state in medical sciences and define our research questions.In Section 3, we describe the process through which we have designed, implemented, and ran a subjective user study involving 54 participants.In Section 4, we present our results from the qualitative user study.In Section 5, we discuss our findings and derive a number of generalizable insights regarding the applicability of XAI in medicine.We also put this in context with current work in the medical domain on XAI and how useful it is or not.There is in general a misconception of some people that argue XAI is not important or important but most of these works are rather personal opinions and not based on proper studies.We want to check how their claims compare to our findings.In Section 6, we conclude the paper.

Background and related work
XAI is the sub-field of AI dedicated to explaining AI systems that are opaque or non-intuitive to humans.Different end-users have different needs for explanations, ranging from the developer who wishes to improve the system and ensure its robustness to the doctor using the system as decision support in the clinic and wanting to verify the veracity of the system's findings and potentially communicate this to the patient.Recent advances in the XAI literature almost exclusively concern methods designed to explain the behavior of complex machine learning (ML) models.The XAI literature is large and rapidly developing, and we do not attempt to give an overview here.For the sake of simplicity, we compare two categories of XAI methods, intrinsic and extrinsic explanations.Intrinsic explanations cover the methods that aim to explain a model's internals through analyzing the model weights, which mostly includes saliency-based methods.Extrinsic explanations aim to explain the model using external input like SHAP or Local Interpretable Model-Agnostic Explanations (LIME).In the following, we give a brief explanation of intrinsic and extrinsic explanation methods, and explain which explanation method we use to represent each category.

Intrinsic explanations
As previously explained, intrinsic explanations aim to explain the predictions of a model by looking at the internal weights to provide some reasoning behind a specific output, and is the most common method for explaining deep neural networks.There is a large variety of such methods available, including [8,3,9,10,11,12,13,14], and it is not obvious which, if any, method is superior.In this study, we use GradCAM, which is arguably the most popular method for intrinsic-based visual explanations.Moreover, GradCAM has passed several sanity checks, as opposed to other popular intrinsic explanation methods [15].GradCAM highlights the important parts of the image for a predicted class based on the activated neurons in a specific layer of a neural network model.First, the gradients for the class are computed with respect to the feature maps of a layer in the model.The weights that are important for predicting a selected class are then obtained in order to identify what parts of the image contribute to the prediction, which can then be mapped back to the input.The heat maps follow the standard Jet color mapping, which consists of a gradient from red to green and green to blue, where red indicates the most important areas of the image, yellow indicates less important areas, and blue marks the least important areas.Examples from the study can be seen in row 2 of Figure 1.The user selects which layer in the model to extract the heat maps from.Usually, later convolutional layers, i.e., layers that are close to the output layer, are helpful in order to highlight higher-level details.Moreover, the heat maps will depend on the model architecture since this will affect the activations for the selected layer.Consequently, many different heat maps can be generated for the same image.This means that some heat maps might be regarded as useful for the users, while others might not.An advantage of GradCAM is that it visually explains the inner workings of the neural network model, making it easier to understand.

Extrinsic explanations
Extrinsic explanation methods, which also fall into the categories referred to as model-agnostic and modelindependent, treat the model as a phenomenon and present information about its emergent behavior.In the case of image classification, the presentation is visually similar to the aforementioned class of methods, i.e., a heat map superimposed on the image, but involves occlusion [16] and perturbation of image segments, for which there are significantly fewer methods of this kind available.The SHAP [2] library is widely used for explaining ML models, popular for its solid theoretical basis in game theory.The name of the SHAP package [2] is an acronym for Shapley additive explanations, and the method is based on the Shapley decomposition, which is a solution concept from cooperative game theory.SHAP simulates feature absence by sampling from a background data set.Therefore, the resulting SHAP value indicates how much the value of a feature causes the model's prediction to move away from the average prediction across the data.For images, the systematic removal of all the pixels of which an image consists is infeasible, so SHAP instead groups pixels based on their relative characteristics.For this, SHAP uses an external computer vision-based segmentation algorithm [17].The coarseness of the final SHAP heat map is user-adjusted, and smaller pixel groups require more computation resources.To summarize, SHAP, when applied to images, produces a heat map indicating which parts of the image support and oppose a classification relative to a chosen background data set.The built-in color scheme, also used in our study, uses shades of pink and blue superimposed on the image to indicate that a region supports (pink) or opposes (blue) the model prediction.Examples from the study can be seen in row 3 of Figure 1.

Study design
The primary goal of this study was to quantify the value of current state-of-the-art XAI methods from the perspective of medical doctors.Over the course of five months (September 2021 -February 2022), we sent out a survey invitation to a number of different medical doctors located in different parts of the world together with a short video explaining the study 1 .No compensation was given for taking part in the study.In this section, we describe the motivation and thought process behind building the study.This includes the development of the online survey, the implementation of the deep learning model used to generate the question cases, and the dataset used.

Survey
The survey was built using the open-source framework Huldra, which is a framework for collecting crowdsourced feedback on multimedia assets2 .The framework allows for the collection of participant responses in a storage bucket hosted on the cloud, from where they can be retrieved in real-time by survey organizers, using credentials, immediately after the first interaction of each participant.The survey consisted primarily of four distinct parts; registration, orientation, case questionnaire, and feedback.
The first part of the survey asked participants to register with their name (optional), email (optional), country, academic degree(s), their field of expertise, and how many years they have been active in the field.The second part oriented the participants on what they could expect from the survey and gave some background information on the Table 1: The questions that were asked to the participants in the final feedback form.Please note that Explanation A refers to the intrinsic explanation and Explanation B is the extrinsic explanation.

Type Question
Likert scale (1-10) Explanation (A) increased my understanding of the result.Explanation (B) increased my understanding of the result.Explanation (A) increased my trust in the AI model.Explanation (B) increased my trust in the AI model.I found the colors used to visualize explanation (A) to be appropriate.I found the colors used to visualize explanation (B) to be appropriate.Explanation (A) frequently highlighted the correct area in the image.Explanation (B) frequently highlighted the correct area in the image.
It is important that an explanation accompanies a prediction.

Multiple choice
Do you prefer to have an explanation, or would you rather only know the prediction?Which type of explanation would be useful in clinical practice?Would you prefer that explanations for the predictions be shown during or after the procedure?
Free form What do you think of explanation method (A)?What do you think of explanation method (B)?
two explanation methods that would be compared during the survey.The main part of the survey consisted of 10 cases where a model predicted that an image from the GI tract contained a polyp or not.The prediction was shown together with the image, alongside the two explanation methods that support the prediction.Here, the participants were asked to select which of the two explanation methods they found most helpful.The cases were shuffled on a per-participant basis, meaning the order of which the cases were shown was not the same between two participants.As the attention span may differ between the first and last case, we wanted to avoid any bias that could be introduced through the ordering of the different cases.The last part of the survey contained a feedback form consisting of 14 questions (see Table 1) that were meant to derive a summary of the doctor's overall perception of the two explanation methods.The participants were also shown a summary of their previous answers, where they had the option to go back and review/change the selection for specific cases.

Dataset
The dataset used to sample the case images and train the deep neural network was Kvasir [18], which is an open GI dataset consisting of different findings from the GI tract.Kvasir consists of images annotated and verified by medical doctors (experienced endoscopists), including several classes showing anatomical landmarks, pathological findings, or endoscopic procedures in the GI tract, with hundreds of images for each class.The anatomical landmarks include Z-line, pylorus, cecum, etc., while the pathological finding includes esophagitis, polyps, ulcerative colitis, etc.In addition, several sets of images related to the removal of lesions are also provided, like dyed and lifted polyp and dyed resection margins.The dataset contains images with different resolutions from 720 × 576 up to 1920 × 1072 pixels.Some of the included classes of images have a green box in the lower-left corner that illustrates the position and configuration of the endoscope inside the bowel, using an electromagnetic imaging system (ScopeGuide, Olympus Europe).Examples from the dataset can be seen in row 1 of Figure 1.

Implementation of explanation methods
The model used to classify the images and generate the explanations was a machine learning (CNN) based on the ResNet [19] architecture implemented in PyTorch and trained on a modified version of the aforementioned Kvasir [18] dataset.The dataset was modified accommodate the use-case of distinguishing between images containing polyps and images of clean colon.As for the explanation methods, we used Captum [20] provided by PyTorch for the extrinsic explanations and an open implementation of GradCAM3 for the intrinsic explanations.The model was trained on Figure 1: Five cases taken from the survey presented to each participant.The top row shows the image that was passed through the model to generate a prediction.The second row shows the intrinsic explanation method used to explain the prediction.The last row shows the extrinsic explanation method used to explain the prediction.what can be considered consumer-grade hardware, containing a Nvidia RTX 3090 GPU and an Intel i9 processor.The source code and more details on the implementation of the model used to generate the explanations can be found in our GitHub repository 4 .

Survey results
The survey collected a total of 57 responses.Of these, 54 were used in the final analysis.Among the initial responses were a few non-medical workers, including AI specialists and marketers.As the primary motivation behind this study was to better understand the opinion of medical professionals working with AI, we decided to filter out these and only keep the responses of the participants working in the medical field.Apart from non-medical participants, we also filtered out any incomplete submissions.In the end, the remaining participants came from eight different countries with a varying amount of experience in the medical field, ranging from just a few years to over 50 years.Figure 2 shows some plots regarding the participants statistics in terms of active years in the field, obtained degree(s), and the country that the participants come from.The rest of this section is organized by question category, where we present a summary of the participants' responses for the explanation case questions, Likert questions, multiple choice questions, and free-form questions.

Explanation case responses
To get a better understanding of the agreement between the different participants, we performed an inter-rater reliability test that was calculated for all explanation case responses (see Table 1) in order to measure the level of agreement between the study participants.Intra-class correlation (ICC) is one of the most common ways to investigate inter-rater reliability for ordinal variables [21].An ICC value of 1 corresponds to perfect agreement.Values between 0.70 and 0.79 are regarded as fair, values between 0.80 and 0.89 are good, while values of 0.90 and above are excellent with respect to clinical relevance [22].One of the strengths of ICC is that it takes the magnitude of disagreement between the raters into account, meaning that large disagreements result in lower ICC values than small disagreements [22].
The ICC was calculated for all the explanation cases in order to assess the level of agreement between the answers from the study participants.From Table 2, we see that the average measures ICC is 0.794[0.559,0.938], which means that the agreement is fair.The Fleiss' kappa value in Table 3 is, however, 0.049.This corresponds to poor agreement between the study participants.Note that the agreement metrics reflect the agreement regarding the intrinsic and extrinsic explanation methods.Poor agreement in this context means that the study participants do not prefer the same explanation method.

Likert scale responses
Figure 3 shows a collection of violin plots showcasing the answers collected from the Likert questions asked in the survey.Comparing the plots asking about increasing trust in the model for each respective method (Figures 3a  and 3b), we see that there generally seems to be more agreement that the intrinsic method induces more trust in the underlying model across all experience groups.This pattern continues when comparing the plots for understanding (Figures 3c and 3d), spatial relevance (Figures 3e and 3f), and color choices (Figures 3g and 3h).

Multiple-choice responses
According to the histogram plots in Figure 4a, the majority of participants preferred to see model explanations both during and after the procedure, and explanations were preferred to no explanations, see Figure 4b.From Fig- ure 4c, we see that 30 participants answered that the intrinsic explanation method would be useful in clinical practice, 14 participants wanted both explanation methods, while 3 wanted the extrinsic explanation method.3 participants answered that none of the explanation methods would be useful, and 4 of them answered that either one or the other method would be useful.

Free-form responses
The survey included three questions with free-form responses, i.e., parts of the form where the participants could form their responses and elaborate freely.The three questions were • Do you prefer to have an explanation, or would you rather only know the prediction?
• What do you think of explanation method A?
• What do you think of explanation method B? and we provide a summary of the responses to each question below.

Do you prefer to have an explanation, or would you rather only know the prediction?
One participant stated regarding explainability as a "reasonable and popular expectation of AI systems", but highlighted as a challenge that we humans upon being presented an explanation "make human interpretations/assumptions around what the explanation indicates about the underlying AI process [although] these interpretations may not actually be an accurate reflection of what is actually happening", along with a reference to [23].
Another participant pointed out that "in clinical practice we do not have enough time to check the explanation during colonoscopy.".A third contrasted clinical work to research, stating that they prefer no explanation during the former but receiving an explanation during the latter.

What do you think of explanation method A?
As indicated by the Likert scale results, most respondents prefer the intrinsic explanation method, which was reflected in their free-form answers.Participants who had positive sentiments towards method A described it using terms such as "understandable", "user friendly", "helpful", and "intuitive".
We find that the following selection of the free-form answers represent the main reasons for which participants liked explanation method A: "easy to distinguish the red (interesting) areas from the other areas which are not as important.","Logical and a method that I have experienced with other examination modalities", "I find it visually easier to understand and it pinpoints exactly what it is reacting to so it is easy to double-check the data.","It is simple and easy to understand which part of image should be focused on.".Our interpretation of these and similar responses is that the saliency map produced by intrinsic explanation is visually intuitive and appealing, and therefore possibly also used in other applications the participants may be familiar with.
On the other hand, the positive assessment of explanation method A might depend strongly upon the model prediction being correct.One participant stated that method A is "Easy to understand, the red part is mostly in the same spot as the lesion.",suggesting that the assessment would have been different had the model not identified the polyp.Another participant's answer "It helped me identify important areas."supports this notion.
Still, two participants wrote that "I think it is the best one to help you focus on the area that the AI system has identified as a suspected polyp."and "The red colour doesn't show up where the polyp is (. . . ) so for someone used to identify polyps in colonoscopy [this doesn't indicate] that the program is able to really identify the polyp", indicating that domain experts could use this explanation method to evaluate the machine learning model.
Among the responses giving method A a negative evaluation, one participant described the method as "sensitive but less specific" and another stated that it is "Sometimes (. . . ) a bit intense and difficult to interpret".One participant complained about the accuracy of method A, stating that "I prefer method A over B, but the poor accuracy of the method makes the explanation method (A) annoying rather than helpful.".Finally, one participant stated that they "Did not get an explanation", which we interpret as alluding to the fact that highlighting which information goes into a decision is not sufficient to actually explain it.

What do you think of explanation method B?
As the Likert scale responses indicate, the study participants preferred the intrinsic explanation to extrinsic explanation.Based on similar free-form answers from several participants, describing method B as "complicated", "confusing", "hard to interpret", "hard to understand", "doesn't feel natural", and "harder to grasp visually", we conclude that the extrinsic explanations functionality of highlighting in which direction each collection of image pixels drives the prediction, is counter-intuitive to domain experts not familiar with such a way to represent information.One participant also stated that "It is difficult to understand where should we pay attention to.There are several green spots in one image.".Even though this indicated that the extrinsic explanation is not preferred by domain experts, it does not mean that the method does not provide value.
Before participating in the study, the participants were given an introduction to each explanation method, but it seems that a short brief is not enough to become comfortable with the visualizations that the extrinsic explanation produces, as supported by one participant's statement "The image is messy.I understand the method as explained, but the method makes no sense to me.", and another's " (. . . ) hard for my brain to wrap itself around the red/green 'type of data in agreement or not' paradigm used for this explanation".Another stated that the method is "Harder to understand -however after a while you get a hang of it", indicating that more time spent contemplating the method or studying several examples could have a strong positive effect on the evaluation of the method by medical doctors.This notion is supported by participants stating that method B "helps trust the system as it is based on the data" and "Makes more sense in terms of how data is trained".The participants with positive sentiments towards explanation method B described it using terms such as "Interesting", "intriguing" and "more specific".
Further, it seems that the choice of color map as well as super pixel size could be adapted to better suit the end-users, as some participants stated that method B has "not the best colours", is "confusing with the large amount of boxes not as pleasing to the eye", and one suggests that "colors should be opposite.Red for disease, green for healthy.".The latter response indicates that the particular participant had misunderstood the method -as the extrinsic method colouring indicates agreement with the model prediction; not the label -and consequently that method B is not sufficiently intuitive, as recently discussed.One participant also voiced concern regarding aptness of this method for color blind people.We have not taken this aspect into account in our study, but stress that, in general, any visualization method should abide by the principles of universal design, including color blindness accessibility.

Discussion
In general, the answers to the free form questions align with the responses to the Likert and multiple-choice questions.Most doctors prefer intrinsic-based explanations as they more easily align with their expectations in terms of spatial relevance and visual presentation.The participants found the intrinsic explanations more intuitive and userfriendly and that the visualizations more correctly aligned with their preconceived notions with regards to what they expected the model would react to.Specifically, some specifically state that they prefer the intrinsic explanation as it more accurately highlights the lesion.The problem here is that the explanations are not there to detect subjects in an image but rather to explain why a specific prediction was made.If the doctors expect the explanations to always align with the object in question, the explanations may hinder adoption and trust.As for the extrinsic explanations, several doctors were confused by the visualizations and found them hard to interpret, somewhat defeating the explanation in the first place.Some referenced the choice of colors and that the superpixels were not pleasing to look at.On the contrary, there seems to be a different level of understanding in terms of AI knowledge among the participants, with some mentioning that they prefer the extrinsic explanation method due it providing more information about how the model was trained.Perhaps the superficial aspects of the explanation could be improved by involving potential end-users in the development process to tailor the explanations to fit their use-case and needs.By having a humancentered approach for generating explanations of AI systems, the explanations may be regarded as more useful for the end-users [24].As for explanations in general, the study participants preferred that explanations were provided together with the model predictions (Figure 4b), but what they regarded as the best explanation method varied between the participants.The human factor is important when developing model explanations [4].What is regarded as a useful explanation by one person might not be so for another person.Consequently, subjective preferences might contribute to explain why the medical experts that participated in the study did not prefer the same type of model explanation.

Conclusion
This paper presents a study comparing intrinsic against extrinsic explanation methods from the perspective of medical doctors.The study was conducted using a GI use-case involving explanations of a machine learning model used to predict polyps in images.Study participants were gathered from different parts of the globe to complete a survey consisting of model predictions accompanied by two explanation methods for ten different medical cases.Our results show that the intrinsic explanations are preferred.However, the free-form responses in our survey strongly suggest that the underlying reason for the doctors' preference of this method may be more superficial than actually understanding what information the different explanations convey.This suggests that a certain level of training or practice is required for the doctors to fully exploit the usefulness of ML model explanations, although we might naïvely expect that all image-based explanations are sufficiently intuitive to be useful without prior training.We highlight that any form of explanation targeted at non-technical end-users, such as doctors, must be developed with the end-user in mind, ideally also involving the end-user.This includes abiding by the principles of universal design, in order to accommodate specific needs.To conclude, medical doctors recognize the usefulness of visual explanations for deep learning-based computer-vision models, but limited understanding of functioning and the reasoning behind an explanation may lead to unwarranted judgements based on the wrong principles.
(a) Number of active years in the medical field.(b) The degree(s) obtained by the participants.(c) The country that the participant comes from.

Figure 2 :
Figure 2: Plots presenting some statistics about the participants included in the study.

Figure 3 :Figure 4 :
Figure 3: A collection of violin plots that presents an overview of the responses collected from the Likert questions.The answers are grouped by the number of years the person has been active in the medical field.