Fig 1.
Overview of the topics explored in the paper.
Our paper explores challenges that one faces when explaining image classifiers, especially for the application on data with biological variation. We train a convolutional network for grain defect detection. We discuss some of the challenges and questions that arise in connection to applying post-hoc explanation methods on this model. For example, the choice of the explainability method is crucial. We evaluate the quality of explanations, perform an extensive analysis of some of these choices, and present the results, showing how big an impact each choice has.
Fig 2.
Examples of the images and human annotations.
Images of grains with pink fusarium disease (left) and skinned damage (right) with human annotations shown in green.
Fig 3.
Effect of image segmentation on an explanation.
LIME (Local Interpretable Model-Agnostic Explanations, [18]) explanations of the same image with two different segmentations.
Fig 4.
Effect of hyperparameter choices on an explanation.
Heatmaps of three LRP (Layer-wise Relevance Propagation, [19,20]) methods used in this paper. These methods differ only in the propagation rules used for each type of layer. For details about the individual methods see Sect 3.3.
Fig 5.
Effect of channel pooling strategies on the appearance of an explanation.
Mean (left) and max (right) pooling of Gradients [25].
Fig 6.
Effect of normalization strategies on the appearance of an explanation.
LRP (Layer-wise Relevance Propagation, [19,20]) (EpsilonGammaBox) explanation with mean pooling and with different normalization strategies: (from left to right) maximum over the specific explanation, maximum over all explanations in the dataset, the 99th percentile of the specific explanation, and the 99th percentile of all explanations in the dataset.
Fig 7.
Histograms of per-pixel values of each explainability method for one image.
From left to right: Layer-wise Relevance Propagation (LRP) [19,20]) EpsilonAlpha2Beta1Flat, Deconvolution [23], and Local Interpretable Model-Agnostic Explanations (LIME) [18].
Fig 8.
Overview of the explainability pipeline and associated challenges in the context of real-world biological image classification.
The process begins with input data (e.g., grain images), which are used both to generate expert-provided human annotations (serving as approximate ground truth), and to train a neural network classifier. After inference, the model outputs categorical predictions (e.g., pink fusarium, skinned), which are then interpreted using post-hoc explainability methods. These explanations may vary significantly depending on the choice of method and hyperparameters, leading to further steps such as visualization (e.g., saliency maps) and aggregation (e.g., combining multiple explanation techniques). The final stage involves evaluation, which is broken into two core components: (1) metrics assessing explanation quality, either without access to ground truth (e.g., robustness, faithfulness, complexity) or by comparing to the human annotations; and (2) comparison, where explanation methods are ranked for each metric individually and combined into an overall ranking. Components outlined in red represent stages of the pipeline that are particularly prone to uncertainty or variation. This includes subjective human annotations, the sensitivity of post-hoc methods to tuning and input transformations, and the lack of consensus on appropriate evaluation metrics in the absence of reliable ground truth. These areas reflect key open challenges in the deployment of explainability methods on complex, noisy, and under-annotated biological datasets.
Table 1.
Number of images in each dataset used for training the models.
Fig 9.
Model architecture used in this paper.
“Conv2D, 7x7, ch=64, s=2” is a two-dimensional convolutional layer with the filter size 7x7, 64 channels and stride of 2; “BatchNorm2D” is a two-dimensional batch normalization; “ReLU” (REctified Linear Unit) is an activation function; “Max pooling, 3x3, s=2” is a max pooling layer with a filter of size 3x3 and stride of 2; “Dense, 2” denotes a fully-connected layer with 2 neurons, and “Softmax” is and activation function.
Table 2.
Overview of the metrics used for evaluation of explainability methods.
Table 3.
Final ordering of explainability methods.