CAManim: Animating end-to-end network activation maps

Deep neural networks have been widely adopted in numerous domains due to their high performance and accessibility to developers and application-specific end-users. Fundamental to image-based applications is the development of Convolutional Neural Networks (CNNs), which possess the ability to automatically extract features from data. However, comprehending these complex models and their learned representations, which typically comprise millions of parameters and numerous layers, remains a challenge for both developers and end-users. This challenge arises due to the absence of interpretable and transparent tools to make sense of black-box models. There exists a growing body of Explainable Artificial Intelligence (XAI) literature, including a collection of methods denoted Class Activation Maps (CAMs), that seek to demystify what representations the model learns from the data, how it informs a given prediction, and why it, at times, performs poorly in certain tasks. We propose a novel XAI visualization method denoted CAManim that seeks to simultaneously broaden and focus end-user understanding of CNN predictions by animating the CAM-based network activation maps through all layers, effectively depicting from end-to-end how a model progressively arrives at the final layer activation. Herein, we demonstrate that CAManim works with any CAM-based method and various CNN architectures. Beyond qualitative model assessments, we additionally propose a novel quantitative assessment that expands upon the Remove and Debias (ROAD) metric, pairing the qualitative end-to-end network visual explanations assessment with our novel quantitative “yellow brick ROAD” assessment (ybROAD). This builds upon prior research to address the increasing demand for interpretable, robust, and transparent model assessment methodology, ultimately improving an end-user’s trust in a given model’s predictions. Examples and source code can be found at: https://omni-ml.github.io/pytorch-grad-cam-anim/.


Introduction
The popularization of deep learning in numerous domains of research has led to the rapid adoption of these methodologies in disparate fields of scientific research.Convolutional Neural Networks (CNNs) are a class of deep learning models that use convolutions to extract image features, achieving high performance in numerous computer vision applications [1].However, due to the intrinsic network structure and the complexity of features leveraged for model predictions, CNNs are, consequently, often labeled as uninterpretable or 'black-box' models.Interpretability is crucial for applications in high-criticality fields such as medicine [2], where model decisions have the potential to cause excessive harm if incorrect.In order to be deployed, models must be trustworthy both in their class predictions and in the features used to make those predictions.Therefore, there is a definitive impetus to develop trustworthy explanations of model decisions.
Presently, there exists extensive literature on the use of state-of-the-art deep learning methodologies within healthcare systems and applications.Indeed, there exist entire subfields of computer science and biomedical engineering on computational medicine and medical image analysis.Notable examples from the literature include online medical pre-diagnosis systems [3], 3D deep learning on medical images [4], the development of medical transformers for chest x-ray diagnosis [5], and an emergent trend to adopt generative methods in these high-criticality fields (e.g.GPT-3 as a data generator for medical dialogue summarization [6]).With the emergence of large language models (LLMs) such as the GPT-3 and GPT-4 models developed by OpenAI and made broadly available through the ChatGPT platform, early adopters are actively promoting the transformative opportunities of these AI systems within the healthcare space [7] while others issue active calls for caution in their use [8].Fundamentally, it is paramount to develop increasingly transparent methods to assist medical practitioners in their use of, critical oversight of, and reliance upon deep learning models.
There have been numerous methods proposed to improve the interpretability of CNNs.Zeiler and Fergus initially investigated network interpretability by using a deconvolutional network to identify pixels activated in CNN feature maps [9].Thereafter, gradient-based methods were used to develop saliency maps indicating important image regions based on desired output class [10][11][12].Class Activation Maps (CAMs) are a group of methods that linearly combine weighted feature activation maps from a given CNN layer [13][14][15][16][17][18][19][20][21][22].Typically, only the final layer(s) are visualized to confer trustworthiness and describe what image features are used for model predictions.However, this provides little detail on the learning process of the model.In addition, selecting the correct final layer to visualize from each CNN model is not straightforward and is often done arbitrarily.
To better interpret how a given model evaluates a given image through each of its layers, we propose expanding these existant Explainable Artificial Intelligence (XAI) methodologies by individually visualizing and analyzing the model's layer-wise activation maps.In a natural extension of this idea, these layer-wise activation maps can be combined as individual frames of a video animating the end-to-end network activation maps; a method we propose in this article and denote CAManim.We develop local and global normalization to understand learned network features on a layer-wise (local perspective) and network-wise scale (global perspective).We experiment and quantify layer-wise performance of CAManim with numerous CNN models and CAM variations to show performance in a variety of experimental conditions, including medical applications wherein model understanding and trustworthiness is critical.
Our contributions are as follows: • We propose CAManim, a novel visualization method that creates activation maps December 20, 2023 2/19 for each layer in a given CNN.CAManim can be applied to any existing CNN and CAM for any classification task.
• We introduce local and global normalization to understand important learned features at both a layer-wise and network-wise level.
• We perform extensive experimentation to determine the expected time and space requirements to run CAManim.
• We demonstrate the usefulness of CAManim across multiple CAM variations and CNN models, and in high-criticality fields.
• We quantitatively evaluate the performance of each CAM visualization generated per model layer with an analytical process denoted "yellow brick ROAD" (ybROAD) that seeks to improve the understanding of how CNNs learn.This is further extended to selecting the most accurate feature map representation from all possible layers of a CNN.

Related Work
The topic of explainable and trustworthy AI has been researched extensively.Lipton et al. [23] described the importance for trustworthy and interpretable models, while Ribeiro et al. [24] conducted human-based trials to quantify their degree of trust in classifer predictions.Computationally, numerous methods have investigated the improvement of CNN interpretation.In this section, we provide an overview of proposed methods and how CAManim addresses a gap in the current literature.
Earliest Explainable AI Studies: One of the earliest efforts to interpret CNNs was made by Zeiler and Fergus [9].In this study, feature maps from convolutional layers are used as input to a deconvolutional network to identify activated pixels in the original image space.Simonyan et al. [10] approached network explainability in two ways.First, they proposed class models, which are images generated through gradient ascent that maximize the score for a given class [10].Next, they produced class-specific saliency maps, calculated using the gradient of the input image with respect to a given class [10].
Guided Backpropagation and Gradient-Based Methods: Springenberg et al. [11] extended Simonyan's work to Guided Backpropagation, which excludes all negative gradients to produce improved saliency maps.Despite calculating gradients with respect to individual classes, Selvaraju et al. showed that the visualizations produced by Guided Backpropagation are not class-discriminative (i.e.there is little difference between images generated using different class nodes) [14].Sundarajan et al. [12] proposed integrated gradients, calculated through the integral of the gradient between a given image and baseline, to satisfy axioms of sensitivity and implementation invariance.FullGrad is another gradient-based method that is non-discriminative and uses the gradients of bias layers to produce saliency maps [25].
Gradient-Free Methods: While gradient-based methods are quite popular in the field of explainable AI, some studies argue that these methods produce noisy visualizations due to gradient saturation [26,27].For this reason, gradient-free methods have been investigated by a number of studies.Zhou et al. [28] identified K images with the highest activation at a given neuron in a convolutional layer and occluded patches of each image to determine the object detected by the neuron.Morcos et al. [29] used an ablation analysis to remove individual neurons or feature maps from a CNN and quantify the effect on network performance.This study demonstrated that neurons with high class selectivity (i.e.highly activated for a single class) may indicate poor network generalizability.Zhou et al. [30] extended this work to show that ablating neurons with high class selectivity may cause large differences in individual class performance.
Class Activation Maps: A popular class of CNN visualizations are Class Activation Maps (CAMs), which produce explainable visualizations through a linearly weighted sum of feature maps at a given CNN layer [13].The original CAM was proposed for a specific CNN model, consisting of convolutional, global average pooling, and dense layers at the end of the network [13].The dense layer weights were used to determine the weighted importance of individual feature maps [13].However, this required a specific CNN architecture and was not applicable to numerous high-performing models.This led to the development of CNN model-agnostic CAM methods.
Gradient-based methods were the first variation of the original CAM [14][15][16][17][18][19].These methods determine importance weights by calculating averaged or element-wise gradients of the output of a class with respect to the feature maps at the desired layer.As discussed previously, gradient methods may produce noisy visualizations due to gradient saturation [20-22, 26, 27]; as a result, perturbation CAM methods have been proposed [20,21].In this case, importance weights are calculated by perturbing the original input image by the feature maps and measuring the change in prediction score.In addition, non-discriminative approaches have been investigated to eliminate the reliance of class-discriminative methods upon correct class predictions.For example, EigenCAM produces its CAM visualization using the principal components of the activation maps at the desired layer [22].
While most studies have developed saliency map and/or CAM formulations for a single layer, LayerCAM demonstrated how aggregating feature maps from multiple layers can refine the final CAM visualization to include more fine-detailed information [19,31].Gildenblat extended this idea across existing multiple CAM and saliency map methods [17].While conceptually similar, to the best of our knowledge, our study is the first to consider individual feature maps generated from every CNN layer and combine them into an end-to-end network explanation.Moreover, this end-to-end layer-wise analysis enables a unique view of local and global perspectives and a natural integration of both qualitative and quantitative network-wide explainability.Figure 1 provides a conceptual overview of the CAManim method proposed in this work.

Materials and methods
In this section, we first recall the general formulation for Class Activation Maps and outline notation preliminaries.Next, we explain the generation of CAManim using CAM visualizations from each layer of a CNN, depicted in Figure 1.The concepts of global and local normalization are introduced, and the computational requirements of CAManim are described from large-scale experiments.Lastly, we define the quantitative performance metric for individual CAM visualizations, and propose ybROAD for analyzing end-to-end layer-wise CAManim.

Individual CAM Formulation
The general formulation for any CAM method consists of taking a linearly weighted sum of feature maps as follows: For a given input image x and CNN model f (•), a CAM visualization L can be generated through the weighted α summation of k activation feature maps A at layer l.Class discriminative CAM methods further define L per predicted class c.To exclude negative activations, most CAM formulations are followed by a ReLU operation.

End-to-End Layerwise Activation Maps
To formulate CAManim, CAM visualizations are first generated for every differentiable layer l within a given CNN with a total number of layers N : Each CAM visualization is subsequently saved as a PNG image I and concatenated together to create the final CAManim video, as depicted below: For clarity, the concatenation operator, ∥, is defined in this work in a way analagous to the summation operator, Σ, and product operator, Π, to concisely express the sequential organization of individual frames into the resulting animated video.

Global-vs. Local-Level Normalization
For a model end-user to correctly interpret what importance the model attributes to particular pixels at a given layer in the network, they must be provided the appropriate context.To this end, the model interpreter may wish to know "what importance does the model place on particular pixels at a given layer?" or "what importance does the model place upon particular pixels overall?".Consequently, two normalization approaches can be leveraged, each with the intent of correctly relaying information to answer one of these two questions, and both complementary to the other.Thus CAManim visualizes the CAM activations of each layer using two different types of normalization: Local-Level Normalization and Global-Level Normalization.Global normalization is performed using the minimum and maximum activation value across all December 20, 2023 5/19 Figure 2 shows the difference between global and local normalization for the first denseblock of DenseNet161 [32].The global normalization (right) displays an attenuated version of the local normalization (left).This example demonstrates that the layer-wise information focuses upon learning small pattern-like features, whereas the network-wise information indicates that the activations of this layer are generally attenuated with respect to all other layers within the DenseNet161 model.

Model-& CAM-Specific Interpretation and Computational Complexity
To appreciate how varying CNN architectures and CAM methods produce differing visual explanations for a given image x, target class c, and CNN model f (•), we ran large-scale experiments producing numerous layer-wise and globally normalized CAManim videos/image sequences.Consequently, this additionally enabled the benchmarking of key model-specific metrics such as layer-level number of parameters and CAManim run-time.
Figure 3 illustrates the layer-wise parameter number along a log-scale where we can explicitly visualize the four general DenseBlocks comprised of a varying number of DenseLayers.Since CAManim computation will vary by layer number n, layer-wise parameters p, and CAM-specific compute runtime r, we generally estimate that CAManim will have a simplified computational time complexity of O(npr).For clarity, the overbar notion expresses averages for the number of parameters and CAM-specific compute time, respectively.Image-specific dimension will also impact runtime, however, given that the majority of models reshape their input to a consistent size, this constant factor may be subsumed within term n.To provide general estimates on the overall runtime for a given CNN and CAM, we tabulate in

Quantitative Evaluation
To quantitatively evaluate the performance of each CAM visualization and demonstrate the information gained through deeper layers in a CNN, we calculate the Remove and Debias (ROAD) score [33].This metric has superior computational efficiency and prevents data leakage found with other CAM performance metrics [33].ROAD perturbs images through noisy linear imputations, blurring regions of the image based on neighbouring pixel values [33].The confidence increase or decrease C in classification score using the perturbed image with the least relevant pixels (LRP) or most relevant pixels (MRP) is then used to evaluate the accuracy of a CAM visualization.Since the percentage of pixels perturbed affects the ROAD performance, we evaluate ROAD at p = 20%, 40%, 60% and 80% pixel perturbation thresholds.As proposed by Gildenblat [17], we combine the LRP and MRP scores for our final metric: A ROAD score is calculated for each CAM generated.Therefore, for N differentiable layers in a CNN, there will be N ROAD scores calculated within CAManim.Given that this network-wide sequence of ROAD values represents a journey-like traversal of the network, we denote this series of ROAD values as the 'yellow brick ROAD', or ybROAD for brevity: The ybROAD scores can be used to analyze performance of an experiment with given image x, target class c, and CNN model f (•) over all layers of the network.Consequently, this analysis enables the quantitative identification of the CNN layer that maximally visualizes features with the largest impact on model performance through max(ybROAD).The mean(ybROAD) score is also calculated to summarize the overall model end-to-end ROAD performance.Interestingly, variant metrics derived from ybROAD values may yield new insights into the quantification of a model's ability to predict particular classes.

Results & Discussion
In this section, we first define the pre-trained models and datasets used to evaluate CAManim.Next, we demonstrate CAManim in high-criticality fields using a ResNet50 model fine-tuned to perform breast cancer classification.We then show example visualizations from CAManim for 10 different CAM variations and discuss abnormal visualizations.Lastly, we discuss the ybROAD performance of CAManim and future directions building upon this work.

Pre-trained Models and Datasets
To evaluate CAManim, we use models from Pytorch pre-trained on the 2012 ImageNet-1K dataset [34].Specifically, results are shown for AlexNet [35], ConvNeXT [36], DenseNet161 [32], EfficientNet-b7 [37], MaxViT-t [38], and SqueezeNet [39].The CAManim videos for an additional 14 models and publicly available code can be found here: https://omni-ml.github.io/pytorch-grad-cam-anim/.All results in this study (apart from the high-criticality case study) leverage a popular brown bear-containing image typically used in the CAM research community; following an emergent standard, the image is preprocessed by resizing to 224 × 224 and normalized.Next, we demonstrate the utility of CAManim in a high-criticality field.
Case Study: End-to-End BC-ResNet50 Visualization for Malignant Tumour Prediction Pre-processing and training steps are selected based on MONAI recommendations1 .Following fine-tuning, CAManim is run with an example test image of the malignant class to visualize and interpret how the resultant CNN arrives at producing the correct prediction of malignant cancer.Figure 4 illustrates the layer-wise activations that BC-ResNet50 considers when determining the 'malignant' tumour.
It is important to emphasize that for high-criticality applications such as medical imagery, the initial resizing of input imagery can dramatically alter the information available to the model and impact model out and its interpretability.While this work builds upon previous XAI literature and adopts their methodological approach, we recommend that for high-criticality applications, the initial image size be kept closely aligned with original input image sizes (no/limited downsizing) so as not to alter image resolution and to provide a clinical decision support system as a visual explanation method.
As expected, Figures 5 & 6 depicts early model layers as activating general patterns and edges while middle and final layers progressively focus the activation maps to regions highly characteristic of the brown bear contained within.Such a layer-wise approach enables the pair-wise or multi-wise comparison of visual-explanation methods and how these individual activation maps compare globally across all activation maps.

Layer-Type Visualization Issues
Certain differentiable layers may produce unanticipated CAM visualizations, as depicted in Figure 7.In these layers, images are compressed to 1-dimensional (1D) representations; consequently, 2D feature visualization of a non-convolutional layer is effectively meaningless.Instead, individual neurons that are highly activated show up as vertical or horizontal lines across the image.While these images are uninformative, they simply depict visualizations of 1D vectors and should be filtered out.ybROAD Quantitative Evaluation Figure 8 displays the ybROAD for 11 trials of generating CAManim for the bear image using ResNet152 (mean ybROAD: 0.204; max ybROAD: 0.473 at layer 402).Initially, the layer-wise ROAD performance is very high (∼0.4).At this point, the CNN layer is activating many small regions throughout the image; when each of these areas is perturbed, it is difficult to correctly classify the image, and the ROAD score increases.
As the network starts learning larger features, less of the bear image is perturbed, and the ROAD score decreases.Towards the end of the network, the ROAD score increases again and reaches its maximum value as the small activations are combined together to encapsulate the entire bear.This demonstrates how the ybROAD score can provide more information on how the network progressively learns.Interestingly, the layer-wise depiction of ROAD values may be used to investigate how various model layers contribute to the overall discrimination of a given target class within an image for a pre-trained model of a given architecture and selected CAM.To quantify the improvement of our ybROAD method against standard practise (i.e., considering the activation map of the final layer of a model), we sumarize 12 diverse experimental conditions in Table 2.The difference in the ybROAD vs. final layer-ROAD values is indicative of the performance improvement from our proposed layer-wise approach.Figure 9 additionally depicts the general improvement and convergence of ROAD values across all model layers.Interestingly, this layer-wise series of values affords greater insight into the general functioning and utility of various model layer contributions across experiments.While Figures 9A,B,D,E all seem to generally improve in ROAD performance from model layer beginning to end, Figures 9C,F both appear relatively consistent in their value distribution, perhaps indicative that within these instances, the model/CAM combination had greater difficulty in discriminating the target class within the given image.Certainly, across all experiments we observe a noisy time-series signal suggesting that future work investigate moving average smoothing as a technique to make these curves more interpretable, albeit, as a trade-off for the layer-specific resolution of ROAD values.
The combined consideration of quantitative ROAD and qualitative CAM at every layer enables end-users to identify the best representation for their particular image, target class, and model in a manner less arbitrary than selecting one of several terminal layers.For example, a healthcare professional might identify a better representative feature map for a predicted tumour than they might otherwise from a potentially poorer last-layer visualization.This approach effectively allows an end-user to peer across the network and determine those layers that best capture the story as opposed to relying on the final output alone.We caution that this may introduce additional risk for confirmation bias, however, this has broadly been a challenge within the XAI community.

CAM Failure Cases
Interestingly, EigenCAM incorrectly highlights the dog in the image, instead of the desired cat class.This explains the negative ROAD value for EigenCAM in Figure 10.EigenCAM is a non-discriminative CAM method that uses principal components to create activation maps.However, when there are multiple classes within the same image, the order of principal components must be specified (e.g., first principal components vs. second principal components).EigenCAM performs well on images with a "single-subject", but otherwise requires a user to determine the number and rank of various components within an image to perform successfully.This requires a level of hand-engineering and data leakage to correctly align the appropriate principal component with the intended class.The ybROAD plots proposed within this work can be leveraged to better understand whether a model adequately distinguishes a given class or whether it fails across all layers of the model.As visualized in the ybROAD plots of Figure 11 the mean layerwise ROAD value around 0 effectively demonstrate that the model was unable to identify the correct class within the image.Consequently, the ybROAD quantitative metrics derived from the ybROAD plots may be useful in elucidating the impact of model architecture (and their learned parameters) on a class-specific basis.As part of future work, this concept could be extended to consider epoch-wise ybROAD plots to better determine how specific layers through model training contribute to the discrimination of the target class.

Future Directions
The proposed future directions for research represent individual contributions that can significantly advance the use of CAMs for CNNs.Foremost, conducting more in-depth studies on the activation maps statistics at different layers and for different images can provide a better understanding of how CNNs attend to images in varying applications and contexts.Secondly, designing an algorithm to efficiently compute CAM-based videos would greatly improve the applicability of this technique in various fields, particularly those that require inference or interpretability in near-realtime.Thirdly, using activation maps sequences to identify useless layers/filters represents a novel approach towards network compression purposes.Fourthly, exploring the behavior of activation maps sequences for wrong classes and finding ways to exploit this information for classification purposes is a unique contribution.with expert feedback in specific applications can result in a more interpretable and accurate model.Overall, these individual research contributions have the potential to improve the performance, efficiency, and interpretability of CNN models, leading to advancements in various image classification tasks and promoting large-scale and transparent adoption of these models.

Conclusion
This work proposes CAManim as a novel XAI visualization method enabling end-users to better interpret CNN predictions by animating the CAM-based network activation maps through all layers.We demonstrate that CAManim works with any CAM-based method and various CNN architectures.We additionally introduce a quantitative end-to-end assessment inspired from the ROAD metric, denoted "yellow brick ROAD" (ybROAD).Our experiments demonstrate the utility of these methods for improved interpretation and understanding of CNN predictions, not only in their final layers but across their layer-specific and global-wise perspectives.Visualizations and source code can be found at: https://omni-ml.github.io/pytorch-grad-cam-anim/

Fig 2 .
Fig 2. Difference between local and global normalization for the feature map generated from layer features.denseblock1 in DenseNet161.

Fig 4 .
Fig 4. Visualization of the activation maps from BC-ResNet50 to visually depict how the model predicts the 'malignant' tumour class.Only the 10th percentile layers are illustrated for concision.

Fig 9 .
Fig 9. Quantitative determination of ybROAD & visualization of model convergence to the target class.

Table 1 .
Total number of parameters, CAManim runtime, and average parameters and runtime across all layers included in CAManim calculated for six CNN models using HiResCAM

Table 2 .
Quantitative comparison of ybROAD values and SOTA CAM methods for various model architectures, images, and target classes.
Lastly, coupling CAM-based videos