Uncertainty-aware deep learning in healthcare: A scoping review

Mistrust is a major barrier to implementing deep learning in healthcare settings. Entrustment could be earned by conveying model certainty, or the probability that a given model output is accurate, but the use of uncertainty estimation for deep learning entrustment is largely unexplored, and there is no consensus regarding optimal methods for quantifying uncertainty. Our purpose is to critically evaluate methods for quantifying uncertainty in deep learning for healthcare applications and propose a conceptual framework for specifying certainty of deep learning predictions. We searched Embase, MEDLINE, and PubMed databases for articles relevant to study objectives, complying with PRISMA guidelines, rated study quality using validated tools, and extracted data according to modified CHARMS criteria. Among 30 included studies, 24 described medical imaging applications. All imaging model architectures used convolutional neural networks or a variation thereof. The predominant method for quantifying uncertainty was Monte Carlo dropout, producing predictions from multiple networks for which different neurons have dropped out and measuring variance across the distribution of resulting predictions. Conformal prediction offered similar strong performance in estimating uncertainty, along with ease of interpretation and application not only to deep learning but also to other machine learning approaches. Among the six articles describing non-imaging applications, model architectures and uncertainty estimation methods were heterogeneous, but predictive performance was generally strong, and uncertainty estimation was effective in comparing modeling methods. Overall, the use of model learning curves to quantify epistemic uncertainty (attributable to model parameters) was sparse. Heterogeneity in reporting methods precluded the performance of a meta-analysis. Uncertainty estimation methods have the potential to identify rare but important misclassifications made by deep learning models and compare modeling methods, which could build patient and clinician trust in deep learning applications in healthcare. Efficient maturation of this field will require standardized guidelines for reporting performance and uncertainty metrics.


Introduction
Deep learning is increasingly important in healthcare. Deep learning prediction models that leverage electronic health record data have outperformed other statistical and regressionbased methods [1,2]. Computer vision models have matched or outperformed physicians for several common and essential clinical tasks, albeit in select circumstances [3,4]. These results suggest a potential role for clinical implementation of deep learning applications in health care.
Mistrust is a major barrier to clinical implementation of deep learning predictions [5,6]. Efforts to restore and build trust in machine learning have focused primarily on improving model explainability and interpretability. These techniques build clinicians' trust, especially when model outputs and important features correlate with logic, scientific evidence, and domain knowledge [7,8]. Another critically important step in building trust in deep learning is to convey model uncertainty, or the probability that a given model output is inaccurate [8]. Deep learning models that typically perform well make rare but egregious errors [9]. If a model could calculate the uncertainty in its predictions on a case-by-case basis, patients and clinicians would be afforded opportunities to make safe, effective, data-driven decisions regarding the utility of model outputs, and either ignore predictions with high uncertainty or triage them for detailed, human review. Unfortunately, there is a paucity of literature describing effective mechanisms for calculating model uncertainty for healthcare applications, and no consensus regarding best methods exists.
Our purpose is to critically evaluate methods for quantifying uncertainty in deep learning for healthcare applications and propose a conceptual framework for optimizing certainty in deep learning predictions. Herein, we perform a scoping review of salient literature, critically evaluate methods for quantifying uncertainty in deep learning, and use insights gained from the review process to develop a conceptual framework.

Materials and methods
Article inclusion is illustrated in Fig 1, a PRISMA flow diagram. We searched Embase, MED-LINE, and PubMed databases, chosen for their specificity to the healthcare domain, for articles with "deep learning" and "confidence" or "uncertainty" in the title or abstract and for articles with "deep learning" and "conformal prediction" in the title or abstract, identifying 37 unique articles. Two investigators independently screened all article abstracts for relevance to review objectives, removing three articles. Full texts of the remaining 34 articles were reviewed. Study quality was independently rated by two investigators using quality assessment tools specific to the design of the study in question (available at: https://www.nhlbi.nih.gov/health-topics/ study-quality-assessment-tools). Only studies describing healthcare applications that were good or fair quality were included in the final analysis, which removed four articles, leaving 30 total articles in the final analysis. Data extraction was performed according to a modification of CHARMS criteria, which included methods for measuring uncertainty in deep learning predictions [10]. The search was performed according to Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) guidelines, as listed in S1 PRISMA Checklist.
During screening, there were disagreements between the two investigators regarding the exclusion of five articles; all disagreements were resolved by discussion of review objectives without a third-party arbiter. Cohen's kappa statistic summarizing interrater agreement regarding article screening was 0.358 (observed agreement = 0.848, expected agreement = 0.764), suggesting that screening agreement between reviewers was fair [11,12]. During full text review, there was a disagreement between the two investigators regarding the exclusion of one article, which was resolved by discussion of review objectives without a third-party arbiter. Cohen's kappa statistic summarizing interrater agreement regarding full text review could not be calculated because both observed and expected agreement were 0.964, but this high value suggests that agreement between reviewers was substantial.

Results
Included articles are summarized in Table 1. Notably, the use of uncertainty estimation in these articles was rarely applied to building trust in deep learning among patients, caregivers, and clinicians. Therefore, the presentation of results will focus primarily on the content of the articles, and opportunities to use uncertainty-aware deep learning to build trust will be discussed further in the Discussion section as a novel application of established techniques.
Among 30 included studies, 24 described medical imaging applications and six described non-imaging applications; these categories are evaluated and reported separately. First, important themes from included articles are synthesized into a conceptual framework.

Conceptual framework for optimizing certainty in deep learning predictions
Deep learning uncertainty can be classified as epistemic, (i.e., attributable to uncertainty regarding model parameters or lack of knowledge), or aleatoric (i.e., attributable to stochastic variability and noise in data). Epistemic and aleatoric uncertainty have overlapping etiologies, as variability and noise in data can contribute to uncertainty regarding optimal model parameters and knowledge regarding ground truth. In addition, epistemic and aleatoric uncertainty may be amenable to similar mitigation strategies, as collecting and analyzing more data may allow for more effective identification and imputation of outlier and missing values, reducing aleatoric uncertainty, and may also allow for more effective parameter searches. Beyond these overlapping etiologies and mitigation strategies, epistemic and aleatoric uncertainty have some unique and potentially important attributes. Epistemic uncertainty can be seen as a lack of information about the best model and can be reduced by adding more training data [13]. Learning curves stratified by number of training samples offer an intuitive approach to visualizing epistemic uncertainty, where it becomes evident that using more data typically results not only in more accurate models, but also in more stable loss when trained for the same number of epochs. In stochastic models, parameter estimates also become more stable with increasing amounts of training data. In addition to increasing knowledge through larger sample sizes, it may also be possible to reduce epistemic uncertainty by adding input features, especially multi-modal features (e.g., using not only vital signs to predict hospital mortality, but also using laboratory values, imaging data, and unstructured text data from notes written by clinicians), or modifying the algorithm to learn from additional nonlinear combinations of  variables. Once an epistemic uncertainty limit has been reached, quantifying the remaining aleatoric uncertainty in predictions could augment clinical application by allowing patients and providers to understand whether predictions have suitable accuracy and certainty for incorporation in shared decision-making, or are too severely compromised by aleatoric uncertainty to be useful, regardless of overall model accuracy [13]. These concepts are illustrated in Fig 2. This explanation considers transforming a given model into a stochastic ensemble through Bernoulli sampling of weights at model test time, giving rise to a measure of epistemic uncertainty for each sample.

Medical imaging applications
Among the 24 studies describing medical imaging applications, 12 of those 24 (50%) used magnetic resonance imaging (MRI) features for model training and testing; 11 of those 12 (92%) of which involved the brain or central nervous system. The next most common sources of model features were retinal or fundus images (5 of 24, 21%) and endoscopic images of colorectal polyps (3 of 24, 13%). The remaining studies used computed tomography images, breast ultrasound images, lung microscopy images, or facial expressions. All model architectures included convolutional neural networks or a variation thereof (e.g., U-Net). The predominant method for quantifying uncertainty in model predictions was Monte Carlo dropout, as originally described by Gal and Ghahramani as a Bayesian approximation of probabilistic Gaussian processes [14]. Briefly, during testing, multiple predictions are generated from a given network for which different neurons have dropped out. The neuron dropout rate is calibrated during model development according to training data sparsity and model complexity. Each forward pass uses a different set of neurons, so the outcome is an ensemble of different network architectures that can generate a posterior distribution for which high variance suggests high uncertainty and low variance suggests low uncertainty. Studies assessing the efficacy of uncertainty measurements provided reasonable evidence that uncertainty estimations were useful. In applying a Bayesian convolutional neural network to diagnose ischemic stroke using brain MRI images, Herzog et al [15] found that uncertainty measurements improved model accuracy by approximately 2%. In applying a convolutional neural network to estimate brain and cerebrospinal fluid intracellular volume, Qin et al [16] reported highly significant correlations (all p<0.001) between uncertainty estimations and observed error based on ground truth values. Finally, in applying a convolutional neural network for differentiating among glioma, multiple sclerosis, and healthy brain, Tanno et al [17] found that uncertainty-based classification correctly identified 96% of all predictions that had high-risk for error; this error was likely attributable to aleatoric uncertainty from noise and variability in data. Valiuddi et al [18] used Monte Carlo simulations in depicting the performance of a probabilistic U-Net performing density modeling of thoracic computed tomography and endoscopic polyp images, learning aleatoric uncertainty as a distribution of possible annotations using a probabilistic segmentation model. This approach was effective in increasing predictive performance, measured by generalized energy distance and intersection over union, by up to 14%. Collectively, these findings suggest Monte Carlo dropout methods can accurately estimate uncertainty in predictions made by convolutional neural networks that make rare but potentially important misclassifications on medical imaging data, and corroborates prior evidence that Monte Carlo dropout can also offer predictive performance advantages, especially on external validation, by mitigating risk for overfitting.
Conformal prediction-used in two studies-demonstrated strong performance in estimating uncertainty. Wieslander et al [19] applied convolutional neural networks to investigate drug distribution on microscopy images of rat lungs following different doses and routes of medication administration, finding that conformal prediction explained 99% of the variance in predicted versus actual error. In another study by Athanasiadis et al [20], conformal prediction improved audio-visual emotion classification for a semi-supervised generative adversarial network compared with a similar network using the classifier alone.
Two studies used uncertainty estimation to compare modeling methods. Graham et al [21] used uncertainty measurements to demonstrate that a hierarchical approach to labeling regions and sub-regions of the brain produced similar predictive performance with greater certainty compared with a flat labeling approach, at any level of the labeling tree. Alternatively, to evaluate similarity between functional brain networks, Ktena et al [22] use convolutional neural network architectures in deriving a novel similarity metric on irregular graphs, demonstrating improver overall classification. Sedghi et al [23] calculated variance in displacement for different image classifications of brain MRIs, demonstrating good dice values for intra-subject pairs with consistent good results when simulating resections on the images, suggesting utility for challenging clinical scenarios.

Non-imaging applications
The six studies describing non-imaging medical applications were heterogenous. Five of the studies endeavored to predict and classify biochemical and molecular properties for pharmacologic applications, each with somewhat different model architectures (i.e., ensembles of deep neural networks, convolutional neural networks, and multi-layer perceptrons). Three of these five studies generated posterior distributions and assessed variance across those distributions to approximate prediction uncertainty. In one instance, there was almost no gain in predictive performance; in another by Cortes-Ciriano and Bender, there was strong correlation between estimated confidence levels and the percentage of confidence intervals that encompassed the ground truth (R 2 > 0.99, p<0.001) [24]. This difference in performance may have been attributable to differences in model features. The less successful model used bit strings to represent molecular structures; the more successful model used high-granularity bioactivity features, with 203-5,207 data points per protein. A third study in the molecular property class also used Monte Carlo dropout techniques and reported relatively low test error values [25]. Two studies used conformal predictions to estimate uncertainty, one of which used conformal predictions in predicting active and inactive compound classes, generating single-label predictions for about 90% of all instances with overall confidence 80% or greater. Best results were demonstrated for deep neural networks rather than random forest or light gradient boosting machine models, and conformal prediction offered a controllable error rate and better recall for all three model types [26]. Cortes-Ciriano and Bender [27] leveraged conformal predictions in analyzing errors on ensembles of predictions generated by dropout, reporting strong correlation between confidence levels and error rates (R 2 > 0.99, p<0.001), with results similar to those reported in their Deep Confidence work [24]. The remaining non-imaging study predicted neurodegenerative disease progression using multi-source clinical, imaging, genetic, and biochemical data, reporting variable predictive performance across different outcomes, but overall strong performance [28]. Compared with the biochemical prediction models, this study used a unique method for quantifying uncertainty, by measuring variance across predictions made by an ensemble of possible patient forecasts using a generative network. Collectively, these findings suggest that unique model architectures and methods for estimating uncertainty can be applied to a variety of non-pixel-based input features, producing occasional predictive performance advantages and accurate uncertainty estimations.

Discussion
This review found that the uncertainty inherent in deep learning predictions are most commonly estimated for medical imaging applications using Monte Carlo dropout methods on convolutional neural networks. In addition, unique model architectures and uncertainty estimation methods can apply to non-pixel features, simultaneously improving predictive performance (presumably by mitigating risk for overfitting, in the case of Monte Carlo Dropout) while accurately estimating uncertainty. Unsurprisingly, for medical imaging applications, larger datasets of training images were associated with greater predictive performance [15,21,[29][30][31][32][33][34][35][36][37][38]. We could not perform meta-analyses on predictive performance or uncertainty estimations because performance metrics and methods for quantifying uncertainty were heterogenous, despite relative homogeneity in model architectures-which were primarily based on convolutional neural networks-and homogeneity in methods for estimating uncertaintywhich were primarily based on Monte Carlo dropout [14]. Uncertainty estimations for nonmedical imaging applications were both sparse and heterogenous. Yet the weight of evidence suggests that a variety of methods can estimate uncertainty in predictions on non-pixel features, offering greater performance and reasonably accurate uncertainty estimations. Conformal prediction demonstrated efficacy in uncertainty estimation as well and is easy to interpret (e.g., at a confidence level of 80%, at least 80% of the predicted confidence intervals contain the true value), and applies not only to deep learning but also to other machine learning approaches such as random forest modeling.
For both imaging and non-imaging applications, uncertainty estimations are poised to augment clinical application by identifying rare but potentially important misclassifications made by deep learning models. First, mistrust of machine learning predictions must be overcome. Model explainability, interpretability, and consistency with logic, scientific evidence, and domain knowledge are critically important in building trust [7,8]. Yet, even when a model is easy to understand, generates predictions consistent with medical knowledge, and has 90% overall accuracy, patients and providers may wonder: is this prediction among the 1 in 10 that is incorrect? Can the model tell me whether it is certain or uncertain of this particular prediction? To address these questions and build trust, it seems prudent to include model uncertainty estimations in shared decision-making processes. Therefore, we believe that uncertainty estimations are a critical element in the safe, effective clinical implementation of deep learning in healthcare. In performing this review, we sought to summarize evidence regarding the efficacy of uncertainty estimation in building trust in deep learning among patients, caregivers, and clinicians, but we found little evidence thereof. Therefore, we propose uncertainty-aware deep learning as a novel approach to building trust.
We found no previous systematic or scoping reviews on the same topic, though several authors have described important components of estimating uncertainty in deep learning predictions. Common statistical measures of spread (e.g., standard deviation and interquartile range) are undefined for single point predictions. Entropy, however, does apply to probability distributions. Therefore, most uncertainty estimation methods generate probability distributions around point estimations. Monte Carlo dropout, as originally described by Gal and Ghahramani, offers an elegant solution [14]. During testing, multiple stochastic predictions are generated from a given network for which different neurons have dropped out with specified probability. This dropout rate is calibrated during model development according to training data sparsity and model complexity. When training, dropping out different sets of neurons at different steps harbors the additional advantage of mitigating overfitting. When testing, each forward pass uses a different set of neurons; therefore, the outcome is an ensemble of different network architectures that can be represented as a posterior distribution. Variance across the distribution of predictions can be analyzed by several methods (e.g., entropy, variation ratios, standard deviation, mutual information). High variance suggests high uncertainty; low variance suggests low uncertainty.
This review was limited by heterogeneity in model performance metrics and methods for quantifying uncertainty. To identify the optimal methods for estimating uncertainty in deep learning predictions, it would be necessary to perform a meta-analysis or comparative effectiveness analyses. This would be facilitated by achieving consensus regarding core performance and uncertainty metrics. The field of deep learning uncertainty estimation is maturing rapidly; it would be advantageous to establish reporting guidelines, as has been done for prediction modeling, causal inference, and machine learning trials [39][40][41][42]. Finally, beyond uncertainty estimations, it may be useful to quantify how similar an individual patient is to other patients in the training data, so that users can understand whether uncertainty is attributable to variability in outcomes relative to similar features in the training data or due to a patient having outlier features that are not well represented in the training data.

Conclusions
For convolutional neural network predictions on medical images, Monte Carlo dropout methods accurately estimate uncertainty. For non-medical imaging applications, a paucity of evidence suggests that several uncertainty estimation methods can improve predictive performance and accurately estimate uncertainty. Using uncertainty estimations to gain the trust of patients and clinicians is a novel concept that warrants empirical investigation. The rapid maturation of deep learning uncertainty estimations in medical literature could be facilitated by achieving consensus regarding performance and uncertainty metrics and standardizing reporting guidelines. Once standardized and validated, uncertainty estimates have the potential to identify rare but important misclassifications made by deep learning models in clinical settings, augmenting shared decision-making processes toward improved healthcare delivery.
Supporting information S1 PRISMA Checklist. Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist. (DOCX)