Skip to main content
Advertisement
  • Loading metrics

Uncertainty-aware deep learning in healthcare: A scoping review

  • Tyler J. Loftus ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Visualization, Writing – original draft

    tyler.loftus@surgery.ufl.edu

    Affiliations Department of Surgery, University of Florida Health, Gainesville, Florida, United States of America, Intelligent Critical Care Center, University of Florida, Gainesville, Florida, United States of America

  • Benjamin Shickel,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Visualization, Writing – original draft

    Affiliation Department of Biomedical Engineering, University of Florida, Gainesville, Florida, United States of America

  • Matthew M. Ruppert,

    Roles Data curation, Investigation, Resources, Software, Visualization, Writing – review & editing

    Affiliations Intelligent Critical Care Center, University of Florida, Gainesville, Florida, United States of America, Department of Medicine, University of Florida Health, Gainesville, Florida, United States of America

  • Jeremy A. Balch,

    Roles Data curation, Investigation, Resources, Software, Visualization, Writing – review & editing

    Affiliation Department of Surgery, University of Florida Health, Gainesville, Florida, United States of America

  • Tezcan Ozrazgat-Baslanti,

    Roles Data curation, Investigation, Resources, Software, Visualization, Writing – review & editing

    Affiliations Intelligent Critical Care Center, University of Florida, Gainesville, Florida, United States of America, Department of Medicine, University of Florida Health, Gainesville, Florida, United States of America

  • Patrick J. Tighe,

    Roles Investigation, Supervision, Visualization, Writing – review & editing

    Affiliation Departments of Anesthesiology, Orthopedics, and Information Systems/Operations Management, University of Florida Health, Gainesville, Florida, United States of America

  • Philip A. Efron,

    Roles Investigation, Project administration, Resources, Software, Supervision, Visualization, Writing – review & editing

    Affiliations Department of Surgery, University of Florida Health, Gainesville, Florida, United States of America, Intelligent Critical Care Center, University of Florida, Gainesville, Florida, United States of America

  • William R. Hogan,

    Roles Investigation, Project administration, Resources, Software, Supervision, Visualization, Writing – review & editing

    Affiliation Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, United States of America

  • Parisa Rashidi,

    Roles Funding acquisition, Investigation, Project administration, Resources, Software, Supervision, Visualization, Writing – review & editing

    Affiliations Intelligent Critical Care Center, University of Florida, Gainesville, Florida, United States of America, Departments of Biomedical Engineering, Computer and Information Science and Engineering, and Electrical and Computer Engineering, University of Florida, Gainesville, Florida, United States of America

  • Gilbert R. Upchurch Jr.,

    Roles Investigation, Project administration, Resources, Software, Supervision, Visualization, Writing – review & editing

    Affiliation Department of Surgery, University of Florida Health, Gainesville, Florida, United States of America

  • Azra Bihorac

    Roles Funding acquisition, Investigation, Project administration, Resources, Software, Supervision, Visualization, Writing – review & editing

    Affiliations Intelligent Critical Care Center, University of Florida, Gainesville, Florida, United States of America, Department of Medicine, University of Florida Health, Gainesville, Florida, United States of America

Abstract

Mistrust is a major barrier to implementing deep learning in healthcare settings. Entrustment could be earned by conveying model certainty, or the probability that a given model output is accurate, but the use of uncertainty estimation for deep learning entrustment is largely unexplored, and there is no consensus regarding optimal methods for quantifying uncertainty. Our purpose is to critically evaluate methods for quantifying uncertainty in deep learning for healthcare applications and propose a conceptual framework for specifying certainty of deep learning predictions. We searched Embase, MEDLINE, and PubMed databases for articles relevant to study objectives, complying with PRISMA guidelines, rated study quality using validated tools, and extracted data according to modified CHARMS criteria. Among 30 included studies, 24 described medical imaging applications. All imaging model architectures used convolutional neural networks or a variation thereof. The predominant method for quantifying uncertainty was Monte Carlo dropout, producing predictions from multiple networks for which different neurons have dropped out and measuring variance across the distribution of resulting predictions. Conformal prediction offered similar strong performance in estimating uncertainty, along with ease of interpretation and application not only to deep learning but also to other machine learning approaches. Among the six articles describing non-imaging applications, model architectures and uncertainty estimation methods were heterogeneous, but predictive performance was generally strong, and uncertainty estimation was effective in comparing modeling methods. Overall, the use of model learning curves to quantify epistemic uncertainty (attributable to model parameters) was sparse. Heterogeneity in reporting methods precluded the performance of a meta-analysis. Uncertainty estimation methods have the potential to identify rare but important misclassifications made by deep learning models and compare modeling methods, which could build patient and clinician trust in deep learning applications in healthcare. Efficient maturation of this field will require standardized guidelines for reporting performance and uncertainty metrics.

Author summary

Deep learning prediction models perform better than traditional prediction models for several healthcare applications. For deep learning to achieve it’s greatest impact on healthcare delivery, patients and providers must trust deep learning models and their outputs. This article describes the potential for deep learning to earn trust by conveying model certainty–the probability that a given model output is accurate. If a model could convey not only it’s prediction but also it’s level of certainty that the prediction is correct, patients and providers could make an informed decision to incorporate or ignore the prediction. The use of uncertainty estimation for deep learning entrustment is largely unexplored, and there is no consensus regarding optimal methods for quantifying uncertainty. Our purpose is to critically evaluate methods for quantifying uncertainty in deep learning for healthcare applications and propose a conceptual framework for specifying certainty of deep learning predictions. We systematically reviewed published scientific literature and summarized results from 30 studies, and found that uncertainty estimation methods have the potential to identify rare but important misclassifications made by deep learning models and compare modeling methods, which could build patient and clinician trust in deep learning applications in healthcare.

Introduction

Deep learning is increasingly important in healthcare. Deep learning prediction models that leverage electronic health record data have outperformed other statistical and regression-based methods [1,2]. Computer vision models have matched or outperformed physicians for several common and essential clinical tasks, albeit in select circumstances [3,4]. These results suggest a potential role for clinical implementation of deep learning applications in health care.

Mistrust is a major barrier to clinical implementation of deep learning predictions [5,6]. Efforts to restore and build trust in machine learning have focused primarily on improving model explainability and interpretability. These techniques build clinicians’ trust, especially when model outputs and important features correlate with logic, scientific evidence, and domain knowledge [7,8]. Another critically important step in building trust in deep learning is to convey model uncertainty, or the probability that a given model output is inaccurate [8]. Deep learning models that typically perform well make rare but egregious errors [9]. If a model could calculate the uncertainty in its predictions on a case-by-case basis, patients and clinicians would be afforded opportunities to make safe, effective, data-driven decisions regarding the utility of model outputs, and either ignore predictions with high uncertainty or triage them for detailed, human review. Unfortunately, there is a paucity of literature describing effective mechanisms for calculating model uncertainty for healthcare applications, and no consensus regarding best methods exists.

Our purpose is to critically evaluate methods for quantifying uncertainty in deep learning for healthcare applications and propose a conceptual framework for optimizing certainty in deep learning predictions. Herein, we perform a scoping review of salient literature, critically evaluate methods for quantifying uncertainty in deep learning, and use insights gained from the review process to develop a conceptual framework.

Materials and methods

Article inclusion is illustrated in Fig 1, a PRISMA flow diagram. We searched Embase, MEDLINE, and PubMed databases, chosen for their specificity to the healthcare domain, for articles with “deep learning” and “confidence” or “uncertainty” in the title or abstract and for articles with “deep learning” and “conformal prediction” in the title or abstract, identifying 37 unique articles. Two investigators independently screened all article abstracts for relevance to review objectives, removing three articles. Full texts of the remaining 34 articles were reviewed. Study quality was independently rated by two investigators using quality assessment tools specific to the design of the study in question (available at: https://www.nhlbi.nih.gov/health-topics/study-quality-assessment-tools). Only studies describing healthcare applications that were good or fair quality were included in the final analysis, which removed four articles, leaving 30 total articles in the final analysis. Data extraction was performed according to a modification of CHARMS criteria, which included methods for measuring uncertainty in deep learning predictions [10]. The search was performed according to Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) guidelines, as listed in S1 PRISMA Checklist.

During screening, there were disagreements between the two investigators regarding the exclusion of five articles; all disagreements were resolved by discussion of review objectives without a third-party arbiter. Cohen’s kappa statistic summarizing interrater agreement regarding article screening was 0.358 (observed agreement = 0.848, expected agreement = 0.764), suggesting that screening agreement between reviewers was fair [11,12]. During full text review, there was a disagreement between the two investigators regarding the exclusion of one article, which was resolved by discussion of review objectives without a third-party arbiter. Cohen’s kappa statistic summarizing interrater agreement regarding full text review could not be calculated because both observed and expected agreement were 0.964, but this high value suggests that agreement between reviewers was substantial.

Results

Included articles are summarized in Table 1. Notably, the use of uncertainty estimation in these articles was rarely applied to building trust in deep learning among patients, caregivers, and clinicians. Therefore, the presentation of results will focus primarily on the content of the articles, and opportunities to use uncertainty-aware deep learning to build trust will be discussed further in the Discussion section as a novel application of established techniques.

thumbnail
Table 1. Summary of included studies, classified as imaging or non-imaging applications.

https://doi.org/10.1371/journal.pdig.0000085.t001

Among 30 included studies, 24 described medical imaging applications and six described non-imaging applications; these categories are evaluated and reported separately. First, important themes from included articles are synthesized into a conceptual framework.

Conceptual framework for optimizing certainty in deep learning predictions

Deep learning uncertainty can be classified as epistemic, (i.e., attributable to uncertainty regarding model parameters or lack of knowledge), or aleatoric (i.e., attributable to stochastic variability and noise in data). Epistemic and aleatoric uncertainty have overlapping etiologies, as variability and noise in data can contribute to uncertainty regarding optimal model parameters and knowledge regarding ground truth. In addition, epistemic and aleatoric uncertainty may be amenable to similar mitigation strategies, as collecting and analyzing more data may allow for more effective identification and imputation of outlier and missing values, reducing aleatoric uncertainty, and may also allow for more effective parameter searches. Beyond these overlapping etiologies and mitigation strategies, epistemic and aleatoric uncertainty have some unique and potentially important attributes. Epistemic uncertainty can be seen as a lack of information about the best model and can be reduced by adding more training data [13]. Learning curves stratified by number of training samples offer an intuitive approach to visualizing epistemic uncertainty, where it becomes evident that using more data typically results not only in more accurate models, but also in more stable loss when trained for the same number of epochs. In stochastic models, parameter estimates also become more stable with increasing amounts of training data. In addition to increasing knowledge through larger sample sizes, it may also be possible to reduce epistemic uncertainty by adding input features, especially multi-modal features (e.g., using not only vital signs to predict hospital mortality, but also using laboratory values, imaging data, and unstructured text data from notes written by clinicians), or modifying the algorithm to learn from additional nonlinear combinations of variables. Once an epistemic uncertainty limit has been reached, quantifying the remaining aleatoric uncertainty in predictions could augment clinical application by allowing patients and providers to understand whether predictions have suitable accuracy and certainty for incorporation in shared decision-making, or are too severely compromised by aleatoric uncertainty to be useful, regardless of overall model accuracy [13]. These concepts are illustrated in Fig 2. This explanation considers transforming a given model into a stochastic ensemble through Bernoulli sampling of weights at model test time, giving rise to a measure of epistemic uncertainty for each sample.

thumbnail
Fig 2. A conceptual framework for optimizing certainty in deep learning predictions by quantifying and minimizing aleatoric and epistemic uncertainty.

https://doi.org/10.1371/journal.pdig.0000085.g002

Medical imaging applications

Among the 24 studies describing medical imaging applications, 12 of those 24 (50%) used magnetic resonance imaging (MRI) features for model training and testing; 11 of those 12 (92%) of which involved the brain or central nervous system. The next most common sources of model features were retinal or fundus images (5 of 24, 21%) and endoscopic images of colorectal polyps (3 of 24, 13%). The remaining studies used computed tomography images, breast ultrasound images, lung microscopy images, or facial expressions. All model architectures included convolutional neural networks or a variation thereof (e.g., U-Net).

The predominant method for quantifying uncertainty in model predictions was Monte Carlo dropout, as originally described by Gal and Ghahramani as a Bayesian approximation of probabilistic Gaussian processes [14]. Briefly, during testing, multiple predictions are generated from a given network for which different neurons have dropped out. The neuron dropout rate is calibrated during model development according to training data sparsity and model complexity. Each forward pass uses a different set of neurons, so the outcome is an ensemble of different network architectures that can generate a posterior distribution for which high variance suggests high uncertainty and low variance suggests low uncertainty. Studies assessing the efficacy of uncertainty measurements provided reasonable evidence that uncertainty estimations were useful. In applying a Bayesian convolutional neural network to diagnose ischemic stroke using brain MRI images, Herzog et al [15] found that uncertainty measurements improved model accuracy by approximately 2%. In applying a convolutional neural network to estimate brain and cerebrospinal fluid intracellular volume, Qin et al [16] reported highly significant correlations (all p<0.001) between uncertainty estimations and observed error based on ground truth values. Finally, in applying a convolutional neural network for differentiating among glioma, multiple sclerosis, and healthy brain, Tanno et al [17] found that uncertainty-based classification correctly identified 96% of all predictions that had high-risk for error; this error was likely attributable to aleatoric uncertainty from noise and variability in data. Valiuddi et al [18] used Monte Carlo simulations in depicting the performance of a probabilistic U-Net performing density modeling of thoracic computed tomography and endoscopic polyp images, learning aleatoric uncertainty as a distribution of possible annotations using a probabilistic segmentation model. This approach was effective in increasing predictive performance, measured by generalized energy distance and intersection over union, by up to 14%. Collectively, these findings suggest Monte Carlo dropout methods can accurately estimate uncertainty in predictions made by convolutional neural networks that make rare but potentially important misclassifications on medical imaging data, and corroborates prior evidence that Monte Carlo dropout can also offer predictive performance advantages, especially on external validation, by mitigating risk for overfitting.

Conformal prediction–used in two studies–demonstrated strong performance in estimating uncertainty. Wieslander et al [19] applied convolutional neural networks to investigate drug distribution on microscopy images of rat lungs following different doses and routes of medication administration, finding that conformal prediction explained 99% of the variance in predicted versus actual error. In another study by Athanasiadis et al [20], conformal prediction improved audio-visual emotion classification for a semi-supervised generative adversarial network compared with a similar network using the classifier alone.

Two studies used uncertainty estimation to compare modeling methods. Graham et al [21] used uncertainty measurements to demonstrate that a hierarchical approach to labeling regions and sub-regions of the brain produced similar predictive performance with greater certainty compared with a flat labeling approach, at any level of the labeling tree. Alternatively, to evaluate similarity between functional brain networks, Ktena et al [22] use convolutional neural network architectures in deriving a novel similarity metric on irregular graphs, demonstrating improver overall classification. Sedghi et al [23] calculated variance in displacement for different image classifications of brain MRIs, demonstrating good dice values for intra-subject pairs with consistent good results when simulating resections on the images, suggesting utility for challenging clinical scenarios.

Non-imaging applications

The six studies describing non-imaging medical applications were heterogenous. Five of the studies endeavored to predict and classify biochemical and molecular properties for pharmacologic applications, each with somewhat different model architectures (i.e., ensembles of deep neural networks, convolutional neural networks, and multi-layer perceptrons). Three of these five studies generated posterior distributions and assessed variance across those distributions to approximate prediction uncertainty. In one instance, there was almost no gain in predictive performance; in another by Cortes-Ciriano and Bender, there was strong correlation between estimated confidence levels and the percentage of confidence intervals that encompassed the ground truth (R2 > 0.99, p<0.001) [24]. This difference in performance may have been attributable to differences in model features. The less successful model used bit strings to represent molecular structures; the more successful model used high-granularity bioactivity features, with 203–5,207 data points per protein. A third study in the molecular property class also used Monte Carlo dropout techniques and reported relatively low test error values [25]. Two studies used conformal predictions to estimate uncertainty, one of which used conformal predictions in predicting active and inactive compound classes, generating single-label predictions for about 90% of all instances with overall confidence 80% or greater. Best results were demonstrated for deep neural networks rather than random forest or light gradient boosting machine models, and conformal prediction offered a controllable error rate and better recall for all three model types [26]. Cortes-Ciriano and Bender [27] leveraged conformal predictions in analyzing errors on ensembles of predictions generated by dropout, reporting strong correlation between confidence levels and error rates (R2 > 0.99, p<0.001), with results similar to those reported in their Deep Confidence work [24]. The remaining non-imaging study predicted neurodegenerative disease progression using multi-source clinical, imaging, genetic, and biochemical data, reporting variable predictive performance across different outcomes, but overall strong performance [28]. Compared with the biochemical prediction models, this study used a unique method for quantifying uncertainty, by measuring variance across predictions made by an ensemble of possible patient forecasts using a generative network. Collectively, these findings suggest that unique model architectures and methods for estimating uncertainty can be applied to a variety of non-pixel-based input features, producing occasional predictive performance advantages and accurate uncertainty estimations.

Discussion

This review found that the uncertainty inherent in deep learning predictions are most commonly estimated for medical imaging applications using Monte Carlo dropout methods on convolutional neural networks. In addition, unique model architectures and uncertainty estimation methods can apply to non-pixel features, simultaneously improving predictive performance (presumably by mitigating risk for overfitting, in the case of Monte Carlo Dropout) while accurately estimating uncertainty. Unsurprisingly, for medical imaging applications, larger datasets of training images were associated with greater predictive performance [15,21,2938]. We could not perform meta-analyses on predictive performance or uncertainty estimations because performance metrics and methods for quantifying uncertainty were heterogenous, despite relative homogeneity in model architectures–which were primarily based on convolutional neural networks–and homogeneity in methods for estimating uncertainty–which were primarily based on Monte Carlo dropout [14]. Uncertainty estimations for non-medical imaging applications were both sparse and heterogenous. Yet the weight of evidence suggests that a variety of methods can estimate uncertainty in predictions on non-pixel features, offering greater performance and reasonably accurate uncertainty estimations. Conformal prediction demonstrated efficacy in uncertainty estimation as well and is easy to interpret (e.g., at a confidence level of 80%, at least 80% of the predicted confidence intervals contain the true value), and applies not only to deep learning but also to other machine learning approaches such as random forest modeling.

For both imaging and non-imaging applications, uncertainty estimations are poised to augment clinical application by identifying rare but potentially important misclassifications made by deep learning models. First, mistrust of machine learning predictions must be overcome. Model explainability, interpretability, and consistency with logic, scientific evidence, and domain knowledge are critically important in building trust [7,8]. Yet, even when a model is easy to understand, generates predictions consistent with medical knowledge, and has 90% overall accuracy, patients and providers may wonder: is this prediction among the 1 in 10 that is incorrect? Can the model tell me whether it is certain or uncertain of this particular prediction? To address these questions and build trust, it seems prudent to include model uncertainty estimations in shared decision-making processes. Therefore, we believe that uncertainty estimations are a critical element in the safe, effective clinical implementation of deep learning in healthcare. In performing this review, we sought to summarize evidence regarding the efficacy of uncertainty estimation in building trust in deep learning among patients, caregivers, and clinicians, but we found little evidence thereof. Therefore, we propose uncertainty-aware deep learning as a novel approach to building trust.

We found no previous systematic or scoping reviews on the same topic, though several authors have described important components of estimating uncertainty in deep learning predictions. Common statistical measures of spread (e.g., standard deviation and interquartile range) are undefined for single point predictions. Entropy, however, does apply to probability distributions. Therefore, most uncertainty estimation methods generate probability distributions around point estimations. Monte Carlo dropout, as originally described by Gal and Ghahramani, offers an elegant solution [14]. During testing, multiple stochastic predictions are generated from a given network for which different neurons have dropped out with specified probability. This dropout rate is calibrated during model development according to training data sparsity and model complexity. When training, dropping out different sets of neurons at different steps harbors the additional advantage of mitigating overfitting. When testing, each forward pass uses a different set of neurons; therefore, the outcome is an ensemble of different network architectures that can be represented as a posterior distribution. Variance across the distribution of predictions can be analyzed by several methods (e.g., entropy, variation ratios, standard deviation, mutual information). High variance suggests high uncertainty; low variance suggests low uncertainty.

This review was limited by heterogeneity in model performance metrics and methods for quantifying uncertainty. To identify the optimal methods for estimating uncertainty in deep learning predictions, it would be necessary to perform a meta-analysis or comparative effectiveness analyses. This would be facilitated by achieving consensus regarding core performance and uncertainty metrics. The field of deep learning uncertainty estimation is maturing rapidly; it would be advantageous to establish reporting guidelines, as has been done for prediction modeling, causal inference, and machine learning trials [3942]. Finally, beyond uncertainty estimations, it may be useful to quantify how similar an individual patient is to other patients in the training data, so that users can understand whether uncertainty is attributable to variability in outcomes relative to similar features in the training data or due to a patient having outlier features that are not well represented in the training data.

Conclusions

For convolutional neural network predictions on medical images, Monte Carlo dropout methods accurately estimate uncertainty. For non-medical imaging applications, a paucity of evidence suggests that several uncertainty estimation methods can improve predictive performance and accurately estimate uncertainty. Using uncertainty estimations to gain the trust of patients and clinicians is a novel concept that warrants empirical investigation. The rapid maturation of deep learning uncertainty estimations in medical literature could be facilitated by achieving consensus regarding performance and uncertainty metrics and standardizing reporting guidelines. Once standardized and validated, uncertainty estimates have the potential to identify rare but important misclassifications made by deep learning models in clinical settings, augmenting shared decision-making processes toward improved healthcare delivery.

Supporting information

S1 PRISMA Checklist. Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist.

https://doi.org/10.1371/journal.pdig.0000085.s001

(DOCX)

Acknowledgments

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

  1. 1. Shickel B, Loftus TJ, Adhikari L, Ozrazgat-Baslanti T, Bihorac A, Rashidi P. DeepSOFA: A Continuous Acuity Score for Critically Ill Patients using Clinically Interpretable Deep Learning. Sci Rep. 2019;9(1):1879. pmid:30755689
  2. 2. Tiwari P, Colborn KL, Smith DE, Xing F, Ghosh D, Rosenberg MA. Assessment of a Machine Learning Model Applied to Harmonized Electronic Health Record Data for the Prediction of Incident Atrial Fibrillation. JAMA Netw Open. 2020;3(1):e1919396. pmid:31951272
  3. 3. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8. pmid:28117445
  4. 4. McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, Ashrafian H, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577(7788):89–94. pmid:31894144
  5. 5. Stubbs K, Hinds PJ, Wettergreen D. Autonomy and common ground in human-robot interaction: A field study (vol 22, pg 42, 2007). Ieee Intell Syst. 2007;22(3):3–.
  6. 6. Linegang MP, Stoner HA, Patterson MJ, Seppelt BD, Hoffman JD, Crittendon ZB, et al. Human-Automation Collaboration in Dynamic Mission Planning: A Challenge Requiring an Ecological Approach. Proceedings of the Human Factors and Ergonomics Society Annual Meeting. 2006;50(23):2482–6.
  7. 7. Miller T. Explanation in artificial intelligence: Insights from the social sciences. Artif Intell. 2019;267:1–38.
  8. 8. Tonekaboni S, Joshi S, McCradden MD, Goldenberg A. What Clinicians Want: Contextualizing Explainable Machine Learning for Clinical End Use. In: Finale D-V, Jim F, Ken J, David K, Rajesh R, Byron W, et al., editors. Proceedings of the 4th Machine Learning for Healthcare Conference; Proceedings of Machine Learning Research: PMLR; 2019. p. 359–80.
  9. 9. Rosenfeld A, Zemel R, Tsotsos J. The Elephant in the Room. arXiv:1808.03305 [cs.CV]. 2018.
  10. 10. Moons KG, de Groot JA, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist. PLoS Med. 2014;11(10):e1001744. pmid:25314315
  11. 11. De Vries H, Elliott MN, Kanouse DE, Teleki SS. Using pooled kappa to summarize interrater agreement across many items. Field Method. 2008;20(3):272–82.
  12. 12. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74. pmid:843571
  13. 13. Hullermeier E, Waegeman W. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach Learn. 2021;110(3):457–506.
  14. 14. Gal Y, Ghahramani Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In: Maria Florina B, Kilian QW, editors. Proceedings of The 33rd International Conference on Machine Learning; Proceedings of Machine Learning Research: PMLR; 2016. p. 1050–9.
  15. 15. Herzog L, Murina E, Durr O, Wegener S, Sick B. Integrating uncertainty in deep neural networks for MRI based stroke analysis. Med Image Anal. 2020;65:101790. pmid:32801096
  16. 16. Qin Y, Liu Z, Liu C, Li Y, Zeng X, Ye C. Super-Resolved q-Space deep learning with uncertainty quantification. Med Image Anal. 2021;67:101885. pmid:33227600
  17. 17. Tanno R, Worrall DE, Kaden E, Ghosh A, Grussu F, Bizzi A, et al. Uncertainty modelling in deep learning for safer neuroimage enhancement: Demonstration in diffusion MRI. Neuroimage. 2021;225:117366. pmid:33039617
  18. 18. Valiuddin M, Viviers CG, van Sloun RJ, Sommen Fvd. Improving Aleatoric Uncertainty Quantification in Multi-annotated Medical Image Segmentation with Normalizing Flows. Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Perinatal Imaging, Placental and Preterm Image Analysis: Springer; 2021. p. 75–88.
  19. 19. Wieslander H, Harrison PJ, Skogberg G, Jackson S, Friden M, Karlsson J, et al. Deep Learning With Conformal Prediction for Hierarchical Analysis of Large-Scale Whole-Slide Tissue Images. IEEE J Biomed Health Inform. 2021;25(2):371–80. pmid:32750907
  20. 20. Athanasiadis C, Hortal E, Asteriadis S. Audio-visual domain adaptation using conditional semi-supervised Generative Adversarial Networks. Neurocomputing. 2020;397:331–44.
  21. 21. Graham MS, Sudre CH, Varsavsky T, Tudosiu P-D, Nachev P, Ourselin S, et al. Hierarchical brain parcellation with uncertainty. Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Graphs in Biomedical Image Analysis: Springer; 2020. p. 23–31.
  22. 22. Ktena SI, Parisot S, Ferrante E, Rajchl M, Lee M, Glocker B, et al., editors. Distance metric learning using graph convolutional networks: Application to functional brain networks. International Conference on Medical Image Computing and Computer-Assisted Intervention; 2017: Springer.
  23. 23. Sedghi A, Kapur T, Luo J, Mousavi P, Wells WM. Probabilistic image registration via deep multi-class classification: characterizing uncertainty. Uncertainty for Safe Utilization of Machine Learning in Medical Imaging and Clinical Image-Based Procedures: Springer; 2019. p. 12–22.
  24. 24. Cortes-Ciriano I, Bender A. Deep Confidence: A Computationally Efficient Framework for Calculating Reliable Prediction Errors for Deep Neural Networks. J Chem Inf Model. 2019;59(3):1269–81. pmid:30336009
  25. 25. Scalia G, Grambow CA, Pernici B, Li YP, Green WH. Evaluating Scalable Uncertainty Estimation Methods for Deep Learning-Based Molecular Property Prediction. J Chem Inf Model. 2020;60(6):2697–717. pmid:32243154
  26. 26. Zhang J, Norinder U, Svensson F. Deep Learning-Based Conformal Prediction of Toxicity. J Chem Inf Model. 2021;61(6):2648–57. pmid:34043352
  27. 27. Cortes-Ciriano I, Bender A. Reliable Prediction Errors for Deep Neural Networks Using Test-Time Dropout. Journal of Chemical Information and Modeling. 2019;59(7):3330–9. pmid:31241929
  28. 28. Teng X, Pei S, Lin YR. StoCast: Stochastic Disease Forecasting with Progression Uncertainty. IEEE J Biomed Health Inform. 2020;PP.
  29. 29. Carneiro G, Pu LZCT, Singh R, Burt A. Deep learning uncertainty and confidence calibration for the five-class polyp classification from colonoscopy. Medical Image Analysis. 2020;62. pmid:32172037
  30. 30. Hu X, Guo R, Chen J, Li H, Waldmannstetter D, Zhao Y, et al. Coarse-to-Fine Adversarial Networks and Zone-Based Uncertainty Analysis for NK/T-Cell Lymphoma Segmentation in CT/PET Images. IEEE J Biomed Health Inform. 2020;24(9):2599–608. pmid:32054593
  31. 31. Ayhan MS, Kuhlewein L, Aliyeva G, Inhoffen W, Ziemssen F, Berens P. Expert-validated estimation of diagnostic uncertainty for deep neural networks in diabetic retinopathy detection. Med Image Anal. 2020;64:101724. pmid:32497870
  32. 32. Cao X, Chen H, Li Y, Peng Y, Wang S, Cheng L. Uncertainty Aware Temporal-Ensembling Model for Semi-Supervised ABUS Mass Segmentation. IEEE Trans Med Imaging. 2021;40(1):431–43. pmid:33021936
  33. 33. Wang X, Tang F, Chen H, Luo L, Tang Z, Ran AR, et al. UD-MIL: Uncertainty-Driven Deep Multiple Instance Learning for OCT Image Classification. IEEE J Biomed Health Inform. 2020;24(12):3431–42. pmid:32248132
  34. 34. Araujo T, Aresta G, Mendonca L, Penas S, Maia C, Carneiro A, et al. DR|GRADUATE: Uncertainty-aware deep learning-based diabetic retinopathy grading in eye fundus images. Med Image Anal. 2020;63:101715. pmid:32434128
  35. 35. Edupuganti V, Mardani M, Vasanawala S, Pauly J. Uncertainty Quantification in Deep MRI Reconstruction. IEEE Trans Med Imaging. 2021;40(1):239–50. pmid:32956045
  36. 36. Nair T, Precup D, Arnold DL, Arbel T. Exploring uncertainty measures in deep networks for Multiple sclerosis lesion detection and segmentation. Med Image Anal. 2020;59:101557. pmid:31677438
  37. 37. Natekar P, Kori A, Krishnamurthi G. Demystifying Brain Tumor Segmentation Networks: Interpretability and Uncertainty Analysis. Front Comput Neurosci. 2020;14:6. pmid:32116620
  38. 38. Seebock P, Orlando JI, Schlegl T, Waldstein SM, Bogunovic H, Klimscha S, et al. Exploiting Epistemic Uncertainty of Anatomy Segmentation for Anomaly Detection in Retinal OCT. IEEE Trans Med Imaging. 2020;39(1):87–98. pmid:31170065
  39. 39. Rivera SC, Liu XX, Chan AW, Denniston AK, Calvert MJ, Grp S-AC-AW. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Lancet Digit Health. 2020;2(10):E549–E60. pmid:33328049
  40. 40. Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK, Spirit AI, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. 2020;26(9):1364–74. pmid:32908283
  41. 41. Leisman DE, Harhay MO, Lederer DJ, Abramson M, Adjei AA, Bakker J, et al. Development and Reporting of Prediction Models: Guidance for Authors From Editors of Respiratory, Sleep, and Critical Care Journals. Crit Care Med. 2020;48(5):623–33. pmid:32141923
  42. 42. Lederer DJ, Bell SC, Branson RD, Chalmers JD, Marshall R, Maslove DM, et al. Control of Confounding and Reporting of Results in Causal Inference Studies. Guidance for Authors from Editors of Respiratory, Sleep, and Critical Care Journals. Ann Am Thorac Soc. 2019;16(1):22–8. pmid:30230362