Evaluation of Prediction Models for Decision-Making: Beyond Calibration and Discrimination

Lars Holmberg and Andrew Vickers discuss the importance of ensuring prediction models lead to better decision making in light of new research into breast, endometrial, and ovarian cancer risk by Ruth Pfeiffer and colleagues. Please see later in the article for the Editors' Summary

In this week's issue of PLOS Medicine, Ruth Pfeiffer and colleagues present risk prediction models for breast, endometrial, and ovarian cancer [1]. Improvement of existing models and a new model for endometrial cancer can, as the authors say, be useful for several purposes. However, the paper also raises issues about the challenges of model improvement, interpretation, and application to public health and to clinical decision-making.
Ruth Pfeiffer and colleagues present models for absolute risks and thereby avoid the common mistake of proclaiming a substantial relative risk as clinically relevant without considering the background risk. For example, a relative risk of 3.0 corresponding to a risk increase from 1% to 3% may have quite different implications than an increase from 10% to 30% [2].
The key claim of the paper is that the models ''may assist in clinical decisionmaking.'' While the examples in the paper predominately concern prevention, rather than what many readers would intuitively think of as clinical decision-making-situations such as primary treatment of early prostate cancer or choice of adjuvant chemotherapy for early breast cancerthe emphasis on decision-making is laudable. What we want from models is that they help us make better decisions, leading to better outcomes for our patients. This raises the question of how to evaluate whether a model does indeed improve decision-making.
As the authors state, good calibration is essential for good decision-making. A model is well calibrated if, for every 100 individuals given a risk of x%, close to x will indeed have the event of interest. Calibration concerns average risk in a population and a well-calibrated model may assist in prevention decisions, but a miscalibrated model may lead to situations where an individual at high risk is assigned a low predicted probability, and thus forgoes effective preventive intervention. However, calibration is necessary but not sufficient for clinical utility, as the example of mammography screening shows: breast cancer risk prediction models are rarely used to determine eligibility for screening, which is instead based predominately on age, because very large differences in risk between women would be needed to justify separating women into higher versus lower intensities of mammography.
The statistical measure of how well a model separates risk is known as discrimination. But traditional analyses of risk factors are, on their own, not well suited to discriminate prognostic groups in a way that is useful for clinical decision-making [3,4]. Discrimination is often described in terms of the area under the receiver operating characteristic curve (AUC), taken from the receiver operating characteristics. The AUC is often a useful first step in evaluating a model or in comparing two diagnostic or prognostic models against each other. But like calibration, the AUC value is insufficient to demonstrate that a model would improve decision-making [5]. The calculation of AUC assumes that sensitivity is of equal value to specificity, whereas typically the consequences of a false negative (such as a missed cancer) are dramatically different from those of a false positive (such as an unnecessary biopsy). One example is a classifier of aggressive prostate cancer associated with a clearly elevated relative risk of lethal cancer that has an AUC statistically significantly over 0.5, but that still has an unacceptable rate of false negatives that could imply missed treatment opportunities for the ranges where it is reasonable to use [6].
The paper by Pfeiffer and colleagues raises the critical issue of how we should determine the clinical utility of a model, whether it changes decisions, and whether Ruth Pfeiffer and colleagues describe models to calculate absolute risks for breast, endometrial, and ovarian cancers for white, nonhispanic women over 50 years old using easily obtainable risk factors. those decisions are good ones. This is an issue that touches a variety of different areas in medical prediction, including comparisons of models and the value of novel molecular markers. Recent years have seen numerous methodological developments, going above and beyond a clear recommendation that clinical utility should be formally assessed [7] to actual statistical techniques for doing so [8,9]. One of us (A. V.) developed a method for evaluating prediction models called decision curve analysis [10], a straightforward technique with readily available software (http://www.decisioncurveanalysis.org).
Thus, there are now quantitative techniques available that can determine whether a model does more good than harm given reasonable assumptions about the consequences of false negatives com-pared to false positives. This takes us substantially further than the (unsolved) debate about whether model evaluation should prioritize calibration or discrimination [5,[11][12][13][14]. Use of novel decision analytic techniques can also avoid the sort of problems raised by statements such as ''[w]ell-calibrated risk models, even those with modest discriminatory accuracy, have public health applications'' [1]: it is difficult to know what counts as ''modest'' discrimination, and how much discrimination would have to improve to outweigh a given level of miscalibration. For instance, the discrimination estimated by Pfeiffer and colleagues as judged by the AUC would, to many, seem weak rather than modest.
The paper by Pfeiffer and colleagues is one of many current papers illustrating the need for quantitative evaluation of the clinical value of prediction models, needed to arm us to transfer rapidly growing medical knowledge to sound decisionmaking to the benefit of patients. Hopefully, we can have models and model evaluations that illuminate the whole spectrum, from public health decisions for groups of people to the vision of individualized medicine with individually tailored treatments.

Author Contributions
Wrote the first draft of the manuscript: LH. Contributed to the writing of the manuscript: LH AV. ICMJE criteria for authorship read and met: LH AV.