Figure 1.
An example of outputs for two probabilistic classifiers and their ROC curves, which do not evaluate calibration.
In (a) and (b), # indicates the observations, corresponds to the class membership, and
represents the probability estimate. In (c) and (d), each red circle corresponds to a threshold value. Note that probabilistic classifier B has the same ROC as probabilistic classifier A, but their calibration differs dramatically: estimates for B are ten times lower than estimates for A.
Figure 2.
The horizontal line shows sorted probabilistic estimates on “scores” s. In (a) and (b), we show the ROC and the AUC for a classifier built from an artificial dataset. In (c) and (d), we show concordant and discordant pairs, where concordant means that an estimate for a positive observation is higher than an estimate for a negative one. The AUC can be interpreted in the same way as the c-index: the proportion of concordant pairs. Note that
corresponds to an observation,
represents its predicted score, and
represents its observed class label, i.e., the gold standard. AUC is calculated as the fraction of concordant pairs out of a total number of instance pairs where an element is positive and the other is negative. Note that
is the indicator function.
Figure 3.
Reliability diagrams and two types of HL-test.
In (a), (b), and (c), we visually illustrate the reliability diagram, and groupings used for the HL-H test and the HL-C test, respectively.
Figure 4.
Discrimination plots (ROC curves) and Calibration plots for simulated models.
(a) Perfect discrimination (i.e., AUC = 1) requires a classifier with perfect dichotomous predictions, which in the calibration plot has only one point (0,0) for negative observations and one point (1,1) for positive observations. (b) Poor discrimination (i.e., AUC = 0.530.02) and poor calibration (i.e.,
= 251.27
65.2,
<1e−10). (c) Good discrimination (i.e., AUC = 0.83
0.03) and excellent calibration (i.e.,
= 10.02
4.42,
= 0.26
0.82). (d) Excellent discrimination (i.e., AUC = 0.96
0.01) and mediocre calibration (i.e.,
= 34.46
2.77,
= 0
0.95). Note that a HL statistic smaller than 13.36 indicates that the model fits well at the significance level of 0.1.
Figure 5.
Tradeoffs between calibration and discrimination.
(a) Perfect calibration may harm discrimination under a three-group binning. The numbers above each bar indicate the percentage of negative observations (green) and positive observations (orange) in each prediction group (0–0.33, 0.33–0.67, and 0.67–1). Note the small red arrows in the left figure indicate discordant pairs, in which negative observations ranked higher than positive observations. (b) Enforcing discrimination may also hurt calibration. The blue curve and error bars correspond to the AUC while the green curve and error bars represent the p-values for the Hosmer-Lemeshow C test (). Initially, as discrimination increases,
-value of
(calibration) increases but it quickly drops after hitting the global maximum. We use red arrows in Figure 5(b) to indicate the location of optimal calibration and discrimination for the simulated data.
Table 1.
Confusion matrix of a classifier based on the gold standard of class labels.
Table 2.
Details of the training and test datasets in our first experiment.
Figure 6.
Performance comparison between using GSE2034 and GSE2990.
Table 3.
Details of the training and test datasets in our second experiment.
Figure 7.
Performance comparisons between three different models using breast cancer datasets.