Doubly Optimized Calibrated Support Vector Machine (DOC-SVM): An Algorithm for Joint Optimization of Discrimination and Calibration

doi:10.1371/journal.pone.0048823

Figure 1.

An example of outputs for two probabilistic classifiers and their ROC curves, which do not evaluate calibration.

In (a) and (b), # indicates the observations, corresponds to the class membership, and represents the probability estimate. In (c) and (d), each red circle corresponds to a threshold value. Note that probabilistic classifier B has the same ROC as probabilistic classifier A, but their calibration differs dramatically: estimates for B are ten times lower than estimates for A.

More »

Expand

Figure 2.

ROC, AUC and its calculation.

The horizontal line shows sorted probabilistic estimates on “scores” s. In (a) and (b), we show the ROC and the AUC for a classifier built from an artificial dataset. In (c) and (d), we show concordant and discordant pairs, where concordant means that an estimate for a positive observation is higher than an estimate for a negative one. The AUC can be interpreted in the same way as the c-index: the proportion of concordant pairs. Note that corresponds to an observation, represents its predicted score, and represents its observed class label, i.e., the gold standard. AUC is calculated as the fraction of concordant pairs out of a total number of instance pairs where an element is positive and the other is negative. Note that is the indicator function.

More »

Expand

Figure 3.

Reliability diagrams and two types of HL-test.

In (a), (b), and (c), we visually illustrate the reliability diagram, and groupings used for the HL-H test and the HL-C test, respectively.

More »

Expand

Figure 4.

Discrimination plots (ROC curves) and Calibration plots for simulated models.

(a) Perfect discrimination (i.e., AUC = 1) requires a classifier with perfect dichotomous predictions, which in the calibration plot has only one point (0,0) for negative observations and one point (1,1) for positive observations. (b) Poor discrimination (i.e., AUC = 0.530.02) and poor calibration (i.e., = 251.2765.2, <1e−10). (c) Good discrimination (i.e., AUC = 0.830.03) and excellent calibration (i.e., = 10.024.42, = 0.260.82). (d) Excellent discrimination (i.e., AUC = 0.960.01) and mediocre calibration (i.e., = 34.462.77, = 00.95). Note that a HL statistic smaller than 13.36 indicates that the model fits well at the significance level of 0.1.

More »

Expand

Figure 5.

Tradeoffs between calibration and discrimination.

(a) Perfect calibration may harm discrimination under a three-group binning. The numbers above each bar indicate the percentage of negative observations (green) and positive observations (orange) in each prediction group (0–0.33, 0.33–0.67, and 0.67–1). Note the small red arrows in the left figure indicate discordant pairs, in which negative observations ranked higher than positive observations. (b) Enforcing discrimination may also hurt calibration. The blue curve and error bars correspond to the AUC while the green curve and error bars represent the p-values for the Hosmer-Lemeshow C test (). Initially, as discrimination increases, -value of (calibration) increases but it quickly drops after hitting the global maximum. We use red arrows in Figure 5(b) to indicate the location of optimal calibration and discrimination for the simulated data.