Recalibrating probabilistic forecasts of epidemics

doi:10.1371/journal.pcbi.1010771

Fig 1.

Densities of PIT distributions for five sample forecasters, when the true distribution is a standard normal.

More »

Expand

Fig 2.

An illustration of recalibration.

The original, underconfident forecast density is while the true density is . By calculating the PIT density g and producing a recalibrated forecast as the product g(F(y)) ⋅ f(y), we recover the true h(y).

More »

Expand

Fig 3.

Mean log score, averaged over all forecasters, for the different recalibration methods.

A window size of k corresponds to training recalibration on forecasts within k weeks of the given forecast week, where available, inclusive. Log score is averaged over 9 seasons, 11 locations, and 29 weeks (higher log score is better). The largest window sizes slightly hurt the performance of the parametric model, and the smallest window sizes significantly hurt the nonparametric model. Averaged over all forecasters, the improvement in performance due to calibration is roughly equal to the improvement in performance by reducing the forecast horizon by a week.

More »

Expand

Fig 4.

Improvement in mean log score, for the different recalibration methods.

Log score is averaged over all 27 forecasters in the FluSight, 9 seasons, 11 locations, and 29 weeks (higher log score is better). The ensemble recalibration method improves accuracy for every target.

More »

Expand

Fig 5.

Proportion of forecasters for which recalibration improves mean log score (left) and entropy of the PIT values (right).

The ensemble method improves accuracy for the short-term targets for all forecasters, and most forecasters for the seasonal targets. It also improves calibration (as measured by entropy) for most forecasters and most targets. The ensemble method outperforms both the nonparametric and parametric methods.

More »

Expand

Fig 6.

Improvement in mean log score versus improvement in entropy for each of the 27 FluSight forecasters and short-term targets.

There is a clear linear trend (with slope approximately 1) between the improvement in calibration and the improvement in accuracy.

More »

Expand

Fig 7.

Entropy and mean log score before and after recalibration, for each of the 27 FluSight forecasters and short-term targets.

The tail of arrow represents a quantity before recalibration, and the head after recalibration. The dotted lines show the central 90% interval of the entropy of a comparably-sized sample of standard uniform random variables for comparison. For all but two forecasters (the eight bottom-most line segments), the ensemble recalibration method achieves almost perfect calibration as evidenced by a near-zero PIT entropy, and this is accompanied by significant improvements in accuracy.

More »

Expand

Fig 8.

Improvement in mean log score after recalibration, averaged over all 27 FluSight forecasters, by number of training seasons.

We perform three runs for each of the nine available seasons and n ∈ {1, 2, 4, 8}, where a run consists of randomly sampling n other seasons to train recalibration for each of the 27 FluSight forecasters. Each point in the plot is averaged over 9 × 3 = 27 runs. As expected, the parametric method is more robust to limited training data than the nonparametric method.

More »

Expand

Fig 9.

Mean log score for the two different approaches to recalibrating the FluSight ensemble forecaster, with C-E and E-C reflecting the order of recalibration and ensembling.

Both the C-E and E-C models outperform the original ensemble (with no recalibration), but ensembling followed by recalibration performs best. By viewing forecast performance as a function of time, recalibration increases performance as much as roughly two days’ time would.

More »

Expand