Recalibrating probabilistic forecasts of epidemics

Distributional forecasts are important for a wide variety of applications, including forecasting epidemics. Often, forecasts are miscalibrated, or unreliable in assigning uncertainty to future events. We present a recalibration method that can be applied to a black-box forecaster given retrospective forecasts and observations, as well as an extension to make this method more effective in recalibrating epidemic forecasts. This method is guaranteed to improve calibration and log score performance when trained and measured in-sample. We also prove that the increase in expected log score of a recalibrated forecaster is equal to the entropy of the PIT distribution. We apply this recalibration method to the 27 influenza forecasters in the FluSight Network and show that recalibration reliably improves forecast accuracy and calibration. This method, available on Github, is effective, robust, and easy to use as a post-processing tool to improve epidemic forecasts.

For illustration, we provide the densities of the PIT distributions for each of the 27 FluSight forecasters and four short-term targets, before recalibration ( Fig S1) and after recalibration (Fig S2). The original forecasters' PIT distributions fall mostly into one of two categories: underconfident with a mode around 0.5, and overconfident with a minimum around 0.5 and peaks at 0 and 1. The outlier with a peak around 0.1 is the PIT distribution of the uniform forecaster. The recalibrated forecasters' PIT distributions are mostly flat, indicating that the PIT values are distributed nearly uniformly.

PIT Variance of Original and Recalibrated Forecasts
We show the variance of the PIT distributions of the original and recalibrated forecasts in Fig S3. The variance of the uniform distribution is 1 12 , and a forecaster whose PIT values have a variance of 1 12 are referred to as neutrally dispersed. If the variance is greater than 1 12 , the forecaster is underdispersed ("overconfident"), and if the variance is less than 1 12 , the forecaster is overdispersed ("underconfident") [1]. Nearly all forecasters converge to a PIT variance close to 1 12 , and overconfident forecasters generally remain slightly overconfident, and underconfident forecasters generally remain slightly underconfident.

Recalibration Performance in a True Retrospective Setting
While there are many seasons available for recalibration training in the FluSight Challenge, this may not be the case for other epidemics. In Fig S4, we show how the component recalibration methods perform in a true retrospective setting, able to train on only past seasons. The parametric method improves performance with just one season of training data, and the nonparametric method improves performance after three seasons of training data. Because influenza seasons can differ significantly, these results are subject to high variance and may not generalize. For example, in comparing this figure to Fig 8 in the main manuscript, the improvement in 2011 is substantially higher than the average improvement after one training season in Fig 8. Conversely, the improvement in 2018 is substantially lower than the average improvement after eight training seasons in Fig 8. We speculate that this is the reason that improvement decreases after 2014, despite an increase in available training data.
Note that for consistency, we only trained on available PIT values within a 3-week window on either side. In a real application with only one season available, the bias-variance tradeoff would likely favor a larger window and better performance.

Ensemble Weights
Our recalibration ensemble fits weights to three components: a parametric method, a nonparametric method, and a null method. In general, ensemble weights do not necessarily correlate with performance, and we find that to be the case here. We had December 6, 2022 3/7 expected poor forecasters to have higher weights for the nonparametric and parametric methods than good forecasters, because they rely on calibration more strongly. However, the correlations between original forecaster performance and component weights are weak.

Improvement in Mean Log Score and PIT Entropy on Seasonal Targets
As mentioned in the discussion, forecasts of seasonal targets are difficult to recalibrate. After the season onset and peak have been observed, the true value is essentially determined, subject to data revisions. In such a case, a forecaster will place 100% of its mass in the correct bin, resulting in a PIT distribution which is a Dirac delta distribution δ(0.5). This forecaster has a PIT entropy of −∞ but a perfect log score, which violates our previous assumption that improving the PIT entropy through recalibration will improve the log score. In practice, there is a strong positive correlation between improvement in PIT entropy and improvement in mean log score, as shown in Fig S6. However, unlike the short-term targets, where the linear relationship had a slope of approximately 1, as theoretically expected, the slope for the seasonal targets is about 0.8. We suspect that the improvement in accuracy is not as large as theoretically expected because of forecast behavior at the end of the season.
However, a comparison of Fig S7 and Fig 7 in the main manuscript shows that recalibration on seasonal targets is much less effective than on short-term targets. No forecaster achieves a near-uniform PIT entropy. We believe this is because at the end of the season, the PIT distribution approaches δ(0.5), and the composition of any CDF transform with a Dirac delta distribution will result in a Dirac delta distribution.  FluSight forecasters and seasonal targets. The tail of arrow represents a quantity before recalibration, and the head after recalibration. The dotted lines show the central 90% interval of the entropy of a comparably-sized sample of standard uniform random variables for comparison. Recalibration is much less effective for seasonal targets than for short-term targets (compare to Fig 7 in the main manuscript).