Scoring epidemiological forecasts on transformed scales

doi:10.1371/journal.pcbi.1011393

Fig 1.

Numerical comparison of different measures of relative error: Absolute percentage error (APE), relative error (RE), symmetric absolute percentage error (SAPE) and the absolute error applied to log-transformed predictions and observations.

We denote the predicted value by and display errors as a function of the ratio of observed and predicted value. A: x-axis shown on a linear scale. B: x-axis shown on a logarithmic scale.

More »

Expand

Fig 2.

Expected CRPS scores as a function of the mean and variance of the forecast quantity.

We computed expected CRPS values for three different distributions, assuming an ideal forecaster with predictive distribution equal to the true underlying (data-generating) distribution. These expected CRPS values were computed for different predictive means based on 10,000 samples each and are represented by dots. Solid lines show the corresponding approximations of the expected CRPS from Eqs (16) and (17). S3 Fig shows the quality of the approximation in more detail. The first distribution (red) is a truncated normal distribution with constant variance (we chose σ = 1 in order to only obtain positive samples). The second (green) is a negative binomial distribution with variance θ = 10 and variance σ² = μ + 0.1μ². The third (blue) is a Poisson distribution with σ² = μ. To make the scores for the different distributions comparable, scores were normalised to one, meaning that the mean score for every distribution (red, green, blue) is one. A: Normalised expected CRPS for ideal forecasts with increasing means for three distribution with different relationships between mean and variance. Expected CRPS was computed on the natural scale (left), after applying a square-root transformation (middle), and after adding one and applying a log-transformation to the data (right). B: A but with x and y axes on the log scale.

More »

Expand

Fig 3.

Illustration of the effect of the log-transformation of the ranking for a single forecast.

Shown are CRPS (or WIS, respectively) values as a function of the observed value for two forecasters. Model A issues a geometric distribution (a negative binomial distribution with size parameter θ = 1) with mean μ = 10 and variance σ² = μ + μ² = 110), while Model B issues a Poisson distribution with mean and variance equal to 10. Zeroes in this illustrative example were handled by adding one before applying the natural logarithm.

More »

Expand

Fig 4.

Forecasts and scores for two-week-ahead predictions from the EuroCOVIDhub-ensemble made in Germany.

Missing values are due to data anomalies that were removed. A, E: 50% and 90% prediction intervals and observed values for cases and deaths on the natural scale. B, F: Corresponding scores. C, G: Forecasts and observations on the log scale. D, H: Corresponding scores.

More »

Expand

Fig 5.

Observations and scores across locations and forecast horizons for the European COVID-19 Forecast Hub data.

Locations are sorted according to the mean observed value in that location. A: Average (across all time points) of observed cases and deaths for different locations. B: Corresponding boxplot (y-axis on log-scale) of all cases and deaths. C: Scores for two-week-ahead forecasts from the EuroCOVIDhub-ensemble (averaged across all forecast dates) for different locations, evaluated on the natural scale as well as after transforming counts by adding one and applying the natural logarithm. D: Corresponding boxplots of all individual scores of the EuroCOVIDhub-ensemble for two-week-ahead predictions. E: Boxplots for the relative change of scores for the EuroCOVIDhub-ensemble across forecast horizons. For any given forecast date and location, forecasts were made for four different forecast horizons, resulting in four scores. All scores were divided by the score for forecast horizon one. To enhance interpretability, the range of visible relative changes in scores (relative to horizon = 1) was restricted to [0.1, 10].

More »

Expand

Fig 6.

Mean WIS in different locations for different transformations applied before scoring.

Locations are sorted according to the mean observed value in that location. Shown are scores for two-week-ahead forecasts of the EuroCOVIDhub-ensemble. On the natural scale (with no transformation prior to applying the WIS), scores correlate strongly with the average number of observed values in a given location. The same is true for scores obtained after applying a square-root transformation, or after applying a log-transformation with a large offset a. For illustrative purposes, a was chosen to be 101630 for cases and 530 for deaths, 10 times the respective median observed value. For large values of a, log(x + a) grows roughly linearly in x, meaning that we expect to observe the same patterns as in the case with no transformation. For decreasing values of a, we give more relative weight to scores in small locations.

More »

Expand

Table 1.

Coefficients of three regressions for the effect of the magnitude of the median forecast on expected scores.

The first regression was log[WIS(F, y)] = α + β × log[median(F)], where F is the predictive distribution and y the observed value. The second one was WIS(F_log, log y) = α_log + β_log ⋅ log (median(F)), where F_log is the predictive distribution for log y. The third one was where is the predictive distribution for .

More »

Expand

Fig 7.

Relationship between median forecasts and scores.

Black dots represent WIS values for two-week ahead predictions of the EuroCOVIDhub-ensemble. Drawn in red are the regression lines as discussed in the main text and shown in Table 1. A: WIS for two-week-ahead predictions of the EuroCOVIDhub-ensemble against median predicted values. B: Same as A, with scores obtained after applying a square-root-transformation to the data. C: Same as A, with scores obtained after applying a log-transformation to the data.

More »

Expand

Fig 8.

Correlations of rankings on the natural and logarithmic scale.

A: Average Spearman rank correlation of scores for individual forecasts. For every individual target (defined by a combination of forecast date, target type, horizon, location), one score was obtained per model. Then, for every forecast target, the Spearman rank correlation was computed between scores on the natural scale and on the log scale for all the models that had made a forecast for that specific target. These individual rank correlations were then averaged across locations and time and are displayed stratified by horizon and target types, representing average accordance of model ranks for a single forecast target on the natural and on the log scale. B: Correlation between relative skill scores. For every forecast horizon and target type, a separate relative skill score was computed per model using pairwise comparisons, which is a measure of performance of a model relative to the others for a given horizon and target type that accounts for missing values. The plot shows the correlation between the relative skill scores on the natural vs. on the log scale, representing accordance of overall model performance as judged by scores on the natural and on the log scale.

More »

Expand

Fig 9.

Changes in model ratings as measured by relative skill for two-week-ahead predictions for cases (top row) and deaths (bottom row).

A: Relative skill scores for case forecasts from different models submitted to the European COVID-19 Forecast Hub computed on the natural scale. B: Change in rankings as determined by relative skill scores when moving from an evaluation on the natural scale to one on the logarithmic scale. Red arrows indicate that the relative skill scores deteriorated when moving from the natural to the log scale, green arrows indicate they improved. C: Relative skill scores based on scores on the log scale. D: Difference in relative skill scores computed on the natural and on the logarithmic scale, ordered as in C. E: Relative contributions of the different WIS components (overprediction, underprediction, and dispersion) to overall model scores on the natural and the logarithmic scale. F, G, H, I, J: Analogously for deaths.

More »

Expand