Interval forecasts of weekly incident and cumulative COVID-19 mortality in the United States: A comparison of combining methods

Background A combined forecast from multiple models is typically more accurate than an individual forecast, but there are few examples of studies of combining in infectious disease forecasting. We investigated the accuracy of different ways of combining interval forecasts of weekly incident and cumulative coronavirus disease-2019 (COVID-19) mortality. Methods We considered weekly interval forecasts, for 1- to 4-week prediction horizons, with out-of-sample periods of approximately 18 months ending on 8 January 2022, for multiple locations in the United States, using data from the COVID-19 Forecast Hub. Our comparison involved simple and more complex combining methods, including methods that involve trimming outliers or performance-based weights. Prediction accuracy was evaluated using interval scores, weighted interval scores, skill scores, ranks, and reliability diagrams. Results The weighted inverse score and median combining methods performed best for forecasts of incident deaths. Overall, the leading inverse score method was 12% better than the mean benchmark method in forecasting the 95% interval and, considering all interval forecasts, the median was 7% better than the mean. Overall, the median was the most accurate method for forecasts of cumulative deaths. Compared to the mean, the median’s accuracy was 65% better in forecasting the 95% interval, and 43% better considering all interval forecasts. For all combining methods except the median, combining forecasts from only compartmental models produced better forecasts than combining forecasts from all models. Conclusions Combining forecasts can improve the contribution of probabilistic forecasting to health policy decision making during epidemics. The relative performance of combining methods depends on the extent of outliers and the type of models in the combination. The median combination has the advantage of being robust to outlying forecasts. Our results support the Hub’s use of the median and we recommend further investigation into the use of weighted methods.


Introduction
The coronavirus disease-2019 (COVID-19) pandemic has overwhelmed health services and caused excess death rates, prompting governments to impose extreme restrictions in attempts to control the spread of the virus [1-3].These interventions have resulted in multiple economic, health and societal problems [4,5].This has generated intense debate among experts about the best way forward [6].
Governments and their advisors have relied upon forecasts from models of the numbers of COVID-19 cases, hospitalisations and deaths to help decide what actions to take [7].Using models to lead health policy has been controversial, but it is recognised that modelling is potentially valuable when used appropriately [1, [8][9][10].Numerous models have been developed to forecast different COVID-19 data, e.g.[11][12][13].
Models should provide probabilistic forecasts, as point forecasts are inherently uncertain [9,14].A 95% interval forecast is a common and useful form of probabilistic forecast [15,16].Models may be constructed for prediction or scenario analysis.Prediction models forecast the most likely outcome in the current circumstances.Multiple models may reflect different approaches to answering the same question [11], and conflicting forecasts may arise.Rather than asking which is the best model [17], a forecast combination can be used, such as the mean, which is often used and hard to beat [18,19].
Forecast combining harnesses the 'wisdom of the crowd' [20] by producing a collective forecast from multiple models that is typically more accurate than forecasts from individual models.Combining pragmatically synthesises information underlying different prediction methods, diversifying the risk inherent in relying on an individual model, and it can offset statistical bias, potentially cancelling out overestimation and underestimation [21].These advantages are well-established in many applications outside health care [22][23][24][25].This has encouraged the more recent applications of combining in infectious disease prediction [14,[26][27][28][29], including online platforms that present visualisations of combined probabilistic forecasts of COVID-19 data from the U.S, reported by the Centers for Disease Control and Prevention (CDC), and from Europe, reported by the European Centre for Disease and Control (EDCD).
Other examples or combined probabilistic forecasts are in vaccine trial planning [30] and diagnosing disease [31].These examples have mainly focused on simple mean and median 'ensembles' and, in the case of prediction of COVID-19 data, published studies have primarily involved short periods of data, which rules out the consideration of more sophisticated methods, such as those weighted by historical accuracy.
By comparing the accuracy of different combining methods over longer forecast evaluation periods compared to other studies, our broad aims were to: (a) investigate whether combining methods, involving weights determined by prior forecast accuracy or different ways of excluding outliers, are more accurate than simple methods of combining, and (b) establish the relative accuracy of the mean and median combining methods.Previously, we reported several new weighted methods, in a comparison of combining methods applied to probabilistic predictions of weekly cumulative COVID-19 mortality in U.S. locations over the 40-week period up to 23 January 2021 [32].We found that weighted methods were the most accurate overall and the mean generally outperformed the median except in the first ten weeks.In this paper, we test further by comparing the combining methods on a dataset of interval forecasts of cumulative mortality and a dataset of incident mortality, both over a period of more than 80 weeks.We also include individual models in the comparison, and explore the impacts of reporting patterns of death counts and outlying forecasts on forecast accuracy.

Dataset
The Hub carries out screening tests for inclusion in their 'ensemble' forecast combinations.Screening excludes forecasts with an incomplete set of quantiles or prediction horizons, and improbable forecasts.
The definition of improbable forecasts relate to cumulative deaths, and currently includes decreasing quantiles over the forecast horizons, decreasing cumulative deaths over time (except including an adjustment due to reporting revisions, which is permitted up to a maximum of 10% in the 1-week ahead forecasts), and forecasts of cumulative deaths for a particular location exceeding the size of its population.Before the week of 28 July 2020, the Hub also excluded outlying forecasts, which were identified by a visual check against the actual number of deaths.We only included forecasts that passed the Hub's screening tests.
Our dataset included forecasts projected from forecast origins at midnight on Saturdays between 9 May 2020 to 8 January 2022 for forecasts of cumulative COVID-19 deaths (88 weeks of data), and between 6 June 2020 and 8 January 2022 for forecasts of incident deaths (84 weeks of data).Forecasts of incident deaths were not screened by the Hub in the weeks ending 9 May 2020 to 30 May 2020.We included forecasts of cumulative deaths in this period as we wished to use all the available data, and also given the fact that the set of incident and cumulative forecasts were different in terms of the included models and locations.In terms of the actual weekly COVID-19 mortality, for each location and week, we used the values made available on 15 January 2022.We studied forecasts of COVID-19 mortality for the U.S. as a whole and 51 U.S. jurisdictions, including the 50 states and the District of Columbia.
For simplicity, we refer to these as 51 states.
Our analysis included forecasts from 60 forecasting models and the Hub's ensemble model.In the early weeks of our dataset, the majority were susceptible-exposed-infected-removed (SEIR) compartmental models, but as the weeks passed, other model types became more common (Fig 1).These involved methods such as neural networks, agent-based modelling, time series analysis, and the use of curve fitting techniques.S1

Fig 2. Data availability for forecasts of incident COVID-19 deaths
Several combining methods required parameter estimation, which we performed for each location and forecast origin.We defined the in-sample estimation period as being initially the first 10 weeks, and then expanding week by week.This resulted in out-of-sample forecasts produced from 78 weekly forecast origins for the cumulative deaths series, and 74 weekly forecast origins for incident deaths.

Evaluating the interval forecasts
We evaluated out-of-sample prediction accuracy and calibration, with reference to the reported death counts on 15 January 2022, thus producing a retrospective evaluation.Calibration was assessed by the percentage of actual deaths that fell below each bound of the interval forecasts.As each bound is a quantile, this amounted to assessing the calibration of the 23 quantiles for which the teams submitted forecasts.We present this using reliability diagrams.To evaluate prediction accuracy of an interval forecast, we used the interval score (IS) given by the following expression [33,34]: where lt is the interval's lower bound, ut is its upper bound, yt is the observation in period t, I is the indicator function (1 if the condition is true and 0 otherwise), and  is the ideal probability of the observation falling outside the interval.We report the IS for the 95% interval forecasts (for which  =5%).Lower values of the IS reflect greater interval forecast accuracy.The unit of the IS is deaths.As each forecasting team provides forecasts for 23 different quantiles, the following K=11 symmetric interval forecasts can be considered: 98%, 95%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20% and 10%.
To summarise prediction accuracy for all these intervals, we used the weighted IS (WIS) [16]: ) where and m is the forecast of the median.The IS and the WIS are useful for comparing methods, and although their units are deaths, these scores are not interpretable.Averaging each of these two scores across weeks provided the mean IS (MIS) and the mean WIS (MWIS).
We also averaged the scores across forecast horizons.We did this for conciseness, and because we had a relatively short analysis period, which is a particular problem when evaluating forecasts of extreme quantiles.To show the consistency across horizons, we present a set of results by horizon for interval forecasts for both the incident and cumulative deaths data.For this analysis, because we were looking at individual horizons, we were able to use the Diebold-Mariano statistical test [35], adapted to test across multiple series.This test statistic was originally designed to apply to the difference between the mean of an accuracy measure for two methods for a single time series.To compare the difference averaged across multiple time series, we calculated the variance of the sampling distribution by first summing each variance of the sampling distribution from the Diebold-Mariano test applied to each series, and then dividing by the square of the number of series.To summarise results averaged across the four horizons, we were unable to use the adapted Diebold-Mariano test, so we applied the statistical test proposed by Koning et al.[36].This test compares the rank of each method, averaged across multiple series, with the corresponding average rank of the most accurate method.Statistical testing was based on a 5% significance level.
We present results of the forecast accuracy evaluation in terms of the 95% interval MIS, MWIS, ranks and skill scores, which are calculated as the percentage by which a given method is superior to the mean combination.The mean is a common choice of benchmark in combining studies.We report results for the series of total U.S. deaths, as well as results averaged across all 52 locations.In addition, to avoid scores for some locations dominating, we also present results averaged for three categories, each including 17 states: high, medium and low mortality states.This categorisation was based on the number of cumulative COVID-19 deaths on 15 January 2022.All results are for the out-of-sample period, and to provide some insight into the potential change in ranking of methods over time, we present MWIS results separately for the first and second halves of the out-of-sample period.
We evaluated the effects of changes in reporting patterns on forecast accuracy.Changes in reporting patterns may involve reporting delays of death counts and changes in the definitions of COVID-19 deaths, both of which may lead to backdating of death counts and steep increases or decreases.
Backdating of death counts would produce a problematic assessment in our retrospective evaluation of forecast accuracy, and sudden changes in death counts might cause some forecasting models to misestimate, particularly time series models.To obtain some insight, we compared reports of cumulative death counts for each location in files that were downloaded at multiple time points between 20 June 2020 and 15 January 2022.Locations for which there were notable effects of reporting patterns were excluded in sensitivity analysis.We also examined the effect of outlying forecasts on forecast accuracy by comparing the performance of the mean and median, and visually comparing plots of the MWIS of the mean and median forecasts by location.
Data preparation and descriptive analysis was carried out using Stata version 16 and the forecasts were combined using version 19 of the GAUSS programming language.

Forecast combining methods
All the interval combining methods are applied to each interval bound separately, and for each mortality series, forecast origin and prediction horizon.The comparison included several interval combining methods that do not rely on the availability of records of past accuracy for individual models.These methods include the well-established mean and median combinations [37][38][39], and the more novel symmetric, exterior, interior and envelope trimming methods, which exclude a particular percentage of forecasts, and then average the remaining forecasts of each bound [40].We also implemented two inverse score methods that do rely on the availability of a record of historical accuracy for each individual model.For any combining method that involved a parameter, such as a trimming parameter, we optimised its value for each location by minimising the MIS calculated over the in-sample period.The following is a list of the combining methods that we included in our study: (a) Mean combination.We calculated the average of the forecasts of each bound.This combining method is also known as the simple average.
(b) Median combination.We calculated the median of the forecasts of each bound.This method is robust to outliers.
(c) Ensemble.This is the COVID-19 Hub ensemble forecast, which was originally the mean combination of the eligible forecasts until the week commencing 28 July 2020, when the ensemble forecast became the median combination and then, in the week commencing 27 September 2021, the Hub switched to using a weighted ensemble method.The use of eligibility screening implies that the ensemble is constructed with the benefit of a degree of trimming which initially involved some subjectivity and was then formalised more objectively.The median and the Hub ensemble should be identical for the postsample period, but there were some minor discrepancies including a small number of eligible forecasts that were not available on the Hub.The results for the median and the Hub ensemble will be similar as the latter method was the median combination for around 90% of the out-of-sample period.
(d) Symmetric trimming.This method deals with outliers.For each bound, it involves trimming the N lowest-valued and N highest-valued forecasts, where N is the largest integer less than or equal to the product of β/2 and the total number of forecasts, where β is a trimming parameter.The median combination is an extreme form of symmetric trimming.
(e) Exterior trimming.This method targets overly wide intervals.It involves removing the N lowestvalued lower bound forecasts and the N highest-valued upper bound forecasts, where N is the largest integer less than or equal to the product of the trimming parameter β and the number of forecasts.When this resulted in a lower bound being above the upper bound, we replaced the two bounds by their average.
(f) Interior trimming.This method targets overly narrow intervals.It involves removing the N highestvalued lower bound forecasts and the N lowest-valued upper bound forecasts, where N is defined as for exterior trimming.
(g) Envelope method.The interval is constructed using the lowest-valued lower bound forecast and highest-valued upper bound forecast.This method is an extreme form of interior trimming.
(h) Inverse interval score method.This is a method that has the model forecasts weighted by historical accuracy, with the weight for each model inversely proportional to the historical MIS for that team [32], which is calculated in the in-sample period.With the shortest in-sample period being 10 weeks, we considered only forecasting teams for which we had forecasts for at least five past forecast origins.
Larger numbers led to the elimination of many forecasters for the early weeks in our out-of-sample period.The following expression gives the weight on forecasting model i at forecast origin t: where MISi,t is the historical MIS computed at forecast origin t from model i, and J is the number of forecasting models included in the combination.
(i) Inverse interval score with tuning.This method has weights inversely proportional to the MIS and a tuning parameter, λ > 0, to control the influence of the score on the combining weights [32].The following expression gives the weight on forecasting model i at forecast origin t: If λ is close to zero, the combination reduces to the mean combination, whereas a large value for λ leads to the selection of the model with best historical accuracy.The parameter λ was optimised using the same expanding in-sample periods, as for the trimming combining methods.Due to the extent of missing forecasts, we pragmatically computed MISi,t using all available past forecasts, rather than limit the computation to periods for which forecasts from all models were available.For the models for which forecasts were not available for at least 5 past periods, we set MISi,t to be equal to the mean of MISi,t for all other models.An alternative approach, which we employed in our earlier study [32], is to omit from the combination any model for which there is only a very short or non-existent history of accuracy available.The disadvantage of this is that it omits potentially useful forecast information, and this was shown by empirical results.

Comparison with individual models
A comparison of the results of the combining methods with those of individual models is complicated by none of the individual models providing forecasts that passed the Hub's screening for all past periods and locations.We addressed this in two ways.Firstly, we included a previous best method, which at each forecast origin and location, selected the interval forecast of the individual model with lowest insample MIS.The aim of this is, essentially, to obtain the interval forecasts of the best of the individual models.Secondly, in our results, we also summarise a comparison of the mean and median combinations with individual models for which we had forecasts for at least half the locations and at least half the outof-sample period.Our inclusion criteria here is rather arbitrary, but the resulting analysis does help us compare the combining methods with the models of the more active individual teams.In this comparison, we excluded the COVID Hub baseline model, as it is only designed to be a comparator point for the models submitted to the Hub and not a true forecast.

Main results for incident deaths
Table 1 presents the MIS for 95% interval forecasts and MWIS for the 74-week out-of-sample period for incident mortality.Table 2 presents the corresponding mean skill scores, and Table 3 provides the mean ranks and results of the statistical test proposed by Koning et al. [36].The weighted inverse scores the ensemble and median combination were the best performing methods.Overall, (for all 52 locations), Table 2 shows that the performance of the inverse score method was almost 12% better than the mean in forecasting the 95% interval and, considering all interval forecasts, the ensemble and median were around 7% better than the mean.Of the trimming methods, symmetric trimming performed best overall, and was quite competitive compared to the leading methods.The 'previous best' was not competitive against most of the combining methods.The worst results were produced by the envelope method.
Tables 1 to 3 report results averaged across the four forecast horizons (1 to 4 weeks ahead).We found similar relative performances of the methods when looking at each forecast horizon separately (S2

Changes over time in performance for incident deaths
In Table 4, the MWIS skill scores are shown separately for the first and second halves of the 74-week out-of-sample period.Recalling that the skill scores assess performance relative to the mean combining method, the table shows that this combining method was notably more competitive for the second half of the out-of-sample period than for the first half.Comparing the other methods, we see that the same methods that performed particularly well for the first half of the data also were the best methods for the second half.An exception that was the inverse score tuning method that performed worse for the second half, which is perhaps surprising, as one might expect the tuning parameter to be better estimated for the second half, as more data was available for estimation.Inverse score without tuning would appear to be a more robust method for this dataset.The consistently good performance of the median emphasises the importance of robustness.

Performance by model type for incident deaths
To evaluate performance by model type, for each category of mortality series (all, U.S, high, medium and low mortality), in Table 5, we tabulate MWIS skills scores for the combining methods applied separately to each of the following three sets of individual models: all models, compartmental models only, and non-compartmental models only.For each category of mortality series, to enable a comparison of the combining methods applied to the different sets of individual models, we computed the skill scores using the same benchmark, which we set as the mean combination of all models.Note that we have omitted the ensemble from Table 5 because the forecasts from this method were determined by the Hub, and so we were not in control of which individual methods that method combines.The first point to note from Table 5 is that combining only non-compartmental models led to poorer results for almost all combining methods and categories of mortality series.A second point to note is that, for the all, high, medium and low categories of series, combining only compartmental models was preferable to combining all models, unless the combining method was the median.For the median, combining all available models was preferable.It is interesting to note that the two inverse score methods, when applied only to the compartmental models, become competitive with the median.

Performance of individual models for incident deaths
Table 6 reports the performance of the 27 individual models for which we had forecasts of incident deaths for at least half the out-of-sample period and at least half of the 52 locations.The table summarises skill scores based on scores calculated for the individual model and the benchmark method using only those weeks for which forecasts were available for the individual model.Table 6 reports results for the skill score calculated using mean combining as the benchmark, as in our previous tables, but also the results for skill score calculate using median combining as the benchmark method.The skill scores of these individual models were highly variable, and generally negative, implying that they were not competitive against the mean or median in any category.The only notable exception was the performance of an individual model that was almost 17% better than the mean for the 95% interval forecasts in the high mortality locations.

Calibration results for incident deaths
As we stated earlier, with each bound of the interval forecasts being a quantile, we assess calibration for each of the 23 quantiles for which the teams submitted forecasts.We do this in Fig 4, which presents reliability diagrams for each category of mortality series for the mean, median and inverse score with tuning combining methods.Reasonable calibration can be seen in the plot relating to all 52 locations, and there is good calibration at the extreme quantiles in each plot, except the one for low mortality locations.Most methods had calibration that was too low for the U.S and high mortality locations, and most methods displayed calibration that was too high for the low mortality locations, particularly for the lower quantiles.For the medium mortality locations, the mean and inverse score with tuning performed better than the median, for which calibration was slightly too low.For all methods, S3-S7 Tables provide the calibration for the five categories of the locations: all 52 locations, U.S, high mortality, medium mortality and low mortality, respectively.

Forecasting cumulative deaths
In this section, for the cumulative deaths data, we report analogous results tables and figure to those that we have presented for the incident deaths data.

Main results for cumulative deaths
Table 7 presents the MIS for 95% interval forecasts and MWIS for the 78-week out-of-sample period for cumulative mortality.The corresponding skill scores are in Table 8.The median and ensemble approaches were the best performing methods in terms of both metrics.Overall, their performance for the 95% interval was about 65% better than the mean, and considering all interval forecasts, the ensemble and median were around 43% better than the mean.The very poor performance of the mean suggests the presence of outlying forecasts.These would also undermine the weighted inverse score methods, as they involve weighted averages.The inverse score with tuning method was the best method for the U.S. series.Interior trimming performed better than the inverse scoring methods for the 95% interval, which suggests that there were large numbers of 95% intervals that were too narrow.Both metrics showed symmetric trimming performing almost as well as the median.Table 9 reports mean ranks.Using the statistical test proposed by Koning et al. [36], we identified that, in terms of the mean rank, most methods were statistically significantly worse than the median-based approaches.We found similar relative performances of the methods when looking at each forecast horizon separately (S8 Table ).

Performance of individual models for cumulative deaths
For forecasts of cumulative mortality, 25 models provided forecasts for at least half the locations for at least half the weeks in the out-of-sample period.As can be seen in Table 12, the performance of these individual models was highly variable.The upper half of the table shows that, particularly for the 95% interval, a good number of the individual models were able to outperform the mean.However, the lower half of the table shows that the individual methods were not competitive with the median, except for the case of the 95% interval for the total U.S. mortality series.The figure shows that the mean produced quantile forecasts that tended to be too low for the U.S. series, and too high for the other four categories.The inverse score with tuning method was very well calibrated, except for the U.S. series, and the median method also performed reasonably well, although it tended to produce quantile forecasts that were generally a little low.For all methods, S9-S13 Tables show the calibration for all methods for the five categories of the locations: all 52 locations, U.S, high mortality, medium mortality and low mortality, respectively.

Impact of reporting patterns and outliers on forecast accuracy
We observed changes in reporting patterns of historical death counts in 15 locations.We cannot rule out there having been changes in reporting patterns for these and other locations, as we did not have a complete set of files of reported death counts for each week.   2 and 8 (where no locations were excluded), there were improvements for all methods, slight changes in rankings, but no changes in the overall conclusions.
The differences between the performance of the mean and median forecasts described in previous

Discussion
The weighted inverse scores and the median methods performed best for forecasts of incident deaths.
They produced moderate improvements in performance over the common benchmark mean combination.With the forecasts of cumulative deaths, improvements over the mean were much higher, and for the median, they were substantial.For all combining methods except the median, combining forecasts from only compartmental models produced better forecasts than forecasts from combinations of all models.Furthermore, considering combinations of compartmental models only, inverse score combining was more competitive against the median for both mortalities.We found that the individual models were not competitive with the better combining methods.The presence of outlying forecasts had an adverse impact on the performance of the mean and the inverse score methods, which involved weighted averaging.The adverse effects of reporting patterns on performance were minor.
We presented the inverse score methods in an earlier study of forecasts of cumulative COVID-19 deaths [32].The current paper considers both incident and cumulative forecasts, using a far longer period of data than the earlier study, and involves a different set of forecasting models, as we now only include forecasts that passed the screening tests of the COVID-19 Hub.In our earlier study, the inverse score methods were the most accurate overall and the mean generally outperformed the median.The mean was also competitive against the inverse score method for many locations.The results of the current study for forecasts of cumulative deaths were not consistent with those of the earlier study, although much better results were achieved for the mean by combining only compartmental models, and when combining forecasts of incident deaths from all models, the relative performance of the inverse mean methods was considerably better.In the current study of cumulative deaths, the leading methods were generally the median methods and symmetric mean (for which the median is an extreme case).These methods are robust to outliers.The results of our two studies illustrate that, particularly for forecasts of cumulative deaths, the relative performance of combining methods depends on the extent of outlying forecasts, and that outlying forecasts were clearly more prevalent in the dataset for the current study.
Another relevant previous study is that by Bracher et al [29], who compared forecasts produced by the mean combination, median combination and a weighted combination for COVID-19 deaths in Germany and Poland.They found that combined methods did not perform better than individual models.
However, this study was limited by an evaluation period of only ten weeks.It is also worth noting that the study used just thirteen individual models in the combinations.In our previous work [32], we found accuracy improved notably as the number of individual models rose, plateauing at around twenty models.
not clear which model has the best performance, the statistical expectation is that the average method will score far better than a model chosen randomly, or chosen on the basis of no prior history.This was the case at the start of the COVID-19 pandemic.
The existence of outlying forecasts presents challenges to forecast combining.These can arise due to model-based factors or factors involving the actual number of deaths.The former include computational model errors, which can happen occasionally, and model assumptions being incorrect, which will typically apply in the early stages of a pandemic.The latter factors include data updates and changes in definitions.Some models can be adapted to allow for data anomalies.The removal of outlying forecasts may be added to the pre-combining screening process, but screening criteria for outliers may be arbitrary and it will be subjective.A more objective way to tackle outlying forecasts is to use the median combination, and that was the approach taken by the COVID-19 Hub in July 2021, having previously relied on a mean ensemble.Our earlier study suggested that factoring historical accuracy into forecast combinations may achieve greater accuracy than the median combination [32].
Both our studies have involved the use of performance-weighted mean methods, and our current study has shown that they are not sufficiently robust to outliers.We recommend further research into weighted methods and the effect of model type on the relative performance of combined methods.

Fig 1 .
Fig 1. Number and types of models at each forecast origin for each mortality dataset

Fig 3 .
Fig 3. Illustration of interval forecast combining methods that do not rely on past historical accuracy.Each pair of shapes represents an interval forecast produced by an individual model.

Fig 4 .
Fig 4. For incident mortality, reliability diagrams showing calibration of the 23 quantiles for the mean, median and inverse score with tuning methods.The 23 quantiles include all bounds on the interval forecasts and the median

Fig 5
Fig 5 presents reliability diagrams for each category of the mortality series to summarise the calibration

Fig 5 .
Fig 5.For cumulative mortality, reliability diagrams showing calibration of the 23 quantiles for the mean, median and inverse score with tuning methods.. The 23 quantiles include all bounds on the interval forecasts and the median Fig 6 shows examples of six locations where updates to death counts were particularly notable.We found evidence of backdating in Delaware, Ohio, Rhode Island and Indiana.Backdating of historical death counts is shown as dashed lines.We noted a sharp drop in death counts in West Virginia in May 2021, suggesting a redefinition of COVID-19 deaths.There were sharp increases in death counts in Oklahoma in early April 2021 and in Delaware in late July 2021.We also observed sharp increases in death counts of two other locations (Missouri and Nebraska).

Fig 6 .
Fig 6.Numbers of reported cumulative deaths in six states where there were noticeable changes in reporting patterns.Based on reported death counts at multiple data points between 20 June 2020 and 15 January 2022.

Fig 7 .Fig 8 .
Fig 7.For incident mortality, MWIS for high, medium and low mortality states for three selected combining methods sections and highlighted Fig 8 suggested a problem with outliers, particularly, for cumulative deaths.

Fig 9
Fig 9 provides some insight into an outlying set of 23 quantile forecasts.Each line shows the probability

Fig 9 .
Fig 9. Two examples of an outlying set of quantile forecasts of cumulative deaths for one week-ahead from forecast origin for the week ending on 18 July 2020 Table provides a list of all the models.
a best method in each column.

Table 2 .
For incident mortality, skill scores for 95% interval MIS and MWIS.

Table 3 .
For incident mortality, average ranks of the 95% interval MIS and MWIS.Lower values are better.a best method in each column; b significantly worse than the best method, at the 5% significance level.

Table 4 .
For incident mortality, skill scores for MWIS calculated separately for the first and second halves of the 74-week out-of-sample period.Shows percentage change from the mean.Higher values are better.a best method in each column.

Table 5 .
For incident mortality, skill scores for MWIS for combining methods applied to forecasts of all models, compartmental models only, and non-compartmental models only.Shows percentage change from the mean.Higher values are better.a best method in each of the five mortality categories.

Table 6 .
For incident mortality, summary statistics of skill scores for individual models.

Table 7 .
For cumulative mortality, 95% interval MIS and MWIS for the 78-week out-of-sample period a best method in each column.

Table 8 .
For cumulative mortality, skill scores for 95% interval MIS and MWIS for the 78-week outof-sample period.

Table 12 .
For cumulative mortality, summary statistics of skill scores for individual models, using mean and median combining as benchmark.
Table and S15 Table for incident and cumulative deaths respectively.Compared to the MWIS skill scores presented in Tables