Accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the U.S.

Seasonal influenza results in substantial annual morbidity and mortality in the United States and worldwide. Accurate forecasts of key features of influenza epidemics, such as the timing and severity of the peak incidence in a given season, can inform public health response to outbreaks. As part of ongoing efforts to incorporate data and advanced analytical methods into public health decision-making, the United States Centers for Disease Control and Prevention (CDC) has organized seasonal influenza forecasting challenges since the 2013/2014 season. In the 2017/2018 season, 22 teams participated. A subset of four teams created a research consortium called the FluSight Network in early 2017. During the 2017/2018 season they worked together to produce a collaborative multi-model ensemble that combined 21 separate component models into a single model using a machine learning technique called stacking. This approach creates a weighted average of predictive densities where the weight for each component is determined by maximizing overall ensemble accuracy over past seasons. In the 2017/2018 influenza season, one of the largest seasonal outbreaks in the last 15 years, this multi-model ensemble performed better on average than all individual component models and placed second overall in the CDC challenge. It also outperformed the baseline multi-model ensemble created by the CDC that took a simple average of all models submitted to the forecasting challenge. This project shows that collaborative efforts between research teams to develop ensemble forecasting approaches can bring measurable improvements in forecast accuracy and important reductions in the variability of performance from year to year. Efforts such as this, that emphasize real-time testing and evaluation of forecasting models and facilitate the close collaboration between public health officials and modeling researchers, are essential to improving our understanding of how best to use forecasts to improve public health response to seasonal and emerging epidemic threats.


Component models
See Table A. 2 Supplemental evaluation metrics 2

.1 Bias and MSE
We compared the FSNetwork-TTW ensemble model's accuracy in 2017/2018 to the top-performing models from each team in the training phase. We used the metrics of root mean-squared error (RMSE) and average bias to measure accuracy of point estimates. Note that the ensemble weights were optimized solely to maximize log-score, so these accuracy scores are not an indicator of how well the ensemble could do if it were optimized to minimize point-estimate error. Consistent with the CDC scoring rules, we only evaluated point estimates within the "scoring bounds" specific to each target, region, and season (see Methods in main manuscript).
Overall, the FSNetwork-TTW model ranked second in both RMSE and average bias, behind the LANL-DBM model ( Figure A). All selected models showed a negative bias (i.e. underestimation, on average) of the targets on the wILI scale (week-ahead incidence and peak percentage). The CU-EKF SIRS model showed particularly low bias for 1-and 2-week-ahead foreacasts, although greater variability led to lower ranks for RMSE.
In general, these results suggest that using separate weighting schemes for point estimates and predictive distributions may be valuable.

Probability integral transform
The Probability Integral Transform (PIT) is an evaluation metric that can be used to assess the calibration of a predictive model. A common application of PIT is testing whether a set of values from an unknown target distribution can be accurately modeled by one or more predictive distributions. In brief, statistical theory tells us that if we plug the set of observed values (which come from the unknown target distribution) into the cumulative distribution function of the predicted distribution, the output, otherwise known as the PIT values, should be uniformly distributed if the predicted distribution matches the true distribution. [12,13] Therefore

Comparison of ensemble model performance in testing and training phases
To compare the training and test phase performance of the ensemble models, we plotted their relative performance ( Figure E). This shows that the FSNetwork-TTW model had the highest overall score in both the training and test phase.

EM Algorithm for Weighted Density Ensembles
The following is adapted from [14,15] and describes the use of the Expectation Maximization (EM) algorithm for constructing weighted average ensemble models in the context of infectious disease forecasting. Our goal is to develop a weighted density ensemble that combines the full predictive distributions in such a way as to optimize the score of the resulting model average.
To use the EM algorithm to find optimal weights, we formulate the question as a missing data problem.
We consider a data generating process in which an observed target is generated from f (z) by choosing one of the f c (z) component distributions as a random draw from a multinomial distribution with probabilities π c . Here we supress the subscripts for target, region and week (t, r, and w) for simplicity. The problem is that we do not know, for each observed datapoint z * i , which component this observation was drawn from. However, we can make a best guess, conditional on the data and our current estimates of π c , of how often each component was chosen. This is the "E step". Then, based on these guesses, we can update our estimate of π c . This is the "M-step".
The "E step" of the EM algorithm we can think of as determining, for each component c, the expected number of times for each of our observed N datapoints that component c was chosen as the contributor to f (z): Heuristically, we can think of the expression f c (z) equivalently as P r(z|model c ) or in words the likelihood of seeing the value z given that component c is the "chosen" model.
The "M step" of the EM algorithm simply calculates, conditional on the "complete data", i.e. the z * i and the estimated number of times each component was chosen, the fraction of times each method was chosen.
Therefore, if we simply divide the quantity from the "E-step" by N , our total number of observations, we obtain a new estimate of this probability: Assume that we have a set of C fitted predictive densities "evaluated at" observed data z * i for i = 1, ..., N . In our application, we let the f c (z * i ) be computed as the probabilities associated with the modified scores as described in the main manuscript. As an example, for season peak percentage and the short-term forecasts, probabilities assigned to wILI values within 0.5 units of the observed values are included as correct, so the modified score becomes f c (z * i ) = z * i +.5 z * i −.5 f c (z|x)dz. We will notate these scores as f c (z * i |x). There will be C · N total observations, as each model must have an associated score (a probability, between 0 and 1) for each observed data point.
We wish to obtain a set of optimal weightsπ = {π 1 ,π 2 , ...,π C } for combining the models such that ∀cπ c ≥ 0 and C c=1π c = 1. The weights can be used to then combine the component models into an ensemble model as f (z|π) = C c=1 π c f c (z).
We define a function (π) that computes a log-likelihood of the resulting ensemble as follows: Below, we define one procedure to obtain a set of weights for the ensemble. Set ∆ = 1, or another arbitrary constant.

5:
Set to be a very small positive number strictly less than ∆.
We note that this application of the EM algorithm is a very simple example of the standard EM, which in general does not guarantee a global maximum. However, this particular log-likelihood function (a sum of logarithms of weighted sums) is a convex function of its parameters, the π c , and convexity ensures that any local maximum is a global maximum [16]. In particular, we can ensure that there is a global, finite maximum and that the EM finds it by including a uniform component and making the algorithm start with all nonzero weights.