^{1}

^{2}

^{2}

The authors have declared that no competing interests exist.

Accurate and reliable predictions of infectious disease dynamics can be valuable to public health organizations that plan interventions to decrease or prevent disease transmission. A great variety of models have been developed for this task, using different model structures, covariates, and targets for prediction. Experience has shown that the performance of these models varies; some tend to do better or worse in different seasons or at different points within a season. Ensemble methods combine multiple models to obtain a single prediction that leverages the strengths of each model. We considered a range of ensemble methods that each form a predictive density for a target of interest as a weighted sum of the predictive densities from component models. In the simplest case, equal weight is assigned to each component model; in the most complex case, the weights vary with the region, prediction target, week of the season when the predictions are made, a measure of component model uncertainty, and recent observations of disease incidence. We applied these methods to predict measures of influenza season timing and severity in the United States, both at the national and regional levels, using three component models. We trained the models on retrospective predictions from 14 seasons (1997/1998–2010/2011) and evaluated each model’s prospective, out-of-sample performance in the five subsequent influenza seasons. In this test phase, the ensemble methods showed average performance that was similar to the best of the component models, but offered more consistent performance across seasons than the component models. Ensemble methods offer the potential to deliver more reliable predictions to public health decision makers.

Public health agencies such as the US Centers for Disease Control and Prevention would like to have as much information as possible when planning interventions intended to reduce and prevent the spread of infectious disease. For instance, accurate and reliable predictions of the timing and severity of the influenza season could help with planning how many influenza vaccine doses to produce and by what date they will be needed. Many different mathematical and statistical models have been proposed to model influenza and other infectious diseases, and these models have different strengths and weaknesses. In particular, one or another of these model specifications is often better than the others in different seasons, at different times within the season, and for different prediction targets (such as different measures of the timing or severity of the influenza season). In this article, we explore ensemble methods that combine predictions from multiple “component” models. We find that these ensemble methods do about as well as the best of the component models in terms of aggregate performance across multiple seasons, but that the ensemble methods have more consistent performance across different seasons. This improved consistency is valuable for planners who need predictions that can be trusted under all circumstances.

The practice of combining predictions from different models has been used for decades by climatologists and geophysical scientists. These methods have subsequently been adapted and extended by statisticians and computer scientists in diverse areas of scientific inquiry. In recent years, these “ensemble” forecasting approaches frequently have been among the top methods used in prediction challenges across a wide range of applications.

Ensembles are a natural choice for noisy, complex, and interdependent systems that evolve over time. In these settings, no one model is likely to be able to capture and predict the full set of complex relationships that drive future observations from a particular system of interest. Instead “specialist” or “component” models can be relied on to capture distinct features or signals from a system and, when combined, represent a nearly complete range of possible outcomes. In this work, we develop and compare a collection of ensemble methods for combining predictive densities. This enables us to quantify the improvement in predictions achieved by using ensemble methods with varying levels of complexity.

To illustrate these ensemble methods, we present time-series forecasts for infectious disease, specifically for influenza in the United States. The international significance of emerging epidemic threats in recent decades has highlighted the importance of understanding and being able to predict infectious disease dynamics. With the revolution in science driven by the promise of “big” and real-time data, there is an increased focus on and hope for using statistics to inform public health policy and decision-making in ways that could mitigate the impact of future outbreaks. Some of the largest public health agencies in the world, including the US Centers for Disease Control and Prevention (CDC) have openly endorsed using models to inform decision making, saying “with models, decision-makers can look to the future with confidence in their ability to respond to outbreaks and public health emergencies” [

There is a large literature on prediction methods for influenza. We will give a brief overview of this literature here, and refer the reader to Chretien

The ensemble methods that we explore in the present work are designed to combine predictions from multiple models, which could use a variety of different model structures and covariates to generate predictions. Development of the methods presented in this manuscript was motivated by the observation that certain prediction models for infectious disease consistently performed better than other models at certain times of year. We observed in previous research that early in the influenza season, simple models of historical incidence often outperformed more standard time-series prediction models such as a seasonal auto-regressive integrated moving average (SARIMA) model [

A large number of ensemble methods have been developed for a diverse array of tasks including regression, classification, and density estimation. These methods are broadly similar in that they combine results from multiple component models. However, details differ between ensemble methods. We suggest Polikar [

While there are many different methods for combining models, all ensemble models discussed in this paper use an approach called stacking [

In structured prediction settings such as time series forecasting, ensemble methods may benefit from taking advantage of the data structure. For example, it may be the case that different models offer a better representation of the data at different points in time. A common idea in these settings is to use model weights that change over time. For instance, model weights may vary as a function of how well each model did in recent predictions [

Using component models that generate predictive densities for outcomes of interest, we have implemented a series of ensembles using different methods for choosing the weights for each model. Specifically, we compare three different approaches. The first approach simply takes an equally weighted average of all models. The second approach estimates constant but not necessarily equal weights for each model. The third approach is a novel method for determining model weights based on features of the system at the time predictions are made. The overarching goal of this study is to create a systematic comparison between ensemble methods to study the benefits of increasing complexity in ensemble weighting schemes.

We are aware of two previous articles that developed ensemble methods for infectious disease prediction. Yamana

This paper presents a novel ensemble method that determines optimal model combinations based on (a) observed data at the time predictions are made and (b) aspects of the predictive distributions obtained from the component models. We refer to models built using this approach as “feature-weighted” ensembles. This approach fuses aspects of different ensemble methods: it uses model stacking [

Using seasonal influenza outbreaks in the US health regions as a case-study, we developed and applied our ensemble models to predict several attributes of the influenza season at each week during the season. By illustrating the utility of these approaches to ensemble forecasting in a setting with complex population dynamics, this work highlights the importance of continued innovation in ensemble methodology.

This paper presents a comparison of methods for determining weights for weighted density ensembles, applied to forecasting specific features of influenza seasons in the US. First, we present a description of the influenza data we use in our application and the prediction targets. Next, we discuss the three component models utilized by the ensemble framework. We then turn to the ensemble framework itself, describing the different ensemble model specifications used.

We obtained publicly available data on seasonal influenza activity in the United States between 1997 and 2016 from the US Centers for Disease Control and Prevention (CDC) (

The full data include observations aggregated to the national level and for ten smaller regions. Here we plot only the data at the national level and in two of the smaller regions; data for the other regions are qualitatively similar. Missing data are indicated with vertical blue bars. The vertical red dashed lines indicate the cutoff time between the training and testing phases; five seasons of data were held out for testing.

The CDC defines the influenza season onset as the first of three consecutive weeks of the season for which wILI is greater than or equal to a threshold that is specific to the region and season. This threshold is the mean percent of patient visits where the patient had ILI during low incidence weeks for that region in the past three seasons, plus two standard deviations [

Each predictive distribution was represented by probabilities assigned to bins associated with different possible outcomes. For onset week, the bins are represented by integer values for each possible season week plus a bin for “no onset”. For peak week, the bins are represented by integer values for each possible season week. For peak incidence, the bins capture incidence rounded to a single decimal place, with a single bin to capture all incidence over 12.95. Formally, the incidence bins are as follows: [0, 0.05), [0.05, 0.15), …, [12.85, 12.95), [12.95, 100]. These bins were used in the 2016-2017 influenza prediction contest run by the CDC [

We measure the accuracy of predictive distributions using the log score. The log score is a proper scoring rule [

We used three component models to generate probabilistic predictions of the three prediction targets. The first model was a seasonal average model that utilized kernel density estimation (KDE) to estimate a predictive distribution for each target. The second model utilized kernel conditional density estimation (KCDE) and copulas to create a joint predictive distribution for incidence in all remaining weeks of the season, conditional on recent observations of incidence [

The simplest of the component models uses kernel density estimation [

To create an empirical predictive distribution of size _{1:K} (for example, this might be the vector of peak week values from the _{1:K}, yielding a new vector ^{5}. In theory, this model assigns non-zero probability to every possible outcome; however, in a few cases the empirical predictive distribution resulting from this Monte Carlo sampling approach assigned probability zero to some of the bins.

It is important to note that the predictions from this model do not change as new data are observed over the course of the season.

We used kernel conditional density estimation and copulas to estimate a joint predictive distribution for flu incidence in each future week of the season, and then calculated predictive distributions for each target from that joint distribution [

To predict seasonal quantities (onset, peak timing, and peak incidence), we simulate ^{5} trajectories of disease incidence from this joint predictive distribution. For each simulated incidence trajectory, we compute the onset week, peak week, and peak incidence. We then aggregate these values to create predictive distributions for each target. This procedure for obtaining predictive distributions for the targets of interest can be formally justified as an appropriate Monte Carlo integral of the joint predictive distribution for disease incidence in future weeks (see [

We fit seasonal ARIMA models [

Similar to KCDE, forecasts were obtained by sampling ^{5} trajectories of wILI values over the rest of the season (using the

We used data from 14 seasons (1997/1998 through 2010/2011) to train the models. Data from five seasons (2011/2012 through 2015/2016) were held out when fitting the models and used exclusively in the testing phase. To avoid overfitting our models, we made predictions for the test phase only once [

Estimation of the ensemble models (discussed in the next subsection) requires cross-validated measures of performance of each of the component models in order to accurately gauge their relative performance. For each region, we estimated the parameters of each component model 15 times: 14 fits were obtained excluding one training season at a time, and another fit used all of the training data. For each fit obtained leaving one season out, we generated a set of three predictive distributions (one for each of the prediction targets) at each week in the held-out season. We were not able to generate predictions from the SARIMA and KCDE models for some seasons in the training phase because those models used lagged observations from previous seasons that were missing in our data set. The component model fits based on all of the training data were used to generate predictions for the test phase.

All of the ensemble models we consider in this article work by averaging predictions from the component models to obtain the ensemble prediction. Additionally, these methods are stacked model ensembles because they use leave-one-season-out predictions from the independently estimated component models as inputs to estimate the model weights [

A single set of notation can be used to describe all of the ensemble frameworks implemented here. Let _{t} conditional on observed variables _{t} could for example represent the peak incidence for a given season and region; in our application to predicting seasonal quantities, the same outcome _{t} will be realized for all weeks within a given season. In the context of time series predictions, the covariate vector ^{(m)} reflects the fact that each component model may use a different set of covariates.

The combined predictive density _{t}|_{t}) for a particular target can be written as

In _{m} are the model weights, which are allowed to vary as a function of observed features in _{t}. We define _{t} to be a vector of all observed quantities that are used by any of the component models or in calculating the model weights. In order to guarantee that _{t}|_{t}) is a probability distribution we require that _{t}.

The distributions illustrated here have density bins of 1 wILI unit, which differs from those used in the manuscript for illustrative purposes only. Panel A shows the predictive distributions from three component models. Panel B shows scaled versions of the distributions from A, after being multiplied by model weights. In Panel C, the scaled distributions are literally stacked to create the final ensemble predictive distribution.

In the following subsection, we propose a framework for estimating

We used four distinct methodologies to define weights to use for the stacking models:

Equal Weights (_{m}(_{t}) = 1/_{t}.

Constant model weights via degenerate EM (_{m}(_{t}) = _{m}, a constant where

Feature-weighted (_{m}(_{t}) depends on features including week of the season and model uncertainty for the KCDE and SARIMA models. A separate set of weighting functions is estimated for each region and prediction target.

Feature-weighted with regularization: _{m}(_{t}) depends on features, but with regularization discouraging the weights from taking extreme values or from varying too quickly as a function of _{t}. A separate set of weighting functions is estimated for each region and prediction target. We fit three variations on this ensemble model, using different sets of features:

(

(

(

All in all, this leads to 6 ensemble models, summarized in _{m}(_{t}). We will discuss the regularization strategies used in

Model | Component Model Weights Vary with… | |||||
---|---|---|---|---|---|---|

Region | Prediction Target | Week of Season | SARIMA Uncertainty | KCDE Uncertainty | Current wILI | |

EW | ||||||

CW | X | X | ||||

FW | X | X | X | X | X | |

FW-reg-w | X | X | X | |||

FW-reg-wu | X | X | X | X | X | |

FW-reg-wui | X | X | X | X | X | X |

As discussed above, leave-one-season-out prediction results from the three component models are inputs to the ensemble estimation routines. During ensemble estimation, we dropped any training set time points for which cross-validated predictions from all three component models were not available. After the training phase, each of the six ensemble models, along with the three component models, are used to generate predictions in every season-week of each of the five testing seasons, assuming perfect reporting. These predictions are then used to evaluate the prospective predictive performance of each of the ensemble methods. In total, we evaluate 9 models in 11 regions over 5 years and 3 targets of interest.

In this section we introduce the particular specification of the parameter weight functions _{m}(_{t}) that we use for the

In order to ensure that the the _{m} are non-negative and sum to 1 for all values of _{t}, we parameterize them in terms of the softmax transformation of real-valued latent functions _{m}:

For a pair of models _{l}(_{t}) > _{m}(_{t}) indicates that model _{t}. The functions _{m}(_{t}) could be parameterized and estimated using many different techniques, such as a linear specification in the features, splines, or so on. We chose to estimate the functions _{m}(

Gradient tree boosting uses a forward stagewise additive modeling algorithm to iteratively and incrementally construct a series of regression trees that, when added together, create a function designed to minimize a given loss function. In our application, the algorithm builds up the _{m}(_{t}) that minimize the negative log-score of the stacked predictions _{t}|_{t}) across all times _{t}.

Specifically, we define a single tree as
_{j}_{t}, and _{t} ∈ _{j}_{j}_{m}(_{t}) is obtained as the sum of

In each iteration

Gradient tree boosting is appealing as a method for estimating the functions _{m} because it offers a great deal of flexibility in how the weights can vary as a function of the features _{t}. On the other hand, this flexibility can lead to overfitting the training data. In order to limit the chances of overfitting, we have explored the use of three regularization parameters:

The number of boosting iterations

An _{1} penalty on the number of tree leaves, _{t}.

An _{1} penalty on the regression constants _{j}. A large penalty encourages these constants to be small, so that the overall model weights change less in each boosting iteration.

We selected values for these regularization parameters using a grid search optimizing leave-one-season-out cross-validated model performance.

We used R version 3.2.2 (2015-08-14) for all analyses [

To evaluate overall model performance, we computed log scores for all predictions made by each model across all regions and test phase seasons. Predictions made before the season peak (for predictions of peak incidence or peak timing) or before the season onset (for predictions of season onset timing) are the most relevant to decision makers using the predictions as inputs to set public policy. We therefore focus our comparison of model performance on results for predictions made before the target event occurred within each of the test phase seasons. Plots of the full predictive distributions at the national level from the

As discussed in the methods section, our test set contained predictions from each model for 3 targets over 5 seasons in 11 spatial units. To ensure that seasons with later onsets or later peaks would not count more heavily than seasons with earlier onsets and peaks, and to simplify the analysis in the presence of serial autocorrelation in model performance over consecutive weeks, we summarized model performance within each season by the mean log score for all predictions made before the peak or onset week (as appropriate for the prediction target). This led to 165 observations of model performance for each model, corresponding to the unique combinations of prediction target, season, and spatial unit.

Model weights are shown for predictions of onset timing (panel A), peak timing (panel B), and peak incidence (panel C) at the national level. The upper plot within each panel shows mean, minimum, and maximum log scores achieved by each component model for predictions of the given prediction target at the national level in each week of the season, summarizing across all seasons in the training phase when all three component models produced predictions. The lower plot within each panel shows model weights from the textbfCW and textbfFW-reg-w ensemble methods at each week in the season.

The model weights assigned by the feature weighted ensemble models generally track these trends in relative model performance (

Aggregating across all combinations of prediction target, region, and season in the test phase, the best component models and the best ensemble models had similar performance (

For each combination of 3 prediction targets, 11 regions, and 5 test phase seasons, we calculated the mean log score for all predictions made by each method in weeks before the event being predicted occurred. Panel A presents the overall mean of these values for each method; higher mean log scores indicate better performance. Panel B displays the difference in mean log scores for each pair of models. Positive values indicate that the model on the vertical axis outperformed the model on the horizontal axis on average. A permutation test was used to obtain approximate p-values for these differences (see

As noted above, our test set included only 5 seasons, and the effective sample size for model comparison is smaller than the 165 combinations of prediction target, region, and test phase season due to correlations in predictive performance across regions and seasons. This may have contributed to our inability to detect statistically significant differences between the best models, and may limit the generalizability of these results; we will return to this point in the discussion.

Although the aggregate performance of these models is quite similar, some differences between the methods begin to emerge when we examine performance in more detail. Predictions that are used in setting public policy must be of consistent quality across all regions and seasons. We observed that the component models showed more variability and lower worst-case performance than the ensemble methods. The discussion in this subsection presents results of an exploratory analysis of the results, and all p-values are from post-hoc hypothesis tests.

To examine consistency of predictive performance, for each combination of prediction target, region, and test phase season we calculated the difference in mean log scores between each method and the method with median performance for that target, region, and season. This measure of model performance relative to the median can be compared across prediction targets, regions, and seasons that may be predicted with varying levels of difficulty.

We calculate the difference in log scores for a given method and the method with median performance for each combination of prediction target, region, and test phase season; each density curve summarizes results across all 165 combinations of 3 prediction targets, 11 regions, and 5 test phase seasons. Positive values indicate better performance than the median model. For legibility, we only show results for the two component models with best mean performance (KCDE and SARIMA) and for the two ensemble models with best mean performance (CW and FW-reg-w).

We can quantify this observation by comparing the minimum performance relative to the median across all prediction targets, regions, and seasons for each method (

For each combination of 3 prediction targets, 11 regions, and 5 test phase seasons, we calculated the difference in mean log scores between each method and the method with median performance for that target, region, and season. Panel A presents the minimum difference from the median model for each method across all combinations of target, region, and season. Larger values of this quantity indicate that the given model has better worst-case performance. Panel B displays the difference in this measure of worst-case performance for each pair of models. Positive values indicate that the model on the vertical axis had better worst-case performance than the model on the horizontal axis. A permutation test was used to obtain approximate p-values for these differences (see

The regularization of feature-weighted ensembles improved early-season prediction accuracy. A comparison of the

In this work we have examined the potential for ensemble methods to improve infectious disease predictions. We explored a nested series of ensemble methods, focusing on methods that computed weighted averages of predictive distributions for seasonal targets of public health interest, such as the peak intensity of the outbreak and the timing of both season onset and peak. The methods we examined ranged from using equal model weights to more complex schemes with weights that varied as functions of multiple covariates. The best of these ensemble methods achieved overall performance that was about as good as the best of the individual component models, with increased stability in model performance across different regions and seasons.

Increased stability in predictive accuracy can provide decision makers with more confidence when using predictions as inputs to set policy. For example, if a single model does well in most seasons but occasionally fails badly, planning decisions may be negatively impacted in those failing years. This may be particularly important in a public health setting where the events that are most important to get right are those relatively rare cases when incidence is much larger than usual or the season timing is earlier or later than usual. This reduction in variability of model performance achieved by ensemble methods is therefore important for ensuring that our predictions are reliable under a variety of conditions.

Among the different ensemble specifications we considered, the

All hypothesis tests we conducted related to worst-case performance were post-hoc tests conducted after an exploratory analysis of relative model performance, and these results should be confirmed in future studies. Additionally, the permutation test we used accounts for serial autocorrelation in model performance within a region-season, but does not account for correlation across region or seasons; thus the p-values discussed throughout this work should be regarded as only approximate indicators of statistical significance.

The feature-weighted ensemble models presented in this article use a novel scheme to estimate feature-dependent model weights that sum to 1 and are therefore suitable for use in combining predictive distributions. This general method could be applied to combine distribution estimates in any context, and is not limited to time-series or infectious disease applications. Furthermore, comparing an implementation of the feature-weighting that smoothed the model weights to one that did not showed consistent improvements in model performance. This result suggests that future work on feature-weighted ensemble implementations should consider regularized estimation.

Infectious disease predictions are only useful to public health officials if they are communicated effectively in real time. Predictions from an early version of the

A central challenge of working with infectious disease data sets is the limited number of years of data available for model estimation and evaluation. We have used approximately one fourth of our data set for model evaluation, which left us with only 14 seasons of training data and 5 seasons of testing data. Additionally, we had fewer than 14 seasons of leave-one-season-out predictions to use in estimating the model weighting functions for the

Another limitation of this work is the small selection of component models used. Theoretical results and applications have demonstrated that ensemble methods are most effective when using a diverse set of component models [

Our exploration of feature-weighted ensembles is also limited by the relatively restricted feature sets we used for the weighting functions. We selected a few features based on exploratory analysis of the training phase results, and set all ensemble model formulations before obtaining any predictions for the test phase. It is possible that other weighting features not considered in this work may be more informative than those we have used. Some ideas for weighting covariates to use in future work include the largest incidence so far this season; the onset threshold; alternative summaries of the predictive distributions from the component models such as the probability at the mode or the modal value; the predominant flu strain; or the distribution of incidence in age groups.

The performance of the ensemble methods might be improved by subsetting the training data for the ensembles to the most important observations. The discrepancy in this work between the times used to train the ensembles (all leave-one-season-out predictions) and the times used for model comparison (only predictions made before the season onset or peak) may have led to an artificial decline in performance for the ensembles; this may be especially so for the relatively inflexible

This work provides a rigorous and comprehensive evaluation of ensemble methods for averaging probabilistic predictions for features of infectious disease outbreaks. A range of models, both single component models and ensemble models that combined component model predictions, demonstrated the ability to make more accurate predictions than a seasonal average baseline model. Additionally, systematic comparisons of simple and complex prediction models highlight a crucial added value of ensemble modeling, namely increased stability and consistency of model performance relative to the component models. Continued investigation, application, and innovation is necessary to strengthen our understanding of how to best leverage combinations of models to assist decision makers in fields, such as public health and infectious disease surveillance, that require data-driven rapid response.

(PDF)

Predictions are shown for just the FW-reg-w method at the national level, facetted by test phase season.

(PDF)

Predictions are shown for just the FW-reg-w method at the national level, facetted by test phase season.

(PDF)

Predictions are shown for just the FW-reg-w method at the national level, facetted by test phase season.

(PDF)

For each week of the season, log scores are summarized across all seasons in the training phase when all three component models produced predictions. The thick line is a smoothed estimate of mean log score at each week in the season; the shaded region indicates the convex hull of log scores achieved by each model; and the actual log scores achieved in each week are indicated with points.

(PDF)

Model uncertainty is measured by the number of bins required to cover 90% of the predictive distribution. The plot summarizes results across all seasons in the training phase when all three component models produced predictions. The thick line is a smoothed estimate of mean log score at each value of model uncertainty; the shaded region indicates the convex hull of log scores achieved by each model; and the actual log scores achieved in each week are indicated with points. The KCDE and SARIMA models condition on all previously observed data within the current season, and generally have high certainly when the target event (season onset or season peak) has almost occurred or has already occurred.

(PDF)

The plot summarizes results across all seasons in the training phase when all three component models produced predictions. The thick line is a smoothed estimate of mean log score at each week in the season; the shaded region indicates the convex hull of log scores achieved by each model; and the actual log scores achieved in each week are indicated with points.

(PDF)

Weights are shown for the prediction of season peak incidence at the national level. There are three weighting functions (one for each component model) represented in each row of the figure. The value of the weight is depicted by the color. Each function depends on three features: the week of the season at the time when the predictions are made, KCDE model uncertainty, and SARIMA model uncertainty. Model uncertainty represents the minimum number of predictive distribution bins required to cover 90% probability of the predictive distribution, so the higher this number is the more uncertain the model is.

(PDF)

Predictions are aggregated across all regions and test phase seasons. The horizontal axis represents the difference in log scores achieved by the FW-reg-w and SARIMA models for predictions made in a particular week; positive values indicate that FW-reg-w outperformed SARIMA for that prediction. The vertical line indicates the mean log score difference for all predictions made before the onset or season peak occurred.

(PDF)

For each combination of 3 prediction targets, 11 regions, and 5 test phase seasons, we calculated the difference in mean log scores between each method and the method with median performance for that target, region, and season. Panel A presents the 10th percentile of these differences from the median model for each method across all combinations of target, region, and season. Larger values of this quantity indicate that the given model has better worst-case performance. Panel B displays the difference in this measure of worst-case performance for each pair of models. Positive values indicate that the model on the vertical axis had better worst-case performance than the model on the horizontal axis. A permutation test was used to obtain approximate p-values for these differences (see

(PDF)

Only predictions made before the target (season onset or peak) occurred are included. Averages are taken across all regions.

(PDF)