Fast and accurate influenza forecasting in the United States with Inferno

Infectious disease forecasting is an emerging field and has the potential to improve public health through anticipatory resource allocation, situational awareness, and mitigation planning. By way of exploring and operationalizing disease forecasting, the U.S. Centers for Disease Control and Prevention (CDC) has hosted FluSight since the 2013/14 flu season, an annual flu forecasting challenge. Since FluSight’s onset, forecasters have developed and improved forecasting models in an effort to provide more timely, reliable, and accurate information about the likely progression of the outbreak. While improving the predictive performance of these forecasting models is often the primary objective, it is also important for a forecasting model to run quickly, facilitating further model development and improvement while providing flexibility when deployed in a real-time setting. In this vein I introduce Inferno, a fast and accurate flu forecasting model inspired by Dante, the top performing model in the 2018/19 FluSight challenge. When pseudoprospectively compared to all models that participated in FluSight 2018/19, Inferno would have placed 2nd in the national and regional challenge as well as the state challenge, behind only Dante. Inferno, however, runs in minutes and is trivially parallelizable, while Dante takes hours to run, representing a significant operational improvement with minimal impact to performance. Forecasting challenges like FluSight should continue to monitor and evaluate how they can be modified and expanded to incentivize the development of forecasting models that benefit public health.

ahead forecasts of ILI for states or wILI for HHS regions and the U.S. (collectively referred to as (w)ILI), the week of flu season onset, the week the flu season will peak, and the peak value of (w)ILI for the flu season. FluSight uses a modified log scoring rule to evaluate forecasts [26]. The modified log scoring rule evaluates probabilistic forecasts, requiring forecasters to not only provide a prediction of what they think will happen in the future but also quantify how sure they are of that. The choice made by the U.S. CDC to use a modified log scoring rule makes clear their position that uncertainty quantification is of value to public health. Given a set of forecasting targets and an evaluation metric, forecasters participating in FluSight develop models capable of forecasting those targets within the real-time operational constraints of the challenge with the goal of maximizing their model's forecast evaluation score.
Forecasting challenges are powerful incentivization engines. How they are structured encourage/require models to have certain properties that align with public health needs. For instance, if public health needs forecasting models capable of short-term and long-term forecasting, selecting short-term and long-term/seasonal targets incentives the development of models that can do both of those things well. If public health needs probabilistic forecasting models that quantify their uncertainty, selecting a scoring rule that rewards appropriate uncertainties and penalizes overly confident/conservative forecasts incentivizes probabilistic model development. If public health needs forecasting models to support rapid response decision making, increasing the forecast submission cadence (e.g., from weekly to daily), reducing the amount of time between the release of new data and the forecast submission deadline, and/or augmenting the scope of forecasting geographies (e.g., from HHS regions to states to counties) incentivizes the development of forecasting models that run quickly.
In this paper, I focus on improving the runtime of flu forecasting models while maintaining high prediction standards with the presentation of Inferno, a fast and accurate flu forecasting model. Inferno is a parallelizable, Bayesian forecasting model inspired by Dante, the top performing model in FluSight 2018/19 [14]. The achieved goal of Inferno is to maintain the high predictive performance of Dante but substantially decrease the runtime. As will be discussed later, in a pseudoprospective comparison, Inferno would have placed 2nd only to Dante in the 2018/19 FluSight challenge but runs in minutes rather than hours, constituting a significant speed-up in operational performance.
In the remainder of this paper, I describe the details to Inferno (Section 2) and present Inferno's forecasting performance as compared to all participating models in FluSight 2018/19 (Section 3).

Dante background
Dante is a multiscale, probabilistic, influenza forecasting model. It requires historical data of past flu seasons to effectively learn patterns and leverage those patterns for forecasting. Dante has two sub-models: a state forecasting model and an aggregation model which combines state forecasts to produce forecasts for HHS regions and the United States.
Dante's state forecasting model is y rst jy rst ; l r � Betaðl r y rst ; l r ð1 À y rst ÞÞ ðEq 1 of ½14�Þ where y rst is ILI/100 for week t for state r during season s and θ rst , the conditional expectation of y rst given θ rst and λ r , is modeled as a function of four components: an overall trend component (m all t ), a state-specific deviation component (m state rt ), a season-specific deviation component (m season st ), and a state and season-specific deviation component (m interaction rst ). These four components are each modeled as random or reverse-random walks-flexible time series models that capture temporal correlation (for more details and non-infectious disease applications of reverse-random walks, see [27] and [28]). By modeling all states and past flu seasons jointly, Dante is able to borrow information across seasons and space. By modeling the HHS regional and United States forecasts as U.S. Census population-weighted averages of state forecasts, Dante ensures self-consistency across geographic scales. For more details on Dante, see [14].
Dante is a fully Bayesian model, capturing uncertainty in all model parameters, latent states, and forecasts through its posterior (predictive) distribution. The fully Bayesian formulation and self-consistency of Dante comes at a computational price, however. Dante represents a large model that will grow each year as more historical data are added and is not well-positioned to scale with possible future changes/expansions to FluSight (e.g., county-level forecasting). Nothing is precomputed and due to its interconnected model structure, it is not obvious how to break up Dante to exploit parallelization.
Inferno was developed to addresses these computational shortcomings. Inferno, while motivated by Dante, deviates from Dante in two main ways. First, Inferno is fit separately to each geographical unit. This allows Inferno to leverage parallel computing architectures but at the expense of modeling correlations across states. Second, Inferno precomputes many of its parameters via a heuristic estimation procedure, reducing the number of parameters and latent model components that need to be sampled via Markov chain Monte Carlo (MCMC). These two choices result in significant computational speed ups with only moderate loss in forecast accuracy. In Section 2.2, I describe the Inferno forecasting model.

Inferno
Inferno is fit to each geographical unit separately and can be viewed as a simplified version of Dante, where Dante's state-specific components (m state rt and m interaction rst ) are removed, certain parameters are kept fixed at predetermined values, and the random walk model on m season st is replaced with a multivariate normal model. Specifically, let y s,t 2 (0, 1) be ILI/100 for states or wILI/100 for HHS regions and the United States for season s = 1, 2, . . ., S and week t = 1, 2, . . ., T = 35, where t = 1 corresponds to Morbidity and Mortality Weekly Report (MMWR) week 40, roughly the beginning of October, and T = 35 roughly corresponds to the end of May. Inferno's generative model is defined as follows, with all parameters which are not assigned a prior distribution set to fixed values (e.g., γ t , s 2 S ; see below): y s;t jy s;t ; a � Betaðay s;t ; að1 À y s;t ÞÞ ð1Þ where PLOS COMPUTATIONAL BIOLOGY Inferno: Fast and accurate flu forecasting • y s,t is the noisy, observable measurement of (w)ILI/100 on week t of season s.
• θ s,t models the true but unobservable value of (w)ILI/100 on week t of season s.
• γ t models the typical (w)ILI/100 value on week t on the logit scale.
• δ s is modeled with a multivariate normal (MVN) distribution with mean μ s 1 (a scalar μ s times a T × 1 vector of ones) and covariance matrix S.
In this paper, bold quantities represent vectors or matrices, while non-bold quantities represent scalars. Because Inferno is applied to each geographical unit r separately, the subscript r is suppressed throughout. The Beta distribution of Eq 1 requires y s,t 2 (0, 1). There is no guarantee (w)ILI/100 is not equal to 0 or 1. Thus, all y s,t below a low threshold l are set equal to l and all y s,t above 1 − l are set to 1 − l. For this work, l = 0.0005 and y s,t is thresholded by l for all observations before the modeling begins.
The parameters kept fixed in the above generative model (α, γ = (γ 1 , γ 2 , . . ., γ T ) 0 , s 2 m , s 2 S , λ, and ϕ) are estimated from past season's (w)ILI data with a heuristic estimation procedure (at least two past seasons are required to heuristically estimate all Inferno parameters). As will be shown, this heuristic estimation procedure works well in practice to produce forecasts-Inferno's primary goal-as Inferno's forecast performance is competitive with Dante. While parameter estimates from the heuristic estimation procedure are presented, inference is not the focus of this work and using the heuristic parameter estimation procedure for inference is not advised. Parameter estimates are presented to support the intuition motivating the modeling choices and provide relative comparisons of parameter estimates across states. The S1 Appendix provides a simulation study and discussion on the inferential limits of the heuristic parameter estimation procedure. Alternative heuristic estimation choices could be made and will be pointed out throughout the paper.
In this paper, s � will denote the flu season being forecasted. The past flu seasons (flu seasons occurring before season s � ) used to estimate the parameters will be denoted with a subscript s. In practice and in this paper, when forecasting season s � , parameters are estimated from seasons s � − 1 and earlier. In what follows, I outline a six step procedure to estimate the unknown parameters α, γ, s 2 m , s 2 S , λ, and ϕ and describe how to sample and forecast from Inferno's posterior predictive distribution via MCMC.

2.2.1
Step 1: Estimate θ s,t . The purpose of Step 1 is to estimate θ s,t . Estimating θ s,t is not of value by itself, but is important as it facilitates the estimation of Inferno's hyperparameters.
The estimate of θ s,t , namelyŷ s;t , is itself computed as a combination of two other quantities:b s;t andt t . All computed quantities in Step 1 are based on training seasons only.
For a given geographic unit (e.g., state, HHS region, or the U.S.) and forecast season s � , let y s,t be (w)ILI/100 for training season s 2 1, 2, . . ., S = s � − 1 and week of season t. First, computeb s;t as a 3-week moving average: Fig 1 shows the moving average fit to ILI/100 in Illinois. The purpose ofb s;t is to capture the time series trend in season s with a smooth, simple function that can be used to separate trend from noise in y s,t . By construction, the moving average captures the shape of the ILI/100 curve. Alternative smoothing functions, like smoothing splines [29], generalized ridge regression [30], or, with additional model assumptions, Kalman filtering [31] could also be used. The degree of smoothness in these alternative methods is controlled by a tuning parameter(s) and can be learned through cross-validation. I found a 3-week moving average worked well and, due to its simplicity, was appealing. The moving average, however, can miss sharp changes in y s,t caused by differences in reporting practices over holidays. For instance, we see that the moving average most often underestimates y s,t the week of Christmas (t = 13, or MMWR week 52). To capture the systematic sharp changes in y s,t that are common across training seasons, Inferno estimates the quantity τ t :t Finally, the quantityŷ s;t captures both the trend in y s,t (b s;t ) and the holiday effects (t t ): where, again, l is a small number (in this paper, l = 0.0005) to ensure 0 <ŷ s;t < 1.

2.2.2
Step 2: Estimate α. Inferno computesŷ s;t in order to facilitate the estimation of the other unknown quantities of Inferno's generative model. The expectation and the variance of Inferno's data model (Eq 1) are, Varðy s;t jy s;t ; aÞ ¼ The parameter α controls the variance of the data model, capturing the week-to-week variability in the ILI data. The larger α is, the smaller the variance, reflecting less week-to-week noise in the ILI data. The smaller α is, the larger the variance, reflecting more week-to-week noise in the ILI data. α > 0 is estimated by maximizing the likelihood of Inferno's data model (or, equivalently, minimizing the negative log likelihood): where log(x) is the natural log of x, Bða; bÞ ¼ GðaÞGðbÞ Gða þ bÞ ð16Þ and Γ() is the gamma function. Fig 4 showsâ for all states, territories, and cities (collectively referred to as states). States like the U.S. Virgin Islands, North Dakota, and Puerto Rico have the smallestâs, reflecting they have the largest week-to-week noise in their ILI data, while states like California, Illinois, and New York City have the largestâs, reflecting they have the smallest week-to-week noise in their ILI data. Fig 5 shows summaries of the data model Betaðâŷ s;t ;âð1 Àŷ s;t ÞÞ for North Dakota, Nevada, and Illinois, illustrating the different levels of week-to-week noise in ILI data across states.

2.2.3
Step 3: Estimate γ t . Seasonal flu has a typical shape to it in the United States. ILI starts at low levels early in the season, rises to a peak between December and March, and reverts to low levels by the end of May. The role of γ is to capture this typical seasonal flu profile. Inferno computes γ t as follows: where logit(p) = log(p/(1 − p)).
where 1 is a T × 1 vector of ones, S is a T × T positive semi-definite matrix, |S| is the determinant of S, and S −1 is the inverse of S. The model for the mean of the multivariate normal distribution, μ s , is Step 4 describes how to estimate s 2 m . First compute the following quantities: By construction, P S s¼1ds;t ¼ 0 for each t. The quantityŝ 2 m is computed as the unbiased sample variance: which is the standard parameterization of the squared exponential covariance function, where , the sum of the output variance and the overdispersion parameter. While the standard squared exponential parameterization of Eqs 25 and 26 are arguably more intuitive than the parameterization of Eqs 5 and 6, I found parameterizing s 0 S 2 and s 0 � 2 as �s 2 S and ð1 À �Þs 2 S , respectively, offered more numerical stability to the optimization described below as a result of ϕ being bounded between 0 and 1.
The left column of Fig 9 plotsδ s Àm s 1 for North Dakota, Nevada, and Illinois. North Dakota exhibits more variability than Illinois as can be seen with its wider range of values. Inferno estimates s 2 S , a measure of how far δ s − μ s 1 typically deviates from 0, aŝ The remaining parameters of S are ϕ and λ. They collectively capture two different characteristics of δ s . The parameter ϕ 2 [0, 1] captures the roughness of δ s . The larger (1 − ϕ)/ϕ is, the rougher δ s is. For instance,δ s for North Dakota in Fig 7 are much rougher thanδ s for Illinois. The second characteristic of δ s captured by ϕ and λ is the correlation between entries of δ s . The correlation between δ s,t and δ s,t+1 is  empirical quantitiesδ s Àm s 1, suggesting the multivariate normal distribution is a defensible generative model for δ s .

2.2.6
Step 6: Sample forecasts from Inferno. The sixth and final step of Inferno is to replace parameters with their estimates and sample from the posterior predictive distribution. Recall s � is the forecast season and all parameters were estimated with data from seasons s � − 1 and earlier. Then, the generative model with parameters replaced by their estimates is y s � ;t jy s � ;t ;â � Betaðây s � ;t ;âð1 À y s � ;t ÞÞ ð30Þ Given (w)ILI/100 observations for the first t weeks of flu season s � (i.e., given y s � ,1:t ), Inferno forecasts the remainder of the flu season (weeks (t + 1) through T) by sampling from the posterior predictive distribution:  [32] is used to execute the MCMC sampling. JAGS is called with functions from the rjags package [33] in the programming language R [34]. The results are J draws from the posterior predictive distribution of Eq 36. For this paper, forecasts are based on J = 25, 000 samples, discarding the first 12,500 samples as burn-in and thinning the remaining 12,500 samples by two, resulting in forecasts based on 6,250 MCMC samples. A Markov chain should draw enough samples to achieve adequate estimation of the distribution(s) of interest. In general, when estimating quantiles of distributions, more samples are needed as the quantile of interest moves out into the tails of the distribution (i.e., it takes more samples to estimate the 99th percentile of a distribution well than it does to estimate the median of a distribution well). With more samples, however, comes increased runtime. I selected 25,000 samples as a practical balance between runtime and tail estimation quality. In practice, the amount of time available to run the MCMC will impact the number of samples a user selects. The JAGS code that implements Inferno can be found in the S1 Appendix.

Results
To evaluate Inferno's forecasting performance, Inferno is pseudoprospectively compared to all models that participated in the U.S. CDC's 2018/19 National and Regional FluSight challenge as well as the State challenge. Forecasting follows the guidelines outlined by the CDC FluSight challenge; see [26] for details. The forecasts and the evaluation procedure is briefly described below.
Forecasts are made for four short-term targets (1, 2, 3, and 4-week-ahead) and three seasonal targets (the peak week, the peak percentage, and the onset week-onset is not forecasted for the State challenge). All forecast targets are binned and a probability is assigned to each bin such that the sum of all probabilities over all bins for a target equals 1. The bins for the onset Posterior mean (black line) and 95% prediction intervals (ribbons) are displayed, along with y s � ,1:t (grey points) and y s � ,(t+1):T , the future (w)ILI/100 values being forecasted (black points). The ribbon for times 1 to t is a summary of the fit to data y s � ,1:t , while the ribbon for times t + 1 to T is a summary of the forecast for season s � . https://doi.org/10.1371/journal.pcbi.1008651.g011 week and the peak week are bins of one week; the bins for the short-term targets and the peak percentage are tenths of a percent (e.g., a bin from 2.0% (included) to 2.1% (excluded)) from 0 to 13%, with one large bin from 13% to 100%.
Define bin b as the bin containing the correct target, B as the set of all bins that will be scored (where b 2 B), and p B 2 [0, 1] as the sum of the probabilities assigned to all the bins in B. The modified log score used by FluSight is computed as max(−10, log(p B )). When B = b, the modified log score is called the single-bin log score and is the scoring criteria used starting with the 2019/20 FluSight challenge. When b 2 B but b 6 ¼ B, the log score is called the multi-bin log score and was the scoring criteria used in the 2018/19 FluSight challenge. The multi-bin log score essentially scores the forecast probability assigned to not only the correct target bin, but also all target bins that are "close" to the correct target bin. The change from multi-bin log score to single-bin log score is motivated by the topic of proper/improper scoring rules [35]. For a recent, detailed discussion on this, the interested reader is directed to [24] and [25]. Finally, multi-bin skill and single-bin skill are derived by exponentiating the multi-bin and single-bin log scores, respectively. Single-and multi-bin skill are 2 (0, 1], with larger skills being better. The (w)ILI data are subject to weekly revisions. As a result, it is important to use the (w)ILI estimates that were available at the time to make faithful comparisons to models that participated in the real-time FluSight challenges. Data available on historical dates are made available by the Carnegie Mellon University Delphi group's API [36] and were used to produce the results of the pseudoprospective comparison to real-time FluSight participating models. Fig 12 and Table 1 show the multi-and single-bin skills for Inferno and all models that participated in the 2018/19 FluSight challenges. Inferno would have placed 2nd only to Dante in the 2018/19 FluSight National and Regional as well as State challenges. FluSight 2018/19 used FluSight challenge evaluated models using multi-bin skill (x-axis), but starting with the FluSight 2019/20 challenge, will be using single-bin skill (y-axis). Skill scores are presented overall (left column), but also by seasonal targets (middle column) and short-term targets (right column). Inferno is a leading forecasting model overall, excelling in short-term forecasting, with good but not leading seasonal forecasting performance.
https://doi.org/10.1371/journal.pcbi.1008651.g012 multi-bin skill as the forecast evaluation metric. Starting with FluSight 2019/20, single-bin skill will be used. While single-bin and multi-bin skills are correlated, as can be seen in Fig 12, the relationship is not perfect. Models can rise or fall in the relative ranking depending on which evaluation metric is used for scoring, highlighting that the evaluation metric the forecasting challenge organizing body selects is of consequence. Inferno and Dante both perform better under the multi-bin skill evaluation than single-bin skill, but are both top 4 models by either evaluation metric. Most importantly, the drop in predictive performance from Dante to Inferno is small.
The small drop in performance from Dante to Inferno in 2018/19 is largely consistent with other seasons. Fig 13 shows Inferno's skill relative to Dante's skill when retrospectively compared for seasons 2013/14 through 2017/18 (using data from MMWR week 40 of 2010 through the forecast data for training/fitting). For the majority of seasons and targets, Inferno's performance is worse than Dante's by a small margin. From Fig 13, we can see that, relative to Dante, Inferno performed better than expected in 2018/19 for short-term targets at the state level. For all other regions and targets, however, Inferno's drop in performance relative to Dante in 2018/19 is consistent with the drop in performance seen in other seasons, suggesting the relatively small drop in performance for Inferno is typical. For context, Inferno's average overall multi-bin skill was 94% of Dante's overall multi-bin skill for the National and Regional challenge. If each model that participated in the 2016/17, 2017/18, or 2018/19 National and Regional FluSight challenge had its overall multi-bin skill reduced by 6%, the average drop in rank was just over 1 position (i.e., if a model finished in Xth place in the challenge, a 6% reduction in its skill would, on average, result in that same model finishing in (X+1)th place). The drop in rank increases from 1 position to almost 3 positions if you focus only on the models that finished in the top 10, indicating that a 6% drop in skill has a greater impact on a model's relative rank for better performing models than worse performing models. The retrospective comparison shown in Fig 13 ignores revisions made to data in real-time (i.e., the validation data is used for forecasting as the data that would have been available in real-time is not Table 1. The rank by challenge and target for Inferno and Dante as measured by single-bin and multi-bin skill. Inferno would have placed 2nd in both the National and Regional and the State challenges as measured by multi-bin skill, only finishing behind Dante. Inferno would have placed 4th (National and Regional) and 3rd (State) were the forecasts evaluated with single-bin skill. For both challenges and both evaluation metrics, Inferno achieved better short-term than seasonal performance. The small drop in predictive performance from Dante to Inferno is offset by Inferno's significant improvement in runtime for real-time forecasting and preparation for future scalability to more granular forecasting geographies (e.g., county-level). Fig 14 shows the runtime comparison between Dante and Inferno at different stages of the flu season and number of cores to draw 25,000 MCMC samples for all 64 geographies (53 states, 10 HHS regions, and the United States). Dante takes between 110 and 120 minutes, while Inferno takes between 20 and 70 minutes (if run serially on one core). Inferno, however, can be trivially parallelized for real-time forecasting. As a result, Inferno can draw the same 25,000 MCMC samples for all geographies in 30 seconds to 2 minutes when fully parallelized (running one geography per core). Fig 14 shows that Inferno improves runtime relative to Dante in two ways: by being a simpler model with fewer parameters and latent quantities to sample (comparing Dante to 1 core Inferno runtimes) and by being parallelizable (comparing Dante to 8, 32, and 64 core runtimes).

Discussion
In this paper, I argued that while predictive performance is the most important measure of a forecasting model, other factors like runtime are important for model development, scalability, and meeting real-time, operational timelines. Developing a model with leading predictive performance but drastically improved runtime was the motivation behind Inferno. I laid out a six step procedure to heuristically estimate the parameters of Inferno from historical (w)ILI data, greatly reducing the MCMC computations as executed by the probabilistic programming language JAGS. Furthermore, by forecasting each geography separately, Inferno can take advantage of parallelization, both improving forecast runtimes in the present while being scalable and well-positioned for the more spatially granular future of flu forecasting (e.g., county-level forecasting). Inferno's predictive performance was comparable to but worse than Dante's. This may be for a couple different reasons, both of which are addressable. Firstly, Dante explicitly models revisions; previous work has shown that accounting for and modeling revisions can result in improved predictive performance [16,20]. Similar modeling can be incorporated into Inferno at little additional computational cost. Secondly, Dante models correlation across states within a season by coupling states within a hierarchical framework. This coupling comes at a computational cost. The price Inferno pays to achieve significant computational speed-ups is the loss of coupling. There has been some recent work that takes independently generated probabilistic forecasts and, using principles of coherence, produces self-consistent forecasts that have improved predictive performance [37][38][39]. The goal of this two staged approach is to achieve the computational speed ups parallelization offers to independently generated forecasts and then, through post-hoc coupling (i.e., coherence), recover some of the lost forecast performance. The combination of revision modeling and coherence exploitation may result in equal or even better predictive performance at minimal computational cost.
In this paper, I discussed the importance of forecasting challenges to help direct forecasting model development. Forecasting models are tools to help us answer questions. Forecasting challenges articulate what questions we want to answer and help define what properties we want our forecasting tools to have. They do this by selecting data sources, targets, scoring rules, geographic scope, and timelines that incentivize the development of models to optimize a forecast score while meeting these operational constraints. With the recently announced U.S. CDC Center for Forecasting and Outbreak Analytics [40], there are exciting opportunities for the growth and influence of forecasting challenges to flourish.
Supporting information S1 Appendix. The JAGS code implementing Inferno's MCMC sampling routine is found in Section 1. A simulation study illustrating the inferential limits of Inferno's heuristic parameter estimation procedure is found in Section 2. (PDF) Supervision: Dave Osthus.