Improving probabilistic infectious disease forecasting through coherence

With an estimated $10.4 billion in medical costs and 31.4 million outpatient visits each year, influenza poses a serious burden of disease in the United States. To provide insights and advance warning into the spread of influenza, the U.S. Centers for Disease Control and Prevention (CDC) runs a challenge for forecasting weighted influenza-like illness (wILI) at the national and regional level. Many models produce independent forecasts for each geographical unit, ignoring the constraint that the national wILI is a weighted sum of regional wILI, where the weights correspond to the population size of the region. We propose a novel algorithm that transforms a set of independent forecast distributions to obey this constraint, which we refer to as probabilistically coherent. Enforcing probabilistic coherence led to an increase in forecast skill for 79% of the models we tested over multiple flu seasons, highlighting the importance of respecting the forecasting system’s geographical hierarchy.

examined the effect of imposing this geographical constraint on the set of independent forecasts, made publicly available by the CDC. We developed a novel method to transform forecast densities to obey the geographical constraint that respects the correlation structure between geographical units. This method showed consistent improvement across 90% of models and that held when stratified by targets and test seasons. Our method can be applied to other forecasting systems both within and outside an infectious disease context that have a geographical hierarchy.

Introduction
Seasonal influenza is a persistent and serious contributor to global morbidity and mortality, hospitalizing over half a million people in the world every year [1]. The United States alone reported approximately 80,000 influenza related mortalities in the 2017/2018 influenza season, with most serious consequences for vulnerable populations such as children or the elderly [2].
As part of a larger forecasting initiative, the U.S. Centers for Disease Control and Prevention (CDC) hosts an annual influenza forecasting challenge called the FluSight challenge open to the public [3] [4]. As part of this challenge, forecasters supply probabilistic forecasts for short-term and seasonal targets at both the national and regional levels corresponding to weighted influenza-like illness (wILI), which measures the proportion of outpatient doctor visits at reporting health care facilities where the patient had an influenza-like illness (ILI), weighted by state population. At the national level, wILI can be directly computed using state population weighted ILI or it can be equivalently computed using regional population weighted wILI. The CDC estimates ILI as the ratio of patients presenting with a cough and fever equal to or above 100°Fahrenheit over the total number of patients presenting at health care providers [5]. Participants in the FluSight challenge have harnessed a variety of models and methods to forecast the targets under consideration, which include both short-term forecasts and seasonal targets. These efforts have included time series models [6], mechanistic disease transmission models [7] [8], and machine learning techniques [9][10] [11] [12]. Teams have also incorporated external data, such as internet search queries or point of care data, to improve forecasts [13] [14][15] [16]. FluSight challenge participation has grown in popularity since the inaugural challenge in 2013, with twenty-four teams submitting forecasts from thirty-three models for the 2018/2019 season [17]. Model submission files from the past FluSight challenges are publicly available [17], providing the opportunity for retrospective analysis and the potential for improved forecasting.
Multiple procedures have been proposed in the literature to transform independently generated incoherent forecasts into coherent forecasts, also called forecast reconciliation [18] [19]. Projection matrix forecasting is a popular coherence forecasting approach [20]. This approach uses a matrix projection of the original set of forecasts onto a subspace that respects the known hierarchical relationship of the forecasting system. This approach uses forecasts for all levels of the hierarchy and does not discard any information as opposed to say a bottom-up approach, where the national forecast is ignored and the estimate is simply a linear combination of regional estimates. However, the demonstrated benefits of coherence in the point prediction setting do not At any given epiweek, the national wILI (black) is a weighted sum of regional wILI, where the weights correspond to the population size of the region. We can see that wILI is highly seasonal and varies heavily by region. Region population sizes (in millions) are given next to the region in the legend.
necessarily translate to the probabilistic forecasting realm [19]. We propose a novel algorithm that transforms a set of independent forecast densities to satisfy the coherence property and demonstrate that the resulting collection of forecast densities has a consistently higher forecast skill when broken down by season and target.

US ILI Surveillance Data
For the FluSight challenge, the CDC provides wILI data at both the national level and broken down into 10  Figure 1, wILI varies by region but maintains a relatively consistent winter peak. The CDC reports the wILI data using epidemic weeks, called epiweeks, instead of calendar weeks [21]. This allows for consistent week numbering across multiple seasons. Epiweek 40 is usually the first week of October, the start of the flu season, and epiweek 20 usually falls in May, marking the end of the season.

Pointwise Forecast Coherence
The partitioning of national data into HHS regions facilitates geographically localized forecasts, augmenting their usability to local public health officials. A consequence of this partitioning, however, is the creation of a hierarchical structure in the forecasting system. Namely, national wILI data is a linear combination of HHS regional wILI data. Region population sizes (in millions) are given next to the region in the legend of Figure 1 as reported by the 2010 U.S. Census [22].
We notate the true wILI value as yr,s,w ∈ [0, 100], a percentage for region r in flu season s corresponding to epiweek w. Throughout the paper, r = 11 corresponds to the nation, while r = 1, 2, . . . , 10 corresponds to HHS region r. Let αr ∈ [0, 1] be a weight corresponding to HHS region r, proportional to the population of HHS region r, such that 10 r=1 αr = 1. The hierarchical nature of the national/regional partitioning of forecasts for any season and epiweek is equivalent to the following constraint: αryr,s,w.
For convenience, define the collection of point forecasts for all regions for season s and epiweek w as We say that the forecastỹs,w is coherent ifỹ 11,s,w = 10 r=1 αrỹr,s,w, andỹs,w is incoherent ifỹ 11,s,w = 10 r=1 αrỹr,s,w.
Though the existence of the hierarchical structure in the wILI forecasting system is know, many FluSight challenge forecasts are made independently. Forecasting at geographic regions independently provides the forecaster more flexibility to cater models to specific regions or avoid modeling correlation between regions explicitly, but leaves the resulting forecasts vulnerable to incoherence as the true coherent data generating process is not respected. In this paper, we useỹ to represent independently generated and (likely) incoherent forecasts andŷ to represent coherent forecasts.
For an independently generated set of forecastsỹ, the corresponding coherent projection matrix forecastŷ where X is a design matrix corresponding to the hierarchical relationship of the data generating process and V is a weight matrix. Specifically for the FluSight challenge, Figure 2: Projection of original forecasts (red) onto space satisfying constraint of regional level forecasts summing to national level (blue). The three different projection methods defined by the various choices of weight matrices result in different projections of the red point onto the blue plane. The blue plane represents the set of points that satisfy the coherence constraint, namely that the weighted combination of region-level forecasts equals the National level forecast. In particular, the ordinary least squares projection (OLS) projects the red point onto the blue point such that the resulting distance is minimized, leading to an orthogonal projection.
where αr is the weight for the r th region.
A special case of projection matrix forecasting is when the ordinary least squares (OLS) projection matrix is used, produced by setting V equal to the identity matrix I in Equation 5. This special case has the property that the resulting coherent point forecastŷ has mean squared error (MSE) no worse than the independently generated forecastỹ [19]. That is, when V = I in Equation 5, for any y andỹ (See Appendix for proof).
For illustration and clarity, consider an example with two low level regions (HHS1 and HHS2) and one top level region (Nation). This example is illustrated in Figure 2.
where α1 = α2 = 0.5. Let the true values be and assume the independently generated forecasts arẽ Notice that the independently generated forecasts are incoherent, as The MSE forỹ is The OLS projection matrix forecast isŷ = [2/3, 2/3, 2/3], computed aŝ where The projection matrix forecastŷ is, by construction, coherent. The effect of the projection matrix forecast is a reduction in MSE overỹ, where the MSE forŷ is Note the MSE forŷHHS1 andŷHHS2 improved relative toỹHHS1 andỹHHS2, respectively, while the MSE for yNat got worse relative toỹNat, resulting in an overall, but not uniform, improvement in MSE forŷ relative toỹ.
The projection matrix forecast is a useful tool for transforming independently generated, incoherent forecasts into coherent forecasts. When V = I in Equation 5, the MSE of the resultingŷ is guaranteed to be no greater than the MSE ofỹ. When V = I in Equation 5, the resulting forecast is still coherent, but no such guarantee of MSE improvement exists.

The FluSight Forecast Coherence Dilemma
Unlike the situations demonstrated in the point forecasting literature, forecasts for the FluSight challenge are required to be probabilistic, not point estimates, and the probabilistic forecasts are evaluated using a multi-bin scoring rule, not MSE. In practice, probabilistic forecasts are generated as a collection of n forecast samples for yr,s,w. In this paper, we use the index i = 1, 2, . . . , n to denote draw i from the forecast distribution, resulting in a collection of realizationsỹr,s,w,i. In probabilistic settings we can no longer rely on the coherence definition given in Equation 3. Although various definitions for probabilistic coherence exist [23] [18], in this paper, we choose the intuitive presentation of Gamakumara et. al [23]. We say that the density f (ỹs,w) is probabilistically coherent if f (ỹs,w) = 0 whenỹ11,s,w = 10 r=1 αrỹr,s,w.
Here f represents the joint density over all regions and the national level of the hierarchy. Note the close correspondence with the point forecast definition. However, probabilistic coherence does require specification of a joint distribution over regions. Intuitively, this definition says that any point in the support of f that does not obey point-wise coherence is assigned zero probability, (i.e., has measure zero).
We score the probabilistic forecasts using multi-bin skill, rather than multi-bin log score as used in the FluSight challenge. Multi-bin skill is on the interpretable [0,1] probability scale, unlike the multi-bin log score.
Like log score, skill is a thresholding scoring rule, where a forecast is deemed correct if it is within a certain distance of the truth. The probability assigned to the true target Zt (e.g., a one-week-ahead forecast) corresponding to region r in flu season s and epiweek w is computed as where for the true target Zt and a predefined threshold b. For short-term forecasts (e.g., one through four-week-ahead targets), b is set equal to 0.5. Using a thresholding evaluation metric breaks the guaranteed equal or improved performance of the coherent forecasts when evaluated with MSE.
To see why, consider again the example from Section 1.2 with y T = [1, 1, 1]. For i = 1, 2, . . . , n = 10000, we independently drawỹ Figure 3 shows the n draws ofỹ andŷ. The skill counts all forecasts within a predefined threshold of the true value as correct and all forecasts outside the predefined threshold as incorrect. For this example, the correct forecasts must fall within plus or minus 0.1 of the true value y. For every realization i, the MSE forŷi is less than the MSE ofỹi . However, the skill for the coherent forecasts is 0 while the skill for incoherent forecasts is 0.32 and the skill for the incoherent forecasts is always better than or equal to the skill for the coherent forecasts. This is because all of the correct forecasts nationally are projected out of the correct region while the incorrect regional forecasts are moved closer to the correct region, but not close enough to fall inside it. The result is a collection of coherent forecasts with better (lower) MSE, but also lower forecast skill than the incoherent forecasts.

Problem Statement
On one hand, we have a guarantee that the MSE of point forecasts projected into the data generating process space can get no worse under the OLS projection method. On the other, we have an explicit example of forecast skill decreasing as a result of forecasts projected into the data generating process space. This seeming inconsistency leads us to the central question of this paper: Can probabilistic forecast coherence be used to improve forecast skill when forecasting influenza? The remainder of this paper will investigate this question.

Methods
In order to investigate the question posed in Section 1.4 we developed two methods to sample from probabilistically coherent forecast densities. To begin, we define our collection of original forecast densities, drawn independently for each region: Previous approaches have factored f (ỹ) into a bottom-up density, where f (ỹ) = δ(ỹ11|ỹ1:10)h(ỹ1:10|θ), where δ is a Dirac delta density centered atỹ11 = 10 r=1 αrỹr, θ is a parameter(s) estimated from training data, and h is a joint density over all 10 regions [18]. In what follows we consider two approaches for sampling from the probabilistic coherent density, f * (ŷ).
We first consider the scenario where errors in the forecast distributions are assumed to be uncorrelated across regions. This allows us to sample from the original forecast distribution f (ỹ) and apply a point-wise coherence projection matrix to each independently drawn sampleỹi. In the second approach, we assume the geographical units are positively correlated. That is, the forecasting model is more likely to underestimate or overestimate locations simultaneously rather than having the errors be independent of one another. This correlation structure reflects our knowledge that during an epidemic, forecast models that tend to under predict wILI at the regional level, will also do so at the national level.

Unordered Ordinary Least Squares
Our first approach requires samples from the independent forecast distributions, defined as follows: where i indexes samples and where we have dropped the season, epiweek, and model index for simplicity. We then apply the projection matrix to the column vector This approach is graphically demonstrated in Figure 4. Using the projection matrix on each sample guarantees that the resulting empirical probability mass function satisfies the probabilistic coherence definition of Equation 16. Algorithm 1 outlines how to produce samples from f * (ŷ). In practice, the CDC submission files specify probability distributions as binned probability mass functions. Independent forecastsỹi are sampled from these binned probability mass functions.
Algorithm 1 Unordered OLS sampling from probabilistically coherent joint distribution given a collection of marginal distributions.

Ordered Ordinary Least Squares
The unordered OLS approach assumes no correlation between the error structures regionally and nationally.
That is, each sample is generated independently. In the second approach we induce a correlation structure between the forecast errors. We begin with the same set of samples as defined in Equation 23. However, before applying the projection matrix we first compute the order statistics. We then apply the projection matrix to the column vector to the aligned order statistics: whereỹr (i) is the i th order statistic for the empirical distribution fr(ỹr) for region r.
Both the ordered and unordered OLS approaches lead to empirical distributions that are probabilistically coherent, however, the ordered OLS approach induces a correlation structure where low regional wILI forecasts are tied to low national wILI forecasts and vice versa: similar to the Schaake Shuffle [24]. In practice, the ordered OLS algorithm amounts to first sorting the samples drawn independently at each region and then applying the projection matrix to the sorted samples as outlined in Algorithm 2.  Table 1, where we define an evaluation point as a unique region, season, model, epiweek, and target combination. As we can see from Table 1, the sample sizes are quite substantial when aggregating over evaluation points. We use the 2010 U.S. Census weights across all seasons as an estimate of αr for each region [22]. The correlation between the weighted combination of the regionally reported wILI by the CDC and the nationally reported wILI is > .99. Using the 2010 U.S. Census weights is a reasonable approximation to the weights used by the CDC.

Results
In what follows we consider the combination of a model and a season as the fundamental unit of analysis on which to base our conclusions. As a forecaster, the main question under consideration is whether applying forecast coherence will improve the average forecast skill of a given model in an upcoming season. As we can see from Figure 5 both the unordered and ordered OLS methods improved short-term forecasting skill for the majority of model/seasons, averaged over short-term targets, regions, and epiweeks. The ordered OLS method saw an improvement in two-thirds of the models submitted. However, the ordered OLS method showed improvement in forecast skill in 90% of models submitted.
The results are consistent across targets (see Figure 6 A), where a majority of model/season forecast skill improves over the forecast skill of the independent forecasts. Interestingly, forecast horizon has differing effects on the two methods, with ordered OLS improving as forecast horizon increase and unordered OLS improving as forecast horizon decreases. In particular, out of the 83 unique model/season combinations, the ordered OLS method improved over 75% of them for each target, showing that ordered OLS can consistently improve forecast skill across targets.
We also break down the results by season (Figure 6 B). Here we see that both methods improve over the independent forecasts when stratified by seasons. Additionally, the ordered OLS method showed the greatest improvement in the season that was hardest to predict out of the three: the 2017/2018 season [12]. Though the ordered OLS method shows consistent improvement in short-term forecasting skill across model/seasons (74/83) and when broken down by target (Figure 6 A) and season (Figure 6 B), consistent improvement in forecast skill does not hold when broken down by region as seen in Figure 6 C. As demonstrated in Section 1.2, there is no guarantee that in the point prediction setting the MSE of each region will improve, but rather the average MSE across regions will improve.. This property seems to translate to the probabilistic forecasting realm as well. There is no systematic improvement in Figure 6 C. In fact, the breakdown obscures any improvement at all, a consequence of using forecast skill as the primary metric, which is a geometric mean of log score.
We next turn to the question of why the ordered OLS method performed better than the unordered OLS method. As noted in Section 2.2, the ordered OLS method assumes a correlation structure exists between the  is only a 66% chance of being better than the independent forecasts. he ordered OLS method, however, shows an even greater chance of improvement over the independent forecasts, with 90% of models performing better than their independent forecast counterparts.. (c) Region Figure 6: A: Difference between forecast skill of projection method and forecast skill of independent forecasts averaged over all regions and epiweeks broken down by target. Notice that only the ordered OLS method is significantly to the right of the zero line. Numbers represent the number of models that improved relative to the independent forecast skills. B: Difference between forecast skill of projection method and forecast skill of independent forecasts averaged over all regions, epiweeks, and targets (1-4 week ahead) broken down by season. Notice that the 2017/2018 season, which was the hardest of the three seasons to predict, saw the largest increase in forecast skill and the largest relative difference in the ordered OLS versus the unordered OLS. Numbers represent the number of models that improved relative to the independent forecast skills. C: Difference between forecast skill of projection method and forecast skill of independent forecasts averaged over epiweeks and targets (1-4 week ahead) broken down by region. Here we see no systematic improvement. However, we do not expect to see one since forecast coherence does not necessarily improve the forecast skill of every region, but rather the regions in aggregate. When breaking down the results by region, the geometric mean obscures any average benefit across region. that when the average regional error is high, the national error is likely to be high and vice versa. This explains why the ordered OLS method performs better than the unordered OLS method, since it is able to capture this correlation structure.
errors of the independent forecast densities. In particular, the ordered OLS method assumes that when the regional forecasts are low, the national forecast is similarly low and vice versa. When examining the observed correlation between the average error of the regional forecasts (averaged over all regions) and the error of the national forecasts in the evaluation seasons (Figure 7), we can clearly see positive correlation that the ordered OLS method is able to exploit. This helps explain the overall increase in performance seen with the ordered OLS method.

Discussion
Forecast coherence is a simple tool to improve forecast skill of short-term predictions in systems with hierarchical structures. In order to demonstrate this, we first defined probabilistic coherence, and showed that the results in the literature surrounding point forecast coherence do not naively transfer over to probabilistic forecasting.
Guarantees for improvements in MSE do not directly transfer over to the CDC FluSight forecast skill metric of probabilistic forecast performance. However, by leveraging the definition of probabilistic coherence, we were able to generate coherent samples by first sampling from a collection of independent forecast distributions over all regions and projecting them onto a coherent subspace. This projection method is generic and allows for both correlated and uncorrelated projections. By exploiting the correlation structure of the true data generating process, we were able to improve average forecast skill when broken down by model, season and target.
In practice, the unordered and ordered OLS methods are very appealing due to the lack of training data required and the operational simplicity of operating on submitted forecast, without requiring adjustments to the model code itself. No knowledge of the process model used to generate forecasts is required, only the resulting predictive density. Even though the benefits in forecast skill are small in magnitude, there is little cost to implementing coherence in practice and, especially for the ordered OLS method, the frequency of improvement is high (90%).
Our experiments lead to the following conclusions: • Forecast coherence can benefit forecast skill, but the benefits are small. Using the ordered OLS method, we can improve short-term forecast skill with high likelihood (90% of model/seasons) and when partitioned by targets (over 75% of all model/seasons improved for each target). We see a small but consistent improvement, with little to no cost in terms of parameter estimation and implementation difficulty. This makes the ordered OLS method a clear choice to use when submitting forecasts to the CDC FluSight challenge.
• Careful consideration is required when accounting for correlation structure of the errors.
The ordered OLS method is able to account for positive correlation structure by applying the projection matrix the order statistics from the independent forecast distributions.
• Coherence is most beneficial during a difficult season. We saw that the forecast skill benefit for both methods was largest in the 2017/2018 season. In addition, the relative difference between the unordered OLS method and the ordered OLS method was also largest during this season, suggesting that exploiting the correlation structure matters most in difficult seasons [12]. This is intuitively true, since during an epidemic season both the regions and the national wILI is elevated beyond normal levels, so the forecasting models tend to under-predict at all levels of the hierarchy.
Although the above method is simple to implement and results in small but significant changes in forecast skill, there is still significant room for improvement. Recent work by Taieb et. al has explored copula based techniques to combine the independent forecast distributions from the regional level into a joint distribution with a specified covariance structure [18]. Wickramasuriya et. al have also explored various projections using a weighted least squares method [19]. This weight matrix also represents the correlation structure between the independent forecasts but can be estimated from historical forecast accuracy. It is clear that, unlike in the point forecast setting, exploring various correlation structures in the probabilistic setting has a drastic effect on the results. Given historical training data, once could estimate the error correlation specific to a given process model which could potentially lead to an even greater increase in forecast skill. However, in the absence of historical training data, forecasters can still leverage coherence to improve probabilistic forecast skill. In all, these results suggest that simple and fast methods can improve probabilistic forecasts of systems where the available data has a natural hierarchy. Using the example of seasonal influenza forecasts in the US, we show that enforcing coherence provides a high likelihood of improvement in forecast accuracy, and in general may provide opportunities for improvement in forecast accuracy in this and other real-world application settings.
Proof. Properties for the proof: Property 1: Let X be an n × p matrix such that rank(X) = p.
Property 2c: I − P is idempotent.
Property 2d: I − P is symmetric.
Property 3: If A is a symmetric matrix, then A = QΛQ T where Q is a matrix whose columns are equal to the eigenvectors of A and Λ is a diagonal matrix whose diagonal elements λ are the eigenvalues of A.
Property 4: If A is idempotent, then its eigenvalues are either 0 or 1.