Neural network models for influenza forecasting with associated uncertainty using Web search activity trends

Influenza affects millions of people every year. It causes a considerable amount of medical visits and hospitalisations as well as hundreds of thousands of deaths. Forecasting influenza prevalence with good accuracy can significantly help public health agencies to timely react to seasonal or novel strain epidemics. Although significant progress has been made, influenza forecasting remains a challenging modelling task. In this paper, we propose a methodological framework that improves over the state-of-the-art forecasting accuracy of influenza-like illness (ILI) rates in the United States. We achieve this by using Web search activity time series in conjunction with historical ILI rates as observations for training neural network (NN) architectures. The proposed models incorporate Bayesian layers to produce associated uncertainty intervals to their forecast estimates, positioning themselves as legitimate complementary solutions to more conventional approaches. The best performing NN, referred to as the iterative recurrent neural network (IRNN) architecture, reduces mean absolute error by 10.3% and improves skill by 17.1% on average in nowcasting and forecasting tasks across 4 consecutive flu seasons.


Introduction
Forecasting the spread of infectious diseases can inform public health policy decisions.The potential impact of forecasting was highlighted during the COVID-19 pandemic where disease incidence and mortality projections led governments to initiate lockdowns [1][2][3].There continues to be a considerable interest in forecasting, particularly of influenza [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21].According to the World Health Organization, influenza remains a strong candidate for a pandemic and is typically responsible for 290, 000 to 650, 000 deaths worldwide each year.Forecasting its prevalence allows policy makers to, for example, identify when to recommend the prescription of anti-viral drugs [22].The United States of America (US), like many other countries, has a syndromic surveillance network, coordinated by the Centres for Disease Control and Prevention (CDC), that tracks the rate of influenza-like illness (ILI).Models for influenza forecasting considered by the CDC [23][24][25] incorporate associated uncertainty estimates as an essential component for deployment within a decision support system.After all, an unexpected forecast with high uncertainty is very different to an unexpected forecast with low uncertainty.The latter might trigger a public health intervention, while the former might be ignored.
Neural networks (NNs) have been shown to be competitive with the state-of-the-art in forecasting tasks [26][27][28].However, their application to epidemic forecasting has been limited [29], in part, because estimating uncertainty with NNs can be challenging.The uncertainty in a forecast produced by a machine learning model is attributed to two sources [30].Data or aleatoric uncertainty is inherent in the data, such as measurement noise.Model or epistemic uncertainty deals with confidence in the model's parameters [31].In our work, we use Bayesian neural networks (BNNs) to estimate the model uncertainty.We place a distribution over the NN's parameters (weights), and sample from that distribution to create model parameter instances.The variation in outputs across the model instances is used to derive the model uncertainty.Data uncertainty is estimated by outputting the parameters of the data distribution, that is a mean prediction and its variance, rather than a single point estimate.We then combine the two uncertainty mechanisms.
At its simplest, a forecast may be based on its own historical values.However, in many cases, better accuracy is achieved by incorporating additional exogenous data.Prior work has demonstrated that the current influenza rate can be accurately estimated (nowcasted) from a variety of different data sources, including social media posts and Web search activity [32][33][34][35][36].These streams of information have very low latency.Their daily frequency can theoretically be determined with a delay of about 24 hours, i.e. right after the completion of a day.In contrast, syndromic surveillance networks for influenza (including CDC's) report ILI rates with latencies of about two weeks, i.e. the ILI rate today is not known until two weeks later.Hence metadata about online user activity, when used appropriately, can facilitate more timely disease rate inferences, which perhaps is the most important factor for incorporating this exogenous information into conventional epidemiological approaches.Our proposed NN architectures for disease rate forecasting can efficiently and effectively incorporate frequency time series of Web search queries.We use the daily frequency of a variety of search terms (keywords or phrases) related to ILI.These include symptoms, remedies, general advice seeking, and other relevant categories (S1 Table ).
However, the different latencies of health reporting and Web search activity can introduce a level of confusion with respect to model configuration and evaluation.Generally in forecasting we have a set of observed data points (samples) up to and including time (day) t 0 , and aim to predict a future value (here a disease rate) at time t > t 0 .Due to the different data reporting latencies, we can obtain historical ILI rates up to time t 0 and exogenous data up to t 0 + δ, where δ is typically 14 days.When we refer to the number of days ahead to be forecast, i.e. the forecast horizon denoted by γ, we need to specify from what time.For that purpose, we can either use t 0 (time point of the last available ILI rate) or t 0 + δ (time point of the most recent exogenous information).Here, we adopt the convention from prior literature and use t 0 .As such, a 7 days ahead forecast, i.e. for day t = t 0 + 7, with a latency of δ = 14 days may actually use exogenous data that is available after the forecast horizon (days t 0 + 8, . .., t 0 + δ).This is a curious situation, but we note that it is accepted practice within the ILI forecasting community (often referred to as hindcasting), and hence we have chosen to include these results.Obviously, for forecast horizons greater or equal to δ, no "future" exogenous data is available, which makes the outcomes of these experiments more relevant in practical terms.
We propose and evaluate the performance of three NN architectures, namely a simple feedforward network (FF) and two forms of recurrent neural networks (denoted SRNN and IRNN; see Methods) all of which incorporate the frequency time series of various Web search terms as exogenous variables, and provide uncertainty estimates by deploying BNN layers and inference techniques.The forecast targets are US national ILI rates as published by the CDC.Evaluation is performed for the four flu seasons from 2015/16 to 2018/19 (both inclusive).For the overall best performing NN model, IRNN, we also confirm that the incorporation of exogenous data significantly improves performance.The best performing networks for each forecasting horizon, SRNN when γ = 7 and IRNN otherwise, are then compared with Dante [21], a state-of-the-art conventional ILI forecasting model.Our experiments show that the proposed NN architectures that incorporate Web search activity can significantly reduce forecasting error and provide significantly earlier insights about emerging ILI trends.

Results
We first provide a comparative performance analysis of the NN-based models.Then, we compare with the established state-of-the-art in ILI forecasting.Details about the models, training, and evaluation can be found in the Methods section.

Forecasting performance of NNs
We investigate the performance of three Bayesian NN architectures, a feedforward network (FF), a simple recurrent NN optimised for a single forecast horizon (SRNN), and an iterative RNN which feeds back daily forecasts to itself up to and including the horizon window (IRNN).We forecast the national level weighted ILI rate (wILI; see definition in the Methods section) in the US over four flu seasons, namely 2015/16 to 2018/19 from late October until June (exact dates are provided in S2  .We evaluate our models for four forecast horizons γ = 7, 14, 21, and 28 days ahead from the last available ILI rate.The input to all NNs is both past ILI rates as well as time series of Web search query frequencies.In addition to that, for a more complete comparison, we also report performance results for the best performing NN, IRNN, after excluding Web search activity data.We deploy six metrics to compare estimated forecasts to reported ILI rates (ground truth).Mean absolute error (MAE) and bivariate correlation (r) compare forecasts without considering the associated uncertainty.Negative log likelihood (NLL), continuous ranked probability score (CRPS) a generalisation of MAE, and Skill weight the error by its corresponding uncertainty.The (dis)advantages of the various weightings are discussed in S1 Appendix.For NLL, CRPS, and MAE a lower score is better, while for r and Skill higher scores are better.When average metrics are calculated across several seasons or forecast horizons the arithmetic mean is used for all metrics besides Skill, where the geometric mean is used [15].
Table 1 enumerates the performance metrics for the three NNs in each flu season and forecast horizon.The IRNN performs best for all forecast horizons, except for γ = 7 days ahead where SRNN is the best performing model.As we detail in Methods, this is something expected given the model design.IRNN, contrary to SRNN and FF, is not using future query frequencies (from the 7-days following the target forecast date) for the hindcasting task (γ = 7).Interestingly, we also observe that the performance of IRNN does not change for γ = 7 and 14, something that can probably be explained by a model behaviour that gives significantly more importance to the more recent inputs (search query frequencies are ahead of the past ILI rates by δ = 14 days).IRNN, the most advanced NN that we propose, compared to the next best NN architecture reduces error by 14.87% in terms of MAE, 20% in terms of CRPS, and improves Skill by 32.48%, when averaged across all test seasons and forecasting horizons γ = 14, 21, and 28 days.IRNN yields further improvements in the rest of the metrics, although these have a more limited interpretability.The fact that IRNN improves gains between MAE and CRPS (by 4.15 percentage points) means that it is also a better model for the uncertainty bounds compared to FF and SRNN.In addition to the three NNs, we also provide performance metrics for an IRNN variant that does not use any search query frequency data (denoted by IRNN 0 ) as well as for a simple persistence model (denoted by PER; see S1 Appendix for a definition).IRNN consistently performs better than IRNN 0 , which confirms our hypothesis that Web search activity information provides a significant performance improvement.On the other hand, IRNN 0 displays competitive performance when compared to SRNN or FF which highlights that IRNN is a more suitable model for handling search query frequency time series.In S4 Table , we have also provided an additional baseline comparison with an elastic net [37] model that, in line with our previous work [34], provides inferior performance (S4 Table and S6 Fig) .A fair comparison with Gaussian Processes models [38], that we have also deployed in the past [36,39], was not practically tractable given the high dimensionality of the task and the relatively large amount of training samples.Finally, the persistence model baseline is always inferior to at least one of the NN models.Forecasts from IRNN in every season and forecast horizon are shown in Fig 2, whereas forecasts from the FF and SRNN architectures are shown in the Supporting Information (S7 and S8 Figs, respectively).The expected decline in accuracy as the forecast horizon increases is visually evident for all models.Interestingly, forecasts from the FF NN follow closely the estimates of a persistence model (i.e.shifted ground truth), and also have quite pronounced uncertainty bounds for γ = 21 and 28.SRNN provides smoother but generally flatter forecasts that in principle may capture the underlying ILI trend.However, they quite often underestimate the exact flu rate and are over-confident (tight uncertainty bounds).The IRNN makes more independent forecasts that do not necessarily follow previous trends in recently observed ILI rates.Uncertainty bounds increase slightly with γ, albeit we note that this model does not directly differentiate between forecasting horizons.Overall, forecasts from IRNN have a better correspondence to the ILI rate range and provide an early flu onset warning (in at least 3 of the 4 test seasons).
Fig 3 shows the calibration of the confidence intervals (CI) for each of the NNs.The x-axis represents the expected frequency that the ground truth data will be present in a specified region of confidence, while the y-axis represents the empirical frequency as measured from the test results.Remember that each forecast has an associated uncertainty represented by a Gaussian distribution.For a specified probability, ρ, we can determine the confidence region around each forecast such that we expect the ground truth to fall within these regions with probability ρ. ρ can be computed by ρ = cdf(n) − cdf(−n), where n is the number of standard deviations away from the mean, and cdf denotes the cumulative distribution function.For a given probability (on the x-axis), we compute the empirical probability for each of the four test seasons.The diagonal line (y = x) represents perfect calibration, i.e. the expected and empirical probabilities are the same.Points above the diagonal indicate that the uncertainty estimates are too large.Conversely, points below it indicate that the uncertainty estimates are too low.The shadow around the calibration curve shows the variation due to different initialisation seeds over 10 NN training runs (see Methods for further details).Uncertainties produced by the IRNN are closer to the diagonal (i.e.better estimates of uncertainty) for horizon windows greater than 7. Overall, we see that FF is an under-confident model, SRNN an over-confident model, and IRNN generally more balanced, but the error in confidence increases for the largest forecast horizon (γ = 28).

Comparison with state-of-the-art
We compare our best model for each forecasting horizon i.e., SRNN for γ = 7 and IRNN for γ � 14, to a state-of-the-art ILI rate forecasting model, known as 'Dante' [21].Its original implementation, Dante produces a binned forecast and does not permit comparison based on CRPS or NLL (see S1 Appendix).Therefore, for this analysis we restrict the performance metrics to Skill, MAE, bivariate, and correlation.
To be consistent with prior published literature and conduct a fair comparison, we adopt exactly the same training setup as proposed in the original paper that proposed Dante [21].However, we would like to make the reader aware of various caveats in this comparison.First, Dante's national US ILI rate forecasts are based on ILI rates from 63 subnational US geographical regions (50 US states, 10 Health and Human Services regions, the district of Columbia, Puerto Rico, and Guam) as well as ILI rates at the national level.The NNs use only national US ILI rates, augmented with a US national aggregate of Web search activity data.The latter is more recent, i.e. search query frequencies are available until t 0 + δ which is after the last observed ILI rate (t 0 ).To remove this temporal advantage, we do not use Web search activity data generated after t 0 when training models for comparing with Dante.Secondly, Dante is trained using a leave-one flu season-out methodology, training on all other flu seasons (past and future) but the test one.Thus, for example, for the test season 2016/17, Dante will use historical data prior to 2016 and after 2016/17.We do not consider this appropriate as, in practice, a deployed system has no knowledge of future seasons.However, for comparison purposes, we train our models using leave-one flu season-out as well.We note that we were not able to successfully train Dante when restricting training data to exclude future seasons; Dante's performance was much worse to be considered for a comparison.We emphasise that training on dates after the test season is only done when comparing to Dante.Another caveat is that Dante exploits regional ILI prevalence to come up with a national forecast-this can sometimes provide an earlier warning as outbreaks will first be recorded sub-nationally.Our models are not built this way, and cannot leverage from this information.The final remark is that Dante performs retraining prior to conducting a forecast.Although that is possible for the  NN models as well, running complete experiments (across many seasons, different NN architectures, and different initialisation seeds) with retraining every time prior to making a forecast would have taken a considerable amount of time.Hence, NNs make forecasts for an entire flu season without retraining.
Table 2 shows the metrics for the best NN for each forecast horizon γ, trained with leaveone flu season-out and with search data from t � t 0 , and results for Dante taken on identical forecast dates.When averaged over all forecasting tasks, the NNs have 11.93% higher Skill, 4.97% lower MAE, and 5.96% higher correlation than Dante.Dante has a better calibrated uncertainty compared to IRNN, but this can be interpreted by its significantly larger uncertainty estimates that sometimes are over 2 times greater than the ones produced by IRNN (S9 Fig) .In general, a better calibrated uncertainty is less important when forecast error metrics indicate an overall inferior performance.The last column (NN b ) of Table 2 provides an expanded comparison (full results are shown in S4 Table ) whereby we have enabled training with Web search activity data that maintain their actual latency (t 0 + δ).As expected, the performance benefits increase, obtaining 33.52% higher Skill, 14.37% lower MAE, and 8.78% higher correlation compared to Dante.Disabling leave-one flu season-out training on our end also results in a better performance compared to Dante that maintains its knowledge of future flu seasons (see column NN a of Table 2).

Discussion
We have demonstrated the ability of neural networks to forecast ILI rates by incorporating exogenous Web search activity data while providing uncertainty estimates.IRNN exhibits superior performance (averaged over all test years) for forecast horizons greater than 7 days, whereas SRNN is superior for the γ = 7 days ahead forecast horizon, a prediction task also referred to as hindcasting.As discussed extensively (see Methods and Results), this is expected because when γ = 7 days, SRNN is using all the available Web search activity data, which extends 7 days beyond the target forecasting horizon.We have also demonstrated that the proposed forecasting framework can provide very competitive performance that is better than the established state-of-the-art in ILI rate forecasting.Our experiments highlight the importance of including Web search activity for forecasting ILI rates with or without their expected temporal advantage.This is consistent with previous literature whereby the added value of online user-generated data streams (e.g.Web search, but also social media) has been evaluated [5,34,40].However, our experiments present the most comprehensive analysis to date, assessing performance over 4 consecutive flu seasons, and utilising an open-ended, non-manually curated set of search queries.In addition, we have crossexamined accuracy with a number of different error metrics, including CRPS and NLL that can incorporate the validity of uncertainty estimates.We have seen that adding Web search information not only improves accuracy, but also provides better estimates of confidence (S1 Fig) .
By examining ILI seasons in our training and test sets, we can deduce that the 2015/16 test season is the least similar season to previously seen ones (mean bivariate correlation of 0.74), whereas the 2018/19 is the most similar (mean bivariate correlation of 0.81).With that in mind, we observe that in comparison to Dante the NNs that utilise Web search activity perform better when the flu season has a more novel prevalence trajectory (Table 2).As Dante is utilising ILI rates only (including subnational ones), it is expected to be a more focused model on previously seen ILI rate trajectories.In contrast, the search query frequency time series provide an opportunity to capture more complex underlying patterns, and hence seem to be a more informative source during novel flu seasons.
From an epidemiological perspective, accurate forecast estimates might not always be the sole determinant of model superiority.Although, our model performance analysis is comprehensive, and contrary to most of the related literature, provides a clean depiction of seasonal forecasts, it does focus on the accuracy of a forecast and its associated uncertainty.Table 3 attempts to address that partially by offering a few additional comparative insights following aspects of a similar analysis for ILI rate nowcasting models in England [22].Focusing on the most challenging forecasting horizons (γ = 21 and 28 days), we compute the delay in forecasting the peak of the flu season as well as the difference in magnitude between the predicted and the estimated peak ILI rate.We see that Dante is making either  Existing disease forecasting frameworks are difficult to scale, and incorporating additional features or more training data can result in excessive computational cost.This results in a trade-off between model flexibility and the number of exogenous variables a model can handle effectively [11,41,42].An advantage of neural networks is that they are easy to scale; increasing the amount of training instances often results in better overall performance [43].Overfitting issues, that become more apparent when working with relatively small data sets, are alleviated to an extent by the deployment of a Bayesian layer which averages over parameter values instead of making single point estimates [44].A lingering disadvantage, however, is that there is no current consensus on estimating uncertainty with NNs in a principled manner.Our methodological approach, presented in the following section, has attempted to address that by considering two modes of uncertainty (epistemic and aleatoric).In addition, given the relatively restricted amount of samples of training neural networks, our experimental approach provides novel insights for model derivation, training, and hyperparameter validation for similar time series forecasting tasks.
It is equally important to acknowledge the limitations of our methodological approach, and more broadly, of this research task as a whole.We note that the retrospective analysis provided in this paper cannot be the only determinant for model deployment within established syndromic surveillance systems.This would also require real-time assessments during ongoing influenza seasons in collaboration with public health organisations.Furthermore, an ILI consultation rate is not always representative of the true influenza rate in a population.It is a proxy indicator, and as such oftentimes it might be biased [45,46].Therefore, any model that is trained and evaluated based on these rates is inherently limited by this property.An additional factor that could arguably yield misleading inferences is the co-existence of COVID-19 and influenza, given their similar symptom profiles.Although this is outside the remit of this paper, early results from our ILI models for England during the 2022/23 flu season have showcased that ILI rates can be accurately estimated during COVID-19 outbreaks [47].From a methodological perspective, we note that our approach in estimating uncertainty can be improved-IRNN, the best performing NN, is currently not explicitly aware of the actual forecasting horizon (γ) when conducting a prediction (see Methods).Addressing this in an appropriate way will most likely result in better calibrated uncertainty estimates.From an empirical evaluation perspective, our experiments have been conducted on the US at a national level.Hence, although we expect that these results will generalise suband inter-nationally, we have no evidence of this, apart from the fact that past research on similar types of models has shown promise in various different US subregions or countries [36,39,[48][49][50].Finally, the application presented in this paper relies on the existence of Web search activity data.Access to this data is not assured as it both depends on sufficient Internet usage rates as well as on the willingness of private corporations to provide this information for research and epidemiological modelling.Nonetheless, the presented forecasting models do provide a general machine learning approach applicable to different input (e.g.social media activity, body sensors) and output streams of information (e.g.different disease indicators).

Neural network architectures
The three NN architectures we have deployed are described next.Each NN outputs two values, namely an ILI rate forecast estimate (ŷ) and an associated data uncertainty (ŝ).Each architecture also has an additional Bayesian layer where the weights are specified by an associated probability distribution.Multiple models are instantiated, based on sampling from the weight distribution, and the outputs from these model instances are used to estimate the model uncertainty.
Feedforward Neural Network (FF).The FF model has two hidden feedforward neural layers with a ReLU (max(0, x)) activation function, and a Bayesian layer (S12 Fig) .The input to the network is a window of τ + 1 days of ILI rates and m search query frequencies.There is an ILI rate collection delay of δ days, in that at day t 0 we know (CDC has published) the ILI rate of day t 0 − δ.The delay is assumed to be δ = 14 days throughout our experiments.Thus, at day t 0 , the input to the network consists of ILI rates, F t 0 À t to F t 0 , and search query frequencies, Q t 0 þdÀ t through Q t 0 À d .We ignore the temporal structure of the data and use an (m + 1) × (τ + 1) vector as the input to the neural network.The output of the network is an estimate of the ILI rate and corresponding data uncertainty γ days ahead.
Simple Recurrent Neural Network (SRNN).This is a recurrent neural network which observes a time series of ILI rates and search frequencies (S13 Fig) .The input to the network is the same as for FF, but without flattening into a vector.We feed the (m + 1) × (τ + 1) input matrix into a Gated Recurrent Unit (GRU) layer one day at a time.The final output of the GRU is passed to a dense layer with a distribution over its weights.
Iterative Recurrent Neural Network (IRNN).This is a recurrent neural network which makes forecasts of the ILI rate and search frequencies one day at a time.It bases forecasts on its own previous forecasts.IRNN comprises a recurrent GRU layer and a feedforward Bayesian layer as shown in  , beginning from time point (day) t 0 − τ are fed into the network a day at a time.τ denotes the window size of past observations that we consider (τ + 1 = 56 days).The reporting delay of the ILI rates means that when an ILI rates are available up to day t 0 , search query frequencies are available up to day t 0 + δ, where δ = 14 days in our experiments.Dashed arrow lines denote that the model is called for multiple time-steps (where a time step is a day).For days t 0 − τ to t 0 , IRNN enters a warm-up phase where it sets the hidden states in the RNN layer without making any predictions.For days t 0 to t 0 + δ, we can observe search query frequencies, but we cannot observe ILI rates.At this stage, IRNN performs nowcasting with respect to input Q.During nowcasting the estimated ILI rate Ft is combined with the true search frequencies Q t use as the input for the next time step.The query search frequency estimates which are not used (as they are known to us) are shown by a faded box.For days t 0 + δ + 1 to t 0 + γ, where γ denotes the forecasting horizon, IRNN conducts pure forecasting as neither search query frequencies nor ILI rates are known for that period.Forecasted values for both of them are used as inputs for subsequent time steps.The full sequence of both predicted ILI rates and search query frequencies is used in the training loss.https://doi.org/10.1371/journal.pcbi.1011392.g004future (for a period of 7 days after the target forecast) search query frequencies when γ = 7.Hence, for both γ = 7 and 14, the only minor difference may be due to the more recent past ILI rate inputs.As a result, the difference in performance between γ = 7 and 14 is expected to be minor given that search query frequencies are always the more recent information source (as opposed to past ILI rates).This is also empirically confirmed by our experiments (see Table 1).A caveat of the current formulation of IRNN is that the model is agnostic of the actual forecast horizon and hence its uncertainty might be underestimated for larger forecasting horizons.

Uncertainty estimation using a Bayesian NN layer
We first describe the data uncertainty, then model uncertainty, and finally how the two uncertainties are combined.
Data uncertainty.Data or aleatoric uncertainty is caused by noisy observations.It can be further divided into homoscedastic (constant for all inputs) and heteroscedastic (dependent on the input) uncertainty [53].We estimate heteroscedastic uncertainty in the output layer of the neural network by approximating the parameters of a Gaussian distribution [54] f Φ ðxÞ ¼ N ðŷ; ŝÞ ¼ N ða 1 ; softplusða 2 ÞÞ ; ð1Þ where a 1 and a 2 are the outputs of the neural network.Softplus is defined by where s > 0 is a scaling factor, and c = ln (e x − 1) is an offset which makes the output equal to 1 when a 2 = 0.All hyperparameters (insofar s and c) are jointly optimised using Bayesian optimisation (see "Hyperparameter optimisation").The network is trained by minimising NLL given by where y = {y 1 , . .., y T } is a series of T flu rates (ground truth), ŷ ¼ fŷ 1 ; . . .; ŷT g is a series of T forecasts, and σ ¼ fŝ 1 ; . . .; ŝT g is a series of associated standard deviations (data uncertainty).The first component of Eq 3 contains a residual term equivalent to the mean squared error and an uncertainty normalisation term.The second component prevents the model from predicting an infinitely large ŝ.Minimizing the NLL allows us to train an NN despite not having ground truth estimates of the data uncertainty.Model uncertainty.Model or epistemic uncertainty is inherent in the parameters of the model.It is caused by having insufficient data to exactly set the model's parameters.To estimate model uncertainty, we deploy a BNN trained using variational inference [55].BNNs have a distribution over their parameters.For our purposes, we restrict the Bayesian component to the last layer of a network, while preceding layers use deterministic weights and biases.We do this as we found it to be more stable and faster compared to training using a distribution over all the parameters of the model [56].We use a Gaussian prior with a diagonal covariance matrix pðΦÞ ¼ N ð0; s p IÞ, where σ p 2 [0.0001, 0.1] is a hyperparameter.The form of the distributions is an implementation choice.We specify the posterior distribution to the same form (Gaussian with diagonal covariance) as the prior, such that we have a conjugate prior [57].The mean μ w and standard deviation σ w of each weight in the posterior are learned.θ holds all the layer's parameters that are updated during training; the means in the posterior are half of its parameters and the softplus of the standard deviations constitutes the other half.Instances of the model are created by sampling the weights in the Bayesian layer.For each instantiation of the model, a forecast and associated data uncertainty are determined.The variation in the forecast across the model instances provides an estimate of the model uncertainty.
Combining data and model uncertainty.Each time that we sample from the posterior distribution of the model's parameters (which in Experiments will be denoted by q θ (Φ)) we create a model instance, k, that forecasts a mean ŷ0 k and standard deviation ŝ0 k .After drawing K samples, the predictions are combined to create a single distribution.The mean ŷ is the mean of the K forecasts, i.e.
and the variance is given by [31] ŝ2 � 1 K In Eq 5, the first two sum terms define the variance of the means, i.e. the model uncertainty.
The third term is the mean of the variances, i.e. the data uncertainty.

Experiments
We first introduce the training setup for a BNN, and the variations which are used for the different architectures, then we discuss hyperparameter optimisation, and finally how the evaluation is performed in our experiments.Training a BNN.Bayesian inference is used to learn the posterior distribution p(Φ|D) over the model parameters Φ, where D is the training data containing inputs and targets.In theory, the posterior can be derived by using Bayes rule, i.e.where p(Φ) is a prior distribution which expresses prior belief about the parameter distribution before training data is observed.However, we cannot obtain an exact estimate of the posterior due to the integral p(D) = R Φ p(D, Φ 0 )dΦ 0 , which is unavailable in closed form and requires exponential time to compute [55].We instead use variational inference, replacing Eq 6 with an optimisation task.The posterior p(Φ|D) is constrained to a family of distributions F [55], and instead of finding the exact posterior, a variational distribution q θ (Φ) is used.θ is used to describe the distribution [58] e.g., the mean and variance for a Gaussian.The goal of training becomes to learn θ by minimising the Kullback-Leibler divergence (D KL ) to the true posterior p(Φ|D) as in argmin θ D KL ½q θ ðΦÞjjpðΦjDÞ� such that q θ ðΦÞ 2 F : ð7Þ The KL divergence represents the average number of bits required to encode a sample from p (Φ|D) using q θ (Φ).To compute the KL divergence, p(D) is required.However, we can instead maximise the evidence-lower-bound (ELBO) given by which is equivalent to Eq 7 up to a constant and is tractable.The first component of Eq 8 is computed using Eq 3, and measures how well the model fits the training data.The second component of Eq 8 is the KL divergence between the posterior and prior distribution.The KL divergence term behaves similarly to a regulariser and encourages the model to choose a simple q θ (Φ).When the ELBO is computed using minibatch gradient descent, the gradients are averaged over each minibatch.Therefore, to compute the ELBO the KL divergence term is weighted by the total number of minibatches during training [59].We also use an additional KL annealing term ξ, which limits the regularisation effect of the KL term.For the FF and SRNN models this is equal to 1/M where M is the number of minibatches multiplied by KL w [60].For the IRNN model, M is the number of minibatches multiplied by γ = 28 and KL w .This is because the model makes 28 outputs, and each requires calling the dense Bayesian layer.The constant, KL w , is chosen by hyperparameter optimisation.Thus, Eq 8 becomes When training the FF and SRNN models, each training step takes an [(m + 1) × (τ + 1)]dimensional input (where m denotes the number of search queries and τ + 1 denotes the window of days, from t 0 and back, for which query frequencies and ILI rates are used) and produces a forecast estimate ŷt 0 þg containing both a mean and standard deviation for the ILI rate for time (day) t 0 + γ.The parameters Φ are updated by minimising Eq 9.During each training step one sample is taken from q θ (Φ) and used to compute the ELBO.We use backpropagation to compute gradients and update the parameters in both the Bayesian and non-Bayesian layers.
The model is retrained for each time horizon γ, where γ = 7, 14, 21 or 28 days, and for each test period.The output of the IRNN is a sequence of ILI rates and search frequencies.Although we have search data from t 0 to t 0 + δ, we use the full sequence of estimated query frequencies when backpropagating the ELBO (Eq 9) through time.When evaluating the model's performance we are only concerned with the model's ILI rate forecasts.The Bayesian layer is called once for each iterative prediction.
Hyperparameter optimisation.We use Bayesian hyperparameter optimisation [61] with 5-fold cross validation where each fold is a 365-day period covering a full flu season (see S3 Table and S3 and S4 Figs).We tune the hyperparameters once before the first test period, and keep the same hyperparameters for all subsequent test seasons.For the FF and SRNN the hyperparameters are re-tuned for each of the four forecast horizons.For the IRNN the hyperparameters are tuned once, considering all four forecasting horizons (the average NLL is computed across them).The hyperparameters are the following: the size of the hidden NN layers 2 [25,125], the number of queries m 2 [20,150], the weighting of the KL divergence term in the ELBO loss KL w 2 [0.0001, 1.0], the scaling factor of the output's standard deviation s 2 [1.0, 100], the prior standard deviation σ p 2 [0.0001, 0.1], the number of epochs 2 [10, 100], and the learning rate 2 [0.0001, 0.01] for training the NNs.After the hyperparameters are tuned we retrain the model using the full training set for the number of epochs chosen.The derived model is then used for forecasting on the test set.Note that hyperparameters are not re-tuned for comparing with Dante (when Web search activity data that are more recent that the last observed ILI rate are removed), which may have disadvantaged our NN models.
Inference.When making an estimate with a BNN based on inputs X, and with training data D, the goal is to compute an output ŷ for the entire distribution over Φ: pðŷjX; DÞ ¼ Z Φ pðŷ 0 jX; Φ 0 ÞpðΦ 0 jDÞdΦ 0 : ð10Þ In practice pðŷjX; DÞ is estimated using Monte Carlo sampling from p(Φ|D) [58].At prediction time, the posterior distribution over the weights is sampled K times, each giving a output N ðŷ 0 ; s 0 Þ.The K estimates are combined using Eq 5 which makes an estimate for the combined model and data uncertainty.K is chosen by sampling until the final estimate of ŷ stabilises.Initially we sample 10 times and produce an estimate using Eq 5. We then run the model a further 10 times and produce a new estimate using the 20 samples.We repeat this process until increasing K by 10 does not change the estimated mean by more than 0.1%.Despite averaging over K instances of the model, we observed some instability in training the models.
To resolve this each model was trained 10 times with different initialisation seeds, i.e. the seed controlling the initial parameter values of the NN.The mean of the 10 estimates of forecasts and associated variances are our final forecast and variance.We considered alternate methods of combining estimates, such as Eq 5, and averaging the probability density functions.Ultimately we found that averaging the means and variances gave the best final forecasts.Thus, the total number of samples for making a forecast is equal to P 10 i¼1 d i � 10, where i 2 {1, . .., 10} denotes a different seed, and d i � 20 is the number of samples required for this seed to converge.
To make an estimate with the SRNN and FF models, the inputs are passed through the model's layers up to the Bayesian layer.The weights in the Bayesian layer are then sampled K times, and the estimates from the K samples are combined with Eq 5 as discussed in the previous paragraph.Making an estimate with IRNN has three distinct phases: warm-up, nowcasting, and forecasting (S14 Fig) .During the warm-up phase the model observes ILI rates and m search queries from t 0 − τ to t 0 .This sets the hidden states of the GRU layer based on all ILI rates and search frequencies from the same days.The output of the GRU is fed into the Bayesian layer (denoted by FC BNN in Fig 4), which estimates the input for the next time step.The Bayesian layer estimates model and data uncertainty, and has 2 × (m + 1) units.The first half of the units estimate the means of the query frequencies and ILI rate, the second half of the units estimate the corresponding standard deviations.The estimated ILI rate is a distribution which cannot be directly interpreted by a NN layer.Therefore, a sample from this distribution is combined with the true search query frequencies and fed back into the GRU layer.This is repeated from t 0 to t 0 + δ (nowcasting phase).After time t 0 + δ, no more search query frequencies are available.The estimated search query frequencies and ILI rates from each time step are fed back into the model to make subsequent forecasts.The process of making daily estimates can be repeated indefinitely, so γ, the forecasting horizon, could increased arbitrarily.
Evaluation.We evaluate the performance for forecasting horizons γ = 7, 14, 21 and 28 days ahead.We choose weekly test dates starting from week 45 and lasting for 25 weeks.We use the 2015/16, 2016/17, 2017/18 and 2018/19 flu seasons to evaluate our model.We did not consider running experiments on data from 2019/20 or 2020/21 as flu prevalence has significantly declined, and ILI rate estimates from CDC became less reliable due to the COVID-19 pandemic.We train models for the period 05/06/2004 until the Wednesday of the 33rd week of the year in which the test flu season starts (around mid-August).We test the models on the period from the Sunday of week 44 until the Saturday of week 23 in the following year.Exact training and test periods are provided in the Supporting Information (S2 Table and S2  To compare our NN models to Dante we evaluate the model scores on the same test weeks as specified in Reich et al. (2019) [15].When comparing the best performing NNs to Dante, the training set included all seasons except the test season, i.e. it also included data after the test season (models NN and NN b in Table 2).We did not re-tune hyperparameters to account for training on future seasons.As discussed earlier, we do not consider training on data after the test period to be appropriate, but it allows the most direct comparison to the training setup used by Dante.We also report the performance of our best performing NNs when trained using only data prior to the test season (model NN a in Table 2).

Fig 1
provides an alternative visual of the forecasting performance metrics of the different NN models when averaged over the four flu seasons (NLL and MAE are depicted, the rest of the metrics are displayed in S1 Fig).

Fig 1 .
Fig 1. Negative log-likelihood (NLL) and mean absolute error (MAE) for each NN model averaged over all four test flu seasons (2015/16 to 2018/19).Scores for different forecast horizons (γ) are shown.Lower values are better.We also provide a comparison with IRNN trained without using any Web search activity data (IRNN 0 ), and a simple persistence model (PER).Note that NLL cannot be determined for PER as it does not provide an associated uncertainty.S1 Fig shows the results for all metrics.https://doi.org/10.1371/journal.pcbi.1011392.g001

Fig 3 .
Fig 3. Calibration plots for the forecasts made by the three NN models (FF, SRNN, and IRNN) averaged over the four test periods (2015/16 to 2018/19) and shown for the 4 forecasting horizons (γ).The lines show how frequently the ground truth falls within a confidence interval (CI) of the same level.To be more precise, a point (x, y) denotes that the proportion y 2 [0, 1] of the forecasts when combined with a CI at the x × 100% level include the ground truth (successful forecasts).The optimal calibration is shown by the diagonal black line.Points above or below the diagonal indicate an over-or under-estimation of uncertainty, and hence an underor over-confident model, respectively.The shadows show the upper and lower quartile of the calibration curves when the models are trained multiple times with different initialisation seeds.The plot broken out into separate test periods is shown in the Supporting Information (S11 Fig). https://doi.org/10.1371/journal.pcbi.1011392.g003

Table 3 .
Meta-analysis of ILI rate forecasts around the peak of a flu season for Dante, NN (the best NN variant when the temporal advantage of Web search activity data is removed), and NN b (same as NN but after reinstating the temporal advantage of Web search activity data).δ-p denotes the temporal difference (in days) in forecasting the peak of the flu seasons 2015/16, 2016/17, 2017/18, and 2018/19, respectively.Negative / positive values indicate an earlier / later forecast; averaging δ-p across the 4 test flu seasons would remove this information and that is why we enumerate all 4 values.Avg.δ-y p measures the average magnitude difference in the estimate of the peak of the flu season between a forecasting model and CDC.MAE-p is the MAE when the ILI rate is above the seasonal mean plus one standard deviation.SMAPE-p (%) is the symmetric mean absolute percentage of error for the same time periods.Outcomes that yield an unfavourable interpretation for the underlying forecasting model are provided in bold.Detailed outcomes for all NNs are shown in S5

Fig 4 .
We have also described how model training works with pseudocode in the Supporting Information (S14 Fig).Given its special structure, IRNN does not incorporate

Fig 4 .
Fig 4. Diagram of the IRNN architecture where for the recurrent layers (RNN) we have used a Gated Recurrent Unit.An ILI rate, F 2 [0, 1], and m search query frequencies, Q 2 R m �0, beginning from time point (day) t 0 − τ are fed into the network a day at a time.τ denotes the window size of past observations that we consider (τ + 1 = 56 days).The reporting delay of the ILI rates means that when an ILI rates are available up to day t 0 , search query frequencies are available up to day t 0 + δ, where δ = 14 days in our experiments.Dashed arrow lines denote that the model is called for multiple time-steps (where a time step is a day).For days t 0 − τ to t 0 , IRNN enters a warm-up phase where it sets the hidden states in the RNN layer without making any predictions.For days t 0 to t 0 + δ, we can observe search query frequencies, but we cannot observe ILI rates.At this stage, IRNN performs nowcasting with respect to input Q.During nowcasting the estimated ILI rate Ft is combined with the true search frequencies Q t use as the input for the next time step.The query search frequency estimates which are not used (as they are known to us) are shown by a faded box.For days t 0 + δ + 1 to t 0 + γ, where γ denotes the forecasting horizon, IRNN conducts pure forecasting as neither search query frequencies nor ILI rates are known for that period.Forecasted values for both of them are used as inputs for subsequent time steps.The full sequence of both predicted ILI rates and search query frequencies is used in the training loss. Fig).
an under-or over-confident model, respectively.The shadows show the upper and lower quartile of the calibration curves when the models are trained multiple times with different initialisation seeds.(EPS) S8 Fig. SRNN Forecasts.SRNN forecasts for all 4 test seasons (2015/16 to 2018/19) and forecasting horizons (γ = 7, 14, 21, and 28).Confidence intervals (uncertainty estimates) are shown at 50% and 90% levels, and are visually distinguished by darker and lighter colour overlays respectively.The influenza-like illness (ILI) rate (ground truth) is shown by the black line.The flu seasons are shown in different colours which correspond with the calibration plots on the right.The calibration lines show how frequently the ground truth falls within a confidence interval (CI) of the same level.To be more precise, a point (x, y) denotes that the proportion y 2 [0, 1] of the forecasts when combined with a CI at the x × 100% level include the ground truth (successful forecasts).The optimal calibration is shown by the diagonal black line.Points above or below the diagonal indicate an over-or under-estimation of uncertainty, and hence an under-or over-confident model, respectively.The shadows show the upper and lower quartile of the calibration curves when the models are trained multiple times with different initialisation seeds.(EPS) S9 Fig. Dante Forecasts.Dante forecasts for all 4 test seasons (2015/16 to 2018/19) and forecasting horizons (γ = 7, 14, 21, and 28).Confidence intervals (uncertainty estimates) are shown at 50% and 90% levels, and are visually distinguished by darker and lighter colour overlays respectively.The influenza-like illness (ILI) rate (ground truth) is shown by the black line.The flu seasons are shown in different colours which correspond with the calibration plots on the right.The calibration lines show how frequently the ground truth falls within a confidence interval (CI) of the same level.To be more precise, a point (x, y) denotes that the proportion y 2 [0, 1] of the forecasts when combined with a CI at the x × 100% level include the ground truth (successful forecasts).The optimal calibration is shown by the diagonal black line.Points above or below the diagonal indicate an over-or under-estimation of uncertainty, and hence an under-or over-confident model, respectively.(EPS) S10 Fig. IRNN Forecasts.IRNN forecasts with leave-one flu season-out and using all available Web search data for all 4 test seasons (2015/16 to 2018/19) and forecasting horizons (γ = 7, 14, 21, and 28).Confidence intervals (uncertainty estimates) are shown at 50% and 90% levels, and are visually distinguished by darker and lighter colour overlays respectively.The influenza-like illness (ILI) rate (ground truth) is shown by the black line.The flu seasons are shown in different colours which correspond with the calibration plots on the right.The calibration lines show how frequently the ground truth falls within a confidence interval (CI) of the same level.To be more precise, a point (x, y) denotes that the proportion y 2 [0, 1] of the forecasts when combined with a CI at the x × 100% level include the ground truth (successful forecasts).The optimal calibration is shown by the diagonal black line.Points above or below the diagonal indicate an over-or under-estimation of uncertainty, and hence an under-or over-confident model, respectively.The shadows show the upper and lower quartile of the calibration curves when the models are trained multiple times with different initialisation seeds.(EPS) S11 Fig. Detailed calibration plots for neural network models.Calibration plots for the forecasts made by the three NN models (FF, SRNN, and IRNN) for each of the four test periods (2015/16 to 2018/19) and forecasting horizons (γ).The lines show how frequently the ground Table and corresponding ILI rates are displayed in S2 Fig)

Table 1 . Regression performance metrics for three Bayesian NN models for four forecast horizons (γ = 7, 14, 21, and 28 days ahead). Negative
log likelihood (NLL), continuous ranked probability score (CRPS) and Skill compare the accuracy weighted by the uncertainty of forecasts.MAE is the mean absolute error, and r is the bivariate correlation between forecasts and reported ILI rates.Best results for each metric and forecast horizon are shown in bold.The last three columns are performances averaged over the four test flu seasons (from 2015/16 to 2018/19).

Table 2 . Forecasting performance metrics for the best-performing neural network (SRNN for γ = 7, IRNN for γ � 14) compared with Dante.
The NNs are trained using search query frequencies generated only up to the last available ILI rate (the 2-week advantage of using Web search data is removed).We use leave-one flu seasonout to train models, similarly to Dante.The best results for this comparison are shown in bold.The very last column (NN b ) presents the average performance results of NNs where the temporal advantage of Web search activity information is maintained (see also S10 Fig that depicts IRNN's forecasts when leave-one flu season-out is applied).The penultimate column (NN a ) holds results for the same experiment as NN b with the addition of disabling leave-one flu season-out training. https://doi.org/10.1371/journal.pcbi.1011392.t002

Table .
estimates (e.g.70 days prior to the actual peak) or otherwise lags by 1 or 2 weeks (i.e.no early warning), whereas the NN models tend to always provide reasonable early warnings of the peak.While there is no definitive winner in estimating the ILI rate peak magnitude, by examining forecasts when the ILI rate was relatively high (above the seasonal mean plus one standard deviation), we observed that Dante's estimates were significantly worse in terms of MAE and relative MAE (symmetric mean absolute percentage of error).A similar analysis across NN variants is provided in S5 Table highlighting the expected superiority of IRNN.