Influenza forecasting for French regions combining EHR, web and climatic data sources with a machine learning ensemble approach

Effective and timely disease surveillance systems have the potential to help public health officials design interventions to mitigate the effects of disease outbreaks. Currently, healthcare-based disease monitoring systems in France offer influenza activity information that lags real-time by one to three weeks. This temporal data gap introduces uncertainty that prevents public health officials from having a timely perspective on the population-level disease activity. Here, we present a machine-learning modeling approach that produces real-time estimates and short-term forecasts of influenza activity for the twelve continental regions of France by leveraging multiple disparate data sources that include, Google search activity, real-time and local weather information, flu-related Twitter micro-blogs, electronic health records data, and historical disease activity synchronicities across regions. Our results show that all data sources contribute to improving influenza surveillance and that machine-learning ensembles that combine all data sources lead to accurate and timely predictions.


Introduction
Influenza is a major public health problem causing up to five million severe cases and 500,000 deaths per year worldwide [1][2][3]. In France alone, the epidemic of 2018-2019 caused 9,500 deaths. During epidemic peaks, large increases of visits to general practitioners and to emergency departments are observed and often lead to disruptions to healthcare delivery and thus increase the risk of undesirable outcomes in patients with influenza infections. To reduce the impact of influenza outbreaks in the population and to better design timely public health interventions, surveillance systems that produce accurate real-time and short-term forecasts of disease activity may prove to be instrumental. In France, an important influenza monitoring system was implemented by the Sentinelles network in 1984 [4,5]. This system centralizes information obtained from a group of volunteer (1,314 in 2018) general practitioners and (116 in 2018) pediatricians that each week report the proportion of patients with Influenza-Like-Illness (ILI, any acute respiratory infection with fever above 38˚C, cough and onset within the last ten days) seeking medical attention. Data collection, processing, aggregation and distribution processes of this information, at the national and regional levels, introduce up to three weeks delays in the availability of flu activity information. This temporal data gap prevents public health officials from having the most upto-date epidemiological information, and thus leads to the design of interventions that do not take into consideration recent changes in disease activity [2,6]. For example, if estimates were available in real-time, information campaigns and vaccination prevention could be deployed earlier and could lead to greater impact. Additionally, healthcare facilities could be better prepared to respond to unexpected increases in the flux of high-risk patient during time periods of increased disease activity.
With the motivation to alleviate this time delay, mathematical modeling and machine learning approaches have been proposed to produce disease estimates in real time and ahead of healthcare-based surveillance systems in multiple nations around the world. Most of these studies have been designed and tested in developed nations, such as the United States and France, where information on disease outbreaks has been collected historically for decades [2]. Numerous research studies have been conducted on the use of traditional statistical methods, like temporal series or compartmental methods, as well as the inclusion of disparate data sources such as meteorological or demographic data to track flu activity, as discussed in Nsoesie et al. 2014 and Yang and Shaman 2014 [7,8]. And in recent years, multiple more studies have emerged exploring the use of Internet-based data sources that capture aspects of human behavior and environmental factors to track the spread of diseases. With over 3.2 billion web users, data flows from the internet are huge and of all types. Some studies have used data from Google [2,3,[9][10][11][12][13], Twitter [14][15][16][17][18] or Wikipedia [19][20][21][22] to monitor flu specifically.
One of the first and most prominent studies on the use of internet data for monitoring influenza epidemics is Google Flu Trends (GFT) [23,24]. This web-based platform, created in 2009 and designed and deployed by Google, used the volume of selected Google search terms to estimate ILI activity in real time. GFT led to multiple prediction errors during the 2009 H1N1 Flu Pandemic (due to changes in people's search behaviour as a result of the exceptional nature of the pandemic) and later produced large overestimations during the 2012-2013 US flu season (due to the announcement of a pandemic that finally did not appear). These events show the lack of robustness of their algorithm and led to eventual discontinuation of this disease monitoring platform [25]. Since then, multiple research teams have proposed improved methodologies that are capable of extracting information more efficiently from flu-related Google searches and produce improved flu estimates [2,3,[9][10][11][12][13]. Among these methods, the work of Shihao Yang et al. [2] explored a penalized regression methodology that combines historical flu activity with Google search activity dynamically, called ARGO, to better predict flu.
Additional data sources have been explored to monitor flu activity such as clinicians' searches, electronic health records (EHR), crowd-sourced flu monitoring apps [26][27][28]. Among these, electronic health records have been shown to track flu accurately and timely in the US and France [6,[29][30][31]. Specifically, in United States, Santillana et al. [6] showed that a model leveraging EHR data and a machine learning algorithms was capable to monitor flu activity in multiple spatial resolutions that included the regional level. In France, Poirier et al. [29] similarly showed multiple statistical models that incorporate EHR and Internet-search data, can yield accurate ILI incidence rates in real time at the national level.
In early 2019, Fred S. Lu et al. [9] extended the ARGO methodology to accurately track flu activity in multiple states of the United States. In their approach, they included Google search data, EHRs and historical flu trends. They developed also a spatial network approach, called Net, to capture the synchronicity observed historically in flu activity between each states. Finally, by dynamically combining estimates from ARGO and Net, they showed that an ensemble approach, named ARGONet, led to improved results.

Our contribution
In this study, we propose a forecasting platform that combines multiple data sources and statistical models to track flu activity in France at a spatial resolution that, to our knowledge, has not been explored before. Our forecasting platform produces accurate region-specific realtime and short-term flu activity forecasts for the twelve continental French regions. In our approach, we incorporated data sources such as Google data or Twitter microblogs, Electronic Health Record data, and weather that were not considered in the US study [9]. In addition, the EHR Data used here came directly from a clinical data warehouse rather than cloud-based billing and EHR company which required integrating structured and unstructured clinical data. Additionally, historical synchronicities across regions are captured with a Network model. A machine learning ensemble approach is proposed to improve predictions by dynamically combining estimates from these two distinct approaches. Near real-time estimates as well as oneand two-week ahead forecasts are presented.

Materials and methods
All the data used for this research were fully anonymized. For the EHR data, the IRB ethics committee from the Rennes Academic Hospital approved this research (Approval number 16.69) and the data were fully anonymized before we accessed them. All other data sources are publicly available and appropriately anonymized. The data data collected from Google and Twitter complied with the terms and conditions for each website.

Data sources
Sentinelles network data. We obtained weekly ILI incidence rates (per 100,000 inhabitants) for the French regions (twelve) from the French Sentinelles network (websenti.u707.jussieu.fr/sentiweb). We retrieved these data in August 2018 from 05 January 2004 to 13 March 2017. We considered these data as the gold standard and as our task for our prediction models.
Google data. We obtained the frequency per week of the 100 most correlated internet queries (if correlation � 0.60) by French users from Google Correlate (https://www.google. com/trends/correlate). Because our prediction period spans 05 January 2015 to 20 February 2017, we utilized the ILI signal for each French region, from January 2004 to December 2014 to obtain the most highly correlated search terms using the tool Google Correlate. In this way, we obtained different search terms for each individual region. The signals obtained correspond to queries performed by French users at the national level. We retrieved Google Correlate data in August 2018 for the period going from 05 January 2004 to 13 March 2017.
Electronic health record data. We retrieved EHR data from the clinical data warehouse (CDW) of Rennes University Hospital (France). This CDW, called eHOP, integrates structured (laboratory test results, prescriptions, ICD-10 diagnoses) and unstructured (discharge letter, pathology reports, operative reports) patients' data. It includes data from 1.2 million inpatients and outpatients and 45 million documents that correspond to 510 million structured elements. eHOP consists of a powerful search engine system that can identify patients with specific criteria by querying unstructured data with keywords, or structured data with querying codes based on terminologies.
The first approach to obtain eHOP data connected with ILI was to perform different manual queries to retrieve patients who had at least one document in their EHR that matched the following search criteria: (1) Queries directly connected with flu or ILI with the keywords "flu" or "ILI"; (2) Queries connected with flu symptoms with the keywords "fever", "pyrexia", "body aches" or "muscular pain"; (3) Queries connected with flu drugs with the keyword "Tamiflu"; (4) Queries with the ICD-10 terminology; (5) Queries connected with flu tests, positive or negative results.
In total, we performed 34 manual queries. For each query, the eHOP search engine returned all documents containing the chosen keywords (often, several documents for one patient and one stay). For query aggregation, we kept the oldest document for one patient and one stay and then calculated, for each week, the number of stays with at least one document mentioning the keyword contained in the query.
From the CDW eHOP, we built a database containing the time series constructed from the structured data. In all, we have 1,335,347 time series. As Google Correlate, the Pearson correlation between each signal of each region and the time series from the database was calculated. In this way, for each region, the second approach was to retrieve the 100 most correlated signals to ILI signal. Because our test period is from 05 January 2015 to 20 February 2017, we calculated the correlation between January 2004 and December 2014.
As a result, for each region, we obtained 134 variables from the CDW eHOP where there are at least 34 variables common to all regions (manual queries). We retrieved retrospective data in August 2018 for the period going from 03 January 2005 to 13 March 2017.
Weather data. We obtained region-specific weather data from the French climatological website Info Climat (https://www.infoclimat.fr). It has been shown in several studies that humidity is correlated with the spread of influenza [32]. In the absence of humidity data on the Climat website, we retrieved precipitation and temperature data. This choice was made knowing that both variables [33,34], can be used as a proxy for humidity since they are directly related by the Clausius-Clapeyron relation [35]. We obtained temperatures and precipitations per day for the largest city of each region, and calculated the weekly mean for both temperature and precipitation. We retrieved climatic data in August 2018 for the time period going from 07 January 2008 to 13 March 2017.

Statistical models
The ARGO model. The ARGO model is a regularized regression dynamically calibrated weekly using the LASSO method [36] to combine multiple external data sources with historical flu information. We performed the LASSO regression with the R package caret and the associated function fit with the method glmnet [37,38]. We optimized the shrinkage parameter lambda via a ten-fold cross-validation. To test the stationarity and whiteness of residuals, we used Dickey Fuller's and Box-Pierce's tests available from the R packages tseries and stats [39]. The formulation of our model is: • Real time estimates: • One-week ahead forecast: • Two-week ahead forecast: where y it corresponding to the flu incidence rate at time t for the region i, P 52 j¼1 Z j y itÀ j corresponding to the historical flu incidence rates for the region i, P 10 k¼1 a k x kit corresponding to the 10 most correlated variables from Google data for the region i, P 10 l¼1 b l z lit corresponding to the 10 most correlated variables from hospital data for the region i, P 11 p¼1 g p v pt corresponding to Twitter data, P 2 m¼1 d m w mit corresponding to climatic data for the region i, � t corresponding to residuals. We applied this model for each region. The model was dynamically recalibrated every week by incorporating all data available. In this way, the size of our training dataset increases every week. We obtained estimates from January 2011 to March 2017.
The Net model. The Net model is a LASSO model dynamically calibrated weekly and using the relationship between the regions to know how synchronicity could improve forecasts. Indeed, S1 Fig in S1 File (Heatmap of pairwise correlations between all regions) shows that the flu incidence rates of the different areas are correlated. For each region, we used historical data of all regions and estimates obtained with ARGO model for all regions expected the region to be predicted. The formulation of our model is: • Real time estimates: • One-week ahead forecast: • Two-week ahead forecast: where y it corresponding to the flu incidence rate at time t for the region i, P 2 l¼1 P 12 j¼1 a j y jtÀ l corresponding to two weeks of historical flu incidence rates for all regions, P 12 j¼1 j6 ¼i b jŷjt corresponding to ARGO predictions for all regions excepted the region i to be predicted and � t corresponding to residuals. We applied this model for each region. We used a two years' training dataset. We obtained estimates from January 2013 to March 2017.
The ARGONet model. The ARGONet model is an ensemble approach combining the predictive power of ARGO and Net models. To combine the results of both models, we tested three methods: • First, for a given week, we choose ARGO's estimate if it leads to the lowest mean prediction error in the previous K weeks (compared to the Net model's estimate). If this is not true, we choose Net's estimate. The values of K were inspired by Lu et al. [9] study and verified using cross-validation during the training time period.
• A second method consists of calculating the mean value of the estimates produced by the ARGO and Net models for a given week.
• In the final method, for a given week, ARGONet's estimate is built as a linear combination of estimates produced by ARGO and Net. The coefficients are dynamically calculated each week to best predict new ground truth data available each week.
The autoregressive model. To assess the importance of external data sources, we built an autoregressive model of order 52 (AR(52)). We used the LASSO regression with the previous 52 weeks of ILI incidence rates to predict the current week and the two weeks after.
• Real time estimates: • One-week ahead forecast: • Two-week ahead forecast: where y it corresponding to the flu incidence rate at time t for the region i, P 52 j¼1 a j y itÀ j corresponding to the previous 52 weeks, � t corresponding to residuals. We applied this model for each region. We used a six years' training dataset. The model was dynamically recalibrated every week.
The baseline model. Finally, we included a baseline model that simply predicts that the number of new flu cases in a week will be exactly the number of cases observed in the past week.

Evaluation
Our test period consists on 115 weeks starting from January 2015 to March 2017.
Metrics. To assess the performance of the models, we compared estimates to the official incidence rates from the Sentinelles network by calculating two metrics: the root mean squared error (RMSE) and the Pearson correlation coefficient (PCC).
• RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P n i¼1 ðy i À � whereŷ i is the predicted value for the week i, � y i is the mean of predicted values, y i the real value for the week i, � y i is the mean of real values. We also estimated the relative efficiency of ARGONet model compared to the autoregressive model AR(52) with 95% confidence interval (CI) by using a Bootstrap method. A relative efficiency, calculated by RMSE AR52 RMSE ARGONet higher that one, suggests increased predictive power of ARG-ONet compared to the autoregressive model AR(52). The CI and relative efficiency have been computed based on 100 Bootstrap samples of length 52. The 52 weeks were randomly selected from estimates from January 2015 to February 2017.
Comparisons. First, we assessed the importance of adding external data sources by comparing: • RMSE and PCC of the AR(52) model and the ARGO model including historical data plus the ten most correlated variables from hospital data and Google data. The individual contribution of hospital data and Google data has already been shown in a previous study [29]. But, we added in appendices, two comparisons: A comparison with the ten most correlated variables from hospital data and a comparison with the ten most correlated variables from Google data.
• RMSE and PCC of the AR(52) model and the ARGO model including historical data plus climatic data.
• RMSE and PCC of the AR(52) model and the ARGO model including historical data plus Twitter data.
Second, we compared the baseline model, AR(52) model, ARGO model (including all the data sources), Net model and ARGONet model.

Evaluation of data sources as predictors
In order to assess the predictive value of each and all external data source, we compared ARGO models that incrementally included external data sources with an autoregressive model, AR(52), model that only uses historical information as input. As shown in the next sections, we found that all external data sources improve flu estimates, specially EHR Data and Google Data.
EHR data and Google data. Our first modeling experiment involved comparing ARGO models that use Google search and EHR data simultaneously with the AR(52) in all French regions. A detailed analysis on the individual contribution of Google data and EHR data into predictions, separately, is provided for completeness in the supplementary materials. Our findings suggest that each of these data sources individually improves predictions in all time-horizons. This is consistent with the findings of a previous study conducted at the national-level and the French region of Brittany [29], where both Google and EHR information were found meaningful, but EHR data was shown to possess a stronger predictive power.
The join contribution of both EHR and Google data on predictions is presented below. In real time (Table 1), in terms of correlation, estimates produced using EHR data and Google data improve the accuracy for all the regions and for 9 regions in terms of error metrics. The combination of both sources lead to correlation improvements of up to 5% for the region Bretagne and decreases in error of up to 20% for the region Provence-Alpes-Côte d'Azur.
For one-week ahead estimate (Table 2), estimates obtained with EHR and Google data are more accurate or comparable for eleven of the twelve regions in terms of correlation and nive of the twelve region in terms of error metrics. The combination of both sources lead to correlation improvements of up to 15% for the region Bourgogne and decreases in error of up to 25% for the region Provence-Alpes-Côte d'Azur.
For two-week ahead predictions (Table 3), estimates obtained with EHR and Google data are more accurate for all the regions in terms of correlation and for eleven of the twelve regions in terms of error metrics. The combination of both sources lead to correlation improvements of up to 30% for the region Centre and decreases in error of up to 25% for the region Provence-Alpes-Côte d'Azur. Climatic data. When combining climatic data with historical activity via ARGO was shown to consistently improve prediction results across all regions (Table 4). However, this improvement is lower than the one observed with EHR and Google data. Indeed, climatic data lead to correlation improvements of 2% for the region Pays de la Loire and decreases in error of 5% for the region Hauts-de-France.
For one-week ahead estimate (Table 5), in term of correlation and error, results obtained with Climatic data are better or comparable for all regions. Climatic data lead to correlation improvements of up to 5% for the region Bourgogne-Franche-Comté and decreases in error of up to 7% for the region Hauts-de-France.
For two-week ahead estimate (Table 6), results obtained with Climatic data are better for all the regions. Climatic data lead to correlation improvements of up to 25% and decreases in error of up to 11% for the region Bourgogne-Franche-Comté.
Twitter data. Overall, we found that national-level flu-related Twitter data improves prediction results for all regions. In real time (Table 7), we see that between Twitter data and AR(52) results are comparable. Twitter data lead to correlation improvements of 2% for the regions Occitanie and Pays de la Loire and decreases in error of 5% for the region Centre.
For one-week ahead estimate (Table 8), estimates obtained with Twitter data are more accurate for all the regions in terms of correlation and for ten regions in terms of error metrics. Twitter data lead to correlation improvements of 10% for the region Pays de la Loire and decreases in error of 7% for the region Bretagne.
For two-week ahead estimate (Table 9), results obtained with Twitter data are more accurate for all the regions in terms of correlation and for nine regions in terms of error metrics. Twitter data lead to correlation improvements of 20% and decreases in error of 6% for the region Bourgogne-Franche-Comté.

Evaluation of statistical models
Here, we compare the predictive performance of five different modeling approaches the baseline model, AR(52), ARGO, Net, and ARGONet for three time horizons: real-time, one-week   Table 10. It confirms, region by region, that the best PCC and RMSE are mostly obtained with ARGONet. Nevertheless, for real-time estimate, ARGO shows good performance. For seven regions ARGO is the model with the second lowest  To assess the statistical significance of the improved prediction power of ARGONet, we constructed a 95% confidence interval for the relative efficiency of ARGONet compared to the autoregressive model AR(52) (the error of ARGONet is in the denominator). Table 11 shows that in real-time, the improvement obtained thanks to the ARGONet model compared to the  One-week ahead estimate.   ARGONet model which implies a reduction of the error from 22% to 67% compared to the baseline. In comparison to the best results, AR(52) and the baseline are the models giving the highest errors and lowest correlations. In contrast to real-time estimates, ARGO and Net models are comparable. Indeed, for seven regions, Net model is the model having the second lowest   Table 11. Real-time estimate-relative efficiency being higher than one suggests increased predictive power of ARGONet compared to the autoregressive model AR(52).

Region
Relative efficiency 95% CI   Table 13 shows that the improvement obtained with ARGONet model compared to the autoregressive model AR(52) is statistically significant for all regions for one-week ahead estimates. Depending on the region, ARGONet allows to reduce the error by 18% to 59%. Fig 11 shows one-week ahead estimate obtained for the french region Nouvelle-Aquitaine. On this plot, we can see that AR(52) still have a lag of one or two weeks in contrast to ARG-ONet model. The heatmap on Fig 12, shows that ARGO model uses mostly seven variables including three variables from Google Data, two variables from Hospital Data, one variable from Climatic data and one variable from Historical data.
Two-week ahead estimate. Table 14 shows results for two-week ahead estimates for the time period January 2015-March 2017. Over this time period, the 90% CI of the best correlation is [0.825;0.935] with a median value equal to 0.885. The 90% CI of the best relative error is [59.28;105.77] with a median value equal to 79.53. As for real-time and one-week ahead estimates, ARGONet is the model giving the best results in terms of correlation and error whereas AR(52) and the baseline model give the least accurate results. ARGONet allows a reduction of the error from 37% to 67% compared to the baseline. For most of the french regions, the method giving the highest correlation and lowest error for ARGONet is the method using the mean between estimates obtained from ARGO and Net models. Fig 13, allows to visualize that ARGONet is the best model for all regions in term of correlation and error. These results are confirmed with the distribution of correlation and error of each model obtained by calculating the PCC and RMSE for each flu season and each region. (Figs 5 and 6).  Table 15 shows that the improvement obtained with the ARGONet model compared to the AR(52) model is statistically significant for all regions for two-week ahead estimate. Depending on the region, ARGONet allows to reduce the error by 27% to 57%. Fig 14 shows two-week ahead estimates for the region Nouvelle-Aquitaine. As for one-week ahead estimate, we can see that estimates obtained with AR(52) is still delayed. It is not the

Discussion
We have introduced a machine learning ensemble methodology that combines multiple data sources and multiple statistical approaches to accurately track flu activity in the twelve continental regions of France. To the best of our knowledge, this is a spatial resolution for which no forecasting approaches have been explored before in France. Our methodology provides realtime estimates as well as one-and two-week ahead forecasts. The success of our approach comes from the ability to dynamically identify the appropriate method and data sources to produce the best disease activity estimates for a given location and time horizon in a prospective way (out-of-sample). Specifically, we show that the ARGO model alone (one that does not incorporate flu activity from neighboring regions) yields accurate results for real-time estimates but fails to produce optimal predictions for longer-term time-horizons. We find that the Net model (one that leverages information from neighboring regions alone) leads to reasonable flu predictions but tends to overestimate epidemic peaks. The proposed ensemble approach, named ARGONet (that combines information from both ARGO and the Net model), an extension of a model proposed in the USA [9], produces forecasts with the lowest errors and highest correlation as captured by Fig 1. Particularly, the most reliable longer-term forecasts are obtained with ARGONet's method using the mean of estimates from ARGO and Net models. This machine-learning ensemble approach displays both Table 13. One-week ahead estimate-relative efficiency being higher than one suggests increased predictive power of ARGONet compared to the autoregressive model AR(52).

Region
Relative efficiency 95% CI accuracy and robustness to estimate ILI activity up to two-weeks ahead of time at the french regional level. Our methodological approach was inspired by Lu et al. [9] using ARGO, Net and ARGONet methods to track flu activity at state level in United States. However, Rangarajan et al. [40] have shown that potential improvements can be achieved in data-driven  forecasting methods by exploiting sparsity structures in the predictors. Future studies could explore the efficacy of these techniques for flu prediction in France. Prediction error reductions are observed when using ARGONet over its autoregressive counterpart (AR(52)) (up to 32% across regions) in real-time predictions. As the time-horizon of prediction increases, the improvements of predictions are more evident, leading to up to 60% error reductions when comparing ARGONet with AR(52), and up to 70% error reduction when comparing ARGONet with the baseline (Tables 12 and 14). S2 through S13 Figs in S1 File show these results graphically. As expected, autoregressive approaches show "withinrange" prediction values that consistently lag behind the observed disease activity and lead to under-predictions close to peak activity.
We find that all external data sources contribute to improving local flu estimates, when compared to the autoregressive model (AR(52)), specially for longer-term forecasts. Indeed, for the two-week ahead estimates, the combination of EHR data and Google data lead to correlation improvements of up to 30% and decreases in error of up to 25%. For Climatic data, this improvement reaches 20% for correlation and 11% for the error. For Twitter data, it reaches Table 15. Two-week ahead estimate-relative efficiency being bigger than one suggests increased predictive power of ARGONet compared to the AR(52) model.

Region
Relative efficiency 95% CI 20% for correlation and 7% for the error. By analyzing heatmaps (S5, S9 and S13 Figs in S1 File) obtained for ARGO models, we can see that the contribution of different predictors (data sources) changes over time and time-horizon of prediction, but all data sources appear to posses predictive power. Indeed, the most important data sources are EHR data and Google data in real-time and for longer-term forecasts. Historical data is consistently used in real-time, but less used for longer-term forecasting. Conversely, Climatic data and Twitter data are used more prominently for longer-term forecasts than for real-time estimate. The fact that we could only access EHR data from Rennes University Hospital, and thus from the Brittany region, prevented us from being able to quantify the added valued of regionspecific EHR information on flu predictions in their respective region. This should be evaluated in future research efforts. On the other hand, we find interesting the fact that data from a hospital in Rennes can improve flu forecasting in other regions. Indeed, S4-S6 Tables in S1 File show that forecasts that include EHR information from Rennes, up to two weeks, are more accurate for all the regions when compared to the autoregressive model (AR(52)). EHR data appears to be more relevant for some regions than others. For example, it appears to be an important predictor in the Brittany region (which contains Rennes) as expected, as well as in Normandy, which shares a border with Brittany. For Occitanie, EHR data from Rennes improves predictions, which is in alignment with the fact that historical information shows that flu activity tends to occur synchronously (with a correlation of 0.93) as seen in S1 Fig in S1 File. We hypothesize that having access to region-specific EHR data, from all the french regions, will lead to prediction improvements across the board.
Twitter data was collected at the National level given the sparsity of relevant flu-related Tweets at the regional level. This was the case as we only had access to the publicly available data shared by Twitter's API that only allows users to view up to 5% of all Geo-coded Tweets (themselves a small fraction of about 5% of the total corpus of all Tweets). We also suspect that gaining access to higher volumes of Tweets at the regional level could improve our forecasts.
For climatic data, we only had a access to weekly local temperature and precipitation. Future studies may explore incorporating other climatic indicators known to be more directly related to the transmission of the virus, such as humidity [32].
For historical data, the variables with highest predictive power include lags or 52 weeks (one year). However, some other long-term lags show up as important in predictions (as 42 and 43 lags). Given the short time period of our study, we suspect that the flu seasons that we studied may have had specific trends (an early season combined with a late season) that could cause our methods to identify a meaningful influences of lags that are shorter than the intuitive 52 week lag.
Data retrieved from Google Correlate is normalized by Google in a (frequently) distinct sample and over different time periods depending on the data request. This pre-normalization can affect our results, but as shown in [2] the process of dynamic training minimizes the impact of this instability.
Our methods were designed to produce point estimates and decision-makers who are potential end-users of the output of our models would benefit from a quantification of the confidence we have in our predictions. For such purposes, by conducting a historical analysis on the errors between out-of-sample predictions and subsequent observations, we find that for a forward looking prediction, the bracket (ŷ t À RMSE;ŷ t þ RMSE) can be thought of as a 95% confidence interval for each region at every point in time (See S80-S82 Figs in S1 File). This is consistent with previous work by Yang et al [2] where the collection of observed errors between (out-of-sample) predictions and subsequent observations were fitted using a Gaussian distribution (over a moving window of about 2 years, 104 observations) and the RMSE was found to be comparable to the standard deviation of such distribution (See S83-S85 Figs in S1 File). This is an empirical result and suggests that in 100 out-of-sample observations, 95 will fall within the suggested bracket around the point prediction. As shown in S80-S82 Figs in S1 File most observations fall within the proposed confidence intervals prospectively, confirming the validity of our approach.
To conclude, we have shown that Internet-based data sources can yield accurate influenza estimates in the twelve continental regions in France. Operational implementations of these methods may prove to be useful for public health officials in the face of public health threats. Our regional-level flu estimates may contribute to better management of patients' flow in general practitioners' offices and in hospitals, particularly emergency departments.