Advertisement
  • Loading metrics

Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance

  • Mauricio Santillana ,

    msantill@fas.harvard.edu

    Affiliations: Harvard School of Engineering and Applied Sciences, Cambridge, Massachusetts, United States of America, Boston Children’s Hospital Informatics Program, Boston, Massachusetts, United States of America, Harvard Medical School, Boston, Massachusetts, United States of America

  • André T. Nguyen,

    Affiliation: Harvard School of Engineering and Applied Sciences, Cambridge, Massachusetts, United States of America

  • Mark Dredze,

    Affiliation: Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, United States of America

  • Michael J. Paul,

    Affiliation: Department of Information Science, University of Colorado, Boulder, Colorado, United States of America

  • Elaine O. Nsoesie,

    Affiliations: Department of Global Health, University of Washington, Seattle, Washington, United States of America, Institute for Health Metrics and Evaluation, Seattle, Washington, United States of America

  • John S. Brownstein

    Affiliations: Boston Children’s Hospital Informatics Program, Boston, Massachusetts, United States of America, Harvard Medical School, Boston, Massachusetts, United States of America

Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance

  • Mauricio Santillana, 
  • André T. Nguyen, 
  • Mark Dredze, 
  • Michael J. Paul, 
  • Elaine O. Nsoesie, 
  • John S. Brownstein
PLOS
x

Abstract

We present a machine learning-based methodology capable of providing real-time (“nowcast”) and forecast estimates of influenza activity in the US by leveraging data from multiple data sources including: Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system. Our main contribution consists of combining multiple influenza-like illnesses (ILI) activity estimates, generated independently with each data source, into a single prediction of ILI utilizing machine learning ensemble approaches. Our methodology exploits the information in each data source and produces accurate weekly ILI predictions for up to four weeks ahead of the release of CDC’s ILI reports. We evaluate the predictive ability of our ensemble approach during the 2013–2014 (retrospective) and 2014–2015 (live) flu seasons for each of the four weekly time horizons. Our ensemble approach demonstrates several advantages: (1) our ensemble method’s predictions outperform every prediction using each data source independently, (2) our methodology can produce predictions one week ahead of GFT’s real-time estimates with comparable accuracy, and (3) our two and three week forecast estimates have comparable accuracy to real-time predictions using an autoregressive model. Moreover, our results show that considerable insight is gained from incorporating disparate data streams, in the form of social media and crowd sourced data, into influenza predictions in all time horizons.

Author Summary

The aggregated activity patterns of Internet users have enabled the detection and tracking of multiple population-wide events such as disease outbreaks, financial markets performance, and preferences in online movie selections. As a consequence, a collection of mathematical models aiming at monitoring and predicting these events in real-time have been proposed in the past decade. As we discover new methods and data sources suitable to track these events, it is not clear whether more information will lead to improved predictions. In the context of digital disease detection at the population level, we show that it is advantageous to combine the information from multiple flu activity predictors in the US instead of simply choosing the best performing flu predictor. Our findings suggest that the information from multiple data sources such as Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system, complement one another and produce the most accurate and robust set of flu predictions when combined optimally.

Introduction

Predicting the dynamics of seasonal and non-seasonal influenza outbreaks remains a great challenge [1]. They cause up to 500,000 deaths a year worldwide and an estimated 3,000 to 50,000 deaths a year in the United States of America (US) [2]. Frequently, their severity cannot be assessed in a timely manner, and thus, systems capable of providing estimates of influenza incidence are critical to allow health officials to properly prepare for and respond to influenza-like illness (ILI) outbreaks. The US Centers for Disease Control and Prevention (CDC) continuously monitor the level of ILI circulation in the US population by gathering information from physicians’ reports that record the percentage of patients seen in clinics who exhibit influenza-like illnesses (ILI) symptoms. While CDC ILI data provides public health officials with an important proxy of influenza activity in the population, its availability has a known lag-time of at least 7 to 14 days. This means that by the time the data is available, the information is already 1 or 2 weeks old.

Many attempts have been made to estimate the ILI activity in the US ahead of the release of CDC reports, some using a combination of statistical and mechanistic SIR models [3,4,5] and others using non-traditional Internet-based information systems such as: Google [6,7], Yahoo [8], and Baidu [9] Internet searches, Twitter posts [10,11,12], Wikipedia article views [13,14], Flu Near You [15,16], and clinicians’ databases (such as UpToDate) queries [17]. We will focus on non-traditional Internet-based approaches here. Google Flu Trends (GFT) [6], a widely accepted digital disease detection system that uses the Google search volume of specific terms to predict ILI in the US and other countries, continuously provides real-time estimates of ILI. Even though GFT was initially hailed as a success, its inaccuracies in multiple time periods of high ILI have led to doubts about the utility of these data [18]. While Google and external researchers have worked to update and reevaluate the methodology behind GFT [19, 20, 21, 22, 23, 24], alternative and independent methods to estimate ILI in real-time are still needed.

We propose a methodology based on machine learning algorithms capable of providing real-time (“nowcast”) and forecast estimates of ILI by leveraging data from multiple sources including: Google searches, nearly real-time hospital visit records provided by athenahealth, Twitter posts, and data from Flu Near You, a participatory surveillance system. While models using these data sources to predict ILI may capture different flu incidence signals in the population, we show that they complement one another when we combine them to predict CDC’s ILI. Our main contribution consists of optimally combining multiple ILI estimates, generated independently with each data source, into a single prediction of ILI utilizing machine learning ensemble approaches. Our methodology exploits the information in each data source and produces accurate weekly ILI predictions for up to four weeks ahead of the release of CDC’s ILI reports, effectively producing forecasts three weeks into the future. We evaluate the predictive ability of our ensemble approach during the 2013–2014 and 2014–2015 flu seasons for each of the four weekly time horizons.

Data

We collected CDC-reported ILI, considered the ground truth for this study, from the ILINet website (http://gis.cdc.gov/grasp/fluview/fluportaldashboard.html). We used five independent data sets to develop our ILI weak predictors: (a) near real-time hospital visit records from athenahealth, a medical practices management company; (b) Google Trends, a Google service that provides approximate search volumes for specific queries (www.google.com/trends), (c) influenza-related Twitter microblogging posts, (d) FluNearYou, a participatory surveillance system to self-report ILI; and (e) Google Flu Trends. All datasets were accessed and downloaded on March 16, 2015.

CDC data.

The CDC compiles data on the weekly number of people seeking medical attention with ILI symptoms in the United States. CDC’s ILI data is freely distributed and available through ILInet, via the online FluView tool, which posts both new and historical data (http://gis.cdc.gov/grasp/fluview/fluportaldashboard.html). Typically, new CDC reports provide a first estimate of %ILI and as more reports are received, revised CDC reports are released and become the official %ILI. We used the revised CDC reports for weeks 1/10/04 to 02/21/15 as our gold standard for validation purposes. For training our models, we used the (then available) unrevised CDC reports. Weekly tables released on week W of the X-Y season are available at the following URLs: http://www.cdc.gov/flu/weekly/weeklyarchivesX-Y/data/senAllregtW.htm

See [11] for details on obtaining and using historical CDC data.

Additionally, the CDC reports the number and percentage of laboratory tests that are positive for influenza types A and B, using data reported by WHO and NREVSS collaborating laboratories across the United States. This virology data is not part of the ensemble but is used for comparison. Similar to the ILI data, the virology data is subject to weekly revisions, which can be obtained through weekly tables available at: http://www.cdc.gov/flu/weekly/weeklyarchivesX-Y/data/whoAllregtW.html

athenahealth data.

We obtained weekly nationally aggregated data reporting the total number of people seeking medical attention with ILI symptoms in medical practices managed by athenahealth, from Jul 2009 to Feb 2015. athenahealth data is typically available at least one week ahead of CDC ILI reports. By dynamically finding the best linear model to historically map athenahealth’s ILI onto CDC’s ILI, we were able to produce (out-of-sample) ILI estimates using athenahealth’s data as a predictor, one week ahead of CDC reports during our study period. We refer to this data as ATH in the plots and tables. We used ATH data for weeks 6/28/09 to 02/21/15.

GT data.

Following the methodology proposed in [17] and [22], we used data from Google Trends (GT) as a proxy of the volume of query searches for 100 search terms and then utilized a dynamic multivariate approach to predict flu activity for the time period Jul 2013—Feb 2015. The logit transform utilized in [22] was not used to produce our out-of-sample predictions since the identity transformation [17] showed better performance. We used GT data for weeks 1/10/04 to 02/21/15.

Twitter data.

We used the Twitter (TWT) influenza classification system introduced by [25,26], which identifies Twitter messages that express an influenza infection. The logistic regression classifiers were trained on approximately 12,000 tweets annotated for relevance, distinguishing tweets that indicated an infection rather than discussing influenza in other contexts. The normalized weekly volumes of influenza tweets are available from HealthTweets.org [27]. ILI predictions are then created by including the influenza tweet volumes in a linear autoregression exogenous (ARX) model, as described in [6], using the previous three weeks of CDC-reported ILI. The Twitter data spans 11/27/11–2/15/15 and the CDC data (starting three weeks before Twitter) spans 11/06/11–2/08/15. The ARX model is trained using data from the 2011–2012 and the 2012–2013 flu seasons.

FNY data.

FluNearYou (FNY) [15,16] compiles weekly data of ILI activity in the United States. They achieve this by conducting weekly, year-round, Internet-based surveys of voluntary participants who indicate whether they are healthy or have any of the following symptoms: fever, cough, sore throat, shortness of breath, chills/night sweats, fatigue, nausea/vomiting, diarrhea, body aches, headaches. FNY also collects data on the participant’s location, vaccination status, gender, and age. We produced FNY ILI national estimates following the methodology introduced in Smolinski et al 2015 [16]. We used FNY data for weeks 10/24/11 to 02/21/15.

GFT data.

Google Flu Trends’ weekly ILI national estimates are freely available through the Google Flu Trends website (www.google.org/flutrends). GFT data is the result of Google’s proprietary algorithm that combines the volume of specific Google search queries to estimate the level of ILI activity in a given region [6, 28, 29]. We used GFT data for weeks 11/10/12 to 02/21/15, obtained from the http://www.google.org/flutrends/ website. This historical dataset was produced with the corresponding GFT engine active at the time the data was originally posted [29] (https://www.google.org/flutrends/about/how.html).

Methods

We chose three different machine learning algorithms: Stacked linear regression, Support Vector Machine regression, and AdaBoost with Decision Trees Regression, in order to optimally combine the five ILI estimates, produced independently with the five available data sources. We chose this set of machine learning algorithms since each one of them is known to have distinct strengths in combining information [30]. While the linearity assumption may be restrictive, we chose Stacked Linear Regression for simplicity. We chose Support Vector Machines (SVM) with radial basis function kernels because they map the input space to an infinite dimensional, nonlinear feature space, thus allowing more freedom on the functional relationship between the target and independent variables. Both Stacked Linear Regression and Support Vector Machines are global methods that apply the same rules to all of the data. We chose AdaBoost with Decision Trees because it has the power to learn local rules.

In the following paragraphs we describe the main features of each methodology.

Stacked linear regression

Stacked linear regression is a machine learning methodology commonly used in finance to combine weak predictors of stock prices [30, 31]. The goal of this methodology is not to identify which (so called) “weak predictor”, vk(t), is the best one to predict the quantity y(t) (in our case flu activity), but to linearly combine the information contained in all the “weak predictors” to obtain a more accurate and robust single predictor of a quantity y(t). A multivariate approach is used to determine the best linear combination of weak predictors capable of producing the best prediction of the quantity y(t) over a training period. Since the weak predictors are, by construction, highly correlated (indeed, each individual predictor was designed to minimize the square error between the predictions and flu activity), a way to discard redundant information is needed. Regularized approaches that penalize the size of the multiplying coefficients, αk, in the multivariate regression, such as Ridge or LASSO regularizations (L2 and L1, respectively), are good candidates to handle this. We chose LASSO regularization for our ensemble approach since we are interested in identifying models with the smallest number of independent variables (vk(t)). Additionally, a non-negative constraint for each multiplicative coefficient αk is imposed. This linear combination is then used to predict the value of y(t) for values of t outside of the training period.

Support Vector Machine regression

Support Vector Machine (SVM) models [32] are similar to multivariate linear regression models with the important difference that non-linear functions can be chosen as the best relationship between the variables. This is achieved by introducing transformations (called kernels) that map the independent variables to higher dimensional feature spaces. The independent variables can even be mapped to an infinite dimensional feature space with the use of a radial basis function (RBF) kernel. SVM models are fitted by minimizing an epsilon-insensitive cost function where errors (between the predictions and the observed values) of magnitude less than epsilon are ignored in the cost function. This approach typically leads to better generalization of the chosen model on out-of-sample data. The SVM kernel type, margin width, and regularization hyper parameters were chosen via cross-validation on the training data.

AdaBoost regression with decision trees

Decision Tree models are created by recursively splitting the input space, creating local models in each region of the input space. Decision trees, however, have been shown to be unstable as small changes in the data can lead to drastically different tree structures. Boosting methods, such as Adaptive Boosting (AdaBoost), are often employed to fix this problem. Adaptive Boosting (AdaBoost) regression [33] fits a sequence of weak learners (in this case decision trees) on sequentially reweighted versions of the training data. At each iteration, the weights are individually modified so that the training examples incorrectly predicted by the previous decision tree are given more importance when training the next decision tree. The final prediction is obtained by taking the weighted median of the predictions outputted by the ensemble of weak learners (AdaBoost.R2 algorithm: [33]).

Independent variables

In all of the aforementioned regression approaches the goal was to use all available information, in a given point in time, to produce accurate predictions of CDC’s %ILI one, two, three, and four weeks ahead of the release of CDC reports, effectively predicting ILI three weeks into the future. At a given point in time, historical values up to two weeks prior to current date were available for all data sources (CDC, FNY, ATH, GT, GFT, and TWT). In addition real-time ILI estimates were available, with one-week lag, for ATH, GT, GFT, TWT. With this information, we produced predictions for every week starting on July 06, 2013 and up to February 21, 2015. For our first prediction, on the week of July 06, 2013, the first training set included 31 weeks worth of historical data from all data sources. For subsequent weeks, we dynamically increased the training set to include all available information at the given date, from all data sources.

Baseline predictions

As a reference, we produced ILI predictions using only historical CDC reported ILI. We achieved this via an autoregressive model with three weekly lagged components as independent variables (equation 1 in Paul et al 2014 [11]). We trained this model for the time period 11/06/11–2/08/15, and produced out-of-sample predictions for the four weekly time horizons during the time period of our study. We used the same procedure as the ARX model for Twitter, training on the 2011–2012 and 2012–2013 flu seasons, and producing predictions on the 2013–2014 and 2014–2015 flu seasons. These predictions were used to assess the added value provided by our digital disease detection systems’ information.

Evaluation metrics

We report 5 evaluation metrics to compare the performance of the five independent predictors and the multiple ensemble methods: Pearson correlation, root mean squared error (RMSE), maximum absolute percent error (MAPE), Root Mean Square Percent error (RMSPE), and hit rate.

The definitions of all evaluation metrics are given below. Our notation is as follows: yi denotes the observed value of the CDC’s ILI at time ti, xi denotes the predicted value by any model at time ti, denotes the mean or average of the values {yi} and similarly denotes the mean or average of the values {xi}.

Pearson Correlation, a measure of the linear dependence between two variables during a time period [t1, tn], is defined as:

Root Mean Squared Error (RMSE), a measure of the difference between predicted and true values is defined as:

Root Mean Squared Percent Error (RMSPE), a measure of the percent difference between predicted and true values is defined as:

Maximum Absolute Percent Error (MAPE), a measure of the magnitude of the maximum percent difference between predicted and true values, is defined as

Hit Rate, a measure of how well the algorithm predicts the direction of change in the signal (independently of the magnitude of the change), is defined as: where the symbol = = denotes an if statement that returns the value 1, if the signs of predicted and observed changes are the same, and 0 otherwise.

These metrics were calculated for the time period: July 06, 2013 to February 21, 2015.

Results

Real time estimates

Table 1 presents the performance of the 5 real-time (nowcast) weak predictors as measured by each individual evaluation metric. This table is labeled “last week” since at a given point in time the revised version of all these estimates is only available on the Sunday of the reported week (or Monday of the subsequent week) and thus the information effectively predicts the %ILI of last week. For context, we included the metrics of three additional real-time predictions: (1) the baseline autoregressive predictions described in the previous section; (2) the CDC’s Virology data, and (3) the best real-time ensemble method predictions, produced with a support vector machine (with RBF kernel).

thumbnail
Table 1. Similarity metrics between CDC’s ILI and the 5 real-time weak predictors: Flu Near You, athenahealth, Google Trends, Google Flu Trends, and Twitter, for the time period Aug 2013—Feb 2015.

For reference, an autoregressive model (AR3) was utilized as a baseline. Pearson correlation and Hit rate for CDC’s Virology data are shown. The best performing model per metric is bold faced.

http://dx.doi.org/10.1371/journal.pcbi.1004513.t001

As Table 1 shows, the real-time ensemble predictions outperform any individual weak predictor in all but one metric (the hit rate). A 0.989 Pearson correlation and an average error of about 0.176%ILI (RMSE) make the ensemble approach a very accurate predictor. The ensemble predictions are very robust as indicated by the size of the MAPE, which measures how much the ensemble method is off-target with respect to the revised CDC ILI estimates. The worst performance was 23.6%, which is comparable to the LASSO’s 20.2% MAPE. See Table 2. This error is smaller than two thirds of the smallest MAPE of any of the individual weak predictors. In terms of hit rate, which reflects the ability of the method to predict the upward or downward tendency of the CDC’s ILI (in addition to the Pearson correlation and independently of producing an accurate point estimate, as captured by RMSE), athenaheath data (ATH) offers the best results.

Furthermore, Table 1 quantitatively shows the added value of using real-time digital disease detection information over a simple historical autoregressive approach. This can be seen by the improvement of the Pearson correlation from 0.930 to 0.989, the near three-fold reduction on the RMSE, and the maximum absolute error cut in half.

The top panel of Fig 1 graphically shows the revised CDC’s ILI along with the predictions of: the 5 data sources, the baseline, and the best ensemble approach (SVM RBF), as a function of time. The errors for each predictor are displayed in the bottom panel of Fig 1. The real-time estimates produced with our ensemble method are capable of predicting the timing and magnitude of the two peaks of the 2014–2015 season exactly, whereas they predict the peak of the 2013–2014 season with a one-week lag. Overall predictions track very accurately the CDC’s revised %ILI. This can also be seen in the top left panel of Fig 2.

thumbnail
Fig 1. The CDC’s %ILI (Influenza like illnesses), the performance of the 5 available predictors, the baseline predictions, and the performance of the best ensemble method for last week’s predictions are displayed as a function of time (top).

The errors associated with each weak predictor and the ensemble approach are shown (bottom).

http://dx.doi.org/10.1371/journal.pcbi.1004513.g001

thumbnail
Fig 2. The best performing ensemble approach is shown in red side by side to the CDC’s % ILI for all time horizons: last week (top left), current week (top right), next week (bottom left), and two weeks from current (bottom right).

The dark error bars correspond to the relative root mean squared error (RRMSE) and the light error bars correspond to the relative maximum absolute error.

http://dx.doi.org/10.1371/journal.pcbi.1004513.g002

Forecasts

Since none of the five weak predictors produce predictions into the future (forecasts), we do not have the equivalent of Table 1 for the three forecast time horizons (labeled “this week”, “next week”, and “in two weeks”). Table 2 presents the performance of 4 different machine learning ensemble approaches and the baseline autoregressive predictions for the four time horizons. Figs 2, 3 and 4 show these results graphically. Ensemble predictions produced with the AdaBoost method show the best accuracy (lowest RMSE) and robustness (lowest MAPE), for the three forecast time horizons. Correlation is also highest with AdaBoost in all three horizons. While the hit rate seems to be highest for different methods in different time horizons, Adaboost has an overall best performance as observed in Figs 3 and 4. We highlight the fact that our ensemble predictions one week into the future, labeled “this week”, have comparable accuracy to real-time GFT predictions, as measured by RMSE.

thumbnail
Fig 3. The CDC’s %ILI (Influenza like illnesses) and the performance of multiple machine learning ensemble approaches that combine the 5 weak predictors to produce a single estimate are displayed for comparison for the four time horizons: last week (top left), current week (top right), next week (bottom left), and two weeks from current (bottom right).

The red curve displays the performance of the best method for a given time horizon. As expected, the accuracy and robustness of the predictions decrease as the time horizon increases.

http://dx.doi.org/10.1371/journal.pcbi.1004513.g003

thumbnail
Fig 4. Errors associated with each ensemble approach are displayed for all time horizons: last week (top left), current week (top right), next week (bottom left), and two weeks from current (bottom right).

http://dx.doi.org/10.1371/journal.pcbi.1004513.g004

thumbnail
Table 2. Similarity metrics between CDC’s ILI and 4 machine learning ensemble methods for last week (top), this week (second), next week (third), and two weeks from now (bottom), for the time period Aug 2013—Feb 2015.

For reference, an autoregressive model (AR3) was utilized as a baseline. The best performing model per metric is bold faced.

http://dx.doi.org/10.1371/journal.pcbi.1004513.t002

As shown in Table 2, our ensemble approach produces better results than the baseline AR3 autoregressive model in all similarity metrics and all time horizons. This fact shows quantitatively the value of using social media and crowd-sourced data in improving influenza predictions in future %ILI predictions. Specifically, the average error (RMSE) of our ensemble predictions nearly halves the errors of the autoregressive predictions in all time horizons. Pearson correlations of our ensemble approach predictions improve their autoregressive counterparts, from 0.845 to 0.960, in the one week forecast; from 0.759 to 0.927, in the two-week forecast, and from 0.683 to 0.904, in the three week forecast. Note also that our forecast estimates in all time horizons (up to four weeks ahead of the release of CDC’s reports) show at least comparable accuracy to “real-time” estimates obtained with a purely autoregressive model.

The ability of the ensemble approach forecasts to capture the timing and magnitude of the peaks in the flu seasons decays as the time horizon increases, as observed in Fig 2. Indeed, one-week forecasts predict the 2013–2014 peak with a one-week lag and with a percent error of about 10%, and they predict the two 2014–2015 peaks with a one-week lag and with percentage errors less than 2%. The two-week forecasts capture the 2013–2014 peak with a one-week lag and show percentage errors of about 10%, and they predict the two 2014–2015 peaks with a two-week lag and percentage errors up to 20%. Finally, the three-week forecasts capture the 2013–2014 peak with a two week lag and show percentage errors of about 20%, and they predict the two 2014–2015 peaks with a two-three week lag and with percentage errors up to 25–30%.

Discussion

Our results show that our real-time ensemble predictions outperform every real-time flu predictor constructed independently with each data source. This fact suggests that combining information from multiple independent flu predictors is advantageous over simply choosing the best performing predictor. This is the case not only for real-time predictions but also for the one, two and three week forecasts presented.

Specifically, we show that our methodology can produce predictions one week ahead of GFT’s real-time estimates with comparable accuracy. We also show that our ensemble forecasts (up to three weeks into the future) always improve predictions produced with a baseline autoregressive model, thus proving quantitatively the added value of incorporating search and social media data in our flu prediction models.

It is interesting to highlight that the correlation and RMSE of the ensemble approach real-time predictions (Corr: 0.989 and RMSE: 0.176) are similar to the differences between revised and unrevised CDC reports (Corr: 0.993 and RMSE: 0.162). This means that our real-time ensemble model is as accurate a predictor of the revised CDC’s ILI estimates as the unrevised CDC data is. Thus, it is possible that we may be reaching the limit of what is possible, in terms of producing an accurate predictor of revised CDC’s ILI.

Our ensemble estimates correlate better with CDC’s ILI than CDC’s Virology data (which measures lab-confirmed cases of influenza) does with CDC’s ILI. This suggests that our (search and social media) data sources, when combined appropriately, track closely people showing symptoms and not necessarily those that are confirmed with influenza. It is important to mention that CDC’s Virology data (http://www.cdc.gov/vaccines/pubs/surv-manual/chpt06-influenza.html) is not necessarily considered to be a good predictor of ILI and tends to be even more lagged than CDC’s ILI due to the slowness of laboratory testing [34,35,36].

Doubts have emerged regarding the value of digital disease detection methods as a consequence of the multiple discrepancies between GFT’s predictions and the observed CDC’s ILI estimates [18,19, 20, 21, 22, 24, 28, 29]. We highlight the fact that even when one of the independent predictors produces unreliable estimates, our ensemble estimates are robust and accurate. This is observed specifically during the 2014–2015 flu season when ATH and GFT overestimated the flu season peak magnitude by more than 30% and approximately 15%, respectively, and the real-time ensemble approach estimates were right on target.

An additional attribute of our approach is that even if the ground truth (now CDC reports) were chosen to come from a different (and potentially more appropriate) source, our methodology would seamlessly adapt to predicting any target signal.

While the results presented here are for influenza-like illnesses at the national level within the US, our approach shows promise to be easily extended to accurately track not only influenza in other countries where multiple data sources may be available [37,38] but also other infectious diseases. Indeed, infectious diseases such as Dengue [39, 40, 41] or Malaria [42], for which multiple surveillance methods are in place would benefit from combining information in a similar way to the one proposed here. Moreover, disease surveillance data at finer spatial resolutions tend to be scarcer and often unreliable [43], and thus, approaches like ours may help produce more accurate and robust disease incidence estimates, at higher spatial resolutions, by drawing data from multiple sources.

Limitations

Using weekly information from reports published by the CDC as our gold standard for national flu activity may not necessarily be ideal. Indeed, two data sources considered in this study, athenahealth and Flu Near You, aim at tracking the percentage of the general population with ILI symptoms independently. While athenahealth can be thought of as a subsample of the CDC-reported %ILI (since it calculates the %ILI in a similar fashion to the CDC, except with the information from those patients seeking medical attention in facilities managed by athenahealth), Flu Near You aims at providing an estimate of flu activity from a potentially distinct population (people willing to report their health status in weekly surveys via a mobile phone app). Interestingly, while the sectors of the population sampled by the CDC and FNY maybe distinct (they may overlap when people report their symptoms using the FNY app and they seek medical attention), Fig 1 and a recent study [16] show that their ILI estimates track one another quite well (Pearson correlation of .948) suggesting that both FNY and CDC datasets may be good proxies of ILI activity in the population. Finally, the best ensemble methodology may change for future flu seasons, and thus, continuous monitoring of the multiple methodologies’ performances should be conducted as new predictions are produced.

Conclusion

We presented a methodology that optimally combines the information from multiple real-time flu predictors to produce more accurate and robust real-time flu predictions than any other existing system. Moreover, our ensemble approach is capable of using real-time and historical information to accurately forecast flu estimates one, two, and three weeks into the future.

Author Contributions

Conceived and designed the experiments: MS JSB ATN MD EON. Performed the experiments: MS ATN MJP. Analyzed the data: MS ATN. Contributed reagents/materials/analysis tools: MS ATN MD MJP. Wrote the paper: MS.

References

  1. 1. Lipsitch M, Finelli L, Heffernan RT, Leung GM, & Redd S. Improving the evidence base for decision making during a pandemic: the example of 2009 influenza A/H1N1. Biosecurity and bioterrorism: biodefense strategy, practice, and science. 2011; 9(2), 89–115.
  2. 2. WHO (2015) Influenza (Seasonal), Fact Sheet Number 211. Available at http://www.who.int/mediacentre/factsheets/fs211/en/index.html.
  3. 3. Cobb L, Krishnamurthy A, Mandel J, and Beezley JD. Bayesian tracking of emerging epidemics using ensemble optimal statistical interpolation. Spatial and spatio-temporal epidemiology. 2014; 10: 39–48.
  4. 4. Yang W, Karspeck A, and Shaman J. "Comparison of filtering methods for the modeling and retrospective forecasting of influenza epidemics." PLoS computational biology. 2014; 10, no. 4: e1003583. doi: 10.1371/journal.pcbi.1003583. pmid:24762780
  5. 5. Yang W, Lipsitch M, and Sham . Inference of seasonal and pandemic influenza transmission dynamics using ‘big’ surveillance data. Proceedings of the National Academy of Sciences. 2015;112(9): 2723–2728. doi: 10.1073/pnas.1415012112
  6. 6. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, and Brilliant L. Detecting influenza epidemics using search engine query data. Nature. 2009; 457, 1012–1014 doi: 10.1038/nature07634. pmid:19020500
  7. 7. Scarpino SV, Dimitrov NB, and Meyers LA. Optimizing provider recruitment for influenza surveillance networks. PLoS Comput Biol. 2012; 8, no. 4: e1002472. doi: 10.1371/journal.pcbi.1002472. pmid:22511860
  8. 8. Polgreen PM, Chen Y, Pennock DM, Nelson FD, and Weinstein RA. Using internet searches for influenza surveillance. Clinical Infectious Diseases. 2008; 47(11):1443–1448. doi: 10.1086/593098. pmid:18954267
  9. 9. Yuan Q, Nsoesie EO, Lv B, Peng G, Chunara R, Brownstein JS. Monitoring influenza epidemics in China with search query from Baidu. PLoS One 2013; 8:e64323. doi: 10.1371/journal.pone.0064323. pmid:23750192
  10. 10. Signorini A, Segre AM,and Polgreen PM.The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic. PLoS ONE 2011; 6, e19467. doi: 10.1371/journal.pone.0019467. pmid:21573238
  11. 11. Paul MJ, Dredze M, and Broniatowski D. Twitter Improves Influenza Forecasting. PLoS currents. 2014; 6. doi: 10.1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117
  12. 12. Chen L, Tozammel Hossain KSM, Butler P, Ramakrishnan N, and Prakash BA. Flu Gone Viral: Syndromic Surveillance of Flu on Twitter using Temporal Topic Models. IEEE International Conference In Data Mining (ICDM), 2014; pp. 755–760. IEEE,.
  13. 13. McIver DJ and Brownstein JS., Wikipedia usage estimates prevalence of influenza-like illness in the United States in near real-time. PLoS Comput. Biol. 2014;10, e1003581 doi: 10.1371/journal.pcbi.1003581. pmid:24743682
  14. 14. Generous N, Fairchild G, Deshpande A, Del Valle SY, and Priedhorsky R. Global disease monitoring and forecasting with wikipedia. PLoS computational biology. 2014; 10(11), e1003892 doi: 10.1371/journal.pcbi.1003892. pmid:25392913
  15. 15. Crawley AW Flu near you: Comparing crowd-sourced reports of influenza-like illness to the CDC outpatient influenza-like illness surveillance network, October 2012 to March 2014. In 2014 CSTE Annual Conference. Cste, 2014.
  16. 16. Smolinski MS, Crawley AW, Baltrusaitis K, Chunara R, Olsen JM, Wojick O, et al. Flu Near You: Crowdsourced Symptom Reporting Spanning Two Influenza Seasons. American Journal of Public Health. 2015; e1–e7. doi: 10.2105/ajph.2015.302696
  17. 17. Santillana M, Nsoesie EO, Mekaru SR, Scales D, Brownstein JS. Using Clinicians’ Search Query Data to Monitor Influenza Epidemics. Clinical Infectious Diseases. 2014; 59 (10): 1446–1450 doi: 10.1093/cid/ciu647. pmid:25115873
  18. 18. Butler D. When Google got flu wrong. Nature. 2013; 494(7436):155. doi: 10.1038/494155a. pmid:23407515
  19. 19. Cook S, Conrad C, Fowlkes AL, and Mohebbi MH. Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PLoS ONE. 2011; 6, e23610. doi: 10.1371/journal.pone.0023610. pmid:21886802
  20. 20. Olson DR, Konty KJ, Paladini M, Viboud C, Simonsen L. Reassessing Google flu trends data for detection of seasonal and pandemic influenza: a comparative epidemiological study at three geographic scales. PLoS Comput. Biol. 2013; 9, e1003256. doi: 10.1371/journal.pcbi.1003256. pmid:24146603
  21. 21. Lazer DM, Kennedy R, King L, Vespigniani A. The parable of Google flu: traps in big data analysis. Science. 2014; 343, 1203–1205. doi: 10.1126/science.1248506. pmid:24626916
  22. 22. Santillana M, Zhang DW, Althouse BM, and Ayers JW. What can digital disease detection learn from (an external revision to) Google flu trends? Am. J. Prev. Med. 2014; 47, 341–347. doi: 10.1016/j.amepre.2014.05.020. pmid:24997572
  23. 23. Davidson M, Haim DA, and Radin JM. Using Networks to Combine Big Data and Traditional Surveillance to Improve Influenza Predictions. Sci. Rep. 2015; 5 doi: 10.1038/srep08154
  24. 24. Yang S, Santillana M, and Kou SC. ARGO: a model for accurate estimation of influenza epidemics using Google search data. 2015. arXiv preprint arXiv:1505.00864.
  25. 25. Lamb A, Paul MJ, and Dredze M. Separating Fact from Fear: Tracking Flu Infections on Twitter. HLT-NAACL. 2013.
  26. 26. Broniatowski DA, Paul MJ, and Dredze M. National and local influenza surveillance through twitter: An analysis of the 2012–2013 influenza epidemic. PloS one. 2013; 8(12), e83672. doi: 10.1371/journal.pone.0083672. pmid:24349542
  27. 27. Dredze M, Cheng R, Paul M, and Broniatowski D. HealthTweets.org: A Platform for Public Health Surveillance using Twitter. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2014.
  28. 28. Copeland P, Romano R, Zhang T, Hecht G, Zigmond D, Stefansen C. Google disease trends: an update. Int Soc Negl Trop Dis. 2013; 3.
  29. 29. Stefansen C. Google Flu Trends gets a brand new engine. Google Research Blog. 2014.
  30. 30. Friedman J, Hastie T, and Tibshirani R. The elements of statistical learning. 2009; Vol. 2, No. 1. New York: springer.
  31. 31. Breiman L. Stacked regressions. Machine Learning.1996; 24, 49–64. doi: 10.1007/bf00117832
  32. 32. Smola A, and Vapnik V. Support vector regression machines. Advances in neural information processing systems. 1997; 9: 155–161.
  33. 33. Freund Y, and Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences.1997; 55.1: 119–139. doi: 10.1006/jcss.1997.1504
  34. 34. Brownstein JS, and Mandl KD. Reengineering real time outbreak detection systems for influenza epidemic monitoring. In AMIA Annual Symposium Proceedings. American Medical Informatics Association. 2006; Vol. 2006, p. 866.
  35. 35. Olson DR, Heffernan RT, Paladini M, Konty K, Weiss D, an Mostashari F. Monitoring the impact of influenza by age: emergency department fever and respiratory complaint surveillance in New York City. PLoS Med. 2007; 4(8), e247. pmid:17683196 doi: 10.1371/journal.pmed.0040247
  36. 36. Viboud C, Charu V, Olson D, Ballesteros S, Gog J, Khan F, et al. Demonstrating the use of high-volume electronic medical claims data to monitor local and regional influenza activity in the US. PLOS One. 2014; 9(7): e102429. doi: 10.1371/journal.pone.0102429. pmid:25072598
  37. 37. Paolotti D, Carnahan A, Colizza V, Eames K, Edmunds J, Gomes G, et al. Web-based participatory surveillance of infectious diseases: the Influenzanet participatory surveillance experience. Clinical Microbiology and Infection. 2014; 20(1), 17–21. doi: 10.1111/1469-0691.12477. pmid:24350723
  38. 38. Dalton C, Durrheim D, Fejsa J, Francis L, Carlson S, d'Espaignet ET, et al. Flutracking: a weekly Australian community online survey of influenza-like illness in 2006, 2007 and 2008. Commun Dis Intell Q Rep. 2009; 33(3): 316–22. pmid:20043602 doi: 10.3201/eid1612.100935
  39. 39. Chan EH, Sahai V, Conrad C, Brownstein JS. Using Web search query data to monitor dengue epidemics: a new model for neglected tropical disease surveillance. PLoS Negl Trop Dis. 2011; 5:e1206 doi: 10.1371/journal.pntd.0001206. pmid:21647308
  40. 40. Madoff LC, Fisman DN, Kass-Hout T. A new approach to monitoring dengue activity. PLoS Negl Trop Dis. 2011; 5:e1215. doi: 10.1371/journal.pntd.0001215. pmid:21647309
  41. 41. Gluskin RT, Johansson M, Santillana M, Brownstein JS. Evaluation of Internet-based dengue query data: Google Dengue Trends. PLoS neglected tropical diseases. 2014; 8.2: e2713. doi: 10.1371/journal.pntd.0002713
  42. 42. Ocampo AJ, Chunara R, and Brownstein JS. Using search queries for malaria surveillance, Thailand. Malaria journal. 2013; 12.1: 390. doi: 10.1186/1475-2875-12-390
  43. 43. Nagar R, Yuan Q, Freifeld CC, Santillana M, Nojima A, Chunara R, et al. A Case Study of the New York City 2012–2013 Influenza Season With Daily Geocoded Twitter Data From Temporal and Spatiotemporal Perspectives. Journal of medical Internet research. 2014; 16 (10) http://rsos.royalsocietypublishing.org/lookup/external-ref?access_num=21573238&link_type=MED&atom=%2Froyopensci%2F1%2F2%2F140095.atom