Skip to main content
  • Loading metrics

Reappraising the utility of Google Flu Trends


Estimation of influenza-like illness (ILI) using search trends activity was intended to supplement traditional surveillance systems, and was a motivation behind the development of Google Flu Trends (GFT). However, several studies have previously reported large errors in GFT estimates of ILI in the US. Following recent release of time-stamped surveillance data, which better reflects real-time operational scenarios, we reanalyzed GFT errors. Using three data sources—GFT: an archive of weekly ILI estimates from Google Flu Trends; ILIf: fully-observed ILI rates from ILINet; and, ILIp: ILI rates available in real-time based on partial reporting—five influenza seasons were analyzed and mean square errors (MSE) of GFT and ILIp as estimates of ILIf were computed. To correct GFT errors, a random forest regression model was built with ILI and GFT rates from the previous three weeks as predictors. An overall reduction in error of 44% was observed and the errors of the corrected GFT are lower than those of ILIp. An 80% reduction in error during 2012/13, when GFT had large errors, shows that extreme failures of GFT could have been avoided. Using autoregressive integrated moving average (ARIMA) models, one- to four-week ahead forecasts were generated with two separate data streams: ILIp alone, and with both ILIp and corrected GFT. At all forecast targets and seasons, and for all but two regions, inclusion of GFT lowered MSE. Results from two alternative error measures, mean absolute error and mean absolute proportional error, were largely consistent with results from MSE. Taken together these findings provide an error profile of GFT in the US, establish strong evidence for the adoption of search trends based 'nowcasts' in influenza forecast systems, and encourage reevaluation of the utility of this data source in diverse domains.

Author summary

Google Flu Trends (GFT) was proposed as a method to estimate influenza-like illness (ILI) in the general population and to be used in conjunction with traditional surveillance systems. Several previous studies have documented that GFT estimates were often overestimates of ILI. In this study, using a recently released archive of data of provisional incidence from a large surveillance system in the US (ILINet), we report errors in GFT alongside errors from ILINet’s initial estimates of ILI. This comparison using information available in real-time allows for a more nuanced assessment of GFT errors. Additionally, we describe a method to correct errors in GFT and show that the corrected GFT estimates are at least as accurate as initial estimates from ILINet. Finally, we show that inclusion of corrected GFT while forecasting ILI in the next four weeks considerably improves forecast accuracy. Taken together, our results indicate that the GFT model could have added value to traditional surveillance and forecasting systems, and a reevaluation of the utility of the underlying search trends data, which is now more openly accessible, in fields beyond influenza is warranted.

This is a PLOS Computational Biology Methods paper.


Surveillance of seasonal influenza and other respiratory illnesses deservedly receives significant attention from public health agencies in the United States. To complement traditional surveillance systems, both internet- [17] and non-internet-based [811] proxy indicators of incidence have been developed. Among these, of note is Google Flu Trends (GFT) [1, 12], which estimated influenza-like illness (ILI) from online search activity. GFT estimates from an initial model and subsequent revisions to the model were publicly available until 2015, when the service was discontinued [13]. Although Google has not offered reasons for the termination, one contributing factor could well have been the widely reported propensity of GFT to over-estimate ILI, which effectively morphed it in the public perception from a poster child for the power and utility of big data to one of its hubris [1420].

However, this perception is probably misplaced. The most comprehensive and commonly cited study of GFT errors for locations in the United States was published by Lazer et al [14], following an anomalous season during which the errors were much larger than previously observed. These findings were supported by several other studies that were smaller in scope but reported errors of approximately the same magnitude at different locations and geographical resolutions [21, 22]. In this paper, using newly available surveillance data, we revisit GFT estimates for locations in the US and show that its errors are less substantial than previously reported.

The severity of a respiratory viral infection in an individual depends on multiple factors, and in most cases the symptoms are mild and do not require medical attention. As a consequence, the more widely used surveillance systems in the US–the Centers for Disease Control and Prevention (CDC)’s ILINet and FluSurv-NET systems, for example–only capture infections that are severe enough to precipitate a visit to a physician's office or hospital. On the other hand, the relationship between the severity of a respiratory infection and the likelihood that an individual initiates an online search session for related information, is unknown; hence, the signals that drive GFT and the surveillance systems are intrinsically different. Nonetheless, as GFT used incidence data from ILINet as its response variable, it has been a common practice, and one that we follow in this study, to use these rates as reference or ground truth when reporting the accuracy of GFT estimates.

However, in reporting US national and regional errors, most previous studies, including Lazer et al [14], did not account for delayed reporting to ILINet. The fully observed ILINet rates (ILIf) are finalized no sooner than 2–3 weeks after the conclusion of a surveillance week, as some of the surveillance network data are submitted late, and in some instances, revisions can even occur several months later. The rates released in the interim are estimates based on partial observations (ILIp) and the magnitude of difference between ILIp and ILIf, as we report here, varies by location, influenza season and the phase of a season.

Given these reporting delays and revisions, it is ILIp rather than ILIf that informs real-time decisions. Thus, a more appropriate error analysis, one that better reflects an operational scenario, should compare errors of GFT (ILIf—GFT) to errors of ILIp (ILIf—ILIp). An archive of ILIp at US national and Health and Human Service (HHS)[23] regional levels for 6 seasons has been made available [24] recently, and in this study we used this archive to recompute GFT and ILIp errors[25, 26]. Additionally, we extended the analysis to finer geographical resolutions as ILIf is now also available for US states. Finally, we report for the first time, errors from the final GFT model, updated in fall 2014, before the service was discontinued.

Google's recent initiative to provide access to its search trends through an API[13] supports more open data sharing. This effectively decouples data from model and facilitates the development of alternative models to GFT. Through the analysis described here, we hope to establish an error profile of GFT that can serve as a baseline for comparing these alternative models.

More importantly, although GFT was proposed by its developers as a supplement to traditional surveillance systems and not a replacement, the focus to date has been disproportionately on evaluating GFT's ability to mimic surveillance systems rather than on evaluating its utility when deployed in conjunction with these systems in operational settings. Previous findings suggest that the errors in GFT can be reduced by combining GFT estimates with lagged surveillance rates [14, 27, 28]. Here we propose a similar remedial step with a parsimonious regression model and show that the corrected GFT is more accurate than ILIp.

A natural extension is to assess whether GFT, its errors thus corrected, could have improved longer term forecasts by providing more timely outbreak information than traditional surveillance systems. For this purpose, we generated forecasts of ILI one to four weeks in the future using ILIp alone, and using both ILIp and error corrected GFT. We demonstrate that the inclusion of GFT considerably improves the accuracy of near-term forecasts and thus adds value to traditional surveillance systems.

Materials and methods

In this section we describe in detail the two data sources used—an outpatient surveillance system and GFT—access information for the two sources, and the measures used to calculate errors of these estimates. We then describe the autoregressive model framework used to generate near term forecasts, followed by details of the forecast generation and validation process.

US influenza outpatient surveillance network (ILINet)

The ILINet surveillance system [29], developed and supported by the CDC, collects data from nearly 3000 healthcare providers in the US on outpatient visits for ILI, which is defined as fever (temperature above 100°F) co-occurring with cough and/or sore throat. Weekly counts of patients seen for ILI and for any reason are submitted to the system. These count data are used to calculate the percentage of outpatient visits due to ILI. In this study, by ILI rate we refer to population-weighted aggregates of ILI.

A Morbidity and Mortality Weekly Report (MMWR)[30] surveillance week runs from Sunday thru Saturday and aggregated ILI rates at US state-, HHS regional- and national levels are publicly released through the FluView [31] website on Friday (6 days after a week concludes). The system allows for delayed reporting from providers and the delayed data are included in subsequent weekly releases. Hence, the ILI estimates for a week can change for multiple weeks following initial release. We refer to the ILI rates calculated from incomplete reporting as partially observed ILI rates, and in this paper denote the rates as per the first week of release as ILIp. An archive of revisions for the 2009/10 season onwards has been recently made available [24, 32] and for the 2013/14 season and later, these data are also accessible through the DELPHI group's epidata API [33].

Although ILIf is available for the US, HHS regions and states, ILIp is not currently available at the state level. ILI rates for the 2009/10 to 2014/15 seasons that were available on FluView at the end of surveillance week 20 of the 2017/18 season (May 13–19, 2018) were considered to be ILIf. This date is over two years after the end of the time period studied, and hence we assume that it is very unlikely that these rates would be further revised. Note that both ILIp and ILIf are rates, and ILIp can over or underestimate ILIf.

Google Flu Trends (GFT)

Originally developed in 2008, GFT estimated ILI rates in a population based on the frequency of a selected set of queries to the Google search engine [1]. The 2008 model used 45 queries, whose search frequencies were historically well correlated [34, 35] with ILI rates, as explanatory variables. To generate the estimates for the US, ILI rates were used as the response variable in the model. In response to observed deficiencies in the predictions, revisions to the model, including updates to the feature set, were made in 2009, August 2013 and August 2014. GFT estimates that were published in real-time from September 2008 through August 2015, along with estimates from revised models applied to past seasons continue to be hosted publicly [12].

Fig 1 shows the availability of GFT, ILIf and ILIp at different locations in the US. For US and HHS regions, GFT, ILIf and ILIp are available for six seasons—2009/10 to 2014/15—and for the states ILIf and GFT are available for the last 5 of these 6 seasons. The vertical lines indicate the time points of revisions to the GFT model; therefore estimates for seasons 2009/10 thru 2012/13 seasons, season 2013/14, and season 2014/15 are from different models.

Fig 1. Availability of GFT, ILIf, and ILIp at US national, regional and state levels in the US.

At the regional level, GFT and ILIf were available from 2003, and ILIp were available from 2009/10 season onwards, excluding off-season weeks. For states, ILIp were never available to the public, and ILIf is available from 2011/12 season onwards. Updates to GFT model are indicated by the vertical lines.

Unlike ILINet, GFT estimates for a week are finalized at the end of the week. Furthermore, as the GFT estimates were completely automated, and computed in real-time, they did not have the 6-day lag between the end of a week and the release of data as is the case with ILINet. This translates to GFT providing weekly incidence estimates for at least one more week than ILINet, at any given point of time. The estimate for this one additional week is sometimes referred to as a nowcast.

Error measures

For each week and location, error is defined as , where y is the reference, ILIf, and the estimate from GFT or ILIp. Aggregate error measures, Mean Squared Error (MSE), Mean Absolute Error (MAE) and Mean Absolute Proportional Error (MAPE) are respectively the mean of the square of errors, of the absolute error and of absolute error as a proportion of the reference value, and are reported across all seasons and locations, or for each season (across all location) and each location (across all seasons). During the study period, the reference value was never zero, and hence MAPE was computable. Formally,

As the errors in 2012/13 are reportedly much larger than during the other seasons included in the study, inclusion of this season could obscure overall results, and hence we report aggregate measures both with and without this season.

Seasonal autoregressive integrated moving average (ARIMA) model

A non-seasonal ARIMA model is specified by three parameters—p, the order of the autoregressive component; q, the order of the moving average component; and d, the degree of differencing required to make the given time series stationary. For a time series, Y, let y denote the time series obtained by d degree differencing. Thus, an ARIMA(p, d, q) is a model of the form: where the elements, εi, represent the forecast errors at the ith time step. Elements c, ϕ1, , ϕp, θ1, …,θq can be estimated through maximum likelihood estimation. As influenza in the US has strong yearly seasonality, a seasonal ARIMA model may often, though not always, be a better fit. Seasonal ARIMA models are specified with three additional parameters P, D, Q where D denotes seasonal differencing and P, Q are analogous to p, q, respectively, as defined above.

We used an implementation of an iterative method proposed by Hyndman and Khandakar [36] from the R [37] forecast [38] package to find an appropriate order for the ARIMA models. Briefly, the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test [39] and extended Canova-Hansen test [40] are used to determine an appropriate d and D respectively. To find values for the remaining parameters, an iterative process is initiated with the model that has the lowest Akaike’s Information Criterion (AIC) [41] amongst a small default set of models, as the candidate model. In each subsequent step, the parameters of the candidate model are varied by ±1 within a pre-specified parameter space (p, q: (0, 5); P, Q: (0, 2)) and the variant with the lowest AIC becomes the new candidate model. The process is terminated when the parameter space is exhausted or all variants of the candidate model result in a higher AIC.

Error correction and forecast generation

Retrospective near-term forecasts were generated for US National and the 10 HHS regions during the 2010/11 to 2014/15 influenza seasons for MMWR weeks 41 through week 20. Traditionally an influenza season is considered to run from MMWR week 40 thru MMWR week 39 of the following calendar year. Late spring and summer weeks (MWWR week 20 onwards) experience low incidence and hence were excluded in this study. Separate models were fit for each location and week. Models for each location are isolated as they do not use observations from any other location.

Let Xi and Zi denote the log transformed ILI rates and GFT estimates at week i respectively. All ILI/GFT values less than 2 (per 100000) were rounded up to 2 before log transformation. As described in a previous section, when forecasts are generated operationally at the end of week t, X1,⋯,Xt and Z1,⋯,Zt+1 would be available; i = 1 indicates MMWR week 40 of 2009/10 season. For a given week w ≤ t, Xw is ILIp if w and t belong to the same season, and ILIf otherwise. Corrected GFT, , is estimated using a random forest [4144] regression model with explanatory variables Xt,Xt−1,Xt−2,Zt+1,Zt,Zt−1,Zt−2 and response variable Xt+1. S4 Fig shows corrected GFT at the US national level, and its error with respect to ILIf.

To generate near term forecasts, two models were developed: the first using ILIp only (ILIp), and the second using ILIp and corrected GFT (ILIp+GFT). For a given week t, the ILIp models were trained on the time series X1, , Xt and used to forecast rates for weeks t+1, , t+4, denoted by . The corresponding ILIp+GFT ARIMA models were fit using the time series and forecast rates for doubles as the 1-week ahead forecast. For both models MSE, MAE and MAPE, as defined above, were calculated with ILIf as reference.

For example, the ILIp models for week 46 of 2011/12 season were fit using ILIf from the 2009/10 and 2010/11 seasons and ILIp from weeks 40 to 46 of the 2011/12 season, and were used to forecast rates for weeks 47 through 50. The GFT correction model for week 46 was fit using training instances compiled with ILI and GFT through week 46 of 2011/12 season and used to estimate, with test instance (X46,X45,X44,Z47,Z46,Z45,Z44). ILIp+GFT models used as an additional observation, and forecast rates for weeks 48 to 50. Therefore, the week 50 forecast from ILIp ARIMA model was a 4-week ahead forecast but a 3-week ahead forecast for the ILIp+GFT ARIMA model. Forecast errors for both model forms were then calculated using ILIf for weeks 47 to 50 as reference.


GFT as an estimator of ILIf

Table 1 shows that the MSE of GFT is on average 2.5 times that of ILIp with considerable variability by location. Region 9, where the mean squared errors were nearly equal, had the smallest difference between GFT and ILIp, whereas Region 4 had the largest difference, with GFT error about 7.6 times as large as that of ILIp. Similar variability was observed across seasons, with the largest difference by far occurring during the 2012/13 season, and the smallest during 2009/10. As previously reported [14], GFT estimates for weeks around the peak of the 2012/13 season were large over-estimates, which contributed considerably to the high mean errors.

Table 1. Aggregated squared error.

Mean (standard deviation, [25th–75th percentile]) for the entire study period, disaggregated by location and season. US national has ILIp has 11 fewer dates than the regions. Overall and location aggregations exclude 2012/13 season.

The corresponding difference in MAPE (S1 Table) is slightly smaller overall (GFT error 1.8 times ILIp error), with the GFT error actually lower than that of ILIp for Region 9. In reporting Table 1 (and S1 Table) we excluded season 2012/13 for Overall and regional aggregations; see S2 Table for aggregations across all seasons.

Fig 2 shows MSE with the final version of the GFT system for the 2014/15 season and the average GFT errors in all regions (denoted by the black triangle) are larger than corresponding ILIp errors. But as indicated by the data points above the diagonal, ILIp does not consistently have lower errors for all weeks. As supported by S2 Fig, during weeks very early (blue data points) or towards the end (red data points) of the season, the difference between GFT and ILIp is relatively small (data points closer to the diagonal). The larger errors for both ILIp and GFT occur during weeks of increased ILI activity around the peak week (green and grey data points). S1 Fig has the corresponding MAPE errors for the 2014/15 season. On the whole, errors during the 2014/15 season are in line with some of the previous seasons, and the final GFT model was not a marked improvement over previous models.

Fig 2. Squared errors from GFT and ILIp for HHS regions during the 2014/15 season.

The green data points show the error during the week of maximum weekly ILIf—the peak week—and the remaining data points are color coded by their distance from peak week. The black triangles show the mean error for the season. The black line is the y = x line; points below this line have larger errors from GFT than from ILIp. In all regions, the mean error from GFT falls below the line.

Looking at GFT errors at the state-level (Fig 3, S3 Fig), the errors are much larger than the errors at the corresponding HHS regions (black horizontal mark). Overall (top left panel), states with low (< 2 million) and medium (2–6 million) population sizes, tend to have larger GFT errors than high population states. We know from previous work [5] with search trends from Google's Health Trends API, that terms/queries whose search frequencies do not meet a predetermined threshold limit are reported as 0. If GFT used a dataset that was based on similar criteria, low population states where search volumes are smaller, would have had sparser feature spaces.

Fig 3. Mean squared error of GFT observed in US states.

The top left panel, Overall, shows average errors across 5 seasons and each of the other panels is limited to one season. The data points are color coded by population size and ordered by overall error (high to low). The black line shows the errors from corresponding HHS regions.

Similar patterns were seen when the errors are disaggregated by season. It is interesting to note that among all seasons studied, the season with the smallest differential in MSE between state and regional errors was the anomalous 2012/13 season, where the large increase in GFT regional errors was not accompanied by a proportionate increase in errors for states. In a few cases, the errors for a state were smaller than the errors at the corresponding region.

Nowcast using lagged ILIp and GFT

As shown in Table 2 and S4 Fig, considerable reduction in GFT MSE was achieved through regression on lagged data. An overall reduction of 44% was observed across the 11 locations and 4 seasons. The large reduction during 2012/13 reiterates the utility of this additional step as a check against extreme failures of GFT. It is also interesting to note that this step reduces GFT errors below that of ILIp i.e. the use of search trend data can not only provide an estimate of incidence a week earlier than ILINet, but can do so more accurately than ILINet's own initial estimate of incidence. S3 Table shows the corresponding overall reductions in MAE and MAPE, and the findings noted with MSE hold.

Table 2. Mean squared error in one-week ahead estimates.

Change column indicates percentage reduction in mean error by regressing ILIf on lagged ILIp+GFT. 2012/13 was excluded while aggregating overall and by region. Paired Wilcoxon signed rank tests for the hypothesis that the median of errors in GFT (Z) are greater than the median of errors in corrected GFT () were performed; cases where p > .05 are denoted by an asterisk (*).

There is considerable variability in the magnitude of improvement in nowcast quality across locations and seasons, and with a few exceptions the decrease in errors was significant (P < 0.05) per a paired Wilcoxon signed rank test [4547].

Near-term forecasts using nowcasts

Table 3 shows the MSE for near-term forecasts generated with ILIp alone and using both ILIp and corrected GFT (ILIp+GFT). At all targets (1- to 4-week ahead estimates) and seasons, and for all but two regions (Fig 4), inclusion of GFT lowered MSE. The overall MAPE with the ILIp+GFT models is also lower (S4 Table, S5 Fig), although the relative advantage over ILIp with different regional or seasonal disaggregation criteria is more mixed. The overall reduction in errors when aggregated by target or region is not limited to reduction from the anomalous 2012/13 season; ILIp+GFT errors continue to be lower and significant when the 2012/13 season is excluded (S5 Table).

Fig 4. Mean squared error of near term forecasts for ILIp and ILIp+GFT models.

The data points are color coded by target. Points below the diagonal (broken black line) indicate instances where forecast quality improved with the use of GFT. Each panel is for one of the locations.

Table 3. Mean squared error of near-term forecasts.

ILIp was generated with ILIp alone and ILIp+GFT by appending corrected GFT to ILIp. The lower error in each row is underscored. P-values from a paired Wilcoxon signed rank test that the median of error in ILIp forecasts are greater than the median of errors in ILIp+GFT forecasts are also shown; cases where p > .05 are denoted by an asterisk (*).

For all three measures, the accuracy of the regression model's nowcast either matches or exceeds that of the 1-week ahead ARIMA forecast. Reduction of errors at longer horizons is larger and this is quite likely due to the k week ahead forecast of the ILIp+GFT model being lined up with the k+1 week ahead forecast of the ILIp model, as ARIMA errors tend to increase with increasing horizon.


The increasing availability of big data has naturally led to the development of experimental applications in several domains, including those such as public health surveillance that have traditionally relied on more robust, but also labor intensive, data collection processes. Google Flu Trends was developed as an alternative method to measure ILI in the general population, to be used in conjunction with traditional surveillance methods when and where they exist. Given its prospects for use (and misuse) GFT appropriately received wide attention; but it is our belief that it has been adjudged wanting against goals it was not designed to meet.

Reporting errors of ILIp rates alongside GFT errors, helps quantify the transient errors in ILINet due to delayed reporting and provides a more appropriate baseline for comparing the accuracy of GFT (and alternative nowcast models) in operational settings. The use of ILINet rates as ground truth, here and in previous studies, is appropriate simply because these are the targets GFT was designed to estimate and a more reliable system for estimating ILI broadly in the US does not exist. However, when assessing the validity of alternatives methods for influenza estimation, we must remain cognizant of the deficiencies of ILINet in capturing influenza transmission at metapopulation scales–for instance, its passive data collection process, broad symptom definition that is geared towards ILI rather than influenza, and estimation of incidence from visit counts without a requirement for virologic confirmation.

The opening up of Google Trends API directly addresses one major obstacle in improving nowcasts over the GFT models, namely, the non-availability of public search trends data. Additionally, US state level ILINet rates were not available prior to the 2017/18 season, and previously required some form of extrapolation from regional ILI rates to state ILI rates in order to build state-level nowcast models. With these data now being released in real time, nowcast models for states should be able to identify more reliable predictor variables, and the accuracy of these nowcast estimates can be expected to improve over GFT estimates. Furthermore, fine-grained nowcast estimates, say at city or county scales, or for large hospital settings, are possible when reliable ILI rates exist.

Our results show that a regression model with lagged ILIp and GFT predictors can adequately correct errors in search trend based nowcasts and thereby avoid catastrophic failures, and the model estimated rates are at least as accurate as partially observed surveillance rates in the US. Indeed, during the 2017/18 and 2018/19 influenza seasons, which saw atypical, large, sustained outbreaks, our search trends based nowcasts did not exhibit large errors. Use of this data source alongside other data sources like twitter, electronic medical records, Wikipedia logs etc. [3, 4], can further reduce the risk of such failures by making the nowcasts less reliant on any single source.

Results with the near-term forecasts show that the provision of an additional week of observation to the ARIMA models considerably improves forecast quality. Forecasts generated with ILIp and corrected GFT also improve over those generated with ILIp and uncorrected GFT (S6 Table). Both the random forest and ARIMA models used here were standard implementations from open source statistical packages with no domain specific tailoring, and we have no reason to believe that these improvements and the ensuant findings are specific to the models used. Other mechanistic or time series models may offer similar improvements in accuracy, and some recent results are suggestive of such improvements [48, 49]. Our choice of ARIMA as the forecast model should not be construed as a vote in favor of its optimality in forecasting ILI; on the contrary, as ARIMA is not informed by any of the transmission dynamics of ILI, we include it as a non-naïve reference method. Researchers proposing alternative methods tailored for ILI should be expected to show that they do at least as well as ARIMA.

Overall, we believe that the results presented here provide sufficient evidence to encourage continued efforts to improve search trend based nowcasts for influenza and make a case for their more wide-spread adoption in operational forecasting systems. At a minimum, they show that reports of the failure of GFT are not unequivocal and they should not deter use of Google Trends API in areas other than ILI estimation.

Supporting information

S1 Table. Aggregated absolute error and absolute Proportional Error.

Mean (standard deviation, [25th–75th percentile]) for the entire study period, disaggregated by location and season. US national has ILIp has 11 fewer dates than the regions. Overall and location aggregations exclude 2012/13 season.


S2 Table. Aggregated measures including 2012/13 season.

Mean (SD) in Squared Error, Absolute Proportional Error and Absolute Error for all locations. Unlike Table 1 and S1 Table, this includes 2012/13 season.


S3 Table. MAPE and MAE in one-week ahead estimates.

Change column indicates percentage reduction in mean error by regressing GFT on lagged ILIp and lagged GFT. 2012/13 was excluded while aggregating overall and by region. Paired Wilcoxon signed rank tests for the hypothesis that errors in GFT (Z) are greater than errors in corrected GFT () were performed; cases where p > .05 are denoted by * and † for MAPE and MAE respectively.


S4 Table. MAPE and MAE of near-term forecasts.

The lower error in each row is underscored. An asterisk (*) indicates P > .05 with a paired Wilcoxon signed rank test for MAPE and † indicates P > .05 for MAE.


S5 Table. MSE, MAPE and MAE of near-term forecasts generated for ILIp and ILIp+GFT.

The lower error in each row is underlined. Unlike Table 3 and S4 Table, this excludes 2012/13 season. Disaggregation by season is not shown as they are identical to errors reported in Table 3 and S4 Table.


S6 Table. MSE of near-term forecasts generated for ILIp and ILIp+uncorrGFT.

The last two columns show mean squared errors with the 2012/13 season excluded from the aggregations. Due to the large errors in GFT during the 2012/13 season, in aggregations that include forecast errors from this season all other forecasts are overwhelmed and ILIp models almost always outperform. With 2012/13 excluded errors between ILIp and ILIp+ uncorrGFT are comparable.


S1 Fig. Absolute proportional errors from GFT (x-axis) and ILIp during the 2014/15 season.

The green data point shows the error during week of maximum weekly ILIf—the peak week—and the remaining data points are color coded by their distance from peak week. The black triangle shows the mean error for the entire season. 2014/15 was the only season for which GFT estimates were generated with the final version of the GFT model.


S2 Fig. Squared errors of GFT estimates and ILIp.

x-axis shows week relative to peak, with the negative sign indicating weeks preceding peak. The box shows the interquartile range, the horizontal line indicates the median.


S3 Fig. MAPE of GFT observed in US states.

The top left panel, Overall, shows average errors across 5 seasons and each of the other panels is limited to one season. The data points are color coded by population size and ordered by overall error (high to low). The black line shows the errors from corresponding HHS regions.


S4 Fig. Comparison of estimates and errors at US national during 5 seasons.

A) Plot of estimates from ILIf (in black), ILIp (in blue), GFT (in red) and Corrected GFT (in orange). The vertical line indicates the week of peak ILIf; B) Corresponding errors relative to ILIf as reference.


S5 Fig. MAPE of near term forecasts for ILIp and ILIp+GFT models.

The data points are color coded by target. Points below the diagonal (broken black line) indicate instances where forecast quality improved with the use of GFT. Each panel is for one of the locations.


S1 Data. An archive of near-term forecasts and corresponding errors ILIp and ILIp+GFT models.



We thank Google Health Trends team for access to the API and coordinators of the Epidemic Prediction Initiative for access to versioned ILI incidence data.


  1. 1. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature. 2009;457(7232):1012–4. pmid:19020500
  2. 2. Dredze M, Cheng R, Paul MJ, Broniatowski D, editors. HealthTweets. org: a platform for public health surveillance using Twitter. AAAI Workshop on the World Wide Web and Public Health Intelligence; 2014.
  3. 3. Farrow D. Modeling the Past, Present, and Future of Influenza [Doctoral dissertation]: Carnegie Mellon University; 2016.
  4. 4. Santillana M, Nguyen AT, Dredze M, Paul MJ, Nsoesie EO, Brownstein JS. Combining search, social media, and traditional data sources to improve influenza surveillance. PLoS Comput Biol. 2015;11(10):e1004513. pmid:26513245
  5. 5. Kandula S, Hsu D, Shaman J. Subregional nowcasts of seasonal influenza using search trends. Journal of Medical Internet Research. 2017;19(11).
  6. 6. Smolinski MS, Crawley AW, Baltrusaitis K, Chunara R, Olsen JM, Wójcik O, et al. Flu Near You: crowdsourced symptom reporting spanning 2 influenza seasons. American journal of public health. 2015;105(10):2124–30. pmid:26270299
  7. 7. Polgreen PM, Chen Y, Pennock DM, Nelson FD, Weinstein RAJCid. Using internet searches for influenza surveillance. 2008;47(11):1443–8.
  8. 8. Vergu E, Grais RF, Sarter H, Fagot J-P, Lambert B, Valleron A-J, et al. Medication sales and syndromic surveillance, France. 2006;12(3):416.
  9. 9. Viboud C, Charu V, Olson D, Ballesteros S, Gog J, Khan F, et al. Demonstrating the use of high-volume electronic medical claims data to monitor local and regional influenza activity in the US. 2014;9(7):e102429.
  10. 10. Althouse BM, Scarpino SV, Meyers LA, Ayers JW, Bargsten M, Baumbach J, et al. Enhancing disease surveillance with novel data streams: challenges and opportunities. 2015;4(1):17.
  11. 11. Simonsen L, Gog JR, Olson D, Viboud CJTJoid. Infectious disease surveillance in the big data era: towards faster and locally relevant systems. 2016;214(suppl_4):S380–S5.
  12. 12. Flu Trends Team. Google Flu Trends Available from:
  13. 13. Google Trends Team. Extended Health Trends API Available from:
  14. 14. Lazer D, Kennedy R, King G, Vespignani A. The parable of Google Flu: traps in big data analysis. Science. 2014;343(6176):1203–5. pmid:24626916
  15. 15. Lohr S. Google Flu Trends: The Limits of Big Data. The New York Times. 2014.
  16. 16. Arthur C. Google Flu Trends is no longer good at predicting flu, scientists find. The Guardian. 2014.
  17. 17. Ward A. Google catches cold as debate over 'big data hubris' rages. The Financial Times. 2014.
  18. 18. Hodson H. Google Flu Trends gets it wrong three years running. Elsevier; 2014.
  19. 19. Salzberg S. Why Google flu is a failure. Forbes com [online]. 2014:03–24.
  20. 20. Walsh B. Google’s Flu Project shows the failings of big data. Time com. 2014;1.
  21. 21. Olson DR, Konty KJ, Paladini M, Viboud C, Simonsen L. Reassessing Google Flu Trends data for detection of seasonal and pandemic influenza: a comparative epidemiological study at three geographic scales. PLoS Comput Biol. 2013;9(10):e1003256. pmid:24146603
  22. 22. Pollett S, Boscardin WJ, Azziz-Baumgartner E, Tinoco YO, Soto G, Romero C, et al. Evaluating Google Flu Trends in Latin America: important lessons for the next phase of digital disease detection. Clinical Infectious Diseases. 2017;64(1):34–41. pmid:27678084
  23. 23. U.S. Department of Health & Human Services Regional Offices. Available from:
  24. 24. PHI_Research_Lab. Epidemic Prediction Initiative: FluSight 2015–2016 Data archive [Available from:
  25. 25. Brooks LC, Farrow DC, Hyun S, Tibshirani RJ, Rosenfeld RJPcb. Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions. 2018;14(6):e1006134.
  26. 26. Reich NG, Brooks LC, Fox SJ, Kandula S, McGowan CJ, Moore E, et al. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. 2019;116(8):3146–54.
  27. 27. Paul MJ, Dredze M, Broniatowski D. Twitter improves influenza forecasting. PLOS Currents Outbreaks. 2014.
  28. 28. Santillana M, Zhang DW, Althouse BM, Ayers JW. What can digital disease detection learn from (an external revision to) Google Flu Trends? American journal of preventive medicine. 2014;47(3):341–7. pmid:24997572
  29. 29. Centers for Disease Control and Prevention: Overview of Influenza Surveillance in the United States; (08/10/2016). Available from:
  30. 30. Centers for Disease Control and Prevention, National Notifiable Diseases Surveillance System: MMWR Weeks Available from:
  31. 31. Centers for Disease Control and Prevention: FluView Interactive (08/10/2016). Available from:
  32. 32. PHI Research Lab. Epidemic Prediction Initiative Available from:
  33. 33. DELPHI Research Group. Epidemiological Data API Available from:
  34. 34. Google Correlate Available from:
  35. 35. Mohebbi M, Vanderkam D, Kodysh J, Schonberger R, Choi H, Kumar S. Google correlate whitepaper. 2011.
  36. 36. Hyndman R, Khandakar Y. Automatic Time Series Forecasting: The Forecast Package for R 7. 2008. 2007.
  37. 37. R Core Team. R: A language and environment for statistical computing. 3.5.0 ed. Vienna, Austria: R Foundation for Statistical Computing; 2018.
  38. 38. Hyndman R. Forecasting functions for time series and linear models, R package version 6.1. 2015.
  39. 39. Kwiatkowski D, Phillips PC, Schmidt P, Shin Y. Testing the null hypothesis of stationarity against the alternative of a unit root: How sure are we that economic time series have a unit root? Journal of econometrics. 1992;54(1–3):159–78.
  40. 40. Canova F, Hansen BE. Are seasonal patterns constant over time? A test for seasonal stability. Journal of Business & Economic Statistics. 1995;13(3):237–52.
  41. 41. Akaike H. A new look at the statistical model identification. IEEE transactions on automatic control. 1974;19(6):716–23.
  42. 42. Breiman L. Random forests. Machine learning. 2001;45(1):5–32.
  43. 43. Breiman L. Manual on setting up, using, and understanding random forests v3. 1. Statistics Department University of California Berkeley, CA, USA. 2002;1.
  44. 44. Liaw A, Wiener M. Classification and regression by randomForest. R news. 2002;2(3):18–22.
  45. 45. Bauer DF. Constructing confidence sets using rank statistics. Journal of the American Statistical Association. 1972;67(339):687–90.
  46. 46. Hollander M, Wolfe DA. Nonparametric statistical methods. 1999.
  47. 47. Wilcoxon F. Individual comparisons by ranking methods. Biometrics bulletin. 1945;1(6):80–3.
  48. 48. Kandula S, Yamana T, Pei S, Yang W, Morita H, Shaman JJJoTRSI. Evaluation of mechanistic and statistical methods in forecasting influenza-like illness. 2018;15(144):20180174.
  49. 49. Osthus D, Daughton AR, Priedhorsky RJPcb. Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited. 2019;15(2):e1006599.