Figure 1.
System for estimating influenza prevalence from Twitter.
This figure shows the components of our system for estimating influenza prevalence from Twitter. A stream of tweets matching hundreds of health-related keywords is passed through three classification filters to remove irrelevant tweets. The locations of tweets are identified with our geolocation system, Carmen, and only tweets in the location of interest are saved. The volume of tweets is normalized by the total volume of tweets from a random sample of Twitter to produce a prevalence measure.
Figure 2.
2012-2013 national influenza rates from Twitter and CDC surveillance.
This figure shows national influenza rates of the United States as predicted by two Twitter-based algorithms alongside the influenza-like illness surveillance network data from the US Centers for Disease Control and Prevention (CDC). The dashed blue line is the measure estimated by a simple model of keyword matching, while the solid blue line is the measure estimated by our new infection detection model. Our new algorithm more closely matches the CDC data (solid black line), while the simpler keyword model infers spurious spikes due to other Twitter chatter, e.g. in early December and early April.
Figure 3.
Cross-correlation between Twitter infection rates and CDC ILI rates.
This figure shows the cross-correlation function for the residuals of the second-order-differenced CDC and Twitter infection data, as described in the “Time Series Analysis” section. These results show that the Twitter estimates neither lead nor lag the CDC ILI rates, although the Twitter data are publicly available up to two weeks earlier than CDC data.