Global Disease Monitoring and Forecasting with Wikipedia

Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.


Motivation and overview
Infectious disease remains extremely costly in both human and economic terms.For example, the majority of global child mortality is due to conditions such as acute respiratory infection, measles, diarrhea, malaria, and HIV/AIDS [1].Even in developed countries, infectious disease has great impact; for example, each influenza season costs the United States between 3,000 and 49,000 lives [2] and an average of $87 billion in reduced economic output [3].
Effective and timely disease surveillance -that is, detecting, characterizing, and quantifying the incidence of disease -is a critical component of prevention and mitigation strategies that can save lives, reduce suffering, and minimize impact.Traditionally, such monitoring takes the form of patient interviews and/or laboratory tests followed by a bureaucratic reporting chain; while generally considered accurate, this process is costly and introduces a significant lag between observation and reporting.
These problems have motivated new surveillance techniques based upon internet data sources such as search queries and social media posts.Essentially, these methods use large-scale data mining techniques to identify health-related activity traces within the data streams, extract them, and transform them into some useful metric.The basic approach is to train a statistical estimation model against ground truth data, such as ministry of health disease incidence records, and then apply the model to generate estimates when the true data are not available, e.g., when forecasting or when the true data have not yet been published.This has proven effective and has spawned operational systems such as Google Flu Trends (http://www.google.org/flutrends/).However, four key challenges remain before internetbased disease surveillance models can be reliably integrated into an decision-making toolkit: C1.Openness.Models should afford review, replication, improvement, and deployment by third parties.This guarantees a high-quality scientific basis, continuity of operations, and broad applicability.These requirements imply that model algorithms -in the form of source code, not research papers -must be generally available, and they also imply that complete input data must be available.The latter is the key obstacle, as terms are dictated by the data owner rather than the data user; this motivated our exploration of Wikipedia access logs.To our knowledge, no models exist that use both open data and open algorithms.
C2. Breadth.Dozens of diseases in hundreds of countries have sufficient impact to merit surveillance; however, adapting a model from one disease-location context to another can be costly, and resources are often, if not usually, unavailable to do so.Thus, models should be cheaply adaptable, ideally by simply entering new incidence data for training.While most published models afford this flexibility in principle, few have been expressly tested for this purpose.
C3. Transferability.Many contexts have insufficient reliable incidence data to train a model (for example, the relevant ministry of health might not track the disease of interest), and in fact these are the contexts where new approaches are of the greatest urgency.Thus, trained models should be translatable to new contexts using alternate, non-incidence data such as a bilingual dictionary or census demographics.To our knowledge, no such models exist.
C4. Forecasting.Effective disease response depends not only on the current state of an outbreak but also its future course.That is, models should provide not only estimates of the current state of the world -nowcasts -but also forecasts of its future state.
While recent work in disease forecasting has made significant strides in accuracy, forecasting the future of an outbreak is still a complex affair that is sharply limited in contexts with insufficient data or insufficient understanding of the biological processes and parameters underpinning the outbreak.In these contexts, a simpler statistical approach based on leading indicators in internet data streams may improve forecast availability, quality, and time horizon.Prior evaluations of such approaches have yielded conflicting results and to our knowledge have not been performed at time granularity finer than one week.
In order to address these challenges, we propose a new approach based on freely available Wikipedia article access logs.In the current proof of concept, we use language as a proxy for location, but we hope that access data explicitly aggregated by geography will become available in the future.(Our implementation is available as open source software: http://github.com/reidpr/quac.)To demonstrate the feasibility of techniques built upon this data stream, we built linear models mapping daily access counts of encyclopedia articles to case counts for 7 diseases in 9 countries, for a total of 14 contexts.Even a simple article selection method was successful in 8 of the 14 contexts, yielding models with r 2 up to 0.89 in nowcasting and 0.92 in forecasting, with most of the successful contexts having forecast value up to the tested limit of 28 days.Specifically, we argue that approaches based on this data source can overcome the four challenges as follows: C1. Anyone with relatively modest computing resources can download the complete Wikipedia dataset and keep it up to date.The data can also be freely shared with others.
C2.In cases where estimation is practical, our approach can be adapted to a new context by simply supplying a reliable incidence time series and selecting input articles.We demonstrate this by computing effective models for several different contexts even with a simple article selection procedure.Future, more powerful article selection procedures will increase the adaptability of the approach.
C3.In several instances, our models for the same disease in different locations are very similar; i.e., correlations between different language versions of the same article and the corresponding local disease incidence are similar.This suggests that simple techniques based on inter-language article mappings or other readily available data can be used to translate models from one context to another without re-training.
C4.Even our simple models show usefully high r 2 when forecasting a few days or weeks into the future.This suggests that the general approach can be used to build short-term forecasts with reasonably tight confidence intervals.
In short, this paper offers two key arguments.First, we evaluate the potential of an emerging data source, Wikipedia access logs, for global disease surveillance and forecasting in more detail than is previously available, and we argue that the openness and other properties of these data have important scientific and operational benefits.Second, using simple proof-of-concept experiments, we demonstrate that statistical techniques effective for estimating disease incidence using previous internet data are likely to also be effective using Wikipedia access logs.
We turn next to a more thorough discussion of prior work, both to set the stage for the current work as well as outline in greater detail the state of the art's relationship to the challenges above.Following that, we cover our methods and data sources, results, and a discussion of implications and future work.

Related work
Our paper draws upon prior scholarly and practical work in three areas: traditional patient-and laboratory-based disease surveillance, Wikipedia-based measurement of the real world, and internet-based disease surveillance.

Traditional disease surveillance
Traditional forms of disease surveillance are based upon direct patient contact or biological tests taking place in clinics, hospitals, and laboratories.The majority of current systems rely on syndromic surveillance data (i.e., about symptoms) including clinical diagnoses, chief complaints, school and work absenteeism, illness-related 911 calls, and emergency room admissions [4].
For example, a well-established measure for influenza surveillance is the fraction of patients with influenza-like illness, abbreviated simply ILI.A network of outpatient providers report the total number of patients seen and the number who present with symptoms consistent with influenza that have no other identifiable cause [5].Similarly, other electronic resources have emerged, such as the Electronic Surveillance System for the Early Notification of Community Based Epidemics (ESSENCE), based on real-time data from the Department of Defense Military Health System [6] and BioSense, based on data from the Department of Veterans Affairs, the Department of Defense, retail pharmacies, and Laboratory Corporation of America [7].These systems are designed to facilitate early detection of disease outbreaks as well as response to harmful health effects, exposure to disease, or related hazardous conditions.
Clinical labs play a critical role in surveillance of infectious diseases.For example, the Laboratory Response Network (LRN), consisting of over 120 biological laboratories, provides active surveillance of a number of disease agents in humans ranging from mild (e.g., non-pathogenic E. coli and Staphylococcus aureus) to severe (e.g., Ebola and Marburg), based on clinical or environmental samples [4].Other systems monitor non-traditional public health indicators such as school absenteeism rates, over-thecounter medication sales, 911 calls, veterinary data, and ambulance run data.For example, the Early Aberration Reporting System (EARS) provides national, state, and local health departments alternative detection approaches for syndromic surveillance [8].
The main value of these systems is their accuracy.However, they have a number of disadvantages, notably cost and timeliness: for example, each ILI datum requires a practitioner visit, and ILI data are published only after a delay of 1-2 weeks [5].

Wikipedia
Wikipedia is an online encyclopedia that has, since its founding in 2001, grown to contain approximately 30 million articles in 287 languages [9].In recent years, it has consistently ranked as a top-10 website; as of this writing, it is the 6th most visited website in the world and the most visited site that is not a search engine or social network [10], serving roughly 850 million article requests per day [11].For numerous search engine queries, a Wikipedia article is the top result.
Wikipedia contrasts with traditional encyclopedias on two key dimensions: it is free of charge to read, and anyone can make changes that are published immediately -review is performed by the community after publication.(This is true for the vast majority of articles.Particularly controversial articles, such as "George W. Bush" or "Abortion", have varying levels of edit protection.)While this surprising inversion of the traditional review-publish cycle would seem to invite all manner of abuse and misinformation, Wikipedia has developed effective measures to deal with these problems and is of similar accuracy to traditional encyclopedias such as Britannica [12].
Wikipedia article access logs have been used for a modest variety of research.The most common application is detection and measurement of popular news topics or events [13][14][15][16][17].The data have also been used to study the dynamics of Wikipedia itself [18][19][20].Social applications include evaluating toponym importance in order to make type size decisions for maps [21], measuring the flow of concepts across the world [22], and estimating the popularity of politicians and political parties [23].Finally, economic applications include attempts to forecast movie ticket sales [24] and stock prices [25].The latter two applications are of particular interest because they include a forecasting component, as the present work does.
In the context of health information, the most prominent research direction focuses on assessing the quality of Wikipedia as a health information source for the public, e.g., with respect to cancer [26,27], carpal tunnel syndrome [28], drug information [29], and kidney conditions [30].To our knowledge, only four health studies exist that make use of Wikipedia access logs.Tausczik et al. examined public "anxiety and information seeking" during the 2009 H1N1 pandemic, in part by measuring traffic to H1N1related Wikipedia articles [31].Laurent and Vickers evaluated Wikipedia article traffic for disease-related seasonality and in relation to news coverage of health issues, finding significant effects in both cases [32].Aitken et al. found a correlation between drug sales and Wikipedia traffic for a selection of approximately 5,000 health-related articles [33].None of these propose a time-series model mapping article traffic to disease metrics.
The fourth study is a recent article by McIver and Brownstein, which uses statistical techniques to estimate the influenza rate in the United States from Wikipedia access logs [34].In the next section, we compare and contrast this article with the present work in the context of a broader discussion of such techniques.
In summary, use of Wikipedia access logs to measure real-world quantities is beginning to emerge, as is interest in Wikipedia for health purposes.However, to our knowledge, use of the encyclopedia for quantitative disease surveillance remains at the earliest stages.

Internet-based disease surveillance
Recently, new forms of surveillance based upon the social internet have emerged; these data streams are appealing in large part because of their real-time nature and the low cost of information extraction, properties complementary to traditional methods.The basic insight is that people leave traces of their online activity related to health observations, and these traces can be captured and used to derive actionable information.Two main classes of trace exist: sharing such as social media mentions of face mask use [35] and health-seeking behavior such as Web searches for health-related topics [36].(In fact, there is evidence that the volume of internet-based health-seeking behavior dwarfs traditional avenues [37,38].) In this section, we focus on the surveillance work most closely related to our efforts, specifically, that which uses existing single-source internet data feeds to estimate some scalar disease-related metric.For example, we exclude from detailed analysis work that provides only alerts [39,40], measures public perception of a disease [41], includes disease dynamics in its model [42], evaluates a third-party method [43], uses non-single-source data feeds [39,44], or crowd-sources health-related data ("participatory disease surveillance") [45,46].We also focus on work that estimates biologically-rooted metrics.For example, we exclude metrics based on seasonality [47,48] and over-the-counter drug sales volume, itself a proxy [49].
These activity traces are embedded in search queries [36,, social media messages [77][78][79][80][81][82][83][84][85][86][87][88][89][90][91][92], and web server access logs [34,72,93].At a basic level, traces are extracted by counting query strings, words or phrases, or web page URLs that are related to the metric of interest, forming a time series of occurrences for each item.A statistical model is then created that maps these input time series to a time series estimating the metric's changing value.This model is trained on time period(s) when both the internet data and the true metric values are available and then applied to estimate the metric value over time period(s) when it is not available, i.e., forecasting the future, nowcasting the present, and anti-forecasting the past (the latter two being useful in cases where true metric availability lags real time).
Closely related to the present work is an independent, simultaneous effort by McIver & Brownstein to measure influenza in the United States using Wikipedia access logs [34].This study used Poisson models fitted with LASSO regression to estimate ILI over a 5-year period.The results, Pearson's r of 0.94 to 0.99 against official data, depending on model variation, compare quite favorably to prior work that tries to replicate official data.More generally, this article's statistical methods are more sophisticated than those employed in the present study.However, we offer several key improvements: • We evaluate 14 location-disease contexts around the globe, rather than just one.In doing so, we test the use of language as a location proxy, which was noted briefly as future work in McIver & Brownstein.(However, as we detail below, we suspect this is not a reliable geo-location method for the long term.) • We test our models for forecasting value, which was again mentioned briefly as future work in McIver & Brownstein.
• We evaluate models for translatability from one location to another.
• We present negative results and use these to begin exploring when internet-based disease surveillance methods might and might not work.
• We offer a systematic, well-specified, and simple procedure to select articles for model inclusion.
• We normalize article traffic by total language traffic rather than using a few specific articles as a proxy for it.
• Our software is open source and has only freely available dependencies, while the McIver & Brownstein code is not available and depends on proprietary components (Stata).
Finally, the goals of the two studies differ.McIver & Brownstein wanted to "develop a statistical model to provide near-time estimates of ILI activity in the US using freely available data gathered from the online encyclopedia Wikipedia" [34, p. 2].Our goals are to assess the applicability of these data to global disease surveillance for operational public health purposes and to lay out a research agenda for achieving this end.
These methods are the basis for at least one deployed, widely used surveillance system.Based upon search query data, Google Flu Trends offers near-real-time estimates of influenza activity in 29 countries across the world (15 at the province level); another facet of the same system, Google Dengue Trends (http://www.google.org/denguetrends/)estimates dengue activity in 9 countries (2 at the province level) in Asia and Latin America.
Having laid out the space of quantitative internet disease surveillance as it exists to the best of our knowledge, we now consider this prior work in the context of our four challenges: C1.Openness.Deep access to search queries from Baidu, a Chinese-language search engine serving mostly the Chinese market (http://www.baidu.com)[64,74,76]; Google [36, 50-54, 56-60, 65-67, 69-73, 75]; Yahoo [55,68]; and Yandex, a search engine serving mostly Russia and Slavic countries in Russian (http://www.yandex.ru),English (http://www.yandex.com),and Turkish (http://www.yandex.com.tr)[75], as well as purpose-built health website search queries [61][62][63] and access logs [72,93] are available only to those within the organizations, upon payment of an often-substantial fee, or by some other special arrangement.While tools such as Baidu Index (http://index.baidu.com),Google Trends (http://www.google.com/trends/),Google Correlate (http://www.google.com/trends/correlate/),and Yandex's WordStat (http://wordstat.yandex.com)provide a limited view into specific search queries and/or time periods, as do occasional lower-level data dumps offered for research, neither affords the large-scale, broad data analysis that drives the most effective models.
The situation is only somewhat better for surveillance efforts based upon Twitter [77][78][79][80][81][82][83][84][85][86][87][88][89][90][91][92].While a small portion of the real-time message stream (1%, or 10% for certain grandfathered users) is available outside the company without substantial fees, terms of use prohibit sharing historical data needed for calibration between researchers.Access rules are similar or significantly more restrictive for alternative social media platforms such as Sina Weibo, the leading Chinese microblogging site (http://weibo.com),and Facebook.Consistent with this, we were able to find no research meeting our inclusion criteria based on either of these extremely popular systems.
We identified only one prior effort making use of open data, McIver & Brownstein with Wikipedia access logs [34].Open algorithms in this field of inquiry are also very limited.Of the works cited above, again only one, Althouse et al. [50], claims general availability of their algorithms in the form of open source code.
Finally, we highlight the quite successful Google Flu and Dengue Trends as a case study in the problems of closed data and algorithms.First, because their data and algorithms are proprietary, there is little opportunity for the wider community of expertise to offer peer review or improvements (for example, the list of search terms used by Dengue Trends has never been published, even in summary form); the importance of these opportunities is highlighted by the system's well-publicized estimation failures during the 2012-2013 flu season [94] as well as more comprehensive scholarly criticisms [43].Second, only Google can choose the level of resources to spend on Trends; no one else, regardless of their available resources, can add new contexts or take on operational responsibility should Google choose to discontinue the project.
C2. Breadth.While in principle these surveillance approaches are highly generalizable, nearly all extant efforts address a small set of diseases in a small set of countries, without testing specific methods to expand these sets.
The key exception is Paul & Dredze [91], which proposes a content-based method, ailment topic aspect model (ATAM), to automatically discover a theoretically unbounded set of medical conditions mentioned in Twitter messages.This unsupervised machine learning algorithm, similarly to latent Dirichlet allocation (LDA) [95], accumulates co-occurring words into probabilistic topics.Lists of health-related lay keywords, as well as the text of health articles written for a lay audience, are used to ensure that the algorithm builds topics related to medical issues.A test of the method discovered 15 coherent condition topics including infectious diseases such as influenza, non-infectious diseases such as cancer, and non-specific conditions such as aches and pains.The influenza topic's time series correlated very well with ILI data in the United States.
However, we identify three drawbacks of this approach.First, significant curated text input data in the target language are required; second, output topics require expert interpretation; and third, the ATAM algorithm has several parameters that require expert tuning.That is, in order to adapt the algorithm to a new location and/or language, expertise in both machine learning as well as the target language are required.
In summary, to our knowledge, no disease measurement algorithms have been proposed that are extensible to new disease-location contexts solely by adding examples of desired output.We propose a path to such algorithms.
C3. Transferability.To our knowledge, no prior work offers trained models that can be translated from one context to another.We propose using the inter-language article links provided in Wikipedia to accomplish this translation.
C4. Forecasting.A substantial minority of the efforts in this space test some kind of forecasting method.(Note that many papers use the term predict, and some even misuse forecast, to indicate nowcasting.)In addition to forecasting models that incorporate disease dynamics (recall that these are out of scope for the current paper), two basic classes of forecasting exist: lag analysis, where the internet data are simply time-shifted in order to capture leading signals, and statistical forecast models such as linear regression.
Lag analysis has shown mixed results in prior work.Johnson et al. [93], Pelat et al. [67], and Jia-xing et al. [64] identified no reliable leading signals.On the other hand, Polgreen et al. [68] used lag analysis with a shift granularity of one week to forecast positive influenza cultures as well as influenza and pneumonia mortality with a horizon of 5 weeks or more (though these indicators may trail the onset of symptoms significantly).Similarly, Xu et al. [72] reported evidence that lag analysis may be able to forecast HFMD by up to two months, and Yang et al. [73] used lag analysis with a granularity of one month to identify search queries that lead suicide incidence by up to two months.
The more complex method of statistical forecast models appears potentially fruitful as well.Dugas et al. tested several statistical methods using positive influenza tests and Google Flu Trends to make 1-week forecasts [57], and Kim et al. used linear regression to forecast influenza on a horizon of 1 month [86].
In summary, while forecasts based upon models that include disease dynamics are clearly useful, sometimes this is not possible because important disease parameters are insufficiently known.Therefore, it is still important to pursue simple methods.The simplest is lag analysis; our contribution is to evaluate leading information more quantitatively than previously attempted.Specifically, we are unaware of previous analysis with shift granularity less than one week; in contrast, our analysis tests daily shifting even if official data are less granular, and each shift is an independently computed model; thus, our ±28-day evaluation results in 57 separate models for each context.
In summary, significant gaps remain with respect to the challenges blocking a path to an open, deployable, quantitative internet-based disease surveillance system.In this paper, we propose a path to overcoming these challenges and offer evidence demonstrating that this path is plausible.

Methods
We used two data sources, Wikipedia article access logs and official disease incidence reports, and built linear models to analyze approximately 3 years of data for each of 14 disease-location contexts.This section details the nature, acquisition, and processing of these data as well as how we computed the estimation models and evaluated their output.

Wikipedia article access logs
Access logs for all Wikipedia articles are available in summary form to anyone who wishes to use them.We used the complete logs available at http://dumps.wikimedia.org/other/pagecounts-raw/.Web interfaces offering a limited view into the logs, such as http://stats.grok.se,are also available.These data are referred to using a variety of terms, including article views, article visits, pagecount files, page views, pageviews, page view logs, and request logs.
These summary files contain, for each hour from December 9, 2007 to present and updated in real time, a compressed text file listing the number of requests for every article in every language, except that articles with no requests are omitted.(This request count differs from the true number of human views due to automated requests, proxies, pre-fetching, people not reading the article they loaded, and other factors.However, this commonly used proxy for human views is the best available.)We analyzed data from March 7, 2010 through February 1, 2014 inclusive, a total of 1,428 days.This dataset contains roughly 34,000 data files totaling 2.7TB.266 hours or 0.8% of the data are missing, with the largest gap being 85 hours.These missing data were treated as zero; because they were few, this has minimal effect on our analyses.This table lists the 7 diseases in 9 locations analyzed, for a total of 14 disease-location contexts.For each context, we list the language used as a location proxy, the inclusive start and end dates of analysis, the resolution of the disease incidence data, and one or more citations for those data.
We normalized these request counts by language.This yielded, for each article, a time series containing the number of requests for that article during each hour, expressed as a fraction of the hour's total requests for articles in the language.This normalization also compensates for periods of request undercounting, when up to 20% fewer requests were counted than served [96].Finally, we transposed the data using Map-Reduce [97] to produce files from which the request count time series of any article can be retrieved efficiently.

Disease incidence data
Our goal was to evaluate a broad selection of diseases in a variety of countries across the world, in order to test the global applicability and disease agnosticism of our proposed technique.For example, we sought diseases with diverse modes of transmission (e.g., airborne droplet, vector, sexual, and fecal-oral), biology (virus, bacteria, protozoa), types of symptoms, length of incubation period, seasonality, and prevalence.Similarly, we sought locations in both the developed and developing world in various climates.Finally, we wanted to test each disease in multiple countries, to provide an opportunity for comparison.
These comprehensive desiderata were tempered by the realities of data availability.First, we needed reliable data establishing incidence ground truth for specific diseases in specific countries and at high temporal granularity; such official data are frequently not available for locations and diseases of interest.We used official epidemiological reports available on websites of government public health agencies as well as the World Health Organization (WHO).
Second, we needed article access counts for specific countries.This information is not present in the Wikipedia article access logs (i.e., request counts are global totals).However, a proxy is sometimes available in that certain languages are mostly limited to one country of interest; for example, a strong majority of Thai speakers are in Thailand, and the only English-speaking country where plague appears is the United States.In contrast, Spanish is spoken all over the world and thus largely unsuitable for this purpose.
Third, the language edition needs to have articles related to the disease of interest that are mature enough to evaluate and generate sufficient traffic to provide a reasonable signal.With these constraints in mind, we used our professional judgement to select diseases and countries.The resulting list of 14 disease-location contexts, which is designed to be informative rather than comprehensive, is enumerated in Table 1.
These incidence data take two basic forms: (a) tabular files such as spreadsheets mapping days, weeks, or months to new case counts or the total number of infected persons or (b) graphs presenting the same mapping.In the latter case, we used plot digitizing software (Plot Digitizer, http://plotdigitizer. sourceforge.net)to extract a tabular form.We then translated these diverse tabular forms to a consistent spreadsheet format, yielding for each disease-location context a time series of disease incidence (these series are available in supplemental data S1).

Article selection
The goal of our models is to create a linear mapping from the access counts of some set of Wikipedia articles to a scalar disease incidence for some disease-location context.To do so, a procedure for selecting these articles is needed; for the current proof-of-concept work, we used the following: 1. Examine the English-language Wikipedia article for the disease and enumerate the linked articles.
Select for analysis the disease article itself along with linked articles on relevant symptoms, syndromes, pathogens, conditions, treatments, biological processes, and epidemiology.For example, the articles selected for influenza include "Influenza", "Amantadine", and "Swine influenza", but not "2009 flu pandemic".
2. Identify the corresponding article in each target language by following the inter-language wiki link; these appear at the lower left of Wikipedia articles under the heading "Languages".For example, the Polish articles selected for influenza include "Grypa", "Amantadyna", and " Świńska grypa", but not "Pandemia grypy A/H1N1 w latach 2009-2010", respectively.
3. Translate each article title into the form that appears in the logs.Specifically, encode the article's Unicode title using UTF-8, percent-encode the result, and replace spaces with underscores.For example, the Polish article "Choroby zakaźne" becomes Choroby_zaka%C5%BAne.This procedure is accomplished by simply copying the article's URL from the web browser address bar.
This procedure has two potential complications.First, an article may not exist in the target language; in this case, we simply omit it.Second, Wikipedia contains null articles called redirects that merely point to another article, called the target of the redirect.These are created to catch synonyms or common misspellings of an article.For example, in English, the article "Flu" is a redirect to "Influenza".When a user visits http://en.wikipedia.org/wiki/Flu, the content served by Wikipedia is actually that of the "Influenza" article; the server does not issue an HTTP 301 response nor require the reader to manually click through to the redirect target.
This complicates our analysis because this arrangement causes the redirect itself ("Flu"), not the target ("Influenza"), to appear in the access log.While in principle we could sum redirect requests into the target article's total, reliably mapping redirects to targets is a non-trivial problem because this mapping changes over time, and in fact Wikipedia's history for redirect changes is not complete [110].Therefore, we have elected to leave this issue for future work; this choice is supported by our observation below that when target and redirect are reversed, traffic to "Dengue fever" in Thai follows the target.
If we encounter a redirect during the above procedure, we use the target article.The complete selection of articles is available in the supplementary data S1.

Building and evaluating each model
Our goal was to understand how well traffic for a simple selection of articles can nowcast and forecast disease incidence.Accordingly, we implemented the following procedure in Python to build and evaluate a model for each disease-location context.
1. Align the hourly article access counts with the daily, weekly, or monthly disease incidence data by summing the hourly counts for each day, week, or month in the incidence time series.This yields article and disease time series with the same frequency, making them comparable.(We ignore time zone in this procedure.Because Wikipedia data are in UTC and incidence data are in unspecified, likely local time zones, this leads to a temporal offset error of up to 23 hours, a relatively small error at the scale of our analysis.Therefore, we ignore this issue for simplicity.) 2. For each candidate article in the target language, compute Pearson's correlation r against the disease incidence time series for the target country.
3. Order the candidates by decreasing |r| and select the best 10 articles.
4. Use ordinary least squares to build a linear multiple regression model mapping accesses to these 10 articles to the disease time series.No other variables were incorporated into the model.Below, we report r 2 for the multi-article models as well as a qualitative evaluation of success or failure.We also report r for individual articles in the supplementary data S1.
In order to test forecasting potential, we repeat the above with the article time series time-shifted from 28 days forward to 28 days backward in 1-day increments.For example, to build a 4-day forecasting model -that is, a model that estimates disease incidence 4 days in the future -we would shift the article time series later by 4 days so that article request counts for a given day are matched against disease incidence 4 days in the future.The choice of ±28 days for lag analysis is based upon our a priori hypothesis that these statistical models are likely effective for a few weeks of forecasting.
We refer to models that estimate current (i.e., same-day) disease incidence as nowcasting models and those that estimate past disease incidence as anti-forecasting models; for example, a model that estimates disease incidence 7 days ago is a 7-day anti-forecasting model.(While useless at first glance, effective anti-forecasting models that give results sooner than official data can still reduce the lead time for action.Also, it is valuable for understanding the mechanism of internet-based models to know the temporal location of predictive information.)We report r 2 for each time-shifted multi-article model.
Finally, to evaluate whether translating models from one location to another is feasible, we compute a metric r t for each pair of locations tested on the same disease.This meta-correlation is simply the Pearson's r computed between the correlation scores r of each article found in both languages; the intent is to give a gross notion of similarity between models computed for the same disease in two different languages.A value of 1 means that the two models are identical, 0 means they have no relationship, and -1 means they are opposite.We ignore articles found in only one language because the goal is to obtain a sense of feasibility: given favorable conditions, could one train a model in one location and apply it to another?Table 2 illustrates an example.

Results
Among the 14 disease-location contexts we analyzed, we found three broad classes of results.In 8 cases, the model succeeded, i.e., there was a usefully close match between the model's estimate and the official data.In 3 cases, the model failed, apparently because patterns in the official data were too subtle to capture, and in a further 3, the model failed apparently because the signal-to-noise ratio (SNR) in the Wikipedia data was too subtle to capture.Recall that this success/failure classification is based This table shows simplified models for influenza in two locations: Japan, where Japanese is spoken, and Thailand, where Thai is spoken.The Japanese model yielded correlations for Japanese versions of the articles "Fever", "Chills", "Headache", and "Influenza" of 0.23, 0.59, −0.10, and 0.85, respectively.The Thai model yielded correlations of 0.21, 0.15, and 0.77 for "Fever", "Headache", and "Influenza", respectively.Note that the article "Chills" is not currently present in the Thai Wikipedia.Therefore, the correlation vectors are {0.23,−0.10, 0.85} and {0.21, 0.15, 0.77} for the two languages.The meta-correlation, r t , for these two vectors, which provides a gross estimate of how similar the models are, is 0.97.Extending this computation to the full models yields r t = 0.81, as noted below in Table 4.
on subjective judgement; that is, in our exploration, we discovered that r 2 is insufficient to completely evaluate a model's goodness of fit, and a complementary qualitative evaluation was necessary.Below, we discuss the successful and failed nowcasting models, followed by a summary and evaluation of transferability.(No models failed at nowcasting but succeeded at forecasting, so we omit a detailed forecasting discussion for brevity.)

Successful nowcasting
Model and official data time series for selected successful contexts are illustrated in Figure 1.The method's good performance on dengue and influenza is consistent with voluminous prior work on these diseases; this offers evidence for the feasibility of Wikipedia access as a data source.
Success in the United States is somewhat surprising.Given the widespread use of English across the globe, we expected that language would be a poor location proxy for the United States.We speculate that the good influenza model performance is due to the high levels of internet use in United States, perhaps coupled with similar flu seasons in other Northern Hemisphere countries.Similarly, in addition to Brazil, Portuguese is spoken in Portugal and several other former colonies, yet problems again failed to arise.In this case, we suspect a different explanation: the simple absence of dengue from other Portuguese-speaking countries.
The case of dengue in Brazil is further interesting because it highlights the noise inherent in this social data source, a property shared by many other internet data sources.That is, noise in the input articles is carried forward into the model's estimate.We speculate that this problem could be mitigated by building a model on a larger, more carefully selected set of articles rather than just 10.
Finally, we highlight tuberculosis in China as an example of a marginally successful model.Despite the apparently low r 2 of 0.66, we judged this model successful because it captured the high baseline disease level excellently and the three modest peaks well.However, it is not clear that the model provides useful information at the time scale analyzed.This result suggests that additional quantitative evaluation metrics may be needed, such as root mean squared error (RMSE) or a more complex analysis considering peaks, valleys, slope changes, and related properties.
Forecasting and anti-forecasting performance of the four selected contexts is illustrated in Figure 2. In the case of dengue and influenza, the models contain significant forecast value through the limit of our 28day analysis, often with the maximally effective lag comprising a forecast.We offer three possible reasons for this.First, both diseases are seasonal, so readers may simply be interested in the syndrome for this Graphs for the four remaining successful contexts (dengue in Thailand, influenza in Japan, influenza in Thailand, and tuberculosis in Thailand) are included in the supplemental data file S1.

Tuberculosis, China
Figure 2. Forecasting effectiveness for selected successful models.This figure shows model r 2 compared to temporal offset in days: positive offsets are forecasting, zero is nowcasting (marked with a dotted line), and negative offsets are anti-forecasting.As above, figures for the four successful contexts not included here are in the supplemental data S1.
reason; however, the fact that models were able to correctly estimate seasons of widely varying severity provides counterevidence for this theory.Second, readers may be interested due to indirect reasons such as news coverage.Prior work disagrees on the impact of such influences; for example, Dukic et al. found that adding news coverage to their methicillin-resistant Staphylococcus aureus (MRSA) model had a limited effect [58], but recent Google Flu Trends failures appear to be caused in part by media activity [94].Finally, both diseases have a relatively short incubation period (influenza at 1-4 days and dengue at 3-14); soon-to-be-ill readers may be observing the illness of their infectors or those who are a small number of degrees removed.It is the third hypothesis that is most interesting for forecasting purposes, and evidence to distinguish among them might be obtained from studies using simulated media and internet data, as suggested by Culotta [82].Tuberculosis in China is another story.In this case, the model's effectiveness is poorer as the forecast interval increases; we speculate that this is because seasonality is absent and the incubation period of 2-12 weeks is longer, diluting the effect of the above two mechanisms.

Failed nowcasting
Figure 3 illustrates the three contexts where the model was not effective because, we suspect, it was not able to discern meaningful patterns in the official data.These suggest a few patterns that models might have difficulty with: 1. Noise.True patterns in data may be obscured by noise.For example, in the case of HIV/AIDS in China, the official data vary by a factor of 2 or more throughout the graph, and the model captures this fairly well, but the pattern seems epidemiologically strange and thus we suspect it may be merely noise.The other two contexts appear to also contain significant noise.
(Note that we distinguish noisy official data from an unfavorable signal-to-noise ratio, which is discussed below.) 2. Too slow.Disease incidence may be changing too slowly to be evident in the chosen analysis period.In all three contexts shown in Figure 3, the trend of the official data is essentially flat, with HIV/AIDS in Japan especially so.The models have captured this flat trend fairly well, but even doing so excellently provides little actionable value over traditional surveillance.
Both HIV/AIDS and tuberculosis infections progress quite slowly.A period of analysis longer than three years might reveal meaningful patterns that could be captured by this class of models.This table summarizes the performance of our estimation models.For each disease and location, we list the subjective success/failure classification as well as model r 2 at nowcasting (0-day forecast) and 7-, 14-, and 28-day forecasts.We also list the temporal offset in days of the best model (again, a positive offset indicates forecasting) along with that model's r 2 .
In the case of Ebola, there are relatively few direct observations (a major outbreak has tens of cases), and the path to these observations becoming internet traces is hampered by poor connectivity in the sub-Saharan countries where the disease is active.On the other hand, the disease is one of general ongoing interest; in fact, one can observe on the graph a pattern of weekly variation (higher in midweek, lower on the weekend), which is common in online activity.In combination, these yield a completely useless model.
The United States has good internet connectivity, but plague has even lower incidence (the peak on the graph is three cases) and this disease is also popularly interesting, resulting in essentially the same effect.The cholera outbreak in Haiti differs in that the number of cases is quite large (the peak of the graph is 4,200 cases in one day).However, internet connectivity in Haiti was already poor even before the earthquake, and the outbreak was a major world news story, increasing noise, so the signal was again lost.

Performance summary
Table 3 summarizes the performance of our models in the 14 disease-location contexts tested.Of these, we classified 8 as successful, producing useful estimates for both nowcasting and forecasting, and 6 as unsuccessful.Performance roughly broke down along disease lines: all influenza and dengue models were successful, while two of the three tuberculosis models were, and cholera, ebola, HIV/AIDS, and plague proved unsuccessful.Given the relatively simple model building technique used, this suggests that our Wikipedia-based approach is sufficiently promising to explore in more detail.(Another hypothesis is that model performance is related to popularity of the corresponding Wikipedia language edition.However, we found no relationship between r 2 and either a language's total number of articles or total traffic.)At a higher level, we posit that a successful estimation model based on Wikipedia access logs or other social internet data requires two key elements.First, it must be sensitive enough to capture the true variation in disease incidence data.Second, it must be sensitive enough to distinguish between activity traces due to health-related observations and those due to other causes.In both cases, further research on modeling techniques is likely to yield sensitivity improvements.In particular, a broader article selection procedure -for example, using big data methods to test all non-trivial article time series for correlation, as Ginsberg et al. did for search queries [36] -is likely to prove fruitful, as might a non-linear statistical mapping.

Discussion
Human activity on the internet leaves voluminous traces that contain real and useful evidence of disease dynamics.Above, we pose four challenges currently preventing these traces from informing operational disease surveillance activities, and we argue that Wikipedia data are one of the few social internet data sources that can meet all four challenges.Specifically: C1.Openness.Open data and algorithms are required, in order to offer reliable science as well as a flexible and robust operational capability.Wikipedia access logs are freely available to anyone.
C2. Breadth.Thousands of disease-location contexts, not dozens, are needed to fully understand the global disease threat.We tested simple disease estimation models on 14 contexts around the world; in 8 of these, the models were successful with r 2 up to 0.92, suggesting that Wikipedia data are useful in this regard.
C3. Transferability.The greatest promise of novel disease surveillance methods is the possibility of use in contexts where traditional surveillance is poor or nonexistent.Our analysis uncovered pairs of same-disease, different-location models with similarity up to 0.81, suggesting that translation of trained models using Wikipedia's mappings of one language to another may be possible.

C4. Forecasting. Effective response to disease depends on knowing not only what is happening now
but also what will happen in the future.Traditional mechanistic forecasting models often cannot be applied due to missing parameters, motivating the use of simpler statistical models.We show that such statistical models based on Wikipedia data have forecasting value through our maximum tested horizon of 28 days.
This preliminary study has several important limitations.These comprise an agenda for future research work: 1.The methods need to be tested in many more contexts in order to draw lessons about when and why this class of methods is likely to work.
2. A better article selection procedure is needed.In the current paper, we tried a simple manual process yielding at most a few dozen candidate articles in order to establish feasibility.However, real techniques should use a comprehensive process that evaluates thousands, millions, or all plausible articles for inclusion in the model.This will also facilitate content analysis studies that evaluate which types of articles are predictive of disease incidence.
3. Better geo-location is needed.While language as a location proxy works well in some cases, as we have demonstrated, it is inherently weak.In particular, it is implausible for use at a finer scale than country-level.What is needed is a hierarchical geographic aggregation of article traffic.The Wikimedia Foundation, operators of Wikipedia and several related projects, could do this using IP addresses to infer location before the aggregated data are released to the public.For most epidemiologically-useful granularities, this will still preserve reader privacy.
4. Statistical estimation maps from article traffic to disease incidence should be more sophisticated.
Here, we tried simple linear models mapping a single interval's Wikipedia traffic to a single interval's disease incidence.Future directions include testing non-linear and multi-interval models.
5. Wikipedia data have a variety of instabilities that need to be understood and compensated for.For example, Wikipedia shares many of the problems of other internet data, such as highly variable interest-driven traffic caused by news reporting and other sources.
Wikipedia has its own data peculiarities that can also cause difficulty.For example, during preliminary exploration for this paper in spring 2013, we used the inter-language link on the English article "Dengue fever" to locate the Thai version, " เลื อดออกเด็ งกี " (roughly, "dengue hemorrhagic fever"); article access logs indicated several hundred accesses per day for this article in the month of June 2013.When we repeated the same process in March 2014, the inter-language link led to a page with the same content, but a different title, " ไข้ เด็ งกี " (roughly, "dengue fever").As none of the authors are Thai speakers, and Google Translate renders both versions as "dengue fever", we did not notice that the title of the Thai article had changed and were alarmed to discover that the article's traffic in June 2013 was essentially zero.
The explanation is that before July 23, 2013, " ไข้ เด็ งกี " was a redirect to " ไข้ เลื อดออกเด็ งกี "; on that day, the direction of the redirect was reversed, and almost all accesses moved over to the new redirect target over a period of a few days.That is, the article was the same all along, but the URL under which its accesses were recorded changed.
Possible techniques for compensation include article selection procedures that exclude such articles or maintaining a time-aware redirect graph so that different aliases of the same article can be merged.Indeed, when we tried the latter approach by manually summing the two URLs' time series, it improved nowcast r 2 from 0.55 to 0.65.However, the first technique is likely to discard useful information, and the second may not be reliable because complete history for this type of article transformation is not available [110].
In general, ongoing, time-aware re-training of models will likely be helpful, and limitations of the compensation techniques can be evaluated with simulation studies that inject data problems.6.We have not explored the full richness of the Wikipedia data.For example, complete histories of each language edition are available, which include editing metadata (timestamps, editor identity, and comments), the text of each version, and conversations about the articles; these would facilitate analysis of edit activity as well as the articles' changing text.Also, health-related articles are often mapped to existing ontologies such as the International Statistical Classification of Diseases and Related Health Problems (ICD-9 or ICD-10).
7. Transferability of models should be tested using more realistic techniques, such as simply building a model in one context and testing its performance in another.
Finally, it is important to recognize the biases inherent in Wikipedia and other social internet data sources.Most importantly, the data strongly over-represent people and places with good internet access and technology skills; demographic biases such as age, gender, and education also play a role.These biases are sometimes quantified (e.g., with survey results) and sometimes completely unknown.As noted above, simulation studies using synthetic internet data can quantify the impact and limitations of these biases.
Despite these limitations, we have established the utility of Wikipedia access logs for global disease monitoring and forecasting, and we have outlined a plausible path to a reliable, scientifically sound, operational disease surveillance system.We look forward to collaborating with the scientific and technical community to make this vision a reality.

Figure 1 .
Figure 1.Selected successful model nowcasts.These graphs show official epidemiological data and nowcast model estimate (left Y axis) with traffic to the five most-correlated Wikipedia articles (right Y axis) over the 3 year study periods.The Wikipedia time series are individually self-normalized.Graphs for the four remaining successful contexts (dengue in Thailand, influenza in Japan, influenza in Thailand, and tuberculosis in Thailand) are included in the supplemental data file S1.

Figure 3 .
Figure 3. Nowcast attempts where the model was unable to capture a meaningful pattern in official data.

Table 1 .
Diseases-location contexts analyzed, with data sources.

Table 2 .
Transferability r t example.

Table 3 .
Model performance summary

Table 4 .
Transferability scores r t for paired modelsThis table lists the transferability scores r t for each tested pair of countries within a disease.Countries that did not share enough articles to compute a meaningful r t are indicated with n/a.

Table 4
lists the transferability scores r t for each pair of countries tested on the same disease.Because this paper is concerned with establishing feasibility, we focus on the highest scores.These are encouraging: in the case of influenza, both Japan/Thailand and Thailand/United States are promising.That is, it seems plausible that careful source model selection and training techniques may yield useful models in contexts where no training data are available (e.g., official data are unavailable or unreliable).These early results suggest that research to quantitatively test methods for translating models from one disease-location context to another should be pursued.