Surveillance Sans Frontières: Internet-Based Emerging Infectious Disease Intelligence and the HealthMap Project

John Brownstein and colleagues discuss HealthMap, an automated real-time system that monitors and disseminates online information about emerging infectious diseases.


Health in Action
The Opportunity As developed nations continue to strengthen their electronic disease surveillance capacities [1], the parts of the world that are most vulnerable to emerging disease threats still lack essential public health information infrastructure [2,3]. The existing network of traditional surveillance efforts managed by health ministries, public health institutes, multinational agencies, and laboratory and institutional networks has wide gaps in geographic coverage and often suffers from poor and sometimes suppressed information flow across national borders [4]. At the same time, an enormous amount of valuable information about infectious diseases is found in Web-accessible information sources such as discussion sites, disease reporting networks, and news outlets [5,6,7]. These resources can support situational awareness by providing current, highly local information about outbreaks, even from areas relatively invisible to traditional global public health efforts [8]. These data are plagued by a number of potential hazards that must be studied in depth, including false reports (mis-or disinformation) and reporting bias. Yet these data hold tremendous potential to initiate epidemiologic follow-up studies and provide complementary epidemic intelligence context to traditional surveillance sources. This potential is already being realized, as a majority of outbreak verifications currently conducted by the World Health Organization (WHO)'s Global Outbreak Alert and Response Network are triggered by reports from these nontraditional sources [5,6].
In one of the most frequently cited examples [9], early indications of the severe acute respiratory syndrome (SARS) outbreak in Guangdong Province, China, came in November 2002 from a Chinese article that alluded to an unusual increase in emergency department visits with acute respiratory illness [9,10]. This was followed by media reports of a respiratory disease among health care workers in February 2003, all captured by the Public Health Agency of Canada's Global Public Health Intelligence Network (GPHIN) [10,11,12]. In parallel, online discussions on the ProMED-mail system referred to an outbreak in Guangzhou, well before official government reports were issued [13].
These Web-based data sources not only facilitate early outbreak detection, but also support increasing public awareness of disease outbreaks prior to their formal recognition. Through lowcost and real-time Internet data-mining, combined with openly available and

Surveillance Sans Frontières: Internet-Based Emerging Infectious Disease Intelligence and the HealthMap Project
Summary Points diseases is found in Web-accessible information sources such as discussion forums, mailing lists, government Web sites, and news outlets.
sources can play an important role in early event detection and support situational awareness by providing current, highly local information about outbreaks, even from areas relatively invisible to traditional global public health efforts. useful, information overload and difficulties in distinguishing "signal from noise" pose substantial barriers to fully utilizing this information. automated real-time system that monitors, organizes, integrates, filters, visualizes, and disseminates online information about emerging diseases. real-time intelligence on a broad range of emerging infectious diseases for a diverse audience, from public health officials to international travelers. and other nontraditional sources of surveillance data can facilitate early outbreak detection, increase public awareness of disease outbreaks prior to their formal recognition, and provide an integrated and contextualized view of global health information.
user-friendly technologies, both participation in and access to global disease surveillance are no longer limited to the public health community [14,15]. The availability of Web-based news media provides an alternative public health information source in under-resourced areas. However, the myriad diverse sources of infectious disease information across the Web are not structured or organized; public health officials, nongovernmental organizations, and concerned citizens must routinely search and synthesize a continually growing number of disparate sources in order to use this information. With the aim of creating an integrated global view of emerging infections based not only on traditional public health datasets but rather on all available information sources, we developed HealthMap, a freely accessible, automated electronic information system for organizing data on outbreaks according to geography, time, and infectious disease agent [16] ( Figure 1).

The System
Operating since September 2006, HealthMap (http://www.healthmap. org/) is a multistream real-time surveillance platform that continually aggregates reports on new and ongoing infectious disease outbreaks [17]. The system performs extraction, categorization, filtration, and integration of these reports, facilitating knowledge management and early detection ( Figure 2). Through this approach, we seek a unified and comprehensive view of current infectious disease outbreaks in space and time worldwide.
HealthMap is designed to provide a starting point for real-time intelligence on a broad range of emerging infectious diseases for a diverse range of end users, from public health officials to international travelers [18,19,20]. The system currently serves as a direct information source for approximately 20,000 unique visitors per month, as well as a resource for libraries, local health departments, governments (e.g., the US Department Health and Human Services and Department of Defense), and multinational agencies (e.g., the United Nations), which use the HealthMap data stream for day-to-day surveillance activities. Many regular users come from the WHO, the US Centers for Disease Control and Prevention, and the European Centre for Disease Prevention and Control.
Knowledge sources. HealthMap relies on a variety of electronic media sources, including online news sources through aggregators such as Google News, expert-curated discussion such as ProMED-mail [13,21,22], and validated official reports from organizations such as the WHO. Currently, the system collects reports from 14 sources, which in turn represent information from over 20,000 Web sites, every hour, 24 hours a day. Internet search criteria include disease names (scientific and common), symptoms, keywords, and phrases. The system collects an average of 300 reports per day, with the majority acquired from news media sources (85.1%). Although most of the reports collected to date have been in English, HealthMap also monitors information sources in Chinese, Spanish, Russian, and French, with additional languages such as Hindi, Portuguese, and Arabic under development. As HealthMap reports are acquired solely from free news sources, operational costs are minimal.  The Web site is freely accessible on the Internet without subscription fees.
The use of international news media for public health surveillance has a number of potential biases that merit consideration. While local news sources may report on incidents involving a few cases that would not be picked up at the national level, such sources may be less reliable, lacking resources and training, and may report stories without adequate confirmation. Furthermore, other biases may be intentionally introduced for political reasons through disinformation campaigns (false positives) or state censorship of information relating to outbreaks (false negatives). We have attempted to better understand some of these issues through ongoing analysis and evaluation research. We ran a 43-week evaluation of HealthMap data, covering the period of October 1, 2006 through July 18 2007. We found that pathogen diversity was substantial across news sources, with 141 unique infectious disease categories reported through the Google News feed alone (Table 1). We found the frequency of reports about particular pathogens to be related not to their associated morbidity or mortality impact, but rather to the direct or potential economic and social disruption caused by the outbreak.
For instance, we found substantial skew towards reporting on stories about avian influenza and food-borne illnesses. Over the evaluation time period, 174 countries had reports of infectious disease outbreaks, with the greatest reporting from the United States (n = 4351), the United Kingdom (n = 1018), Canada (n = 880), and China (n = 737) ( Figure 3A). There was a clear bias towards increased reporting from countries with higher numbers of media outlets, more developed public health resources, and greater availability of electronic communication infrastructure (approximated by number of Internet hosts) ( Figure 3B). These trends are highly relevant for users of the system, and thus the individual impact of these factors on surveillance will form the basis of a detailed user guide currently under development.
Knowledge extraction. The system characterizes disease outbreak reports by means of a series of text   mining algorithms. (A complete technical description of the system may be found elsewhere [23].) Characterization stages include: (a) identifying disease and location; (b) determining relevance-namely, whether a given report refers to any current outbreak; and (c) grouping similar reports together while removing exact duplicates. Once the reports are automatically processed, curators correct the misclassifications of the system where necessary (Figure 2). Currently, only one analyst reviews and corrects the posts. However, additional resources would enable more detailed multilingual curation and annotation of collected reports. Extracting location and disease names from text reports presents the most formidable challenge. HealthMap draws from a continually expanding dictionary of pathogens (human, plant, and animal diseases) and geographic names (country, province, state, and city) to classify outbreak alert information. However, disease and place names are often ambiguous, colloquial, and subject to change, and may have multiple spellings (e.g., diarrhea, common in the US, and diarrhoea, common in the UK). Thus, the expansion and editing of the database requires extensive manual data entry.
Once location and disease have been identified, articles are automatically tagged according to their relevance. Specifically, we identify whether a given report refers to a current outbreak ( "breaking news"), as opposed to reporting on other infectious disease-related news, such as vaccination campaigns, scientific research, or public health policy. In this case, HealthMap makes use of a Bayesian machine learning algorithm, trained on manually characterized existing reports, to automatically tag and separate breaking news. Finally, duplicate reports are filtered, identified, and grouped based on the similarity of the article's headline, body text, and disease and location categories. Using a similarity score threshold, the system groups related articles into clusters that provide the collective information on a given outbreak.
Knowledge integration and dissemination. HealthMap is particularly focused on providing users with news of immediate interest and reducing information overload.
Overwhelming public health officials with information on outbreaks of low public health impact may distract them from investigating outbreaks of greater priority that might receive reduced media attention. Thus, only articles classified as breaking news are posted to the site. Although they are filtered from the initial display, other article types and duplicate articles are shown in a related information window, providing a situational report on an ongoing outbreak as well as recent reports concerning either the same disease or location, and links for further research (Figure 4).
HealthMap also addresses the computational challenges  of integrating multiple sources of unstructured information by generating meta-alerts of disease outbreaks. As false alarms can often be reduced by thorough aggregation and cross-validation of reported information, a composite activity score (or heat index) is calculated based on (a) the reliability of the data source (for instance, increased weight is given to WHO reports and reduced weight to local media reports); and (b) the number of unique data sources, with increased weight to multiple types of information (e.g., discussion sites and media reports on the same outbreak). This meta-alert derivation is based on the idea that multiple sources of information about an incident provide greater confidence in the reliability of the report than any one source alone.

The Future
A wide range of further improvements are currently being developed across all components of the HealthMap system. In particular, population and geography gaps in the coverage of the monitored sources need to be better understood and accounted for. For example, there are critical gaps in media reporting in tropical and lowerlatitude areas, including major parts of Africa and South America-the very regions that have the greatest burden and risk of emerging infectious diseases (Figure 3) [3]. Monitoring other Internet-based sources such as blogs, discussion sites, and listservs could complement news coverage. The use of click-stream data and individual search queries is also a promising new surveillance source [24]. Multilingual surveillance is critical for capturing greater geographic coverage and for providing earlier and more comprehensive reporting from local news media.
Potential future challenges include the possibility that news data sources that are freely available now will no longer be available if current business models change. In addition, the way news is reported online (content, format, communication standards, etc.) may change and develop in the coming years, which will require a re-tooling of the system in order to capture the appropriate information. Potential future benefits of technological advances include better meta-data tagging if/when the semantic Web becomes a reality. Also, as location-based services become more widespread, including on portable devices, HealthMap feeds can be tailored and targeted to specific users and their locations.
Future work must also focus on improving natural language processing capability to clearly identify the pathogen, filter nonpertinent reports and duplicates, and enhance the spatial resolution of location extraction. However, while improvements in machine learning techniques are undoubtedly critical, they cannot currently replace human analysis. The success of Wikipedia has shown that leveraging collaborative human networks of trained public health professionals has the potential to support improved classification, severity assignment, conflict resolution, geocoding, and confirmation of reports on rare or unknown infectious diseases [25]. A recently established collaboration between HealthMap and ProMED-mail (http://www. healthmap.org/promed) is helping to pave the way for such a bidirectional system of classification and curation of information flow [26].
Continued system evaluation is also essential. The fundamental characteristics of different news source types need to be quantified, including sensitivity, specificity, and timeliness [27,28,29,30]. Consideration should also be given to integrating unstructured online information sources with other health indicator data to provide a broader context for reports. Pertinent data sets include mortality and morbidity estimates, laboratory data, field surveillance (e.g., vector and animal reservoir distribution), environmental predictors (e.g., climate and vegetation), population density and mobility, and pathogen seasonality and transmissibility. Such integration could  All articles related to a given outbreak are aggregated by text similarity matching in order to provide a situational awareness report. Furthermore, other outbreaks occurring in the same geographic area or involving the same pathogen are provided. The window also provides links to further research on the subject. In this example, we show all alerts relating to a recent cholera outbreak in Nigeria.
yield a more precise relevance score for a given report, define populations at risk, and predict disease spread.

Conclusion
HealthMap is a member of a new generation of surveillance systems that mine media sources in near realtime for reports of infectious disease outbreaks, including GPHIN [10,12], MedISys, developed by the Directorate General Health and Consumer Affairs of the European Commission [31], the US government-funded Argus [32], and EpiSPIDER [33]. While Internet-based online media sources are becoming a critical tool for global infectious disease surveillance, important challenges still need to be addressed. Since regions with the least advanced communication infrastructure also tend to carry the greatest infectious disease burden and risk, system development must be aimed at closing the gaps in these critical areas. Hence, achieving global coverage requires attention to creating and capturing locally feasible channels of communication. It also involves making the outputs of the system more accessible to users in these regions through user interfaces in additional languages and low-bandwidth display options, including mobile phone alerts.
Ultimately, the monitoring of diverse media-based sources will augment epidemic intelligence with information derived outside the traditional public health infrastructure, yielding a more comprehensive and timely global view of emerging infectious disease threats. A truly open and accessible system can also assist users in overcoming existing geographical, organizational, and societal barriers to information, a process that can lead to greater empowerment, involvement, and democratization across the increasingly interconnected global health sphere.