All authors are with the Children's Hospital Informatics Program at the Harvard–MIT Division of Health Sciences and Technology; the Division of Emergency Medicine, Children's Hospital Boston; and the Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, United States of America.
The authors have declared that no competing interests exist.
John Brownstein and colleagues discuss HealthMap, an automated real-time system that monitors and disseminates online information about emerging infectious diseases.
As developed nations continue to strengthen their electronic disease surveillance capacities [
Valuable information about infectious diseases is found in Web-accessible information sources such as discussion forums, mailing lists, government Web sites, and news outlets.
Web-based electronic information sources can play an important role in early event detection and support situational awareness by providing current, highly local information about outbreaks, even from areas relatively invisible to traditional global public health efforts.
While these sources are potentially useful, information overload and difficulties in distinguishing “signal from noise” pose substantial barriers to fully utilizing this information.
HealthMap is a freely accessible, automated real-time system that monitors, organizes, integrates, filters, visualizes, and disseminates online information about emerging diseases.
The goal of HealthMap is to deliver real-time intelligence on a broad range of emerging infectious diseases for a diverse audience, from public health officials to international travelers.
Ultimately, the use of news media and other nontraditional sources of surveillance data can facilitate early outbreak detection, increase public awareness of disease outbreaks prior to their formal recognition, and provide an integrated and contextualized view of global health information.
In one of the most frequently cited examples [
These Web-based data sources not only facilitate early outbreak detection, but also support increasing public awareness of disease outbreaks prior to their formal recognition. Through low-cost and real-time Internet data-mining, combined with openly available and user-friendly technologies, both participation in and access to global disease surveillance are no longer limited to the public health community [
Operating since September 2006, HealthMap (
(1) Web-based data are acquired from a variety of Web sites every hour, 7 days a week (ranging from rumors on discussion sites to news media to validated official reports). (2) The extracted articles are then categorized by pathogen and location of the outbreak in question. (3) Articles are then analyzed for duplication and content. Duplicate articles are removed, while those that discuss new information about an ongoing situation are integrated with other related articles and added to the interactive map. (4) Once classified, articles are filtered by their relevance into five categories. Only “breaking news” articles are added as markers to the map.
HealthMap is designed to provide a starting point for real-time intelligence on a broad range of emerging infectious diseases for a diverse range of end users, from public health officials to international travelers [
HealthMap relies on a variety of electronic media sources, including online news sources through aggregators such as Google News, expert-curated discussion such as ProMED-mail [
The use of international news media for public health surveillance has a number of potential biases that merit consideration. While local news sources may report on incidents involving a few cases that would not be picked up at the national level, such sources may be less reliable, lacking resources and training, and may report stories without adequate confirmation. Furthermore, other biases may be intentionally introduced for political reasons through disinformation campaigns (false positives) or state censorship of information relating to outbreaks (false negatives). We have attempted to better understand some of these issues through ongoing analysis and evaluation research. We ran a 43-week evaluation of HealthMap data, covering the period of October 1, 2006 through July 18 2007. We found that pathogen diversity was substantial across news sources, with 141 unique infectious disease categories reported through the Google News feed alone (
Infectious Disease Occurrences Extracted from Google News Searches from October 1, 2006 through July 18, 2007
For instance, we found substantial skew towards reporting on stories about avian influenza and food-borne illnesses. Over the evaluation time period, 174 countries had reports of infectious disease outbreaks, with the greatest reporting from the United States (
Geographic coverage of English language news reports of infectious disease outbreaks collected through Google News from October 1, 2006 through July 18, 2007. (A) displays counts of disease outbreak reports; (B) displays population-adjusted outbreak reporting as number of reports per million inhabitants.
The system characterizes disease outbreak reports by means of a series of text mining algorithms. (A complete technical description of the system may be found elsewhere [
Extracting location and disease names from text reports presents the most formidable challenge. HealthMap draws from a continually expanding dictionary of pathogens (human, plant, and animal diseases) and geographic names (country, province, state, and city) to classify outbreak alert information. However, disease and place names are often ambiguous, colloquial, and subject to change, and may have multiple spellings (e.g., diarrhea, common in the US, and diarrhoea, common in the UK). Thus, the expansion and editing of the database requires extensive manual data entry.
Once location and disease have been identified, articles are automatically tagged according to their relevance. Specifically, we identify whether a given report refers to a current outbreak (“breaking news”), as opposed to reporting on other infectious disease–related news, such as vaccination campaigns, scientific research, or public health policy. In this case, HealthMap makes use of a Bayesian machine learning algorithm, trained on manually characterized existing reports, to automatically tag and separate breaking news. Finally, duplicate reports are filtered, identified, and grouped based on the similarity of the article's headline, body text, and disease and location categories. Using a similarity score threshold, the system groups related articles into clusters that provide the collective information on a given outbreak.
HealthMap is particularly focused on providing users with news of immediate interest and reducing information overload. Overwhelming public health officials with information on outbreaks of low public health impact may distract them from investigating outbreaks of greater priority that might receive reduced media attention. Thus, only articles classified as breaking news are posted to the site. Although they are filtered from the initial display, other article types and duplicate articles are shown in a related information window, providing a situational report on an ongoing outbreak as well as recent reports concerning either the same disease or location, and links for further research (
All articles related to a given outbreak are aggregated by text similarity matching in order to provide a situational awareness report. Furthermore, other outbreaks occurring in the same geographic area or involving the same pathogen are provided. The window also provides links to further research on the subject. In this example, we show all alerts relating to a recent cholera outbreak in Nigeria.
HealthMap also addresses the computational challenges of integrating multiple sources of unstructured information by generating meta-alerts of disease outbreaks. As false alarms can often be reduced by thorough aggregation and cross-validation of reported information, a composite activity score (or heat index) is calculated based on (a) the reliability of the data source (for instance, increased weight is given to WHO reports and reduced weight to local media reports); and (b) the number of unique data sources, with increased weight to multiple types of information (e.g., discussion sites and media reports on the same outbreak). This meta-alert derivation is based on the idea that multiple sources of information about an incident provide greater confidence in the reliability of the report than any one source alone.
A wide range of further improvements are currently being developed across all components of the HealthMap system. In particular, population and geography gaps in the coverage of the monitored sources need to be better understood and accounted for. For example, there are critical gaps in media reporting in tropical and lower-latitude areas, including major parts of Africa and South America—the very regions that have the greatest burden and risk of emerging infectious diseases (
Potential future challenges include the possibility that news data sources that are freely available now will no longer be available if current business models change. In addition, the way news is reported online (content, format, communication standards, etc.) may change and develop in the coming years, which will require a re-tooling of the system in order to capture the appropriate information. Potential future benefits of technological advances include better meta-data tagging if/when the semantic Web becomes a reality. Also, as location-based services become more widespread, including on portable devices, HealthMap feeds can be tailored and targeted to specific users and their locations.
Future work must also focus on improving natural language processing capability to clearly identify the pathogen, filter non-pertinent reports and duplicates, and enhance the spatial resolution of location extraction. However, while improvements in machine learning techniques are undoubtedly critical, they cannot currently replace human analysis. The success of Wikipedia has shown that leveraging collaborative human networks of trained public health professionals has the potential to support improved classification, severity assignment, conflict resolution, geocoding, and confirmation of reports on rare or unknown infectious diseases [
Continued system evaluation is also essential. The fundamental characteristics of different news source types need to be quantified, including sensitivity, specificity, and timeliness [
HealthMap is a member of a new generation of surveillance systems that mine media sources in near real-time for reports of infectious disease outbreaks, including GPHIN [
Ultimately, the monitoring of diverse media-based sources will augment epidemic intelligence with information derived outside the traditional public health infrastructure, yielding a more comprehensive and timely global view of emerging infectious disease threats. A truly open and accessible system can also assist users in overcoming existing geographical, organizational, and societal barriers to information, a process that can lead to greater empowerment, involvement, and democratization across the increasingly interconnected global health sphere.
Global Public Health Intelligence Network
severe acute respiratory syndrome
World Health Organization