Syndromic surveillance promotes the early detection of diseases outbreaks. Although syndromic surveillance has increased in developing countries, performance on outbreak detection, particularly in cases of multi-stream surveillance, has scarcely been evaluated in rural areas.
This study introduces a temporal simulation model based on healthcare-seeking behaviors to evaluate the performance of multi-stream syndromic surveillance for influenza-like illness.
Data were obtained in six towns of rural Hubei Province, China, from April 2012 to June 2013. A Susceptible-Exposed-Infectious-Recovered model generated 27 scenarios of simulated influenza A (H1N1) outbreaks, which were converted into corresponding simulated syndromic datasets through the healthcare-behaviors model. We then superimposed converted syndromic datasets onto the baselines obtained to create the testing datasets. Outbreak performance of single-stream surveillance of clinic visit, frequency of over the counter drug purchases, school absenteeism, and multi-stream surveillance of their combinations were evaluated using receiver operating characteristic curves and activity monitoring operation curves.
In the six towns examined, clinic visit surveillance and school absenteeism surveillance exhibited superior performances of outbreak detection than over the counter drug purchase frequency surveillance; the performance of multi-stream surveillance was preferable to signal-stream surveillance, particularly at low specificity (Sp <90%).
Citation: Fan Y, Wang Y, Jiang H, Yang W, Yu M, Yan W, et al. (2014) Evaluation of Outbreak Detection Performance Using Multi-Stream Syndromic Surveillance for Influenza-Like Illness in Rural Hubei Province, China: A Temporal Simulation Model Based on Healthcare-Seeking Behaviors. PLoS ONE 9(11): e112255. https://doi.org/10.1371/journal.pone.0112255
Editor: Javier Sanchez, Univ. Prince Edward Island Atlantic Veterinary College, Canada
Received: March 31, 2014; Accepted: October 3, 2014; Published: November 19, 2014
Copyright: © 2014 Fan et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. All relevant data are within the paper and its Supporting Information files.
Funding: This work has received financial support from the European Union Framework Program 7 (grant no. 241900) (http://ec.europa.eu/research/participants/portal/desktop/en/home.html). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Syndromic surveillance collects information about health-related events prior to official diagnosis, and promotes early detection of outbreaks . Such surveillance is commonplace in developed countries –. It is often conducted by collecting information through multiple data streams that contribute to detection effectively. Although developing countries and rural areas have attempted to create surveillance systems, their performance on outbreak detection has rarely been evaluated . Of particular interest is the performance of different data streams used in surveillance system.
In outbreak detection, data streams determine whether the detection is valid and timely, and therefore worth investigating. Various data streams exist, including clinic visit, sales of over-the-counter (OTC) drugs, school/work absence, calls to help lines, environmental data, ambulance dispatch data, and others –. To optimize detection performance, policy makers must know which data streams are superior, and whether they are more efficient when used in parallel.
Most syndromic surveillance studies have evaluated outbreak detection performance by comparing surveillance signals with a gold standard of surveillance data, such as laboratory pathogen surveillance or conventional confirmed case surveillance . In resource-poor settings, however, governments cannot afford expensive surveillance. In rural China, village clinics are equipped with simple instruments, and are unable to administer laboratory tests for disease confirmation. Furthermore, the Chinese Information System for Diseases Control and Prevention (CISDCP), a conventional routine reporting system for selected infectious diseases, cannot monitor village populations in a timely manner because the hierarchical nature of the system dictates that villages must first send cases to township staff to be recorded in the system. Thus, it is difficult to assess the performance of syndromic surveillance using limited “gold standard” data in rural areas.
An alternative approach is to use simulated data for assessment. Many studies have evaluated the performance of single-stream surveillance through simulated outbreaks. Multi-stream surveillance, however, has seldom been evaluated, because simulated outbreak data cannot be superimposed directly onto different syndromic data baselines (which represent different health-related events, such as visiting clinics, OTC drug purchasing, or absence due to illness). Simulated outbreak data must first be converted into corresponding simulated syndromic data prior to superimposition.
Because all syndromic data streams are associated with one another, we proposed a simulation method based on the healthcare-seeking behaviors that can capture the inner linkages between outbreak data and various syndromic data. Thus, a discrete probability distribution of healthcare-seeking behaviors of symptomatic individuals may be used to convert simulated outbreak data into multiple simulated syndromic data , .
We introduced an evaluation method based on the healthcare-seeking behaviors model for multi-stream syndromic surveillance (Figure 1). We designated influenza A (H1N1) as the hypothetical disease because it is highly infectious and received high attention from public health agencies worldwide since the 2009 influenza A (H1N1) pandemic.
First, simulated outbreak datasets are generated by the SEIR model. Second, the simulated outbreak datasets are converted into three kinds of syndromic datasets according to a discrete distribution probability of healthcare-seeking behaviors. Third, the converted syndromic datasets are superimposed onto corresponding syndromic baseline datasets to create testing datasets. Next, detection algorithms can be performed on testing datasets to detect simulated outbreaks. Last, relevant indicators can be devised to evaluate the detection performance. OTC: over-the-counter; ROC: receiver operating characteristic; SEIR: Susceptible – Exposed – Infectious – Recovered model; AMOC: activity monitoring operation curves.
Simulated outbreaks model
We generated simulated influenza outbreak data based on the Susceptible – Exposed – Infectious – Recovered (SEIR) model. This model imitates four main health states in disease progression. First, individuals begin in the susceptible state (S) and progress to the exposed state (E) at rate β when in contact with infected individuals. Second, following an incubation period, exposed individuals move to the infectious state (I) at rate ω. Finally, at the end of the infectious period, infected individuals enter the recovery state (R) at rate γ , . This process may be modeled using the following equations:where N(t) is the total population at time t. S(t), E(t), I(t), and R(t) are the numbers of individuals at time t in each of the four states. Ratios of 1/ω and 1/γ are the mean incubation and infectious periods, respectively. β is the transmission rate, which reflects the diffusion intensity of a disease; it is usually measured by the reproductive number (R0), which refers to the number of secondary cases for each primary case: R0 = β/γ.
To simulate influenza A (H1N1) outbreak data, we defined the values of R0, 1/ω, and 1/γ. Previous studies , ,  estimated R0 for influenza A (H1N1) to be in the range of about 1.0–3.0. Thus we defined R0 at the three levels in our study as 1.5, 2.0, and 2.5. We defined 1/ω as 1, 2, and 3, and 1/γ as 3, 5, and 7, according to the natural progression of influenza A (H1N1) –. We then varied a single parameter and held all others fixed, resulting in 27 scenarios of simulated outbreaks (Table 1). Xu et al. reported that the Chinese population has a very low pre-existing immunity to influenza A (H1N1) virus ; thus, we assumed that all populations in our target sites were susceptible at the beginning of the outbreaks. In the process of simulating, we assumed equal infectiousness among populations and did not take mortality, or possible interventions (hospitalizations or treatments) into consideration.
Healthcare-seeking behaviors model
To superimpose the simulated outbreak data onto syndromic baselines, we needed to convert them into the corresponding syndromic data using the healthcare-seeking behaviors model. This model simulated the occurrence and timing of three types of healthcare-seeking behaviors following syndrome onset: visiting clinics, OTC drug purchasing, and school absenteeism.
Figure 2 illustrates the principle of converting simulated outbreak data into syndromic data (clinic visit data are used as an example): ai is the probability of patients visiting doctors at day i; ni is the number of new outbreak cases at day i. As the figure shows, the number of new cases at day 1 is n1. Of these, a1n1 cases will visit doctors in clinics on that day; a2n1 cases will visit doctors on the second day; a3n1 will do so on the third day (light blue pillars), etc. On day 2, there are n2 new cases, of which a1n2 cases will visit doctors in clinics on that day; a2n2 cases will do so on the second day; a3n2 will do so on the third day (green pillars), etc. As a result, the sum of the visiting volume on day 1 is a1n1; on day 2, it is a2n1 + a1n2; on day 3, it is a3n1 + a2n2 + a1n3; etc. Thus, we formulated the relationship between outbreak data and visiting volume data as follows:
Clinic visits data were used as an example; ai is the probability of patients visiting doctors at day i; ni is the number of new outbreak cases at day i. The number of new cases at day 1 is n1. Of these, a1n1 cases will visit doctors in clinics on that day; a2n1 cases will visit doctors on the second day; a3n1 will do so on the third day (light blue pillars), etc. On day 2, there are n2 new cases, of which a1n2 cases will visit doctors in clinics on that day; a2n2 cases will do so on the second day; a3n2 will do so on the third day (green pillars), etc. As a result, the sum of the visiting volumes on day 1 is a1n1; on day 2, it is a2n1 + a1n2; on day 3, it is a3n1 + a2n2 + a1n3; etc.
where Ni×j is a matrix of the time-series data of new outbreak cases, and ni is the count of new cases at day i. The matrix Aj×1 represents the discrete probability distribution of visiting clinics for each day following the onset of symptoms, and ai is the probability of patients visiting doctors at day i. Vi×1 is the converted time-series data for visiting volume.
Similarly, the converted data of OTC drug purchase frequency (Oi×1) can be formulated as follows:
and where Bj×1 is the discrete probability distribution of OTC drug purchasing for each day following the onset of symptoms.
School absenteeism surveillance only concerns the school-aged population. We assumed the homogenous population mixing in our models, simplifying the process of disease transmission across different population. Therefore, we used the proportion of school-aged children within the population (p) to structure school-aged infections in simulated outbreaks. The school absenteeism data (Si×1) can be formulated as follows:
and where Cj×1 is the discrete probability distribution of absence from school for each day following the onset of symptoms, and p is the proportion of school-aged children within population.
Because detailed data for our target population regarding the probability of seeking care and the delay in seeking care were not available in the literature, a survey was conducted to obtain the probability and time of these behaviors. We randomly sampled 10 households in each town and 5 households in each village within the study areas. A total of 2,473 participants (including 171 school-aged children) were sampled regarding whether and when they visited doctors, purchased OTC drugs, or were absent from school, once they had symptoms of influenza-like illness (fever + cough or sore throat).
The Integrated Surveillance System (ISS), the first electronic syndromic surveillance system for infectious diseases in rural China, was employed for our field experiment in six towns in Hubei province on April 1, 2012. The ISS collects daily syndromic information from three data streams: chief complaints from health clinics, medication sales from retail pharmacies, and primary school absences. Chief complaints surveillance focuses on patients' main symptoms and basic information including age, gender, home address, and visiting time. Medication sale surveillance concerns daily sales of 98 drugs. School absence surveillance concerns the daily numbers of and reasons given for absence of students from primary school. Further details of the ISS may be found in previous studies , .
We used ISS daily numbers of three syndromic data streams (clinic visit, OTC drug purchase frequency, and primary school absence) in six towns in Hubei, China (Longwang, Zhangjing, Shiqiao, Zengji, Hougang, and Xiongkou; total population in the target regions was 326,984, population density was about 326 persons/km2, sex ratio (male: female) was 1.03, and per capita GDP was $1,691). The ISS routinely collected data from 152 health centers (6 township hospitals, 146 village clinics), 11 township drugstores, and 26 primary schools (6 township and 20 village schools). The clinic visit (CV) data stream recorded the daily count of patients with symptoms of influenza-like illness. The OTC drug purchase frequency (OTC) data stream recorded the daily count of consumers who purchased three categories of drugs related to respiratory symptoms: antipyretics, compound cold medicine, and cough suppressants (a consumer could be recorded for only one at a time, even though he/she may have purchased multiple categories of drugs). The school absence (SA) data stream recorded the daily number of absent students claiming to experience influenza-like illness symptoms (weekends and vacations excluded, because schools are closed). We used these three data streams from April 1, 2012, to June 30, 2013 (no outbreak took place during this period), to provide realistic baseline data.
Generation of semi-synthetic testing data
Given the complexities of real data, a semi-synthetic approach is preferable to a fully synthetic approach in assessment; the former superimposes simulated outbreak cases onto a realistic baseline and then uses this combination as testing data , . In other words, the number of cases on the testing data at day t is the sum of the cases of simulated outbreak data and realistic baseline data at day t. Testing datasets can be calculated using the following formulas: where TV(t), TO(t), and TS(t) are the numbers of cases on testing datasets of clinic visit, OTC drug purchase frequency, and school absence at day t, respectively. BV(t), BO(t), and BS(t) are the numbers of cases on baseline datasets of each data stream at day t. V(t), O(t), and S(t) are the numbers of cases on simulated datasets of each data stream at day t.
Data obtained during April 1–9, 2012, provided background counts for the detection algorithm. The superimposing process began on April 10, 2012. To avoid bias due to seasonality and day-of-the-week effects, this process was repeated every day from April 10, 2012, to June 30, 2013, for each of the three realistic syndromic baselines . This yielded 447 testing datasets per scenario per data stream, for a total of 36,207 (447 * 27 * 3) datasets for analysis (Figure 3).
A) Simulated Outbreak Data (generated by SEIR model) and converted syndromic data (generated by healthcare-seeking behaviors model); B) Testing CV Data (simulated CV + CV baseline); C) Testing OTC Frequency Data (simulated OTC + OTC baseline); D) Testing SA Data (simulated SA + SA baseline). The pink epidemic was one of the simulated outbreaks generated by the SEIR model. This could be converted into simulated clinic visits (green), simulated OTC drug purchase frequency (red), and simulated school absence (black), according to the healthcare-seeking behaviors model. The first simulated outbreak was released on 2012/4/10. Simulated syndromic data were superimposed onto corresponding baselines on the same release period (see B, C, and D). Every simulation released one outbreak to generate three testing datasets, including testing CV data, testing OTC data, and testing SA data. The simulation was repeated day by day during the whole surveillance period (2012/4/10–2013/6/30). Testing SA data on vacation breaks were defaulted as “0.” CV: clinic visits; OTC: over-the-counter; SA: school absence; SEIR: Susceptible – Exposed – Infectious – Recovered model.
Because the ISS system has only been in use in rural China since 2012, we did not have long-term historical data as a background for our algorithm. Thus, a non-historical model of the Early Aberration Reporting System (EARS) was suitable for our data, which collected <2 years of background data . The EARS has been increasingly used as a standard syndromic surveillance system in both the USA and China , . The EARS models were intended to be used as the cumulative sum method (CUSUM) consisting of three algorithms—C1, C2, and C3—that show increasing sensitivities matching their intended sensitivity levels (C3 being most sensitive). The statistic of CUSUM value can be written as follows: where Xt is the count of cases at day t, and μt and σt are the moving sample mean and standard deviation at baseline, respectively. The C1 baseline is obtained from the previous 7 days in closest proximity to the current day (day t-7 through day t-1). C2 uses a 7-day baseline on day t-9 through day t-3. C3 is the sum of the C2 values for the past 3 days .
To obtain the receiver operating characteristic (ROC) curve and activity monitoring operation curves (AMOC) , we set seven thresholds for each algorithm (0.1, 0.5, 1.0, 1.5, 2.0, 2.5, and 3.0). These thresholds indicate the critical CUSUM value levels when the number of current cases exceeds three deviations above the baseline mean .
We designed seven surveillance strategies according to combinations of three data streams, with three single-stream surveillance strategies: (1) Clinic Visit Only, (2) OTC drug purchase frequency Only, (3) School Absence Only; and four multi-stream surveillance strategies: (4) Clinic Visit + OTC Frequency, (5) Clinic Visit + School Absence, (6) OTC Frequency + School Absence, (7) Clinic Visit + OTC Frequency + School Absence. We defined the multi-stream signal as the earliest signal generated in any sub-data stream. The performance of different strategies was compared to allow us to judge which strategy was the best and whether multi-stream surveillance was more efficient for outbreak detection.
The metrics used to evaluate the performance were the receiver operating characteristic (ROC) curve and activity monitoring operation curves (AMOC); these could be draw by using the 7 thresholds mentioned above. ROC curves were plotted using 1-specificity and sensitivity at each threshold. Similarly, AMOC was plotted using 1-specificity and proportional timeliness (time to detection divided by the outbreak duration). Proportional timeliness enables the direct comparison of detection timeliness across different outbreak scenarios. We calculated these indicators by averaging the detection outcome across all 12,069 (447 * 27) analysis runs in each surveillance strategy.
Sensitivity (Se) was defined as the number of flagged aberrations that correctly corresponded to simulated outbreaks (f), divided by the total number of simulated outbreaks (s): .
Specificity (Sp) was defined as the total number of days that did not contain simulated outbreaks and remained unflagged (d), divided by the total number of days that did not contain simulated outbreaks (D): .
Proportional timeliness (Pt) was defined as the time to detection (t, the number of days that occurred between the beginning of an outbreak and the first day the outbreak was flagged ), divided by the outbreak duration (T): .
To compare the overall performance of outbreak detection between single-stream surveillance and multi-steam surveillance, we calculated the overall sensitivity, specificity and proportional timeliness by averaging these estimates across relevant surveillance strategies:
The overall estimates for single-stream surveillance were calculated by averaging strategies of (1) Clinic Visit Only, (2) OTC drug purchase frequency Only, and (3) School Absence Only. The overall estimates for multi-stream surveillance were calculated by averaging strategies of (4) Clinic Visit + OTC Frequency, (5) Clinic Visit + School Absence, (6) OTC Frequency + School Absence, (7) Clinic Visit + OTC Frequency + School Absence.
All simulations were generated using VBA programming embedded in Microsoft Excel 2007; statistical analyses were conducted using SPSS version 12.0 (SPSS Inc., Chicago, IL, USA).
Written informed consent statements were obtained from the all relevant participants including parents of children. All participants and patients were anonymized and only aggregated data was used for data analysis. The personal identification information did not appear in the final database. The study was ethically approved by the Institutional Review Board of Tongji Medical College.
Realistic syndromic baseline
During the period lasting from April 1, 2012, to June 30, 2013, the ISS recorded 16,956 visitors due to influenza-like syndrome (37.2±14.2 per day), 98,744 respiratory syndrome-related OTC drug sales (216.5±42.6 per day), and 715 student absences due to influenza-like syndrome (2.7±2.6 per day; Table S1 in File S1). The clinic visit data stream demonstrated higher levels of activity in summer and winter, coinciding with the seasonal peaks of respiratory diseases. Similar peaks occurred in the OTC drug purchase frequency data stream, but were less obvious in summer. Dips at the end of February 2013 in both clinic visit and OTC drug purchases frequency coincided with the Chinese New Year, in line with traditional Chinese avoidance of healthcare during this period. The school absenteeism data stream did not demonstrate a noticeable pattern because of the numerous vacation breaks (Figure 4).
Gray bars show three periods separated by vacation breaks. Circles point out dips during the Chinese New Year.
We also calculated Spearman's rank correlation coefficients between the different time series of data streams within three periods separated by vacation breaks (Table 2). The clinic visit stream correlated strongly with the OTC drug purchase frequency stream in all periods (maximum r = 0.79, lag = 5∼7 days in period 1; maximum r = 0.49, lag = 6∼7 days in period 2; and maximum r = 0.66, lag = 0 days in period 3). The clinic visit stream correlation with the school absenteeism stream was high in period 1 (maximum r = 0.57, lag = 2∼3 days) and period 2 (maximum r = 0.33, lag = -3∼-4 days), but not significant in period 3. The OTC drug purchase frequency and school absenteeism streams were significantly correlated in all periods (maximum r = 0.52, lag = -2∼-6 days in period 1; maximum r = 0.63, lag = -12∼-13 days in period 2; and maximum r = 0.28, lag = 3∼6 days in period 3).
Generation of simulated outbreaks
Using the SEIR model, 27 scenarios of simulated outbreak were generated using different combinations of parameters (Table 1 and Table S2 in File S1). The number of infected individuals varied from 34 to 207. Outbreak 27 was the strongest, with a maximum of 20 cases occurring on peak day. The lowest three outbreaks were 1, 2, and 3 lasting for a long time without peaks; these were also more likely to be sporadic outbreaks. Total cases and peak cases rose in accordance with an increased value of R0, whereas increasing R0 decreased peak time and total duration. A decrease in the value of 1/ω (i.e., a decrease in the incubation period) raised the number of total cases and peak cases, and decreased peak time and duration. In contrast, raising the value of 1/γ (e.g., an increase in the infectious period) increased the number of total cases and duration, but did not significantly influence peak cases and peak time.
Healthcare-seeking behaviors pattern
The healthcare-seeking behavior questionnaires targeting an influenza-like syndrome were completed and returned with an overall response rate of 75.7% (n = 1,873 of 2,473; 53.0% male; 40.2 (17.9) years of age; 6.9% primary-school age population). Table 3 shows the discrete probability distribution of healthcare-seeking behaviors following the onset of syndrome. Of the total population, 51.3% (960/1,873) participants replied that they would visit a doctor, and 39.8% (746/1,873) would purchase drugs. In the primary school-age population, 25.4% (33/130) would miss school. Using these parameters, we converted 27 simulated outbreak datasets into 81 relevant syndromic datasets (Table S3 in File S1).
Validity and timeliness of syndromic surveillance
Figure 5 shows the ROCs of all surveillance strategies. In single-stream surveillance strategies (Figure 5-A, B, C), the clinic visit stream manifested the highest validity, while the OTC drug purchase frequency stream exhibited the lowest validity for all algorithms. In multi-stream surveillance strategies (Figure 5-D, E, F), all strategies exhibited similar performance for outbreak detection. We also plotted the ROCs of overall single-stream and multi-stream surveillance (Figure 5-G, H, I). According to the comparison results, overall multi-stream surveillance had superior sensitivity to overall single-stream surveillance, especially when specificities were below 90% (namely 1-Sp>10%).
CV: clinic visit surveillance; EARS: the Early Aberration Reporting System; OTC: over-the-counter frequency surveillance; SA: school absence surveillance. Overall single-stream surveillance contains strategies of CV, OTC, and SA; overall multi-stream surveillance contains strategies of CV + OTC, CV + SA, OTC + SA, and CV + OTC + SA.
Figure 6 shows the AMOC curves of all surveillance strategies. In single-stream surveillance (Figure 6-A, B, C), the school absenteeism stream had a slightly superior timeliness than the other two data streams for all algorithms; this, however, was not obvious at a higher level of specificity. In multi-stream surveillance (Figure 6-D, E, F), all strategies exhibited similar timeliness of outbreak detection. By comparing overall single-stream and multi-stream surveillance (Figure 6-G, H, I), we found that overall multi–stream surveillance had superior timeliness to overall single-stream surveillance when specificities were below 90% (namely 1-Sp>10%). However, at a high level of specificity, there was little difference in timeliness between single-stream and multi-stream surveillance for all algorithms.
CV: clinic visit surveillance; EARS: the Early Aberration Reporting System; OTC: over-the-counter frequency surveillance; SA: school absence surveillance. Overall single-stream surveillance contains strategies of CV, OTC, and SA; overall multi-stream surveillance contains strategies of CV + OTC, CV + SA, OTC + SA, and CV + OTC + SA.
Meanwhile, comparing the positions of ROC and AMOC curves between different detection algorithms, we found that the EARS ∼ C1 model exhibited the best validity (ROC curves of C1 were closest to the top left origin) and timeliness (AMOC curves of C1 were closest to the bottom left origin), while the EARS ∼ C3 model was slightly inferior to the other two algorithms in our study.
We explored the performance of multi-stream syndromic surveillance on outbreak detection in rural Hubei, China by using simulation influenza A (H1N1) outbreaks based on the healthcare-seeking behaviors model. Although several other studies have previously evaluated the performance of syndromic surveillance through simulation methods, most had done so by evaluating a single data stream , , , . One study considered concurrent surveillance of two data streams ; however, only a simple and fixed probability of healthcare-seeking was used, and did not factor in the time individuals sought care. In fact, all syndromic data streams were associated with each other. A simulation based on the healthcare-seeking behaviors model, which assessed individuals' healthcare-seeking behavior patterns following the onset of symptoms, is a useful framework for simulating associated syndromic datasets.
Results from our study areas showed that clinic visit surveillance exhibited the most favorable validity, similar to findings of previous studies , . The clinic visit data stream, which collects individual medical details including demographic characteristics and chief complaints, makes it easy to screen out visitors using precise symptoms related to specific diseases. Detail individual chief complaints can help to exclude those visitors who did not have symptom related to the target diseases. This decreases the non-specific noise bias of baseline datasets. Consequently, the fluctuations of visit volume data in clinics could largely be influenced by outbreaks, and allow for fine detection. When compared to the CV stream, the SA stream collected rough individual information of absence reasons that was obtained from patients, and the OTC stream could never collect individual information about reasons for medicine purchases due privacy concerns.
Like some previous studies , , we also found that SA showed a satisfactory performance of outbreak detection. Primary school-aged students who gather regularly in a relatively closed and crowded environment, allow for diseases to spread easily. Therefore, school absenteeism surveillance may be more sensitive to contagious diseases. Additionally, although only 25.4% of primary school-aged students would miss school after the onset of an influenza-like syndrome (Table 3), it still led to a drastic fluctuation compared to the relatively low baseline data (Figure 3-D), so that outbreaks could still be detected effectively. There are, however, some disadvantages to school absenteeism surveillance, such as limited coverage (only school-aged children), and intermittent surveillance due to schools being closed on weekends, and vacations.
In our study areas, the OTC drug purchase frequency surveillance exhibited inferior performance of outbreak detection, when compared to clinical visit and school absenteeism surveillance. We surmise that this occurred due to the fact that drug sale information is less specific to diseases (not all consumers buy drugs for illnesses; and drug sale records contain no individual medical information; moreover drugs may be preserved during a long period of time and to be used later). Additionally, the OTC drug purchase frequency baseline (an average of 216 persons per day) was much higher, so that the extra sales volume resulting from outbreaks was not apparent (Figure 3-C). Indeed, fluctuation of surveillance data can generally be influenced by the baseline. Outbreak is easier to detect when the incidence and variation of the baseline count are low relative to outbreak cases. Researchers have reported that outbreaks with a magnitude of less than 10% of the baseline are difficult to detect when operating at a high specificity . Therefore, detection performance of OTC drug purchase frequency surveillance was discounted by a dilution effect resulting from a large degree of non-specific noises in the baseline data.
Some modifications of OTC drug purchase frequency surveillance in SSI are worth performing in the future, such as collecting individual medical information as soon as possible and reducing non-specific baseline counts through a more refined classification of drug categories.
Our study also showed that multi–stream syndromic surveillance could improve the performance of outbreak detection at a low level of specificity; however, this improvement was not manifested when the specificity level was above 90%. In fact, multi–stream syndromic surveillance seems to improve detection performance through the collection of a greater amount of pre-clinical information, but does so at the cost of non-specific signals. The more data streams are used, the more non-specific signals will be captured; this may be a possible reason for the multi-stream surveillance's superior performance of outbreak detection at a lower specificity. In practice, researchers usually increase the algorithm threshold to get a high specificity for outbreak detection. The outbreaks that can be detected by a higher threshold, however, usually have a stronger intensity, and a stronger outbreak can, itself, give rise to drastic fluctuations in relevant syndromic surveillance data streams. Therefore, in general, severe outbreaks can be easily detected by both single-stream and multi-stream surveillance. This bias inherent to stronger outbreaks that can be detected at a high threshold may partly explain the similar outbreak detection performance exhibited by both single-stream and multi-stream surveillance at a high specificity.
Like previous studies , , our results also found that C1 had the best timeliness of detection due to the fact that it used data from 7 previous days of closest proximity to the current day (day t-7 through day t-1) as baseline, while C2 and C3 used data from day t-9 through day t-2 as baseline. Additionally, we found C1 was also the superior model for validity of outbreak detection in our study sites, although, the optimal model validity among C1, C2, and C3 may vary across the size, distribution, and duration of outbreaks .
Although outbreak simulation methods allowed for greater flexibility and evaluated the performance of aberration detection quantitatively, generalization from simulated outbreaks to real outbreaks was far from straightforward . Some limitations to our simulation methods and assumptions should be addressed.
First, the dynamic model used to simulate the infectious disease transmission in our study belongs to a deterministic model, which is defined via a system of ordinary differential equations. An attractive feature of this deterministic model is that it describes, in a straightforward manner, how the number of infections evolves through time. Nevertheless, disease transmission in real world is stochastic and complex. Therefore, stochastic models may be more appropriate than deterministic models with regard to fitting models to data . Some studies have reported relevant methods, such as specifying probability distributions to the incubation and infectious periods , and estimating the essential parameters through the Markov chain Monte Carlo (MCMC) method  or the sequential Bayesian method .
Second, the host population was homogeneously mixed, meaning that simulated outbreaks did not account for the heterogeneous nature of human contact. All individuals were equally likely to come into contact with every other individual. This ignored the actual diversity of diseases transmission across different populations, such as transmissions in families, schools, or social circles. In fact, in the healthcare-seeking behaviors model, we also used a simple ‘p’ (the proportion of school-aged population) to structure the simulated school-aged infections. This likely underestimated the number of simulated school-aged infections by ignoring the vulnerability of children to influenza, as well as the fact that school children tend to spread diseases within the school environment .
Third, we assumed that the entire population was susceptible at the beginning of the simulation. On one hand, we thought this was reasonable for a very low pre-existing immunity to influenza A (H1N1) virus in Chinese population ; on the other hand, a simulation based on an entirely susceptible population could reflect emerging infectious diseases or bioterrorism, which are the likely focus of syndromic surveillance.
Fourth, the simulated population was static. Natural population change (birth rate and death rate) was ignored, because disease outbreaks did not last for a longer period than usual.
Fifth, recall bias may existed in the questionnaire survey of healthcare-seeking behaviors. Information about healthcare-seeking behaviors was obtained from the memory of local residents by asking them to remember how they sought health care during their last influenza-like illness. For this reason, we did not take syndrome severity stratification into consideration, since syndrome severity could be biased largely towards inaccurate memories and subjective judgments of residents who received the survey.
Sixth, for the parameters of the healthcare-seeking behaviors model that were obtained from the local residents' survey, the results of our study are only suitable to the six towns in rural Hubei, China. Although we could not extend them to other regions or populations, our evaluation methods based on the healthcare-seeking behaviors model may be applied anywhere.
Finally, this paper was merely a preliminary exploration on the evaluation methods based on the healthcare-seeking behavior model, assessing three data streams in rural Hubei, China. Other data streams remain potential options for assessment using this method as long as they provide information on relevant parameters of healthcare-seeking behaviors, such as nurse hotline calls or workplace absenteeism.
Overall, outbreak simulation based on the healthcare-seeking behaviors model offers a method for evaluating detection performance of multi-stream syndromic surveillance. In the six towns within our study areas, clinic visit surveillance and school absenteeism surveillance exhibited a satisfactory performance on outbreak detection; multi-stream surveillance yielded superior validity and timeliness than single-stream surveillance at low specificity (Sp <90%). We aim to explore other potential factors, such as stratification by age, detection scales, and outbreak intensity in future research.
Table S1, Baseline data of syndromic surveillance. Table S2, The 27 scenarios of simulated outbreaks generated by the SEIR model. Table S3, Converted syndromic datasets generated by the healthcare-seeking behavior model.
We acknowledge the 152 health centers, 11 drugstores, and 26 primary schools in our study areas for providing syndromic surveillance data. We also thank our editors and anonymous referees for comments that substantially improved the manuscript.
Conceived and designed the experiments: YZF. Performed the experiments: YW HBJ WWY MY. Analyzed the data: YZF. Contributed reagents/materials/analysis tools: WRY VKD BX HJD LP SFN. Wrote the paper: YZF.
- 1. Morse SS (2012) Public health surveillance and infectious disease detection. Biosecur Bioterror 10: 6–16.
- 2. Buckeridge DL (2007) Outbreak detection through automated surveillance: a review of the determinants of detection. J Biomed Inform 40: 370–379.
- 3. Hiller KM, Stoneking L, Min A, Rhodes SM (2013) Syndromic surveillance for influenza in the emergency department-a systematic review. PLoS One 8: e73832.
- 4. Lombardo JS, Burkom H, Pavlin J (2004) ESSENCE II and the framework for evaluating syndromic surveillance systems. MMWR Morb Mortal Wkly Rep 53 Suppl: 159–165.
- 5. Moore KM, Edgar BL, McGuinness D (2008) Implementation of an automated, real-time public health surveillance system linking emergency departments and health units: rationale and methodology. CJEM 10: 114–119.
- 6. May L, Chretien JP, Pavlin JA (2009) Beyond traditional surveillance: applying syndromic surveillance to developing settings—opportunities and challenges. BMC Public Health 9: 242.
- 7. Edge VL, Pollari F, Ng LK, Michel P, McEwen SA, et al. (2006) Syndromic Surveillance of Norovirus using Over-the-counter Sales of Medications Related to Gastrointestinal Illness. Can J Infect Dis Med Microbiol 17: 235–241.
- 8. Heffernan R, Mostashari F, Das D, Besculides M, Rodriguez C, et al. (2004) New York City syndromic surveillance systems. MMWR Morb Mortal Wkly Rep 53 Suppl: 23–27.
- 9. Lombardo J, Burkom H, Elbert E, Magruder S, Lewis SH, et al. (2003) A systems overview of the Electronic Surveillance System for the Early Notification of Community-Based Epidemics (ESSENCE II). J Urban Health 80: i32–42.
- 10. Magruder SF, Lewis SH, Najmi A, Florio E (2004) Progress in understanding and using over-the-counter pharmaceuticals for syndromic surveillance. MMWR Morb Mortal Wkly Rep 53 Suppl: 117–122.
- 11. Mandl KD, Overhage JM, Wagner MM, Lober WB, Sebastiani P, et al. (2004) Implementing syndromic surveillance: a practical guide informed by the early experience. J Am Med Inform Assoc 11: 141–150.
- 12. van den Wijngaard C, van Asten L, van Pelt W, Nagelkerke NJ, Verheij R, et al. (2008) Validation of syndromic surveillance for respiratory pathogen activity. Emerg Infect Dis 14: 917–925.
- 13. Buckeridge DL, Burkom H, Moore A, Pavlin J, Cutchis P, et al. (2004) Evaluation of syndromic surveillance systems—design of an epidemic simulation model. MMWR Morb Mortal Wkly Rep 53 Suppl: 137–143.
- 14. Buckeridge DL, Switzer P, Owens D, Siegrist D, Pavlin J, et al. (2005) An evaluation model for syndromic surveillance: assessing the performance of a temporal algorithm. MMWR Morb Mortal Wkly Rep 54 Suppl: 109–115.
- 15. Chowell G, Nishiura H, Bettencourt LM (2007) Comparative estimation of the reproduction number for pandemic influenza from daily case notification data. J R Soc Interface 4: 155–166.
- 16. Nsoesie EO, Beckman RJ, Marathe MV (2012) Sensitivity analysis of an individual-based model for simulation of influenza epidemics. PLoS One 7: e45414.
- 17. Truscott J, Fraser C, Cauchemez S, Meeyai A, Hinsley W, et al. (2012) Essential epidemiological mechanisms underpinning the transmission dynamics of seasonal influenza. J R Soc Interface 9: 304–312.
- 18. Yang F, Yuan L, Tan X, Huang C, Feng J (2013) Bayesian estimation of the effective reproduction number for pandemic influenza A H1N1 in Guangdong Province, China. Ann Epidemiol 23: 301–306.
- 19. Uchida M, Tsukahara T, Kaneko M, Washizuka S, Kawa S (2013) Evaluation of factors affecting variations in influenza A/H1N1 history in university students, Japan. J Infect Chemother 19: 665–672.
- 20. Wang C, Yu E, Xu B, Wang W, Li L, et al. (2012) Epidemiological and clinical characteristics of the outbreak of 2009 pandemic influenza A (H1N1) at a middle school in Luoyang, China. Public Health 126: 289–294.
- 21. Punpanich W, Chotpitayasunondh T (2012) A review on the clinical spectrum and natural history of human influenza. Int J Infect Dis 16: e714–723.
- 22. Carrat F, Vergu E, Ferguson NM, Lemaitre M, Cauchemez S, et al. (2008) Time lines of infection and disease in human influenza: a review of volunteer challenge studies. Am J Epidemiol 167: 775–785.
- 23. Xu C, Bai T, Iuliano AD, Wang M, Yang L, et al. (2011) The seroprevalence of pandemic influenza H1N1 (2009) virus in China. PLoS One 6: e17919.
- 24. Yan W, Palm L, Lu X, Nie S, Xu B, et al. (2013) ISS—an electronic syndromic surveillance system for infectious disease in rural China. PLoS One 8: e62749.
- 25. Yan WR, Nie SF, Xu B, Dong HJ, Palm L, et al. (2012) Establishing a web-based integrated surveillance system for early detection of infectious disease epidemic in rural China: a field experimental study. BMC Med Inform Decis Mak 12: 4.
- 26. Sosin DM (2003) Draft framework for evaluating syndromic surveillance systems. J Urban Health 80: i8–13.
- 27. Jackson ML, Baer A, Painter I, Duchin J (2007) A simulation study comparing aberration detection algorithms for syndromic surveillance. BMC Med Inform Decis Mak 7: 6.
- 28. Hutwagner L, Browne T, Seeman GM, Fleischauer AT (2005) Comparing aberration detection methods with simulated data. Emerg Infect Dis 11: 314–316.
- 29. Hutwagner L, Thompson W, Seeman GM, Treadwell T (2003) The bioterrorism preparedness and response Early Aberration Reporting System (EARS). J Urban Health 80: i89–96.
- 30. Yang P, Duan W, Lv M, Shi W, Peng X, et al. (2009) Review of an influenza surveillance system, Beijing, People's Republic of China. Emerg Infect Dis 15: 1603–1608.
- 31. Hutwagner LC, Thompson WW, Seeman GM, Treadwell T (2005) A simulation model for assessing aberration detection methods used in public health surveillance for systems with limited baselines. Stat Med 24: 543–550.
- 32. Buckeridge DL, Burkom H, Campbell M, Hogan WR, Moore AW (2005) Algorithms for rapid outbreak detection: a research synthesis. J Biomed Inform 38: 99–113.
- 33. Watkins RE, Eagleson S, Veenendaal B, Wright G, Plant AJ (2008) Applying cusum-based methods for the detection of outbreaks of Ross River virus disease in Western Australia. BMC Med Inform Decis Mak 8: 37.
- 34. Kleinman KP, Abrams A, Mandl K, Platt R (2005) Simulation for assessing statistical methods of biologic terrorism surveillance. MMWR Morb Mortal Wkly Rep 54 Suppl: 101–108.
- 35. Rha B, Burrer S, Park S, Trivedi T, Parashar UD, et al. (2013) Emergency department visit data for rapid detection and monitoring of norovirus activity, United States. Emerg Infect Dis 19: 1214–1221.
- 36. Van-Dijk A, Aramini J, Edge G, Moore KM (2009) Real-time surveillance for respiratory disease outbreaks, Ontario, Canada. Emerg Infect Dis 15: 799–801.
- 37. Kara EO, Elliot AJ, Bagnall H, Foord DG, Pnaiser R, et al. (2012) Absenteeism in schools during the 2009 influenza A(H1N1) pandemic: a useful tool for early detection of influenza activity in the community? Epidemiol Infect 140: 1328–1336.
- 38. O'Neill PD (2010) Introduction and snapshot review: relating infectious disease transmission models to data. Stat Med 29: 2069–2077.
- 39. Cauchemez S, Carrat F, Viboud C, Valleron AJ, Boelle PY (2004) A Bayesian MCMC approach to study transmission of influenza: application to household longitudinal data. Stat Med 23: 3469–3487.
- 40. Chowell G, Simonsen L, Towers S, Miller MA, Viboud C (2013) Transmission potential of influenza A/H7N9, February to May 2013, China. BMC Med 11: 214.
- 41. Mikolajczyk RT, Akmatov MK, Rastin S, Kretzschmar M (2008) Social contacts of school children and the transmission of respiratory-spread pathogens. Epidemiol Infect 136: 813–822.