Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Evaluation and comparison of statistical methods for early temporal detection of outbreaks: A simulation-based study

  • Gabriel Bédubourg ,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft

    gabrielbedubourg@hotmail.fr

    Current address: CESPA, GSBDD Marseille Aubagne, 111 Avenue de la Corse, BP 40026, 13568 Marseille Cedex 02, France

    Affiliations CESPA, French Armed Forces Center for Epidemiology and Public Health, Marseille, France, Aix Marseille Univ, INSERM, IRD, SESSTIM, Sciences Economiques & Sociales de la Santé & Traitement de l’Information Médicale, Marseille, France

  • Yann Le Strat

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Project administration, Software, Supervision, Validation, Writing – review & editing

    Affiliation Santé publique France, French national public health agency, F-94415 Saint-Maurice, France

Abstract

The objective of this paper is to evaluate a panel of statistical algorithms for temporal outbreak detection. Based on a large dataset of simulated weekly surveillance time series, we performed a systematic assessment of 21 statistical algorithms, 19 implemented in the R package surveillance and two other methods. We estimated false positive rate (FPR), probability of detection (POD), probability of detection during the first week, sensitivity, specificity, negative and positive predictive values and F1-measure for each detection method. Then, to identify the factors associated with these performance measures, we ran multivariate Poisson regression models adjusted for the characteristics of the simulated time series (trend, seasonality, dispersion, outbreak sizes, etc.). The FPR ranged from 0.7% to 59.9% and the POD from 43.3% to 88.7%. Some methods had a very high specificity, up to 99.4%, but a low sensitivity. Methods with a high sensitivity (up to 79.5%) had a low specificity. All methods had a high negative predictive value, over 94%, while positive predictive values ranged from 6.5% to 68.4%. Multivariate Poisson regression models showed that performance measures were strongly influenced by the characteristics of time series. Past or current outbreak size and duration strongly influenced detection performances.

Introduction

Public health surveillance is the ongoing, systematic collection, analysis, interpretation, and dissemination of data for use in public health action to reduce morbidity and mortality of health-related events and to improve health [1]. One of the objectives of health surveillance is outbreak detection, which is crucial to enabling rapid investigation and implementation of control measures [2]. The threat of bioterrorism has stimulated interest in improving health surveillance systems for early detection of outbreaks [3, 4] as have natural disasters and humanitarian crises, such as earthquakes or the 2005 tsunami, and the recent emergence or reemergence of infectious diseases such as Middle East Respiratory Syndrome due to New Coronavirus (MERS-CoV) in 2012 [5] or Ebola in West Africa in 2014 [6].

Nowadays, a large number of surveillance systems are computer-supported. The computer support and statistical alarms are intended to improve outbreak detection for traditional or syndromic surveillance [7, 8]. These systems routinely monitor a large amount of data, recorded as time series of counts in a given geographic area for a given population. They produce statistical alarms that need to be confirmed by an epidemiologist, who determines if further investigation is needed. One limitation of these detection systems is an occasional lack of specificity, leading to false alarms that can overwhelm the epidemiologist with verification tasks [9, 10]. It is thus important to implement statistical methods that offer a good balance between sensitivity and specificity in order to detect a large majority of outbreaks without generating too many false positive alarms.

In the literature, a broad range of statistical methods has been proposed to detect outbreaks from surveillance data. The main statistical approaches have been reviewed by Shmueli et al. [11] and Unkel et al. [12]. By restricting these reviews to the methods that allow temporal detection of outbreaks without integrating the spatial distribution of cases, the general principle is to identify a time interval in which the observed number of cases of an event under surveillance (i.e. the number of reported cases) is significantly higher than expected. This identification is mainly based on a two-step process: First, an expected number of cases of the event of interest for the current time unit (generally a week or a day) is estimated and then compared to the observed value by a statistical test. A statistical alarm is triggered if the observed value is significantly different from the expected value. The main difference between statistical methods lies in how the expected value is estimated, which is most often done using statistical process control or regression techniques or combination of both [12].

A major constraint to the practical implementation of these methods is their capacity to be run on an increasing number of time series, provided by multiple sources of information, and centralized in large databases [3, 13, 14]. Monitoring a large number of polymorphic time series requires flexible statistical methods to deal with several well-known characteristics observed in time series: the frequency and variance of the number of cases, secular trend and one or more seasonality terms [14]. Even if some authors proposed to classify time series into a small number of categories and sought suitable algorithms for each category, in this automated and prospective framework, statistical methods cannot easily be fine tuned by choosing the most appropriate parameters adapted to each time series in an operational way, as explained by Farrington et al. [15].

A key question for public health practitioners is what method(s) can be adopted to detect the effects of unusual events on the data. Some authors have proposed a systematic assessment of the performances of certain methods in order to choose one reference algorithm [1620]. They assessed these methods on a real dataset [16, 21], a simulated dataset [1820, 22, 23] or on real time series for which simulated outbreaks were added [24, 25]. Simulating data offers the advantage of knowing the exact occurrence of the simulated outbreaks and their characteristics (amplitude, etc.). For example, Lotze et al. developed a simulated dataset of time series and outbreak signatures [26]. In the same way, Noufaily et al. [9] proposed a thorough simulation study to improve the Farrington algorithm [15]. Guillou et al. [27] compared the performance of their own algorithm to that of the improved Farrington, using the same simulated dataset. This dataset was also used by Salmon et al. to assess their method [28].

To our knowledge, no study has been proposed to thoroughly evaluate and compare the performance of a broad range of methods on a large simulated dataset.

The objective of this paper is to evaluate the performance of 21 statistical methods applied to large simulated datasets for outbreak detection in weekly health surveillance. The simulated dataset is presented in Section 2. The 21 evaluated methods and performance measures are described in Section 3. Evaluations and comparisons are presented in Section 4. A discussion follows in the last section.

Materials

We simulated data following the approach proposed by Noufaily et al. [9].

First, simulated baseline data (i.e. time series of counts in the absence of outbreaks) were generated from a negative binomial model of mean μ and variance ϕμ, ϕ being the dispersion parameter ≥1. The mean at time t, μ(t), depends on a trend and seasonality modeled using Fourier terms: (1)

Time series were simulated from 42 parameter combinations (called scenarios and presented in Table 1 in [9]) with different values taken by θ, β, γ1, γ2, m and ϕ, respectively associated with the baseline frequency of counts, trend, seasonality (no seasonality: m = 0, annual seasonality: m = 1, biannual seasonality: m = 2) and the dispersion parameter. For each scenario, 100 replicates of the baseline data (time series with 624 weeks) were generated. We thus obtained 42 × 100 = 4200 simulated time series. The last 49 weeks of each time series were named current weeks. The evaluated algorithms were run on these most recent 49 weeks. Performance measures described below were computed based on detection during these 49 weeks.

Secondly, for each time series, five outbreaks were simulated. Four outbreaks were generated in baseline weeks. Each outbreak started at a randomly drawn week and we generated the outbreak size (i.e. the number of outbreak cases) as Poisson with mean equal to a constant k1 times the standard deviation of the counts observed at the starting week. The fifth outbreak was generated in the current weeks in the same manner, using another constant noted k2. We chose the values of k1 to be 0, 2, 3, 5 and 10 in baseline weeks and k2 from 1 to 10 in current weeks as in [9].

Finally, outbreak cases were randomly distributed according to a lognormal distribution with mean 0 and standard deviation 0.5.

A total of 231,000 time series were generated from the 42 scenarios: 21,000 time series during the first step of simulation process (42 × 100 duplicates × 5 values for k1), and 210,000 time series during the second step of simulation process (21,000 × 10 values for k2), leading to a large simulated dataset including a great variety of time series, as observed in real surveillance data. At the end of the simulation process, 10,290,000 current weeks were generated, among which 6.2% were classified as outbreak weeks as they were included in an outbreak.

Methods

Statistical methods

We studied 21 statistical methods, 19 of which were implemented in the R package surveillance [29, 30]:

  • the CDC algorithm [31].
  • the RKI 1, 2 and 3 algorithms [29],
  • the Bayes 1, 2 and 3 algorithms [29],
  • CUSUM variants: original CUSUM [29, 32], a Rossi approximate CUSUM [32], a CUSUM algorithm for which the expected values are estimated by a GLM model [29], a mixed Rossi approximate CUSUM GLM algorithm [29],
  • the original Farrington algorithm [15] and the improved Farrington algorithm [9],
  • a count data regression chart (GLRNB) [29, 33] and a Poisson regression chart (GLR Poisson) [29, 34],
  • the OutbreakP method [35],
  • EARS C1, C2 and C3 algorithms [19, 36]

For all simulated time series, we used the tuning parameters recommended by their authors for each algorithm when available and proposed by default in the package surveillance. The commands used from the R package surveillance and the control tuning parameters chosen for these 19 algorithms are presented in Table 1.

thumbnail
Table 1. Commands, control tuning parameters and references of 19 algorithms implemented in the R package surveillance.

https://doi.org/10.1371/journal.pone.0181227.t001

We also proposed two additional methods not implemented in the package surveillance:

  • a periodic Poisson regression where μ(t) is defined as in Eq (1). The threshold is the 1 − α quantile of a Poisson distribution with mean equal to the predicted value at week t.
  • a periodic negative binomial regression, also defined as in Eq (1), where the threshold is the 1 − α quantile of a negative binomial distribution with mean equal to the predicted value at week t and a dispersion parameter estimated by the model.

These last two models were run on all the historical data. An alarm was triggered if the observed number of cases was greater than the upper limit of the prediction interval. These two methods are basic periodic regressions. The R code of these two algorithms is presented in the S24 Appendix.

We evaluated the performances of the methods with three different α values: α = 0.001, α = 0.01 and α = 0.05.

Performance measures

We considered eight measures to assess the performance of the methods:

  • Measure 1 is false positive rate (FPR). For each method and each scenario, we calculated the FPR defined as the proportion of weeks corresponding to an alarm in the absence of an outbreak, as in [9]. Nominal FPRs were 0.0005 for analyses with α = 0.001, 0.005 for analyses with α = 0.01 or 0.025 for analyses with α = 0.05.
  • Measure 2 is probability of detection (POD). For each scenario and for each current week period, if an alarm is generated at least once between the start and the end of an outbreak, the outbreak is considered to be detected [9]. POD is an event-based sensitivity (i.e. the entire outbreak interval is counted as a single observation for the sensitivity measurement) and is thus the proportion of outbreaks detected in 100 replicates.
  • Measure 3 is probability of detection during the first week (POD1week), which makes it possible to evaluate the methods’ ability to enable early control measures.
  • Measure 4 is observation-based sensitivity (Se): Outbreak weeks associated with an alarm were defined as True Positive (TP), non-outbreak weeks without alarm as True Negative (TN), outbreak weeks without alarm as False Negative (FN) and non-outbreak weeks with alarm as False Positive (FP). Thus, Se = TP/(TP+FN).
  • Measure 5 is specificity (Sp) defined as Sp = TN/(TN+FP). Unlike FPR which was calculated on current weeks without any simulated outbreak, specificity was calculated on the entire number of current weeks out of the 210 000 time series including current outbreaks.
  • Measure 6 is positive predictive value (PPV) defined as: PPV = TP/(TP+FP).
  • Measure 7 is negative predictive value (NPV) defined as: NPV = TN/(TN+FN).
  • Measure 8 is F1-measure defined as the harmonic mean of the sensitivity and the PPV: F1 = 2 × (Se × PPV)/(Se + PPV). F1-measure assumes values in the interval [0, 1] [37].

In the result section, we proposed to calculate averaged performance measures, i.e. to calculate FPR on the overall 21,000 time series without outbreak during the current weeks, and to calculate the other performance measures on the overall 210,000 time series with simulated outbreaks during the current weeks.

FPR was estimated prior to the simulation of current outbreaks, i.e. among the 49 current weeks for 21,000 (5 × 4,200) time series. Other indicators (POD, POD1week, Se, Sp, PPV, NPV) were estimated once outbreaks had been simulated, i.e. on the current weeks of all the time series (210,000 time series).

For each α value, we proposed ROC curve-like representation of these results with four plots representing sensitivity according to 1-specificity, POD and POD1week as functions of FPR, and sensitivity according to PPV.

Factors associated with the performance measures

To identify the factors associated with the performance measures for α = 0.01 and assess the strength of associations, multivariate Poisson regression models [38] were run, as in Barboza et al. [39] or Buckeridge et al. [40]. A set of covariates corresponding to the characteristics of the simulated time series was included: trend (yes/no), seasonality (no/annual/biannual), the baseline frequency coefficient θ, the dispersion coefficient ϕ and k1 representing the amplitude and duration of past outbreaks. The last three covariates and k2 were treated as continuous and modeled using fractional polynomials. The statistical methods were introduced as covariates to estimate performance ratios, i.e. the ratios of performances of two methods, adjusted for the characteristics of the time series represented by the other covariates.

Adjusted FPR, POD, POD1week, sensitivity, and specificity ratios were estimated with the improved Farrington algorithm as reference. 95% confidence intervals were calculated with robust estimation of standard errors. For each continuous covariate modeled by fractional polynomials, ratios were presented for each value [41].

The simulation study, the implementation of the detection methods, and the estimations of performance were carried out using R (version 3.2.2), in particular using the package surveillance. Poisson regression models used to identify the factors associated with the performance measures and to assess the strength of associations were run using Stata 14.

Results

Averaged performances of the methods

In this section, we present the averaged performances of each evaluated method, i.e. the performances irrespective of the scenario and of the characteristics of the time series. Table 2 presents averaged FPR, specificity, POD, POD1week, sensitivity, negative predictive value, positive predictive value and F1-measure for all 42 scenarios and all past and current outbreak amplitude and duration and for α = 0.01. Overall, FPR ranged from 0.7% to 59.9% and POD from 43.3% to 88.7%. Methods with the highest specificity, such as the improved Farrington method or the periodic negative binomial regression, presented a POD lower than 45% and a sensitivity lower than 21%. Averaged measures for α = 0.001 and α = 0.05 are presented in S1 Table and S2 Table. RKI 1-3, GLR Negative Binomial, GLR Poisson, Bayes 1-3 and OutbreakP algorithms’ performances do not vary with α values (see Table 1). Their performances are only reported in Table 2. For each method, a radar chart presenting the measures 1-7 for α = 0.01 is proposed in the S23 Appendix.

thumbnail
Table 2. FPR, specificity, POD, POD1week, sensitivity, NPV, PPV and F1-measure for all 21 evaluated methods (for past outbreak constant k1 = 0, 2, 3, 5, 10 and current outbreak k2 = 1 to 10 for POD and sensitivity).

α = 0.01 for Improved Farrington, Original Farrington, Periodic Poisson GLM and Neg Binomial GLM, CDC and EARS C1-C3. α = 0.05 for Bayes 1-3.

https://doi.org/10.1371/journal.pone.0181227.t002

Fig 1 illustrates these results by plotting for the 21 methods the global results: sensitivity according to 1-specificity (line 1), POD according to FPR (line 2), POD1week according to FPR (line 3) and sensitivity according to PPV (line 4) for the 3 α values (columns 1-3). Two groups stand out from the rest. The first group consists of Bayes 1, 2 and 3. These methods present the best POD (around 0.8) and POD1week with a FPR around 10%. The second group consists of the 4 CUSUM methods: CUSUM, CUSUM Rossi, CUSUM GLM, and CUSUM GLM Rossi. For α = 0.01, these methods present the best sensitivity (around 0.80) but the lowest specificity (0.55) and the highest FPR (0.40). Note that while of the algorithm test statistics are based on the likelihood of single-week observations independent of recent ones, CuSUMs are not, and they may be important for applications where detection of gradual events rather than one-week spikes is especially critical. The OutbreakP method had the lowest specificity without having a better POD or POD1week than the first two groups. Finally, a third group consists of the other methods that had good specificity (over 0.9) but a lower sensitivity, POD and POD1week than the first two groups. All 21 methods presented a high negative predictive value, greater than 94%. The PPV of OutbreakP is very low (6.5%), while the Periodic Negative Binomial GLM method had the highest PPV (68.4%).

thumbnail
Fig 1. Sensitivity versus 1-specificity (line 1), POD versus FPR (line 2), POD1week versus FPR (line 3) and sensitivity versus PPV (line 4) for α = 0.001, 0.01 and 0.05 (columns 1-3).

(Farr = Improved Farrington, OrigFarr = Original Farrington, Serf = periodic Poisson GLM, SerfNB = periodic Negative Binomial GLM, CDC = CDC algorithm, CUSUM = CUSUM, CUSUMR = CUSUM Rossi, CUSUMG = CUSUM GLM, CSMGR = CUSUM GLM Rossi, Bay1 = Bayes 1, Bay2 = Bayes 2, Bay3 = Bayes 3, RKI1 = RKI 1, RKI2 = RKI 2, RKI3 = RKI 3, Pois = GLR Poisson, GLRNB = GLR Negative Binomial, C1 = EARS C1, C2 = EARS C2, C3 = EARS C3, OutP = Outbreak P).

https://doi.org/10.1371/journal.pone.0181227.g001

A first attempt to visualize certain differences is to plot POD and FPR according to the scenario and the k1 or k2 values. To illustrate this, Fig 2 shows the performances of the CDC method. The first row represents FPR for an increasing past outbreak constant k1 = 0, 2, 3, 5 and 10 according to the 42 scenarios. The second row shows POD according to k2 for the 42 scenarios (each curve corresponds to a simulated scenario) for an increasing past outbreak constant k1 = 0, 2, 3, 5 and 10. It clearly shows that performance depends on the scenario. The same plots with tables presenting numerical values for each method and different α values are presented in the S2 Appendix to S22 Appendix. To better compare the 21 methods, we presented on a single display in the S1 Appendix, their FPR according to the scenarios and their POD according to the k2 values for k1 = 5 and α = 0.01.

thumbnail
Fig 2. CDC algorithm performances for α = 0.01 by increasing past outbreak amplitude k1 = 0, 2, 3, 5 or 10 with (i) on the first row: false positive rate for 42 simulated scenarios, (ii) on the second row: probability of detection for 42 simulated scenarios (each curve corresponding to a scenario) by increasing current outbreak amplitude k2 = 1 to 10.

https://doi.org/10.1371/journal.pone.0181227.g002

To better understand which characteristics are associated with each performance and to compare each method with the improved Farrington method, we present the results obtained from the multivariate Poisson regression models in the next section.

Adjusted performance ratios and associated factors

Table 3 presents the adjusted performance ratios for performance measures 1 to 5 as described in the Methods’ section (α = 0.01 for Improved Farrington, Original Farrington, Periodic Poisson GLM and Neg Binomial GLM, CDC and EARS C1-C3. α = 0.05 for Bayes 1-3).

  • Adjusted FPR ratios decreased when the amplitude and duration (driven by k1 in Eq (1)) of past outbreaks increased. It is indeed more difficult to detect an outbreak when past outbreaks have occurred, especially when these outbreaks are large and when the method does not under-weight their influence to estimate the expected number of cases. Adjusted FPR ratio was 2.75 times higher for time series with a secular trend than for the others. As we simulated time series with a non-negative trend (β ≥ 0 in Eq (1)), it was expected that FPR would decrease with a trend, especially for methods which do not integrate a trend in the estimation of the expected number of cases. In the same way, annual seasonality–and biannual seasonality to an even greater extent–and overdispersion increased FPR. We observed a nonlinear relation between FPR and baseline frequency: FPR ratio increased from the lowest frequencies to 12 cases per week, then decreased for the highest frequencies, with no clear explanation. Only periodic negative binomial GLM presented a FPR lower than improved Farrington FPR (FPR ratio = 0.71). Adjusted FPR ratios of OutbreakP and all CUSUM variants were higher than 40. Another group of methods all presented FPR ratios below 10: CDC, RKI variants, EARS methods, periodic Poisson GLM, original Farrington, Bayes 2 and GLR negative binomial. FPR ratios for other methods (Bayes 1 and 3, and GLR Poisson) were between 10 and 17.
  • Adjusted specificity ratios were almost all equal to 1 as the amplitude and duration of past outbreaks had little influence on specificity. They were significantly lower for time series with a secular trend (adjusted specificity ratio = 0.84) or with annual or biannual seasonality (respective ratios: 0.99 and 0.98). Specificity decreased when dispersion increased but increased when the baseline frequency (θ in Eq (1)) increased. Only the periodic negative binomial GLM presented a specificity as good as that of the improved Farrington method (specificity ratio = 1.00).
  • The adjusted POD ratios significantly decreased when past outbreak amplitude and duration (k1) increased, which is logical. They increased when current outbreak amplitude and duration (k2) increased, which is also normal. POD was higher for time series with secular trends which can be explained by the positive trend. POD decreased when there was an annual or a biannual seasonality (respective POD ratio = 0.97 and 0.92). Only the highest dispersion value (θ = 5) had an influence on POD (adjusted POD ratio = 1.09). Bayes 1, 2 and 3, CUSUM variants and the GLR Poisson method presented the highest POD ratios, from 1.75 (GLR Poisson) to 1.95 (CUSUM GLM). Any method was less able to detect an outbreak than the improved Farrington algorithm.
  • POD1week presented results that were similar to those of POD. Adjusted POD1week ratios were significantly lower than those of POD for EARS C3 (0.25 versus 1.25), for CDC (0.55 versus 1.04) and for GLR negative binomial (1.17 versus 0.87). Other methods presented ratios for POD1week that were similar to or greater than those of POD.
  • Finally, similar results were observed for sensitivity and for POD. Bayes 2 and 3 methods, OutbreakP, RKI 3, CUSUM variants and the GLR Poisson method presented the highest sensitivity ratios, from 2.04 (RKI 3) to 3.89 (CUSUM GLM). As observed in the POD model, any method was less able to detect an outbreak than the improved Farrington algorithm.

thumbnail
Table 3. Performance ratios with the improved Farrington method as reference, adjusted for past and current outbreaks (duration and amplitude), trend, seasonality, dispersion and baseline frequency (α = 0.01 for Improved Farrington, Original Farrington, Periodic Poisson GLM and Neg Binomial GLM, CDC and EARS C1-C3. α = 0.05 for Bayes 1-3).

https://doi.org/10.1371/journal.pone.0181227.t003

Estimation from the multivariate regression models to explain PPV and NPV are presented in S3 Table.

Discussion

We presented a systematic assessment of the performance of 21 outbreak detection algorithms using a simulated dataset. One advantage of a simulation study for outbreak detection methods benchmarking is the a priori knowledge of the occurrence of outbreaks, which enables the developpment of a real “gold standard”. Some authors have already proposed that simulation studies be used to assess outbreak detection methods [18, 19, 23], and others have suggested adding simulated outbreaks to real surveillance data baselines [16, 24, 25], but without proposing a systematic assessment of the performance of a broad range of outbreak detection methods. Choi et al. [20] proposed such a study design based on the daily simulation method proposed by Hutwagner et al. [18] but do not study the influence of past outbreaks or time series characteristics (frequency, variance, secular trends, seasonalities, etc.), on methods performance.

The simulated dataset we used to perform our study is large enough to include the considerable diversity of time series observed in real surveillance systems. We also simulated a high diversity of outbreaks in terms of amplitude and duration. In our opinion, this simulated dataset presents a high representativeness of real weekly surveillance data. To extend our results to daily surveillance data, it should be necessary to perform a similar study with daily surveillance data. These characteristics of the simulated dataset enabled us to propose simple intrinsic performance indicator estimations such as FPR and POD and sensitivity and specificity to compare the performance of the evaluated methods. Furthermore, this allows us to compare our results to other studies based on the same dataset. Negative predictive value and positive predictive value are proposed as operational indicators for decision making when an alarm is triggered, or not triggered, by an algorithm. A benefit of the addition of outbreaks to the baseline weeks is that outlier removal strategies considered by many authors may be objectively tested and evaluated. One limitation in the simulation process was the fact that only increasing secular trends were used. Increasing secular trends would facilitate outbreak detection, while decreasing trends would hamper it. Furthermore, our study was designed based on weekly surveillance, while syndromic surveillance systems are most often daily systems. In daily surveillance time series, other seasonalities such as the “day of the week” effect need to be taken into account, which is not the case in our study.

The performance of the evaluated methods was only considered from a general perspective, in order to detect outbreaks in a large number of polymorphic weekly-based time series. In a pragmatic approach, it seems very difficult to adapt the tuning parameters of these methods for every time series. In France, public health agencies, such as the French National Public Health Agency (Santé publique France), the French Agency for Food, Environmental and Occupational Health Safety (Anses) and the French Armed Forces Center for Epidemiology and Public Health (CESPA) have deployed computer-supported outbreak detection systems in traditional or syndromic surveillance contexts [4245]. They monitor a broad range of time series on a daily or weekly basis without, however, having rigorously evaluated the algorithms implemented. In the same way, the performance of the methods varied according to different baseline profiles depending on trend, seasonality, baseline frequency and overdispersion. Even if similar meta-models were already proposed by Buckeridge et al. for example [40], an original approach was to compare performance indicators adjusted for these parameters in a regression model. As expected, the adjusted performance of the 21 methods was penalized by increasing amplitude and duration in past outbreaks and by annual or biannual seasonality. Conversely, performance was better for increasing amplitude and duration in current outbreaks to be detected. More generally, the methods’ performance was highly dependent on simulation tuning parameters.

We proposed various measures to monitor the performance of outbreak detection methods. False positive rate (FPR) and probability of detection (POD) were proposed by Noufaily et al. [9]. We proposed an observation-based sensitivity measure and an event based sensitivity (POD). The concept of sensitivity based on alerting in each observation period is not applicable in some applications because signals of interest are intermittent and multimodal and may even be interpreted as multiple events. Many of the algorithms are based on the likelihood of single-week observations independent of recent ones, but CUSUMs are not, and the large sensitivity advantage in the CUSUMs methods, diminished for POD and POD1week, may be a result of the way the outbreak effects are modeled. By contrast, the implementation of the POD measure is uniformly applicable. Public health response to an outbreak depends on its early detection. In the POD definition, an outbreak was considered to be detected even if the first statistical alarm was issued during its last week. With the aim of estimating early detection performance, we also proposed POD during the first week, which cannot be considered alone, because even if it is done belatedly, an outbreak needs to be detected by the methods. While POD1week was an indicator of a method’s ability to detect an outbreak early, we did not propose any measure of timeliness like Salmon et al. [28] or Jiang et al. [45]. This topic could be further explored in another study. To give some insight on the speed of detection, we calculated it for the Improved Farrington algorithm and the CUSUM GLM Rossi algorithm. On average, on the overall dataset, it took 1.23 weeks for the Improved Farrincton method to detect an outbreak or 1.16 weeks for the CUSUM GLM Rossi method.

No method presented outbreak detection performances sufficient enough to provide reliable monitoring for a large surveillance system. Methods which provide high specificity or FPR, such as the improved Farrington or CDC algorithms, are not sensitive enough to detect the majority of outbreaks. These two algorithms could be implemented in systems that monitor health events to detect the largest outbreaks with the highest specificity.

Conversely, methods with the highest sensitivity and able to detect the majority of outbreaks–Bayes 3 or CUSUM GLM Rossi for example–produced an excessive number of false alarms, which could saturate a surveillance system and overhelm an epidemiologist in charge of outbreak investigations. As a screening test in clinical activity, the aim of an early outbreak detection method is to identify the largest possible number of outbreaks without producing too many false alarms.

The performances presented in this paper should be interpreted with caution as they depend both on tuning parameters and on the current implementation of the methods in the R packages. Packages evolve with time and their default parameters may also change. So this work based on R available packages, may be viewed as a starting point for researchers to enhance the comparison of methods and/or to optimize the tuning according to their data. Since no single algorithm presented sufficient performance for all scenarios, combinations of methods must be investigated to achieve predefined minimum performance. Other performance criteria should be proposed in order to improve the choice of algorithms to be implemented in surveillance systems. Therefore, we suggest that a study of the detection period between the first week of an outbreak and the first triggered alarm be conducted.

Supporting information

S1 Appendix. Comparison of the 21 evaluated methods (α = 0.01 for Improved Farrington, Original Farrington, Periodic Poisson GLM and Neg Binomial GLM, CDC and EARS C1-C3, α = 0.05 for Bayes 1-3, k1 = 5.

https://doi.org/10.1371/journal.pone.0181227.s001

(PDF)

S2 Appendix. Overall performances of Improved Farrington algorithm (α = 0.001, 0.01 and 0.05).

https://doi.org/10.1371/journal.pone.0181227.s002

(PDF)

S3 Appendix. Overall performances of Original Farrington algorithm (α = 0.001, 0.01 and 0.05).

https://doi.org/10.1371/journal.pone.0181227.s003

(PDF)

S4 Appendix. Overall performances of Periodic Poisson GLM algorithm (α = 0.001, 0.01 and 0.05).

https://doi.org/10.1371/journal.pone.0181227.s004

(PDF)

S5 Appendix. Overall performances of Periodic Negative Binomial GLM algorithm (α = 0.001, 0.01 and 0.05).

https://doi.org/10.1371/journal.pone.0181227.s005

(PDF)

S6 Appendix. Overall performances of CDC algorithm (α = 0.001, 0.01 and 0.05).

https://doi.org/10.1371/journal.pone.0181227.s006

(PDF)

S7 Appendix. Overall performances of CUSUM algorithm (α = 0.001, 0.01 and 0.05).

https://doi.org/10.1371/journal.pone.0181227.s007

(PDF)

S8 Appendix. Overall performances of CUSUM Rossi algorithm (α = 0.001, 0.01 and 0.05).

https://doi.org/10.1371/journal.pone.0181227.s008

(PDF)

S9 Appendix. Overall performances of CUSUM GLM algorithm (α = 0.001, 0.01 and 0.05).

https://doi.org/10.1371/journal.pone.0181227.s009

(PDF)

S10 Appendix. Overall performances of CUSUM GLM Rossi algorithm (α = 0.001, 0.01 and 0.05).

https://doi.org/10.1371/journal.pone.0181227.s010

(PDF)

S11 Appendix. Overall performances of Bayes 1 algorithm (α = 0.05).

https://doi.org/10.1371/journal.pone.0181227.s011

(PDF)

S12 Appendix. Overall performances of Bayes 2 algorithm (α = 0.05).

https://doi.org/10.1371/journal.pone.0181227.s012

(PDF)

S13 Appendix. Overall performances of Bayes 3 algorithm (α = 0.05).

https://doi.org/10.1371/journal.pone.0181227.s013

(PDF)

S14 Appendix. Overall performances of RKI 1 algorithm.

https://doi.org/10.1371/journal.pone.0181227.s014

(PDF)

S15 Appendix. Overall performances of RKI 2 algorithm.

https://doi.org/10.1371/journal.pone.0181227.s015

(PDF)

S16 Appendix. Overall performances of RKI 3 algorithm.

https://doi.org/10.1371/journal.pone.0181227.s016

(PDF)

S17 Appendix. Overall performances of GLR Negative Binomial algorithm.

https://doi.org/10.1371/journal.pone.0181227.s017

(PDF)

S18 Appendix. Overall performances of GLR Poisson algorithm.

https://doi.org/10.1371/journal.pone.0181227.s018

(PDF)

S19 Appendix. Overall performances of EARS C1 algorithm (α = 0.001, 0.01 and 0.05).

https://doi.org/10.1371/journal.pone.0181227.s019

(PDF)

S20 Appendix. Overall performances of EARS C2 algorithm (α = 0.001, 0.01 and 0.05).

https://doi.org/10.1371/journal.pone.0181227.s020

(PDF)

S21 Appendix. Overall performances of EARS C3 algorithm (α = 0.001, 0.01 and 0.05).

https://doi.org/10.1371/journal.pone.0181227.s021

(PDF)

S22 Appendix. Overall performances of OutbreakP algorithm.

https://doi.org/10.1371/journal.pone.0181227.s022

(PDF)

S23 Appendix. Radar charts of performances indicators: POD1week, POD, PPV, NPV, 1-FPR, Sp and Se for all 21 methods (α = 0.01 for Improved Farrington, Original Farrington, Periodic Poisson GLM and Neg Binomial GLM, CDC and EARS C1-C3. α = 0.05 for Bayes 1-3).

https://doi.org/10.1371/journal.pone.0181227.s023

(PDF)

S24 Appendix. R code of periodic Poisson GLM algorithm and periodic negative binomial GLM algorithm.

https://doi.org/10.1371/journal.pone.0181227.s024

(PDF)

S1 Table. FPR, specificity, POD, POD1week, sensitivity, negative predictive value, positive predictive value and F1-measure for 12 evaluated methods and α = 0.001 (for past outbreak constant k1 = 0, 2, 3, 5, 10 and current outbreak k2 = 1 to 10 for POD and sensitivity).

https://doi.org/10.1371/journal.pone.0181227.s025

(PDF)

S2 Table. FPR, specificity, POD, POD1week, sensitivity, negative predictive value, positive predictive value and F1-measure for 15 evaluated methods and α = 0.05 (for past outbreak constant k1 = 0, 2, 3, 5, 10 and current outbreak k2 = 1 to 10 for POD and sensitivity).

https://doi.org/10.1371/journal.pone.0181227.s026

(PDF)

S3 Table. Other performance ratios, adjusted on past and current outbreak duration and amplitude, trend, seasonality, dispersion and baseline frequency (α = 0.01 for Improved Farrington, Original Farrington, Periodic Poisson GLM and Neg Binomial GLM, CDC and EARS C1-C3. α = 0.05 for Bayes 1-3).

https://doi.org/10.1371/journal.pone.0181227.s027

(PDF)

Acknowledgments

The authors would like to thank Angela Noufaily and Paddy Farrington for providing them with simulated datasets and an R code to simulate outbreaks.

References

  1. 1. Buehler JW, Hopkins RS, Overhage JM, Sosin DM, Tong V, CDC Working Group. Framework for evaluating public health surveillance systems for early detection of outbreaks: recommendations from the CDC Working Group. MMWR Recommendations and reports: Morbidity and mortality weekly report Recommendations and reports / Centers for Disease Control. 2004;53(RR-5):1–11.
  2. 2. Wagner MM, Tsui FC, Espino JU, Dato VM, Sittig DF, Caruana RA, et al. The emerging science of very early detection of disease outbreaks. Journal of public health management and practice: JPHMP. 2001;7(6):51–59. pmid:11710168
  3. 3. Fienberg SE, Shmueli G. Statistical issues and challenges associated with rapid detection of bio-terrorist attacks. Statistics in Medicine. 2005;24(4):513–529. pmid:15678405
  4. 4. Buckeridge DL. Outbreak detection through automated surveillance: a review of the determinants of detection. Journal of Biomedical Informatics. 2007;40(4):370–379. pmid:17095301
  5. 5. Zaki AM, van Boheemen S, Bestebroer TM, Osterhaus ADME, Fouchier RAM. Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia. The New England Journal of Medicine. 2012;367(19):1814–1820. pmid:23075143
  6. 6. Gates B. The next epidemic–lessons from Ebola. The New England Journal of Medicine. 2015;372(15):1381–1384. pmid:25853741
  7. 7. Hulth A, Andrews N, Ethelberg S, Dreesman J, Faensen D, van Pelt W, et al. Practical usage of computer-supported outbreak detection in five European countries. Euro Surveillance: Bulletin Européen Sur Les Maladies Transmissibles = European Communicable Disease Bulletin. 2010;15(36).
  8. 8. Salmon M, Schumacher D, Burmann H, Frank C, Claus H, Höhle M. A system for automated outbreak detection of communicable diseases in Germany. Euro Surveillance: Bulletin Européen Sur Les Maladies Transmissibles = European Communicable Disease Bulletin. 2016;21(13).
  9. 9. Noufaily A, Enki DG, Farrington P, Garthwaite P, Andrews N, Charlett A. An improved algorithm for outbreak detection in multiple surveillance systems. Statistics in Medicine. 2013;32(7):1206–1222. pmid:22941770
  10. 10. Burkom HS, Murphy S, Coberly J, Hurt-Mullen K. Public health monitoring tools for multiple data streams. MMWR Morbidity and mortality weekly report. 2005;54 Suppl:55–62.
  11. 11. Shmueli G, Burkom H. Statistical Challenges Facing Early Outbreak Detection in Biosurveillance. Technometrics. 2010;52(1):39–51.
  12. 12. Unkel S, Farrington CP, Garthwaite PH, Robertson C, Andrews N. Statistical methods for the prospective detection of infectious disease outbreaks: a review. Journal of the Royal Statistical Society: Series A (Statistics in Society). 2012;175(1):49–82.
  13. 13. Centers for Disease Control and Prevention (CDC). Syndromic surveillance. Reports from a national conference, 2003. MMWR Morbidity and mortality weekly report. 2004;53 Suppl:1–264. pmid:15714619
  14. 14. Enki DG, Noufaily A, Garthwaite PH, Andrews NJ, Charlett A, Lane C, et al. Automated biosurveillance data from England and Wales, 1991-2011. Emerging Infectious Diseases. 2013;19(1):35–42. pmid:23260848
  15. 15. Farrington CP, Andrews NJ, Beale AD, Catchpole MA. A Statistical Algorithm for the Early Detection of Outbreaks of Infectious Disease. Journal of the Royal Statistical Society Series A (Statistics in Society). 1996;159(3):547–563.
  16. 16. Rolfhamre P, Ekdahl K. An evaluation and comparison of three commonly used statistical models for automatic detection of outbreaks in epidemiological data of communicable diseases. Epidemiology and Infection. 2006;134(4):863–871. pmid:16371181
  17. 17. Kleinman KP, Abrams AM. Assessing surveillance using sensitivity, specificity and timeliness. Statistical Methods in Medical Research. 2006;15(5):445–464. pmid:17089948
  18. 18. Hutwagner L, Browne T, Seeman GM, Fleischauer AT. Comparing aberration detection methods with simulated data. Emerging Infectious Diseases. 2005;11(2):314–316. pmid:15752454
  19. 19. Fricker RD, Hegler BL, Dunfee DA. Comparing syndromic surveillance detection methods: EARS’ versus a CUSUM-based methodology. Statistics in Medicine. 2008;27(17):3407–3429. pmid:18240128
  20. 20. Choi BY, Kim H, Go UY, Jeong JH, Lee JW. Comparison of various statistical methods for detecting disease outbreaks. Computational Statistics. 2010;25(4):603–617.
  21. 21. Cowling BJ, Ho LM, Riley S, Leung GM. Statistical algorithms for early detection of the annual influenza peak season in Hong Kong using sentinel surveillance data. Hong Kong Medical Journal = Xianggang Yi Xue Za Zhi / Hong Kong Academy of Medicine. 2013;19 Suppl 4:4–5.
  22. 22. Hutwagner LC, Thompson WW, Seeman GM, Treadwell T. A simulation model for assessing aberration detection methods used in public health surveillance for systems with limited baselines. Statistics in Medicine. 2005;24(4):543–550. pmid:15678442
  23. 23. Stroup DF, Wharton M, Kafadar K, Dean AG. Evaluation of a method for detecting aberrations in public health surveillance data. American Journal of Epidemiology. 1993;137(3):373–380. pmid:8452145
  24. 24. Wang X, Zeng D, Seale H, Li S, Cheng H, Luan R, et al. Comparing early outbreak detection algorithms based on their optimized parameter values. Journal of Biomedical Informatics. 2010;43(1):97–103. pmid:19683069
  25. 25. Jackson ML, Baer A, Painter I, Duchin J. A simulation study comparing aberration detection algorithms for syndromic surveillance. BMC medical informatics and decision making. 2007;7:6. pmid:17331250
  26. 26. Lotze T, Shmueli G, Yahav I Simulating Multivariate Syndromic Time Series and Outbreak Signatures, Social Science Research Network. 2007.
  27. 27. Guillou A, Kratz M, Le Strat Y. An extreme value theory approach for the early detection of time clusters. A simulation-based assessment and an illustration to the surveillance of Salmonella. Statistics in Medicine. 2014;33(28):5015–5027. pmid:25060768
  28. 28. Salmon M, Schumacher D, Stark K, Höhle M. Bayesian outbreak detection in the presence of reporting delays. Biometrical Journal Biometrische Zeitschrift. 2015;57(6):1051–1067. pmid:26250543
  29. 29. Höhle M. surveillance: An R package for the monitoring of infectious diseases. Computational Statistics. 2007;22(4):571–582.
  30. 30. Höhle M, Meyer S, Paul M, Held L, Correa T, Hofmann M, et al.. surveillance: Temporal and Spatio-Temporal Modeling and Monitoring of Epidemic Phenomena; 2015.
  31. 31. Stroup DF, Williamson GD, Herndon JL, Karon JM. Detection of aberrations in the occurrence of notifiable diseases surveillance data. Statistics in Medicine. 1989;8(3):323–329; discussion 331–332. pmid:2540519
  32. 32. Rossi G, Lampugnani L, Marchi M. An approximate CUSUM procedure for surveillance of health events. Statistics in Medicine. 1999;18(16):2111–2122. pmid:10441767
  33. 33. Höhle M, Paul M. Count data regression charts for the monitoring of surveillance time series. Computational Statistics & Data Analysis. 2008;52(9):4357–4368.
  34. 34. Höhle M. Poisson regression charts for the monitoring of surveillance time series. Discussion paper // Sonderforschungsbereich 386 der Ludwig-Maximilians-Universität München; 2006. 500.
  35. 35. Frisén M, Andersson E, Schiöler L. Robust outbreak surveillance of epidemics in Sweden. Statistics in Medicine. 2009;28(3):476–493. pmid:19012277
  36. 36. Hutwagner L, Thompson W, Seeman GM, Treadwell T. The bioterrorism preparedness and response Early Aberration Reporting System (EARS). Journal of Urban Health: Bulletin of the New York Academy of Medicine. 2003;80(Suppl 1):i89–i96.
  37. 37. Hripcsak G, Rothschild AS. Agreement, the F-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association: JAMIA. 2005;12(3):296–298. pmid:15684123
  38. 38. Zou G. A modified poisson regression approach to prospective studies with binary data. American Journal of Epidemiology. 2004;159(7):702–706. pmid:15033648
  39. 39. Barboza P, Vaillant L, Le Strat Y, Hartley DM, Nelson NP, Mawudeku A, Madoff LC, Linge JP, Collier N, Brownstein JS, Astagneau P. Factors influencing performance of internet-based biosurveillance systems used in epidemic intelligence for early detection of infectious diseases outbreaks. PloS One. 2014;9(3):e90536. pmid:24599062
  40. 40. Buckeridge DL, Okhmatovskaia A, Tu S, O’Connor M, Nyulas C, Musen MA. Predicting Outbreak Detection in Public Health Surveillance: Quantitative Analysis to Enable Evidence-Based Method Selection, AMIA Annual Symposium Proceedings. 2008:76-80.
  41. 41. Royston P, Ambler G, Sauerbrei W. The use of fractional polynomials to model continuous risk variables in epidemiology. International Journal of Epidemiology. 1999;28(5):964–974. pmid:10597998
  42. 42. Danan C, Baroukh T, Moury F, Jourdan-DA Silva N, Brisabois A, Le Strat Y. Automated early warning system for the surveillance of Salmonella isolated in the agro-food chain in France. Epidemiology and Infection. 2011;139(5):736–741. pmid:20598207
  43. 43. Caserio-Schonemann C, Meynard JB. Ten years experience of syndromic surveillance for civil and military public health, France, 2004-2014. Euro Surveillance: Bulletin Européen Sur Les Maladies Transmissibles = European Communicable Disease Bulletin. 2015;20(19):35–38.
  44. 44. Meynard JB, Chaudet H, Texier G, Ardillon V, Ravachol F, Deparis X, et al. Value of syndromic surveillance within the Armed Forces for early warning during a dengue fever outbreak in French Guiana in 2006. BMC medical informatics and decision making. 2008;8:29. pmid:18597694
  45. 45. Jiang X, Cooper GF, Neill DB. Generalized AMOC curves for evaluation and improvement of event surveillance. AMIA Annual Symposium proceedings / AMIA Symposium AMIA Symposium. 2009;2009:281–285. pmid:20351865