Leading Indicators and the Evaluation of the Performance of Alerts for Influenza Epidemics

Background Most evaluations of epidemic thresholds for influenza have been limited to internal criteria of the indicator variable. We aimed to initiate discussion on appropriate methods for evaluation and the value of cross-validation in assessing the performance of a candidate indicator for influenza activity. Methods Hospital records of in-patients with a diagnosis of confirmed influenza were extracted from the Canadian Discharge Abstract Database from 2003 to 2011 and aggregated to weekly and regional levels, yielding 7 seasons and 4 regions for evaluation (excluding the 2009 pandemic period). An alert created from the weekly time-series of influenza positive laboratory tests (FluWatch, Public Health Agency of Canada) was evaluated against influenza-confirmed hospitalizations on 5 criteria: lead/lag timing; proportion of influenza hospitalizations covered by the alert period; average length of the influenza alert period; continuity of the alert period and length of the pre-peak alert period. Results Influenza hospitalizations led laboratory positive tests an average of only 1.6 (95% CI: -1.5, 4.7) days. However, the difference in timing exceeded 1 week and was statistically significant at the significance level of 0.01 in 5 out of 28 regional seasons. An alert based primarily on 5% positivity and 15 positive tests produced an average alert period of 16.6 weeks. After allowing for a reporting delay of 2 weeks, the alert period included 80% of all influenza-confirmed hospitalizations. For 20 out of the 28 (71%) seasons, the first alert would have been signalled at least 3 weeks (in real time) prior to the week with maximum number of influenza hospitalizations. Conclusions Virological data collected from laboratories was a good indicator of influenza activity with the resulting alert covering most influenza hospitalizations and providing a reasonable pre-peak warning at the regional level. Though differences in timing were statistically significant, neither time-series consistently led the other.


Results
Influenza hospitalizations led laboratory positive tests an average of only 1.6 (95% CI: -1.5, 4.7) days. However, the difference in timing exceeded 1 week and was statistically significant at the significance level of 0.01 in 5 out of 28 regional seasons. An alert based primarily on 5% positivity and 15 positive tests produced an average alert period of 16.6 weeks. After allowing for a reporting delay of 2 weeks, the alert period included 80% of all influenza-confirmed hospitalizations. For 20 out of the 28 (71%) seasons, the first alert would have been signalled at least 3 weeks (in real time) prior to the week with maximum number of influenza hospitalizations.

Introduction
Many countries have used thresholds of weekly time series of consultation rates for influenzalike-illnesses (ILI) to signal the start and end of the period of seasonal influenza activity. The resulting epidemic period identifies a period where ILI activity is considered in excess of what is normally expected. However these thresholds are often set based only on visual inspection [1]. Recently, researchers have evaluated other sources of real time data such as telehealth [2,3], prescription sales [4] or emergency department visits for specific syndromes commonly associated with ILI [5,6] for the purpose of signalling an emerging influenza epidemic earlier. Others have evaluated time series from social media. Google Flu Trends (GFT) is an application that calibrates the number of web searches for terms associated with ILI to the weekly ILI time series provided by public health surveillance. Because search queries can be processed quickly, the resulting ILI estimates were found to be consistently 1-2 weeks ahead of CDC ILI surveillance reports [7,8]. Time series based on Wikipedia page views have produced similar results [9].
Despite the strong correlations, various issues have been identified. Since the impact of influenza is highly variable, it is not surprising that re-calibration is necessary for Google Flu Trends to track ILI consultation rates, or that significant differences between the two time series were identified [10]. Performance evaluation has often been limited to using proposed standards, to comparing the proposed alert with the ILI based epidemic period, or to an internal validation [11][12][13]. Since the epidemic grows exponentially for many weeks before the first cases are detected through laboratory testing, or excess morbidity or mortality outcomes are identifiable [11,14], an alert that signals the emergence of an influenza epidemic before the excess is observed has considerable utility that may not be apparent using the ILI based epidemic period as the gold standard.
With these limitations in mind, we aimed to illustrate the insight provided by the cross-validation of an alert against public health oriented criteria rather than simply showing correlation or performing an internal validation. We chose virological data (number and percent of laboratory tests positive for influenza) as the candidate indicator variable, as this data appeared to be the most promising indicator from our national surveillance program, FluWatch [15]. As the weekly number of influenza-confirmed hospitalizations has been shown to be a good proxy at the seasonal level for excess morbidity and mortality attributable to influenza [16], we used weekly influenza-confirmed hospitalizations as the validation dataset. We drew upon a number of statistical measures to determine whether the two samples arose from the same distribution, that is, we tested for differences in timing (did laboratory reports or hospital admissions lead or lag?), and shape of the distribution function (which time series has longer tails, or more extreme values?). This approach provides a richer description than the correlation and crosscorrelation analyses used elsewhere. Next we aimed to assess the alert period by characteristics of potential interest to hospital resource management and developed 5 criteria: lead/lag and other timing differences; proportion of influenza hospitalizations covered by the alert period; average length of the influenza alert period; continuity of the alert period; and length of the pre-peak alert period.

Sources of data
Hospital discharge records for patients admitted to an acute care hospital with a diagnostic code of influenza, virus identified (J10) were extracted from the Canadian Institute of Health Information (CIHI) patient-specific Discharge Abstract Database (DAD) [17] [19]. Routine surveillance data is collected for each epidemiological week and published in a surveillance report within 2 weeks. Seasons were defined regionally from September to August of the following year. The weekly distribution function was calculated by dividing the weekly number of events by the seasonal total for each region. The weekly positivity rate was calculated by dividing the weekly number of influenza positive tests by the weekly number of tests. The 2009 pandemic period was excluded from the analysis, as laboratory testing initially increased sharply once circulation of a novel strain with pandemic potential was announced in late April of 2009, and subsequently declined substantially over the pandemic period.

Statistical Analysis
Lead-lag and other timing differences. To test for differences in the timing of influenza infections between the two weekly time-series, we performed a two sided t-test using Satterthwaite variance calculation to account for unequal variances for each of the 4 regions and 7 seasons, and reported the mean difference along with Satterthwaite confidence intervals. We used the Folded F test to test for equality of variance, and the two-sample Kolmogorov-Smirnov Test to test for significant differences in the cumulative distribution function (CDF). The two-sample Kolmogorov-Smirnov test is a nonparametric method for comparing the distribution of two samples which is sensitive to differences in location (median), spread (parametric equivalent is the standard deviation) and other shape characteristics such as differences in skewness or heavier tails. Confidence intervals for the Pearson correlation coefficient were calculated using Fisher's z transformation. Differences in timing and other differences between the CDFs were summarized for the 28 seasons. SAS Enterprise Guide 5.1 [20] was used for the analysis and provides descriptions of these statistics.
The alert and evaluation. The alert was set for week t if at least 15 influenza positive tests were observed for week t and the corresponding positivity rate was at least 5% (i.e. at least 300 specimens were tested in week t and 15 or 5% or more were positive). At any point in time, the most current influenza surveillance report is usually available for the period dated 2 weeks earlier. To allow for this delay in reporting and processing, a 2 week operational delay was assumed. That is, if the first alert was set based on laboratory reports for week 1, we assumed that the alert would be announced early in week 3, and preparations could begin in week 3. As gaps were more likely to occur at the beginning or end of the alert period when numbers were small, we waited one week before turning the alert off in order to improve the continuity of the alert period. The influenza hospitalizations for week t+2 (date of admission) were considered to be included in the alert period if the alert was set based on the virological data for week t or t-1. The length of the alert period is the time from the first to last alerted week (including any gaps). The length of the pre-peak period was calculated from the presumed week of first announcement (week of the first alert +2) to the week with the seasonal maximum number of hospitalizations by week of admission. All statistics were calculated based on the alert status as would have been reported in the most recent surveillance report available at the time of hospitalization. As well, the alerted weeks flagged in Figs 1-6 were adjusted for this 2 week operational delay. The first and last alerted weeks do not necessarily correspond to the beginning and end of the epidemic period as generally defined elsewhere to be periods in excess of what is normally expected (usually in reference to ILI surveillance). In using virological data for the alert, the ultimate objective is to provide some advanced warning of an emerging epidemic prior to observing an excess case load.

Ethics Statement
This study was conducted in accordance with the principles expressed in the Declaration of Helsinki. Data provided to the Public Health Agency of Canada were collected under the Public Health Agency of Canada Act and were used in agreement with policy and regulations related to the publication of information related to public health. Identifying information was not available to this study. Hence, ethics approval was not required.

Results
Over the study period, there were 11,070 influenza hospitalizations and 52,715 influenza positive test reports of which 12,550 (24%) were for influenza B. The ratio of positive tests to admissions was 4.8:1. As an influenza diagnosis could have been based on a point of care test, or more than one laboratory test could be associated with one admission, the exact relationship between the number of weekly influenza positive tests reported to FluWatch by laboratories and the number of patients admitted to hospital with a confirmed influenza diagnosis is not known.

Lead-lag and other timing differences
Overall, influenza hospitalizations led laboratory positive tests an average of only 1.6 (95% CI: -1.5, 4.7) days. Though the average difference was not statistically significant, the difference in timing was statistically significant at the significance level of 0.01 in 8 out of 28 epidemic season and this difference exceeded 1 week in 5 seasons. After accounting for multiple comparisons, this level of detection remains highly significant (p-value<0.0001). The estimated variance was higher for hospitalizations than for influenza positive tests in all 28 periods analyzed, and statistically significant in most. The Two Sample Kolmogorov-Smirnov Test was significant at the significance level of 0.01 in 10 out of 28 of the periods analyzed (Table 1). From the perspective of public health, there was very close agreement in the weekly distribution of influenza positive tests and the number of patients admitted to hospital with influenza, though, the distribution of hospital admissions was spread over a slightly longer period, i.e. the tails of the distribution were noticeably longer. The comparison of two weekly indicators of influenza activity over one season often identified distinct and statistically significant differences, similar to differences previously observed between geographically adjacent population centers [21].  strain accounted for 95% of the influenza viral identifications, the average alert period was 12.8 weeks. More recently, at least two strains circulated in significant numbers each season, which However, influenza hospital admissions continued for many months after the epidemic subsided in this region, and the impact of these later admissions is highlighted by the CDF comparison. As a result, the average date of hospital admissions lagged influenza positive tests by an average of 12 days. This season is of interest due to the early epidemic peak (week of Nov 9, 2003). The pre-peak alert period is 4 weeks (the first alert was set based on laboratory data for the week of Sept 28, with the alert period starting operationally in the week of Oct 12), well ahead of peak influenza activity for the region, thereby providing significant advanced warning at a key time. doi:10.1371/journal.pone.0141776.g003 accounts for the longer alert period. Coverage rates were lowest in jurisdictions and seasons with a smaller number of influenza positive tests (Table 2). For 20 out of the 28 (71%) seasons, the first alert was signalled at least 3 weeks in real time prior to the week with the maximum number of influenza hospital admissions. For selected seasons, the weekly distribution of influenza-confirmed hospitalizations and influenza positive laboratory tests are shown in Figs 1 through 6 with the alerted weeks marked with a solid diamond. Note that the alerted weeks were adjusted for a 2 week operational delay to reflect the most recent alert status available during the corresponding week of admission to hospital. The time series used to set the weekly alerts (the weekly number and percent of laboratory tests positive for influenza) are not shown. The accompanying CDF plots illustrate the cumulative differences in timing over the full season that form the basis of the Two-Sample Kolmogorov-Smirnov Test. Examples were selected primarily from the two regions with the   1). Agreement was good between the two curves. The alert was first set based on virological data for the week of Nov 16, and available for planning 2 weeks later (week of Nov 30), or 3 weeks before the peak in influenza hospitalizations (week of Dec 21). In Fig 2, the correlation between curves was poorer (r = 0.55), though coverage was still good (77%). However, the alert status was on for only 2 weeks in real time before the peak in influenza hospitalizations. In the 2003/04 season, the A⁄Fujian⁄411⁄02 strain emerged very early in the season in the Prairies (Fig  3). An unusually long tail for hospitalizations resulted in an estimated average lag of 12 days (95% CI: 7.5, 17.3), though the epidemic midpoint (CDF = 50%) occurred in the week of Nov 9 for both time-series. A pre-peak alert period of 4 weeks (set based on laboratory data for the week of Sept 28, and reportable in the week of Oct 12), would have provided significant advanced warning.

Interpretation
This study shows that there was close agreement in the timing of the two indicators of influenza activity at the regional level: the number of laboratory tests reported positive for influenza and the number of admissions to hospital with a confirmed influenza diagnosis based on date of test and date of admission. Though there was no evidence that one indicator consistently led the other, subtle differences in timing were identified. Laboratory positive tests were slightly but significantly more concentrated during periods of peak activity, while a slightly higher proportion of hospitalizations occurred outside the peak period, as seen by the longer tails in the Also surprising was the large number of epidemic curve comparisons for which either the laboratory or hospitalization data led the other time series by more than 1 week. In some cases this difference could be attributed to the longer tail in the hospitalization data when the epidemic peaked early in the season (Fig 3). In this example the epidemic midpoint offered a more robust measure of lead/lag differences during the period of peak activity. Differences in testing frequencies and procedures between hospitals, clinical practices and health regions, as well as Notes 1 After accounting for a 2 week operational delay. In 12 out the 28 regional seasons, the alert should have been available at least 3 weeks before the peak in influenza hospitalizations. differences of a couple of weeks in the timing of peak activity within the region [21] could explain the irregular differences in timing. An alert based on 5% positivity of laboratory tests and 15 positive tests provided good coverage of influenza-confirmed hospital admissions and a reasonable warning period in the 2003/ 04 season when the epidemic emerged very early and a single strain dominated. Increasing the threshold to 15% positivity reduced the average length of the alert period only slightly by 1 to 2 weeks and coverage from 80% to 75%. However, a reduction in the pre-peak warning period of 1 to 2 weeks could have a more significant impact on operations.
The threshold of 15 positive tests in one week is expected to be reached 5 weeks (3 weeks in real time) before the peak/epidemic midpoint in 90% of seasons if at least 800 positive tests are reported annually for the strain responsible for the peak. This estimate is based on previous estimates of the shape of the empirical epidemic curve [14] (the week five weeks before the epidemic midpoint accounted for approximately 2.5% of annual tests during seasons with a single dominant strain). As the number of positive tests nearly doubled from one week to the next during the exponential growth phase, reducing or increasing the threshold by a factor of two should advance or delay the start of the alert period by 1 week. However, small numbers of positive tests may represent clusters (from an institutional outbreak), and evidence of over-dispersion in influenza data is common. Though regions with fewer influenza confirmations could use a lower threshold, this would increase the risk of setting the alert very early based on strains circulating in the pre-epidemic period.

Comparison with other studies
Despite numerous approaches to identify periods of influenza activity, thresholds are still usually set based on visual inspection. Even approaches based on complex statistical techniques usually require some pre-determined threshold to be nominated, again often by inspection [1]. A recent Delphi study used expert opinion (ie, visual inspection) to provide ground truth for algorithmic research [22]. In this study the focus was on identifying an alert that would provide some warning in real time prior to the epidemic peak for resource planning purposes and one that would include a large proportion of all confirmed influenza hospitalizations to be used, for example, to determine the period of empirical anti-viral treatment, as appropriate [23,24]. Noting that the onset of the influenza activity usually goes undetected for many weeks before the number of influenza cases is sufficiently large to be detected through methods used to identify excess morbidity or mortality or via laboratory testing, Cowling and colleagues [11] identified a similar objective of quickly generating an alarm before the start of the peak season.
Following the success of Google Flu Trends in identifying a leading indicator for weekly ILI consultation rates, a number of studies have shown strong correlations between various timeseries based on other administrative data or social media and conclude that results are promising [5]. However, others have noted that the degree of correlation was highly variable among regions [2,6]. Our results agree more strongly with the latter conclusion.
Olson and colleagues used clinical ILI surveillance data as ground truth to assess the GFT estimates. However, despite observing strong correlations, they also identified substantial differences between the two weekly time series and concluded that search query data was no substitute for timely clinical or laboratory surveillance data [10]. As confirmed in our recent study of emergency department visits [25], it seems inappropriate to treat ILI surveillance as the ground truth for influenza activity when virological time-series are available. Ortiz and colleagues [26] also noted that GFT was more closely correlated with ILI consultation rates than laboratory-confirmed influenza and hence concluded that of the three time series, virologic surveillance is the most critical to the understanding of influenza activity.
Studies that have used the lag with maximum cross-correlation have found considerable variation in the estimates of timeliness of peak influenza activity for different data sources [3]. Our study confirms that some differences in timing, which could be due to a lack of geography representativeness [21], likely exist, though any noted differences are not likely to be reproducible. Since the difference between the maximum and the second largest cross-correlation coefficient was often not statistically significant, our recommendations for assessing differences in lead/lag timing of periods of peak influenza activity is to use differences in the average date of infection or, preferably, the epidemic midpoint, as the average is more sensitive to differences in the off-season (or tails of the distribution).
Ginsberg and colleagues noted that Google web search queries could produce ILI estimates that were consistently 1-2 weeks ahead of CDC ILI surveillance reports because search query data could be processed quickly [7]. As the time from symptom onset to hospital admission averaged 4 days during the 2009 pandemic [27], and positive findings are less likely in persons presenting for medical care more than a week after symptom onset [28], it appears unlikely that secondary data sources could be found that would consistently lead the influenza epidemic by more than the reporting and operational delay of 1-2 weeks. There are a few possible exceptions that we have noted elsewhere: though lead/lag times between adjacent health regions have not been consistent, the Atlantic region in Canada has shown a tendency to lag other regions in Canada and the United States [21]; and influenza infections in persons aged 15-19 and 20-24 years have been shown to lead other age groups by up to 1 week [29]. As this age group is more web savvy, web-based participatory surveillance projects [30] may be able to tap into this lead group. Because of the high baseline level for self-reports of influenza symptoms [31], it is unlikely that the age-specific alert based on self-reports of ILI would be triggered earlier, however, youth reports may be used to signal the timing of peak activity, or more specifically the end of exponential growth phase in the general population.
Reich and colleagues [32] is the only other study we are aware of to use similar operational performance criteria. They illustrated that alerts based on influenza-confirmed hospitalizations could be set at the hospital level, with thresholds in the range of 3 to 5 hospitalizations or approximately 2.5% of the annual total. Immediate access to hospital data would increase the timeliness by 1 week, though the small numbers would increase random variation. A threshold of 5 positive tests is too small for virological data from the general population as tests associated with an institutional outbreak in the pre-epidemic period or other sources of clustering could trigger an early alert. The regional and local (hospital) approaches are complimentary, and in both cases, preparations for a surge in influenza cases would have to start based on a relatively small case load.

Limitations
We used historical data on confirmed influenza hospitalizations to assess the performance of an alert base on 5% positivity and at least 15 positive tests in one week at the regional level as an indicator of influenza activity. This assessment has a number of limitations. Confirmation of an influenza infection either through laboratory or point of care testing is still limited, so that the alert was set at a regional level-a geographic scale that is at times too large to ensure synchronicity within each region [21]. Since the relationship between laboratory tests and the burden of disease (influenza-attributed hospitalizations, for example) may vary from season to season [25], the number of laboratory confirmations is not a direct measure of the disease burden. However, as the number of confirmed influenza hospitalizations is only a fraction of all hospitalizations attributable to influenza [16,33] and only a small proportion of emergency department visits attributed to an influenza infection were given an ILI diagnosis [25], direct measures of the disease burden are not available. Though an indirect measure, regression models have been successful in estimating the annual burden attributable to influenza [16]. Though we did not compare the alert period generated by the virological data head-to-head with weekly ILI surveillance data due to data limitations, viral identifications would still be considered the gold standard to assess whether any excess in ILI consultations is likely due to influenza. We did not provide confidence intervals for many of the summary measures, such as the expected coverage rate associated with this threshold rule, as the characteristics of each season are highly variable and it is unclear to what extent the results can be generalized to future seasons.

Conclusions
In summary, virological data for four regions in Canada provided a reasonable pre-peak warning and indicated the period of influenza activity that covered most influenza hospitalizations. There is no consensus on a gold standard to define the period of influenza activity, and it seems unlikely that there will ever be one. More likely, the influenza indicator will be used in conjunction with the specific resource data in question, and the pre-peak alert will be used by health care resource managers as a reminder to monitor resources more closely especially over the next couple of weeks. Alerts based on more complex time-series, such as ILI consultations, may provide reasonable performance; however, cross-validation is required to assess any performance advantages over virological data. This evaluation approach should be adaptable to alerts based on other surveillance time series, or the criteria could be modified to suit specific resource management issues. It is, however, important to conduct the performance evaluate over many epidemics, especially when evaluating lead time differences, as each epidemic is a unique mix of multiple strains of varying timing and severity.
Supporting Information S1