Spatial aggregation choice in the era of digital and administrative surveillance data

Traditional disease surveillance is increasingly being complemented by data from non-traditional sources like medical claims, electronic health records, and participatory syndromic data platforms. As non-traditional data are often collected at the individual-level and are convenience samples from a population, choices must be made on the aggregation of these data for epidemiological inference. Our study seeks to understand the influence of spatial aggregation choice on our understanding of disease spread with a case study of influenza-like illness in the United States. Using U.S. medical claims data from 2002 to 2009, we examined the epidemic source location, onset and peak season timing, and epidemic duration of influenza seasons for data aggregated to the county and state scales. We also compared spatial autocorrelation and tested the relative magnitude of spatial aggregation differences between onset and peak measures of disease burden. We found discrepancies in the inferred epidemic source locations and estimated influenza season onsets and peaks when comparing county and state-level data. Spatial autocorrelation was detected across more expansive geographic ranges during the peak season as compared to the early flu season, and there were greater spatial aggregation differences in early season measures as well. Epidemiological inferences are more sensitive to spatial scale early on during U.S. influenza seasons, when there is greater heterogeneity in timing, intensity, and geographic spread of the epidemics. Users of non-traditional disease surveillance should carefully consider how to extract accurate disease signals from finer-scaled data for early use in disease outbreaks.

Comparison of region-county and state-county differences for four measures of disease burden 13 The spatial aggregation differences have spatial dependence and temporal structure with multiple observations per county, thus violating the assumptions of the classical paired t-test. Consequently, comparisons of spatial aggregation difference were assessed with Bayesian intercept models (effectively, a Bayesian paired t-test for spatially correlated data) that accounted for county spatial dependence (See SM).
where δ i is defined as the difference between two sets of spatial aggregation differences (e.g., peak versus onset timing, peak versus onset intensity, or region-county versus state-county for a single measure) for county i in a given influenza season. We modeled county spatial dependence φ i with an intrinsic conditional autoregressive (ICAR) model, which smooths model predictions by borrowing information from neighbors [33]: where ξ i represents the number of neighbors for node i, φ j,−i represents the neighborhood of node i, which is composed of neighboring nodes j (neighbors denoted i ∼ j). The precision parameter is τ φ (Equation 2). The results of this ICAR model were compared to that of an iid error model, where the term φ i is replaced with independent and identically distributed error terms for each node i. We chose the ICAR model over the iid error model for all tests after reviewing the model fits and the Deviance Information Criterion for the two sets of models.
The intercept model was implemented with approximate Bayesian inference in R using Integrated Nested Laplace Approximations (INLA) with the INLA package (www.r-inla.org) in order to facilitate the spatial dependence error term [23,24]. If the 95% credible intervals for β 0 fail to overlap with zero, we interpret that there is a statistically significant difference between the measures contributing to δ i . We used relatively non-informative normal priors for β 0 and relatively non-informative log-gamma priors for the precision term τ φ .

S2.1
County population has no consistent association with onset timing

S2.2
County population has no consistent association with peak timing   County points are shaded according to whether peak timing precedes (purple), matches (green), or succeeds (blue) state peak timing in that state.

S2.3
Influenza season features across spatial scales Epidemic duration by spatial scale. Epidemic duration was defined as the number of weeks between achieving the 20% and 80% cumulative intensity.

Fig Q:
County-level maps of disease burden and spatial aggregation difference for an example influenza season (2006)(2007). We present county-level disease burden for onset timing, peak timing, onset intensity, and peak intensity (top row) and their associated state-county spatial aggregation differences (bottom row). Timing measures are reported in number of weeks from week 40 while magnitude measures are reported as log intensity. Here, spatial aggregation difference represents the difference between state and county values of disease burden, where negative values (blue) indicate that state-level data underestimated intensity or had earlier timing than county-level data, and vice versa. The map base layer is from the US Census Bureau. Table A: Comparison of region-county spatial aggregation differences between onset and peak season measures. Negative estimates indicate that peak timing had smaller differences than onset timing or that peak intensity had smaller differences than onset intensity. Bolded values denote mean estimates that were statistically significant, where the 95% credible intervals did not overlap with zero.

S2.4 Operational implications of spatial aggregation difference
Peak intensity had consistently low variability in spatial aggregation difference across seasons (Fig R). Onset timing had consistently high variability in spatial aggregation difference across seasons (Fig S

S2.5
Heterogeneity is associated with greater spatial aggregation differences We hypothesized that spatial aggregation difference was positively associated with spatial heterogeneity in the disease burden measure of interest. For example, as the county-level variation in peak intensity increases within a given state, we might expect that a state-level estimate of peak intensity would have greater absolute difference. We note that spatial aggregation differences for counties in a given state will not sum to zero; as would be done in public health departments, we process our county and state time series directly from the raw surveillance counts and ILI baselines and epidemic periods are identified independently.
To examine the relation between within-state variation and spatial aggregation difference, we examined the Pearson correlation coefficient between the within-state variance in disease burden and the absolute magnitude of state-county error for each measure and flu season in our study period (28 total comparisons). We found statistically significant positive correlations ranging from 0.34 to 0.95 (p-value less than 0.05) for all but four onset measure comparisons -onset timing for the 2005-06, 2006-07, and 2007-08 seasons and onset intensity for the 2006-07 season (Fig T, Fig U, Fig V, Fig W). While some results may be unduly influenced by outlier points, we note that the overall pattern indicates a relationship between variation and error.