Figures
Abstract
In this paper, we present a method for estimating the infection-rate of a disease as a spatial-temporal field. Our data comprises time-series case-counts of symptomatic patients in various areal units of a region. We extend an epidemiological model, originally designed for a single areal unit, to accommodate multiple units. The field estimation is framed within a Bayesian context, utilizing a parameterized Gaussian random field as a spatial prior. We apply an adaptive Markov chain Monte Carlo method to sample the posterior distribution of the model parameters condition on COVID-19 case-count data from three adjacent counties in New Mexico, USA. Our results suggest that the correlation between epidemiological dynamics in neighboring regions helps regularize estimations in areas with high variance (i.e., poor quality) data. Using the calibrated epidemic model, we forecast the infection-rate over each areal unit and develop a simple anomaly detector to signal new epidemic waves. Our findings show that anomaly detector based on estimated infection-rates outperforms a conventional algorithm that relies solely on case-counts.
Citation: Safta C, Ray J, Bridgman W (2025) Detecting outbreaks using a spatial latent field. PLoS One 20(7): e0328770. https://doi.org/10.1371/journal.pone.0328770
Editor: Yoo Min Park, University of Connecticut, UNITED STATES OF AMERICA
Received: September 16, 2024; Accepted: June 27, 2025; Published: July 31, 2025
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Data Availability: All codes and exemplars presented in the paper are available at: https://github.com/sandialabs/PRIME.
Funding: This work was funded by Sandia National Laboratories’ Laboratory Directed Research and Development (LDRD) program and the US Department of Energy, Office of Science’s Advanced Scientific Computing Research’s Biopreparedness Research Virtual Environment (BRaVE) program.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
The infection-rate of a disease, especially a (human-to-human) communicable one, is perhaps the most concise distillation of the epidemiological dynamics of an outbreak. It waxes and wanes as a population’s mixing patterns change with the seasons or when a new variant arrives. It varies in space, modulated by risk factors viz., socioeconomic conditions, population density and demographic profile. It could potentially be a very informative quantity to monitor as part of disease surveillance, but is rarely ever done. This is because the infection-rate of an outbreak cannot be directly observed; instead, it has to be estimated, most commonly using a time-series of case-counts of patients (i.e., infected people who have tested positive). Depending on the quality of case-count data, which could have large reporting errors and display a considerable amount of variability if obtained from a small population where case-counts are low, the estimation of the infection-rate can be a difficult task.
Regardless of these difficulties, there have been many studies that estimate the infection-rate, particularly for the COVID-19 pandemic [1–3]. Our own work [4–6] parameterized a temporally-varying infection-rate and convolved it with the incubation period of COVID-19 to construct a disease model; when fitted to COVID-19 case-count data using Bayesian inference, it yielded parameters of the infection-rate model. This model could be used to provide 2-week-ahead forecasts of the behavior of the outbreak; when the observed data disagreed with the forecasts consistently, it indicated a change in epidemiological dynamics, for example the effect of lockdowns in California [6] or the start of the fall wave of COVID-19 in New Mexico [4]. All these studies aggregate case-counts over large populations (usually above 250,000) to reduce the variability in the observed case-counts and thus ease the estimation problem for the infection-rate. However, this aggregation can be problematic if performed over a large, sparsely populated region, as is the case for the state of New Mexico, USA. The infection-rate estimated is necessarily an average over the regional population and may bear little resemblance to the local population if the population displays large spatial heterogeneity; this is certainly the case with New Mexico due to the presence of urban areas as well as remote, sparsely-populated desert counties. Since public health measures are often decided at the county-level, these regionally-averaged estimates of infection-rate are only used as a rough guide by public health professionals.
In this paper, we develop a method to estimate the infection-rate as a spatiotemporal field, described over areal units that are adjacent and part of a larger area. Each areal unit supplies a time-series of case-counts for the estimation of the infection-rate field. For the purposes of this paper, we will use the COVID-19 outbreak in New Mexico (NM) and its counties as the test case, using data collected between June 1, 2020 and September 15, 2020 (the “Summer wave”; see Fig 1 (left)); after September 15, the case-counts in NM steadily rose into the winter, an event we will refer to colloquially as the “Fall 2020” wave. Our approach is based on two key hypotheses. Our first premise is that the parameterized model for the time-varying infection-rate, as developed by Safta et al. [6], can be used to model the temporal evolution of the outbreak in each areal unit. This will lead to an inverse or estimation problem that will scale with the number of areal units and could quickly become intractable. Our second premise is that the spatial correlations in the epidemiological dynamics, as observed in the case-count data, can be fashioned into a random field model to regularize the high-dimensional field inversion and render it tractable. As part of this investigation, our method will be exposed to observational data of variable “quality”, from relatively low-variability observations from populous counties, such as Bernalillo, to high-variability low case-count data from smaller counties around it.
The data from the “Summer wave” (June 1, 2020 to September 15, 2020) will be used to estimate the infection-rate field. The Fall 2020 wave started around September 15, and is marked with a solid vertical line. The dashed line is August 15. Right: Case-count data during the “Summer wave” for 3 counties. Note the erroneous data spikes in the middle of the summer for Cibola. Data quality for the various areal units can vary significantly.
The development of the this method will require us to address the following research questions:
- How does one fashion a random field model, from observational data of case-counts, to regularize the estimation problem for the infection-rate field?
- How does one include the random field model into the estimation of the infection-rate field? Does its inclusion improve the quality of the estimated infection-rate vis-à-vis an estimation performed using data from an areal unit independently? In particular, for counties or areal units with poor quality data, does the inclusion of the random field model (i.e., incorporate the ability to “borrow” information from neighbors) improve the estimation of the infection-rate?
- Can we use the estimated infection-rate to detect the arrival of the Fall 2020 wave in the counties of NM? How does it compare to a conventional outbreak-detector (specifically Höhle and Paul, 2008 [7])? In addition, in the absence of the Fall 2020 wave, does the use of the infection-rate lead to a false positive?
We re-iterate that the aim of this paper is to estimate the latent infection-rate field over multiple areal units and not just to forecast accurately – had that been the case, we would have used a time-series method.
We will address the questions using data from three adjoining NM counties viz. Bernalillo, Santa Fe, and Valencia. The inverse problem is sufficiently low-dimensional to be solved exactly using an adaptive Markov chain Monte Carlo (AMCMC; see Haario et al. [8]). A preprint that extends this method to all 33 counties of NM is also available [9]. It uses mean-field Variational Inference (MFVI) to solve the inverse problem for the infection-rate field approximately, as it becomes too high-dimensional for AMCMC.
The main contribution of the paper is in illustrating the use of random field models in inverse problems to yield local epidemiological information, using the spatial correlation extant in epidemiological dynamics (caused by population mixing) to compensate for high-variability in the case-count time-series observational data. A second contribution of the paper is to demonstrate that the information so obtained (in the form of a local infection-rate) contains actionable public health information; we will do so by detecting the arrival of the Fall 2020 wave. Note that we do not attempt to make a proper outbreak detector in this paper; that is left to future work. Also note that the use of random field models in disease mapping is well-established [10,11]; however, these methods seek to only smooth observed case-count data rather than estimate the underlying infection-rate.
The paper is structured as follows. In Sect 2 we review existing literature on infection-rate estimation, the empirical construction and parameterization of random field models, especially in disease mapping, and how outbreak-detectors function. In Sect 4, we parameterize a Gaussian random field (GRF) model to represent spatial correlations in epidemiological dynamics and formulate a general inverse problem for the infection-rate. In Sect 5, we present the results of the infection-rate estimation, jointly for the three counties, and compare them with the results obtained from independent estimation. We also discuss how the estimated infection-rate performs in detecting the Fall 2020 wave, compared to conventional techniques (Sect 6). We conclude in Sect 7.
2 Literature review
Covariates and spatial autocorrelation in COVID-19 dynamics: Huang et al. [12] analyzed the spatial relationship between the main environmental and meteorological factors and COVID-19 cases in Hubei province of China using a geographically weighted regression (GWR) model. Results suggest that the impacts of environmental and meteorological factors on the development of COVID-19 were not significant, something we also found in NM (see Sect 3). Their findings indicate that measures such as social distancing and isolation played the primary role in controlling the development of the COVID-19 epidemic. Geng et al. [13] analyzed spatio-temporal patterns of COVID-19 infections at scales spanning from county to continental. They found that spatial evolution of COVID-19 cases in the United States followed multifractal scaling. A rapid increase in the spatial correlation was identified early in the outbreak (March to April 2020) followed by an increase at a slower rate until approaching the spatial correlation of human population. For this study, the multiphase COVID-19 epidemics were modeled by a kernel-modulated susceptible–infectious–recovered (SIR) algorithm. Schuler et al. [14] employed a compartmental model for all 412 districts of Germany coupled with non-pharmaceutical intervention (NPI) models. They identify disease spread dynamics that corresponds to different spatial correlation levels, obtained via variogram estimation, between adjacent districts. McMahon et al. [15] analyzed the spatial correlations of new active cases in the USA at the county level and showed that various stages of the epidemic are distinguished by significant differences in the correlation length. Their results indicate that the correlation length may be large even during periods when the number of cases declines and that correlations between urban centers were more significant than between rural areas. Rendana et al. [16] analyzed the spatial distribution of COVID-19 cases, epidemic infection-rate, spatial pattern during the first and second waves in the South Sumatra Province of Indonesia. The study found little to no correlation between different regions. Air temperature, wind speed, and precipitation have contributed to the high epidemic infection-rate in the second wave. Indika et al. [17] inspect the daily count data related to the total cases of COVID-19 in 93 counties in the state of Virginia using a Bayesian conditional auto-regressive (CAR) modeling framework. The authors find that Moran statistic values at specific time points are impacted by, and linked to, the executive orders at the state level. In summary, there is some evidence that modeling of COVID-19 over small areal units might need to accommodate spatial auto-correlation, and might also require the inclusion of other covariates.
Ahmed et al. [18] proposed an SIR to model the COVID-19 epidemic and control its spread. It is observed that indirect infection increases the basic reproduction number and gives rise to multiple endemic diseases. Khan et al. [19] designed an SIRD (Susceptible-Infected-Recovered-Dead) model for infectious disease control. The model aims to optimize key control variables to reduce deaths, providing a practical framework for strategically mitigating disease spread.
Random fields and disease maps: There is little literature on the use of a random field to estimate the infection-rate of a disease. However, the estimation of a latent field called relative risk is central to disease mapping [20,21]. A disease map is a 2D plot of the risk of contracting a disease, computed from case-counts collected over areal units that comprise a region or province province. First, one obtains an “expected” value ei for the observed case-counts
for areal unit i, usually from a region-wide average of disease incidences and demographics. It is then locally adjusted (in space) using the relative risk field to bring it closer to observations i.e.,
. The risk ri is then modeled as
, where
are co-variate risk factors for areal unit i,
are regression weights and
captures auto-correlated random effects in space using a random field model. The simplest random field model is iCAR (intrinsic Conditional AutoRegressive [20]), a specific type of Gaussian Markov Random Field (GMRF). Thus
where W is the adjacency matrix of the areal units (i.e., wij = 1 if areal units i and j share a boundary). The object of estimation from data is . The precision matrix Q tends to be sparse. This formulation leads to an improper joint distribution for
. The Besag-York-Mollie (BYM) model [22] overcomes this issue by extending iCAR as
and
. We will use a variation of BYM in our work. The objects of estimation from case-count data are
. A second variation, called pCAR (proper CAR [23,24]), modifies the precision matrix
, where the objects of estimation are
. The idea of a random field being used to smooth areal units in feature-space (as opposed to geometrical space) has also been developed using GMRF [25]. Such a method is useful for diseases like alcohol abuse where similarity of socioeconomic and health factors in areal units, rather than the geometric distance between them, are more relevant for smoothing. The difference lies in how Q is modeled using a similarity S matrix [26].
Outbreak detectors: Outbreak detection functions primarily as anomaly detection in space and time [27]. The case-count at time t, yt, is often modeled as a normal random variate ; an alarm is raised if
, where
is a threshold value adjusted to trade-off specificity and sensitivity of the detection. This approach can be considered as an expansion of Shewhart charts [28] and is sometimes referred to as “statistical process control” (SPC) methods. Methods differ on how
are computed. Serfling [29] fitted historical data of case-counts from influenza outbreaks with a linear trend and trigonometric functions (to account for their seasonality) to obtain estimates (and forecasts of)
. A zero-mean Gaussian was assumed as a model for the fitting errors. The method is widely used and over time the linear and periodic components have been adapted for local conditions and specific diseases [30]. For outbreaks with low counts, this approach has been modified to use Poisson error models, where the log-mean is modeled as a function of time, much like Serfling’s method [31,32]. Farrington’s widely used method [33] parallels Serfling’s approach, with linear and periodic trends, but the quasi-Poisson model accommodates the over-dispersion observed in epidemiological surveillance data as
, where
is estimated from the data.
have also been modeled and forecast using time-series model [34] such as AutoRegressive Integrated Moving Average (ARIMA, [35]) but the surveillance time-series has to be first rendered stationary by subtracting out any trends and seasonality (which incurs errors). A comparison of ARIMA and SPC methods for detecting outbreaks showed that ARIMA methods were unremarkable in their ability to model surveillance data [36], due to non-stationarity and sparsity. Outbreaks detection can also be modeled as state-transition events and thus based on Hidden Markov Models [37] and Markov switching models [38–40]. Outbreak detection can also be formulated as a two-component model consisting of an endemic phase (modeled using a Poisson distribution) and an epidemic one (modeled using an auto-regressive parameter). Both components are fitted to the data in a time-window around t and a likelihood ratio test is used to evaluate which model fits better [7,41]. This can be used to detect when an epidemic starts. We will use such a model [7] as a baseline in Sect 6.
Infection-rate estimation: The infection-rate of a disease cannot be directly observed and therefore is estimated by fitting models to observed data e.g., case-counts per unit population or the incidence-rate of an outbreak. It is generally estimated from the data collected during the early epoch of an outbreak, and is treated as a purely temporal variable i.e., it is “averaged” over a region. The models used in its estimation can be mechanistic e.g., ordinary differential equations, statistical or drawn from machine-learning (ML) e.g., artificial neural networks. Mechanistic and statistical [42] models yield explicit estimates of the infection-rate (sometimes also called the “time-dependent reproduction number”) while ML models estimate them implicitly; comprehensive reviews of such estimation techniques can be found in Ref. [43], and a comparison of such methods on a benchmark COVID-19 dataset is in Ref. [44]. Note that when modeling the incidence-rate, spatial modeling for disease maps is quite common (see review above) including when exogenous explanatory variables are included [45].
Similar studies: Perhaps the investigations that are closest to ours, in modeling philosophy, are those by Lawson and collaborators [46–48]. Fundamentally, our approach consists of “stitching together” models meant for individual areal units [4,6] via CAR models (specifically, the BYM model). Lawson and co-workers model case-counts directly, whereas we use a parametric model of a temporally-variable (and, in this paper, also spatially-variable) infection-rate field that is related to the case-counts via the incubation period distribution. The use of the incubation-period model (see Sect 4) makes our model computationally more expensive than the ones used by Lawson and collaborators. Case-counts, in Lawson’s formulation, are modeled using a Susceptible-Infected-Removed compartmental formalism with a one-lagged-in-time auto-correlation and a BYM CAR model to couple with adjoining areal units; the clearest description of the model is in Lawson and Song, 2010 [46], which was applied to four counties in South Carolina. The same model was adapted to COVID-19 data from all counties of South Carolina [49] and the UK [50]. In an allied work, Lawson investigates, and selects between, various formulations of their basic model, as applied to COVID-19 data, with 1-step-ahead forecasting accuracy in mind; he finds no clear benefits between using a space-time versus a purely temporal model [47]. The group has also investigated, much like us, whether departures from forecasts could be used to detect anomalies within the context of epidemiological surveillance [48,51]. They devised metrics such as the Surveillance Kullback-Liebler [52] (SKL) and Surveillance Conditional Predictive ordinate [53] (SCPO) to monitor and detect outlier epidemiological behavior. Lawson and Kim [51] found that one needed to include a leading indicator of epidemiological activity, e.g. absenteeism, as a modeling covariate to detect epidemiological changes in a timely manner. A more methodologically-oriented paper [48] investigated whether Poisson or Negative Binomial (NB) distributions should be use to link the observed case-counts to the modeled values in a likelihood function. They found that the NB distribution provided better goodness-of-fits (perhaps because the two-parameter distribution is more flexible than Poisson) but for small datasets, Poisson provided more predictive forecasts. To summarize, one can use cases-counts directly for (spatio-temporal) model-based syndromic surveillance and there is some uncertainty over whether one should use Poisson or NB distributions to capture the stochasticity in the observation. However, the possibility of using a latent variable that might be better behaved, for example infection-rate, has not been investigated.
3 Exploratory data analysis
In this section we perform an exploratory data analysis on the COVID-19 data from New Mexico (NM), in order to design the spatial problem.
3.1 The COVID-19 dataset
The COVID-19 dataset covers the duration from 2020-01-22 to 2022-05-13, and consists of daily (new) case-counts of COVID-19 from each of the 33 counties of NM; the data is available online. [54,55] The 73 covariates (i.e., risk factors) of COVID-19 span demographics, socioeconomic information (income, business and home ownership etc.) and infrastructure. These were obtained from another group in Sandia National Laboratories and is described in their publication [56]; we provide a summary below. Demographic data on age distribution, gender, racial orgins, housing, family units and living arrangements, education, health etc. were obtained from US Census Bureau’s QuickFacts for New Mexico [57], representing 5-year estimates between 2014-2018 and the 2013-2017 American Community Survey estimates. Geographical information such as area of counties, population densities etc. were also obtained from the Census dataset. Infrastructure represents the resources needed by a county to operate, such as number of COVID testing sites, nursing homes and K-12 schools. [58,59] Geospatial data was also extracted from University of New Mexico Earth Data Analysis Center which develops the Resource Geographic Information System [60]. In total, data was compiled from 40 sources, manually down-selected to 73 features and adjusted (when needed) to each county’s population.
3.2 Data analysis
Let be the vector of case-counts reported on day t in each of the R areal units (i.e., counties of NM). Let
be the vector of normalized cumulative case-counts over the duration
i.e.,
is the cumulative number of case-counts over the 90-day period
for areal unit r and pr is the areal unit’s population. The 90-day window is adopted to average out the effect of reporting errors, as well as to reduce the effect of low case-counts in some of the very sparsely populated desert counties of NM. We assume that the case-counts can be modeled as a linear function of risk factors i.e.,
where the
column of
contains the value of the
risk factor for all R areal units and
are their relative weights in time-window t. The risk factors
are constant in time but vary between areal units. In disease mapping terms, the model
provides the expected value of
and any deviations would be deemed “random”, to be modeled statistically.
Some of the risk factors are very correlated and thus carry little independent information, and consequently we simplify the model via sparse Principal Component Analysis [61] (PCA) to a set of principal component to remove unnecessary risk factors i.e.,
. Note that the principal components
from sparse PCA do not form an orthogonal basis set. We see from the scree plot in Fig 9 (in the Appendix) that K = 10 is sufficient to explain 95% of the variation in
. Further, sparse PCA constructs
using the most important risk factors. The main components of the sparse PCA modes are percent elderly, affluence, medical institutions per capita, size of population, percent native American and percent male.
Results are plotted for the intercept and four principal components (PC). Only the intercept survives and is far larger that the weights associated with the principal components. Top right: Plot of the prediction error from a 7-fold cross-validation performed with the risk-factor model and LASSO, on case-count data accumulated over the entire two-and-a-half-year duration (and normalized by county populations). The figures on the upper horizontal axis denotes the number of principal components retained in the fitted model. and
are clearly marked. Bottom left: Distribution of coefficients, corresponding to penalties
and
; the intercept dominates. Bottom right: The residuals from the risk-factors model i.e., the component not explained by the risk-factors model. The spatial correlations are clear.
We fit a regression model and simplify it with backward-forward stepwise elimination for each time window. New time-windows are obtained by advancing the previous one by 30 days. Fig 2 (top left) plots the variation of the absolute values of the coefficients
over time. We see that the intercept w0 dominates and persists over the entire duration, whereas the others are present only episodically, suggesting that the model might be fitting to noise. To investigate whether the risk factors play any part in the regression model, we take the cumulative sum of the case-counts over the entire duration of the dataset
and fit
via LASSO (Least Absolute Shrinkage and Selection Operator [62], as implemented in the R Statistical Software [63] (R version 4.3.2 (2023-10-31)) package glmnet [64]). Fig 2 (top right) shows the Mean Square Error (MSE) as a function of the sparsity penalty
in LASSO ; the digits along the upper horizontal axis plots the PCA modes retained as
is increased. The “error bars” show the variation in MSE as we undergo 7-fold cross-validation. We use the value of
in our regression model (the second vertical dotted line in Fig 2 (top right), where the mean MSE corresponds to 1 standard deviation away from the minimum MSE observed for
). The coefficients
obtained from these two values of
are plotted in Fig 2 (bottom left). It is clear that the intercept w0 dominates i.e., the case-counts for COVID-19 are not very dependent on
and
. The implication is that over the time-period of interest, the spatial patterns observed in
were not explained by the spatially-variable risk factors. Fig 2 (bottom right) plots the z–score of
and the spatial correlation of the epidemiological dynamics not modeled by risk factors is clear. There is a “blue” diagonal of NM counties running Northeast to Southwest, where as the Northwest and Southeast corners are yellow. In between are “magenta” counties. Note that much of the blue diagonal is along the Rio Grande valley, and the population density falls as we travel away from it, into the desert. Clearly, a neighborhood matrix W for a GMRF model could be made from this data, and we address this next. Note that this spatial variation is not explained by risk factors, but perhaps is due to mixing of populations in the counties.
Moran’s I–statistic test [65], as implemented in the R package spdep [66,67], is used to detect spatial autocorrelation in a variable defined over areal units. It requires an adjacency matrix W between areal units as input. We consider three different definitions of W viz. “binary” where wij = 1 when areal units i and j share a border (i.e., they are immediate neighbors), “binary-modified“ where wij is weighed by the reciprocal of the distance between adjacent counties’ county seat and “row-standardised“ where wij is weighed by the number of neighbors that areal unit i has. Moran’s I–statistic is computed with the that is provided to the test (“observed I–statistic”) versus the null case where the elements of
are IID. The figure of merit is the standard deviate of the observed I–statistic. The standard deviate of the
shown in Fig 2 (bottom right) is in Table 1, top row and includes the full dataset; clearly it is far from being IID random. Thereafter, we perform the same Moran’s I–statistic test for the 90-day windows (Fig 2 (top left)) and tabulate the mean and standard deviation of the the I–statistic in Table 1, bottom row; again, the I–statistic indicates significant spatial auto-correlation. We see that the “binary” and “row-standardised” versions of the adjacency matrix give similar results and they are both far superior to the “binary-modified“ form of W. The computation was repeated with an adjacency matrix with a 2-hop neighborhood (where the immediate neighbors of an areal unit, and their immediate neighbors, were included in the adjacency matrix) and the I–statistic was indistinguishable from random
. Henceforth, we will adopt the row-standardised form of W as our spatial prior as we estimate the infection-rate field over multiple areal units, as it provides the largest standard deviate of Moran’s I–statistic.
4 Formulation
Next we propose an epidemiological model to estimate the infection-rate field described over adjacent geographical regions (areal units). The quality of the data in these areal units can vary significantly; Fig 1 (right) plots the Summer wave for 3 NM counties and Cibola shows an anomalous spike in the middle of the outbreak. The epidemiological model will have to smooth over such large spikes when and where they occur. While we will demonstrate our model (in this paper) on data from Bernalillo, Santa Fe and Valencia (which, per Fig 1 show reasonably good quality), our model has been used to estimate the infection-rate field over all 33 NM counties [9].
We develop our model as an extension of our previous work [4,6], which were meant for a single areal unit, by incorporating into it a GMRF to represent spatial auto-correlations. This is in contrast to the conventional approach of using a space-time proper Conditional AutoRegressive (pCAR-ST) model for the infection-rate field. The reasons for our alternative approach are two-fold. First, the use of a pCAR-ST model would require us to use a time-series model to capture temporal variations; as Fig 1 (right) shows, this might be difficult due to the large reporting errors in the data. Instead, we rely on our previous model that uses exogenous time-scales (the incubation period and the time-profile of the infection-rate) to perform the temporal smoothing. Secondly, Lawson and co-workers [46–48] used a pCAR model with a one-step-ahead model in time (much like a AR(1) model) and found it to be no better than a purely temporal model [46], though later papers with COVID-19 data showed the opposite result [47,48]. This uncertainty caused us to forego the conventional pCAR-ST approach in favour of our older model that was formulated intrinsically in terms of the infection-rate, and has proven to be robust, even in the presence of multiple waves [4].
The framework described below is implemented in Python and released as open source [68].
4.1 Epidemiological model
The epidemiological model combines an infection-rate model and an incubation rate model. In a given areal unit r, the infection-rate is assumed to follow a Gamma distribution (in time) with a probability density function (pdf) given by
The infection-rate in Eq (1) is controlled by two parameters, kr (shape) and (scale), and is sufficiently flexible to capture a range of outbreaks. The third parameter, t0,r, represents the start of the outbreak and will be inferred jointly with the infection-rate parameters. For incubation we employ a model calibrated against early COVID-19 data [69]. This model follows a lognormal distribution with a cumulative distribution function (CDF) given by
Note that and
are not constants, but are random variables themselves. The mean
is approximated as a Student’s t–distribution [70] and
is assumed to have a chi-square distribution. These choices result in 95% confidence intervals of
and
for
and
, respectively, as described in Safta et al. [6]. We will refer to this model as the stochastic incubation model.
The cumulative number of people that have turned symptomatic between time t0,r (the start of the current epidemic wave) and time ti is computed as a convolution between the infection-rate and the CDF of the incubation model
where Nr is the total number of people that will get infected (and counted) during the entire epidemic wave in areal unit r. This model assumes that a person shows symptoms once the virus incubation has completed. Furthermore, once symptoms are evident, it is also assumed that individuals have prompt access to medical services or otherwise self-report the COVID-19 infection, getting counted without delay. These assumptions will be relaxed in future versions of this effort where the model above will be endowed with latent variables that account for uncertainties due to reporting delays and unreported positive counts.
The number of people that turn symptomatic over the time interval , in areal unit r, is estimated as
where finc is the pdf of the incubation model. In transitioning from Eqs (4) to (5) we made use of the approximation
which amounts to approximating the incubation model PDF with a histogram with bin of size . Thus the four parameters that describe the epidemiological dynamics in an areal unit r are
and
is the accumulation of parameters over all R areal units. We will refer to them colloquially as the “epidemiological” parameters. In this paper we focus on outbreak detection and for this purpose a model that follows a single wave, as above, is sufficient for the task. Given the assumptions above, these outbreak forecasts represent a lower bound on the actual number of people that are infected with COVID-19. A fraction of the population infected with a novel disease might also exhibit minor or no symptoms at all and might not seek medical advice, further contributing to lowering the predicted counts compared to the actual size of the epidemic.
4.2 Model calibration
Given data in the form of time-series of daily counts, labeled generically as , as shown in Sect 3.2, and the model predictions
for the number of new symptomatic counts daily, presented in Sect 4.1, we will employ a Bayesian framework to calibrate the epidemiological model parameters. The discrepancy between the data and the model is written as
where are the parameters that describe both the epidemiological models and the statistical discrepancy
between the data and the epidemiological model. These parameters will be detailed in the following sub-sections. The probabilistic error model encapsulates both errors in the observations, e.g. availability of testing capabilities and test accuracy, as well as errors due to empirical modeling choices.
The multivariate distribution for the vector of parameters can be estimated in a Bayesian framework as
where is the posterior distribution we are seeking after observing the data
,
is the likelihood of observing the data
given a specific choice for parameters
, and
contains the prior information about the models parameters. The subsections below provide a detailed description about the setup of the likelihood and prior distributions.
4.2.1 Likelihood construction with spatial correlations.
We now derive a likelihood expression which accounts for the discrepancies between the number of people reported symptomatic daily and the number of new cases predicted by the model, via Eq (5). We denote the reported daily count
for day i, and the daily predicted count
, where
is the epidemiological model described in Eq (5), with
constituting the epidemiological parameters over R regions, some of which might be adjacent.
are the parameters that will be jointly inferred given the available data.
For a given data i, we state
i.e., we assume that the data – model mismatch is a multivariate Gaussian distribution with a block covariance matrix. We will assume that the discrepancies are independent over the temporal axis and correlated in space, i.e.
Here is the block in the large covariance matrix (that spans over Nd days of observations) that corresponds to the predictions for day i. Per the BYM model, we will model the discrepancy
with two components i.e.,
. Per Fig 2 (bottom right),
will be modeled with a pCAR to capture spatial auto-correlation. In contrast
models random, temporally independent, reporting errors and any model shortcomings. Consequently the
discrepancy is modeled as the product of two independent, zero-mean multivariate Gaussian components [71], with a resulting in a joint covariance matrix given by
where P is the precision matrix associated with the GMRF model assumed to account for the spatial correlations between adjacent regions (a proper Conditional Auto-Regressive (pCAR) model [21]). We will refer to the parameters as the “error model” (or ErrM). The precision matrix P is defined as
Here, gj is the number of regions adjacent to region j, and W is a matrix that encodes the relative topology of the regions considered in the joint inference, with entries defined as
Thus P defines a pCAR spatial model with row-standardization and is a function of the “spatial coefficients” (or SpC) , which will also have to be estimated from the data. The inclusion of
implies that the epidemiological parameters
will display spatial correlation. The magnitude of the correlation is unknown a priori, and will be estimated from the case-count data.
The dimensionality of the inverse problem (for fitting the spatial model to data from R areal units) is whereas that of the purely temporal problem is 6. Thus for
the spatial inverse problem is less flexible (i.e., has fewer parameters) than R separate fits to data (corresponding to 6R parameters being estimated from data). This implies that if the data from all R areal units were of the same good quality, individual model fits could be better than that of the spatial model. However, all areal units do not have data of the same quality [9]) (see Fig 1 (right)) and purely temporal fits to individual areal units may not even be possible. This was the motivation behind the development of the spatial model, so that we could exploit spatial auto-correlations to compensate for bad data.
To summarize, the accuracy of the spatiotemporal model for epidemiological dynamics is controlled by the parameters , which will be the object of inference from data from R NM counties. The dimensionality of the inverse problem scales with R and is limited by the scalability of the inversion method. We will use R = 3 and consider inferences using the following setups:
- independent inferences (i.e., R = 1), county by county, for the counties of Bernalillo, Santa Fe, and Valencia.
- two adjacent counties (i.e., R = 2), i.e. Bernalillo & Santa Fe and Bernalillo & Valencia. For these cases the covariance matrix
corresponding to the GMRF model is given by
(13)
- three counties (i.e., R = 3), Bernalillo, Santa Fe, and Valencia, jointly. Bernalillo is adjacent to the other two counties but Santa Fe and Valencia do not share a border. The GMRF covariance matrix
is given by
(14)
4.2.2 Prior distributions.
We employ wide uniform priors for the shape and scale parameters, kr and , of the infection-rate models, in Eq (1) to ensure these parameters are only constrained by the information contained in the data. We also employ a uniform prior for the total count of infected people during the pandemic Nr. In the algorithm implementation, we employ a normalized count for each areal unit,
, where is
is the total population in the areal unit r. Thus the prior distribution for
is U(0,1). From our previous work [4,6] we observed that the convolution model in Eqs (3)–(5) exhibit sharp transitions when the inferred start time t0 is not well constrained by the data, for example in situations where the daily counts are noisy in the low single digits. For this purpose for t0 we selected a Gaussian distribution with a wide enough standard deviation, approx 10 days, to allow the data to easily overcome this prior when the number of counts increases beyond the low single digits count.
Further, to ensure the discrepancy model parameters, and
, are automatically positive, we work with their natural logarithm in the Bayesian framework. Consequently, the equivalent uninformative prior for the logarithm of standard deviations,
and
, is the uniform distributions. For both these parameters, we bound the natural logarithms’ values to [–30,10], a range sufficiently wide to account for the discrepancies between model predictions and observations, while preventing numerical underflow or overflow errors during the Markov chain Monte Carlo (MCMC) sampling.
For the parameters controlling the pCAR model, we employ a Gamma distribution with shape 10 and scale 2, , for
and a uniform distribution U(0,0.9) for
following Shand et al. [56]. The prior distributions are summarized in Table 2 for clarity.
4.2.3 Sampling the posterior distribution.
As in our previous work on epidemiological models [4,6], we employ a MCMC algorithm is used to sample from the posterior density , specifically the AMCMC framework [8]. To accommodate the stochastic incubation model (Eq (2)), we employ an unbiased estimate of the likelihood presented in Eq (9). For each MCMC step we select a random set of
for the incubation model according to their prescribed distributions, then run the epidemiological model to generate
and estimate the likelihood. This approach is similar to the pseudo-marginal MCMC algorithm [72] guaranteeing that the resulting samples correspond to the unbiased posterior distribution model. We use the Effective Sample Size (ESS) [73] estimate to gauge the number of samples sufficient to describe the posterior distribution given the data available. For the results presented in this paper, we found that 1 to 2 million MCMC samples were needed to extract 5K-10K effective samples required to estimate summary statistics and marginal distributions for the epidemiological models’ parameters.
4.2.4 Diagnostics.
The sampling process described in Sect 4.2.3 yields samples of
from the posterior probability density function (PDF) and the question arises regarding how we assess the accuracy and predictive skill of the PDF. Primarily, we will use posterior predictive tests, whereby we will select 100 samples from the posterior PDFs and use Eq (8) to predict case-counts. These forecasts will be limited to 14 days, beyond which, as described in our previous papers [6], the model is not expected to be predictive. Fundamentally, observations up to time t contains information about epidemiological dynamics up to time
,
being a measure of the incubation period; after that, an increasing fraction of the infected people have yet to show symptoms and appear in the case-counts. Using the mean incubation period plus twice the standard deviation as an estimate for
(Eq 2), we get
days, and so we curtail forecasting at a 2-week horizon. These forecasts are compared with the observed case-counts, and in case of a mismatch, the epidemiological dynamics are assumed to have changed after time t. Apart from forecasting, the correlation structure in the
samples can be informative. For each of the areal units of interest, we plot 2D marginal plots (in the Appendix) and, in Sect 5.2, perform grouped statistical dependence analysis to uncover how parameters for each areal unit vary with those from other areal units, or with global parameters such as
.
5 Results
Our use of AMCMC [8] (which is not very scalable when coupled with a moderately computationally expensive model) limits us to 10-15 dimensional posterior distributions. For this reason, we limit our study to three regions, R = 3, for a total of 16 parameters, i.e. 4 parameters for each region, and 4 parameters to describe the error model and correlations between regions. The prior distributions for these parameters are listed in Table 2 for clarity.
We selected three NM counties, Bernalillo, Santa Fe and Valencia, shown in Fig 3, as this allows to understand whether the adjacency between counties plays a role in the model calibration. Bernalillo is sandwiched between the other two counties and thus shares boundaries with the other two; while Santa Fe and Valencia do not share boundaries.
Simulations used between samples to ensure converged estimates for marginal posterior distributions and summary statistics. Simulations took approximately 1 hour for one areal unit and 1.5 hours for three areal unit to generate 106 samples and used between 1-5 GB of memory. All simulations and analysis were performed on a machine with an Apple M2 Max processor.
5.1 Markov chain Monte Carlo results
In this section we will discuss summaries given samples from the posterior distributions sampled via MCMC. We first compare posterior results obtained for 1-, 2-, and 3-region statistical inference runs and the examine their impact on quality of model predictions vs the available observations. In Fig 4 we plot the 1D marginalized posterior PDFs of the epidemiological parameters i.e. for all three counties. The 2D marginals are in the Appendix in Figs 11, 13 and 12. The 1D PDFs were computed using data from all three counties jointly (denoted “3r” in the legend), jointly using data from 2 counties at a time (denoted as “2r”) and independently (denoted as “1r” inversions). We see that joint estimation does not noticeably sharpen the PDFs for any of the objects of interest (OOI), but does shift the PDFs for Santa Fe. This robustness to population size is because the likelihood for the inverse problem is constructed with normalized counts, implying that the larger case-counts observed in Bernalillo (about 6 times larger than Santa Fe or Valencia) do not bias the results against the smaller counties. We note that the PDFs for Valencia do not change much in the three estimations. t0 values are negative as it is measured from June
, 2020, and the PDFs imply that infections for the Summer wave started in late May. In Fig 5, top row, we plot the parameters of the GMRF
. It is clear that these spatial parameters can be estimated from the 2r and 3r inversions, with
becoming easier to estimate with specificity as we add more regions, at the expense of
.
Top row: PDFs for t0,r. t0,r values are negative as it is measured from June , 2020, and the PDFs imply that infections for the Summer wave started in late May. Second row: PDFs for Nr. Third row: PDFs for k. Bottom row: PDFs for
.
In Fig 5, bottom row, we plot the noise parameters , for Santa Fe, obtained from the same set of inversions. We see that the noise parameters are small and can be estimated, though it becomes progressively more difficult to estimate
with much specificity with joint estimation, while
becomes easier. This is because
estimates the magnitude of the epidemiological processes unexplained by our model and the genesis of these processes is likely to be different in the three counties, leading to the difficulty in estimation. This can be explained using Eq (10). Here
and
appear additively, and the uncertainties in one could be exchanged for the other, as can be seen in Fig 5 top left and bottom right.
In Fig 6, we plot the fit of the model to data till September 15, 2020 (the arrival of the Fall 2020 wave) and the two-week forecasts done after that. These predictions are performed by randomly sampling 100 from the posterior distribution (Fig 4) and running the model forward from the start of our calibration period to the end of September 2020 (note that the calibration data stops at September 15, 2020, and the rest is a forecast). The data for the two-week period is also plotted and it not supposed to agree with the forecast, as the calibrated model does not contain information about the Fall 2020 wave. We see quite clearly that the uncertainties in predictions (the dashed blue line denoting the
and
percentiles are tighter for the 3-region joint inversion (top row) for all three counties vis-à-vis the purely temporal fits in the bottom row. Note that this tightness does not necessarily imply that it is more accurate. However, this tightness does imply that it becomes easier for us to detect the discrepancy between the forecast and the data, the marker for the arrival of the Fall 2020 wave. This is particularly true for Santa Fe. The agreement between the predictions (up to September 15, 2020) and the reported case-counts are quantified using the CRPS (Continuous Ranked Probability Score [74]) and tabulated in Table 3. Note that in this case, the CRPS quantifies the “goodness” of the fit of the model to data collected up to September 15, 2020. We see that the most accurate predictions (i.e., better fits) do arise from independent estimations, but the 3r inversions are close behind. This is expected and explained in Sect 4.2.1 - if areal units have data of similar (good) quality, purely temporal fits to areal data may be slightly better than the spatial model. In our case, the quality of data is similar across all three counties. Further, CRPS has units of “cases per day” and as Table 3 shows, the difference in the CRPS arising from 1r and 3r estimations is less than 1 case per day. This small difference is not the consequence of over-smoothing in time by our “mechanistic” treatment of timeevolution of the outbreak vis-à-vis a pCAR-ST model, which would likely employ an autoregressive or moving-average approach. In Fig 7, we plot the corresponding infection-rates for all three counties. Differences in the estimated infection-rates, 3r joint estimation (top row) versus independent (bottom row), are difficult to discern. This is because the infection-rate is only affected by (t0,r, kr, Nr, 𝜃r) and, as is clear from Fig 4, there is not much difference in their posterior PDFs. Instead, it is the noise and spatial parameters whose estimates differ as we add more regions to the joint estimation (see Fig 5).
The red line is the median prediction, the shaded teal region is the inter-quartile range and the dashed lines are and
percentiles and the white circles are actual counts in the forecast regime.
The top row contains results obtained via joint inference (using the GMRF model) for Bernalillo (left), Santa Fe (middle), and Valencia (right). Results from independent inferences for each county separately, are shown in the bottom row. The calibration data spans up to September , 2020 and the case-count data was smoothed with a 7-day running average. The red line is the median prediction, the shaded teal region is the inter-quartile range and the dashed lines are
and
percentiles. difference is not the consequence of over-smoothing in time by our “mechanistic” treatment of time-evolution of the outbreak vis-à-vis a pCAR-ST model, which would likely employ an auto-regressive or moving-average approach. In Fig 7, we plot the corresponding infection-rates for all three counties. Differences in the estimated infection-rates, 3r joint estimation (top row) versus independent (bottom row), are difficult to discern. This is because the infection-rate is only affected by
and, as is clear from Fig 4, there is not much difference in their posterior PDFs. Instead, it is the noise and spatial parameters whose estimates differ as we add more regions to the joint estimation (see Fig 5).
5.2 Statistical dependence analysis
In this section we use distance correlation [75] to ascertain the degree of dependence in the posterior distributions for individual parameters and between collections of parameters that define the model for individual counties. Distance correlation values, denoted , reveal the relationships between model parameters inside each region and between regions when the parameters are inferred jointly. This information can be used to aid in model construction and gauge the degree of which the parameters controlling the dynamics of the epidemics are connected across region boundaries and therefore can benefit a joint inference approach.
Numerically, we estimate the distance correlation using the algorithm presented in definition 3 in Székely et al. [75]. This algorithm employs samples generated by the MCMC exploration of the joint posterior distribution of the model parameters and estimates the degree of dependency between individual parameters conditioned on the case-count data available. We also employ this approach to estimate the degree of dependence between parameter subsets, grouped by regions.
Table 4 shows values for the Bernalillo (left table) and Santa Fe (right) table. The entries in this table can be viewed as quantitative assessments of the shapes observed for the 2D marginal PDFs presented in the right frames of Figs 11 and 12 included in the Appendix. For both counties we observe strong dependencies between k and
, the shape and scale parameters of the Gamma distribution used to model the infection-rate, and t0. These strong dependencies, explained by the corresponding narrow 2D marginal PDFs (in Figs 11 and 12 in the Appendix) are induced by the strong constraints imposed by the available case-count data and the infection-rate dynamics. The error model parameters,
and
, exhibit little dependency among themselves and with other model parameters for Bernalillo county which is driven by larger case-counts values. However, for Santa Fe, which exhibits lower case-counts and hence changes in case-count values are more relevant, the model discrepancy parameters show non-negligible dependencies with respect to each other and other model parameters. Similar trends are also observed for Valencia county (results not shown) for which the observed case-counts are comparable in magnitude to Santa Fe.
Table 5 shows values computed with MCMC samples corresponding to a joint inversion for the three counties simultaneously. The sections in this table were colored to highlight the different types of parameter dependencies. The
values corresponding to Bernalillo and Santa Fe counties (colored in orange) are similar to the corresponding values when the epidemiological models are calibrated region by region. This is due to infection-rate models being defined on a per region basis and hence it is expected to observe that similar trends for the corresponding parameters affected by regional case-counts. Given the large discrepancy between the magnitude of the case-counts in adjacent regions, the additive component
of the error model is now less impactful compared to the multiplicative component. The spatial correlation model parameters and the multiplicative error model component show non-negligible
(with joint PDFs displaying negative correlations - results not shown). We also show, in Table 6, the corresponding
values between model parameters grouped by model components, i.e. by region, then spatial correlation and error models, respectively. These results are essentially summaries of the corresponding values aggregated in similarly colored regions in Table 5.
6 Discussion
In our previous papers [4,6], we had demonstrated, for a single areal unit, that we could estimate the infection-rate with a sufficient degree of accuracy so as to be able to provide short-term (2-week-ahead) forecasts of the evolution; the results in Sect 5 show that this ability is preserved when we introduce a pCAR model to impose spatial correlation across multiple areal units. Given that the inversion is, in effect, a smoothing operation (i.e., the observations inform infection processes that happened in the past), any discrepancy between forecasts and observations could be caused by a sudden change in the infection-rate. Thus it may be feasible to detect the arrival of a new wave of infection using the (latent) infection-rates estimated in Fig 7.
The state of NM experienced three waves of COVID-19 infections in 2020; the state-wide totals of case-counts are shown in Fig 1 (left). The second wave, that was felt between June and September
, provides us with ample data to infer an infection-rate, and forecast the outbreak till the end of September. As is clear from Fig 1 (left), these forecasts will deviate from the data due to the arrival of the third wave (henceforth the “Fall 2020" wave). Our aim is to use the estimated infection-rate to detect the Fall 2020 wave, and compare our performance versus a conventional method. We will also conduct such a test using data collected till August
(before the Fall 2020 wave) and check whether our infection-rate method detects a (false) positive.
We sample the posterior distribution for (plotted in Fig 4) and produce a fantail of predictions of the evolution of the outbreak; the
percentile prediction is treated as the “outlier boundary” (similar to SCPO [53]) and any day with a case-count above the boundary is deemed an “outlier". We treat three consecutive days of outliers as an “alarm" indicating an anomaly in the behavior of the data with respect to the infection-rate estimated before. This is plotted for Bernalillo, Santa Fe and Valencia counties in Fig 8 (top row). The green line denotes September
. Beyond this date, we see a number of days where the case-counts lie above the red “outlier boundary”; these are circled in red. Some days also have their case-count data encased inside a box; these are the third of a 3-day sequence of outlier days (and thus an “alarm” day). We see that in all three counties, we could detect the arrival of the Fall 2020 wave successfully. We repeated the infection-rate estimation using data from June
to August
and performed a similar check for “alarm" days between August
and
; these are plotted in Fig 10 (in the Appendix). We detect many outlier days and in a couple of areal units, we detect false “alarm days”. Thus monitoring the infection-rate allows us to detect the Fall 2020 wave when it is present but could also lead to false alarms when it is absent.
The symbols are the observed case-counts for Bernalillo (left), Santa Fe (middle), and Valencia (right). The Fall 2020 wave is believed to have started around September 15. The red line beyond September 15 is the outlier boundary; a day with a case-count above the dashed line is an “outlier” and is circled. A data point with a square box around it denotes the the last of a sequence of three consecutive alarmed days. In all cases we see that the GLR-Poisson detector misses the Fall 2020 wave.
Next we compare the performance of the detection method using the infection-rate against a conventional detector [7], which we call “GLR-Poisson” (for Generalized Likelihood Ratio - Poisson). Our detection method is implemented in Python while the prior approach is provided in R package surveillance [76]. This detector uses the raw case-counts to fit a time-series model (complete with prediction uncertainty bounds) and thus detect “outlier days”. The detector has two formulations, one based on the negative binomial (NB) distribution and another based on Poisson.
The case-count on any day is modeled as , (or
) where
is the mean and
is the dispersion of a NB distribution. The mean is modeled as
, where
; in essence, this is a seasonal log-linear model with parameters
. We set S = 1, since there is clearly only one mode in Fig 1 (left). We fit a model
using data from June
to September
(corresponding to
), before the arrival of the Fall 2020 wave, and test whether a new model (for
) (corresponding to
), fitted solely to a moving window in the post-September
data, explains it appreciably better than the original
model. Indexing the days after September
as
, we compute the set of days
where
where is the negative binomial distribution and
. In essence, in the 15-day period between September
and
, we search for a window where the original
model explains the data poorly. Note that this model does require much historical data to calibrate
(for example, to determine the seasonal nature of the outbreaks), something that is rarely available for novel diseases such as COVID-19. Using the distribution (negative binomial or Poisson), it is also possible to predict the case-count that would have caused an “outlier day”. Per Kim et al. [48], the NB tends to give better fits whereas Poisson is preferable for small datasets, and so we test both formulations. The results with the NB distribution are clearly inferior and are in our technical report [77]. The results with the Poisson distribution are plotted in Fig 8 (right column), with the “outlier boundary” in red. For Bernalillo, in the post-September
period, we see many outliers and a few alarm days, implying that the Fall 2020 wave was detected. The detector does not show any alarms for Valencia or Santa Fe, thus completely missing the Fall 2020 wave. We repeat this analysis for data between June
and August
(see Fig 10 in the Appendix). Here the detector identifies outliers and alarms in the data for Bernalillo and Santa Fe, thus “detecting” the Fall 2020 wave a full month before its arrival; clearly, this is a false positive. In contrast, the detector behaves correctly for Valencia. The reason for the poor performance of the GLR-Poisson detector is likely due to the peculiarities of our COVID-19 data (no long historical record and low case-counts from sparsely populated areal units), which runs afoul of many assumptions embedded in conventional disease detectors.
Note that the ability to detect the Fall 2020 wave correctly does not imply that we have fashioned an infection-rate-based disease detector (e.g., we have not attempted to compute a Receiver Operating Characteristic curve); rather, it shows that the infection-rate of an outbreak of a novel disease has the information content that could be exploited within a disease detector. The smoothing effect of our estimation process (which reduces the effect of noise in the observations) and the use of epidemiological information i.e., the incubation period distribution, compensates for the lack of long time-series data that conventional detectors rely on for information content, thus making our method particularly suited for novel outbreaks. For endemic diseases with long time-series and high-quality data, our method would possibly be unnecessarily complex.
Based on these results we identify three shortcomings. Our first shortcoming is that while our formulation is generalizable to many areal units, it has been demonstrated on just three areal units i.e., we have not demonstrated the usefulness of the method on an areal unit with data so poor that it would not allow infection-rate estimation (or forecasting). This is due to the lack of scalability of MCMC. However, we have adapted our method to use approximate, but scalable, mean-field Variational Inference and scaled it to all 33 counties in NM; this is documented in a preprint [9]. This paper illustrates how our method can provide infection-rate estimates in areal units with “spikes” in the data i.e., where data collected over multiple weeks are occasionally reported all in one day. The second shortcoming of this paper is the susceptibility of the method to false positives. This can be addressed by averaging over time, though with some loss of timeliness of detection, and is illustrated in a preprint [9]. The third shortcoming is the use of Gaussian models throughout this paper, even though the low case-count data for some counties, for example Santa Fe, would have suggested a negative binomial distribution. This, however, would have requires us to develop a random field model using negative binomials, and is left to future work.
The framework developed in this paper is based on a convolution between the infection-rate model and an incubation model. Earlier versions of this epidemic model have been employed in studies based on historical influenza and bubonic plague data [78,79]. Extensions of this framework relies on the availability of incubation models for the specific epidemic which might not always be accurate during the early stages of a new epidemic. Nevertheless the formulation can accommodate uncertain incubation models and produce forecasts commensurate with it. This allows for a continuous refinement of the forecasts as the quality of the underlying models improves.
7 Conclusion
In this paper, we explore whether it is possible to use the (latent) infection-rate of a disease as a monitoring variable in disease surveillance. This is because the infection-rate, which is governed by mixing patterns and spreading characteristics of the pathogen in question, does not vary erratically from day-to-day; in contrast, observed case-counts, the monitoring variable for all conventional disease surveillance algorithms is contaminated by reporting errors. The difficulty, of course, lies in being able to estimate the infection-rate from the case-counts, which can have high variance if they are small numbers.
To this end, we developed a method to estimate an infection-rate (spatiotemporal) field defined over multiple areal units, conditional on case-count time-series, of various fidelities, gathered from the areal units. The aim of estimating a field, rather than a time-varying infection-rate inside an areal unit, was driven by our desire to encode spatial patterns of epidemiological dynamics into the infection-rate field, allowing us to “borrow” information from neighboring areal units and compensate for poor quality observations. The method was demonstrated on COVID-19 data from 3 counties of New Mexico - Bernalillo, Santa Fe and Valencia. Our method uses COVID-19 data and exogenous covariates to uncover the spatial patterns in epidemiological dynamics and encode them as a Gaussian Markov Random Field (GMRF) model. We extend our original method for estimating the infection-rate in one areal unit [6] to multiple units, and use the GMRF to impose a degree of smoothing. Joint inversions for disease parameters showed that the PDFs and posterior predictive simulations for Santa Fe (which had low case-count data) were sharper compared to inversions performed for one areal unit.
The estimated infection-rate field, estimated using data from June 1, 2020 to September 15, 2020, was used to forecast the evolution of the outbreak for two weeks ahead. The Fall 2020 wave of COVID-19 arrived on September 15 and the forecasts are expected to be erroneous i.e., our forecast acts as a detector of the new wave of infection. Our model’s performance was compared with that of a conventional surveillance algorithm that, like all other surveillance algorithms, relies on a long historical training database and which was not not available for COVID-19 because of its novelty. Our method successfully detected the arrival within the two-week period whereas the conventional detector failed. In addition, we tested the method with data till August 15th, 2020, one month before the arrival of the Fall 2020 wave. Our method, as well as the conventional detector, suffered from false positives, detecting non-existent arrivals of the Fall 2020 wave in two of the three areal units considered here. This is due to the noise in the data. In addition, the conventional detector was negatively affected by an insufficiency of training data, but this is likely to be the case for any novel disease. Thus our premise that the infection-rate could be used as a monitoring variable in surveillance algorithms seems to be a promising one and does not suffer from the need for a lot of data to function well. This is achieved by exploiting the characteristics of the disease, such as the incubation period, to compensate for sparse data. This robustness makes it particularly well-suited for novel diseases.
Future studies will focus on the modeling of correlations between areal units that include multi-modal information. These developments will also benefit from a statistical discrepancy models beyond Gaussian distributions, e.g. formulations that are tailored for low data counts or a mixture model that accounts for areal units with both low and high-data counts.
A software implementation of the method described in this paper and the associated data can be found at our GitHub repository [68].
Acknowledgments
We thank Lyndsay Shand and her co-workers for much of the data [56] used in this project. This work was funded by Sandia National Laboratories’ Laboratory Directed Research and Development (LDRD) program and the US Department of Energy, Office of Science’s Advanced Scientific Computing Research’s Biopreparedness Research Virtual Environment (BRaVE) program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. This article has been authored by an employee of National Technology & Engineering Solutions of Sandia, LLC under Contract No. DE-NA0003525 with the U.S. Department of Energy (DOE). The employee owns all right, title and interest in and to the article and is solely responsible for its contents. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this article or allow others to do so, for United States Government purposes. The DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan https://www.energy.gov/downloads/doe-public-access-plan.
Appendix
Figs 11–13 show 1D and 2D marginal posterior distributions for the three counties tackled in this study. These results indicate a strong correlation between the inferred start of the epidemic, t0, and the parameters of the infection model k and for each of these regions. When calibrating model for individual regions, the discrepancy between the model and the available observations results in an error model with both the additive
and multiplicative
components informed by the data for the counties with smaller populations, Santa Fe and Valencia. For Bernalillo only just the additive error component is sufficient to model the discrepancy. When performing the statistical inference with all three counties, the multiplicative component takes over as that error model component is less sensitive to phase shifts of the epidemic waves compared to the additive component.
We see that 12 principal components can cover 95% of the variations observed in the risk-factors.
August 15 is a month before the arrival of the Fall 2020 wave. Lines and symbols’ settings are the same as in Fig 8. We see that for Bernalillo and Santa Fe, both methods suffer from false positives, detecting alarms before the arrival of the Fall 2020 wave.
References
- 1. Daza-Torres ML, Capistrán MA, Capella A, Christen JA. Bayesian sequential data assimilation for COVID-19 forecasting. Epidemics. 2022;39:100564. pmid:35487155
- 2. Wang Z, Zhang X, Teichert GH, Carrasco-Teja M, Garikipati K. System inference for the spatio-temporal evolution of infectious diseases: Michigan in the time of COVID-19. Comput Mech. 2020;66(5):1153–76. pmid:35194281
- 3. Chen P, Wu K, Ghattas O. Bayesian inference of heterogeneous epidemic models: application to COVID-19 spread accounting for long-term care facilities. Comput Methods Appl Mech Eng. 2021;385:114020. pmid:34248229
- 4. Blonigan P, Ray J, Safta C. Forecasting multi-wave epidemics through Bayesian inference. Arch Comput Methods Eng. 2021;28(6):4169–83. pmid:34335019
- 5. Lin YT, Neumann J, Miller EF, Posner RG, Mallela A, Safta C, et al. Daily forecasting of regional epidemics of coronavirus disease with bayesian uncertainty quantification, United States. Emerg Infect Dis. 2021;27(3):767–78. pmid:33622460
- 6. Safta C, Ray J, Sargsyan K. Characterization of partially observed epidemics through Bayesian inference: application to COVID-19. Comput Mech. 2020;66(5):1109–29. pmid:33041410
- 7. Höhle M, Paul M. Count data regression charts for the monitoring of surveillance time series. Comput Statist Data Anal. 2008;52(9):4357–68.
- 8. Haario H, Saksman E, Tamminen J. An adaptive Metropolis algorithm. Bernoulli. 2001;7(2):223.
- 9. Bridgman W, Safta C, Ray J. Detecting outbreaks using a latent field: part II – scalable estimation. arXiv preprint 2024. https://arxiv.org/abs/2407.11233
- 10. Best N, Richardson S, Thomson A. A comparison of Bayesian spatial models for disease mapping. Stat Methods Med Res. 2005;14(1):35–59. pmid:15690999
- 11.
Waller L, Carlin B. Disease mapping. In: Gelfand AE, Diggle PJ, Fuentes M, Guttorp P, editors. Handbook of spatial statistics. Chapman & Hall/CRC Press; 2010.
- 12. Huang X, Zhou H, Yang X, Zhou W, Huang J, Yuan Y. Spatial characteristics of coronavirus disease 2019 and their possible relationship with environmental and meteorological factors in Hubei Province, China. Geohealth. 2021;5(6):e2020GH000358. pmid:34189364
- 13. Geng X, Katul GG, Gerges F, Bou-Zeid E, Nassif H, Boufadel MC. A kernel-modulated SIR model for Covid-19 contagious spread from county to continent. Proc Natl Acad Sci U S A. 2021;118(21):e2023321118. pmid:33958443
- 14. Schüler L, Calabrese JM, Attinger S. Data driven high resolution modeling and spatial analyses of the COVID-19 pandemic in Germany. PLoS One. 2021;16(8):e0254660. pmid:34407071
- 15. McMahon T, Chan A, Havlin S, Gallos LK. Spatial correlations in geographical spreading of COVID-19 in the United States. Sci Rep. 2022;12(1):699. pmid:35027627
- 16. Rendana M, Idris WMR, Abdul Rahim S. Spatial distribution of COVID-19 cases, epidemic spread rate, spatial pattern, and its correlation with meteorological factors during the first to the second waves. J Infect Public Health. 2021;14(10):1340–8. pmid:34301503
- 17. Indika SHS, Diawara N, Jeng HA, Giles BD, Gamage DSK. Modeling the spread of COVID-19 in spatio-temporal context. Math Biosci Eng. 2023;20(6):10552–69. pmid:37322948
- 18. Ahmed M, Khan MdH-O-R, Alam Sarker MdM. COVID-19 SIR model: bifurcation analysis and optimal control. Results Control Optimiz. 2023;12:100246.
- 19. Harun-Or-Rashid Khan Md, Ahmed M, Alam Sarker MM. Optimal control of an SIRD model with data-driven parameter estimation. Results Control Optimiz. 2024;14:100346.
- 20.
Lawson A, Lee D. Bayesian disease mapping for public health. In: Srinivasa Rao ASR, Pyne S, Rao CR, editors. Disease modelling and public health, Part A. Elsevier; 2017. p. 443–81.
- 21. MacNab YC. Bayesian disease mapping: past, present, and future. Spat Stat. 2022;50:100593. pmid:35075407
- 22. Besag J, York J, Molli A. Bayesian image restoration, with two applications in spatial statistics. Ann Inst Stat Math. 1991;43(1):1–20.
- 23.
Stern H, Cressie NA. Inference for extremes in disease mapping. In: Lawson A, editor. Disease mapping and risk assessment for public health. Chichester: Wiley. 1999. p. 63–84.
- 24.
Cressie N. Statistics for spatial data. Wiley; 2015.
- 25. Baptista H, Mendes JM, MacNab YC, Xavier M, Caldas-de-Almeida J. A Gaussian random field model for similarity-based smoothing in Bayesian disease mapping. Stat Methods Med Res. 2016;25(4):1166–84. pmid:27566771
- 26.
Best NG, Arnold RA, Thomas A, Waller LA, Conlon EM. Bayesian models for spatially correlated disease and exposure data. Bayesian Statistics 6. Oxford University PressOxford; 1999. p. 131–56. https://doi.org/10.1093/oso/9780198504856.003.0006
- 27. Unkel S, Farrington CP, Garthwaite PH, Robertson C, Andrews N. Statistical methods for the prospective detection of infectious disease outbreaks: a review. J Roy Statist Soc Ser A: Statist Soc. 2011;175(1):49–82.
- 28. Shewhart WA. Economic quality control of manufactured product. Bell System Technical Journal. 1930;9(2):364–89.
- 29. Serfling RE. Methods for current statistical analysis of excess pneumonia-influenza deaths. Public Health Reports 1896 -1970). 1963;78(6):494.
- 30. Pelat C, Boëlle P-Y, Cowling BJ, Carrat F, Flahault A, Ansart S, et al. Online detection and quantification of epidemics. BMC Med Inform Decis Mak. 2007;7:29. pmid:17937786
- 31. Parker RA. Analysis of surveillance data with Poisson regression: a case study. Stat Med. 1989;8(3):285–94; discussion 331-2. pmid:2711062
- 32. Jackson ML, Baer A, Painter I, Duchin J. A simulation study comparing aberration detection algorithms for syndromic surveillance. BMC Med Inform Decis Mak. 2007;7:6. pmid:17331250
- 33. Farrington CP, Andrews NJ, Beale AD, Catchpole MA. A statistical algorithm for the early detection of outbreaks of infectious disease. J Roy Statist Soc Ser A (Statist Soc). 1996;159(3):547.
- 34. Reis BY, Mandl KD. Time series modeling for syndromic surveillance. BMC Med Inform Decis Mak. 2003;3:2. pmid:12542838
- 35.
Cowpertwait SP, Metcalfe AV. 6. In: Introductory Time Series with R. Springer, Use R!; 2009.
- 36. Williamson GD, Weatherby Hudson G. A monitoring system for detecting aberrations in public health surveillance reports. Stat Med. 1999;18(23):3283–98. pmid:10602151
- 37. Le Strat Y, Carrat F. Monitoring epidemiologic surveillance data using hidden Markov models. Stat Med. 1999;18(24):3463–78. pmid:10611619
- 38. Martínez-Beneito MA, Conesa D, López-Quílez A, López-Maside A. Bayesian Markov switching models for the early detection of influenza epidemics. Stat Med. 2008;27(22):4455–68. pmid:18618414
- 39.
Conesa D, Amorós R, López-Quilez A, Martínez-Beneito MA. Mean-variability hidden Markov models for the detection of influenza outbreaks. In: 25th International workshop on statistical modelling. Amsterdam, The Netherlands: Statistical Modelling Society; 2010.
- 40.
Lu HM, Zeng D, Chen H. Markov Switching Models for Outbreak Detection. In: Castillo-Chavez C, Chen H, Lober WB, Thurmond M, Zeng D, editors. Infectious disease informatics and biosurveillance: research, systems and case studies. Springer US; 2011. p. 111–44.
- 41. Held L, Hofmann M, Höhle M, Schmid V. A two-component model for counts of infectious diseases. Biostatistics. 2006;7(3):422–37. pmid:16407470
- 42. Zhou T, Ji Y. Semiparametric Bayesian inference for the transmission dynamics of COVID-19 with a state-space model. Contemp Clin Trials. 2020;97:106146. pmid:32947047
- 43. Musulin J, Segota SB, Stifanic D, Lorencin I, Andelic N, Sustersic T. Application of artificial intelligence-based regression methods in the problem of COVID-19 spread prediction: a systematic review. Int J Environ Res Publ Health. 2021;18(8):4287.
- 44. Tariq MU, Ismail SB. AI-powered COVID-19 forecasting: a comprehensive comparison of advanced deep learning methods. Osong Public Health Res Perspect. 2024;15(2):115–36. pmid:38621765
- 45. Mollalo A, Vahedi B, Rivera KM. GIS-based spatial modeling of COVID-19 incidence rate in the continental United States. Sci Total Environ. 2020;728:138884. pmid:32335404
- 46. Lawson AB, Song H-R. Bayesian hierarchical modeling of the dynamics of spatio-temporal influenza season outbreaks. Spat Spatiotemporal Epidemiol. 2010;1(2–3):187–95. pmid:22749473
- 47. Lawson AB. Evaluation of predictive capability of Bayesian spatio-temporal models for Covid-19 spread. BMC Med Res Methodol. 2023;23(1):182. pmid:37568119
- 48. Kim J, Lawson AB, Neelon B, Korte JE, Eberth JM, Chowell G. Evaluation of Bayesian spatiotemporal infectious disease models for prospective surveillance analysis. BMC Med Res Methodol. 2023;23(1):171. pmid:37481553
- 49. Lawson AB, Kim J. Space-time covid-19 Bayesian SIR modeling in South Carolina. PLoS One. 2021;16(3):e0242777. pmid:33730035
- 50. Sartorius B, Lawson AB, Pullan RL. Modelling and predicting the spatio-temporal spread of COVID-19, associated deaths and impact of key risk factors in England. Sci Rep. 2021;11(1):5378. pmid:33686125
- 51. Lawson AB, Kim J. Issues in Bayesian prospective surveillance of spatial health data. Spat Spatiotemporal Epidemiol. 2022;41:100431. pmid:35691635
- 52. Rotejanaprasert C, Lawson A, Bolick-Aldrich S, Hurley D. Spatial Bayesian surveillance for small area case event data. Stat Methods Med Res. 2016;25(4):1101–17. pmid:27566768
- 53. Corberán-Vallet A, Lawson AB. Conditional predictive inference for online surveillance of spatial disease incidence. Stat Med. 2011;30(26):3095–116. pmid:21898522
- 54.
Coronavirus (Covid-19) Data in the United States. https://github.com/nytimes/covid-19-data
- 55.
COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. GitHub Repository. 2023. https://github.com/CSSEGISandData/COVID-19
- 56.
Shand L, Foss A, Zhang A, Tucker JD, Huerta G. A statistical model for the spread of SARS-CoV-2 in New Mexico. SAND2020-10080. Albuquerque, NM: Sandia National Laboratories. 2020.
- 57.
Census Bureau US. QuickFacts New Mexico. 2024. https://www.census.gov/quickfacts/NM
- 58.
New Mexico Primary Care Association. Find a health center. 2024. https://www.nmpca.org/find-a-health-center
- 59.
New Mexico Department of Health. COVID-19 screening and test sites. https://cvprovider.nmhealth.org/directory.html
- 60.
Earth Data Analysis Center. University of New Mexico, Resource Geographic Information System. https://rgis.unm.edu
- 61. Erichson NB, Zheng P, Manohar K, Brunton SL, Kutz JN, Aravkin AY. Sparse principal component analysis via variable projection. SIAM J Appl Math. 2020;80(2):977–1002.
- 62.
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2 ed. Springer. 2008.
- 63.
R Core Team. R: A Language and Environment for Statistical Computing. 2023. https://www.R-project.org/
- 64.
Friedman J, Hastie T, Tibshirani R, Narasimhan B, Tay K, Simon N. CRAN page for package glmnet. https://cran.r-project.org/web/packages/glmnet/index.html
- 65. Bivand RS, Wong DWS. Comparing implementations of global and local indicators of spatial association. TEST. 2018;27(3):716–48.
- 66.
Bivand R. CRAN page for package spdep. https://CRAN.R-project.org/package=spdep
- 67. Bivand R. R packages for analyzing spatial data: a comparative case study with areal data. Geograph Anal. 2022;54(3):488–518.
- 68.
PRIME GitHub Repository. https://github.com/sandialabs/PRIME
- 69. Lauer SA, Grantz KH, Bi Q, Jones FK, Zheng Q, Meredith HR, et al. The Incubation Period of Coronavirus Disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. Ann Intern Med. 2020;172(9):577–82. pmid:32150748
- 70.
Wackerley DD, Medenhall WIII, Scheaffer RL. Mathematical statistics with applications. 6th ed. Duxbury; 2002.
- 71. Cressie N, Johannesson G. Fixed rank kriging for very large spatial data sets. J Roy Statist Soc Ser B: Statist Methodol. 2008;70(1):209–26.
- 72. Andrieu C, Roberts GO. The pseudo-marginal approach for efficient Monte Carlo computations. Ann Statist. 2009;37(2):697–725.
- 73. Kass RE, Carlin BP, Gelman A, Neal RM. Markov chain Monte Carlo in practice: a roundtable discussion. The American Statistician. 1998;52(2):93–100.
- 74. Gneiting T, Katzfuss M. Probabilistic forecasting. Annu Rev Stat Appl. 2014;1(1):125–51.
- 75. Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Ann Statist. 2007;35(6).
- 76. Höhle M. surveillance: an R package for the monitoring of infectious diseases. Comput Statist. 2007;22(4):571–82.
- 77.
Ray J, Safta C, Bridgman W, Horii M, Gould A. A spatially regularized detector for emergent/re-emergent disease outbreaks. Albuquerque, NM: Sandia National Laboratories; 2023. Report No.: SAND2023-09749R.
- 78.
Ray J, Lefantzi S. Deriving a model for influenza epidemics from historical data. Sandia National Laboratories; 2011. Report No.: SAND2011-6633. https://doi.org/10.2172/1030332.
- 79.
Safta C, Ray J, Sargsyan K, Lefantzi S, Cheng K, Crary D. Real-time characterization of partially observed epidemics using surrogate models. Sandia National Laboratories; 2011. Report No.: SAND2011-6776. https://doi.org/10.2172/1030325.