Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

KGR-SKATER: Spatially clustered kernel graph regression for counting processes

Abstract

This paper proposes a procedure for fitting a spatiotemporal model with an interpretable and parsimonious dependence structure to high-dimensional non-Gaussian data. A graph is estimated to represent spatial dependence, and a locally periodic kernel is estimated to represent temporal dependence. These two components are then combined via a Kronecker product, producing a separable spatiotemporal covariance matrix that can account for multiple relevant variables and their dependencies at different time scales. Spatial clustering is used to reduce the dimensionality and estimation is carried out via the integrated nested Laplace approximation. The proposed model, which relies on all of these preceding steps, is introduced along with some alternative spatiotemporal reference models for comparison. The utility of this novel multi-step procedure is demonstrated by modeling monthly time series of respiratory-related deaths across California. Social deprivation indices are used to learn a graph structure, and surrogate variables constructed from exposure adjusted measures of air quality are used to learn time series regression relationships encoded by a kernel. The modeling results indicate that KGR-SKATER models fit the data in and out of sample as well as the reference models and have better coverage properties. A synthetic case study is also presented to demonstrate how the proposed procedure makes better forecasts than the reference model in settings where time series exhibit less stationary amplitudes and periods. Data and code available at: https://doi.org/10.5061/dryad.j3tx95xtt.

1 Introduction

Modeling point processes is particularly challenging when dealing with heterogeneous and high-dimensional spatiotemporal dependence structures. This paper proposes a novel framework for modeling complex spatiotemporal processes in a way that reduces dimensionality, accounts for spatiotemporal autocorrelation, and accommodates exogenous and endogenous time series regression structures. The approach shows particular promise for complex spatiotemporal data with evolving dependencies, offering a practical solution to the challenging problem of modeling heterogeneous covariance structures in space and time. This approach is motivated by an important real-world application aimed at characterizing how spatial patterns in socioeconomic indicators and temporal patterns in air pollutant levels drive respiratory-related mortality in California.

The field of point process analysis has evolved significantly since the early works of [13]. Key contributions to the field include the Besag-York-Mollie (BYM) model from [4], adapted from Bayesian image restoration, spatial generalized linear mixed models (GLMM) with spatial random effects [5], and Log Gaussian Cox processes (LGCP) from [6] for modeling inhomogeneous and dependent spatiotemporal data.

Modeling complex spatiotemporal dependence structures is especially challenging and computationally expensive for high-dimensional non-Gaussian data. Non-sparse covariance matrices are difficult or even intractable to estimate in high-dimensional cases where there are many autocorrelations to account for. Furthermore, flexible response distributions are necessary to account for various over and under dispersion patterns. A latent Gaussian model with a nontrivial dependence structure specification will be computationally expensive to estimate without some assumptions and/or approximations. Some recent techniques to facilitate estimation include the integration of adaptive radial basis functions in spatial GLMMs [7], the use of Integrated Nested Laplace Approximation (INLA) [8] for efficient Bayesian inference, and the combination of kernel regression and graph signal processing for improved prediction models [911].

This paper presents a novel procedure that combines the following useful techniques: (1) Spatial clustering with the Spatial ‘K’luster Analysis by Tree Edge Removal (SKATER) algorithm [12] which reduces dimensionality in a way that has interpretative value; (2) graph estimation with the HUGE package [13] which allows one to estimate a graph on a relevant exogenous or endogenous variable instead of assuming a known graph/neighborhood structure; (3) kernel graph regression (KGR) which produces a separable, interpretable spatiotemporal covariance that incorporates both spatial and temporal dependencies in a way that is not as computationally expensive to estimate (see [1416]); and (4) approximate Bayesian inference with INLA which provides an efficient alternative to MCMC that is about as accurate for estimating latent Gaussian models. The combination of these steps into a single framework has not been formalized before; therefore, it will be referred to henceforth as KGR-SKATER.

Although each component of the proposed procedure draws on existing methodologies, e.g., spatial clustering (SKATER), graphical model estimation, kernel-based Gaussian process regression, and approximate Bayesian inference via INLA, the novelty of the KGR-SKATER framework lies in the integration of these components into a unified modeling pipeline for high-dimensional spatiotemporal count data. In particular, the proposed framework introduces three methodological contributions. First, it combines spatial clustering with data-driven graph learning so that spatial dependence is inferred from covariates rather than imposed through geographic adjacency alone. Second, it introduces a graph-filtered Gaussian process prior constructed via the Kronecker product of a filtered graph Laplacian and a temporal kernel Gram matrix, producing a parsimonious, interpretable, and separable spatiotemporal covariance structure. Third, the framework demonstrates how these components can be estimated efficiently within the INLA framework, allowing flexible kernel-based spatiotemporal models to be fit without relying on computationally expensive MCMC methods.

This procedure allows for thoughtful estimation of a flexible dependence structure that can account for multiple relevant variables and their dependencies at different time scales. The spatial dependence is data-driven and is learned from one subspace of variables using the ‘skater’ and ‘huge’ packages. The temporal dependence is attributed to cross-temporal structure and is learned in a nonstationary fashion on another subspace of variables with kernel functions. These two different ways of measuring dependence are combined in a computationally efficient manner via a Kronecker product. KGR-SKATER incorporates all these steps and is still relatively straightforward to implement. However, there are many choices that must be considered carefully within each step, allowing a model to be customized for a given application study, as explained later in this paper.

Recent developments in spatiotemporal modeling have explored kernel-based approaches that exploit sparse precision matrices constructed from local neighborhood interactions. For example, the stochastic local interaction (SLI) models proposed by [17,18] construct sparse precision matrices for space-time interpolation using local interaction energies. While these models also leverage graph-based representations, they focus on constructing sparse precision structures directly, whereas the KGR-SKATER framework emphasizes learning a spatial graph from covariates and combining it with kernel-based temporal dependence through a Kronecker product covariance structure.

The KGR-SKATER methodology is applied to model respiratory-related mortality across California over six years, considering socioeconomic status measured by Social Deprivation Index (SDI) as a spatial component and air quality variations as a temporal component. It is well established that these variables have associations with respiratory-related disorders and death, see [1922]. But more importantly, as shown later in Fig 2 in Section 7.1, SDI has significant spatial variation but varies slowly in time. In contrast, air pollution measurements have significant temporal variation but do not vary much across California. This enables KGR-SKATER to distinguish temporal nonstationarity driven by air quality fluctuations from spatial nonstationarity linked to socioeconomic conditions, facilitating a cleaner decomposition of spatial and temporal dependence structures in the model.

The remainder of the paper is structured as follows: in Section 2, notation is introduced and the steps to model spatial dependence are described. In Section 3, the proposed model is introduced, along with a description of kernels for modeling temporal dependence. In Section 4, a few benchmark reference models are presented. In Section 5, a brief review of INLA and a description of how it is used to expedite the estimation of the proposed model are provided. In Section 6, an outline of the simulations conducted to validate the KGR-SKATER framework is provided, in addition to some practical implementation considerations. In Section 7, the proposed modeling procedure is applied in full to model the number of respiratory-related deaths from 2015 to 2019 in California at the monthly county level. In Section 8, a discussion of the main results of the paper and an outlook for future research are laid out. Finally, in the supporting information appendices, some additional experiments and findings are included that may be of interest to those who want to understand the finer details of the KGR-SKATER framework.

2 Components comprising KGR-SKATER approach

The KGR-SKATER framework proceeds through six main stages that jointly construct a parsimonious spatiotemporal dependence structure.

  1. Spatial dimensionality reduction: given a spatiotemporal dataset, the spatial dimension is first reduced using the SKATER clustering algorithm, which partitions spatial units into clusters according to similarity in selected covariates.
  2. Construction of cluster-level surrogate variables: the response variables and covariates are then aggregated within each cluster to produce surrogate variables representing cluster-level spatiotemporal observations, thereby replacing the original unit-level measurements with a lower-dimensional representation.
  3. Learning a spatial dependence graph: a subset of surrogate covariates is selected to characterize spatial dependence, and an undirected graph is estimated over the cluster-level data obtained in step (2). Each cluster is represented by a graph vertex located at the cluster centroid. The spatial dependence structure is learned using graphical LASSO within the HUGE framework, producing a sparse graphical representation of conditional dependencies between clusters.
  4. Spectral graph filtering: the estimated adjacency matrix is transformed into a graph Laplacian, after which a low-pass spectral filter is applied. The resulting filtered Laplacian is denoted by and is constructed following the graph Laplacian filtering framework of [11].
  5. Construction of the spatiotemporal covariance operator: the filtered spatial operator is combined with a temporal Gram matrix K constructed from surrogate time-series covariates. This combination is achieved through a Kronecker product, yielding a separable spatiotemporal covariance matrix that defines the dependence structure of a Gaussian process (GP) component within the proposed log-Gaussian Cox process (LGCP) model defined over the graph vertices.
  6. Model estimation: the resulting LGCP model is estimated using Integrated Nested Laplace Approximation (INLA), which enables efficient approximate Bayesian inference for latent Gaussian models.

Fig 1 summarizes the full modeling pipeline. The procedure begins with spatial clustering using SKATER to reduce dimensionality and identify regions with similar socioeconomic characteristics. Cluster-level surrogate variables are then constructed and used to estimate a spatial dependence graph via graphical LASSO. This graph is transformed into a filtered Laplacian that encodes spatial smoothness. Temporal dependencies are modeled using kernel regression based on surrogate time series covariates. Finally, the spatial and temporal structures are combined via a Kronecker product to form a separable spatiotemporal covariance matrix used in a latent Gaussian Cox process model estimated via INLA.

thumbnail
Fig 1. Unified KGR-SKATER workflow.

County-level socioeconomic, pollution, and mortality data are first transformed into cluster-level surrogate representations. The spatial dependence structure is learned from cluster-level SDI variables using graphical model estimation and spectral graph filtering to obtain , while temporal dependence is learned from cluster-level pollution covariates through a kernel Gram matrix K. These structured components are combined within the KGR-SKATER model and estimated using INLA, followed by hyperparameter learning, model selection, and out-of-sample forecasting.

https://doi.org/10.1371/journal.pone.0348787.g001

2.1 Notation

Given an data matrix Y, with , comprising a spatiotemporal dataset with N spatial units over T time steps. Here, represent latitudinal and longitudinal angles associated with the i-th observation. Additionally, associated time series regression covariates are observed at each location given by .

Spatial coordinates are defined using latitude and longitude values. Distances between spatial units are computed using great-circle distances derived from geographic coordinates, ensuring that spatial relationships reflect the curvature of the Earth’s surface.

It will be assumed that the process being modeled will be sufficiently high-dimensional that it can be beneficial to reduce the dimensionality of the problem by mapping the spatial problem to a graph topology comprised of C total units, in which the graph vertex will represent the centroid of a spatial clustering with spatial clusters. The observation matrix will then be transformed by aggregating the counts in each cluster to new counts denoted as which will then make up an aggregated count data matrix of dimension denoted by . Equivalently, in each spatial cluster, a set of surrogate variables will be constructed for time series regression covariates in cluster c denoted by . Finally, there will also be assumed to be random effects that will be incorporated in the modeling, denoted by a spatiotemporal random effect matrix of latent variables denoted by F.

The graph structure defining spatial dependence between cluster centroids is denoted by consisting of C vertices and an edge set connecting vertex pairs. There is an edge connecting vertices vi and vj if centroids for clusters and are associated. The edge relationships are encoding in the adjacency matrix , such that represents the presence of an undirected edge between nodes i and j, i.e., implies and implies .

For the kernel regression structures used in the KGR-SKATER model approach, let be the Reproducing Kernel Hilbert Space (RKHS) kernel characterizing the covariance operator introduced in the paper. There will be several kernel structures explored, but one choice that will be a core component that is used throughout the manuscript will be the locally periodic kernel:

If a kernel matrix is constructed using the kernel function, it will be denoted by the kernel Gram matrix K, capturing the temporal dependence.

In the kernel definition, the variables x and represent vectors of surrogate covariates rather than spatial coordinates. The Euclidean distance therefore measures similarity between covariate vectors rather than geographic proximity. Prior to kernel construction, the covariates are standardized to ensure that the distance metric is not dominated by variables with larger numerical scales.

2.2 Spatial clustering with SKATER

It will be assumed that in the modeling of this data, it will be sufficiently high-dimensional that it can be beneficial to reduce the dimensionality of the problem by mapping the spatial problem to a graph topology comprised of C total units, in which the graph vertex will represent the centroid of a spatial clustering with spatial clusters.

The Spatial ‘K’luster Analysis by Tree Edge Removal (SKATER) algorithm is a clustering technique designed to partition spatially structured data into homogeneous regions. It achieves this by leveraging graph theory and spatial adjacency relationships, making it particularly useful in geographical and spatial analysis. The SKATER algorithm uses a minimum spanning tree (MST) to identify clusters by progressively removing edges from the tree. The MST ensures that spatially adjacent regions are connected in a way that minimizes a specified dissimilarity measure. By removing edges with high dissimilarity, the algorithm isolates groups of nodes (spatial units) into distinct clusters.

There are three required inputs:

  • Spatial Units: spatial regions (e.g., neighborhoods, districts, counties) to be clustered.
  • Attribute Data: variables associated with each spatial unit (e.g., income, population density, environmental variables).
  • Spatial Adjacency Matrix: neighborhood structure to identify which spatial units are neighbors, typically constructed using either direct contiguity relationships or via a distance threshold, which can be a weighted adjacency or a binary graph adjacency.

The objective is then to partition a graph into C disjoint spatial clusters , where their union is , and each is a connected subgraph. To understand how SKATER works, one starts with a minimum spanning tree as defined in [12]. The MST is built using the spatial adjacency matrix and a selected dissimilarity measure. In this tree, the nodes are still spatial units and the edges are adjacency relationships weighted by the dissimilarity between nodes. The MST is then a subgraph that has the property that it connects all nodes, it has no cycles, and it minimizes the total edge weight. There are numerous algorithms that can be used to construct the MST, see [23,24].

Having obtained the MST, the pruning exercise of SKATER involves iterative edge removal, which is performed in order to create the desired number of clusters. Each iteration of the pruning process involves identifying which edge to remove based on the edge with the highest weight (greatest dissimilarity) that still maintains the desired number of clusters. After removing an edge, the MST is then subsequently split into two disconnected subgraphs, each representing a cluster. At successive iterations, these partitions are iteratively refined on each of the previous iterations’ subgraphs until a stopping criterion is satisfied. In this case, the process stops once the k subgraphs, i.e., C clusters, are created.

One can formalize this method of iterative pruning of the MST by solving at each iteration step the optimization program:

where is a partition of objects into k sub-trees, is the cost associated with the quality of a partition, and is the sum of squares deviation in region i given by

where nk is the number of spatial objects in tree k, is the j-th attribute of spatial object i, m is the number of attributes being used, and is the average of the j-th attribute for all objects in tree k.

Therefore, formally, each iteration of SKATER proceeds by starting with a graph G* representing a minimum spanning tree T0 with edges , then at each iteration, SKATER identifies the best arrangement which is produced by removing edge e from tree T, creating two disjoint trees Ta and Tb

The implementation of SKATER optimization utilized in this work is based on the SKATER algorithm as proposed by Assuncao et al. in R [12,25]. This function can include one of two clustering constraints: a minimum population constraint with respect to one of the features in the spatial dataframe, or a minimum (or maximum or both) number of units being put into each cluster.

2.3 Producing surrogate spatial cluster covariates

Once the clustering of the spatial units is identified, the data must be aggregated to the same resolution, i.e., from the original spatial units to a less granular cluster spatial units to obtain the surrogate variables generically denoted by process for each cluster unit at each time point . This is achieved by constructing surrogate variables for each cluster unit.

The approach adopted is based on the method outlined in [26], where each covariate is aggregated from the original spatial unit to the cluster spatial unit via a population weighted mean (see Eq 1). Furthermore, the weights can be time invariant or changed according to a user specified window period, such as monthly, seasonally, annually, or upon each census.

In each spatial cluster, a set of surrogate variables will be constructed as follows for time series regression covariates in cluster c given by

(1)

where wc represents the weight for cluster c. The choice of weighting mechanism for the construction of the cluster surrogate variables will be described in more detail in the following sections. These surrogate variables will then form the graph regression covariates that will be carried into the graph and kernel estimation steps to characterize a spatiotemporal dependence structure at the cluster spatial unit.

The graph regression response variable per cluster unit must also be obtained, and this can be done in different ways depending on the context of the study and specified modeling assumptions. In the applications demonstrated, the observation matrix will be transformed by aggregating the counts in each cluster to new counts as follows for each time point

which will then make up an aggregated count data matrix of dimension denoted by .

2.4 Learning spatial graphical dependence from demographic variables

Given the cluster partitions and the surrogate variables, one must then construct a spatial dependence relationship that will be utilized in the subsequent KGR-SKATER modeling framework. This involves learning a graph over the C spatial cluster regions to obtain a adjacency matrix. This graph can be estimated on any relevant endogenous or exogenous variable, which may be more insightful than the observed neighborhood structure in some cases. Note, this also demonstrates the advantage of the application of the spatial clustering, as now the spatial dependence graph is reduced from N vertices to C, where typically C << N.

In order to estimate the spatial graph, the graphical lasso package huge (high-dimensional undirected graph estimation) by [27] is utilized to learn the associations between the spatial cluster units. In this work, the spatial dependence structure represents spatial associations between a set of surrogate variables representing demographic factors from an index of multideprivation. However, any set of surrogate variables at the cluster spatial unit can be utilized to learn the spatial dependence in practice.

The spatial dependence learning of the graphical adjacency matrix is performed as follows via a Graphical Gaussian Lasso formulation. The objective function for graphical lasso considered performs graph estimation as a sparse penalized maximum likelihood estimation problem with a penalty when estimating the precision matrix (inverse of the covariance matrix) in order to induce sparsity.

where is the real positive definite precision matrix, S is the empirical covariance matrix, denotes the trace, is the regularization parameter, and denotes the L1 norm.

The solution to this precision matrix estimation, denoted , has the interpretation that if , then the ith and jth variables are conditionally independent given all other variables. As the penalty term increases, the bias in the precision matrix increases, and the number of zero elements in the precision matrix increases, providing a trade-off between sparsity of the precision matrix, and therefore the connectedness of the cluster spatial units in terms of the target variable upon which matrix , is estimated, and the bias.

The package huge will estimate multiple graphs from which an optimal one must be identified via a model selection criterion. Based on simulations, the extended Bayes criterion (EBIC) was chosen. EBIC is proposed by [28], as given by

where is the log likelihood, is the precision matrix estimate with edge set (i.e., non-zero entries corresponding to graph with edge set , is the number of non-zero parameters (i.e., cardinality of the edge set), T is the number of observations, p is the number of variables used in the graphical Lasso estimation sample covariance matrix, and is a tuning parameter that adjusts the penalty on the model complexity.

2.5 Constructing a spatial-temporal graph regression factor prior

This section explains how to construct a latent graphical factor model that incorporates the statistical spatial dependence from the graph learned at the units of the spatial cluster with a multiple-output time series kernel based Gaussian Process factor model. The Gaussian process filters will be smoothed according to the topology of the estimated spatial graph via a form of graph filter that uses a Graph Fourier Transform (GFT) to smooth high frequency content from the factors according to the edge relations obtained spatially.

2.5.1 Calculating a graph filter via graph fourier transform.

The estimated spatial graph structure with associated adjacency matrix A, where vertices represent the cluster spatial units and edges their spatial association according to a specified set of surrogate covariates, is used to construct a GFT given by spectral decomposition of the graph Laplacian with degree matrix D and adjacency matrix A given by

for an orthonormal basis matrix U and spectral frequencies for variation over the graph denoted by a diagonal matrix of decreasing eigenvalues corresponding to eigenvectors of columns of U.

Then a Laplacian spectral filtered graph is obtained by modifying these spectral frequencies using a decreasing function , such that

where for each spectral frequency . Numerous possibilities exist for such as the inverse, exponential, or ReLu filter (see discussion in [11]). In this work, a low pass filter was used that sets the filter as follows, for a given probability threshold :

To understand the effect of the filtering applied to the graph Laplacian on a set of graph valued covariates or graph latent factor processes, consider the Dirichlet energy for a graph valued process given by

When the signal is smooth across the graph, the Dirichlet energy is close to zero and exactly zero for a constant signal across the graph vertices. Thus, the graph filter smooths the Laplacian such that for all at all times t one obtains smoother signals over the graph satisfying:

This spectrally filtered graph Laplacian can be used to construct a smooth prior for a latent process over the graph that can be used in the kernel graph regression within the proposed KGR-SKATER methodology. This builds on the framework proposed in [11,15] to develop the graph regression multiple output Gaussian process prior. The latent factor multiple output Graphical Gaussian process is then given by the filtered latent signal

This can then be used to construct a latent factor Gaussian process prior given generically by

(2)

Here, acts as a spatial covariance matrix which encodes the assumption that signals observed over a graph are likely to be smooth with respect to the underlying topology and Gram matrix K encodes temporal and time series regression dependence using desired surrogate times series covariates .

3 Proposed kernel graph regression spatial-temporal model: KGR-SKATER

Given observed surrogate covariate process over cluster spatial units constructed via Eq 1, latent Gaussian process factors specified according to prior structure in Eq 2, and observation process for spatial clusters , the KGR-SKATER proposed model is given observation data matrix by:

with latent intensity model parameters for C spatial clusters and for m temporal periods, and where prior kernel hyperparameters are given generically by with the vector of kernel parameters given for instance in the case of an radial basis function kernel by: . A table of kernel and hyper parameter vectors is provided in Table 1. The kernels used in this work take the form of either an additive or multiplicative product:

(3)

The models that are developed under this structure are known as either additive (aKGR) or multiplicative (mKGR) Kernel Graph Regression models. The interest in this proposed model structure explicitly lies in the fact that the spatial dependence, as captured by the graph Laplacian can be developed based on endogenous covariates of relevance to the application that can be used to specify directly the spatial dependence structure in the graph regression. In this manuscript, the spatial graph structure as specified by the graph filter Laplacian is learned from the surrogate variables related to the index of deprivation. Then, this spatial dependence is combined with a regression time series kernel prior regression structure via a graph product operator, which, in this work, corresponds to a Cartesian product. Monthly and cluster fixed effects are included to translate the estimated intensity function to a corresponding region and point in time. These fixed effects do not have to be included, but their inclusion leads to much better estimates.

The simulation study section illustrates how the proposed model’s specification of spatiotemporal fixed effects and a more nuanced spatiotemporal covariance matrix allows it to produce better fits than the reference models presented in Section 4, especially when the periodicity of the time series being modeled is irregular. In the application study section, a variety of KGR-SKATER models will be fit on a real dataset and they are able to produce similar fits to those of the reference models, but with better coverage.

3.1 Mixture kernel components for temporal and time series kernel regression priors

There are two types of kernel structure that will be considered when designing the autocorrelation and regression structure for the factors at each vertex of the graph. These will be temporal dependence kernels that only depend on a function of time, and the other choice of kernel will be a time series kernel. In the case of the time series kernel, a multivariate time series of exogenous regression variables, which can be specified according to or independent of the graph vertices, is used to specify a conditional correlation dynamic. This is analogous to a type of Gaussian Process factor Dynamic Conditional Correlation (DCC) model expressed on a graph. A dynamic conditional correlation specification is useful when the temporal dependence of a process changes over time rather than remaining stationary. To account for this, nonstationary kernels, such as those used in [2931], are employed to capture nonstationarity in the process being modeled by allowing the covariance function to change conditionally on observable regression time-series covariates, as is explained in the model specification below. One may think of this as a type of GP model with a conditionally deterministic covariance function (like a GARCHX model in classical time-series). Therefore, the GP model used in this work is not actually stationary. What is stationary is the hyperparameters of the covariance function, which is distinct from the GP model being stationary. One may interpret this approach as aligned with the version of Elastic methods, see [31]. Such Elastic measures account for time distortions (e.g., shifts) or different lengths between sequences. Related methods for GP design also include [32], the Global Alignment Kernel (GAK), see [33], and the KDTW approach of [34], which are based upon the Dynamic Time Warping (DTW) method of [35].

3.1.1 Damped period time series kernels.

The most basic form of the temporal kernel considered is the damped period kernel given by the product of a Radial Basis Function (RBF) kernel and a damped periodic kernel:

This kernel can also be specified as a time series kernel based on exogenous regression variables obtained from the surrogate time series in a generalized form, including temporal and time series surrogate covariate time series.

3.1.2 Locally banded and distributed lag time series kernels.

Additional kernel structures that were considered were locally banded inhomogeneous kernel structures, where at each vertex of the spatial graph , the p-banded Kernel matrix structure is considered on the d-dimensional transformed surrogate variables denoted by where the transformed variables are obtained as the residual of the surrogate variable after removing a structural decomposition of each vertex time series into trend , seasonal and residual given by

This decomposed residual covariate was then used in a local kernel of order p given by

(4)

to capture residual autocorrelation structure not captured by the kernels k(lp). This form of local-p kernel produces a banded but dynamic kernel matrix that can be incorporated with other kernels as per Eq 3. This localized kernel structure also allows for the development of inhomogeneous banded lag structures in a localized kernel that can be designed for a desired lag pattern. By specifying important lags given by , a banded kernel at those distributed lags is given by

(5)

Such localized kernels as will produce a kernel matrix with a lagged banded structure that can be incorporated with other kernels as per Eq 3.

3.1.3 Instantaneous and lagged covariate interaction time series kernels.

This last class of time series kernels is designed to provide non-linear interaction effects from the surrogate time series covariates, which introduce non-linear interaction effects into the construction of the graph factor in the KGR-SKATER model family. These interaction kernels seek to produce interactions between different time series regression covariates at lags, which for the i-th and j-th surrogate covariates is modeled by kernel structure:

(6)

3.2 Families of KGR-SKATER models

In this section, the model subfamilies of KGR-SKATER structure considered by combining various kernel structures into a mixture kernel within the KGR-SKATER model framework are summarized:

and are the most basic KGR-SKATER models. They use a single kernel, which produces a kernel Gram matrix (K) of dimension , to represent temporal dependence, albeit in different ways. These kernels, along with the three kernels presented in Section 3.1, can be combined together with either element wise multiplication or addition to get a more nuanced temporal dependence characterization. These lead to model subfamilies , , and . Intuitively, multiplying two kernels together will produce a kernel that has high values when both of the two base kernels have high values, while adding two kernels together will produce a kernel that has high values when either one or both of the base kernels have high values. This will give different structural relations between the time series regression covariates in the kernel Gram matrix construction.

Each of these mixture kernels produces a kernel Gram matrix that is of dimension because element wise operations are used. In this paper, uniform weights () in the multiplicative mixtures and () for the additive mixtures are considered. Once again, these different kernels lead to different temporal dependence characterizations which, when combined with the graph filter , produce different spatiotemporal dependence structure representations.

Although the locally periodic kernel provides a flexible mechanism for capturing quasi-periodic temporal dynamics, other kernel families could also be considered. For example, the Matérn kernel family introduces a smoothness parameter that allows the covariance function to interpolate between exponential and Gaussian kernels ([36]). Such kernels may provide additional flexibility in capturing varying degrees of temporal smoothness. Similarly, non-separable spatiotemporal kernels such as those proposed by [37] based on harmonic oscillator models could offer an alternative to the separable covariance structure adopted in the present study.

4 Benchmark reference models

This section presents benchmark reference models that will be compared with those produced by the proposed modeling framework. These reference models were chosen because they all have a GP component like KGR-SKATER models and build upon each other, culminating in an LGCP. A key component is a latent factor GP model that serves as the foundation for many popular models, including latent Gaussian models, which are used to model non-Gaussian responses. In spatial (or spatiotemporal) data settings, they are commonly used as random effects, like in a spatial GLMM, or to model the intensity function of a stochastic process, like in a Log Gaussian Cox Process (LGCP).

The first reference model is a spatial GLMM, which is used to model spatially dependent non-Gaussian variables in geostatistical contexts [5]. This reference model assumes the observed data conditionally follows a Poisson distribution and the intensity process at each cluster region can be modeled using a mixed effects model with a log link.

Reference model 1: Poisson GLMM ()

The reference model contains an intercept for each cluster, monthly fixed effects, and a random effect for each location which is estimated as a spatial random process that is assumed to be an unobservable stationary Gaussian random vector with spatial correlation captured by a homoskedastic covariance matrix . As a baseline, a diagonal covariance matrix for is used to allow for a reference that treats the regression as a type of Seemingly Unrelated Regression (SURE) LGCP framework. This is the most basic reference model to be considered. The hyperparameter is assigned a vague log gamma prior .

Obviously, this model is naive because spatial random effects are unlikely to be iid. The second reference model (), a Besag-York-Mollie (BYM) model, adds some spatial dependence structure. A BYM model is a log-normal Poisson model with an intrinsic conditional autoregressive (ICAR) component to capture spatial autocorrelations, i.e., a Besag model, plus a standard random effects component to capture non-spatial heterogeneity.

Reference model 2: Besag-York-Mollie Model ()

Reference model () can be seen as a more general form of reference model (), where instead of assuming that random effects between spatial units are iid, the BYM model imposes an ICAR spatial dependence structure. Recall that ICAR components are conditionally normally distributed. It also has an additional standard normal random effect to capture any residual variation that is not spatially dependent.

The third reference model () takes the form of an LGCP model. LGCPs are applicable to a variety of spatiotemporal settings mainly because they are flexible and relatively tractable. They provide good predictions but are not very interpretable [38]. For reference model 3, a locally periodic temporal kernel is used as the covariance of the underlying GP.

Reference model 3: LGCP with time kernel ()

5 Estimation via integrated nested laplace approximation (INLA)

In order to estimate the counts for all time points at all units , the intensity for each cluster and time point must be estimated. This involves the estimation of the model intensity function regression parameters , the spatial temporal factors , and hyperparameters of the Gaussian process that will depend on the kernel functions used. INLA significantly speeds up the estimation and inference of latent Gaussian models and thus can be leveraged to rapidly compare many different model structures and choices against each other for tasks like kernel hyperparameter learning.

In general, INLA can perform hyperparameter learning; however, in the KGR-SKATER model, the use of the graph filter combined with the Gaussian process kernel matrix means that the standard INLA package will not accommodate this formulation directly. This can be resolved in the INLA approach while still using the standard INLA packages in R via a grid search for the estimation of with optimal values selected based on model ranking criteria such as DIC or WAIC. Essentially, for each point on the hyperparameter grid, the proposed LGCP model is fit in INLA on an in sample dataset and ranked by the resulting information criterion. Each model can be fit with INLA quickly, but a grid search with many hyperparameters can become computationally expensive. To alleviate some of the computational burden, a grid was constructed using Latin hypercube sampling, allowing for fewer grid points whilst adequately covering the parameter space of the hyperparameters.

For each set of hyperparameters, the INLA method is adopted to estimate the marginal posteriors of each parameter of interest in Eq 9. This is done in R using the library INLA and brinla libraries. See S7 Appendix for a demonstration of how and and an LGCP model are fit on synthetic data. INLA is a useful tool for conducting approximate Bayesian inference in cases, like Log Gaussian Cox Processes, where traditional Bayesian approaches like MCMC are more difficult to implement. Even though MCMC methods produce asymptotically more accurate results, they are much slower in comparison to INLA.

The main steps of INLA proposed in [8] are to use the joint density of the latent field, hyperparameters, and data given for reference models and by:

(7)

and for KGR-SKATER models by:

(8)

where Eq 7 corresponds to the estimation of and and Eq 8 corresponds to the estimation of the KGR-SKATER models. Note that for and , and for the proposed KGR-SKATER models .

Notice that since the latent factor covariance of KGR-SKATER models is estimated prior to the INLA step, it is included in the conditioning statement in the steps below for how INLA estimates KGR-SKATER models. Also note that denotes the collection of surrogate spatial cluster covariates, and these are used to estimate and . Since all of the distributions from this point on in this section will already be conditioned on , will be omitted for simplicity of notation.

The INLA approximation for KGR-SKATER models proceeds under two generic steps as follows: 1. Approximate the posterior of the hyperparameters as

where is a Gaussian approximation of the full conditional distribution of and represents the mode of the conditional distribution for given values of the hyperparameters and a Cholesky decomposition is used for the precision matrix, i.e.,

2. Then approximate the marginal posteriors of each parameter of interest using numerical integration as follows:

(9)

where there are K integration points and area weights defined by some numerical integration scheme

In [8], extensions of this approximation approach are developed based on the Laplace and simplified Laplace approximations as alternative methods to obtain better approximations for

. These are especially useful in non-Gaussian likelihood cases like Poisson processes. The Laplace approximation is given by:

where represents an the Gaussian approximation of , which is a different conditional density than , and is the mode. This approximation has optimal performance but requires the largest computational budget since it must be computed for each value of , see discussion in [39]. Therefore, there is also a simplified Laplace approximation option, denoted by , which is obtained via a Taylor series expansion of around . This approximation is sufficiently accurate in many applications and requires a smaller computational budget, see [39].

Not only does INLA quickly provide approximations for the marginal posterior distributions of all the parameters of interest, but it can also generate approximations of the posterior predictive distribution, useful for forecasting:

In order to get the approximation of the predictive posterior, a quantization is used and the set of K support points is chosen such that they cover all areas with non-negligible posterior density, and for each support point an estimate of the posterior density is given, see further discussion in [40].

6 KGR-SKATER model validation case study

Before undertaking the detailed real data case study, this section briefly outlines a synthetic data case study that was undertaken to illustrate the behavior and accuracy of the KGR-SKATER modeling framework, compared to the reference models. The bulk of the synthetic data generation, validation, and results are outlined in detail in S13 Appendix. A synthetic dataset was generated intentionally with frequency and amplitude modulated intensity functions to assess performance in a challenging setting. A BYM model akin to and a KGR-SKATER model akin to are fit to this dataset.

In a one month ahead rolling forecast exercise, the reference model simply makes the same periodic prediction repeatedly. However, the KGR-SKATER can adapt more flexibly to capture the patterns of the resulting intensity process both in terms of prediction accuracy and coverage. Details of this relative performance comparison are provided in S13 Appendix.

6.1 Practical implementation considerations

Applying the KGR-SKATER framework in practice involves several modeling choices. First, the selection of covariates used for spatial graph estimation should be guided by domain knowledge regarding the drivers of spatial dependence. Second, the number of spatial clusters can be selected using clustering diagnostics such as silhouette width or domain-specific constraints. Third, kernel structures should be chosen based on the temporal characteristics of the data; locally periodic kernels are appropriate when quasi-periodic dynamics are expected.

Missing data or censored observations can be handled using standard imputation procedures prior to model estimation. In this study, an expectation–maximization procedure was used to impute censored mortality counts. Finally, hyperparameters governing the kernel functions can be selected using information criteria such as DIC or WAIC.

7 Application study: Modeling respiratory-related mortality in California

To demonstrate the KGR-SKATER methodology, this paper will attempt to model respiratory-related deaths across California whilst incorporating socioeconomic data for spatial dependence and air quality data as spatiotemporal time series regression covariates.

7.1 Data

Several publicly available data sources were used in this study. In this section, a brief outline of the details of each data set is provided; for a more comprehensive discussion of the data details and some plots in addition to the ones provided below, see S1 Appendix. The socioeconomic data in the form of Social Deprivation Indices (SDIs) comes from [41] (see (https://www.soa.org/resources/research-reports/2020/us-mort-rate-socioeconomic/#excel)) and the Society of Actuaries (SoA), air quality data from the [42] (see https://aqs.epa.gov/aqsweb/documents/data_api.html), and mortality data from the [43] (see https://cal-vida.cdph.ca.gov/VSQWeb). Table 2 presents an empirical summary of the application study data:

thumbnail
Table 2. Summary statistics for covariates in application study.

https://doi.org/10.1371/journal.pone.0348787.t002

The SoA’s socioeconomic dataset contains 11 different sub-indices for each county for the years 2011–2019:

The subindices in Table 3 were grouped into principal components, and the component that explained the most variation in mortality was used to create a single multidimensional index of social deprivation, SDI. This variable SDI is used in the clustering and graph estimation stages for the KGR-SKATER filtered graph Laplacian spatial dependence structure.

thumbnail
Table 3. Socioeconomic variables used to construct SDI score.

https://doi.org/10.1371/journal.pone.0348787.t003

The air quality data provided by the EPA includes daily atmospheric measurements for the seven main air pollutants: lead, carbon monoxide (CO), sulfur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3), PM10 and PM2.5, and the AQI index value associated with a given pollutant for that day. Once the daily averages for each pollutant, for each county, had been queried and consolidated (see S2 Appendix for details of the procedure), it was aggregated into surrogate monthly time series regression covariates. These will be used to construct the time series regression kernel matrix for the KGR-SKATER model, denoted .

The observation process is comprised of mortality data categorized by cause of death from the California Department of Health (CDPH). In this case study, it was assumed that since air pollution primarily affect people’s health through their respiratory systems, the raw dataset was filtered to only include “chronic lower respiratory diseases” and “influenza and pneumonia” as causes of death because these were the only two that were distinctly respiratory-related. The months of 2014–2019 were selected as the study’s time window in order to avoid the COVID pandemic’s influence on respiratory deaths, which could confound the results. This filtering of the CDPH dataset resulted in observations of the number of deaths each month by county and by age group, e.g., 1–4 years, 15–24 years, up to 85 and older. However, this raw data selection had a known censoring that was applied by the data provider to protect privacy. Censoring was applied to any nonzero death counts less than 11 in each county and age group, in order to mask small cell counts. These censored death counts were imputed based on the Expectation Maximization (EM) Algorithm (see S3 Appendix for details).

While this approach provides statistically consistent estimates under standard missing-data assumptions, it may introduce additional uncertainty into the analysis. In particular, imputation may attenuate extreme observations and potentially underestimate variability. However, the relatively small proportion of censored observations suggests that any resulting bias is likely to be modest.

With the mortality data now available by age group, month, and county, the data across age groups was added together to create the main response variable. Next, the county level data was added together to match the desired deaths per month by cluster format needed for the model. For the purpose of this application study, the spatiotemporal effects of air quality and socioeconomic status were the main focus; hence, the age groups were aggregated into a total county-level observation in order to reduce the zero inflation and overdispersion.

Time series of the cluster level surrogate data for the SDI, AQI and aggregated mortality are shown in Fig 2 for the case of seven spatial SKATER clusters:

thumbnail
Fig 2. Time series plots of aggregated response and aggregated surrogate time series explanatory variables for each cluster.

https://doi.org/10.1371/journal.pone.0348787.g002

7.2 SKATER clustering

The SKATER clustering was performed using the county level SDI covariate information. Details of the procedure for deciding this are provided in the S5 Appendix. There are 58 counties in California and these will be reduced to a total number of clustered regions determined based on the cluster performance, between 2 and 10 spatial clusters obtained from the SKATER method. In the SKATER clustering package in R, one can also consider using one of the two constraints mentioned in 2.2. These constraints are useful for application studies like this, where one might want more balanced subgroups. The cluster order, i.e., the number of clustered spatial groups of counties of California, grouped according to the similarity and spatial contiguity of the Social Deprivation over time, was determined via a silhouette plot analysis. This was done in R using fviznbclust(), and analysis was performed to compare average widths from two to ten clusters.

The silhouette plot, which can be found in S4 Appendix, shows that the average width for two clusters is closest to 1 and that the average drops off progressively. However, it should be noted that the standard silhouette method implemented in these functions does not account for spatial contiguity adopted by SKATER clustering. In S8 Appendix, SKATER is run under the different constraints for two and seven clusters. One can see that the two cluster case is not very informative, and that running SKATER unconstrained may lead to clusters with just one or two counties. Ultimately, based on the silhouette cluster analysis, it was decided that a suitable tradeoff between spatial modeling and cluster dimension reduction was to use seven clusters. Furthermore, the minimum population constraint was used to maintain balanced clusters. These clusters are displayed in Fig 3.

thumbnail
Fig 3. Spatial dependence structure for seven clusters and a minimum population constraint visualized.

County boundary shapefiles obtained from the US Census Bureau (https://catalog.data.gov/dataset/tiger-line-shapefile-2016-state-california-current-place-state-based). These are in the public domain. Maps were generated by the authors using R packages (maps, sf, ggplot2).

https://doi.org/10.1371/journal.pone.0348787.g003

After the clustering step, surrogate variables are constructed as outlined in Section 2.3. However, first, the temporal resolutions of both the raw county level SDI and EPA data have to be matched to the mortality data’s temporal resolution. The data from EPA measuring stations are daily averages, so the median of each month’s measurements was used to get a monthly value for each pollutant. This aggregation was done after filtering out stations with bad data. Only one SDI value is calculated for each year, so the same value was used for each month within a given year. The population weights used to calculate the weighted average for each covariate were obtained by calculating the proportion of the population within each county in a given cluster using census data for each year. Once again, the surrogate variable for the SDI data is used to construct a spatial dependence structure and the surrogate variables for the EPA data are used to construct a temporal dependence structure in the time series kernel structure.

7.3 Characterization of spatial dependence structure: Graphical LASSO

Once the spatial clustering is performed, the next step in the KGR-SKATER proposed modeling approach is to estimate the graphical spatial dependence. This is achieved with a graphical LASSO method, where the R package huge is utilized. The inputs are the surrogate variables from the SDIs and the resulting estimated graph complexity, given by the cardinality of the graph edge set determined using the EBIC model selection criterion (see further details in S6 Appendix).

After feeding the surrogate SDI variables into huge and following the steps outlined in Section 3.2, a graph, from which the graph filter can be calculated, was estimated. Fig 3 below illustrates the estimated spatial dependence structure for seven clusters under the minimum population constraint. To see all cluster and graph filter results, refer to S8 Appendix.

7.4 Characterization of temporal dependence structure

The other component within the covariance matrix of the Gaussian process to be estimated for a KGR-SKATER model is K, the kernel Gram matrix that characterizes the temporal dependence structure. In Table 1, a variety of kernel choices are posited as potentially useful depending on the type of dependence structure that is hypothesized to exist. These different kernel components and mixture choices lead to the five different proposed models used in the application study to demonstrate the KGR-SKATER modeling capabilities. See S9 Appendix for the five KGR-SKATER models written out explicitly. Admittedly, these five models are similar; the main point of including all five models is to illustrate that there are many different ways to specify the covariance structure of the underlying intensity signal. One can see and compare the visual representations of the precision matrices for each proposed model in S11 Appendix. These models have the minor drawback of increased dimensionality in comparison to the simpler reference models. But in the next section, the improvement with respect to in-sample and out-of-sample fit will be exemplified.

8 Results

After presenting the evaluation metrics that will be used to compare in and out of sample performance, the reference and proposed models’ performance in and out of sample will be discussed. This section will show that the application study’s data can be modeled with several of the proposed models just as well as with the reference models. This is complemented by an increase in Frequentist coverage rates out of sample.

8.1 Evaluation metrics

When the reference and proposed models are fit in sample, the deviance information criterion (DIC) and Watanabe-Akaike information criterion (WAIC) are extracted and compared. WAIC results will be shown throughout the rest of the paper. For DIC results and a discussion on how these two metrics compare, see S10 Appendix.

To get an idea of how well each model fits the in sample data, the scaled root mean squared error (RMSPE) is evaluated for each cluster. Standard RMSPEs values for each cluster would be drastically different because the population sizes between clusters still vary significantly. As a result, the error rate is scaled by dividing the standard RMSPE value by the average of the actual observed data points, i.e.,

Since the response takes the form of counts, INLA can only predict the average intensity of this Poisson response, , at each location and time point. Because of this, an estimate of the true intensity is defined as . This estimate is obtained by taking the average of the number of deaths observed for month t + h over all years (2015–2019). So when the average intensity for December 2019 is to be predicted for instance, that prediction is compared with the average of observed mortalities during December 2015, 2016, 2017, and 2018.

8.2 In sample fitting

To estimate each of the KGR-SKATER and reference models, INLA is fit to the aggregated respiratory-related mortalities for each cluster group from 2015 to 2019. The last six months of 2019 (months 55–60) are held out for the out of sample forecasting exercise. All three reference models and five proposed models are fit on the data clustered into two and seven clusters under the minimum population constraint. Using INLA, each of the models was fit in at most a couple of minutes. Additional output, such as the posterior densities of each parameter and hyperparameter, the model’s WAIC values, the posterior predictive fitted values, and uncertainty quantification, can also be extracted with ease.

From the WAIC values in Table 4, notice that as the number of clusters/spatial units increases, the WAIC increases, meaning these models of higher complexity are not as favorable. This is to be expected because the size of the graph filter depends on the number of clusters, and this directly affects how large the covariance matrix of each model’s underlying Gaussian process will be.

As shown in Figs 4 and 5, the model fits for the reference and proposed models with the smallest WAIC, and , the proposed model appears to be better with respect to uncertainty quantification. The accuracy of the fits is too close to differentiate visually, so the in sample scaled RMSPEs were calculated for each model, for two and seven clusters, and put into a table. This table can be found in S12 Appendix. The reference models turn out to have slightly less bias, but not by much.

thumbnail
Fig 4. Posterior predictive mean and credible interval bands estimated by .

Notice that the proposed models perform better with respect to uncertainty quantification and about the same with respect to in-sample fit.

https://doi.org/10.1371/journal.pone.0348787.g004

thumbnail
Fig 5. Posterior predictive mean and credible interval bands estimated by .

Notice that the proposed models perform better with respect to uncertainty quantification and about the same with respect to in-sample fit.

https://doi.org/10.1371/journal.pone.0348787.g005

As mentioned in Section 4, the proposed models do not need monthly fixed effects like the reference models to generate acceptable fits. These fits, which can be seen in S16 Appendix, are smoother and generally less accurate because their predictions tend towards the cluster mean instead of following the seasonal patterns. Including monthly fixed effects results in model fits indistinguishable from those of and . It turns out that due to the strong, persistent periodicity in the application study data, the monthly fixed effects capture the seasonal patterns very well. Since had the lowest WAIC values among the proposed models, it will be carried into the out of sample fitting analysis along with . Additionally, since the response exhibits signs of overdispersion, in S18 Appendix, a KGR-SKATER model similar to except with a Negative Binomial likelihood instead of Poisson was fit for comparison.

8.3 Out of sample forecasting

To evaluate the out of sample forecasting ability of the two models of interest, two forecasting exercises were carried out. The first has a forecast origin at month 54 and forecasts the last six months (July-December 2019) simultaneously. The second has a forecast origin at month 36 and performs a series of one step ahead forecasts to reconstruct the last 36 months with the filtration/historical data increasing to include the newly forecasted value at each step. To specify which entries are out of sample, INLA instructs users to use NAs placeholders for the response entries/values to be forecasted. For a description of the derivation of the approximate posterior predictive distribution, refer back to Section 5. One can see from Figs 6 and 7 that produces virtually the same predictions out of sample as , if not a little better.

thumbnail
Fig 6. Posterior predictive mean and credible interval bands estimated by .

The number of respiratory-related deaths for July-December 2019 are forecasted simultaneously. Coverage included in the title is only calculated for the out of sample window.

https://doi.org/10.1371/journal.pone.0348787.g006

thumbnail
Fig 7. Posterior predictive mean and credible interval bands estimated by .

The number of respiratory-related deaths for July-December 2019 are forecasted simultaneously. Coverage included in the title is only calculated for the out of sample window.

https://doi.org/10.1371/journal.pone.0348787.g007

This is further supported by the similar out of sample forecast RMSPEs displayed in Table 5, as well as for all of the other forecasting performance metrics that were calculated (see S15 Appendix).

thumbnail
Table 5. Out-of-sample RMSPE values for each model.

https://doi.org/10.1371/journal.pone.0348787.t005

Where the two models mainly differ is with respect to uncertainty quantification. The coverage rates presented in Table 6 show that the proposed models’ credible interval bands tend to be wider (see S14 Appendix for all posterior predictive plots). As was the case when fitting in sample, produces narrower credible intervals, which sometimes do not encapsulate the observed data, like in Cluster 1. This is, of course, suboptimal. This is the main advantage of the KGR-SKATER models to point to in the out of sample fitting setting.

thumbnail
Table 6. Out-of-sample Frequentist coverage for each model.

https://doi.org/10.1371/journal.pone.0348787.t006

After the six month horizon forecasting exercise, a rolling window forecast exercise was conducted to further evaluate out of sample prediction performance. For this exercise, one starts with a reduced version of the application study’s dataset, only 36 months long. Using this dataset, each model is estimated and used to make a forecast one month ahead. Then, using the original data and the new forecasts for the next month, i.e., months 1–37, the model is re-estimated and a forecast for the next month is produced. This process continues until the original 36 months of data have been used to produce a complete time series of 60 months of which the last 36 months are forecasted one at a time.

No matter how many clusters were produced in the SKATER step, the results were more or less the same. The results for seven clusters are presented in Figs 8 and 9:

thumbnail
Fig 8. Rolling window forecasts produced by .

The fixed effects appear to dominate INLA’s estimates; thus, the forecasted values and credible interval bands are virtually identical. Coverage is slightly better for compared to due to cluster 6.

https://doi.org/10.1371/journal.pone.0348787.g008

thumbnail
Fig 9. Rolling window forecasts produced by .

The fixed effects appear to dominate INLA’s estimates; thus, the forecasted values and credible interval bands are virtually identical. Although the forecasted values look identical between and , they actually differ by one or two deaths in some cases.

https://doi.org/10.1371/journal.pone.0348787.g009

It turns out that the forecasts made by and are almost the same, give or take one or two deaths in some cases. However, this leads to noticeable differences in terms of the forecast metrics displayed in Fig 10. See S17 Appendix for a table with forecast metrics calculated for each horizon time point. Unlike in the previous forecasting exercise, there is not much difference in the coverage here. yielded a 95% coverage of 89.88% compared to 89.29% by .

thumbnail
Fig 10. Rolling forecast error metrics for and at each horizon time point.

Although the forecasted values look identical between the two models, actually has more accurate predictions, except with respect to MAPE.

https://doi.org/10.1371/journal.pone.0348787.g010

9 Discussion

This paper presents the KGR-SKATER framework, which integrates spatial clustering, graph signal processing, and approximate Bayesian inference to model high-dimensional, non-Gaussian spatiotemporal data. The proposed framework provides an interpretable and parsimonious structure for capturing complex spatiotemporal dependencies, achieved through the combination of spatial clustering and graph-based modeling. The spatial clustering step, implemented via the SKATER algorithm, reduces dimensionality and reveals spatial heterogeneity, while the graph filter constructed from these clusters encodes the spatial dependence structure. This graph filter is then combined with a locally periodic temporal kernel using a Kronecker product, resulting in a covariance matrix that captures both spatial and temporal dependencies.

The utility of the KGR-SKATER framework is demonstrated through an application to modeling respiratory-related mortality in California, leveraging socioeconomic and air quality data at the county and monthly levels. The results show that the KGR-SKATER model outperforms traditional models in terms of uncertainty quantification while maintaining comparable predictive accuracy. This advantage is particularly evident when the time series exhibits volatile periodicity and amplitude, as confirmed by the simulation study. Additionally, the framework’s robustness across different spatial clustering and temporal kernel configurations suggests its scalability to larger and more complex datasets.

The KGR-SKATER framework is designed to be broadly applicable across high-dimensional spatiotemporal settings beyond public health applications. Its primary strength lies in the deliberate construction of structured dependence representations that decompose complex dynamics into interpretable spatial and temporal components. By learning spatial relationships through graph-based representations and temporal dynamics through flexible kernel structures, the framework facilitates principled uncertainty quantification for both in-sample and out-of-sample predictions. Such uncertainty-aware forecasts can be directly leveraged to support informed policy and business decision-making.

A central modeling feature of the framework is its use of a separable spatiotemporal covariance structure, constructed via the Kronecker product of a graph-based spatial operator and a kernel-derived temporal Gram matrix. This design is not merely a simplification, but a deliberate structural decomposition that enables tractable inference while preserving interpretability. In particular, it allows practitioners to isolate and study spatial dependence (learned from covariates through graphical modeling and spectral filtering) separately from temporal dependence (captured through flexible, potentially nonstationary kernel constructions). As highlighted in the model formulation, this decomposition is especially effective in settings where different drivers govern spatial versus temporal variability.

Within this modular framework, practitioners are afforded substantial flexibility in tailoring each component to the application at hand. The choice of spatial clustering resolution, graph estimation procedure, graph filtering operator, and temporal kernel family can all be guided by domain knowledge. Rather than representing arbitrary tuning decisions, these modeling choices correspond to meaningful structural assumptions about the underlying data-generating process, such as the scale at which spatial homogeneity is expected, the sparsity of conditional dependencies, or the nature of temporal dynamics (e.g., periodicity, nonstationarity, or lag effects).

The use of SKATER-based clustering plays a key role in this structural design by providing a principled mechanism for dimensionality reduction that preserves interpretable spatial groupings. This not only enhances computational efficiency but also aligns the model with meaningful regional aggregation, enabling the learned graph structure to reflect relationships between functionally similar spatial units rather than purely geographic adjacency.

Similarly, the graph filtering step introduces a controlled form of spatial regularization through spectral smoothing of the graph Laplacian, encoding the assumption that latent processes vary smoothly over the learned topology. On the temporal side, the use of kernel mixtures allows for rich representations of time series behavior, including locally periodic dynamics, distributed lag effects, and nonlinear covariate interactions. These components can be combined additively or multiplicatively to reflect different structural hypotheses about temporal dependence.

From a computational perspective, the framework’s structure is intentionally aligned with efficient approximate inference via INLA. The separable covariance construction and clustering-based dimensionality reduction jointly ensure that high-dimensional spatiotemporal models remain tractable without sacrificing key dependence features. While model complexity increases with the resolution of the spatial partition and the richness of the kernel design, these are controlled, interpretable dimensions of model specification rather than incidental burdens.

Overall, the KGR-SKATER framework should be viewed as a flexible, modular system for constructing structured spatiotemporal models, where each modeling component encodes a specific and interpretable assumption about the data. This design enables practitioners to balance model fidelity, interpretability, and computational feasibility in a principled manner, rather than treating such trade-offs as limitations.

The proposed KGR-SKATER framework offers significant improvements in both computational efficiency and model interpretability, providing a valuable tool for spatiotemporal modeling. Future work will explore the extension of this framework to accommodate multimodal data and additional graph constructions, further enhancing its applicability to a wider range of spatiotemporal data analysis tasks.

9.1 Computational complexity and scalability

The computational cost of the KGR-SKATER framework arises primarily from three components: spatial clustering, graph estimation, and latent Gaussian model inference.

The SKATER clustering step operates on a minimum spanning tree constructed from the spatial adjacency graph. The complexity of this step is approximately for constructing the MST and for iterative pruning, where N is the number of spatial units.

Graph estimation using graphical LASSO involves solving a penalized likelihood problem for the precision matrix. The computational complexity is typically for dense matrices, where C is the number of spatial clusters, although sparse solutions can reduce this cost substantially.

The Gaussian process component relies on a covariance matrix of dimension constructed as a Kronecker product . The Kronecker structure enables efficient linear algebra operations, reducing both storage and computational costs relative to fully dense covariance matrices.

Finally, INLA performs approximate Bayesian inference using Laplace approximations. Its computational complexity is roughly cubic in the dimension of the latent field but benefits from sparse precision matrices and structured covariance operators.

Overall, the framework scales more favorably with the number of spatial units than traditional GP-based spatiotemporal models because clustering reduces the spatial dimension from N to C, where typically . All of the analysis carried out in this chapter was conducted on a standard laptop.

9.2 Software accessibility

A brief remark on the software accessibility: the current implementation relies on R packages such as SKATER, HUGE, and INLA; however, the methodology itself is not restricted to a particular programming environment. Equivalent functionality exists in other ecosystems: graphical LASSO implementations are available in Python (e.g., scikit-learn), Gaussian process modeling frameworks exist in libraries such as GPyTorch, and spatial clustering algorithms can be implemented using standard graph-processing tools. Future work could include a Python implementation of the KGR-SKATER framework to facilitate broader adoption. The code used to implement the application study is provided in the GitHub repository https://github.com/jeffwu25/KGR-SKATER.

Supporting information

S2 Appendix. Procedure for obtaining air quality and pollutant measurements for each county.

https://doi.org/10.1371/journal.pone.0348787.s002

(PDF)

S3 Appendix. Imputing “<11” values in data with EM Algorithm.

https://doi.org/10.1371/journal.pone.0348787.s003

(PDF)

S4 Appendix. Silhouette plot to determine optimal number of clusters.

https://doi.org/10.1371/journal.pone.0348787.s004

(PDF)

S6 Appendix. Evaluating different HUGE model selection criteria.

https://doi.org/10.1371/journal.pone.0348787.s006

(PDF)

S8 Appendix. Complete collection of SKATER and graph filter plots.

https://doi.org/10.1371/journal.pone.0348787.s008

(PDF)

S9 Appendix. Proposed KGR-SKATER model equations.

https://doi.org/10.1371/journal.pone.0348787.s009

(PDF)

S10 Appendix. Model comparison criterion for application study.

https://doi.org/10.1371/journal.pone.0348787.s010

(PDF)

S11 Appendix. Heatmaps of precision matrix of underlying GP of LGCP models.

https://doi.org/10.1371/journal.pone.0348787.s011

(PDF)

S12 Appendix. In sample RMSPE table for reference and proposed models.

https://doi.org/10.1371/journal.pone.0348787.s012

(PDF)

S14 Appendix. Posterior predictive plots for other reference and proposed models.

https://doi.org/10.1371/journal.pone.0348787.s014

(PDF)

S15 Appendix. Additional out of sample forecast performance tables.

https://doi.org/10.1371/journal.pone.0348787.s015

(PDF)

S16 Appendix. Fitting proposed models with and without monthly fixed effects.

https://doi.org/10.1371/journal.pone.0348787.s016

(PDF)

S18 Appendix. Fitting a KGR-SKATER model with Negative Binomial likelihood.

https://doi.org/10.1371/journal.pone.0348787.s018

(PDF)

Acknowledgments

The authors are grateful to Professor Tamma Carleton for thoughtful discussion that improved the modeling design for the application study and for the invitation to share preliminary results at the TWEEDS 2023 workshop.

References

  1. 1. Greenwood M, Yule GU. An Inquiry into the Nature of Frequency Distributions Representative of Multiple Happenings with Particular Reference to the Occurrence of Multiple Attacks of Disease or of Repeated Accidents. 1920;83(2):255–79. https://www.jstor.org/stable/2341080
  2. 2. Bartlett MS. The Spectral Analysis of Two-Dimensional Point Processes. 1964;51(3):299–311. Publisher: [Oxford University Press, Biometrika Trust]. https://www.jstor.org/stable/2334136
  3. 3. Cox DR. Some Statistical Methods Connected with Series of Events. 1955;17(2):129-64. Publisher: [Royal Statistical Society, Wiley]. https://www.jstor.org/stable/2983950
  4. 4. Besag J, York J, Mollié A. Bayesian image restoration, with two applications in spatial statistics. 1991;43(1):1–20.
  5. 5. Diggle PJ, Tawn JA, Moyeed RA. Model-Based Geostatistics. 1998;47(3):299-350. Publisher: [Wiley, Royal Statistical Society]. https://www.jstor.org/stable/2986101
  6. 6. Møller J, Syversveen AR, Waagepetersen RP. Log Gaussian Cox Processes. 1998;25(3):451–82. Publisher: [Board of the Foundation of the Scandinavian Journal of Statistics, Wiley]. https://www.jstor.org/stable/4616515
  7. 7. MacDonald R, Lee BS. Flexible basis representations for modeling large non-Gaussian spatial data. 2024;62:100841. https://www.sciencedirect.com/science/article/pii/S2211675324000320
  8. 8. Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. 2009;71(2):319–92. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2008.00700.x
  9. 9. Hasanzadeh A, Liu X, Duffield N, Narayanan K, Chigoy BT. A Graph Signal Processing Approach For Real-Time Traffic Prediction In Transportation Networks. 2017. Available from: https://www.semanticscholar.org/paper/A-Graph-Signal-Processing-Approach-For-Real-Time-In-Hasanzadeh-Liu/56d5ab3283cb3d3bc2f950f6623bf0e2e5879fda
  10. 10. Ioannidis VN, Romero D, Giannakis GB. Inference of Spatio-Temporal Functions over Graphs via Multi-Kernel Kriged Kalman Filtering. 2018;66(12):3228–39. http://arxiv.org/abs/1711.09306
  11. 11. Antonian E, Peters GW, Chantler M. Kernel generalized least squares regression for network-structured data. PLOS ONE. 2025;20(5):e0324087.
  12. 12. Martins A, Neves M, Cmara G, Da Costa Freitas D. Efficient Regionalization Techniques for Socio-Economic Geographical Units Using Minimum Spanning Trees. 2006;20:797–811.
  13. 13. Zhao T, Liu H, Roeder K, Lafferty J, Wasserman L. The huge Package for High-dimensional Undirected Graph Estimation in R. 2012;13:1059–62.
  14. 14. Venkitaraman A, Chatterjee S, Händel P. Gaussian Processes Over Graphs. 2018;(arXiv:1803.05776). Available from: http://arxiv.org/abs/1803.05776
  15. 15. Antonian E, Peters GW, Chantler M. Bayesian reconstruction of Cartesian product graph signals with general patterns of missing data. Journal of the Franklin Institute. 2024;361(9):106805.
  16. 16. Antonian E, Peters GW, Chantler M. PyKronecker: A Python Library for the Efficient Manipulation of Kronecker Products and Related Structures. Journal of Open Source Software. 2023;8(81):4900.
  17. 17. Hristopulos DT. Stochastic Local Interaction Model: Bridging Machine Learning and Geostatistics. Computers & Geosciences. 2015;85:26–37.
  18. 18. Hristopulos DT, Agou VD. Stochastic Local Interaction Model with Sparse Precision Matrix for Space–Time Interpolation. Spatial Statistics. 2020;40:100403.
  19. 19. Chen YH, Mukherjee B, Berrocal VJ. Distributed Lag Interaction Models with Two Pollutants. 2019;68(1):79–97. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6328049/
  20. 20. Cortes TR, Silveira IH, Oliveira BFAd, Bell ML, Junger WL. Short-term association between ambient air pollution and cardio-respiratory mortality in Rio de Janeiro, Brazil. 2023;18(2):e0281499. Publisher: Public Library of Science. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0281499
  21. 21. Albadrani M. Socioeconomic disparities in mortality from indoor air pollution: A multi-country study. 2025;20(1):e0317581. Publisher: Public Library of Science. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0317581. doi: https://doi.org/10.1371/journal.pone.0317581
  22. 22. Cairns AJG, Kleinow T, Wen J. Drivers of mortality: risk factors and inequality. 2024;187(4):989–1012.
  23. 23. Prim RC. Shortest connection networks and some generalizations. The Bell System Technical Journal. 1957;36(6):1389–401.
  24. 24. Kruskal JB. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society. 1956;7(1):48–50.
  25. 25. Turner S, Scholz M, Nagraj V. skater: Utilities for SNP-Based Kinship Analysis; 2023. R package version 0.1.2. Available from: https://CRAN.R-project.org/package=skater
  26. 26. Chatterjee N, Kalaylioglu Z, Moslehi R, Peters U, Wacholder S. Powerful Multilocus Tests of Genetic Association in the Presence of Gene-Gene and Gene-Environment Interactions. 2006;79(6):1002–16. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1698705/
  27. 27. Jiang H, Fei X, Liu H, Roeder K, Lafferty J, Wasserman L, et al. huge: High-Dimensional Undirected Graph Estimation; 2021. R package version 1.3.5. Available from: https://CRAN.R-project.org/package=huge
  28. 28. Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model spaces. 2008;95(3):759–71.
  29. 29. Király FJ, Oberhauser H. Kernels for sequentially ordered data. Journal of Machine Learning Research. 2019;20(31):1–-45.
  30. 30. Koukorinis A, Peters GW, Germano G. Generative-discriminative machine learning models for high-frequency financial regime classification. Methodology and Computing in Applied Probability. 2025;27(2):36.
  31. 31. De Felice G, Goulermas J, Gusev V. Time series kernels based on nonlinear vector AutoRegressive delay embeddings. Advances in Neural Information Processing Systems. 2023;36:37230–51.
  32. 32. Lu Z, Leen TK, Huang Y, Erdogmus D. A reproducing kernel Hilbert space framework for pairwise time series distances. In: Proceedings of the 25th international conference on Machine learning; 2008. p. 624–31.
  33. 33. Cuturi M. Fast global alignment kernels. In: Proceedings of the 28th international conference on machine learning (ICML-11); 2011. p. 929–36.
  34. 34. Marteau PF, Gibet S. On recursive edit distance kernels with application to time series classification. IEEE transactions on neural networks and learning systems. 2014;26(6):1121–33.
  35. 35. Berndt DJ, Clifford J. Using dynamic time warping to find patterns in time series. In: Proceedings of the 3rd international conference on knowledge discovery and data mining; 1994. p. 359–70.
  36. 36. Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. Cambridge, MA: MIT Press; 2006. Available from: https://gaussianprocess.org/gpml/
  37. 37. Hristopulos DT. Non-Separable Covariance Kernels for Spatiotemporal Gaussian Processes Based on a Hybrid Spectral Method and the Harmonic Oscillator. IEEE Transactions on Information Theory. 2024;70(2):1268–83.
  38. 38. Diggle PJ, Moraga P, Rowlingson B, Taylor BM. Spatial and Spatio-Temporal Log-Gaussian Cox Processes: Extending the Geostatistical Paradigm. 2013;28(4). http://arxiv.org/abs/1312.6536
  39. 39. Gómez-Rubio V. Bayesian inference with INLA; 2021. Available from: http://becarioprecario.bitbucket.io/inla-gitbook/index.html
  40. 40. Held L, Schrödle B, Rue H. Posterior and Cross-validatory Predictive Checks: A Comparison of MCMC and INLA. In: Kneib T, Tutz G, editors. Statistical Modelling and Regression Structures: Festschrift in Honour of Ludwig Fahrmeir. Physica-Verlag HD; 2010. p. 91–110. https://doi.org/10.1007/978-3-7908-2413-1_6
  41. 41. Barbieri M. Mortality by Socioeconomic Category in the United States; 2020. Available from: https://www.soa.org/resources/research-reports/2020/us-mort-rate-socioeconomic/#report
  42. 42. Agency UEP. Air Quality System Data Mart; 2024. Accessed February 11, 2024. Available from: https://www.epa.gov/outdoor-air-quality-data
  43. 43. State of California DoPH. California Vital Data (Cal-ViDa), Death Query; 2024. Accessed: 2023-06-18. Available from: https://cal-vida.cdph.ca.gov/