Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Kernel generalized least squares regression for network-structured data

  • Edward Antonian,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation School of Mathematics and Computer Science, Heriot-Watt University, Edinburgh, United Kingdom

  • Gareth W. Peters ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    garethpeters@ucsb.edu

    Affiliation Department of Statistics and Applied Probability, University of California Santa Barbara, Santa Barbara, California, United States of America

  • Michael Chantler

    Roles Supervision, Writing – review & editing

    Affiliation School of Mathematics and Computer Science, Heriot-Watt University, Edinburgh, United Kingdom

Abstract

In this paper, we study a class of non-parametric regression models for predicting graph signals as a function of explanatory variables . Recently, Kernel Graph Regression (KGR) and Gaussian Processes over Graph (GPoG) have emerged as promising techniques for this task. The goal of this paper is to examine several extensions to KGR/GPoG, with the aim of generalising them a wider variety of data scenarios. The first extension we consider is the case of graph signals that have only been partially recorded, meaning a subset of their elements is missing at observation time. Next, we examine the statistical effect of correlated prediction error and propose a method for Generalized Least Squares (GLS) on graphs. In particular, we examine Autoregressive AR(1) vector autoregressive processes, which are commonly found in time-series applications. Finally, we use the Laplace approximation to determine a lower bound for the out-of-sample prediction error and derive a scalable expression for the marginal variance of each prediction. These methods are tested on both real and synthetic data, with the former taken from a network of air quality monitoring stations across California. We find evidence that the generalised GLS-KGR algorithm is well-suited to such time-series applications, outperforming several standard techniques on this dataset.

1 Introduction

Graphs have proven to be a useful way of describing complex datasets due to their ability to capture general relational information between entities [1]. Knowledge about pairwise relationships between data points can, for example, help enhance regularization, improve computational efficiency and robustness, or better exploit unlabelled data [2,3]. Successful applications have included social networks [4], brain imaging [5], biomolecular systems [6], sensor networks [7], image and point cloud processing [8] and chemical structure [9] to name a few.

The Graph Signal Processing (GSP) community focuses in particular on generalisations of traditional signal processing techniques on regular domains. A toolbox of analogous graph-based techniques has been built by translating concepts such as convolution and spectral decomposition into the graph domain. These have been applied to a diverse range of problems including signal reconstruction [10], graph learning [11], denoising [12] and regression [13].

Graph-based regression specifically has received some attention recently in the form of kernel regression and Gaussian processes [1316]. In this context, we are interested in constructing a function that maps input features to an output space, which is interpreted as a graph signal. This opens interesting avenues for enhanced regularisation using GSP tools and offers novel ways of defining kernel functions.

In this paper, we examine Kernel Graph Regression (KGR) and the closely related topic of Gaussian Processes over Graph (GPoG) from the perspective of Graph Signal Reconstruction (GSR). This framework allows signal prediction at unobserved test nodes as well as test times. Next, we extend this by developing a method for Generalized Least Squares (GLS) regression on graphs. This has particular relevance for the statistical analysis of time-series data where a graph signal is sampled at regular intervals, as it allows, for example, node-level error autocorrelation to be incorporated. Such scenarios are encountered in the analysis of many common graph applications such as sensor networks and financial time series. Finally, we use the Laplace approximation to calculate a lower bound for the out-of-sample prediction uncertainty and derive a scalable expression for the marginal uncertainty at each node.

1.1 Problem overview and scope

The goal of supervised learning and regression is to estimate an optimal function that maps an input space to an output space, given a set of training examples . It will be assumed that there is a static graph with vertex set with N vertices (nodes), edge set and adjacency matrix . In this paper we are concerned specifically with real-valued vector regression inputs (time series covariates) and outputs (time series responses), that is, and , for d-dim observation vector over a subset of vertices out of a total of N vertices at time t. Clearly, it is assumed that is a partially observed graph signal, meaning the elements can be interpreted as being observed at time t on a subset of the full node-set .

In this setup, the M dimensional covariate time series may be exogenous to the graph vertices (not observed at any of the graph vertices) or they may be endogenous to the graph vertices (observed at a subset or all of the graph vertices). The methodology developed in this work admits both possible setups, but the focus in this work will be primarily on the former case of exogeneity of the regression covariate time series and this case will be considered unless otherwise stated.

The dimensional response time series vectors of the graph regression will be assumed to be observed, ideally at each time point , such that the first d coordinates for corresponds to the d-dimensional observation sub-vector of that were observed at vertex and analogously coordinates for corresponds to the d-dimensional observation sub-vector of for vertex .

In the case that the cardinality of vertex set is less than that of the complete graph vertex set i.e. , the formulation will accommodate missingness of observations at some vertices in the setup, allowing for a subset of the vertices, , to have complete missingness of the observation vectors per time instant. In the case that one may have full or partial observations all d-dim observation sub-vector coordinates per vertex at all vertices in .

This gives a general pattern of missingness where single observation coordinates of missingness may occur at a given vertex or an entire observation vector may be missing at a vertex at any instant of time. To capture this general pattern of missingness, a general sensing matrix will be utilised throughout, where Si,t takes value 1 if the i-th coordinate of is observed at time t and value 0 if it is missing at time t. Without loss of generality, in latter sections, we will set d = 1 and we will decompose S into two sub-matrices and .

The graph itself is assumed to be static in time with a known adjacency matrix , for a graph with N vertices. There will be an estimation challenge to estimate the graph signal reconstruction and regression models over T time points based on a potentially partially observed set of responses over N vertices at time points. The types of missingness that can be considered in this setup is quiet general and this aspect will be explained with illustrations in latter sections of the paper.

In this formulation, the input features are ‘global’ in the sense that individual elements are not necessarily associated with any particular node in the output space. For example, if the graph represents a network of corporations, and the graph signal is their quarterly revenue growth, the features could represent global factors such as inflation, GDP etc, as well as firm-specific data such as the number of employees at a particular company. This is the most general formulation and allows for maximum flexibility in problem specification. Additional constraints explicitly linking elements from one space to the other are possible but not considered in this paper. This approach is shared by other recent work such as [13,17].

Throughout this paper we take a Bayesian perspective, using graph filters to construct spectral priors that can be used to make statements about the expected profile of a graph signal. Using graph filters in this way shares aesthetic and practical features with the graph kernel framework described in [18,19] and elsewhere. Graph filters as opposed to kernels, however, provide a more direct Bayesian interpretation which is useful in this context.

As mentioned, in this work, the graph adjacency matrix is assumed to be known a priori. In many applications such as social or sensor networks, this is a reasonable assumption, as a graph may be self-evident or easy to construct. For other applications, such as financial or biological networks, the dependency structure may be more opaque. In such scenarios, we refer the reader to the large body of work on graph learning [2023].

1.2 Contributions

This paper contributes a statistical exploration of Kernel Graph Regression and Gaussian Processes over Graphs in the context of graph signal reconstruction. A primary aim is to incorporate the possibility of error correlation, which is of practical concern in many applications. We pay particular attention to time series modelling and node-level autocorrelation and describe an algorithm for determining function and model parameters in the presence of autocorrelated noise. We also focus on deriving mathematical expressions which lend themselves to practical computation for large-scale problems and use Bayesian reasoning to provide a lower bound for the model parameter uncertainty. These contributions are clarified below.

1.2.1 Incorporating partially observed graph signals.

Prior work on KGR and GPoG has assumed that the signal from all nodes under consideration is fully observed throughout training [24], [25] and [26]. We introduce a modification that incorporates ideas from graph signal reconstruction to allow a constant subset of nodes to remain unobserved at train time and use the topology of the graph to make smooth predictions at these points.

1.2.2 Incorporating correlation prediction error via Generalised Least Squares (GLS).

We relax the statistical assumption that all prediction errors should be independent and identically distributed, allowing for the possibility of cross-correlation amongst nodes, and autocorrelation over time. This creates a broader class of models with rich statistical properties. In particular, we consider the scenario where Autoregressive AR(1) autocorrelation may exist at each node which has relevance in many time-series applications.

1.2.3 Strict enforcement of worst-case O(N3  +  T3) complexity.

Graph regression problems are naturally expressed in terms of matrix operations of Kronecker products i.e in terms of a matrix operations on an NT NT matrix of elements involving a high dimensional inverse. This is problematic for all but the smallest of problems as memory and computational costs escalate quickly. This work contributes an analysis of the complexity of the methods and ensures that all final mathematical expressions are given in terms of worst-case operations, ensuring medium to large problems remain tractable.

1.2.4 A Laplace approximation for parameter uncertainty.

We use the Laplace approximation to estimate the full posterior and derive a tractable expression for the marginal variance. This can be used to provide a lower bound for the uncertainty over out-of-sample model predictions.

1.3 Paper organisation

Section 2 gives a brief introduction to some fundamental GSP concepts such as the Graph Fourier Transform (GFT) and graph filters. In section 3 we revisit the problem of Kernel Graph Regression (KGR) from a Bayesian perspective and incorporate the concept of partially observed graph signals. Section 4 builds on this, introducing the GLS model and outlining an iterative algorithm for estimating the prediction error, focusing particularly on an autoregressive model. Here, we also run several experiements on synthetic data to provide intuition for the hyperparameters this algorithm relies on. Finally, section 5 uses the Laplace approximation to derive a tractable lower bound for the marginal prediction uncertainty. Finally, section 6 analyses these methods on a real dataset concerning the prediction of pollutant levels across a network of monitoring stations in California.

1.4 Notation

Effort is made to adhere to the following variable naming conventions throughout this paper, see Table 1.

2. Preliminaries

2.1 Graphs and the GFT

Consider a weighted, undirected graph made up of N vertices and edge set connecting vertex pairs. is the weighted adjacency matrix, where represents the strength of interaction between nodes i and j. implies and implies . For graph one may define the graph Laplacian matrix as where is the diagonal degree matrix; being the ith row sum (or column sum) of . One key property of the graph Laplacian is that it can be used to measure the smoothness of a signal with respect to the graph.

(1)

As visible, the quadratic product measures the square difference in signal value between connected nodes, weighted by the respective adjacency matrix entry. In this way, it can be considered a measure of ‘roughness’ of the signal f, with smaller values indicating a smoother signal, achieving a minimum of zero when the function is constant on each disconnected subgraph [27]. It can be useful to view the Laplacian in terms of its eigendecomposition, , where the columns of the orthogonal matrix are the normalised eigenvectors of , forming an orthonormal spanning set. (The eigenvalues are typically ordered such that ). Any graph signal can be expressed as a linear combination of the eigenvectors of the Laplacian as , where is the i-th column of the matrix and is known as the Graph Fourier Transform of f [28]. This gives an alternative way to express the quadratic form of Eq (1).

(2)

These two facts provide the interpretation of the eigenvectors of the Laplacian as having varying degrees of smoothness, ordered by the size of the associated eigenvalue. The number of zero eigenvalues is the number of disconnected subgraphs. Beyond that, larger eigenvalues, analogous to frequency in classical signal processing, are associated with increasingly ‘rough’ eigenvectors. Fig 1 shows an example of this behaviour for a 3D mesh graph.

thumbnail
Fig 1. Colour representation of seven low-frequency eigenvectors and two high-frequency eigenvectors of a 3D mesh graph, along with their associated eigenvalue.

The mesh is constructed from the classic Utah Teapot 3D object. Here, the main body, lid and spout are all mutually disconnected, giving three zero eigenvalues.

https://doi.org/10.1371/journal.pone.0324087.g001

2.2 Graph filters

The concept of a graph filter can be constructed by extending the analogy of the GFT [29]. A low-pass filter, , which operates on a graph signal f, attenuates the high-frequency graph Fourier components and can be used to smooth or denoise a signal [30]. It can be constructed by defining a decreasing function , parametrised by , though we will suppress the parameter when using the function unless explicitly required, for which and . The filter, , is then defined by

(3)

where denotes the application of element wise to the diagonal. The filtered signal is then calculated as . Note that the operator is necessarily symmetric and semi-positive definite. Numerous possibilities exist for , some of which are outlined in Table 2.

Graph filters constructed in this way bear a close resemblance to graph kernels as described in [18], where instead the authors define an increasing function . In practical terms, a graph filter can be considered the inverse of a graph kernel as defined in this paper or, more accurately, . “Graph kernel" in this context should not be confused with another concept of the same name, which concerns distance metrics between pairs of graphs [31].

2.3 Graph spectral priors

Graph filters can also be used to construct covariance matrices, for example in [14,32]. Consider a Gaussian white noise graph signal with precision . If a filter is applied to give , the resultant signal will be smoother with respect to the graph and will be drawn from a distribution

(4)

In a Bayesian setting, can act as a covariance matrix which encodes the assumption that signals observed over a graph are likely to be smooth with respect to the underlying topology.

The selection of an appropriate graph filter in this context can be guided by prior knowledge about the expected power spectrum. Adaptive techniques have also been proposed for jointly learning a graph filter such as [15,33] but are not considered in this paper.

3. Kernel graph regression

3.1 Data and Definitions

This problem runs over a set of T discrete “times" . (These need not be regularly-sampled time intervals but in many practical problems will be). At each time, there exists a fixed graph with N nodes as specified earlier by vertex set and related by a static adjacency matrix .

At a distinct subset of times where we observe supervised learning pairs . Here, is a vector of explanatory variables, and is a partially observed graph signal. This signal is measured on a subset of the nodes where which remains constant over all . In a sensor network, this may be the case if only a subset of the sensors are functioning correctly, or we wish to estimate the reading at locations where no physical sensor is placed. The goal is to predict the graph signal at the unobserved nodes for and at all nodes for .

It is useful to define two binary selection matrices and which help in mapping between observed and unobserved node sets. has -th entry equal to one for to and , with the rest set to zero. Similarly has entries equal to one for to and , with the rest set to zero. For simplicity, and without loss of generality, we can assume that

(5)

which implies that the first nodes are observed, and the first times contain the labelled examples. Algebraically, this is always possible under a reordering of nodes/times. However, statistically, depending on the regression assumptions of the model, one should note that such reordering may influence the regression results, depending on what assumptions on the autocorrelation structures and conditional dependence assumptions of the response variables, given the covariates, are made for the observations over time in the graph regression. If the observations are assumed conditionally uncorrelated (or independent if under a Gaussian error assumption) in time, given the covariates, then reordering in this manner will not influence the regression results, in the case that the covariates time series does not display autocorrelation or cross autocorrelation in any coordinates (of course assuming covariates in the regression are also reordered accordingly). If however, the covariates do have a temporal cross correlation then mixing assumptions may need to be considered when deciding about the suitability of reordering, such that reordering takes place only between time intervals with lags that are sufficiently uncorrelated in the design space covariate time series. If additionally, there is deemed to be a conditional autocorrelation in the response time series, even after conditioning on regressors, in the graph regression, then such reordering should be considered carefully as it may influence the results of the regression including the smoothness of signals learnt. In this instance, one may perform a decorrelating transformation on the observation covariates prior to performing the reordering, of course, this would assume knowledge of the conditional autocorrelation matrix of the response in the regression. When this is not known, it leads to a generalised iterative Graph regression that is the extension of Generalised Least Squares (see [34]) into the graph regression setting. Such an example is illustrated in the AR(1) error process in Section 4.3 below. Alternatively, if the assumptions required to perform re-ordering are not easily satisfied for a given regression application, then there is no problem with working with matrices and without any reordering, it is just less convenient algebraically.

3.2 KGR via signal reconstruction

We will now derive a Kernel Graph Regression model from the perspective of graph signal smoothing and reconstruction. Let us assume that there exists an underlying noiseless graph signal at each time that we wish to estimate. The observation is modelled as a partial noisy observation of . This statement is summarised by the following model.

(6)

Here is a horizontal stacking of each partial graph observation such that the t-th column is the partial graph signal observed at time t, and the n-th row is the time series observed at node n. has a similar structure, but represents the true underlying function at the full, length-N, node-set at all times. is an error matrix of standard normal i.i.d. noise. The probability distribution of can therefore be expressed as

(7)

where has the usual meaning of a vertical stacking of matrix columns, and is the Kronecker product. In order to estimate the latent signal , we must specify a prior distribution indicating our belief at the outset about the likelihood of different signals. This provides the necessary Tikhonov regularization term and avoids under-specification. In this case, an appropriate prior for , which has also been used in [14], is the following

(8)

where matrix is a filtered graph Laplacian matrix, as defined in Equation (3), for a user specified graph filter , such as one of those specified in Table 2 and is a precision parameter which controls the regularisation strength.

Here, is a kernel (or Gram) matrix defined by the relation , where is a Mercer kernel (see [35]) with parameter(s) . In a kernel matrix , the entries represent the inner product between pairs of (potentially infinite-dimensional) basis function representations of the features. That is, . An expample of a typical kernel used in machine learning is the Gaussian kernel:

(9)

A natural question to ask is why we use the prior given in Eq (8), with a covariance matrix given by the Kronecker product between a kernel matrix and a squared graph filter. Intuitively, it makes sense that the correlation between two node-times and is expected to be high when the nodes are closely connected and the explanetory variables are similar, i.e. is small. Eq (8) directly maps onto this intuition, since the prior covariance between two signal elements and is given by

In the supplementary materials, we provide a more formal justification for why this prior is appropriate. This is derived by considering the “weight-space" view of multivariate Gaussian processes [36]. This derivation begins with a multivariate linear regression model with a spherical prior over the regression coefficients. Then, the so-called ‘kernel trick’ in used to translate this into a non-parametric model with a kernel matrix . The only substantial modification to this original derivation that we make is to place a graph spectral prior over the regression coefficients, rather than the usual spherical prior. This encourages predictions which are smooth with respect to the graph topology.

Applying Bayes’ rule to Eqs (7) and (8) results in the following Maximum a Posteriori (MAP) optimisation problem for .

where we use the notation to represent a probability density function. Using the definition of the density function for a matrix normal distribution, can be written as

(10)

Taking the derivative of this expression with respect to and setting the result to zero yields a solution for . We refer to this as the KGR solution, but can also equivalently be considered the mean of a GPoG solution.

(11)

Applying a well-known matrix identity ([37], Eq 162) allows the dimension of the inverse to be reduced from to .

(12)

where and . However, in this form, computation is impractical for large problems (T or N or both large) due to the high complexity involved in inverting the Kronecker-structured matrix given by the form . Thankfully this can be overcome by performing eigendecomposition on and separately and leveraging properties of the Kronecker product. For a detailed derivation, consult the supplementary materials. Using the notation to represent the Hadamard product, the result is

(13)

where

(14)

and has the elements given by

(15)

A key benefit of the solution in this form is that the necessary dense eigendecompositions, which are typically the most computationally taxing step, are performed on and , of size and respectively, whereas a naive implementation would require this to be and . This can be a significant speed-up for large problems, especially when a meaningful portion of the data is unlabelled. (Decomposition of is also required, but this is typically sparse so can often be accelerated using sparsity-specific linear algebra tools, see [38] and references therein).

A full outline of all the steps for graph kernel regression is highlighted in algorithm 1.

3.3 KGR and Cartesian product graphs

In this short section, we highlight the connection between Kernel Graph Regression and Graph Signal Reconstruction (GSR). In particular, the mathematical formalism of KGR described so far in this paper bears a strong resemblance to Bayesian signal reconstruction as applied to a Cartesian product graph. The reconstruction of signals defined over Cartesian product graphs is an area that has received increasing attention in recent years [39,40], particularly in the context of Time-Vertex (T-V) problems [41]. By adapting the KGR algorithm slightly, we effectively get a signal reconstruction algorithm with little effort. This also highlights the connection between KGR and GSR which can provide insight into both areas.

A Cartesian product graph is described by two adjacency matrices and which, in the following, we refer to as the ‘time-like’ graph and the ‘space-like’ graph respectively. Fig 2 gives a graphical depiction of a small Cartesian product graph. The resultant graph has NT nodes, with an adjacency matrix , given by the Kronecker sum of the two individual adjacency matrices, specified as

thumbnail
Fig 2. A graphical depiction of a small Cartesian product graph.

https://doi.org/10.1371/journal.pone.0324087.g002

(16)

Similarly, the Laplacian matrix of the Cartesian product, , is given by the Kronecker sum of the individual Laplacian matrices, that is

(17)

The Laplacian can be eigendecomposed as follows.

(18)

where and . Therefore, a general graph filter associated with the product graph can be understood as applying a decreasing function to the diagonal eigenvalue matrix .

(19)

For a certain class of graph filter functions, for example, the exponential filter, it is the case that

(20)

We refer to graph filters of this type as separable. If this condition holds, it implies that the total graph filter can also be expressed as a Kronecker product.

(21)

where and . Now consider a graph signal reconstruction problem, where the task is to estimate a smooth underlying signal , given a signal observed over a constant subset of the space-like and time-like nodes. Once again, we assume that is a noisy partial observation of given by Eq (6). The only key difference from the KGR formulation is that the prior distribution for is no longer given by Eq (8), but instead is given by

(22)

Note that the only material difference is that the kernel matrix has been replaced by another filter matrix . Following the Bayesian logic of the previous section immediately leads to a graph signal reconstruction method, given explicitly in algorithm 2.

4 GLS kernel graph regression

In many real-world applications of regression modelling, the assumption that the error terms of Eq (6) are independent and identically distributed is unlikely to hold. In this section, we consider the situation where the differences between the observed and underlying signals are instead correlated according to a matrix normal distribution. Under this model, the elements of are distributed as

(23)

where and respectively represent the time correlation and vertex covariance matrices. We choose to parameterise the covariance matrix in this way for several reasons. Firstly, some method for reducing the dimensionality of the problem is certainly necessary for the estimation of the matrix. Since the available data to make the estimate is a single observation , estimating a full matrix would not be possible without other regularisation assumptions. Secondly, expressing the covariance matrix as a Kronecker product of two smaller matrices enables tractable estimators. This is primarily due to the property of Kronecker products that . In addition, holding a dense matrix in memory may be out of the question. Thirdly, the Kronecker product assumption has an intuitive interpretation in terms of independent factor-specific contributions to the overall covariance. Put simply, the covariance between node-times and will be . If there is no correlation between two times t1 and t2, then there should be no correlation between node-times and , no matter what the value of . Similarly, if the two nodes are totally uncorrelated, then there should be no correlation between node-times and , no matter what the value of . The actual value, can be seen as counting the contribution from both the time correlation and the node-level correlation multiplicatively. Finally, this model has also been used in many applications relevant to network problems such as geospatial, econometric, and EEG models [4244]. For other recent approaches, see [45,46].

4.1 KGR with Gauss–Markov estimator

To begin, we derive a Gauss–Markov estimator for , that is, the Best Linear Unbiased Estimator (BLUE) assuming that the covariance matrix is known. (This restriction is relaxed in the subsequent section). In this case, the log-likelihood of making an observation is altered such that the cost function of Eq (10) becomes

(24)

The BLUE estimator for a given and is defined to be the value of which minimises this expression and can be found by differentiating it with respect to and setting the result equal to zero. By again following the same steps as in section 3, this procedure results in

(25)

Once again, in this form, the solution remains prohibitively expensive to compute for large and . However, it can be expressed alternatively with complexity by making the following definitions. First eigendecompose and .

Then perform eigendecomposition on the following matrices:

Note that both these matrices are guaranteed to be symmetric with positive, real eigenvalues [47]. Finally, make the following definitions:

The GLS estimator for is given by

(26)

While a simpler O(N3  +  T3) solution is possible, the above method ensures that the eigendecompositions are performed on reduced size matrices, effectively scaling as  +  . For a detailed derivation with intermediate steps, we refer the reader to the supplementary materials.

4.2 Estimating SigN and SigT given F

In this section, we construct estimators for the covariance matrices and , given that the latent signal , and therefore the prediction error , is known. It follows from Eq (23), that has a matrix normal distribution of the following form.

(27)

Past literature on parameter estimation in matrix-normal models has mostly focused on the task of estimating the covariance given multiple realisations of the error matrix [4851]. In order to estimate both and in full, the number of observations of must be greater than [max  +  1] [52]. However, in the present case, only a single observation comprising values is available. It is therefore necessary to assume a simplified, parametrised structure for at least one of these matrices to reduce the number of degrees of freedom.

For this reason, we assume that the matrix describes the autocorrelation between different time measurements and is a function of a parameter or set of parameters . The elements of are constrained to be dimensionless Pearson correlation coefficients, with 1 along the diagonal. Conversely, is assumed to be an arbitrary covariance matrix that can take on any value satisfying semi-positive definiteness. By explicitly parametrising one as a correlation matrix, parameterised by a function of a parameter vector in much lower dimension than T, while letting the other act as a fully parameterised covariance matrix, the problem of the solution being under-determined (since gives the same matrix for any a) is avoided.

The likelihood of observing an error matrix , given covariance matrices and , is

(28)

By specifying two independent prior distributions and , the posterior density over and can be expressed via Bayes’ rule as

Depending on the size of the problem, this could be used along with a Monte Carlo algorithm to sample from the whole posterior or, more likely, be used to calculate single MAP estimates for and . In either case, the negative log posterior, up to a scaling and additive constant, is equal to

(29)

A MAP estimate can be found by minimising this quantity jointly with respect to the parameters. There are several possible strategies for solving this, such as simple gradient descent. However, since an analytical solution for is often possible given and vice versa, an efficient algorithm using temporary approximate solutions can often be found. This so-called ‘flip-flop’ strategy is widely used in the literature on Kronecker-product covariance estimation [52]. Crucially, the number of iterations required for convergence does not depend greatly on the size of the system, meaning complexity is retained.

4.3 Example: AR(1) process

The previous section stated the general formulation of the problem without any particular parametrisation of the matrix . In this section, we give a concrete example by considering the widely used AR(1) model, which assumes simple, serial correlation between prediction errors. Denoting the t-th column of as , the general AR(1) model assumes that

(30)

where is a matrix of autoregression coefficients [53]. In the following, we assume a simplified version, where the n-th element of is assumed only to be serially correlated with the n-th element of , in a way that is both stationary and uniform across all n. In effect, this requires that . This is, in essence, the multivariate extension of the AR(1) model considered in the seminal paper by Cochrane and Orcutt [54]. In the limit as , the correlation matrix , and its inverse , have a well-known form, which can be truncated approximately for large enough T to the form given by

(31)

where is a matrix with 0 on the diagonal and with 1 on the first upper and lower diagonal, and is the identity matrix with zero on the first and last diagonal entries [55]. Note also that the determinant of is given by [56]. To be a valid stationary process, the parameter must lie on the interval (–1,1). An appropriate prior distribution for could therefore be

(32)

for some scalar parameter . Depending on the value of , this prior is roughly uniform across the majority of the interval, whilst rapidly decreasing in likelihood towards –1 and 1. This effectively encodes a belief that the time series is stationary across all nodes, with controlling the strength of that stationarity assumption. The resultant -dependant part of the cost function of Eq (29) therefore becomes

(33)

Whilst this is not a quadratic function of , the unique maximum likelihood estimator can be found by differentiating this expression and setting the result to zero. This results in a MAP estimator which is the real root of a cubic polynomial.

(34)

where

(35)

A derivation of this expression can be found in the supplementary materials. Note that this estimator is a function of .

In terms of the cross-covariance matrix , we choose to implement a modified form of the estimator proposed in [57] and further developed in [58]. In essence, these papers considered a weighted combination of the high-variance and often ill-posed maximum likelihood estimate, , and the heavily-biased but well-conditioned estimate, . The only modification presented here is the introduction of into the ML estimate, that is, , which is necessary to account for the effect of autocorrelation.

(36)

[57] and [58] also investigate the optimal setting of the shrinkage coefficient , which is a parameter between 0 and 1. Here, we choose to implement the Rao-Blackwell Ledoit-Wolf (RBLW) estimator described in [58]. This is given by

[Eq (34)]

4.4 GLS kernel graph regression

To complete the GLS kernel regression algorithm, estimation of both and must be performed simultaneously. Since both rely on each other to be estimated, it is necessary to implement an alternating iterative algorithm, where first is estimated for some reasonable initial guess of and , and then and are estimated for this value of . This process then continues until some convergence criterion is met. These steps are outlined in algorithm 5.

Estimation of the hyperparameters of the kernel is performed using a grid search procedure as discussed further in Section 7.1 below. Additional details of this hyperparameter estimation approach are available in the github repository for the code and detailed worked examples on synthetic and real data, including those contained in this work and additional real data examples, see https://github.com/nickelnine37 and further discussions in [38]. In addition, in the following Section 4.5 explores the model sensitivity to various hyperparameters.

4.5 Investigating model sensitivity to hyperparameter selection

In this subsection, we study the effect that the various hyperparameters have on the properties of the GLS-KGR algorithm, using synthetic data. The aim is to provide some intuition for how these variables should be set or learned on real data. The key variables of interest are γ, which acts as a global regularisation parameter, β, which dictates how smooth the graph signals are expected to be, α, which controls the strength of the stationarity assumption, and σ, which affects the local variance scale of the Gaussian kernel. We are interested in two key effects that these hyperparameters have on the algorithm, namely, their effect on the prediction accuracy, and their effect on the convergence rate.

[algorithm 3]

[algorithm 4]

Each experiment was set up in the following way. First, a smooth underlying graph signal was generated by applying a chain-graph diffusion filter to a matrix of Gaussian noise. 20% of the times and nodes were chosen uniformly at random to be hidden, giving two selection matrices and . A model error matrix was then generated from a matrix normal distribution with known covariance matrices and , and added to to create the observed signal . Finally, a matrix of Gaussian noise with 12 columns was created to act as the explanatory variables. This model was run under a variety of hyperparameter conditions, with the number of steps required for convergence, and the Root Mean Squared Error (RMSE) on both the seen and unseen node-times recorded on each run.

First, we analyse the convergence properties. The experimental variables , , and were found to have a negligible effect on the convergence rate of the GLS Kernel Graph Regression algorithm. That is, when , and remained fixed, varying any of the aforementioned experimental parameters did not affect the number of iterations required to reach a minimum level of precision across the model outputs and .

On the other hand, we found that , and all had a significant effect on convergence. In order to analyse this systematically, we fixed N = 100, T = 120, , and . We then varied from to in 9 increments of 0.2, and varied from 10−3 to 101 in 50 logarithmic-spaced increments. For each unique pair we generated 50 unique sets of input data , , and randomly as specified at the start of this section, and ran the GLS KGR model to give unique trials. For each trial, we counted the total number of iterations required for convergence, including both the inner loop detailed in algorithm 4 and the outer loop of algorithm 5. This experiment was performed twice: In experiment (a) we set , and in experiment (b) we set , where is the Kronecker-delta symbol such that iff i = j and otherwise.

The mean number of iterations required for convergence for each pair is shown in Fig 3.

thumbnail
Fig 3. The mean number of iterations required for convergence on synthetic data is shown over a range of values for and .

The upper plot shows the results of experiment (a), when , and the lower plot shows the results for experiment (b), when .

https://doi.org/10.1371/journal.pone.0324087.g003

In general, we can see that , θ and α have a complex interaction with the number of iterations. In both experiments (a) and (b) we see that convergence is roughly symmetric for , with this symmetry slightly broken in experiment (b). In both cases, a spike in the number of iterations is observed for a certain value of α, although this is not observed in the experimental α range for in experiment (a). Further investigation into this interaction from a theoretical standpoint would be valuable, however, we leave this to a future study.

Next, we studied the effect that the hyperparameters had on prediction accuracy across the test and train sets. Here, by test set, we mean all unobserved node-times, and by train set, we mean all observed node-times. Since 80% of the nodes and 80% of the times were observed, this means that of all node-times were in the train set, and 36% were in the test set. In this case, the most important parameters were γ, β and σ, with having a negligible impact on the prediction accuracy.

First, we used gradient descent to find the optimal parameters, with an objective function that evaluated the mean test set RMSE across 50 random realisations of the input data. The optimal values, in this case, were found to be , and . Next, we fixed two of these parameters and varied the third across a range of values, measuring the mean test and training set RMSE across 50 random realisations of the input data. The results are shown in Figs 4, 5, 6, 7. As is visible, the hyperparameter with the largest effect, and therefore the most important to set appropriately, is γ. This is expected, since it acts as a global regularisation parameter. β and σ also have a small but noticeable effect.

thumbnail
Fig 4. The Root Mean Squared Error on the test and train sets are shown as is varied.

https://doi.org/10.1371/journal.pone.0324087.g004

thumbnail
Fig 5. The Root Mean Squared Error on the test and train sets are shown as is varied.

https://doi.org/10.1371/journal.pone.0324087.g005

thumbnail
Fig 6. The Root Mean Squared Error on the test and train sets are shown as is varied.

https://doi.org/10.1371/journal.pone.0324087.g006

thumbnail
Fig 7. The Root Mean Squared Error on the test and train sets are shown as a function of the total percentage of node-times that were observed.

https://doi.org/10.1371/journal.pone.0324087.g007

As a final experiment, we also measured the accuracy across both test and train sets as a function of the percentage of nodes and times that were observed. Here, we ran the experiment as before, with the optimal hyperparameters, but increased the total percentage of node-times that were observed from p = 0.05 to p = 0.99 in 50 equally spaced increments. On each run, we set and , rounded to the nearest integer.

5 Latent signal uncertainty via the Laplace approximation

In this final methodology section, we outline the steps for estimating the uncertainty over the latent signal via the Laplace approximation. In the case of KGR, this is in fact not an approximation, but the exact Gaussian process prior. On the other hand, in the GLS case, it is indeed an approximation as point estimates for the uncertain covariance matrices and are used. Here, we derive the approximate posterior for the GLS case only but note that the exact posterior for simple KGR can be restored by setting and equal to the identity matrix of an appropriate size.

Since the latent signal has size , the true covariance matrix specifying its uncertainty has size NT NT. Deriving a mathematical expression for this matrix is simple, however, it is generally impractical to compute and store in memory for all but the smallest of problems. For this reason, we take steps to derive an efficient expression for the matrix representing the marginal variance of each element of , which is generally the most useful information and allows scaling to relatively large problems.

Consider the negative log-likelihood given in Eq (24). Applying Laplace’s approximation gives the covariance matrix for the uncertainty over the vectorized latent signal.

(37)(38)

Holding this dense matrix in memory is often impractical, and inverting it is generally out of the question. Instead, we introduce the matrix defined by . That is, is the diagonal of of length NT stacked into an matrix of marginal variances, such that element (n,t) is the square uncertainty for the n-th node at time t. Here we present an expression for followed by a short derivation.

(39)

where is the Hadamard product,

and

Note that the eigendecompositions are no longer necessarily performed on symmetric matrices. This means that while and cannot be assumed to be unitary, the eigenvalues are still guaranteed to be positive and real [47]. To prove this expression, first note that can be factorized into the following:

From here, substitute in the eigendecomposition definitions.

This can be split into the following sum.

where is defined to be a matrix of zeros with element (t,t) equal to one. From this, it can be seen that the re-stacked diagonal is given by the following outer product sum.

where

Or more compactly

Further detail concerning this derivation can be found in the supplementary materials.

6 Experimental results

6.1 Spatio-temporal pollutant analysis

In this section, we consider the problem of predicting the concentration of various airborne pollutants, measured across a network of air quality monitoring stations in and around California. Each pollutant type is measured at a unique set of locations and is therefore treated as an independent graph regression task. The goal is to make accurate predictions using weather and environmental features from the previous day. The methods of Kernel Graph Regression and AR(1) GLS Kernel Graph Regression are analysed and compared to some standard baseline algorithms.

Pollutant concentration data was taken from the US Environmental Protection Agency’s air quality monitoring program [59]. Specifically, daily measurements of Ozone, Carbon Monoxide (CO), Nitrogen Dioxide (NO2), PM2.5 and PM10 were taken from January 2017 to April 2021, giving a total of T = 1570 days. This data set also contains daily measurements of humidity, pressure, wind speed and temperature at various locations, which we use as additional explanatory variables. In addition, data concerning historical wildfires in California was sourced from the Department of Forestry and Fire Protection in California [60].

6.2 Graph construction

A key decision that can significantly impact the effectiveness of a graph signal processing method is how to construct the underlying graph. In certain applications, such as social networks, a sparse graph may be self-evident or relatively simple to construct. However, for a large portion of practical problems, including the current case of geographically placed sensors, it is required to either propose a sensible construction method or learn a graph from available signal data. In this paper we opt for the former, and omit a full discussion of the available techniques, which can be found in, for example, [61].

The graph construction method we use makes use of both pairwise geodesic distances between monitors and the intermediate elevation profile. This is important as environmental processes can be strongly influenced by topography, especially in mountainous regions. Elevation data is sourced from the GLOBE30 project [62]. This dataset gives the approximate height above sea level over a 30 arc-second latitude/longitude grid. While more complex models incorporating land use, prevailing wind direction etc. are possible, we opt for a simpler model for the sake of brevity.

Our first step is to create an symmetric distance matrix . We define the “distance" between two monitors to be a weighted combination of their geodesic distance and the vertical relief along the intermediate path. This introduces a hyperparameter defining the relative importance of each component which we learn later via cross-validation. We then use the perturbed minimum spanning tree algorithm outlined in [63] to construct a sparse, fully connected graph. A representation can be found in Fig 8.

thumbnail
Fig 8. A graphical depiction of the graph connecting the 1382 available monitoring stations, with the map colour representing the elevation profile.

This is the graph used in all further experiments. “Reprinted from [38] under a CC BY license, with permission from [Heriott-Watt University], original copyright [2024].” Source available at: http://hdl.handle.net/10399/5041

https://doi.org/10.1371/journal.pone.0324087.g008

6.3 Data pre-processing

Several steps were taken to pre-process the raw sensor data before feeding it into the models. Firstly, any stations with more than a total of 15 null readings due to equipment failure, over the three-year period, were discarded. Any remaining missing values were interpolated linearly between the available readings. Secondly, meter readings for each of the pollutant variables, which generally have units of mass or particle count per unit volume, were transformed via . This rectified the left skew of the readings, generally producing more symmetric aggregate histograms. The same transformation was also applied to wind speed, however pressure, humidity and temperature, which already had largely symmetric data distributions, were not transformed in this way.

To construct a daily set of features for wildfires, a historical list of all recorded wildfire events in California was used. This dataset includes information about the start and end date of each fire, along with the central location and the total ground area burned. For each fire we assumed that the burn rate rose and fell with a Gaussian shape, peaking at the midpoint of the burn window, such that 95% of the area burns within the specified time range. From this we estimate the total area burning on any given day in eleven different regions in California.

The final step before estimation was to transform all variables so that they were scaled and translated to achieve a unit marginal variance and zero mean. PCA dimensionality reduction was then performed on each of the feature groups individually keeping enough dimensions to preserve 90% of the total variance for that group. The resultant input data for and had 1570 columns representing the dates from 2017-01-02 to 2021-04-20. Each of the N rows of held a time series of readings for a unique monitoring sensor with a well-defined location. The columns of contain first the PCA components of the weather conditions, second the approximate number of acres burning in each of the eleven regions. Regression was then performed for each of the five different pollutants. Table 3 gives information about the total number of features that were abvailable after perfroming these processing steps.

thumbnail
Table 3. Information about the number of available features before and after PCA.

https://doi.org/10.1371/journal.pone.0324087.t003

6.4 Results

To test the performance of the KGR and GLS KGR methods, we compared them against several other regression algorithms, namely Ordinary Least Squares (OLS) regression, Ridge regression, Lasso regression and Elastic Net regression.

For each model tested, all relevant hyperparameters were tuned using cross-validation. First, the input data was split such that the first 80% of the days served as a training/validation set and the final 20% of the days as a test set. The train/validation set was then split into four folds of 219 days each uniformly at random. Hyperparameters were set by attempting to minimise the mean squared error averaged across each of these four folds, using three to train and one to validate accordingly. Final results were then reported on the test set. Fig 9 shows the predicted Ozone levels across the full graph (right) on a particular day given the partially observed signal (left). As expected, the output is fairly smooth across the network, indicating that the model has successfully utilised information from the graph structure. Fig 10 shows the output from the GLS KGR algorithm at a particular unobserved node. As is visible, the estimated reading approximates the ground truth well, especially during train times. In addition, the uncertainty about the estimate notably increased during the test times, further indicating a healthy model output.

thumbnail
Fig 9. Left: the observed graph signal for Ozone on a particular day represented via a colour map.

Right: prediction for the reconstructed latent graph signal across the entire network on the same day made by the GLS KGR method. “Reprinted from [38] under a CC BY license, with permission from [Heriott-Watt University], original copyright [2024].” Source available at: http://hdl.handle.net/10399/5041

https://doi.org/10.1371/journal.pone.0324087.g009

thumbnail
Fig 10. A section of the Ozone time-series signal predicted by GLS KGR at a particular unobserved node is shown along with the ground truth.

The blue shading depicts two standard deviations of prediction error arising from latent signal uncertainty as calculated by the Laplace approximation.

https://doi.org/10.1371/journal.pone.0324087.g010

Table 4 shows the mean squared error as reported on the test set after hyperparameter tuning. As is visible, either GLS KGR or KGR performs the best by this metric across all pollutants. Table 5 shows the total compute time for each method on a 4-core i7 intel CPU.

7 Discussion

7.1 Hyperparameter tuning

One weakness with KGR and GLS KGR in practice is that several hyperparameters must be tuned to produce an accurate model, namely and , as well as any used in the graph construction period. Each is important and can have a significant effect on the output produced. For example, insufficient regularisation via can cause severe over-fitting and result in unduly small entries within the estimated cross-covariance matrix . Similarly, setting too high can effectively drive the estimated value of to zero, while setting it too low can result in singular matrices if the data is at all non-stationary. This creates a large, non-convex search space that can be both costly to explore and full of local minima. The degree to which this is an issue depends largely on the size of the problem at hand. For relatively small problems, such as the one considered here with nodes and time steps, well-known scalar minimization algorithms such as Nelder-Mead can help automate this process. However, with significantly larger problems hyperparameters would have to be carefully selected based on domain knowledge.

7.2 The normalised graph Laplacian

One model choice not discussed so far is the question of whether to use the normalised or un-normalised version of the graph Laplacian. Until now, we have assumed the regular Laplacian, , is being used, however, the normalised version, defined by , has properties that may make it preferable in some circumstances. For example, graph penalties constructed using give equal weight to all nodes, whereas penalties constructed using implicitly favour high-degree nodes because they appear more frequently in the sum of Eq (1) [27]. This may or may not be desirable depending on the problem at hand. Another potential benefit of is that its eigenvalues are guaranteed to fall in the interval [0,2] which makes the graph-regularisation parameter easier to set and interpret, as well as making a sensible comparison across problems possible. In the case of environmental monitoring networks, our initial experiments indicate that the normalized Laplacian may result in slightly better performance, however, further investigation is necessary.

7.3 Options for increasing scalability

In the previous analysis, we assumed that, while the graph Laplacian is often sparse, the filter and kernel matrices and were dense. This typically means that the primary bottleneck for KGR and GLS KGR is the eigendecomposition of these matrices, even in their down-sampled form. For large problems, it may be sufficient to only calculate the first k eigenvectors and eigenvalues, which can be performed efficiently for sparse matrices. Therefore finding sparse representations for and can be highly beneficial for large applications. Numerous methods exist in the literature on kernel regression for improving computational robustness. One, in particular, is the Nystrom method, which uses a low-rank approximation to the kernel matrix [64]. This can reduce the computational complexity of solving the linear system to , and the memory requirements to O(M2 + MN), where M is the chosen number of data examples.

One way to create a sparse graph filter is to simply use a low-degree polynomial for the filter function . In this way, can be calculated efficiently as and decomposed faster. When it comes to the kernel matrix, a large literature also exists on compactly supported kernels, for example, the Wendland kernel [65]. Here, an integral operator is applied recursively to Askey’s truncated power functions to create a set of compactly supported radial basis functions, which are guaranteed to be smooth, continuous and positive definite. Fast algorithms for the computation of these basis functions have been outlined in [66].

8 Conclusions and future work

This paper has contributed a statistical analysis of regression methods for signals defined over networks. Drawing upon recent work on Kernel Graph Regression, Gaussian Processes over Graphs and graph signal reconstruction, we proposed the method of GLS KGR and demonstrated its effectiveness on a relevant task. In particular, we addressed the situation where a partial observation of an N-node graph signal is made at a subset of T time-points, deriving the steps for an AR(1) autocorrelation regression model, with general cross-correlation. By assuming a matrix-normal error distribution, an algorithm was designed with O(N3  +  T3) complexity at each iteration. Finally, the Laplace approximation was used to derive a lower bound for the marginal prediction error arising from latent signal uncertainty. This was tested and shown to be effective on real data taken from a network of pollutant monitoring stations.

One assumption in this work was that the unobserved nodes remain so over the lifetime of the problem. This is somewhat unrealistic in applications such as sensor networks where equipment may temporarily fail. In the future, it would be valuable to work on adapting these algorithms for arbitrary missing values. Another worthwhile investigation would be a systematic study into the effect of different graph construction methods within the context of graph regression.

References

  1. 1. Shuman DI, Narang SK, Frossard P, Ortega A, Vandergheynst P. The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process Mag. 2013;30(3):83–98.
  2. 2. Zhu X, Lafferty J, Ghahramani Z. Semi-supervised learning using Gaussian fields and harmonic functions. In: AISTATS 2005 - Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics. Bridgetown, Barbados, January 6-8, 2005. Society for Artificial Intelligence and Statistics; 2005.
  3. 3. Dong X, Thanou D, Toni L, Bronstein M, Frossard P. Graph signal processing for machine learning: a review and new perspectives. arXiv, preprint, 2020.
  4. 4. Perozzi B, Al-Rfou R, Skiena S. DeepWalk. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press; 2014, pp. 701–10. https://doi.org/10.1145/2623330.2623732
  5. 5. Huang W, Bolton TAW, Medaglia JD, Bassett DS, Ribeiro A, Van De Ville D. A graph signal processing perspective on functional brain imaging. Proc IEEE. 2018;106(5):868–85.
  6. 6. Pirayre A, Couprie C, Duval L, Pesquet J-C. BRANE Clust: cluster-assisted gene regulatory network inference refinement. IEEE/ACM Trans Comput Biol Bioinform. 2018;15(3):850–60. pmid:28368827
  7. 7. Wagner R, Delouille V, Baraniuk R. Distributed wavelet de-noising for sensor networks. In: 2005 IEEE/SP 13th Workshop on Statistical Signal Processing. 2007, pp. 373–9.
  8. 8. Hu W, Chen S, Tian D. Graph spectral point cloud processing. Graph spectral image processing. Wiley; 2021, pp. 181–219. https://doi.org/10.1002/9781119850830.ch7
  9. 9. Kearnes S, McCloskey K, Berndl M, Pande V, Riley P. Molecular graph convolutions: moving beyond fingerprints. J Comput Aided Mol Des. 2016;30(8):595–608. pmid:27558503
  10. 10. Tsitsvero M, Barbarossa S, Di Lorenzo P. Signals on graphs: uncertainty principle and sampling. IEEE Trans Signal Process. 2016;64(18):4845–60.
  11. 11. Pu X, Chau SL, Dong X, Sejdinovic D. Kernel-based graph learning from smooth signals: a functional viewpoint. IEEE Trans Signal Inf Process Netw. 2021;7:192–207.
  12. 12. Chen S, Sandryhaila A, Moura JMF, Kovacevic J. Signal denoising on graphs via graph filtering. In: 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE; 2014, pp. 872–6. https://doi.org/10.1109/globalsip.2014.7032244
  13. 13. Venkitaraman A, Chatterjee S, Handel P. Predicting graph signals using kernel regression where the input signal is agnostic to a graph. IEEE Trans Signal Inf Process Netw. 2019;5(4):698–710.
  14. 14. Venkitaraman A, Chatterjee S, Handel P. Gaussian processes over graphs. arXiv, preprint, 2020.
  15. 15. Zhi Y, Ng Y, Dong X. Gaussian processes on graphs via spectral kernel learning. arXiv, preprint, 2020.
  16. 16. Dunson D, Wu H, Wu N. Graph based Gaussian processes on restricted domains. arXiv, preprint, 2021.
  17. 17. Li T, Levina E, Zhu J. Prediction models for network-linked data. Ann Appl Stat. 2019;13(1).
  18. 18. Romero D, Ma M, Giannakis GB. Kernel-based reconstruction of graph signals. IEEE Trans Signal Process. 2017;65(3):764–78.
  19. 19. Xiaojin Z, Jaz K, John L, Zoubin G. Graph kernels by spectral transforms. Semi-supervised learning. The MIT Press; 2006. pp. 276–91. https://doi.org/10.7551/mitpress/9780262033589.003.0015
  20. 20. Mateos G, Segarra S, Marques AG, Ribeiro A. Connecting the dots: identifying network structure via graph signal processing. IEEE Signal Process Mag. 2019;36(3):16–43.
  21. 21. Venkitaraman A, Maretic H, Chatterjee S, Frossard P. Supervised linear regression for graph learning from graph signals. arXiv, preprint, 2018.
  22. 22. Giannakis GB, Shen Y, Karanikolas GV. Topology identification and learning over graphs: accounting for nonlinearities and dynamics. Proc IEEE. 2018;106(5):787–807.
  23. 23. Dong X, Thanou D, Rabbat M, Frossard P. Learning graphs from data: a signal representation perspective. IEEE Signal Process Mag. 2019;36(3):44–63.
  24. 24. Liu H, Chen X, Wasserman L, Lafferty J. Graph-valued regression. In: Advances in Neural Information Processing Systems. 2010, p. 23.
  25. 25. Kovac A, Smith ADAC. Nonparametric regression on a graph. J Comput Graph Stat. 2011;20(2):432–47.
  26. 26. Tayewo R, Septier F, Nevat I, Peters GW. Graph regression model for spatial and temporal environmental data-case of carbon dioxide emissions in the United States. Entropy (Basel). 2023;25(9):1272. pmid:37761572
  27. 27. Fouss F, Saerens M, Shimbo M. Algorithms and models for network data and link analysis. Cambridge University Press; 2016.
  28. 28. Djurić P, Richard C. Graph signal processing. Cooperative and graph signal processing: principles and applications. Academic Press; 2018, pp. 239–59.
  29. 29. Sandryhaila A, Moura JMF. Discrete signal processing on graphs. IEEE Trans Signal Process. 2013;61(7):1644–56.
  30. 30. Tremblay N, Gonçalves P, Borgnat P. Design of graph filters and filterbanks. In: Djurić P, Richard C, editors. Cooperative and graph signal processing: principles and applications. Academic Press; 2018, pp. 299–324.
  31. 31. Kriege NM, Johansson FD, Morris C. A survey on graph kernels. Appl Netw Sci. 2020;5(1).
  32. 32. Chepuri SP, Leus G. Graph sampling for covariance estimation. IEEE Trans Signal Inf Process Netw. 2017;3(3):451–66.
  33. 33. Ioannidis VN, Romero D, Giannakis GB. Inference of spatio-temporal functions over graphs via multikernel kriged Kalman filtering. IEEE Trans Signal Process. 2018;66(12):3228–39.
  34. 34. Kariya T, Kurata H. Generalized least squares. John Wiley & Sons; 2004.
  35. 35. Bach F, Jordan M. Learning graphical models with Mercer kernels. In: Advances in Neural Information Processing Systems 15 (NIPS 2002). ACM Press; 2002.
  36. 36. Rasmussen C, Williams C. Gaussian processes for machine learning. MIT Press; 2006.
  37. 37. Petersen KB, Pedersen MS. The matrix cookbook. Technical University of Denmark; 2012.
  38. 38. Antonian E, Peters GW, Chantler M. Bayesian reconstruction of Cartesian product graph signals with general patterns of missing data. J Franklin Inst. 2024;361(9):106805.
  39. 39. Ortiz-Jiménez G, Coutino M, Chepuri S, Leus G. Sampling and reconstruction of signals on product graphs. In: 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP). 2018, pp. 713–7.
  40. 40. Romero D, Ioannidis VN, Giannakis GB. Kernel-based reconstruction of space-time functions on dynamic graphs. IEEE J Sel Top Signal Process. 2017:1–1.
  41. 41. Grassi F, Loukas A, Perraudin N, Ricaud B. A time-vertex signal processing framework: scalable processing and meaningful representations for time-series on graphs. IEEE Trans Signal Process. 2018;66(3):817–29.
  42. 42. Lu N, Zimmerman DL. The likelihood ratio test for a separable covariance matrix. Stat Probab Lett. 2005;73(4):449–57.
  43. 43. Werner K, Jansson M, Stoica P. On estimation of covariance matrices with Kronecker product structure. IEEE Trans Signal Process. 2008;56(2):478–91.
  44. 44. Mardia KV, Goodall CR. Spatial-temporal analysis of multivariate environmental monitoring data. Multivariate Environ Stat. 1993;6:347–85.
  45. 45. Park S, Shedden K, Zhou S. Non-separable covariance models for spatio-temporal data, with applications to neural encoding analysis. arXiv, preprint, 2017.
  46. 46. Zhang X. Statistical analysis for network data using matrix variate models and latent space models. PhD thesis. The University of Michigan; 2020.
  47. 47. Horn RA, Johnson CR. Matrix analysis. 2nd edn. Cambridge University Press; 2013, pp. 486–7.
  48. 48. Dutilleul P. The MLE algorithm for the matrix normal distribution. J Stat Comput Simul. 1999;64(2):105–23.
  49. 49. Roy A, Khattree R. On implementation of a test for Kronecker product covariance structure for multivariate repeated measures data. Stat Methodol. 2005;2(4):297–306.
  50. 50. Roy A, Khattree R. Testing the hypothesis of a kronecker product covariance matrix in multivariate repeated measures data. Sugi. 2005;30(199–30):1–11.
  51. 51. Srivastava MS, von Rosen T, von Rosen D. Models with a Kronecker product covariance structure: Estimation and testing. Math Meth Stat. 2008;17(4):357–70.
  52. 52. Soloveychik I, Trushin D. Gaussian and robust Kronecker product covariance estimation: existence and uniqueness. J Multivariate Anal. 2016;149:92–113.
  53. 53. Campbell J, Champbell J, Campbell J, Lo A, Lo P, Lo A. The econometrics of financial markets. Princeton University Press; 1997.
  54. 54. Cochrane D, Orcutt GH. Application of least squares regression to relationships containing auto-correlated error terms. Am Stat Assoc. 1949;44(245):32–61.
  55. 55. Kariya T, Kurata H. Generalized least squares, vol. 7. Wiley; 2004.
  56. 56. Peña D, Rodríguez J. The log of the determinant of the autocorrelation matrix for testing goodness of fit in time series. J Stat Plann Inference. 2006;136(8):2706–18.
  57. 57. Ledoit O, Wolf M. A well-conditioned estimator for large-dimensional covariance matrices. J Multivariate Anal. 2004;88(2):365–411.
  58. 58. Chen Y, Wiesel A, Eldar YC, Hero AO. Shrinkage algorithms for MMSE covariance estimation. IEEE Trans Signal Process. 2010;58(10):5016–29.
  59. 59. US Environmental Protection Agency. Air quality system data mart. https://www.epa.gov/airdata.
  60. 60. The Department of Forestry and Fire Protection. https://www.fire.ca.gov/incidents/.
  61. 61. Qiao L, Zhang L, Chen S, Shen D. Data-driven graph construction and graph learning: a review. Neurocomputing. 2018;312:336–51.
  62. 62. Hastings D, Dunbar P, Elphingstone G, Bootz M, Murakami H, Maruyama H. The global land one-kilometer base elevation (GLOBE) digital elevation model, version 1.0. 1999. http://www.ngdc.noaa.gov/mgg/topo/globe.html
  63. 63. Zemel R, Carreira-Perpiñán M. Proximity graphs for clustering and manifold learning. In: Saul L, Weiss Y, Bottou L, editors. Advances in Neural Information Processing Systems. vol. 17. MIT Press; 2004. Available from: https://proceedings.neurips.cc/paper/2004/file/dcda54e29207294d8e7e1b537338b1c0-Paper.pdf.
  64. 64. Drineas P, Mahoney M. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. J Mach Learn Res. 2005;6:2153–75.
  65. 65. Wendland H. Piecewise polynomial, positive definite and compactly supported radial functions of minimal degree. Adv Comput Math. 1995;4(1):389–96.
  66. 66. Zhu S. Compactly supported radial basis functions: how and why? 2012.