Indexing and partitioning the spatial linear model for large data sets

Jay M. Ver Hoef; Michael Dumelle; Matt Higham; Erin E. Peterson; Daniel J. Isaak

doi:10.1371/journal.pone.0291906

Abstract

We consider four main goals when fitting spatial linear models: 1) estimating covariance parameters, 2) estimating fixed effects, 3) kriging (making point predictions), and 4) block-kriging (predicting the average value over a region). Each of these goals can present different challenges when analyzing large spatial data sets. Current research uses a variety of methods, including spatial basis functions (reduced rank), covariance tapering, etc, to achieve these goals. However, spatial indexing, which is very similar to composite likelihood, offers some advantages. We develop a simple framework for all four goals listed above by using indexing to create a block covariance structure and nearest-neighbor predictions while maintaining a coherent linear model. We show exact inference for fixed effects under this block covariance construction. Spatial indexing is very fast, and simulations are used to validate methods and compare to another popular method. We study various sample designs for indexing and our simulations showed that indexing leading to spatially compact partitions are best over a range of sample sizes, autocorrelation values, and generating processes. Partitions can be kept small, on the order of 50 samples per partition. We use nearest-neighbors for kriging and block kriging, finding that 50 nearest-neighbors is sufficient. In all cases, confidence intervals for fixed effects, and prediction intervals for (block) kriging, have appropriate coverage. Some advantages of spatial indexing are that it is available for any valid covariance matrix, can take advantage of parallel computing, and easily extends to non-Euclidean topologies, such as stream networks. We use stream networks to show how spatial indexing can achieve all four goals, listed above, for very large data sets, in a matter of minutes, rather than days, for an example data set.

Citation: Ver Hoef JM, Dumelle M, Higham M, Peterson EE, Isaak DJ (2023) Indexing and partitioning the spatial linear model for large data sets. PLoS ONE 18(11): e0291906. https://doi.org/10.1371/journal.pone.0291906

Editor: Mohamed R. Abonazel, Cairo University, EGYPT

Received: February 6, 2023; Accepted: September 7, 2023; Published: November 1, 2023

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Data Availability: The SPIN method has been implemented in the spmodel R package https://cran.r-project.org/web/packages/spmodel/index.html. The example data can be downloaded from the Github repository, https://github.com/jayverhoef/midColumbiaLSN.git.

Funding: JVH: The project received financial support through Interagency Agreement DW-13-92434601-0 from the U.S. Environmental Protection Agency (EPA), and through Interagency Agreement 81603 from the Bonneville Power Administration (BPA), with the National Marine Fisheries Service, NOAA. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The general linear model, including regression and analysis of variance (ANOVA), is still a mainstay in statistics, (1) where Y is an n × 1 vector of response random variables, X is the design matrix with covariates (fixed explanatory variables, containing any combination of continuous, binary, or categorical variables), β is a vector of parameters, and ε is a vector of zero-mean random variables, which are classically assumed to be uncorrelated, var(ε) = σ²I. The spatial linear model is a version of Eq (1) where var(ε) = V, and V is a patterned covariance matrix that is modeled using spatial relationships. Generally, spatial relationships are of two types: spatially-continuous point-referenced data, often called geostatistics, and finite sets of neighbor-based data, often called lattice or areal data [1]. For geostatistical data, we associate random variables in Eq (1) with their spatial locations by denoting the random variable as Y(s_i); i = 1, …, n, and ε(s_i); i = 1, …, n, where s_i is a vector of spatial coordinates for the ith point, and the i, jth element of V is cov(ε(s_i), ε(s_j)). Table 1 provides a list of all of the main notation used in this article.

Download:

Table 1. Commonly-used symbols and their meanings in this paper.

https://doi.org/10.1371/journal.pone.0291906.t001

The main goals from a geostatistical linear model are to 1) estimate V, 2) estimate β, 3) make predictions at unsampled Y(s_j), where j = n + 1, …, N, form a set of spatial locations without observations, and 4) for some region , make a prediction of the average value , where is the area of . Estimation and prediction both require for V storage and operations for V⁻¹ [2], which, for massive data sets, is computationally expensive and may be prohibitive. Our overall objective is to use spatial indexing ideas to make all four goals possible for very large spatial data sets. We maintain the moment-based approach of classical geostatistics, which is distribution free, and we work to maintain a coherent model of stationarity and a single set of parameter estimates.

Quick review of the spatial linear model

When the outcome of the random variable Y(s_i) is observed, we denote it y(s_i), which are contained in the vector y. These observed data are used first to estimate the autocorrelation parameters in V, which we will denote as θ. In general, V can have n(n + 1)/2 parameters, but use of distance to describe spatial relationships typically reduces this to just 3 or 4 parameters. An example of how V depends on θ is given by the exponential autocorrelation model, where the i, jth element of V is (2) where θ = (τ², η², ρ)′, d_i,j is the Euclidean distance between s_i and s_j, and is an indicator function, equal to 1 if its argument is true, otherwise it is 0. The parameter η² is often called the “nugget effect,” τ² is called the “partial sill,” and ρ is called the “range” parameter. In Eq (2), the variances are constant (stationary), which we denote σ² = τ² + η², when d_i,j = 0. Many other examples of autocorrelation model are given in [1, 3].

We will use restricted maximum likelihood (REML) [4, 5] to estimate parameters of V. REML is less biased than full maximum likelihood [6]. REML estimates of covariance parameters are obtained by minimizing (3) for θ, where V_θ depends on spatial autocorrelation parameters θ, and , , and c is a constant that does not depend on θ. It has been shown [7, 8] that Eq (3) form unbiased estimating equations for covariance parameters, so Gaussian data are not strictly necessary. After Eq (3) has been minimized for θ, then these estimates, call them , are used in the autocorrelation model, e.g. Eq 2, for all of the covariance values to create . This is the first use of data y. The usual frequentist method for geostatistics, with a long tradition [9], “uses the data twice” [10]. Now , along with a second use of the data, are used to estimate regression coefficients or make predictions at unsampled locations. By plugging into the well-known best-linear-unbiased estimate (BLUE) of β for Eq (1), we obtain the empirical best-linear-unbiased estimate (EBLUE), e.g. [11], (4) The estimated variance of Eq (4) is (5)

Let a single unobserved location be denoted s₀, with a covariate vector of x₀ (containing the same covariates and length as a row of X). Then empirical best-linear-unbiased prediction (EBLUP) [12] at an unobserved location is (6) where , using the same autocorrelation model, e.g. Eq (2), and estimated parameters, , that were used to develop . Note that if we condition on as fixed, then Eq (6) is a linear combination of y, and can also be written as when Eq (4) is substituted for . The prediction Eq (6) can be seen as the conditional expectation of Y(s₀)|y with plug-in values for β, V, and c. The estimated variance of EBLUP is, (7) where is the estimated variance of Y(s₀) using the same covariance model as . [12]

Spatial methods for big data

Here, we give a brief overview of the most popular methods currently used for large spatial data sets. There are various ways to classify such methods. For our purposes, there are two broad approaches. One is to adopt a Gaussian Process (GP) model for the data and then approximate the GP. The other is to model locally, essentially creating smaller data sets and using existing models.

There are several good reviews on methods for approximating the GP [13–16]. These methods include low rank ideas such as radial smoothing [17–19], fixed rank kriging [20–23], predictive processes [24, 25], and multiresolution Gaussian processes [26, 27]. Other approaches include covariance tapering [28–30], stochastic partial differential equations [31, 32], and factoring the GP into a series of conditional distributions [33, 34], which was extended to nearest neighbor Gaussian processes [35–38] and other sparse matrix improvements [39–41]. The reduced rank methods are very attractive, and allow models for situations where distances are non-Euclidean (for a review and example, see [42]), as well as fast computation.

Modeling locally involves an attempt to maintain classical geostatistical models by creating subsets of the data, using existing methods on subsets, and then making inference from subsets. For example, [43, 44] created local data sets in a spatial moving window, and then estimated variograms and used ordinary kriging within those windows. This idea allows for nonstationary variances but forces an unnatural asymmetric autocorrelation because the range parameter changes when moving a window. Nor does it estimate β, but rather there is a different β for every point in space. Another early idea was to create a composite likelihood by taking products of subset-likelihoods and optimizing for autocorrelation parameters θ [45], and then can be held fixed when predicting in local windows. However, this does not solve the problem of estimating a single β.

More recently, two broad approaches have been developed for modeling locally. One is a ‘divide and conquer’ approach, which is similar to [45]. Here, it is permissible to re-use data in subsets, or not use some data at all [46–48], with an overview provided by [49]. Another approach is a simple partition of the data into groups, where partitions are generally spatially compact [50–53]. This is sensible for estimating covariance parameters and will provide an unbiased estimate for , however the estimated variance will not be correct. Continuity corrections for predictions are provided, but predictions may not be efficient near partition boundaries.

A blocked structure for the covariance matrix based on spatially-compact groupings was proposed by [54], who then formulated a hybrid likelihood based on blocks of different sizes. The method that we feature is most similar to [54], but we show that there is no need for a hybrid likelihood, and that our approach is different than composite likelihood. Our spatial indexing approach is very simple and extends easily to random effects, and accommodates virtually any covariance matrix that can be constructed. We also show how to obtain the exact covariance matrix of estimated fixed effects without any need for computational derivatives or numerical approximations.

Motivating example

One of the attractive features of the method that we propose is that it will work with any valid covariance matrix. To motivate our methods, consider a stream network (Fig 1a). This is the Mid-Columbia River basin, located along part of the border between the states of Washington and Oregon, USA, with a small part of the network in Idaho as well (Fig 1b). The stream network consists of 28,613 stream segments. Temperature loggers were placed at 9,521 locations on the stream network, indicated by purple dots in Fig 1a. A close-up of the stream network, indicated by the dark rectangle in Fig 1a, is given as Fig 1c, where we also show a systematic placement of prediction locations with orange dots. There are 60,099 prediction locations that will serve as the basis for point predictions. The response variable is an average of daily maximum temperatures in August from 1993 to 2011. Explanatory variables obtained for both observations and prediction sites included elevation at temperature logger site, slope of stream segment at site, percentage of upstream watershed composed of lakes or reservoirs, proportion of upstream watershed composed of glacial ice surfaces, mean annual precipitation in watershed upstream of sensor, the northing coordinate, base-flow index values, upstream drainage area, a canopy value encompassing the sensor, mean August air temperature from a gridded climate model, mean August stream discharge, and occurrence of sensor in tailwater downstream from a large dam (see [55] for more details).

Download:

Fig 1. Study area for the motivating example.

(a) A stream network from the mid-Columbia River basin, where purple points show 9521 sample locations that measured mean water temperature during August. (b) Most of the stream network is located in Washington and Oregon in the United States. (c) A close-up of the black rectangle in (a). The orange points are prediction locations.

https://doi.org/10.1371/journal.pone.0291906.g001

These data were previously analyzed in [55] with geostatistical models specific to stream networks [11, 56]. The models were constructed as spatial moving averages, e.g., [57, 58], also called process convolutions, e.g., [59, 60]. Two basic covariance matrices are constructed, and then summed. In one, random variables were constructed by integrating a kernel over a white noise process strictly upstream of a site, which are termed “tail-up” models. In the other construction, random variables were created by integrating a kernel over a white noise process strictly downstream of a site, which are termed “tail-down” models. Both types of models allow analytical derivation of autocovariance functions, with different properties. For tail-up models, sites remain independent so long as they are not connected by water flow from an upstream site to a downstream site. This is true even if two sites are very close spatially, but each on a different branch just upstream of a junction. Tail-down models are more typical as they allow spatial dependence that is generally a function of distance along the stream, but autocorrelation will still be different for two pairs of sites that are an equal distance apart, when one pair is connected by flow, and the other is not.

When considering big data, such as those in Fig 1, we considered the methods as described in the previous section. The basis-function/reduced rank approaches would be difficult for stream networks because an inspection of Fig 1 reveals that we would need thousands of basis functions in order to cover all headwater stream segments and run the basis functions downstream only. A separate set of basis functions would be needed that ran upstream, and then weighting would be required to split the basis functions at all stream junctions. In fact, all of the GP model approximation methods would require modifying a covariance structure that has already been developed specifically for steam networks. The spatial indexing method that we propose below is much simpler, requiring no modification to the covariance structure, and, as we will demonstrate, proved to be adequate, not only for stream networks, but more generally.

Objectives

In what is to follow, we will use spatial indexing, leading to covariance matrix partitioning and local predictions. We will use the acronym SPIN, for SPatial INdexing, as the collection of methods for covariance parameter estimation, fixed effects estimation, and point and block prediction. Our objective is to show how each of these inferences can be made computationally faster with SPIN, and still provide unbiased results with valid confidence/prediction intervals.

This article uses several acronyms. Table 2 provides a handy reference to the meaning of all acronyms used here.

Download:

Table 2. Acronyms used in this paper.

https://doi.org/10.1371/journal.pone.0291906.t002

Methods

The main advantage of the SPIN method is due to the way the covariance matrix is indexed and partitioned to allow for faster evaluation of the REML equations, Eq (3), whose optimization is iterative, requiring many evaluations involving the inverse of the covariance matrix. This optimization provides estimation of the covariance parameters, which we describe next.

Estimation of covariance parameters

Consider the covariance matrix to be used in Eqs (4) and (6). First, we index the data to create a covariance matrix with P partitions based on the indexes {i; i = 1, …, P}, (8) In a similar way, imagine a corresponding indexing and partition of the spatial linear model as, (9) Now, for the purposes of estimating covariance parameters, we maximize the REML equations based on a covariance matrix, (10) rather than Eq (8). The computational advantage of using Eq (10) in Eq (3) is that we only need to invert matrices of size V_i,i for all i, and, because we have large amounts of data, we assume that {V_i,i} are sufficient for estimating covariance parameters. If the size of V_i,i is fixed, then the computational burden grows linearly with n. Also, Eq (10) in Eq (3) allows for use of parallel computing because each V_i,i can be inverted independently.

Note that we are not concerned with the variance of , which is generally true in classical geostatistics. Rather, θ contains nuisance parameters that require estimation in order to estimate fixed effects and make predictions. Because data are massive, we can afford to lose some efficiency in estimating the covariance parameters. For example, sample sizes ≥ 125 are generally recommended for estimating the covariance matrix for geostatistical data [61]. REML is for the most part unbiased. If we have thousands of samples, and if we imagine partitioning the spatial locations into data sets (in ways that we describe later), then using Eq (10) in Eq (3) is, essentially, using REML many times to obtain a pooled estimate of .

Partitioning the covariance matrix is most closely related to the idea of quasi-likelihood [62], composite likelihood [45] and divide and conquer [63]. However, for REML, they are not exactly equivalent. From Eq (3), the term using composite likelihood, , results in while using V_part results in An advantage to spatial indexing, when compared to composite likelihood, can be seen when X contains columns with many zeros, such as may occur for categorical explanatory variables. Then, partitioning X may result in X_i that has columns with all zeros, which presents a problem when computing for composite likelihood, but not when using V_part.

The SPIN indexing can also allow for faster inversion of the covariance matrix when estimating fixed effects, but that requires some new results to obtain the proper standard errors of the estimated fixed effects, which we describe next.

Estimation of β

The generalized least squares estimate for β was given in Eq (4). Although the inverse V⁻¹ only occurs once (as compared to repeatedly when optimizing the REML equations), it will still be computationally prohibitive if a data set has thousands of samples. Note that under the partitioned model, Eq (9) with covariance matrix Eqs (10), (4), is, (11) where and . This is a “pooled estimator” of β across the partitions. This should be a good estimator of β at a much reduced computational cost. It will also be convenient to show that Eq (11) is linear in y, by noting that (12)

To estimate the variance of we cannot ignore the correlation between the partitions, so we consider the full covariance matrix Eq (8). If we compute the covariance matrix for Eq (11) under the full covariance matrix Eq (8), we obtain (13) where . Note that while we set parts of V = 0 Eq (10) in order to estimate θ and β, we computed the variance of using the full V in Eq (8). Using a plug-in estimator, whereby θ is replaced by , no further inverses of any V_i,j are required if all are stored as part of the REML optimization. There is only a single additional inverse required, which is R × R, where R is the rank of the design matrix X, and is already computed for in Eq (11). Also note that if we simply substituted Eq (10) into Eq (5), then we obtain only as the variance of . In Eq (13), is the adjustment that is required for correlation among the partitions for a pooled estimate of . Partitioning of the spatial linear model allows computation from Eq (11), but then going back to the full model for developing Eq (13), which is a new result. This can be contrasted to the approaches for variance estimation of fixed effects using pseudo likelihood, composite likelihood, and divide and conquer found in the earlier literature review.

Eq (13) is quite fast and grows linearly for computing the number of inverse matrices (that is, if observed sample size is 2n, then there are twice as many inverses as a sample of size n, if we hold partition size fixed). Also note that all inverses may already be computed as part of REML estimation of θ. However, Eq (13) is quadratic in pure matrix computations due to the double sum in W_xx. These can be made parallel, but may take too long for more than about 100,000 samples. One alternative is to use the empirical variation in , where the ith matrix calculations are already needed for Eq (11) and can be simply computed and stored. Then, let (14) which has been used before for partitioned data, e.g. [64]. A second alternative is to pool the estimated variances of each , which are , to obtain (15) where the first P in the denominator is for averaging individual , and the second P is the reduction in variance due to averaging . Eqs (13)–(15) are tested and compared below using simulations.

Point prediction

The predictor for Y(s₀) was given in Eq (6). As for estimation, the inverse V⁻¹ only occurs once (as compared to repeatedly when optimizing to obtain the REML estimates). If the data set has tens of thousands of samples, it will still be computationally prohibitive. Note that under the partitioned model, Eq (9), that assumes zero correlation among partitions, Eq (10), from Eq (6) the predictor is, (16) where is obtained from Eq (11), , , and , using the same autocorrelation model and parameters as for . Even though the predictor is developed under the block diagonal matrix Eq (10), the true prediction variance can be computed under Eq (8), as we did for estimation. However, the performance of these predictors turned out to be quite poor.

We recommend point predictions based on local data instead, which is an old idea, e.g. [43], and has already been implemented in software for some time, e.g. [10]. The local data may be in the form of a spatial limitation, such as a radius around the prediction point, or by using a fixed number of nearest neighbors. For example, the R [65] package nabor [66] finds nearest neighbors among hundreds of thousands of samples very quickly. Our method will be to use a single set of global covariance parameters as estimated under the covariance matrix partition Eq (10), and then predict with a fixed number of nearest neighbors. We will investigate the effect due to the number of nearest neighbors through simulation.

A purely local predictor lacks model coherency, as discussed in the literature review section. We use a single for covariance, but there is still the issue of . As seen in Eq (6), estimation of β is implicit in the prediction equations. If y_j ⊂ y are data in the neighborhood of prediction location s_j, then using Eq (6) with local y_j is implicitly adopting a varying coefficient model for , making it also local, so call it , and it will vary for each prediction location s_j. A further issue arises if there are categorical covariates. It is possible that a level of the covariate is not present in the local neighborhood, so some care is needed to collapse any columns in the design matrix that are all zeros. These are some of the issues that call to question the “coherency” of a model when predicting locally.

Instead, as for estimating the covariance parameters, we will assume that the goal is to have a single global estimate of β. Then we take as our predictor for the jth prediction location, (17) where X_j and are the design and covariance matrices, respectively, for the same neighborhood as y_j, x_j is a vector of covariates at prediction location j, (using the same autocorrelation model and parameters as for ), and was given in Eq (11). It will be convenient for block kriging to note that if we condition on being fixed, then Eq (17) can be written as a linear combination of y, call it , similar to as mentioned after Eq (6). Suppose there are m neighbors around s_j, so y_j is m × 1. Let y_j = N_jy, where N_j is a m × n matrix of zeros and ones that subset the n × 1 vector of all data to only those in the neighborhood. Then (18) where Q was defined in Eq (12).

Let be an estimator of in Eqs (13), (14), or 15), then the prediction variance of Eq (17) is when using the local neighborhood set of data, which is (19) where is the estimated value of var(Y(s_j)) using and the same autocorrelation model that was used for . Eq (19) can be compared to Eq (7).

Block prediction

None of the literature reviewed earlier considered block prediction, yet it is an important goal in many applications. In fact, the origins of kriging were founded on estimating total gold reserves in the pursuit of mining [9]. The goal of block prediction is to predict the average value over a region, rather than at a point. If that region is a compact set of points denoted as , then the random quantity is (20) where is the area of . In practice, we approximate the integral by a dense set of points on a regular grid within . Let us call that dense set of points , where recall that {s_j;j = 1, …, n} are the observed data. Then the grid-based approximation to Eq (20) is with generic predictor

We are in the same situation as for prediction of single sites, where we are unable to invert the covariance matrix of all n observed locations for predicting . Instead, let us use the local predictions as developed in the previous section, which we will average to compute the block prediction. Let the point predictions be a set of random variables denoted as . Denote y_o a vector of random variables for observed locations, and y_u a vector of unobserved random variables on the prediction grid to be used as an approximation to the block. Recall that we can write Eq (18) as . We can put all λ_j into a large matrix, The average of all predictions, then, is (21) where a = (1/N, 1/N, …, 1/N)′. Let , and so the block prediction is also linear in y_o.

Let the covariance matrix for the vector be where V_o,o = V in Eq (8). Then, assuming unbiasedness, that is, , where X_o and X_u are the design matrices for the observed and unobserved variables, respectively, then the block prediction variance is (22)

Although the various parts of V can be very large, the necessary vectors can be created on-the-fly to avoid creating and storing the whole matrix. For example, take the third term in Eq (22). To make the kth element of vector V_u,ua, we can create the kth row of V_u,u, and then take the inner product with a. This means that only the vector V_u,ua must be stored. We then simply take this vector as an inner product with a to obtain a′V_u,ua. Also note that computing Eq (21) grows linearly with observed sample size n due to fixing the number of neighbors used for prediction, but Eq (22) grows quadratically, in both n and N, simply due to the matrix dimensions in V_o,o and V_u,u. We can control the growth of N by choosing the density of the grid approximation, but it may require subsampling of y_o if the number of observed data is too large. We often have very precise estimates of block averages, so this may not be too onerous if we have hundreds of thousands of observations.

The SPIN method

As we have shown, SPIN is a collection of methods for covariance parameter estimation, fixed effects estimation, and point and block prediction, based on spatial indexing. SPIN, as described above, estimates covariance parameters using REML, given by Eq (3), with a valid autocovariance model [e.g., Eq (2) used in a partitioned covariance matrix, given by Eq (10)]. Using these estimated covariance parameters, we estimate β using Eq (11), with estimated covariance matrix, Eq (13), unless explicitly stating the use of Eqs (14) or (15). For point prediction, we use Eq (17) with estimated variance Eq (19), unless explicitly stating the purely local version for given by Eq (6) with estimated variance Eq (7). For block prediction, we use Eq (21) with Eq (22).

Simulations

To test the validity of SPIN, we simulated n spatial locations randomly within the [0, 1] × [0, 1] unit square to be used as observations, and we created a uniformly-spaced (N − n) = 40 × 40 prediction grid within the unit square.

We simulated data with two methods. The first simulation method created data sets that were not actually very large, using exact geostatistical methods that require the Cholesky decomposition of the covariance matrix. For these simulations, we used the spherical autocovariance model to construct V, (23) where terms are defined as in Eq (2). To simulate normally-distributed data from N(0, V), let L be the lower triangular matrix such that V = LL′. If vector z is simulated as independent standard normal variables, then ε = Lz is a simulation from N(0, V). Unfortunately, computing L is an algorithm, on the same order as inverting V, which limits the size of data for simulation. Fig 2a and 2b shows two realizations from N(0, V), where the sample size was n = 2000 and the autocovariance model, Eq (23), had a τ² = 10, ρ = 0.5, and η² = 0.1. Each simulation took about 3 seconds. Note that when including evaluation of predictions, simulations are required at all N spatial locations. We call this the GEOSTAT simulation method. For all simulations, we fixed τ² = 10 and η² = 0.1, but allowed ρ to vary randomly from a uniform distribution between 0 and 2.

Download:

Fig 2. Examples of simulated surfaces used to test methods.

(a) and (b) are two different realizations of 2000 values from the GEOSTAT method with a range of 2. (c) and (d) are two realizations of 100,000 values from the SUMSINE method. Bluer values are lower, and yellower areas are higher.

https://doi.org/10.1371/journal.pone.0291906.g002

We created another method for simulating spatially patterned data for up to several million records. Let S = [s₁, s₂] be the 2-column matrix of the spatial coordinates of data, where s₁ is the first coordinate, and s₂ is the second coordinate. Let be a random rotation of the coordinate system by radian U_1,iπ, where U_1,i is a uniform random variable. Then let (24) which is a 2-dimensional sine wave surface with a random amplitude (due to uniform random variable U_2,i), random frequencies on each coordinate (due to uniform random variables U_3,i and U_5,i), and random shifts on each coordinate (due to uniform random variables U_4,i and U_6,i). Then the response variable is created by taking , where expected amplitudes decrease linearly, and expected frequencies increase, with each i. Further, the ε were standardized to zero mean and a variance of 10 for each simulation, and we added a small independent component with variance of 0.1 to each location, similar to the nugget effect η² for the GEOSTAT method. Fig 2c and 2d shows two realizations from the sum of random sine-wave surfaces, where the sample size was 100,000. Each simulation took about 2 seconds. We call this the SUMSINE simulation method.

Thus, random errors, ε, for the simulations were based on GEOSTAT or SUMSINE methods. In either case, we created two fixed effects. A covariate, x₁(s_i), was generated from standard independent normal-distributions at the s_i locations. A second spatially-patterned covariate, x₂(s_i), was created, using the same model, but a different realization, as the random error simulation for ε. Then the response variable was created as, (25) for i = 1, 2, …, for a specified sample size n, or N (if wanting simulations at prediction sites), and β₀ = β₁ = β₂ = 1.

Evaluation of simulation results

For one summary of performance of fixed effects estimation, we consider the simulation-based estimator of root-mean-squared error, for the kth simulation among K, where is the kth simulation estimate for the pth β parameter, and β_p is the true parameter used in simulations. We only consider β₁ and β₂ in Eq (25). The next simulation-based estimator we consider is 90% confidence interval coverage, To evaluate point prediction we also consider the simulation-based estimator of root-mean-squared prediction error, where is the predicted value at the jth location for the kth simulation and y_k(s_j) is the realized value at the jth location for the kth simulation. The final summary that we consider is 90% prediction interval coverage, where is an estimator of the prediction variance.

Effect of partition method

We wanted to test SPIN over a wide range of data. Hence, we simulated 1000 data sets where simulation method was chosen randomly, with equal probability, between GEOSTAT and SUMSINE methods. If GEOSTAT was chosen, a random sample size between 1000 and 2000 was generated. If SUMSINE was chosen, a random sample size between 2000 and 10,000 was generated. Thus, throughout the study, the simulations occurred over a wide range of parameters, with two different simulation methods and randomly varying autocorrelation. In all cases, the error models fitted to the data were misspecified, because we fitted an exponential autocorrelation model to the true models, GEOSTAT and SUMSINE, that generated them. This should provide a good test of the robustness of the SPIN method and provide fairly general conclusions on the effect of partition method.

After simulating the data, we considered 3 indexing methods. One was completely random, the second was spatially compact, and the third was a mixed strategy, starting with compact, and then 10% were randomly reassigned. To create compact data partitions, we used k-means clustering [67] on the spatial coordinates. K-means has the property of minimizing within group variances and maximizing among group variances. When applied to spatial coordinates, k-means creates spatially compact partitions. An example of each partition method is given in Fig 3. We created partition sizes that ranged randomly from a target of 25 to 225 locations per group (k-means has some variation in group size). It is possible to create one partition for covariance estimation, and another partition for estimating fixed effects. Therefore we considered all nine combinations of the three partition methods for each estimation method.

Download:

Fig 3. Illustration of three methods for partitioning data.

Sample size was 1000, and the data were partitioned into 5 groups of 200 each. (a) Random assignment to group. (b) K-means clustering on x- and y-coordinates. (c) K-means on x- and y-coordinates, with 10% randomly re-assigned from each group. Each color represents a different grouping.

https://doi.org/10.1371/journal.pone.0291906.g003

Table 3 shows performance summaries for the three partition methods, for both fixed effect estimation and point prediction, over wide-ranging simulations when using SPIN with 50 nearest-neighbors for predictions. It is clear that, whether for fixed effect estimation, or prediction, the use of compact partitions was the best option. The worst option was random partitioning. The mixed approach was often close to compact partitioning in performance.

Download:

Table 3. Effect of partition method.

https://doi.org/10.1371/journal.pone.0291906.t003

Effect of partition size

Next, we investigated the effect of partition size. We only used compact partitions, because they were best, and we used partition sizes of 25, 50, 100, and 200 for both covariance parameter estimation and fixed effect estimation, and again used 50 nearest-neighbors for predictions. We simulated data in the same way as above, and used the same performance summaries. Here, we also included the average time, in seconds, for each estimator. The results are shown in Table 4. In general, larger partition sizes had better RMSE for estimating covariance parameters, but the gains were very small after size 50. For fixed effects estimation, partition size of 50 was often better than 100, and approximately equal to size 200. For prediction, RMSPE was lower as partition size increased. In terms of computing speed, covariance parameter estimation was slower as partition size increased, but fixed effect estimation was faster as partition size increased (because of fewer loops in Eq (13). Partition sizes of 25 often had poor coverage in terms of both CI90 and PI90, but coverage was good for other partition sizes. Based on Tables 3 and 4, one good overall strategy is to use compact partitions of block size 50 for covariance parameter estimation, and block size 200 for fixed effect estimation, for both efficiency and speed. Note that when partition size is different for fixed effect estimation from covariance parameter estimation, new inverses of diagonal blocks in Eq (10) are needed. If partition size is the same for fixed effect and covariance parameter estimation, inverses of diagonal blocks can be passed from REML to fixed effects estimation, so another good strategy is to use block size 50 for both fixed effect and covariance parameter estimation.

Download:

Table 4. Effect of partition sizes.

https://doi.org/10.1371/journal.pone.0291906.t004

Variance estimation for fixed effects

In the section on estimating β, we described three possible estimators for the covariance matrix of , with Eq (13) being theoretically correct, and faster alternatives Eqs (14) and (15). The alternative estimators will only be necessary for very large sample sizes, so to test their efficacy we simulated 1000 data sets with random sample sizes, from 10,000 to 100,000, using the SUMSINE method. We then fitted the covariance model, using compact partitions of size 50, and fixed effects, using partition sizes of 25, 50, 100, and 200. We computed the estimated covariance matrix of the fixed effects using Eqs (13)–(15), and evaluated performance based on 90% confidence interval coverage.

Results in Table 5 show that all three estimators, at all block sizes, have confidence interval coverage very close to the nominal 90% when estimating β₁, the independent covariate. However, when estimating the spatially-patterned covariate, β₂, the theoretical estimator has proper coverage for block sizes 50 and greater, while the two alternative estimators have proper coverage only for block size 50. It is surprising that the results for the alternative estimators are so specific to a particular block size, and these estimators warrant further research.

Download:

Table 5. CI90 for β₁ and β₂.

https://doi.org/10.1371/journal.pone.0291906.t005

Prediction with global estimate of β

In the sections on point and block prediction, we described prediction using both a local estimator of β, and the global estimator . To compare them, and examine the effect of the number of nearest neighbors, we simulated 1000 data sets as described in earlier, using compact partitions of size 50 for both covariance and fixed-effects estimation. We then predicted values on the gridded locations with 25, 50, 100, and 200 nearest neighbors.

Results in Table 6 show that prediction with the global estimator had smaller RMSPE, especially with smaller numbers of nearest neighbors. As expected, predictors have lower RMSPE with more nearest neighbors, but gains are small after block size 50. Prediction intervals for both methods had proper coverage. The local estimator of β was faster because it used the local estimator of the covariance of β, while predictions with needed the global covariance estimator (Eq 13) to be used in Eq (19). Higher numbers of nearest neighbors took longer to compute, especially with numbers greater than 100. Of course, predictions for the block average had much smaller RMSPE than points. Again, prediction got better when using more nearest neighbors, but improvements were small with more than 50. Computing time for block averaging increased with number of neighbors, especially when greater than 100, and took longer than point predictions.

Download:

Table 6. Effect of number of nearest neighbors for RMSPE and PI90.

https://doi.org/10.1371/journal.pone.0291906.t006

A comparison of methods

To compare methods, we simulated 1000 data sets using GEOSTAT (partial sill was 10, range was 0.5 and nugget was 0.1) where we fix sample size at n = 1000, and the errors were standardized before adding fixed effects. We compared 3 methods: 1) estimation and prediction using the full covariance matrix for all 1000 points, 2) SPIN with compact blocks of 50 for both covariance and fixed effects parameter estimation, and 50 nearest-neighbors for prediction, and 3) nearest-neighbor Gaussian processes (NNGP). NNGP had good performance in [16] and software is readily available in the R package spNNGP [68]. For spNNGP, we used default parameters for the conjugate prior method and a 25 × 25 search grid for phi and alpha, which were the dimensions of the search grid found in [16]. We stress that we do not claim this to be a definitive comparison among methods, as the developers of NNGP could surely make adjustments to improve performance. Likewise, partition size and number of nearest neighbors for prediction could be adjusted to optimize performance of SPIN for any given simulation or data set. We offer these results to show that, broadly, SPIN and NNGP are comparable, and very fast, with little performance lost in comparison to using the full covariance matrix.

Table 7 shows that RMSE for estimation of the independent covariate, and the spatially-patterned covariate, were approximately equal for SPIN and NNGP, and only slightly worse than the full covariance matrix. RMSPE for SPIN was equal to the full covariance matrix, and both were just slightly better than NNGP. Confidence and prediction intervals for all three methods were very close to the nominal 90%.

Download:

Table 7. Comparison of 3 methods for fixed effects estimation and point prediction.

https://doi.org/10.1371/journal.pone.0291906.t007

Fig 4 shows computing times, using 5 replicate simulations, for each method for up to 100,000 records. Both NNGP and SPIN can use parallel processing, but here we used a single processor to remove any differences due to parallel implementations. Fitting the full covariance matrix with REML, which is iterative, took more than 30 minutes with sample sizes > 2500. Computing time for NNGP is clearly linear with sample size, while for SPIN, it is quadratic when using Eq (13), but linear when using the alternative variance estimators for fixed effects (Eqs 14 and 15). Using the alternative variance estimators, SPIN was about 10 times faster than NNGP, and even with quadratic growth when using Eq (13), SPIN was faster than NNGP for up to 100,000 records.

Download:

Fig 4. Computing times as a function of sample size for three methods: 1) Full covariance matrix (black line), 2) NNGP (red line), and 3) SPIN (green lines).

For SPIN, the theoretically correct variance estimator (Eq 13) is solid green, while faster alternatives (Eqs 14 and 15) are dashed green.

https://doi.org/10.1371/journal.pone.0291906.g004

Application to stream networks

We applied spatial indexing to covariance matrices constructed using stream network models as described for the motivating example in the Introduction. These are variance component models, with a tail-up component, a tail-down component, and a Euclidean-distance component, each with 2 covariance parameters, along with a nugget effect; thus, there are 7 covariance parameters (4 partial sills, and 3 range parameters). A full covariance matrix was developed for these models [69], and we easily adapted it for spatial partitioning. We used compact blocks of size 50 for estimation, and 50 nearest neighbors for predictions. The 4 partial sill estimates were 1.76, 0.40, 2.57, and 0.66 for tail-up, tail-down, Euclidean-distance, and nugget effect, respectively. These indicate that tail-up and Euclidean-distance components dominated the structure of the overall autocovariance, and both had large range parameters. It took 7.98 minutes to fit the covariance parameters. The fitted fixed effects took an additional 2.15 minutes of computing time (Table 8), which are very similar to results found in [55]. Predictions for 65,099 locations are shown in Fig 5, which took 47 minutes.

Download:

Fig 5. Temperature predictions at 65,099 locations for the Mid-Columbia river.

Yellower colors are higher values, while bluer colors are lower values.

https://doi.org/10.1371/journal.pone.0291906.g005

Download:

Table 8. Fixed effects table for Mid-Columbia river data.

https://doi.org/10.1371/journal.pone.0291906.t008

In summary, the original analysis [55] took 10 days of continuous computing time to fit the model and make predictions with a full 9521 × 9521 covariance matrix. Using SPIN, fitting the same model took about 10 minutes, with an additional 47 minutes for predictions. Note that these models take more time than Euclidean distance alone because there are 7 covariance parameters, and the tail-up and tail-down models use stream distance, which takes longer to compute. For this example, we used parallel processing with 8 cores when fitting covariance parameters and fixed effects, and making predictions, which made analyses considerably faster. We did not use block prediction, because that was not a particular goal for this study. However, it is generally important, and has been used for estimating fish abundance [70].

Discussion and conclusions

We have explored spatial partitioning to speed computations for massive data sets. We have provided novel and theoretically correct development of variance estimators for all quantities. We proposed a globally coherent model for covariance and fixed effects estimation, and then use that model for improved predictions, even when those predictions are done locally based on nearest neighbors. We include block kriging in our development, which is absent among literature on big data for spatial methods.

Our simulations showed that, over a range of sample sizes, simulation methods, and range of autocorrelation, spatially compact partitions are best. There does not appear to be a need for “large blocks,” as used in [54]. A good overall strategy, that combines speed without giving up much precision, is based on 50/50/50, where compact partitions of size 50 are used for both covariance parameter estimation and fixed effects estimation, and 50 nearest neighbors are used for prediction. This strategy compares very favorably with a default strategy for NNGP.

One benefit of the data indexing is that it extends easily to any geostatistical model with a valid covariance matrix. There is no need to approximate a Gaussian process. We provided one example for stream network models, but other examples include geometric anisotropy, nonstationary models, spatio-temporal models (including those that are nonseparable), etc. Any valid covariance matrix can be indexed and partitioned, offering both faster matrix inversions and parallel computing, while providing valid inferences with proper uncertainty assessment.

Acknowledgments

We would like to thank Devin Johnson, Brett McClintock, Alan Pearse, and one anonymous reviewer for their reviews. The findings and conclusions in the paper are those of the author(s) and do not necessarily represent the views of the reviewers nor the EPA, BPA, and National Marine Fisheries Service, NOAA. Any use of trade, product, or firm names does not imply an endorsement by the US Government.

References

1. Cressie NAC. Statistics for Spatial Data, Revised Edition. New York: John Wiley & Sons; 1993.
2. Stein ML. A modeling approach for large spatial datasets. Journal of the Korean Statistical Society. 2008;37(1):3–10.
- View Article
- Google Scholar
3. Chiles JP, Delfiner P. Geostatistics: Modeling Spatial Uncertainty. New York: John Wiley & Sons; 1999.
4. Patterson HD, Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58:545–554.
- View Article
- Google Scholar
5. Patterson H, Thompson R. Maximum likelihood estimation of components of variance. In: Proceedings of the 8th International Biometric Conference. Biometric Society, Washington, DC; 1974. p. 197–207.
6. Mardia KV, Marshall R. Maximum likelihood estimation of models for residual covariance in spatial regression. Biometrika. 1984;71(1):135–146.
- View Article
- Google Scholar
7. Heyde CC. A quasi-likelihood approach to the REML estimating equations. Statistics & Probability Letters. 1994;21:381–384.
- View Article
- Google Scholar
8. Cressie N, Lahiri SN. Asymptotics for REML estimation of spatial covariance parameters. Journal of Statistical Planning and Inference. 1996;50:327–341.
- View Article
- Google Scholar
9. Cressie N. The origins of kriging. Mathematical Geology. 1990;22:239–252.
- View Article
- Google Scholar
10. Johnston K, Ver Hoef JM, Krivoruchko K, Lucas N. Using ArcGIS Geostatistical Analyst. vol. 300. Esri Redlands, CA; 2001.
11. Ver Hoef JM, Peterson E. A moving average approach for spatial statistical models of stream networks (with discussion). Journal of the American Statistical Association. 2010;105:6–18.
- View Article
- Google Scholar
12. Zimmerman DL, Cressie N. Mean squared prediction error in the spatial linear model with estimated covariance parameters. Annals of the Institute of Statistical Mathematics. 1992;44:27–43.
- View Article
- Google Scholar
13. Sun Y, Li B, Genton MG. Geostatistics for large datasets. In: Porcu E, Montero JM, Schlather M, editors. Advances and challenges in Space-Time Modelling of Natural Events. Springer; 2012. p. 55–77.
14. Bradley JR, Cressie N, Shi T. A comparison of spatial predictors when datasets could be very large. Statistics Surveys. 2016;10:100–131.
- View Article
- Google Scholar
15. Liu H, Ong YS, Shen X, Cai J. When Gaussian process meets big data: A review of scalable GPs. arXiv preprint arXiv:180701065. 2018;.
16. Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, et al. A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics. 2019;24(3):398–425. pmid:31496633
- View Article
- PubMed/NCBI
- Google Scholar
17. Kammann EE, Wand MP. Geoadditive models. Journal of the Royal Statistical Society: Series C (Applied Statistics). 2003;52(1):1–18.
- View Article
- Google Scholar
18. Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Campbridge University Press, Cambridge, UK; 2003.
19. Wood SN, Li Z, Shaddick G, Augustin NH. Generalized additive models for gigadata: modeling the UK black smoke network daily data. Journal of the American Statistical Association. 2017;112(519):1199–1210.
- View Article
- Google Scholar
20. Cressie, Noel, Johannesson, G. Spatial prediction for massive datasets. In: Mastering the Data Explosion in the Earth and Environmental Sciences: Proceedings of the Australian Academy of Science Elizabeth and Frederick White Conference. Canberra, Australia: Australian Academy of Science; 2006. p. 11.
21. Cressie N, Johannesson G. Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(1):209–226.
- View Article
- Google Scholar
22. Kang EL, Cressie N. Bayesian inference for the spatial random effects model. Journal of the American Statistical Association. 2011;106(495):972–983.
- View Article
- Google Scholar
23. Katzfuss M, Cressie N. Spatio-temporal smoothing and EM estimation for massive remote-sensing data sets. Journal of Time Series Analysis. 2011;32(4):430–446.
- View Article
- Google Scholar
24. Banerjee S, Gelfand AE, Finley AO, Sang H. Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(4):825–848. pmid:19750209
- View Article
- PubMed/NCBI
- Google Scholar
25. Finley AO, Sang H, Banerjee S, Gelfand AE. Improving the performance of predictive process modeling for large datasets. Computational Statistics & Data Analysis. 2009;53(8):2873–2884. pmid:20016667
- View Article
- PubMed/NCBI
- Google Scholar
26. Nychka D, Bandyopadhyay S, Hammerling D, Lindgren F, Sain S. A multiresolution Gaussian process model for the analysis of large spatial datasets. Journal of Computational and Graphical Statistics. 2015;24(2):579–599.
- View Article
- Google Scholar
27. Katzfuss M. A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association. 2017;112(517):201–214.
- View Article
- Google Scholar
28. Furrer R, Genton MG, Nychka D. Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics. 2006;15(3):502–523.
- View Article
- Google Scholar
29. Kaufman CG, Schervish MJ, Nychka DW. Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association. 2008;103(484):1545–1555.
- View Article
- Google Scholar
30. Stein ML. Statistical properties of covariance tapers. Journal of Computational and Graphical Statistics. 2013;22(4):866–885.
- View Article
- Google Scholar
31. Lindgren F, Rue H, Lindström J. An explicit link between Gaussian fields and Gaussian markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2011;73(4):423–498.
- View Article
- Google Scholar
32. Bakka H, Rue H, Fuglstad GA, Riebler A, Bolin D, Illian J, et al. Spatial modeling with R-INLA: A review. Wiley Interdisciplinary Reviews: Computational Statistics. 2018;10(6):e1443.
- View Article
- Google Scholar
33. Vecchia AV. Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society: Series B (Methodological). 1988;50(2):297–312.
- View Article
- Google Scholar
34. Stein ML, Chi Z, Welty LJ. Approximating likelihoods for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2004;66(2):275–296.
- View Article
- Google Scholar
35. Datta A, Banerjee S, Finley AO, Gelfand AE. Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association. 2016;111(514):800–812. pmid:29720777
- View Article
- PubMed/NCBI
- Google Scholar
36. Datta A, Banerjee S, Finley AO, Gelfand AE. On nearest-neighbor Gaussian process models for massive spatial data. Wiley Interdisciplinary Reviews: Computational Statistics. 2016;8(5):162–171. pmid:29657666
- View Article
- PubMed/NCBI
- Google Scholar
37. Finley AO, Datta A, Cook BC, Morton DC, Andersen HE, Banerjee S. Applying nearest neighbor Gaussian processes to massive spatial data sets forest canopy height prediction across tanana valley alaska.”. arXiv preprint arXiv:170200434. 2017;.
38. Finley AO, Datta A, Cook BD, Morton DC, Andersen HE, Banerjee S. Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Journal of Computational and Graphical Statistics. 2019; p. 1–14. pmid:31543693
- View Article
- PubMed/NCBI
- Google Scholar
39. Katzfuss M, Guinness J. A general framework for Vecchia approximations of Gaussian processes. arXiv preprint arXiv:170806302. 2017;.
40. Katzfuss M, Guinness J, Gong W, Zilber D. Vecchia approximations of Gaussian-process predictions. arXiv preprint arXiv:180503309. 2018;.
41. Zilber D, Katzfuss M. Vecchia-Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data. arXiv preprint arXiv:190607828. 2019;.
42. Ver Hoef JM. Kriging models for linear networks and non-Euclidean distances: Cautions and solutions. Methods in Ecology and Evolution. 2018;9(6):1600–1613.
- View Article
- Google Scholar
43. Haas TC. Lognormal and moving window methods of estimating acid deposition. Journal of the American Statistical Association. 1990;85(412):950–963.
- View Article
- Google Scholar
44. Haas TC. Local prediction of a spatio-temporal process with an application to wet sulfate deposition. Journal of the American Statistical Association. 1995;90(432):1189–1199.
- View Article
- Google Scholar
45. Curriero FC, Lele S. A composite likelihood approach to semivariogram estimation. Journal of Agricultural, biological, and Environmental statistics. 1999; p. 9–28.
- View Article
- Google Scholar
46. Liang F, Cheng Y, Song Q, Park J, Yang P. A resampling-based stochastic approximation method for analysis of large geostatistical data. Journal of the American Statistical Association. 2013;108(501):325–339.
- View Article
- Google Scholar
47. Eidsvik J, Shaby BA, Reich BJ, Wheeler M, Niemi J. Estimation and prediction in spatial models with block composite likelihoods. Journal of Computational and Graphical Statistics. 2014;23(2):295–315.
- View Article
- Google Scholar
48. Barbian MH, Assunção RM. Spatial subsemble estimator for large geostatistical data. Spatial Statistics. 2017;22:68–88.
- View Article
- Google Scholar
49. Varin C, Reid N, Firth D. An overview of composite likelihood methods. Statistica Sinica. 2011; p. 5–42.
- View Article
- Google Scholar
50. Park C, Huang JZ, Ding Y. Domain decomposition approach for fast Gaussian process regression of large spatial data sets. Journal of Machine Learning Research. 2011;12(May):1697–1728.
- View Article
- Google Scholar
51. Park C, Huang JZ. Efficient computation of Gaussian process regression for large spatial data sets by patching local Gaussian processes. Journal of Machine Learning Research. 2016;17(174):1–29.
- View Article
- Google Scholar
52. Heaton MJ, Christensen WF, Terres MA. Nonstationary Gaussian process models using spatial hierarchical clustering from finite differences. Technometrics. 2017;59(1):93–101.
- View Article
- Google Scholar
53. Park C, Apley D. Patchwork kriging for large-scale gaussian process regression. The Journal of Machine Learning Research. 2018;19(1):269–311.
- View Article
- Google Scholar
54. Caragea P, Smith RL. Approximate likelihoods for spatial processes. Preprint. 2006; https://rls.sites.oasis.unc.edu/postscript/rs/approxlh.pdf.
55. Isaak DJ, Wenger SJ, Peterson EE, Hoef JMV, Nagel DE, Luce CH, et al. The Norwest summer stream temperature model and scenarios for the western U.S.: a crowd-sourced database and new geospatial tools foster a user community and predict broad climate warming of rivers and streams. Water Resources Research. 2017;53(11):9181–9205.
- View Article
- Google Scholar
56. Ver Hoef JM, Peterson EE, Theobald D. Spatial statistical models that use flow and stream distance. Environmental and Ecological Statistics. 2006;13(1):449–464.
- View Article
- Google Scholar
57. Barry RP, Jay M, Hoef V. Blackbox kriging: spatial prediction without specifying variogram models. Journal of Agricultural, Biological, and Environmental Statistics. 1996;1(3):297–322.
- View Article
- Google Scholar
58. Ver Hoef JM, Barry RP. Constructing and fitting models for cokriging and multivariable spatial prediction. Journal of Statistical Planning and Inference. 1998;69(2):275–294.
- View Article
- Google Scholar
59. Higdon D. A process-convolution approach to modelling temperatures in the north atlantic ocean (disc: p191-192). Environmental and Ecological Statistics. 1998;5:173–190.
- View Article
- Google Scholar
60. Higdon D, Swall J, Kern J. Non-stationary spatial modeling. In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics 6—Proceedings of the Sixth Valencia International Meeting. Clarendon Press [Oxford University Press]; 1999. p. 761–768.
61. Webster R, Oliver MA. Geostatistics for Environmental Scientists. Chichester, England: John Wiley & Sons; 2007.
62. Besag J. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society: Series D (The Statistician). 1975;24(3):179–195.
- View Article
- Google Scholar
63. Guha S, Hafen R, Rounds J, Xia J, Li J, Xi B, et al. Large complex data: divide and recombine (d&r) with rhipe. Stat. 2012;1(1):53–67.
- View Article
- Google Scholar
64. Chapman DG, Johnson AM. Estimation of Fur Seal Pup Populations by Randomized Sampling. Transactions of the American Fisheries Society. 1968;97(3):264–270.
- View Article
- Google Scholar
65. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2020.
66. Elseberg J, Magnenat S, Siegwart R, Nüchter A. Comparison of nearest-neighbor-search strategies and implementations for efficient shape registration. Journal of Software Engineering for Robotics. 2012;3(1):2–12.
- View Article
- Google Scholar
67. MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J, editors. Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1. University of California Press; 1967. p. 281–297.
68. Finley AO, Datta A, Banerjee S. R package for nearest neighbor Gaussian process models. arXiv:200109111 [stat]. 2020;.
69. Ver Hoef JM, Peterson EE, Clifford D, Shah R. SSN: an R package for spatial statistical modeling on stream networks. Journal of Statistical Software. 2014;56(3):1–45.
- View Article
- Google Scholar
70. Isaak DJ, Ver Hoef JM, Peterson EE, Horan DL, Nagel DE. Scalable population estimates using spatial-stream-network (SSN) models, fish density surveys, and national geospatial database frameworks for streams. Canadian Journal of Fisheries and Aquatic Sciences. 2017;74(2):147–156.
- View Article
- Google Scholar

[ref1] 1. Cressie NAC. Statistics for Spatial Data, Revised Edition. New York: John Wiley & Sons; 1993.

[ref2] 2. Stein ML. A modeling approach for large spatial datasets. Journal of the Korean Statistical Society. 2008;37(1):3–10.
View Article
Google Scholar

[3] View Article

[4] Google Scholar

[ref3] 3. Chiles JP, Delfiner P. Geostatistics: Modeling Spatial Uncertainty. New York: John Wiley & Sons; 1999.

[ref4] 4. Patterson HD, Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58:545–554.
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref5] 5. Patterson H, Thompson R. Maximum likelihood estimation of components of variance. In: Proceedings of the 8th International Biometric Conference. Biometric Society, Washington, DC; 1974. p. 197–207.

[ref6] 6. Mardia KV, Marshall R. Maximum likelihood estimation of models for residual covariance in spatial regression. Biometrika. 1984;71(1):135–146.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref7] 7. Heyde CC. A quasi-likelihood approach to the REML estimating equations. Statistics & Probability Letters. 1994;21:381–384.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref8] 8. Cressie N, Lahiri SN. Asymptotics for REML estimation of spatial covariance parameters. Journal of Statistical Planning and Inference. 1996;50:327–341.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref9] 9. Cressie N. The origins of kriging. Mathematical Geology. 1990;22:239–252.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref10] 10. Johnston K, Ver Hoef JM, Krivoruchko K, Lucas N. Using ArcGIS Geostatistical Analyst. vol. 300. Esri Redlands, CA; 2001.

[ref11] 11. Ver Hoef JM, Peterson E. A moving average approach for spatial statistical models of stream networks (with discussion). Journal of the American Statistical Association. 2010;105:6–18.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref12] 12. Zimmerman DL, Cressie N. Mean squared prediction error in the spatial linear model with estimated covariance parameters. Annals of the Institute of Statistical Mathematics. 1992;44:27–43.
View Article
Google Scholar

[27] View Article

[28] Google Scholar

[ref13] 13. Sun Y, Li B, Genton MG. Geostatistics for large datasets. In: Porcu E, Montero JM, Schlather M, editors. Advances and challenges in Space-Time Modelling of Natural Events. Springer; 2012. p. 55–77.

[ref14] 14. Bradley JR, Cressie N, Shi T. A comparison of spatial predictors when datasets could be very large. Statistics Surveys. 2016;10:100–131.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref15] 15. Liu H, Ong YS, Shen X, Cai J. When Gaussian process meets big data: A review of scalable GPs. arXiv preprint arXiv:180701065. 2018;.

[ref16] 16. Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, et al. A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics. 2019;24(3):398–425. pmid:31496633
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref17] 17. Kammann EE, Wand MP. Geoadditive models. Journal of the Royal Statistical Society: Series C (Applied Statistics). 2003;52(1):1–18.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref18] 18. Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Campbridge University Press, Cambridge, UK; 2003.

[ref19] 19. Wood SN, Li Z, Shaddick G, Augustin NH. Generalized additive models for gigadata: modeling the UK black smoke network daily data. Journal of the American Statistical Association. 2017;112(519):1199–1210.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref20] 20. Cressie, Noel, Johannesson, G. Spatial prediction for massive datasets. In: Mastering the Data Explosion in the Earth and Environmental Sciences: Proceedings of the Australian Academy of Science Elizabeth and Frederick White Conference. Canberra, Australia: Australian Academy of Science; 2006. p. 11.

[ref21] 21. Cressie N, Johannesson G. Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(1):209–226.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref22] 22. Kang EL, Cressie N. Bayesian inference for the spatial random effects model. Journal of the American Statistical Association. 2011;106(495):972–983.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref23] 23. Katzfuss M, Cressie N. Spatio-temporal smoothing and EM estimation for massive remote-sensing data sets. Journal of Time Series Analysis. 2011;32(4):430–446.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref24] 24. Banerjee S, Gelfand AE, Finley AO, Sang H. Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(4):825–848. pmid:19750209
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref25] 25. Finley AO, Sang H, Banerjee S, Gelfand AE. Improving the performance of predictive process modeling for large datasets. Computational Statistics & Data Analysis. 2009;53(8):2873–2884. pmid:20016667
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref26] 26. Nychka D, Bandyopadhyay S, Hammerling D, Lindgren F, Sain S. A multiresolution Gaussian process model for the analysis of large spatial datasets. Journal of Computational and Graphical Statistics. 2015;24(2):579–599.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref27] 27. Katzfuss M. A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association. 2017;112(517):201–214.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref28] 28. Furrer R, Genton MG, Nychka D. Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics. 2006;15(3):502–523.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref29] 29. Kaufman CG, Schervish MJ, Nychka DW. Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association. 2008;103(484):1545–1555.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref30] 30. Stein ML. Statistical properties of covariance tapers. Journal of Computational and Graphical Statistics. 2013;22(4):866–885.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref31] 31. Lindgren F, Rue H, Lindström J. An explicit link between Gaussian fields and Gaussian markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2011;73(4):423–498.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref32] 32. Bakka H, Rue H, Fuglstad GA, Riebler A, Bolin D, Illian J, et al. Spatial modeling with R-INLA: A review. Wiley Interdisciplinary Reviews: Computational Statistics. 2018;10(6):e1443.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref33] 33. Vecchia AV. Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society: Series B (Methodological). 1988;50(2):297–312.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref34] 34. Stein ML, Chi Z, Welty LJ. Approximating likelihoods for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2004;66(2):275–296.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref35] 35. Datta A, Banerjee S, Finley AO, Gelfand AE. Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association. 2016;111(514):800–812. pmid:29720777
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref36] 36. Datta A, Banerjee S, Finley AO, Gelfand AE. On nearest-neighbor Gaussian process models for massive spatial data. Wiley Interdisciplinary Reviews: Computational Statistics. 2016;8(5):162–171. pmid:29657666
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref37] 37. Finley AO, Datta A, Cook BC, Morton DC, Andersen HE, Banerjee S. Applying nearest neighbor Gaussian processes to massive spatial data sets forest canopy height prediction across tanana valley alaska.”. arXiv preprint arXiv:170200434. 2017;.

[ref38] 38. Finley AO, Datta A, Cook BD, Morton DC, Andersen HE, Banerjee S. Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Journal of Computational and Graphical Statistics. 2019; p. 1–14. pmid:31543693
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref39] 39. Katzfuss M, Guinness J. A general framework for Vecchia approximations of Gaussian processes. arXiv preprint arXiv:170806302. 2017;.

[ref40] 40. Katzfuss M, Guinness J, Gong W, Zilber D. Vecchia approximations of Gaussian-process predictions. arXiv preprint arXiv:180503309. 2018;.

[ref41] 41. Zilber D, Katzfuss M. Vecchia-Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data. arXiv preprint arXiv:190607828. 2019;.

[ref42] 42. Ver Hoef JM. Kriging models for linear networks and non-Euclidean distances: Cautions and solutions. Methods in Ecology and Evolution. 2018;9(6):1600–1613.
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref43] 43. Haas TC. Lognormal and moving window methods of estimating acid deposition. Journal of the American Statistical Association. 1990;85(412):950–963.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref44] 44. Haas TC. Local prediction of a spatio-temporal process with an application to wet sulfate deposition. Journal of the American Statistical Association. 1995;90(432):1189–1199.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref45] 45. Curriero FC, Lele S. A composite likelihood approach to semivariogram estimation. Journal of Agricultural, biological, and Environmental statistics. 1999; p. 9–28.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

[ref46] 46. Liang F, Cheng Y, Song Q, Park J, Yang P. A resampling-based stochastic approximation method for analysis of large geostatistical data. Journal of the American Statistical Association. 2013;108(501):325–339.
View Article
Google Scholar

[119] View Article

[120] Google Scholar

[ref47] 47. Eidsvik J, Shaby BA, Reich BJ, Wheeler M, Niemi J. Estimation and prediction in spatial models with block composite likelihoods. Journal of Computational and Graphical Statistics. 2014;23(2):295–315.
View Article
Google Scholar

[122] View Article

[123] Google Scholar

[ref48] 48. Barbian MH, Assunção RM. Spatial subsemble estimator for large geostatistical data. Spatial Statistics. 2017;22:68–88.
View Article
Google Scholar

[125] View Article

[126] Google Scholar

[ref49] 49. Varin C, Reid N, Firth D. An overview of composite likelihood methods. Statistica Sinica. 2011; p. 5–42.
View Article
Google Scholar

[128] View Article

[129] Google Scholar

[ref50] 50. Park C, Huang JZ, Ding Y. Domain decomposition approach for fast Gaussian process regression of large spatial data sets. Journal of Machine Learning Research. 2011;12(May):1697–1728.
View Article
Google Scholar

[131] View Article

[132] Google Scholar

[ref51] 51. Park C, Huang JZ. Efficient computation of Gaussian process regression for large spatial data sets by patching local Gaussian processes. Journal of Machine Learning Research. 2016;17(174):1–29.
View Article
Google Scholar

[134] View Article

[135] Google Scholar

[ref52] 52. Heaton MJ, Christensen WF, Terres MA. Nonstationary Gaussian process models using spatial hierarchical clustering from finite differences. Technometrics. 2017;59(1):93–101.
View Article
Google Scholar

[137] View Article

[138] Google Scholar

[ref53] 53. Park C, Apley D. Patchwork kriging for large-scale gaussian process regression. The Journal of Machine Learning Research. 2018;19(1):269–311.
View Article
Google Scholar

[140] View Article

[141] Google Scholar

[ref54] 54. Caragea P, Smith RL. Approximate likelihoods for spatial processes. Preprint. 2006; https://rls.sites.oasis.unc.edu/postscript/rs/approxlh.pdf.

[ref55] 55. Isaak DJ, Wenger SJ, Peterson EE, Hoef JMV, Nagel DE, Luce CH, et al. The Norwest summer stream temperature model and scenarios for the western U.S.: a crowd-sourced database and new geospatial tools foster a user community and predict broad climate warming of rivers and streams. Water Resources Research. 2017;53(11):9181–9205.
View Article
Google Scholar

[144] View Article

[145] Google Scholar

[ref56] 56. Ver Hoef JM, Peterson EE, Theobald D. Spatial statistical models that use flow and stream distance. Environmental and Ecological Statistics. 2006;13(1):449–464.
View Article
Google Scholar

[147] View Article

[148] Google Scholar

[ref57] 57. Barry RP, Jay M, Hoef V. Blackbox kriging: spatial prediction without specifying variogram models. Journal of Agricultural, Biological, and Environmental Statistics. 1996;1(3):297–322.
View Article
Google Scholar

[150] View Article

[151] Google Scholar

[ref58] 58. Ver Hoef JM, Barry RP. Constructing and fitting models for cokriging and multivariable spatial prediction. Journal of Statistical Planning and Inference. 1998;69(2):275–294.
View Article
Google Scholar

[153] View Article

[154] Google Scholar

[ref59] 59. Higdon D. A process-convolution approach to modelling temperatures in the north atlantic ocean (disc: p191-192). Environmental and Ecological Statistics. 1998;5:173–190.
View Article
Google Scholar

[156] View Article

[157] Google Scholar

[ref60] 60. Higdon D, Swall J, Kern J. Non-stationary spatial modeling. In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics 6—Proceedings of the Sixth Valencia International Meeting. Clarendon Press [Oxford University Press]; 1999. p. 761–768.

[ref61] 61. Webster R, Oliver MA. Geostatistics for Environmental Scientists. Chichester, England: John Wiley & Sons; 2007.

[ref62] 62. Besag J. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society: Series D (The Statistician). 1975;24(3):179–195.
View Article
Google Scholar

[161] View Article

[162] Google Scholar

[ref63] 63. Guha S, Hafen R, Rounds J, Xia J, Li J, Xi B, et al. Large complex data: divide and recombine (d&r) with rhipe. Stat. 2012;1(1):53–67.
View Article
Google Scholar

[164] View Article

[165] Google Scholar

[ref64] 64. Chapman DG, Johnson AM. Estimation of Fur Seal Pup Populations by Randomized Sampling. Transactions of the American Fisheries Society. 1968;97(3):264–270.
View Article
Google Scholar

[167] View Article

[168] Google Scholar

[ref65] 65. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2020.

[ref66] 66. Elseberg J, Magnenat S, Siegwart R, Nüchter A. Comparison of nearest-neighbor-search strategies and implementations for efficient shape registration. Journal of Software Engineering for Robotics. 2012;3(1):2–12.
View Article
Google Scholar

[171] View Article

[172] Google Scholar

[ref67] 67. MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J, editors. Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1. University of California Press; 1967. p. 281–297.

[ref68] 68. Finley AO, Datta A, Banerjee S. R package for nearest neighbor Gaussian process models. arXiv:200109111 [stat]. 2020;.

[ref69] 69. Ver Hoef JM, Peterson EE, Clifford D, Shah R. SSN: an R package for spatial statistical modeling on stream networks. Journal of Statistical Software. 2014;56(3):1–45.
View Article
Google Scholar

[176] View Article

[177] Google Scholar

[ref70] 70. Isaak DJ, Ver Hoef JM, Peterson EE, Horan DL, Nagel DE. Scalable population estimates using spatial-stream-network (SSN) models, fish density surveys, and national geospatial database frameworks for streams. Canadian Journal of Fisheries and Aquatic Sciences. 2017;74(2):147–156.
View Article
Google Scholar

[179] View Article

[180] Google Scholar

Figures

Abstract

Introduction

Quick review of the spatial linear model

Spatial methods for big data

Motivating example

Objectives

Methods

Estimation of covariance parameters

Estimation of β

Point prediction

Block prediction

The SPIN method

Simulations

Evaluation of simulation results

Effect of partition method

Effect of partition size

Variance estimation for fixed effects

Prediction with global estimate of β

A comparison of methods

Application to stream networks

Discussion and conclusions

Acknowledgments

References