Indexing and partitioning the spatial linear model for large data sets

We consider four main goals when fitting spatial linear models: 1) estimating covariance parameters, 2) estimating fixed effects, 3) kriging (making point predictions), and 4) block-kriging (predicting the average value over a region). Each of these goals can present different challenges when analyzing large spatial data sets. Current research uses a variety of methods, including spatial basis functions (reduced rank), covariance tapering, etc, to achieve these goals. However, spatial indexing, which is very similar to composite likelihood, offers some advantages. We develop a simple framework for all four goals listed above by using indexing to create a block covariance structure and nearest-neighbor predictions while maintaining a coherent linear model. We show exact inference for fixed effects under this block covariance construction. Spatial indexing is very fast, and simulations are used to validate methods and compare to another popular method. We study various sample designs for indexing and our simulations showed that indexing leading to spatially compact partitions are best over a range of sample sizes, autocorrelation values, and generating processes. Partitions can be kept small, on the order of 50 samples per partition. We use nearest-neighbors for kriging and block kriging, finding that 50 nearest-neighbors is sufficient. In all cases, confidence intervals for fixed effects, and prediction intervals for (block) kriging, have appropriate coverage. Some advantages of spatial indexing are that it is available for any valid covariance matrix, can take advantage of parallel computing, and easily extends to non-Euclidean topologies, such as stream networks. We use stream networks to show how spatial indexing can achieve all four goals, listed above, for very large data sets, in a matter of minutes, rather than days, for an example data set.


Introduction
The general linear model, including regression and analysis of variance (ANOVA), is still a mainstay in statistics, where Y is an n × 1 vector of response random variables, X is the design matrix with covariates (fixed explanatory variables, containing any combination of continuous, binary, or categorical variables), β is a vector of parameters, and ε is a vector of zero-mean random variables, which are classically assumed to be uncorrelated, var(ε) = σ 2 I.The spatial linear model is a version of Eq (1) where var(ε) = V, and V is a patterned covariance matrix that is modeled using spatial relationships.Generally, spatial relationships are of two types: spatially-continuous point-referenced data, often called geostatistics, and finite sets of neighbor-based data, often called lattice or areal data [1].For geostatistical data, we associate random variables in Eq (1) with their spatial locations by denoting the random variable as Y(s i ); i = 1, . .., n, and ε(s i ); i = 1, . .., n, where s i is a vector of spatial coordinates for the ith point, and the i, jth element of V is cov(ε (s i ), ε(s j )).Table 1 provides a list of all of the main notation used in this article.
The main goals from a geostatistical linear model are to 1) estimate V, 2) estimate β, 3) make predictions at unsampled Y(s j ), where j = n + 1, . .., N, form a set of spatial locations without observations, and 4) for some region B, make a prediction of the average value YðBÞ ¼ R B YðsÞds=jBj, where jBj is the area of B. Estimation and prediction both require Oðn 2 Þ for V storage and Oðn 3 Þ operations for V −1 [2], which, for massive data sets, is computationally expensive and may be prohibitive.Our overall objective is to use spatial indexing ideas to make all four goals possible for very large spatial data sets.We maintain the momentbased approach of classical geostatistics, which is distribution free, and we work to maintain a coherent model of stationarity and a single set of parameter estimates.

Quick review of the spatial linear model
When the outcome of the random variable Y(s i ) is observed, we denote it y(s i ), which are contained in the vector y.These observed data are used first to estimate the autocorrelation parameters in V, which we will denote as θ.In general, V can have n(n + 1)/2 parameters, but use of distance to describe spatial relationships typically reduces this to just 3 or 4 parameters.An example of how V depends on θ is given by the exponential autocorrelation model, where the i, jth element of V is where θ = (τ 2 , η 2 , ρ) 0 , d i,j is the Euclidean distance between s i and s j , and Ið�Þ is an indicator function, equal to 1 if its argument is true, otherwise it is 0. The parameter η 2 is often called the "nugget effect," τ 2 is called the "partial sill," and ρ is called the "range" parameter.In Eq (2), the variances are constant (stationary), which we denote σ 2 = τ 2 + η 2 , when d i,j = 0.Many other examples of autocorrelation model are given in [1,3].We will use restricted maximum likelihood (REML) [4,5] to estimate parameters of V. REML is less biased than full maximum likelihood [6].REML estimates of covariance parameters are obtained by minimizing for θ, where V θ depends on spatial autocorrelation parameters θ, and r θ ¼ y À X βθ , βθ ¼ ðX 0 V À 1 θ XÞ À 1 X 0 V À 1 θ y, and c is a constant that does not depend on θ.It has been shown [7,8] that Eq (3) form unbiased estimating equations for covariance parameters, so Gaussian data are not strictly necessary.After Eq (3) has been minimized for θ, then these estimates, call them θ, are used in the autocorrelation model, e.g.Eq 2, for all of the covariance values to create V.This is the first use of data y.The usual frequentist method for geostatistics, with a long tradition [9], "uses the data twice" [10].Now V, along with a second use of the data, are used to estimate regression coefficients or make predictions at unsampled locations.By plugging V into the well-known best-linear-unbiased estimate (BLUE) of β for Eq (1), we obtain the empirical best-linear-unbiased estimate (EBLUE), e.g.[11], The estimated variance of Eq (4) is Let a single unobserved location be denoted s 0 , with a covariate vector of x 0 (containing the same covariates and length as a row of X).Then empirical best-linear-unbiased prediction (EBLUP) [12] at an unobserved location is where ĉ0 � ĉ ovðε; εðs 0 ÞÞ, using the same autocorrelation model, e.g.Eq (2), and estimated parameters, θ, that were used to develop V.Note that if we condition on V as fixed, then Eq (6) is a linear combination of y, and can also be written as η 0 0 y when Eq (4) is substituted for β.The prediction Eq (6) can be seen as the conditional expectation of Y(s 0 )|y with plug-in values for β, V, and c.The estimated variance of EBLUP is, where ŝ2 0 is the estimated variance of Y(s 0 ) using the same covariance model as V. [12] Spatial methods for big data Here, we give a brief overview of the most popular methods currently used for large spatial data sets.There are various ways to classify such methods.For our purposes, there are two broad approaches.One is to adopt a Gaussian Process (GP) model for the data and then approximate the GP.The other is to model locally, essentially creating smaller data sets and using existing models.
Modeling locally involves an attempt to maintain classical geostatistical models by creating subsets of the data, using existing methods on subsets, and then making inference from subsets.For example, [43,44] created local data sets in a spatial moving window, and then estimated variograms and used ordinary kriging within those windows.This idea allows for nonstationary variances but forces an unnatural asymmetric autocorrelation because the range parameter changes when moving a window.Nor does it estimate β, but rather there is a different β for every point in space.Another early idea was to create a composite likelihood by taking products of subset-likelihoods and optimizing for autocorrelation parameters θ [45], and then θ can be held fixed when predicting in local windows.However, this does not solve the problem of estimating a single β.
More recently, two broad approaches have been developed for modeling locally.One is a 'divide and conquer' approach, which is similar to [45].Here, it is permissible to re-use data in subsets, or not use some data at all [46][47][48], with an overview provided by [49].Another approach is a simple partition of the data into groups, where partitions are generally spatially compact [50][51][52][53].This is sensible for estimating covariance parameters and will provide an unbiased estimate for β, however the estimated variance v arð βÞ will not be correct.Continuity corrections for predictions are provided, but predictions may not be efficient near partition boundaries.
A blocked structure for the covariance matrix based on spatially-compact groupings was proposed by [54], who then formulated a hybrid likelihood based on blocks of different sizes.The method that we feature is most similar to [54], but we show that there is no need for a hybrid likelihood, and that our approach is different than composite likelihood.Our spatial indexing approach is very simple and extends easily to random effects, and accommodates virtually any covariance matrix that can be constructed.We also show how to obtain the exact covariance matrix of estimated fixed effects without any need for computational derivatives or numerical approximations.

Motivating example
One of the attractive features of the method that we propose is that it will work with any valid covariance matrix.To motivate our methods, consider a stream network (Fig 1a).This is the Mid-Columbia River basin, located along part of the border between the states of Washington , where we also show a systematic placement of prediction locations with orange dots.There are 60,099 prediction locations that will serve as the basis for point predictions.The response variable is an average of daily maximum temperatures in August from 1993 to 2011.Explanatory variables obtained for both observations and prediction sites included elevation at temperature logger site, slope of stream segment at site, percentage of upstream watershed composed of lakes or reservoirs, proportion of upstream watershed composed of glacial ice surfaces, mean annual precipitation in watershed upstream of sensor, the northing coordinate, base-flow index values, upstream drainage area, a canopy value encompassing the sensor, mean August air temperature from a gridded climate model, mean August stream discharge, and occurrence of sensor in tailwater downstream from a large dam (see [55] for more details).
These data were previously analyzed in [55] with geostatistical models specific to stream networks [11,56].The models were constructed as spatial moving averages, e.g., [57,58], also called process convolutions, e.g., [59,60].Two basic covariance matrices are constructed, and then summed.In one, random variables were constructed by integrating a kernel over a white noise process strictly upstream of a site, which are termed "tail-up" models.In the other construction, random variables were created by integrating a kernel over a white noise process strictly downstream of a site, which are termed "tail-down" models.Both types of models allow analytical derivation of autocovariance functions, with different properties.For tail-up models, sites remain independent so long as they are not connected by water flow from an upstream site to a downstream site.This is true even if two sites are very close spatially, but each on a different branch just upstream of a junction.Tail-down models are more typical as they allow spatial dependence that is generally a function of distance along the stream, but autocorrelation will still be different for two pairs of sites that are an equal distance apart, when one pair is connected by flow, and the other is not.
When considering big data, such as those in Fig 1, we considered the methods as described in the previous section.The basis-function/reduced rank approaches would be difficult for stream networks because an inspection of Fig 1 reveals that we would need thousands of basis functions in order to cover all headwater stream segments and run the basis functions downstream only.A separate set of basis functions would be needed that ran upstream, and then weighting would be required to split the basis functions at all stream junctions.In fact, all of the GP model approximation methods would require modifying a covariance structure that has already been developed specifically for steam networks.The spatial indexing method that we propose below is much simpler, requiring no modification to the covariance structure, and, as we will demonstrate, proved to be adequate, not only for stream networks, but more generally.

Objectives
In what is to follow, we will use spatial indexing, leading to covariance matrix partitioning and local predictions.We will use the acronym SPIN, for SPatial INdexing, as the collection of methods for covariance parameter estimation, fixed effects estimation, and point and block prediction.Our objective is to show how each of these inferences can be made computationally faster with SPIN, and still provide unbiased results with valid confidence/prediction intervals.
This article uses several acronyms.Table 2 provides a handy reference to the meaning of all acronyms used here.

Methods
The main advantage of the SPIN method is due to the way the covariance matrix is indexed and partitioned to allow for faster evaluation of the REML equations, Eq (3), whose optimization is iterative, requiring many evaluations involving the inverse of the covariance matrix.This optimization provides estimation of the covariance parameters, which we describe next.

Estimation of covariance parameters
Consider the covariance matrix to be used in Eqs ( 4) and (6).First, we index the data to create a covariance matrix with P partitions based on the indexes {i; i = 1, . .., P}, In a similar way, imagine a corresponding indexing and partition of the spatial linear model as, . .
Now, for the purposes of estimating covariance parameters, we maximize the REML equations based on a covariance matrix, rather than Eq (8).The computational advantage of using Eq (10) in Eq ( 3) is that we only need to invert matrices of size V i,i for all i, and, because we have large amounts of data, we assume that {V i,i } are sufficient for estimating covariance parameters.If the size of V i,i is fixed, then the computational burden grows linearly with n.Also, Eq (10) in Eq (3) allows for use of parallel computing because each V i,i can be inverted independently.
Note that we are not concerned with the variance of θ, which is generally true in classical geostatistics.Rather, θ contains nuisance parameters that require estimation in order to estimate fixed effects and make predictions.Because data are massive, we can afford to lose some efficiency in estimating the covariance parameters.For example, sample sizes � 125 are generally recommended for estimating the covariance matrix for geostatistical data [61].REML is for the most part unbiased.If we have thousands of samples, and if we imagine partitioning the spatial locations into data sets (in ways that we describe later), then using Eq (10) in Eq (3) is, essentially, using REML many times to obtain a pooled estimate of θ.
Partitioning the covariance matrix is most closely related to the idea of quasi-likelihood [62], composite likelihood [45] and divide and conquer [63].However, for REML, they are not exactly equivalent.From Eq (3), the term logjX 0 V À 1 θ Xj using composite likelihood, An advantage to spatial indexing, when compared to composite likelihood, can be seen when X contains columns with many zeros, such as may occur for categorical explanatory variables.Then, partitioning X may result in X i that has columns with all zeros, which presents a problem when computing logjX 0 i V À 1 i;i X i j for composite likelihood, but not when using V part .The SPIN indexing can also allow for faster inversion of the covariance matrix when estimating fixed effects, but that requires some new results to obtain the proper standard errors of the estimated fixed effects, which we describe next.

Estimation of β
The generalized least squares estimate for β was given in Eq (4).Although the inverse V −1 only occurs once (as compared to repeatedly when optimizing the REML equations), it will still be computationally prohibitive if a data set has thousands of samples.Note that under the partitioned model, Eq (9) with covariance matrix Eqs (10), (4), is, where This is a "pooled estimator" of β across the partitions.This should be a good estimator of β at a much reduced computational cost.It will also be convenient to show that Eq (11) is linear in y, by noting that To estimate the variance of βbd we cannot ignore the correlation between the partitions, so we consider the full covariance matrix Eq (8).If we compute the covariance matrix for Eq (11) under the full covariance matrix Eq (8), we obtain where . Note that while we set parts of V = 0 Eq (10) in order to estimate θ and β, we computed the variance of β using the full V in Eq (8).Using a plug-in estimator, whereby θ is replaced by θ, no further inverses of any V i,j are required if all V À 1 i;i are stored as part of the REML optimization.There is only a single additional inverse required, which is R × R, where R is the rank of the design matrix X, and is already computed for T À 1  xx in Eq (11).Also note that if we simply substituted Eq (10) into Eq (5), then we obtain only T À 1  xx as the variance of βbd .In Eq (13), xx is the adjustment that is required for correlation among the partitions for a pooled estimate of βbd .Partitioning of the spatial linear model allows computation from Eq (11), but then going back to the full model for developing Eq (13), which is a new result.This can be contrasted to the approaches for variance estimation of fixed effects using pseudo likelihood, composite likelihood, and divide and conquer found in the earlier literature review.
Eq (13) is quite fast and grows linearly for computing the number of inverse matrices V À 1 i;i (that is, if observed sample size is 2n, then there are twice as many inverses as a sample of size n, if we hold partition size fixed).Also note that all inverses may already be computed as part of REML estimation of θ.However, Eq (13) is quadratic in pure matrix computations due to the double sum in W xx .These can be made parallel, but may take too long for more than about 100,000 samples.One alternative is to use the empirical variation in βi i;i y i , where the ith matrix calculations are already needed for Eq (11) and βi can be simply computed and stored.Then, let which has been used before for partitioned data, e.g.[64].A second alternative is to pool the estimated variances of each βi , which are where the first P in the denominator is for averaging individual v arð βi Þ, and the second P is the reduction in variance due to averaging βi .Eqs ( 13)-( 15) are tested and compared below using simulations.

Point prediction
The predictor for Y(s 0 ) was given in Eq (6).As for estimation, the inverse V −1 only occurs once (as compared to repeatedly when optimizing to obtain the REML estimates).If the data set has tens of thousands of samples, it will still be computationally prohibitive.Note that under the partitioned model, Eq (9), that assumes zero correlation among partitions, Eq (10), from Eq (6) the predictor is, where βbd is obtained from Eq (11), using the same autocorrelation model and parameters as for V.Even though the predictor is developed under the block diagonal matrix Eq (10), the true prediction variance can be computed under Eq (8), as we did for estimation.However, the performance of these predictors turned out to be quite poor.
We recommend point predictions based on local data instead, which is an old idea, e.g.[43], and has already been implemented in software for some time, e.g.[10].The local data may be in the form of a spatial limitation, such as a radius around the prediction point, or by using a fixed number of nearest neighbors.For example, the R [65] package nabor [66] finds nearest neighbors among hundreds of thousands of samples very quickly.Our method will be to use a single set of global covariance parameters as estimated under the covariance matrix partition Eq (10), and then predict with a fixed number of nearest neighbors.We will investigate the effect due to the number of nearest neighbors through simulation.
A purely local predictor lacks model coherency, as discussed in the literature review section.
We use a single θ for covariance, but there is still the issue of β.As seen in Eq (6), estimation of β is implicit in the prediction equations.If y j � y are data in the neighborhood of prediction location s j , then using Eq (6) with local y j is implicitly adopting a varying coefficient model for β, making it also local, so call it βj , and it will vary for each prediction location s j .A further issue arises if there are categorical covariates.It is possible that a level of the covariate is not present in the local neighborhood, so some care is needed to collapse any columns in the design matrix that are all zeros.These are some of the issues that call to question the "coherency" of a model when predicting locally.Instead, as for estimating the covariance parameters, we will assume that the goal is to have a single global estimate of β.Then we take as our predictor for the jth prediction location, where X j and Vj are the design and covariance matrices, respectively, for the same neighborhood as y j , x j is a vector of covariates at prediction location j, ĉj ¼ covðYðs j Þ; y j Þ (using the same autocorrelation model and parameters as for Vj ), and βbd was given in Eq (11).It will be convenient for block kriging to note that if we condition on Vj being fixed, then Eq (17) can be written as a linear combination of y, call it λ 0 j y, similar to η 0 0 y as mentioned after Eq (6).Suppose there are m neighbors around s j , so y j is m × 1.Let y j = N j y, where N j is a m × n matrix of zeros and ones that subset the n × 1 vector of all data to only those in the neighborhood.Then where Q was defined in Eq (12).Let Ĉ be an estimator of varð βbd Þ in Eqs ( 13), ( 14), or 15), then the prediction variance of Eq ( 17) is varðYðs j Þ À Ŷ ' ðs j ÞÞ when using the local neighborhood set of data, which is where ŝ2 is the estimated value of var(Y(s j )) using θ and the same autocorrelation model that was used for V. Eq (19) can be compared to Eq (7).

Block prediction
None of the literature reviewed earlier considered block prediction, yet it is an important goal in many applications.In fact, the origins of kriging were founded on estimating total gold reserves in the pursuit of mining [9].The goal of block prediction is to predict the average value over a region, rather than at a point.If that region is a compact set of points denoted as B, then the random quantity is where jBj ¼ R B 1 ds is the area of B. In practice, we approximate the integral by a dense set of points on a regular grid within B. Let us call that dense set of points D ¼ fs j ; j ¼ n þ 1; . . .; Ng, where recall that {s j ;j = 1, . .., n} are the observed data.Then the grid-based approximation to Eq (20) is We are in the same situation as for prediction of single sites, where we are unable to invert the covariance matrix of all n observed locations for predicting f Ŷ ðs j Þ; j ¼ n þ 1; n þ 2; . . .; Ng.Instead, let us use the local predictions as developed in the previous section, which we will average to compute the block prediction.Let the point predictions be a set of random variables denoted as f Ŷ ' ðs j Þ; j ¼ n þ 1; n þ 2; . . .; Ng.Denote y o a vector of random variables for observed locations, and y u a vector of unobserved random variables on the prediction grid D to be used as an approximation to the block.Recall that we can write Eq (18) as Ŷ ' ðs j Þ ¼ λ 0 j y o .We can put all λ j into a large matrix, The average of all predictions, then, is where a = (1/N, 1/N, . .., 1/N) 0 .Let a 0 * ¼ a 0 W, and so the block prediction a 0 * y o is also linear in y o .Let the covariance matrix for the vector ðy 0 o ; where V o,o = V in Eq (8).Then, assuming unbiasedness, that is, where X o and X u are the design matrices for the observed and unobserved variables, respectively, then the block prediction variance is Although the various parts of V can be very large, the necessary vectors can be created onthe-fly to avoid creating and storing the whole matrix.For example, take the third term in Eq (22).To make the kth element of vector V u,u a, we can create the kth row of V u,u , and then take the inner product with a.This means that only the vector V u,u a must be stored.We then simply take this vector as an inner product with a to obtain a 0 V u,u a.Also note that computing Eq (21) grows linearly with observed sample size n due to fixing the number of neighbors used for prediction, but Eq (22) grows quadratically, in both n and N, simply due to the matrix dimensions in V o,o and V u,u .We can control the growth of N by choosing the density of the grid approximation, but it may require subsampling of y o if the number of observed data is too large.We often have very precise estimates of block averages, so this may not be too onerous if we have hundreds of thousands of observations.

The SPIN method
As we have shown, SPIN is a collection of methods for covariance parameter estimation, fixed effects estimation, and point and block prediction, based on spatial indexing.SPIN, as described above, estimates covariance parameters using REML, given by Eq (3), with a valid autocovariance model [e.g., Eq (2) used in a partitioned covariance matrix, given by Eq (10)].Using these estimated covariance parameters, we estimate β using Eq (11), with estimated covariance matrix, Eq (13), unless explicitly stating the use of Eqs (14) or (15).For point prediction, we use Eq (17) with estimated variance Eq (19), unless explicitly stating the purely local version for β given by Eq (6) with estimated variance Eq (7).For block prediction, we use Eq (21) with Eq (22).

Simulations
To test the validity of SPIN, we simulated n spatial locations randomly within the [0, 1] × [0, 1] unit square to be used as observations, and we created a uniformly-spaced (N − n) = 40 × 40 prediction grid within the unit square.
We simulated data with two methods.The first simulation method created data sets that were not actually very large, using exact geostatistical methods that require the Cholesky decomposition of the covariance matrix.For these simulations, we used the spherical autocovariance model to construct V, where terms are defined as in Eq (2).To simulate normally-distributed data from N(0, V), let L be the lower triangular matrix such that V = LL 0 .If vector z is simulated as independent standard normal variables, then ε = Lz is a simulation from N(0, V).Unfortunately, computing L is an Oðn 3 Þ algorithm, on the same order as inverting V, which limits the size of data for simulation.required at all N spatial locations.We call this the GEOSTAT simulation method.For all simulations, we fixed τ 2 = 10 and η 2 = 0.1, but allowed ρ to vary randomly from a uniform distribution between 0 and 2. We created another method for simulating spatially patterned data for up to several million records.Let S = [s 1 , s 2 ] be the 2-column matrix of the spatial coordinates of data, where s 1 is the first coordinate, and s 2 is the second coordinate.Let cosðU 1;i pÞ À sinðU 1;i pÞ sinðU 1;i pÞ cosðU 1;i pÞ � � be a random rotation of the coordinate system by radian U 1,i π, where U 1,i is a uniform random variable.Then let which is a 2-dimensional sine wave surface with a random amplitude (due to uniform random variable U 2,i ), random frequencies on each coordinate (due to uniform random variables U 3,i and U 5,i ), and random shifts on each coordinate (due to uniform random variables U 4,i and U 6,i ).Then the response variable is created by taking ε ¼ P 100 i¼1 ε i , where expected amplitudes decrease linearly, and expected frequencies increase, with each i.Further, the ε were standardized to zero mean and a variance of 10 for each simulation, and we added a small independent component with variance of 0.1 to each location, similar to the nugget effect η 2 for the GEO-STAT method.Fig 2c and 2d shows two realizations from the sum of random sine-wave surfaces, where the sample size was 100,000.Each simulation took about 2 seconds.We call this the SUMSINE simulation method.
Thus, random errors, ε, for the simulations were based on GEOSTAT or SUMSINE methods.In either case, we created two fixed effects.A covariate, x 1 (s i ), was generated from standard independent normal-distributions at the s i locations.A second spatially-patterned covariate, x 2 (s i ), was created, using the same model, but a different realization, as the random error simulation for ε.Then the response variable was created as, for i = 1, 2, . .., for a specified sample size n, or N (if wanting simulations at prediction sites), and

Evaluation of simulation results
For one summary of performance of fixed effects estimation, we consider the simulationbased estimator of root-mean-squared error,

RMSE ¼
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi 1 K for the kth simulation among K, where bp;k is the kth simulation estimate for the pth β parameter, and β p is the true parameter used in simulations.We only consider β 1 and β 2 in Eq (25).The next simulation-based estimator we consider is 90% confidence interval coverage, To evaluate point prediction we also consider the simulation-based estimator of root-meansquared prediction error, RMSPE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffiffi 1 K � 1600 where Ŷ k ðs j Þ is the predicted value at the jth location for the kth simulation and y k (s j ) is the realized value at the jth location for the kth simulation.The final summary that we consider is 90% prediction interval coverage, ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where v arð Ŷ k ðs j ÞÞ is an estimator of the prediction variance.

Effect of partition method
We wanted to test SPIN over a wide range of data.Hence, we simulated 1000 data sets where simulation method was chosen randomly, with equal probability, between GEOSTAT and SUMSINE methods.If GEOSTAT was chosen, a random sample size between 1000 and 2000 was generated.If SUMSINE was chosen, a random sample size between 2000 and 10,000 was generated.Thus, throughout the study, the simulations occurred over a wide range of parameters, with two different simulation methods and randomly varying autocorrelation.In all cases, the error models fitted to the data were misspecified, because we fitted an exponential autocorrelation model to the true models, GEOSTAT and SUMSINE, that generated them.This should provide a good test of the robustness of the SPIN method and provide fairly general conclusions on the effect of partition method.
After simulating the data, we considered 3 indexing methods.One was completely random, the second was spatially compact, and the third was a mixed strategy, starting with compact, and then 10% were randomly reassigned.To create compact data partitions, we used k-means clustering [67] on the spatial coordinates.K-means has the property of minimizing within group variances and maximizing among group variances.When applied to spatial coordinates, k-means creates spatially compact partitions.An example of each partition method is given in Fig 3 .We created partition sizes that ranged randomly from a target of 25 to 225 locations per group (k-means has some variation in group size).It is possible to create one partition for covariance estimation, and another partition for estimating fixed effects.Therefore we considered all nine combinations of the three partition methods for each estimation method.
Table 3 shows performance summaries for the three partition methods, for both fixed effect estimation and point prediction, over wide-ranging simulations when using SPIN with 50 nearest-neighbors for predictions.It is clear that, whether for fixed effect estimation, or prediction, the use of compact partitions was the best option.The worst option was random partitioning.The mixed approach was often close to compact partitioning in performance.

Effect of partition size
Next, we investigated the effect of partition size.We only used compact partitions, because they were best, and we used partition sizes of 25, 50, 100, and 200 for both covariance parameter estimation and fixed effect estimation, and again used 50 nearest-neighbors for predictions.We simulated data in the same way as above, and used the same performance summaries.Here, we also included the average time, in seconds, for each estimator.The results are shown in Table 4.In general, larger partition sizes had better RMSE for estimating covariance parameters, but the gains were very small after size 50.For fixed effects estimation, partition size of 50 was often better than 100, and approximately equal to size 200.For prediction, RMSPE was lower as partition size increased.In terms of computing speed, covariance parameter estimation was slower as partition size increased, but fixed effect estimation was faster as partition size increased (because of fewer loops in Eq (13).Partition sizes of 25 often had poor coverage in terms of both CI90 and PI90, but coverage was good for other partition sizes.Based on Tables 3 and 4, one good overall strategy is to use compact partitions of block size 50 for covariance parameter estimation, and block size 200 for fixed effect estimation, for both efficiency and speed.Note that when partition size is different for fixed effect estimation from covariance parameter estimation, new inverses of diagonal blocks in Eq (10) are needed.If  Results using 1000 simulations as described in the text.The first column of the table gives data partition method for the covariance parameter estimation (COPE) using REML, which was one of random partitioning (RAND), compact partitioning (COMP), or a mix of compact with 10% randomly distributed (MIXD).The second column of the table uses covariance parameters as estimated in the first row, and gives the data partition method for fixed effects estimation (FEFE), which was one of RAND, COPE, or MIXD.RMSE, RMSPE, CI90, and PI90 are described in the text.RMSE 1 and RMSE 2 are for the first (spatially independent) and second (spatially patterned) covariates, respectively.Similarly, CI90 1 and CI90 2 are for first and second covariates, respectively.
https://doi.org/10.1371/journal.pone.0291906.t003partition size is the same for fixed effect and covariance parameter estimation, inverses of diagonal blocks can be passed from REML to fixed effects estimation, so another good strategy is to use block size 50 for both fixed effect and covariance parameter estimation.

Variance estimation for fixed effects
In the section on estimating β, we described three possible estimators for the covariance matrix of βbd , with Eq (13) being theoretically correct, and faster alternatives Eqs ( 14) and (15).The alternative estimators will only be necessary for very large sample sizes, so to test their efficacy we simulated 1000 data sets with random sample sizes, from 10,000 to 100,000, using the SUMSINE method.We then fitted the covariance model, using compact partitions of size 50, and fixed effects, using partition sizes of 25, 50, 100, and 200.We computed the estimated covariance matrix of the fixed effects using Eqs ( 13)-( 15), and evaluated performance based on 90% confidence interval coverage.
Results in Table 5 show that all three estimators, at all block sizes, have confidence interval coverage very close to the nominal 90% when estimating β 1 , the independent covariate.However, when estimating the spatially-patterned covariate, β 2 , the theoretical estimator has proper coverage for block sizes 50 and greater, while the two alternative estimators have proper coverage only for block size 50.It is surprising that the results for the alternative estimators are so specific to a particular block size, and these estimators warrant further research.

Prediction with global estimate of β
In the sections on point and block prediction, we described prediction using both a local estimator of β, and the global estimator βbd .To compare them, and examine the effect of the number of nearest neighbors, we simulated 1000 data sets as described in earlier, using compact partitions of size 50 for both covariance and fixed-effects estimation.We then predicted values on the gridded locations with 25, 50, 100, and 200 nearest neighbors.
Results in Table 6 show that prediction with the global estimator βbd had smaller RMSPE, especially with smaller numbers of nearest neighbors.As expected, predictors have lower RMSPE with more nearest neighbors, but gains are small after block size 50.Prediction intervals for both methods had proper coverage.The local estimator of β was faster because it used the local estimator of the covariance of β, while predictions with βbd needed the global covariance estimator (Eq 13) to be used in Eq (19).Higher numbers of nearest neighbors took longer to compute, especially with numbers greater than 100.Of course, predictions for the block average had much smaller RMSPE than points.Again, prediction got better when using more nearest neighbors, but improvements were small with more than 50.Computing time for block averaging increased with number of neighbors, especially when greater than 100, and took longer than point predictions.

A comparison of methods
To compare methods, we simulated 1000 data sets using GEOSTAT (partial sill was 10, range was 0.5 and nugget was 0.1) where we fix sample size at n = 1000, and the errors were standardized before adding fixed effects.We compared 3 methods: 1) estimation and prediction using the full covariance matrix for all 1000 points, 2) SPIN with compact blocks of 50 for both covariance and fixed effects parameter estimation, and 50 nearest-neighbors for prediction, and 3) nearest-neighbor Gaussian processes (NNGP).NNGP had good performance in [16] and software is readily available in the R package spNNGP [68].For spNNGP, we used default parameters for the conjugate prior method and a 25 × 25 search grid for phi and alpha, which were the dimensions of the search grid found in [16].We stress that we do not claim this to be a definitive comparison among methods, as the developers of NNGP could surely make adjustments to improve performance.Likewise, partition size and number of nearest neighbors for prediction could be adjusted to optimize performance of SPIN for any given simulation or  data set.We offer these results to show that, broadly, SPIN and NNGP are comparable, and very fast, with little performance lost in comparison to using the full covariance matrix.Table 7 shows that RMSE for estimation of the independent covariate, and the spatially-patterned covariate, were approximately equal for SPIN and NNGP, and only slightly worse than the full covariance matrix.RMSPE for SPIN was equal to the full covariance matrix, and both were just slightly better than NNGP.Confidence and prediction intervals for all three methods were very close to the nominal 90%.
Fig 4 shows computing times, using 5 replicate simulations, for each method for up to 100,000 records.Both NNGP and SPIN can use parallel processing, but here we used a single processor to remove any differences due to parallel implementations.Fitting the full covariance matrix with REML, which is iterative, took more than 30 minutes with sample sizes > 2500.Computing time for NNGP is clearly linear with sample size, while for SPIN, it is quadratic when using Eq (13), but linear when using the alternative variance estimators for fixed effects (Eqs 14 and 15).Using the alternative variance estimators, SPIN was about 10 times faster than NNGP, and even with quadratic growth when using Eq (13), SPIN was faster than NNGP for up to 100,000 records.

Application to stream networks
We applied spatial indexing to covariance matrices constructed using stream network models as described for the motivating example in the Introduction.These are variance component models, with a tail-up component, a tail-down component, and a Euclidean-distance The seð βbd Þ is the standard error using Eq (13).The z-value is the estimate divided by its standard error.Prob(> |z|) is the probability of getting the fixed effect estimate if it were truly 0, assuming a standard normal distribution. 1Elevation (m/1000) at sensor site 2 Slope (100m/m) of stream reach of sensor site 3 Percentage of watershed upstream of sensor site composed of lake or reservoir surfaces 4 Mean annual precipitation (mm) in watershed upstream of sensor site 5 Albers equal area northing coordinate (10km) of sensor site 6 Percentage of the base flow to total flow of sensor site 7 Drainage area (10,000 km 2 ) upstream of sensor site 8 Riparian canopy coverage (%) of 1 km stream reach encompassing a sensor site 9 Mean annual August air temperature (˚C) 10 Mean annual August discharge (m 3 /sec) https://doi.org/10.1371/journal.pone.0291906.t008component, each with 2 covariance parameters, along with a nugget effect; thus, there are 7 covariance parameters (4 partial sills, and 3 range parameters).A full covariance matrix was developed for these models [69], and we easily adapted it for spatial partitioning.We used compact blocks of size 50 for estimation, and 50 nearest neighbors for predictions.The 4 partial sill estimates were 1.76, 0.40, 2.57, and 0.66 for tail-up, tail-down, Euclidean-distance, and nugget effect, respectively.These indicate that tail-up and Euclidean-distance components dominated the structure of the overall autocovariance, and both had large range parameters.It took 7.98 minutes to fit the covariance parameters.The fitted fixed effects took an additional 2.15 minutes of computing time (Table 8), which are very similar to results found in [55].Predictions for 65,099 locations are shown in  In summary, the original analysis [55] took 10 days of continuous computing time to fit the model and make predictions with a full 9521 × 9521 covariance matrix.Using SPIN, fitting the same model took about 10 minutes, with an additional 47 minutes for predictions.Note that these models take more time than Euclidean distance alone because there are 7 covariance parameters, and the tail-up and tail-down models use stream distance, which takes longer to compute.For this example, we used parallel processing with 8 cores when fitting covariance parameters and fixed effects, and making predictions, which made analyses considerably faster.We did not use block prediction, because that was not a particular goal for this study.However, it is generally important, and has been used for estimating fish abundance [70].

Discussion and conclusions
We have explored spatial partitioning to speed computations for massive data sets.We have provided novel and theoretically correct development of variance estimators for all quantities.We proposed a globally coherent model for covariance and fixed effects estimation, and then use that model for improved predictions, even when those predictions are done locally based on nearest neighbors.We include block kriging in our development, which is absent among literature on big data for spatial methods.
Our simulations showed that, over a range of sample sizes, simulation methods, and range of autocorrelation, spatially compact partitions are best.There does not appear to be a need for "large blocks," as used in [54].A good overall strategy, that combines speed without giving up much precision, is based on 50/50/50, where compact partitions of size 50 are used for both covariance parameter estimation and fixed effects estimation, and 50 nearest neighbors are used for prediction.This strategy compares very favorably with a default strategy for NNGP.
One benefit of the data indexing is that it extends easily to any geostatistical model with a valid covariance matrix.There is no need to approximate a Gaussian process.We provided one example for stream network models, but other examples include geometric anisotropy, nonstationary models, spatio-temporal models (including those that are nonseparable), etc.Any valid covariance matrix can be indexed and partitioned, offering both faster matrix inversions and parallel computing, while providing valid inferences with proper uncertainty assessment.

Fig 1 .
Fig 1. Study area for the motivating example.(a) A stream network from the mid-Columbia River basin, where purple points show 9521 sample locations that measured mean water temperature during August.(b) Most of the stream network is located in Washington and Oregon in the United States.(c) A close-up of the black rectangle in (a).The orange points are prediction locations.https://doi.org/10.1371/journal.pone.0291906.g001

Fig 2 .
Fig 2. Examples of simulated surfaces used to test methods.(a) and (b) are two different realizations of 2000 values from the GEOSTAT method with a range of 2. (c) and (d) are two realizations of 100,000 values from the SUMSINE method.Bluer values are lower, and yellower areas are higher.https://doi.org/10.1371/journal.pone.0291906.g002

Fig 3 .
Fig 3. Illustration of three methods for partitioning data.Sample size was 1000, and the data were partitioned into 5 groups of 200 each.(a) Random assignment to group.(b) K-means clustering on x-and y-coordinates.(c) K-means on x-and y-coordinates, with 10% randomly re-assigned from each group.Each color represents a different grouping.https://doi.org/10.1371/journal.pone.0291906.g003

Table 4 . Effect of partition sizes.
Results are based on 1000 simulations, using the same simulation parameters as in Table3.The first column of the table gives data partition sizes for the covariance parameter estimation (COPE), and the second column gives data partition size for fixed effects estimation (FEFE), while using covariance parameters as estimated in the first column.The columns RMSE 1 , RMSE 2 , RMSPE, CI90 1 , CI90 2 , and PI90 are the same as in Table3.TIME C is the average time, in seconds, for covariance parameter estimation, and TIME F is the average time, in seconds, for fixed effects estimation.

Table 6 . Effect of number of nearest neighbors for RMSPE and PI90.
Results are based on 1000 simulations, using the same simulation parameters as in Table3.The first column of the table gives number of nearest neighbors.Time is (21)age computing time in seconds.The subscript 1 indicates a local estimator of β using Eq (6), while subscript 2 indicates global estimator of β using Eq(17).The subscript 3 indicates the block predictor, Eq(21).https://doi.org/10.1371/journal.pone.0291906.t006

Table 7 . Comparison of 3 methods for fixed effects estimation and point prediction.
Data were simulated from 1000 random locations with a 40 × 40 prediction grid.The first column of the table gives the method, where Full uses the full 1000 × 1000 covariance matrix, SPIN uses spatial partitioning with compact blocks of size 50 and 50 nearest-neighbor prediction points.NNGP uses default parameters from R package for the conjugate prior method with a 25 × 25 search grid on phi and alpha.The columns RMSE 1 , RMSE 2 , RMSPE, CI90 1 , CI90 2 , and PI90 are the same as in

Table 3 .
TIME is the average time, in seconds, for fixed effects estimation and prediction combined.