Forest surveys provide critical information for many diverse interests. Data are often collected from samples, and from these samples, maps of resources and estimates of aerial totals or averages are required. In this paper, two approaches for mapping and estimating totals; the spatial linear model (SLM) and k-NN (k-Nearest Neighbor) are compared, theoretically, through simulations, and as applied to real forestry data. While both methods have desirable properties, a review shows that the SLM has prediction optimality properties, and can be quite robust. Simulations of artificial populations and resamplings of real forestry data show that the SLM has smaller empirical root-mean-squared prediction errors (RMSPE) for a wide variety of data types, with generally less bias and better interval coverage than k-NN. These patterns held for both point predictions and for population totals or averages, with the SLM reducing RMSPE from 9% to 67% over some popular k-NN methods, with SLM also more robust to spatially imbalanced sampling. Estimating prediction standard errors remains a problem for k-NN predictors, despite recent attempts using model-based methods. Our conclusions are that the SLM should generally be used rather than k-NN if the goal is accurate mapping or estimation of population totals or averages.
Citation: Ver Hoef JM, Temesgen H (2013) A Comparison of the Spatial Linear Model to Nearest Neighbor (k-NN) Methods for Forestry Applications. PLoS ONE 8(3): e59129. doi:10.1371/journal.pone.0059129
Editor: Sergio Gómez, Universitat Rovira i Virgili, Spain
Received: October 26, 2012; Accepted: February 11, 2013; Published: March 19, 2013
This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Funding: This project received financial support from Alaska Fisheries Science Center of National Oceanic and Atmospheric Administration (NOAA) Fisheries. The findings and conclusions in the paper are those of the author(s) and do not necessarily represent the views of the National Marine Fisheries Service, NOAA. Any use of trade names is for description purposes only and does not imply endorsement by the U.S. Government. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Forest surveys provide critical information for many interests: quantifying carbon sequestration, making sound management decisions, designing processing plants, guiding decisions among conflicting land uses, and quantifying wildlife habitats, just to name a few. To meet national and international negotiations and reporting requirements, forest management plans require local inventory data on vegetation, site productivity, biomass, carbon and other resources. The data must be intensive enough to include structural variables relevant to biomass and carbon projections and extensive enough to cover hundreds to thousands of acres, but can not be too expensive to collect. Thus, data are often collected from samples. From these samples, maps of resources and estimates of aerial totals or averages are required. One approach to mapping and estimating totals of biomass and productivity data is a spatial linear model (SLM), which includes ordinary kriging and universal kriging. This approach was initially developed for a similar goal: to predict geographic values or totals for mining resources. However, another approach, k-NN (k-Nearest Neighbor), has been recently developed and has gained widespread use. The overall goal of this paper is to compare SLM to k-NN theoretically, through simulations, and as applied to real forestry data.
The k-NN method finds observed samples that are “close” to an unobserved location based on covariates, and then either imputes the “closest” one directly as a prediction (), or forms a weighted average as a prediction (). Widespread availability of remotely-sensed data as covariates allows extending ground information to large areas using k-NN. One of the reasons k-NN is popular is that, when , predictions are within the bounds of biological reality because they were observed in the samples –. Also, the logical relationships among response variables will be maintained, so k-NN is a multivariate method that retains the variable relationships seen in the data, particularly when , –. When variables are predicted separately, the dependence structure among the response variables is generally lost . The multivariate aspect of the k-NN may be necessary for inventory applications where information on multiple stand attributes is required for stand management decisions or further modeling , . Because k-NN methods reuse existing samples, they are distribution-free , , . Non-parametric k-NN imputation methods may provide better matches to listings of tree species for complex stands with multiple species and a wide variety of tree sizes, which tend to have multi-modal distributions . Non-parametric methods were found to effectively describe local conditions and variability , . One way to make k-NN local is to select a combination of neighbors from the neighborhood where the average of the covariates is closest to the target record covariates , . Localization can also be achieved by using spatial coordinates as covariates or by restricting the selection of neighbors to a circular area around the target unit . The yaImpute R package  facilitated the comparison of different k-NN approaches and its wide use.
There are some recognized problems with k-NN. It tends to be highly biased at the edge of the data cloud because prediction sites will likely be paired with a more central sample value due to the asymmetric neighborhood , . Extremely small values and extremely high values will be over- and underestimated, respectively, if the sample data do not cover the whole range of variability , . Bias can also be a problem in the interior of the data cloud if the covariates are non-uniformly distributed . In k-NN methods, the match is found using the covariates that are available for every site. As the number of samples increases, there will be a higher chance of getting an exact match in the covariates. However, this only guarantees an exact match for the response variables if there is perfect (or very strong) correlations with the covariates. As sample size increases, there is no guarantee that the mean of the response variables will approach the true mean and hence k-NN methods are not statistically consistent . The k-NN methods lack a good measure of uncertainty, and often the global root-mean squared error from cross-validation is used for point-wise standard errors . However, that global root-mean squared error may not be a good measure of accuracy when response variables have heteroscedastic variances around the covariates; it was recommended that graphical tools be used to evaluate issues of bias, homoscedasticity, influential observations, outliers, and extrapolations . An approach that uses a model of the covariates space variogram to quantify prediction uncertainty was proposed, but is computationally demanding . Relevant accuracy statistics for assessing the quality of predictions of categorical variables are still lacking . However, there are recent proposals for a variance estimator that incorporates spatial correlation , model-based estimators of the uncertainty , and design-based approaches to derive the statistical properties of the k-NN predictions . If , then k-NN loses many of its purported benefits as some estimates may not be within the realm of real values . If there are several response variables or “rare” polygons (stands), a good match will be very difficult to find .
Geostatistical methods were also devised for prediction, both for single sites or block averages. For a history, see . Combining the notion of classical geostatistics with a linear model (e.g., regression) yields the spatial linear model (SLM) (called universal kriging in the geostatistical literature). In comparison to k-NN, SLM predictions were designed to minimize the root-mean-squared error. If the data-generating process is multivariate normal, then the SLM predictor is equivalent to the conditional expectation, which is optimal (pgs. 108–110). Moreover, if the generating mechanism is not multivariate normal, the SLM predictions are still optimal among the class of linear predictors, called best linear unbiased prediction (BLUP) (pgs. 151–155). The BLUP was extended to finite populations of spatial data –. Although distribution-free, BLUP requires specification of a covariance model; however predictions are generally robust to mis-specification of the covariance model , . Classically, geostatistical methods estimated the covariance model by binning data into distance classes and using a least-squares fit ; however, restricted maximum likelihood (REML) estimation ,  removes the arbitrary nature of binning and fitting and is an unbiased estimating equation approach to the SLM , . Finally, SLM predictors are linear, creating weighted averages of data, and we can appeal to a spatially correlated version of the central limit theorem that predictors will be asymptotically normally distributed, e.g. , allowing for inference based on a standard normal distribution (e.g., for prediction intervals). More detailed and mathematically-oriented arguments on the stability of the geostatistical method can be found in  and (pgs. 289–299). There are many extensions of the SLM when predictions cannot be assumed approximately normally-distributed, e.g. .
The SLM is not without some recognized problems. Most of these are related to the problem of estimating the spatial covariance function. In a sense, the data are used twice; once to estimate the covariance parameters and secondly used for BLUP. This has been termed empirical best linear unbiased prediction (EBLUP) . Forming multivariable models requires estimating many covariance parameters, and any gain in precision is often minimal , with gains often 10% or less . However, see ,  for forestry applications. Some claims against the SLM are in error. Some forestry data are obtained from polygons that do not have a unique position in space and are irregular lattice data (aggregates of spatial data). Some authors, e.g. , indicate that the SLM is not appropriate for these data, but that is not correct. Polygon centroids can be used to compute distance, as shown with the real forestry data below. We note, however, that this may be a problem for very irregularly shaped polygons.
It turns out that predictors for both SLM and k-NN are linear, and the above review indicates that SLM should be optimal. However, much of that theory is based on an assumed spatial stochastic model, does not take into account the estimation of the covariance parameters, nor does it say how much better SLM might be. k-NN has fewer assumptions, and perhaps it is easier to implement and faster to compute, and a small loss in efficiency is compensated by ease of use and computational speed (see  and references therein). We build on a few previous comparisons , , where they compare only for point predictions, they do not assess the validity of prediction standard errors, and they use variogram estimators on residuals, which is known to be biased for covariance parameters (pgs. 257–258). One problem in making the comparison is that k-NN is an algorithm that makes no assumption about how spatial data were created; it only assumes a fixed surface. The SLM, on the other hand, is based on the idea that data are a realization of a spatial stochastic model. By conditioning on realizations under a SLM, and hence adopting a fixed surface perspective, k-NN and SLM can be evaluated in a common framework. Foresters and forest managers are interested in the global performance of predictive or imputation methods across a management region, among management regions, and across time. Supporting that practice, the comparison of prediction or imputation methods using global computations of average bias and RMSPE (described in Section “Performance Measures”) is deeply rooted in the forestry literature , , , .
Our goal is to compare k-NN to the SLM using simulations of artificial populations and resampling real forestry data. In the Methods section, we make explicit the prediction goals and k-NN and SLM methods and models, and we describe the simulation methods and real forestry data. The outcomes of using k-NN and SLM on these data are presented in the Results section. We discuss the results and offer some conclusions in the Discussion section.
We will use the following notation for spatial data. Let the population of response values be partitioned into those that are observed and those that are unobserved , and . Let the index set for the observed data be and for the unobserved data be . We consider two main goals: 1) point prediction of for , and 2) block prediction of the total or average , where are the weights that define the block objective; e.g., if then is a population total, and if then is a population average. Note that prediction goals for small areas can also be defined using zeros as weights in , but we do not pursue that here. For all response values, there are covariates contained in a design matrix which has rows and columns, where the first rows, , correspond to and the next rows, , correspond to . The spatial coordinates are contained in the matrix which has rows and columns, where the first rows correspond to and the next rows correspond to . In , is generally two. For example, the first column could be longitude and the second column latitude, or some planar transformation of them. A coordinate such as height might also be included, but we do not consider it here.
(2)As will be seen, both k-NN and SLM are linear predictors. Note that  and  attempt model-based estimators for a version (2) that includes predicted values of rather than simply summing the observed values. The variance of such predictors lack any notion of a finite population that shrinks the variance as the sampling fraction increases; i.e., we want a variance estimator such that if the whole population is observed, so we do not pursue their formulation any further.
Both the SLM and k-NN use distance in various ways so a general definition is given here. Let be a matrix with coordinates in the columns and the th row denoted as . A general distance formula between the th and th rows of is,(3)where is a weighting matrix.
Review of k-NN
Let and be the th and th rows of , respectively. Then a “variable-space distance” between the th and th sites can be computed as . Several types of distances are possible. For example, if is the identity matrix, this is raw distance; if is diagonal with the inverse of the empirical variance for each of the columns in as the diagonal elements, then this is normalized distance; if is the inverse of the empirical covariance matrix among the columns in , then this is Mahalanobis distance; and if , where is the matrix of canonical vectors from canonical correlation analysis between and and is the canonical correlation matrix, then this is the “most similar neighbor” (MSN) distance . All of these and others have been implemented in the yaImpute package  in R . The k-NN method chooses weights based on a distance matrix. Let be a distance matrix with th element , which can be partitioned as
Let be the th column of , , contained in ; i.e., . If is the index for , then for a first-order nearest neighbor, in (1) and all other . This essentially assigns the value of to for the th site that is closest to the th site in variable-space. Let be the index set of the nearest sites (smallest values) in . Then takes the average of from the nearest neighbors in variable-space. Another option is to weight inversely proportional to distance, where where and is the th element in .
Cross-validation is the method most often used to compute prediction standard errors . Cross-validation makes predictions for sites that already have values, where each sample is removed one at a time, and the rest of the sample is used to predict the one that was removed. The idea is to use in-sample averaged squared errors between the predicted and observed values to serve as a global estimator of squared errors when out-of-sample. Let the k-NN prediction standard error be estimated as,(4)where is the cross-validation prediction of for sample values. Assuming prediction errors are normally-distributed, 90% prediction intervals are formed as for out-of-sample values. Note that these are constant for all ; we examine a spatially explicit model-based approach  in Section “A geostatistical approach for estimating the variance of k-NN predictors.”
For the standard error of estimating a total, we borrow from the idea of classical sampling theory, e.g. , where replaces the standard error,(5)
Review of SLM
Assume only the linear model(6)where is a matrix of fixed covariates, is a vector of parameters, and is a random vector with for some unknown spatial multivariate distribution. Note the contrast to k-NN; (6) is a spatial stochastic model that allows optimization with respect to bias and squared error, which we now review. Let be partitioned as(7)and let be the th column of , , contained in ; i.e., . Consider squared-error loss, and the linear predictor in (1), where and contains at least a column of ones. The predictor that minimizes squared-error loss, known as the best linear unbiased predictor (BLUP), , has(8)where , with prediction variance of
(9) (pgs. 151–155). Notice that is unknown; the only assumption is the linear model and a known spatial covariance matrix.
Assume the same linear model in (6), except this time the linear predictor is (2). Let , where is a vector of all ones, and and . For a finite population the BLUP, , has(10)with prediction variance of(11)where –. The finite population correction factor is not obvious in (11). However, as gets shorter in length, (11) goes to zero. If , then (11) simplifies to where is the sampling fraction; this is the classical formula in simple random sampling without replacement for finite populations, e.g. (pg. 16).
For equations (9) and (11) is unknown and must be estimated. In spatial models, is modeled through spatial information; in geostatistics this is spatial distance. Consider the exponential autocovariance model,(12)where is a general autocorrelation function, as defined in (3), is the th element of , with as the partial sill, as the range parameter, and as the nugget effect (which may absorb spatial autocorrelation at very fine scales within minimum sampling distances). We will fit models using ; for many other models see (pgs. 80–96). The larger , the more autocorrelation between sites for a given distance. The parameters and are variance components, with controlling the autocorrelated component and controlling the uncorrelated component. For all models in this article, we use (12), relying on the fact that inferences are generally robust to mis-specification of the model. We estimate the covariance parameters using restricted maximum likelihood (REML) , ,(13)where , , the dependence of on is denoted as , and is a constant that does not depend on . Equation (13) is an unbiased estimating equation ,  and minimizing it for and provides their REML estimates. Using the estimated covariance parameters from REML in equations (8–11) provides the EBLUP predictors and standard errors.
The significance of using (13) is three-fold: 1) normality is not required to use (13) because it is an unbiased estimating equation, 2) there is no need to de-trend because estimation of is essentially embedded in (13), and 3) there is no need to compute empirical variograms by binning residuals. Residuals from de-trending are biased (pgs. 257–258) and binning is arbitrary. Thus, (13) provides an automatic way to estimate a spatial covariance matrix in very general conditions.
A Geostatistical Approach for Estimating the Variance of k-NN Predictors
An iterated variogram estimator for the variance of a k-NN predictor has been proposed . Suppose that we start with iteration and (4) as an estimator for a constant prediction standard error, . Form standardized residuals as(14)where is the in-sample cross-validation prediction value using some k-NN method. Then compute an empirical semivariogram,(15)where , the th distance class is , where , , and is some function of ; e.g., might be mean of all distances in , the median of , or the midpoint . A semivariogram, such as the equivalent to (12),(16)is fit to (15), typically by minimizing a weighted-least-squares criteria; e.g. ,
After convergence, a local estimator of prediction variance is(18)where the iteration superscript is suppressed after convergence. The prediction variance estimator (18) can be seen as an attempt to form a local (in covariate space) version of (9) without having to estimate mean effects due to “nearness” in covariate space. Note that this estimator will not work for , and it is only sensible when the mean of nearest neighbors is computed, as compared to the distance-weighted version. However, the above method could be adopted for distance-weighted. The estimator (18) will be examined using simulations. For the simulations, we used 10 equal-interval variogram distance-bins ( was equal for all ) between 0 and the maximum distance in the data set. Equation (17) was minimized using the optim() function in R  with the Nelder-Mead simplex method , obtaining a starting value from a 10×10×10 search grid for the parameters of (16). A maximum of 30 iterations was allowed.
Note that k-NN methods make the sensible constraint that when using the mean of the nearest neighbors or distance-weighting. If is a single column of ones, then the bias term above disappears. However, this is not true in general. In contrast, under the BLUP, the bias term disappears due to further constraints on that guarantee unbiasedness for any and . Hence, the RMSPE for k-NN will be greater than BLUP for two reasons: it is not optimized for minimizing the error variance and there is a bias-squared component.
because is unbiased for . Note that , so(19)is an estimator of the RMSPE of a k-NN predictor using parameters estimated under a SLM; i.e., covariance parameters can be estimated using (13) and can be estimated using generalized least squares, . We will analyze some k-NN estimators using (19) in simulations below.
Simulation of Artificial Data
We created spatially-patterned and cross-correlated -variables. All data sets were repeatedly simulated on a 20×20 regular grid evenly spaced between −1 and 1 on both coordinate axes and eight covariates: , described next. Start withwhere is a vector of values (on the 20×20 regular grid) containing zero-mean spatially-autocorrelated random variables from some geostatistical model with partial sill and range parameter , and is a vector containing zero-mean independent random variables with variance . We also let be independent of . Next, we set up an autoregressive-like recursion, where
where contains zero-mean spatially-autocorrelated random variables from some geostatistical model with partial sill and range parameter , and contains zero-mean independent random variables with variance , where again is independent of . Note that is a parameter that creates cross-correlation between variables by regressing on . This set-up ensures cross-correlation among the -variables and spatial autocorrelation within each -variable. Now let
where all the elements of are constant, equal to . Finally, create the response variable as,(20)where , is a vector of parameters, and contains zero-mean spatially-autocorrelated random variables from some geostatistical model with partial sill and range parameter , and contains zero-mean independent random variables with variance , where again is independent of .
We simulated three types of data using these models. In all cases = (NA, 0.5, 0.5, 0.5, 0, 0.5, 0.5, 0.5). Note that for each simulated data set the covariates were cross-correlated through , but broke any further cross-correlation to the group , and then as a group were cross-correlated.
- The first simulated method had 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, , and , , , , and .
- For the second simulation method, let denote the simulated data, with 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, , and , , , , and . Then, was simulated from Poi(exp()) where Poi() is a Poisson distribution. Note that wandered well below zero over significant spatial patches creating zero-inflated count data. Variances and regression coefficients are smaller here to keep numbers reasonable when exponentiating.
- For the third simulation method, let denote the simulated data, with all of the same parameters as for the second simulation. Then, was simulated as , creating binary data of zeros and ones.
Note that the group had smaller ranges and variances than group . The response variable was related to both groups through , but the coefficients were zero for and . We made predictions for both k-NN and SLM methods using only , , , , , . Thus, several model mis-specifications were made: inclusion of covariates with no effect ( and ), exclusion of covariates with real effects ( and ), and in the case of SLM, generating data from a spherical autocovariance model but fitting the data with an exponential autocovariance model.
For each of the three simulation methods listed above, 2000 data sets were simulated (with 400 simulated values per data set). For each simulation 100 locations were sampled randomly, and from each sample the other 300 were predicted, along with the overall total.
Data for this study were obtained from the Forest Inventory and Analysis (FIA) databases for Oregon. The FIA databases are part of the national inventory of forests for the United States , . The Food Security Act of 1985 protects the confidentiality of the Forest Inventory and Analysis (FIA) plot locations, the integrity of the FIA sample, and the privacy of landowners who allow FIA field crews on their land. For these reasons, the actual plot locations are not available publicly. Persons needing exact plot locations to examine or reproduce our results should contact the Pacific Northwest Forest Inventory and Analysis Program at http://www.fs.fed.us/pnw/fia/. Alternatively, FIA produces and maintains a set of public databases with perturbed plot coordinates that can be downloaded and used by anyone. The website for accessing these data is http://www.fia.fs.fed.us/tools-data/.
A tessellation of hexagons, each approximately 2400 hectares in size, is superimposed across the nation, with one field plot randomly located within each hexagon. Approximately the same number of plots is measured each year, each plot has the same probability of selection, and in the western U.S. plots are remeasured every ten years. Each field plot is composed of four subplots. Forested areas that are distinguished by structure, management history, or forest type are mapped as unique polygons (also called condition-classes) on the plot and correspond to stands of at least 0.4047 hectare in size. For our study area there were 1886 forested FIA plots measured between 2001 and 2006. Biomass (DRYBIOT), maximum potential mean annual increment (PMAI), elevation, and primary species identifier (i.e., Douglas-fir (Pseudotsuga menziesii), ponderosa pine (Pinus ponderosa) or western hemlock (Tsuga heterophylla)) were obtained from the FIA annual database.
For climate data, we used monthly temperature and precipitation normal data for the period 1971–2000 produced by the Parameter-elevation Regressions on Independent Slopes Model (PRISM), which provided an 800 m grid that produced differences between measured plot elevation and overlaid PRISM grid elevation up to 350 mm in the mountainous areas of Oregon. To account for changes in climate due to these elevation differences we utilized a process similar to  where we created a scale-free interpolation process using a 90 m digital elevation model and PRISM temperature and elevation gradients of the larger 800 m grid. The result is a 90 m monthly climate grid. Like  we used this procedure for extracting temperature (T) only and used a simple distance weighting method for precipitation (P). Climate Moisture Index (CMI), a measure of precipitation in excess of evapotranspiration, was used to quantify moisture availability. The raw values of PMAI and DRYBIOT, along with residuals after fitting a linear model, are shown in Figure 1; covariates (-variables) in the linear model were temperature, precipitation, Climate Moisture Index (CMI), WH (an indicator variable for shade tolerance based on Western Hemlock trees) and elevation.
The gray-shaded histograms are based on the original centered data, and the cross-hatched histogram is based on the residuals after fitting a multiple regression model with main effects for all covariates.
For real forest data, we used a census of 1886 known FIA values for PMAI and DRYBIOT. These were subsampled in three different experiments. For each resampling experiment listed below, the data sets were subsampled 500 times. For each subsample, a sample of 386 values was chosen randomly without replacement, and from each sample the remaining 1500 locations were predicted, along with an estimate of the total for all 1886 values. The covariates used were temperature, precipitation, CMI, WH, and elevation. The 500 resamplings were performed for each of the following:
- PMAI: this variable reflects the maximum potential forest productivity at a particular site. It indicates the average annual productivity of wood volume (m/ha/year) that would be realized over time.
- DRYBIOT: this variable is the total above-ground oven-dry biomass of live trees larger than 2.5 cm in diameter.
- PMAI with unbalanced spatial sampling. Here, we preferentially sampled geographically, as often happens in real applications due to access or other issues. We divided up the study area into four parts, using and as the dividing lines. We created samples by sampling 200 values without replacement for and , and then randomly sampling 186 values without replacement from the rest of area. One such sample is shown in Figure 2. From each unbalanced spatial sample, the remaining 1500 locations were predicted.
Over the six preceding simulation/resampling experiments, the SLM is compared to k-NN using three measures of predictive performance, both for individual site predictions, and totals over all sites. Note that these are simple summaries of predictive performance, and make no assumptions about how the data were generated; they are global summary statistics to help evaluate and compare methods. Let be the true, known value for the th location and the th simulation or resampling. Note that if we use to denote the spatial location associated with , then will vary over for each simulation/resampling because observed samples were randomized each time. We indicate this dependence on as . A prediction of , using (1), is denoted for each and . For each , there is a true total , and an estimate, using (2), denoted for the th simulation/resampling. The three performance measures are:
- RMSPE: the root-mean-squared prediction error measures how close the estimates are to the true values.
for point-wise predictions, where = number simulations (1500) of artificial data or resamplings (500) of real forestry data for point prediction, and = number of point predictions (300 for artificial data and 1500 for real forestry data) per replication , and
for total prediction. A smaller value of RMSPE indicates predictors are closer to true values.
- SRB: signed relative bias. Absolute bias is meaningless, so it is expressed as a fraction of the variability. It is well-known that MSPE = bias+variance, and we use MSPE as RMSPE squared from above. The sign of bias is also informative, so signed relative bias as a fraction of variability is computed as,
for point-wise predictions, and
for total predictions, sign() is the sign (positive or negative) of , and for a point-wise performance measure or for a total performance measure. A smaller absolute value of SRB has smaller bias, and a negative sign indicates under-prediction and a positive sign indicates over-prediction.
- PIC90∶90% prediction interval coverage, measures how well uncertainty is being estimated. For many predicted values, or over many simulations, a prediction interval should cover the true value with the claimed proportion. For point-wise predictions, the empirical prediction interval coverage was computed as,
where is the estimated standard error of , taken from (4) for k-NN methods, and from the square root of (9) for the SLM (EBLUP), with covariance parameters estimated from (13). PIC90 should be near 0.90 if prediction intervals are properly estimated. It is also possible to compute PIC95 by replacing 1.645 with 1.96 in the formula above, and PIC95 should be near 0.95. For total predictions,
where is the estimated standard error of , taken from (5) for k-NN methods, and from the square root of (11) for the SLM (EBLUP), with covariance parameters estimated from (13).
Seven prediction methods were examined; five different k-NN methods, multiple regression (a special case of a SLM that assumes independence), and a SLM:
- MAH1: k-NN that uses Mahalanobis distance with .
- MAH5: k-NN that uses Mahalanobis distance with .
- MSN1: k-NN that uses most significant neighbor (MSN) with .
- MSN5: k-NN that uses MSN with .
- bstNN: k-NN that uses both Mahalanobis distance and MSN, and tries , and then chooses the distance matrix and with the smallest cross-validation RMSPE from the observed data.
- SLM: a spatial linear model using the same covariates as all k-NN methods as main effects only, with an exponential autocovariance model estimated by REML, and using prediction and variance equations as described in the Review of SLM section.
- LM: multiple regression like SLM but assuming all random errors are independent.
The performance measures for the first set of Gaussian simulated data are presented in Table 1. Note that this table is based on 2000 simulations with 300 predictions per simulation, so 2000 total values were estimated and 600,000 points. As expected, the SLM had the lowest RMSPE, for both point and total predictions. Not only was it lowest, it was dramatically lower than any other predictor. The data were simulated with a high amount of autocorrelation, so this demonstrates how much better SLM can be in that case. When compared to MAH5 and MSN1 (the two commonly-used k-NN methods), SLM reduced RMSPE by 52.6 and 64.1% for the point estimates and 43.1 and 66.8% for the total estimates. SLM was also noticeably better than LM (linear model assuming independence), with reduced RMSPE of 34.8 and 31.8% for point and block prediction, respectively. Among the k-NN methods, MSN5 was best for both points and totals, but still not as good as LM. All methods were essentially unbiased for both points and totals. For all point estimates, prediction interval coverage was near 0.90, as they should be. For total estimates, it appears the MAH1 is a bit too high, and perhaps MAH5 a bit too low.
The performance measures for the second set of Poisson simulated data are presented in Table 1. The SLM again had the lowest RMSPE, for both point and total prediction. SLM reduced RMSPE by 31.3 and 14.8% for point estimates when compared to MAH5 and MSN1, and SLM reduced RMSPE by 23.6 and 23.5% for total estimates. All of the methods appear to be unbiased for point prediction, with generally valid confidence interval coverage. There appears to be some bias among the k-NN methods for predicting the total, and some prediction intervals fall below 0.85 for the k-NN methods. Also, the 0.86 prediction interval coverage for the SLM was a bit low, and this simulation was its poorest performance on that measure.
For predicting a binary variable, we replaced the RMSPE with percent correctly classified (PCC) for point prediction. Only the k-NN methods with = 1 truly predicted values that were 0 or 1, so for all other methods predictions were rounded to 0 or 1. The performance measures for the binary simulated data are listed in Table 1. In fact, the k-NN with = 1 performed most poorly, with the SLM again best. SLM increased PCC by 12.5 and 10.3% over MAH5 and MSN1, respectively. Point prediction appears unbiased for all methods. Prediction interval coverage is poor for the methods. A total of binary variables is rarely of interest in forestry applications compared to estimating proportions. For this simulation, we used the block mean, which is the estimated proportion for binary data, instead of a total. For block prediction, SLM decreased the RMSPE by 22.9 and 24.4% over MAH5 and MSN1, respectively. There may be some bias for MAH1 and bstNN. Prediction interval coverage is a little low for MAH5 and bstNN.
The performance measures for resampling real PMAI forestry data are presented in Table 2 in rows marked with PM. For point prediction, SLM reduced RMSPE by 9.0 and 34.4% over MAH5 and MSN1, respectively. Point prediction appears unbiased for all methods. Prediction interval coverage is quite good for all methods. For predicting a total, there appears to be some bias for k-NN methods using Mahalanobis distance, and prediction intervals are too large for MSN1 and too short for MAH5. The SLM reduces the RMSPE for predicting a total by 21.8 and 25.9% over MAH5 and MSN1, respectively.
The performance measures for resampling real DRYBIOT forestry data are presented in Table 2 in rows marked with DB. For point prediction, the bestNN approached SLM for the smallest RMSPE. The MAH5 method also did quite well, but MSN1 was very poor. For predicting a total, SLM again has the lowest RMSPE. There appears to be some bias for the bestNN method. All prediction intervals are within 5% of 90%.
Table 2, in rows marked UN, presents the performance measures for resampling real PMAI forestry data with spatially unbalanced sampling, as shown in Figure 2. For point prediction, this creates substantially more bias than the k-NN methods in Table 2. SLM remains relatively unbiased, again with the smallest RMSPE and valid prediction intervals. For predicting a total, there are large biases for k-NN methods and prediction intervals are far from the nominal 90%. The large bias cause the RMSPE for SLM to be much lower than any k-NN methods.
Most methods showed little bias globally, with generally valid prediction intervals. Yet, the SLM, and geostatistics in general, aims to make prediction intervals that vary in space, while the cross-validation approach used for k-NN is constant in space. We re-ran simulation 1 using the iterated variogram (IterVar) variance estimator of  in (18), testing its global and point-wise efficacy, compared to the SLM predictor and interval, and compared to the k-NN predictor under the SLM model (19), which we label kNNGeo. A scatter-plot of a single simulation, with 300 predictions for the unsampled locations, is shown in Figure 3, which plots on the x-axis and based on (9), (18), and (19) on the y-axis. We computed Kendall’s rank correlation between the true absolute error and the estimated prediction standard error, for each method. These correlations were computed for 1000 simulations, and then all correlations were plotted as violin plots for each method, which is shown in Figure 4.
IterVar is the iterated variogram method of McRoberts et al. (2007), kNNGeo is the covariance matrix as estimated with all main effects in a spatial linear model and REML, but using the k-NN weights, and EBLUP are the estimated standard errors from the SLM.
Figure 4 shows that, indeed, the individual prediction intervals for the SLM are generally related to the actual errors. In contrast, Figure 4 shows that the IterVar method has no relationship between the prediction intervals and the actual absolute errors. Also,  claim that the algorithm is expected to rapidly converge. In our implementation, it converged only 57.4% of the time. It diverged before 30 iterations about 2% of the time. Globally, the IterVar 90% prediction interval has 88.8% coverage. The kNNGeo method of Figure 4 showed the strongest correlation between actual absolute errors and prediction intervals, largely due to the fact that it correctly estimated a dominant component of the error, which was the bias-squared. Using the RMSPE of the kNNGeo method for a 90% prediction interval had 94.2% coverage.
This article set out to compare k-NN to the SLM for forestry mapping, and for the estimation of totals or averages of forest resources. In the introduction, we laid out arguments that favor using k-NN, and arguments that favor using a SLM, along with disadvantages for both. Our simulations of artificial data and resamplings of real data are not exhaustive; however, for the criteria that we chose (RMSPE, signed relative bias, and prediction interval coverage), the results presented in the previous section clearly favor SLM in general. To summarize, we simulated data under conditions that should severely test the SLM method. Because k-NN is primarily used in forestry, we included various k-NN methods in the simulations. In all cases, even with mis-specified covariance models, mis-specified linear models (including nonsignificant covariates and ignoring significant ones), zero-inflated count data, binary data, and skewed real forestry data, the SLM performed better than k-NN, and generally provided valid inference with little bias, and prediction intervals that contained the true values the correct proportion of time. From a single simulation, it also appears that the SLM is more robust to unbalanced spatial sampling. These results generally verify the claim in the introduction that EBLUP used to estimate the SLM is fairly robust in a variety of ways. The SLM has an additional benefit from its model-based assumptions; it allows point-wise inference, with globally valid prediction intervals that vary at each point.
Our results can be compared to previous literature cited in the Introduction, such as , where our SLM is mathematically equivalent to their universal kriging (UK); however, parameter estimation likely differed in the studies ( do not specify if they used the REML option when they fit variograms using the GSTAT package ). In , the SLM performed well compared to another k-NN method called gradient nearest neighbor (GNN), but not as consistently better as our results. Our results can also be compared to , where our SLM is mathematically equivalent to their kriging-with-external-drift (KED). An interesting hybrid method that uses MSN with kriging on the residuals is compared to the SLM based on RMSPE and bias . They find that the MSN-kriging hybrid is slightly better than the SLM, with both better than MSN alone. However, we note that  do not give a standard error estimator of point-wise predictions for the MSN-kriging hybrid. Also  estimate the SLM by first fitting a linear model assuming independence, and then computing and fitting a semivariogram on residuals. The use of REML for the SLM, as we have described it, estimates the fixed effects assuming correlated residuals, and is expected to be more efficient.
Note that it may seem surprising that k-NN was mostly unbiased for these simulations. Clarification is required because the Introduction claimed that k-NN is biased –. These authors equate bias with the fact that k-NN underestimates large values and overestimates small values; in geostatistics, this characteristic is called smoothing (pg. 158). Smoothing is a desirable characteristic under squared-error loss, which the SLM minimizes, so it is also a property of the SLM (pg. 189). Because the SLM is BLUP, it is unbiased for point-wise predictions; however the predictions are not unbiased for nonlinear functionals of the spatial population, such as quantiles. For example, following , but for finite populations, is a spatial cumulative distribution function (SCDF). Then its inverse, defines the quantile, which is nonlinear in . Predictors that can handle such nonlinearity have been proposed , and by matching variances in the predictions to those in the data, predictions will no longer underestimate high values or overestimate low values. However, it should be noted that these predictors will sacrifice the pointwise MSPE as optimized for BLUP; for an example where the prediction-variance-constrained MSPE is twice that of the “smooth” SLM predictor, see . This illustrates that, in general, no set of predictors will be optimal for all purposes.
More generally, it is possible, though computationally expensive, to obtain multiple sets of predictions, where the predicted data are simulated from conditional distribution properties of the population. The multiple prediction sets can be averaged to obtain predictions that satisfy BLUP, or quantiles can be computed across the sets. Multiple sets of predictions also allow the propogation of uncertainty if prediction sets are used as inputs to other models. In fact, k-NN is closely related to multiple imputation methods –, which sample from existing data to impute for missing data; in that sense they are like using multiple times to give multiple possible “realizations.” Again, there are equivalent ideas in geostatistics, generally termed “conditional simulation,” e.g.,  and (pgs. 452–453). We do not pursue a comparison here but suggest it for further research. Given the above discussion, we emphasize that our goal was point-wise unbiased prediction while minimizing the MSPE, which is what the SLM achieves in a single map, and compare that to k-NN on the same basis.
A model-based variance for k-NN predictors remains problematic. Cross-validation works from a global standpoint. Attempts at making variance local, such as the iterated variogram approach , did not work well for one simulated data set as shown by Figure 4. There was no correlation between estimated prediction standard errors and actual absolute errors, so cross-validation was just as good, and the iterated variogram approach had convergence problems and does not work with . More testing of this method, and possible improvements are warranted. The kNNGeo method was correlated to actual absolute errors. However global prediction intervals for kNNGeo were too conservative with 94.2% coverage for 90% prediction intervals because the bias component is not stochastic, but is treated as such if included in prediction intervals. Thus, the SLM is the only viable choice that we examined for making valid uncertainty maps along with predictions.
Finally, we stress that the SLM was presented here as a “black box” method. As we used it, there were no decisions involved; after choosing a covariance model like the exponential, use all covariates that are available, estimate the covariance parameters with REML, and plug the resulting covariance matrix into the prediction equations. This allowed us to make predictions for thousands of simulations. Such a “black box” method is certainly possible when many predictions are needed by personnel with little statistical training. However, when data have been collected at great expense, a careful analysis is better. In that case, exploratory data analysis, understanding of covariate relationships, finding and explaining outliers, model selection and diagnostics, and finally prediction, all can enhance prediction and understanding for both the SLM and k-NN, and we recommend that over a black box approach. For example, a Bayesian approach can account for the fact that covariance parameters are estimated and should correct for the plug-in aspect of the EBLUP, e.g. , , with available software . Also, when remotely sensed data are involved, data sets can be massive in size. In that case, other methods can be used, e.g. , . Moreover, there is no single correct analysis for forestry data sets; they can be modeled in various ways to achieve different desired goals.
Conceived and designed the experiments: JVH HT. Performed the experiments: JVH. Analyzed the data: JVH. Wrote the paper: JVH HT.
- 1. Moeur M, Stage AR (1995) Most similar neighbor: An improved sampling inference procedure for natural resource planning (STMA V37 0534). Forest Science 41: 337–359.
- 2. Haara AM, Maltamo M, Tokola T (1997) The k-nearest neighbor methods for estimating basal area diameter distribution. Scandinavian Journal of Forest Research 12: 200–208.
- 3. LeMay V, Temesgen H (2005) Comparison of nearest neighbor methods for estimating basal area and stems per hectare using aerial auxiliary variables. Forest Science 51: 109–199.
- 4. Holmström H, Fransson J (2003) Combining remotely sensed optical and radar data in knnestimation of forest variables. Forest Science 16: 409–418.
- 5. McRoberts R (2008) Using satellite imagery and the k-nearest neighbors technique as a bridge between strategic and management forest inventories. Remote Sensing of Environment 112: 2212–2221.
- 6. Tuominen S, Fish S, Poso S (2003) Combining remote sensing, data from earlier inventories, and geostatistical interpolation in multisource forest inventory. Canadian Journal of Forest Research 33: 624–634.
- 7. Tomppo E, Gagliano C, De Natale F, Katila M, McRoberts R (2009) Predicting categorical forest variables using an improved k-nearest neighbor and landsat imagery. Remote Sensing of Environment 113: 500–517.
- 8. Temesgen H, LeMay VM, Froese KL, Marshall PL (2003) Imputing tree-lists from aerial attributes for complex stands of south-eastern british columbia. Forest Ecology and Management 177: 277–285.
- 9. Katila M, Tomppo E (2001) Selecting estimation parameters for the finnish multisource national forest inventory. Remote Sensing of Environment 76: 16–32.
- 10. Fehrmann L, Lehtonen A, Kleinn C, Tomppo E (2008) Comparison of linear and mixed-effect regression models and a k-nearest neighbour approach for estimation of single-tree biomass. Canadian Journal Forest Research 38: 1–9.
- 11. Sironen S, Kangas A, Maltamo M, Kalliovirta J (2003) Estimating individual tree growth with nonparametric methods. Canadian Journal of Forest Research 33: 444–449.
- 12. Maltamo M, Malinen J, Kangas A, Härkönen S, Pasanen AM (2003) Most similar neighbour-based stand variable estimation for use in inventory by compartments in finland. Forestry 76: 449–464.
- 13. Malinen J (2003) Locally adaptable non-parametric methods for estimating stand characteristics for wood procurement planning. Silva Fenn 37: 109–118.
- 14. Sironen S, Kangas A, Maltamo M, Kangas J (2008) Localization of growth estimates using nonparametric imputation methods. Forest Ecology and Management 256: 674–684.
- 15. Crookston NL, Finley AO (2008) yaimpute: An R Package for kNN Imputation. Journal of Statistical Software 23: 1–16.
- 16. McRoberts R, Nelson M, Wendt D (2002) Stratified estimation of forest area using satellite imagery, inventory data, and the k-nearest neighbors technique. Remote Sensing of Environment 82: 457–468.
- 17. Baffetta F, Fattorini L, Franceschi S, Corona P (2009) Design-based approach to k-nearest neighbours technique for coupling field and remotely sensed data in forest surveys. Remote Sensing of Environment 113: 463–475.
- 18. Packalén P, Maltamo M (2007) The k-msn method for the prediction of species-specific stand attributes using airborne laser scanning and aerial photographs. Remote Sensing of the Environment 109: 328–341.
- 19. Baffetta F, Corona P, Fattorini L (2012) A matching procedure to improve k-nn estimation of forest attribute maps. Forest Ecology and Management 272: 35–50.
- 20. Stage A, Crookston N (2007) Partitioning error components for accuracy-assessment of nearneighbor methods of imputation. Forest Science 53: 62–72.
- 21. McRoberts R (2009) Diagnostic tools for nearest neighbors techniques when used with satellite imagery. Remote Sensing of Environment 113: 489–499.
- 22. Kim HJ, Tomppo E (2006) Model-based prediction error uncertainty estimation for k-nn method. Remote Sensing of Environment 104: 257–263.
- 23. McRoberts R, Tomppo EO, Finley AO, Heikkinen J (2007) Estimating areal means and variances of forest attributes using the k-nearest neighbors technique and satellite imagery. Remote Sensing of Environment 111: 466–480.
- 24. Magnussen S, McRoberts RE, Tomppo EO (2009) Model-based mean square error estimators for k-nearest neighbour predictions and applications using remotely sensed data for forest inventories. Remote Sensing of Environment 113: 476–488.
- 25. Cressie N (1990) The origins of kriging. Mathematical Geology 22: 239–252.
- 26. Cressie NAC (1993) Statistics for Spatial Data. New York: John Wiley and Sons. 900 p.
- 27. Ver Hoef JM (2000) Predicting finite populations from spatially correlated data. In: ASA Proceedings of the Section on Statistics and the Environment. American Statistical Association, 93–98.
- 28. Ver Hoef JM (2002) Sampling and geostatistics for spatial data. Ecoscience 9: 152–161.
- 29. Ver Hoef JM (2008) Spatial methods for plot-based sampling of wildlife populations. Environmental and Ecological Statistics 15: 3–13.
- 30. Stein ML (1988) Asymptotically efficient prediction of a random field with a misspecified covariance function. The Annals of Statistics 16: 55–63.
- 31. Putter H, Young GA (2001) On the effect of covariance function estimation on the accuracy of kriging predictors. Bernoulli 7: 421–438.
- 32. Cressie N (1985) Fitting models by weighted least squares. Journal of the International Association for Mathematical Geology 17: 563–586.
- 33. Patterson HD, Thompson R (1971) Recovery of inter-block information when block sizes are unequal. Biometrika 58: 545–554.
- 34. Harville DA (1977) Maximum likelihood approaches to variance component estimation and to related problems (C/R: P338–340). Journal of the American Statistical Association 72: 320–338.
- 35. Heyde CC (1994) A quasi-likelihood approach to the REML estimating equations. Statistics & Probability Letters 21: 381–384.
- 36. Cressie N, Lahiri SN (1996) Asymptotics for REML estimation of spatial covariance parameters. Journal of Statistical Planning and Inference 50: 327–341.
- 37. Bolthausen E (1982) On the central limit theorem for stationary mixing random fields. The Annals of Probability 10: 1047–1050.
- 38. Cressie N, Zimmerman DL (1992) On the stability of the geostatistical method. Mathematical Geology 24: 45–59.
- 39. Diggle PJ, Tawn JA, Moyeed RA (1998) Model-based geostatistics (Disc: P326–350). Journal of the Royal Statistical Society, Series C: Applied Statistics 47: 299–326.
- 40. Zimmerman DL, Cressie N (1992) Mean squared prediction error in the spatial linear model with estimated covariance parameters. Annals of the Institute of Statistical Mathematics 44: 27–43.
- 41. Ver Hoef J, Cressie N (1993) Multivariable spatial prediction. Mathematical Geology 25: 219–240.
- 42. Heisel T, Ersbll AK, Andreasen C (1999) Weed mapping with co-kriging using soil properties. Precision Agriculture 1: 39–52.
- 43. Finley AO, Banerjee S, Ek AR, McRoberts R (2008) Bayesian multivariate process modeling for prediction of forest attributes. Journal of Agricultural, Biological, and Environmental Statistics 13: 60–83.
- 44. Finley AO, Banerjee S, McRoberts RE (2009) Hierarchical spatial models for predicting tree species assemblages across large domains. The Annals of Applied Statistics 3: 1052–1079.
- 45. Finley AO, McRoberts RE (2008) Efficient k-nearest neighbor searches for multi-source forest attribute mapping. Remote Sensing of Environment 112: 2203–2211.
- 46. Pierce KB Jr, Ohmann JL, Wimberly MC, Gregory MJ, Fried JS (2009) Mapping wildland fuels and forest structure for land management: a comparison of nearest neighbor imputation and other methods. Canadian Journal of Forestry Research 39: 1901–1916.
- 47. Räty M, Kangas A (2012) Comparison of k-msn and kriging in local prediction. Forest Ecology and Management 263: 47–56.
- 48. Schabenberger O, Gotway CA (2005) Statistical Methods for Spatial Data Analysis. Boca Raton, FL: Chapman Hall/CRC. 512 p.
- 49. R Development Core Team (2010) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3–900051–07–0. Available: http://www.R-project.org. Accessed 2013 Feb 25.
- 50. Thompson SK (1992) Sampling. New York: John Wiley and Sons. 343 p.
- 51. Chiles JP, Delfiner P (1999) Geostatistics: Modeling Spatial Uncertainty. New York: John Wileyand Sons. 695 p.
- 52. Nelder JA, Mead R (1965) A simplex method for function minimization. Computer Journal 7: 308–313.
- 53. Roesch F, Reams G (1999) Analytical alternatives for an annual inventory system. Journal of Forestry 97: 44–48.
- 54. Czaplewski R (1999) Forest survey sampling designs: a history. Journal of Forestry 97: 4–10.
- 55. Wang T, Hamann A, Spittlehouse D, Aitken S (2006) Development of scale-free climate data for western canada for use in resource management. International Journal of Climatology 26: 383–397.
- 56. Pebesma EJ (2004) Multivariable geostatistics in s: the gstat package. Computers & Geosciences 30: 683–691.
- 57. Johnston K, Ver Hoef J, Krivoruchko K, Lucas N (2001) Using ArcGIS geostatistical analyst, volume 300. Redlands, CA: ESRI Press. 300 p.
- 58. Lahiri SN, Kaiser MS, Cressie N, Hsu NJ (1999) Prediction of spatial cumulative distribution functions using subsampling (C/R: P97–110). Journal of the American Statistical Association 94: 86–97.
- 59. Aldworth J, Cressie N (2003) Prediction of nonlinear spatial functionals. Journal of Statistical Planning and Inference 112: 3–41.
- 60. Schafer JL, Olsen MK (1998) Multiple imputation for multivariate missing-data problems: A data analyst’s perspective. Multivariate Behavioral Research 33: 545–571.
- 61. Schafer JL (1999) Multiple imputation: A primer. Statistical Methods in Medical Research 8: 3–15.
- 62. Rubin DB (1996) Multiple imputation after 18+ years. Journal of the American Statistical Association 91: 473–489.
- 63. Rubin DB (1996) Reply to comments on \Multiple imputation after 18+ years. Journal of the American Statistical Association 91: 515–517.
- 64. Ravenscroft PJ (1994) Conditional simulation for mining: Practical implementation in an industrial environment (Disc: P106–109). In: Armstrong M, Dowd PA, editors. Geostatistical Simulations. New York: Kluwer Academic Publishers Group. 79–87.
- 65. Handcock MS, Stein ML (1993) A Bayesian analysis of kriging. Technometrics 35: 403–410.
- 66. Finley AO, Banerjee S, Carlin BP (2007) spBayes: An R package for univariate and multivariate hierarchical point-referenced spatial models. Journal of Statistical Software 19: 1–24.
- 67. Cressie N, Johannesson G (2008) Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society, Series B 70: 209–226.
- 68. Eidsvik J, Finley AO, Banerjee S, Rue H (2012) Approximate bayesian inference for large spatial datasets using predictive process models. Computational Statistics and Data Analysis 56: 1362–1380.