On Extrapolating Past the Range of Observed Data When Making Statistical Predictions in Ecology

Ecologists are increasingly using statistical models to predict animal abundance and occurrence in unsampled locations. The reliability of such predictions depends on a number of factors, including sample size, how far prediction locations are from the observed data, and similarity of predictive covariates in locations where data are gathered to locations where predictions are desired. In this paper, we propose extending Cook’s notion of an independent variable hull (IVH), developed originally for application with linear regression models, to generalized regression models as a way to help assess the potential reliability of predictions in unsampled areas. Predictions occurring inside the generalized independent variable hull (gIVH) can be regarded as interpolations, while predictions occurring outside the gIVH can be regarded as extrapolations worthy of additional investigation or skepticism. We conduct a simulation study to demonstrate the usefulness of this metric for limiting the scope of spatial inference when conducting model-based abundance estimation from survey counts. In this case, limiting inference to the gIVH substantially reduces bias, especially when survey designs are spatially imbalanced. We also demonstrate the utility of the gIVH in diagnosing problematic extrapolations when estimating the relative abundance of ribbon seals in the Bering Sea as a function of predictive covariates. We suggest that ecologists routinely use diagnostics such as the gIVH to help gauge the reliability of predictions from statistical models (such as generalized linear, generalized additive, and spatio-temporal regression models).

examine the ability of the generalized independent variable hull (gIVH) to diagnose potential areas across the landscape where predictions of animal abundance using these models may be problematic, and to determine whether statistical inferences are more robust when restricted to sampling units within the gIVH. Each simulation replicate consisted of several steps, including 1. Simulate three hypothetical, statistically dependent, spatially autocorrelated environmental covariates over a 30×30 grid, 2. Simulate animal abundance across the landscape as a function of environmental covariates, 3. Simulate surveys across the landscape, including the position of count quadrats and the resulting animal counts, 4. Estimate animal abundance as a function of two of the environmental covariates according three different models: a generalized linear model (GLM), a generalized additive model (GAM), and a spatio-temporal regression model (STRM). For the latter, only a spatial dimension was modeled. The GLM and STRM models were specified hierarchically, with MCMC used for posterior simulation. For the GAM, we conducted a frequentist analysis using the mgcv R package (Wood 2006). 5. Calculate the gIVH using realized posterior variance (i.e. after data were collected and analyzed).
We now describe each of these tasks and results of the simulation study in further detail before describing results. All analyses were performed in the R programming environment 1 (R Development Core Team 2012); requisite code to recreate analyses is available in the R package SpatPred that accompanies this article.

Simulating environmental covariates
In the real world, different habitat covariates are often correlated with each other (e.g., altitude and precipitation), and are often patchily distributed across the landscape (i.e., are spatially autocorrelated). We thus desired a procedure for generating covariates that would allow some level of statistical dependence among covariates, together with spatial autocorrelation. For each simulation, we generated three spatially autocorrelated environmental covariates using a procedure motivated by linear coregionalization models in multivariate spatial statistics (e.g. Goulard and Voltz 1992) to impart desired behavior. To start, we used the R package RandomFields to simulate 10 realizations y i (i ∈ 1, 2, . . . , 10) of independent, mean-zero random fields over a 30 × 30 grid (where the lower case bold type denotes a vector). Each random field had a stationary, isotropic, exponential covariance structure, where the covariance C between two survey units (i.e. grid cells) was a function of the distance r between grid cell centroids, C(r) = exp(r/v). Note that the distance between horizontally and vertically adjacent grid cell centroids was standardized to 1.0; the scale parameter v of the exponential covariance function for each random field was drawn from a Uniform(5,100) distribution to induce heterogeneity in the spatial scale of each process.
Next, we determined the values of three spatially autocorrelated habitat covariates, z j by writing them as linear functions of the y i : Evidently, the covariance between each induced habitat covariate is a function of the weights ω ij . We used the following strategy to set ω ij : 1. Set all ω ij = 0 2. For each desired covariate, j, randomly sample four values u j from the set {1, 2, . . . , 10} without replacement.
This procedure led to "patchy" habitat covariates with realistic levels of covariation ( Fig.   S2.1). To prevent redundancy and collinearity, we rejected and resampled covariate values (using the same procedure) whenever maximum absolute correlation among covariates was greater than 0.75.

Simulating animal abundance
Given values of the three simulated covariates, we generated a vector of expected log-abundance in each cell (µ) as where the regression coefficients β were each drawn from a N (0, τ ) distribution, and the intercept, β 0 , was drawn from a N (2.5, 0.25) distribution. The design matrix X was constructed assuming linear and quadratic effects for each covariate, together with all one-way interactions for a total of 9 coefficients in addition to the intercept. The precision, τ , was set to 2.5 for linear fixed effects, and to 5.0 for quadratic effects and one-way interaction terms. Residual Gaussian errors ( ) were drawn from a N (0, 10) distribution.
Abundance in each cell i was generated as Regression coefficients were redrawn whenever i N i > 100, 000 or if the 20 most populous grid cells included > 90% of total abundance to prevent unreasonably high or constricted distributions of abundance.

Simulating sample locations and count data
For each simulated landscape, we selected 45 grid cells (5%) for sampling. We employed two possible survey designs: i) spatially balanced sampling using a random tessellation design (Stevens Jr. and Olsen 2004), and ii) a convenience sample where the inclusion probability of each grid cell i in the sampling frame was set proportional to exp(−0.2 * r i ), where r i is the Euclidean distance between the centroid of grid cell i and the center of the survey grid.
The latter survey design was meant to approximate the case where there is a "base of operations" in the middle of the survey grid and more effort is expended close to the center due to simpler sampling logistics. We configured simulations such that sample quadrats covered 10% of each targeted grid cell, and generated animal counts for each quadrat j as C j ∼ Binomial(N j , 0.1).

Estimating animal abundance
For each set of count data, we attempted to estimate animal abundance over the landscape using three different models, corresponding to a hierarchical, Bayesian generalized linear model (GLM; linear, fixed effects of covariates on the log scale), a frequentist generalized additive model (GAM; smooth effects of covariates on the log scale) fit using mgcv, and a hierarchical spatial regression model (STRM; with both linear, fixed effects and spatially autocorrelated random effects). Further details on these models, together with the procedure used for posterior simulation, are presented in Text S1. For this study, we used the RSR implementation of the STRM outlined in Text S1. Prior distributions for each precision parameter, τ , were set to Gamma(1.0, 0.01), which is diffuse while maintaining a flat shape near the origin. Regression parameters (β) were given vague N (0, τ = 0.01) priors. For the GLM and STRM, we used the median posterior prediction of total abundance as a point estimator. For the GAM, we used the sum of predictions of total abundance on the response scale, obtained using the predict.gam function in mgcv.