On Extrapolating Past the Range of Observed Data When Making Statistical Predictions in Ecology

doi:10.1371/journal.pone.0141416

Fig 1.

Example IVHs constructed from simulated data.

In (A) and (B), linear regression is used to relate a response variable to a single covariate, x, obtained at locations denoted with an “x”. Using x as a simple linear effect (A), only predictions less than the minimum observed value of x or greater than the maximum value of x are outside the IVH (shaded area), as scaled prediction variance in these areas (solid line) is greater than the maximum scaled prediction variance for observed data (dashed line). Using both linear and quadratic effects (B), some intermediate points are also outside the IVH. When both linear and quadratic effects of two covariates (x₁ and x₂) are modeled, the IVH is more nuanced and depends on whether interactions are omitted (C) or included (D).

More »

Expand

Fig 2.

Depiction of a single simulation replicate where problematic extrapolation occurs.

Panels (A-C) give simulated covariate values, panel D gives true animal abundance, (E) gives estimated abundance from a GLM run on count data from a spatially balanced survey design, and (F) gives abundance from a GLM applied to count data from a convenience survey. In (E-F), predictions outside the gIVH are represented by black boxes, and sampling locations are represented with an x. For the convenience sample, there was considerable positive bias, particularly in cells outside of the gIVH. In this case, the median posterior abundance prediction for the entire survey area is 57% greater than true abundance when inference is made to the whole study area. When inference is restricted to cells within the gIVH, median posterior abundance is 16% greater than true abundance.

More »

Expand

Fig 3.

Assembled covariates used to help explain and predict ribbon seal relative abundance in the eastern Bering Sea.

Covariates include distance from mainland (dist_mainland), distance from 1000m depth contour (dist_shelf), average remotely sensed sea ice concentration while surveys were being conducted (ice_conc), and distance from the southern sea ice edge (dist_edge). All covariates except ice concentration were standardized to have a mean of 1.0 prior to plotting and analysis.

More »

Expand

Fig 4.

Aerial survey tracks over the Bering Sea, April 22–29, 2012.

Survey tracks are shown in blue, and are overlayed on a tesselated study area consisting of 25km by 25km grid cells (gray lines). Dark gray indicates land, while the orange dashed line indicates a 1000m depth contour, and the solid brown line shows the U.S Exclusive Economic Zone (EEZ) boundary. Colored pixels indicate ribbon seal counts along aerial transects. The average effective area surveyed in each grid cell was approximately 2.6km² (0.4%). Note that surveys were designed to target multiple seal species, several of which had high densities further north (results not shown).

More »

Expand

Fig 5.

Predictions of ribbon seal apparent abundance across the eastern Bering sea from models fit to survey data.

Predictions were obtained using the posterior predictive mean for GLM and STRM models, and for the GAM using the predict.gam function in the R mgcv package [7]. Each row gives result for different model types (GLM, GAM, or STRM, respectively); left column plots give results for naive runs without presumed absences, while plots in the right column give predictions for runs where presumed absence data (i.e., 0 counts in cells with <0.1% ice) were included. Cells highlighted in black indicate those where predictions were outside the generalized independent variable hull (gIVH).

More »

Expand

Fig 6.

Boxplots summarizing proportional error in abundance from the simulation experiment.

Each boxplot summarizes the distribution of proportional error in the posterior predictive median of abundance as a function of estimation model (x-axis), survey design (columns) and whether or not inference was restricted to the gIVH (rows). The lower and upper limits of each box correspond to first and third quartiles, while whiskers extend to the lowest and highest observed bias within 1.5 interquartile range units from the box. Outliers outside of this range are denoted with points. Horizontal lines within boxes denote median bias. The two numbers located below each boxplot indicate mean bias (upper number) and the number of additional outliers for which proportional bias was greater than 2.0 (lower number).

More »

Expand