Fig 1.
Location of field visits for vegetation survey and abiotic measurements in the Netherlands.
The blue dots show the location of species observations in LMF-M&N dataset, while the green dots represent the abiotic measurements from Wageningen University & Research. A grid of 2.88km x 2.88km cells has been created and imposed on the Netherlands to define the grid cell centre as the location of predictions, but its resolution can be adjusted arbitrarily. The axis denote distance to the centre of the Dutch coordinate system (Rijksdriehoek) in km.
Table 1.
Total number of distinct species, plots, and years in data sets.
The (plot, year)-pairs define a distinct field visit or abiotic measurement in space and time.
Fig 2.
Directed Acyclic Graph representation in plate notation of the two-stage Bayesian model which first uses standard kriging models to spatially interpolate abiotic variables, and then fits a Species Distribution Model (SDM) to the predicted abiotic variables.
A rectangle is used to group variables that repeat together with the repeating subscript or argument defined at the bottom right. The white nodes are random variables, the gray nodes are observations of random variables, the square nodes are fixed values corresponding to hyperparameters. In the kriging models, are the observed values for the abiotics at locations a ∈ D, Xj(s) are the random variables depicting the random field at all locations
, and
are the posterior means, which can be seen as predicted abiotic values, for all K abiotic variables. In the SDM,
are treated as observations of true abiotic values, Zj(s) are the non-linear covariates, W(s) is the spatial effect, and Y(s) is the binary random variable indicating whether a species occurs at all locations
with observations at locations s ∈ D′. This two-stage model disregards the uncertainty about the true abiotic values. For the formulae refer to Eqs 5 and 6.
Fig 3.
Directed Acyclic Graph representation in plate notation of the two-stage Bayesian model which first uses standard kriging models to spatially interpolate abiotic variables, and then fits a Species Distribution Model (SDM) to the predicted abiotic variables.
A rectangle is used to group variables that repeat together with the repeating subscript or argument defined at the bottom right. The white nodes are random variables, the gray nodes are observations of random variables, the square nodes are fixed values corresponding to hyperparameters. In the kriging models, are the observed values for the abiotics at locations a ∈ D, Xj(s) are the random variables depicting the random field at all locations
, for all K abiotic variables. In the SDM, the latent random variable Xj(s) describes the true abiotic values, Zj(s) are the non-linear covariates, W(s) is the spatial effect, and Y(s) is the binary random variable indicating whether a species occurs at all locations
with observations at locations s ∈ D′. The abiotic random variable is linked directly to the SDM, which means the uncertainty is taken into account by considering the entire distribution of the likely values. For the formulae refer to Eqs 5 and 7.
Fig 4.
The Netherlands subdivided into provinces.
These divisions were used for leave-province-out cross validation, where one province is left out for validation and the model is trained on other provinces, the process is then repeated for each province.
Fig 5.
The Netherlands subdivided into physical geographical regions (FGRs).
These divisions were used for leave-region-out cross validation, where one region is left out for validation and the model is trained on other regions, the process is then repeated for each region.
Fig 6.
True abiotic covariate values X1(r), X2(r) have a latent gaussian field and the true species suitability is defined by linear effects
and a residual spatial effect W(r).
The random variables are defined on the entire unit square r (grid in A,B,C) but the observation locations of abiotics (dots in A,B) and species (dots in C) are chosen separately. This results in spatial misalignment, where the species occurrence Y is observed at many locations but not the abiotic measurement locations (NA).
Fig 7.
Comparison of the estimated parameters in the two-stage and joint models for the example species Empetrum nigrum (Crowberry).
The first three panels (A) plot the SDM splines, the fourth panel (B) plots the coefficients for uncertain abiotic variables, the fifth panel (C) plots the parameters of the spatial component in the SDM, and the last five panels (D) plot the parameters of the spatial interpolation models of the abiotic variables and their measurement error.
Fig 8.
Left panel (A) is the observed species (Empetrum nigrum). The gray grid cells are where we had no observations at all, the black grid cells are where we did have observations, but not for these species. Ohter colors depict the fraction of how often this species was observed in plots in this cell. The top row middle (B) and right (C) panels compare the mean predicted log-odds occurrence probability in the joint and two-stage models. The bottom row middle (D) and right (E) panels compare the standard deviation, i.e. uncertainty, of predicted log-odds occurrence probability.
Table 2.
Pearson correlation coefficient of the estimated abiotic associations in all 50 species between two-stage vs. joint model.
Fig 9.
We compare how similar the 6 predicted maps appear by calculating Pearson correlation coefficients for 50 plant species.
The maps are based on predicted mean log-odds occurrence probabilities (panel ‘SDM’) and predicted mean values (other panels). The histograms consist of 50 correlation coefficients calculated from the predicted grid cell values of the full-coverage map between the two-stage vs joint models.
Fig 10.
Root-mean-square error (RMSE) comparison of validation results between the joint and two-stage models.
Dots represent 50 different plant species. The left panel plots leave-region-out validation, while the right panel plots leave-province-out validation. The joint model outperforms the two-stage model for species that are located below the diagonal.
Fig 11.
Estimates of the linear effect and its 95% credible interval in the data sets generated from the simulation with spatial misalignment and measurement error.
The joint and two-stage model return estimates of the true value (dashed line). Both return unbiased estimates () but the joint model is more certain (
0.23 vs 0.28). The joint model estimates are generally more accurate (0.22 vs. 0.24 RMSE).
Fig 12.
Example of the relationship between observed , kriging prediction
, and true values X1(r) of the first covariate in the region r ∈ [0, 1] × 0.8 of the first simulation run.
The observed value (blue dots) is a realization of the true value (blue line) with measurement error whose uncertainty is highlighted with 95% credible intervals (blue shaded). Kriging (orange line) estimates the true value whose uncertainty is highlighted with 95% credible intervals (orange shaded).
Table 3.
Summary of SDM literature most closely related to our research.
Columns indicate: adjusting the models for uncertainty (Adjust), whether a GLM or GAM was used, multiple predictors (Xj), Berkson error (UB), Classical error (UC), the predictors were estimated by kriging (X(s)), spatial component was included in the SDM (W(s)).