Figures
Abstract
Background
Predicting and explaining species occurrence using environmental characteristics is essential for nature conservation and management. Species distribution models consider species occurrence as the dependent variable and environmental conditions as the independent variables. Suitable conditions are estimated based on a sample of species observations, where one assumes that the underlying environmental conditions are known. This is not always the case, as environmental variables at broad spatial scales are regularly extrapolated from point-referenced data. However, treating the predicted environmental conditions as accurate surveys of independent variables at a specific point does not take into account their uncertainty.
Methods
We present a joint hierarchical Bayesian model where models for the environmental variables, rather than a set of predicted values, are input to the species distribution model. All models are fitted together based only on point-referenced observations, which results in a correct propagation of uncertainty. We use 50 plant species representative of the Dutch flora in natural areas with 8 soil condition predictors taken during field visits in the Netherlands as a case study. We compare the proposed model to the standard approach by studying the difference in associations, predicted maps, and cross-validated accuracy.
Findings
We find that there are differences between the two approaches in the estimated association between soil conditions and species occurrence (correlation 0.64-0.84), but the predicted maps are quite similar (correlation 0.82-1.00). The differences are more pronounced in the rarer species. The cross-validated accuracy is substantially better for 5 species out of the 50, and the species can also help to predict the soil characteristics. The estimated associations tend to have a smaller magnitude with more certainty.
Citation: Viljanen M, Tostrams L, Schoffelen N, van de Kassteele J, Marshall L, Moens M, et al. (2024) A joint model for the estimation of species distributions and environmental characteristics from point-referenced data. PLoS ONE 19(6): e0304942. https://doi.org/10.1371/journal.pone.0304942
Editor: Charlotte Chibuzor Ndiribe, University of Deusto: Universidad de Deusto, SPAIN
Received: October 26, 2023; Accepted: May 21, 2024; Published: June 21, 2024
Copyright: © 2024 Viljanen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The novel code and data that support the findings of this study are openly available in at https://zenodo.org/doi/10.5281/zenodo.8275716. This repository also includes all code used to make the figures. All experimental results from running the codes are also provided as large CSV files: https://zenodo.org/doi/10.5281/zenodo.8261503.
Funding: This research was carried out in the framework of the Strategic Program RIVM (SPR), in which expertise and innovative projects prepare RIVM to respond to future issues in health, the environment and sustainability. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Biodiversity worldwide is under threat by a multitude of stressors, including anthropogenic pressures such as pollution and climate change [1, 2]. Successfully and effectively mitigating threats to a given species requires knowledge about its ecology, but such information is not always available. One solution is to use ecological models to explain and predict species presence.
A powerful tool to this aim are Species Distribution Models (SDMs), which are statistical models that attempt to predict and explain species occurrence using environmental characteristics [3, 4]. The response variable is species occurrence and the explanatory variables are most often environmental characteristics that include various descriptors of the abiotic environment. Researchers have developed increasingly complex SDM techniques based on statistical models and machine learning [5, 6]. SDMs are fitted to spatial data, where spatial auto-correlation is a characteristic that should be taken into account for statistical inference [7, 8] and prediction [9–11]. For more complete description of SDMs and related statistical issues we refer the reader to reviews [12, 13].
Data sets that describe environmental conditions are becoming increasingly available [14]. More data offers promise in more fully capturing the habitat characteristics of species, which can result in more accurate maps and detection of novel predictors of species occurrence [15–17]. These predictors are readily used in SDMs or other ecological models. However, environmental data is often predicted from other models, measured with error, or interpolated from measurement points. There is inherent uncertainty in GIS layers [18–20], local climate interpolated from weather stations [21, 22], thematic resolution and change of land use [23], and the coordinates of species occurrence in historical data [24]. Recent studies suggest that poor model performance can be attributed to high levels of uncertainty in the environmental data [25]. Spatial misalignment, where the measurements of environmental factors are not properly aligned to the species observation data, is one critical source of uncertainty in studying the effect of environmental factors on species distribution [26].
Predicting an accurate map of habitat suitability requires accurate environmental conditions at every possible point of the study region. A simple solution to unknown values is a two-stage approach. In the first stage, one predicts the environmental factor at every spatial location. Typical solutions are using a geostatistical model such as kriging, a machine learning model such as random forest, or down scaling each observation into a full coverage grid. At the second stage, these predicted environmental factors are treated as ground truth in the species distribution model. However, this approach does not consider uncertainty in the covariate values, which can lead to erroneous statistical inferences [27]. Few studies have attempted to assess the effect of uncertainty in the environmental variables on an SDM model [26–32]. There is still a need for a proper method for realistic SDMs that combines errors from the different environmental variables into a single model [33, 34].
One solution is to implement so-called errors-in-variables models [35]. Both classical measurement error and Berkson errors can be considered in a species distribution model depending on the beliefs about the underlying phenomena [31]. The models can be estimated by integrating over the latent variables without error in either frequentist (EM-algorithm) or Bayesian (MCMC or INLA) statistical inference. It is difficult to develop a fully general model because a complex species distribution model has to be integrated with an extrapolation method for the predictors to correctly consider the uncertainty and its propagation. Here, we propose a hierarchical Bayesian model as a particularly attractive solution, because both the observed data and model parameters can be random variables in Bayesian inference [36]. Approximations used by INLA make it possible to fit complex hierarchical and spatial models to large data sets [37]. Standard SDMs have been implemented before in the Bayesian framework using the INLA package, see for example [38–42]. As a novelty, we extend this approach to consider the environmental factors from a set of models as input to the SDM, with all models estimated jointly from data.
In this paper, we implement the joint model and the standard two-stage model in a Bayesian framework with a specification that is otherwise identical. Our aim is to investigate the impact of taking the uncertainty in the predictor variables into account. We compare the estimated associations, predicted maps, and accuracy in cross-validated data. Since there is uncertainty inherent in any model, including maps and associations predicted from an SDM, it is best if the uncertainty is explicitly quantified so that well-justified decisions can be made [33, 43–45].
2 Data
As a use case, we consider the Netherlands where The National Flora Monitoring Network—Environment and Nature Quality (LMF-M&N) [46] collects a data set of species observations and the Wageningen University & Research [47] has measured abiotic factors. Based on these data sets, we build SDMs to illustrate how uncertainty in the abiotic factors can be incorporated into the species distribution model.
The LMF-M&N [46] aims to collect measurements to inform policy makers in the Netherlands. One interest of the national government is to conserve the biodiversity in natural areas. Less common native plant species are threatened in particular and are an important consideration for making policy for nature management [48]. The data set is an inventory of vegetation in permanent test areas, recorded using the Braun-Blanquet method [49]. We considered the data recorded across 12799 plots up until 2019. All species of vascular plants (Tracheophytes) present in these plots are noted during field visits, and we extract the presence/absence of every species at every plot. In total 1442 plant species were observed. The ecologists performing the observations were given a list of possible plant species that could be found in this region, along with a list of species that had been observed at this plot in the past. In general, the recordings at any plot are repeated once every four years. The size of the plots differed per vegetation type that it falls in, generally ranging from between 2x2m and 5x5m for grasslands, and between 25x25m and 50x50m for forests. The monitoring network is a stratified sample, where natural areas of interest are over-represented in sampling. The Netherlands was divided into smaller spatial units [46] based on physical geographical regions (FGR) sharing similar environmental characteristics [50], which were further refined into smaller areas with different vulnerability to acidification and eutrophication [51].
Abiotic environmental factors are available for a subset of plots in a database from Wageningen University & Research [47]. Field measurements are used to record soil conditions such as pH, groundwater table, potassium and nitrogen at 1208 plots. Not every measurement was available at every location in the original dataset, but the 6 soil variables extracted from this data set were recorded at most plots. The abiotic data set is collected independently from the LMF-M&N but can be linked to it using the spatial location of each plot. However, this results in spatial misalignment where some species plots fully overlap with the abiotic field measurements, others are close by to some measurement, and others are far away. This results in varying degrees of uncertainty about abiotic factors at each species plot.
The spatial misalignment is illustrated in Fig 1 and Table 1. The Netherlands spans an extent of 312km x 264km and we created a grid of 2.88km x 2.88km cells to define the location of predictions in a full-coverage map, the resolution of which could be arbitrarily increased or decreased. Note in Fig 1 that, even if the abiotic measurements (green) and species occurrence field visits (blue) would have been perfectly aligned, predicting the entire Netherlands would also require the measurement of abiotic variables in all grid points. In principle, similar partial overlap and uncertainty is present when considering the year of measurement. For simplicity, we take the latest field visit and focus on spatial uncertainty.
The blue dots show the location of species observations in LMF-M&N dataset, while the green dots represent the abiotic measurements from Wageningen University & Research. A grid of 2.88km x 2.88km cells has been created and imposed on the Netherlands to define the grid cell centre as the location of predictions, but its resolution can be adjusted arbitrarily. The axis denote distance to the centre of the Dutch coordinate system (Rijksdriehoek) in km.
The (plot, year)-pairs define a distinct field visit or abiotic measurement in space and time.
We have chosen eight independent variables that describe habitat preferences in the SDM: day number of field visit, pH, amount of Organic matter (ORG), Carbon/Nitrogen-Ratio (C/N), Nitrogen (N), Phosphorus (P), Potassium (K), and land use [52]. Land use classification is available for the entire Netherlands from open spatial data. The day number, pH and land use are considered known, the five soil conditions ORG, C/N, N, P, K are considered uncertain over the entire Netherlands. For modelling these variables, we have log transformed them and converted them to z-scores, i.e. mean centered and divided by the standard deviation, to ensure that the vague prior has an equal impact on all variables. In addition, location information is provided as x&y-coordinates based on the ‘Rijksdriehoekscoördinaten’, the Dutch coordinate system. Every point has x-coordinate between 0 and 280 km and y-coordinate between 300 and 625 km, where the central point of the projection is in Amersfoort x = 155 km, y = 463 km. In the model and produced maps, we expressed both the x-coordinate and the y-coordinate as a deviation from the central point in km.
The day number (1–365) of field visit is recorded at every field visit, so it is available at each location. Most observations were done between March and November. We use an interpolated pH map in S1 Fig, which was created from pH measurements with kriging using the following covariates: land use, soil type [53], and common heather (Calluna vulgaris) as an indicator plant. Research indicates that similar maps of pH in the natural areas of Netherlands can be considered accurate [54]. We re-coded landuse into categories: Forest: -4, Dry natural terrain: -3, Wet natural terrain: -2, Water: -1, Recreation: 0, Railway: 1, Main road: 2, Build: 3, Agriculture and other agricultural: 4. Land use is illustrated in S2 Fig and the number of plots was stratified towards natural areas as illustrated in S3 Fig.
In total 50 species were selected by hand, see S1 Table. Criteria for the selection, besides a minimum number of positive findings of 20 in the database, were that the selection should be representative for the Dutch flora in natural areas. This means that we selected rare species according to the national Red List, threatened species, common species and species increasing in occurrence. Also some tree species, shrubs and dwarf shrubs were included in the selection as well as the most common species in the dataset Holcus lanatus. Species selection reflects a variety in preferences for abiotic circumstances, from nutrient poor to nutrient rich and from dry to wet.
3 Methods
We compare two models that both implement spatial interpolation combined with a species distribution model in a Bayesian framework and apply these to all of the 50 selected species:
- Two-stage model: Model the abiotic variables, then fit a species distribution model to the predicted mean of abiotic variables.
- Joint model: Model the abiotic variables and species distribution simultaneously, by using unknown latent abiotic variables.
These two approaches contain identical model components. One standard kriging model is used for every abiotic variable. A standard generalized additive model with a spatial component is used for the species distribution model based on these abiotic variables. The only difference is that the joint model does not assume the abiotic values are known for certain. The two stage model uses the predicted mean abiotic values at each species observation, whereas the joint model considers the entire possible distribution of the abiotic variables. This is accomplished by chaining the two models together in a hierarchical Bayesian model by feeding the output of abiotic variable models as input to the species distribution model, which are then random variables instead of constants. The proposed species distribution model is not based on the potentially false assumption of accurately measured abiotic variables. In the next 3 sections we first define the kriging model, then the SDM, and finally how they are combined in these two approaches.
3.1 Interpolation of abiotic variables
Kriging refers to a widely used method for interpolation or smoothing spatial data [55]. For a model-based approach, we assume data to be generated by a stochastic process, which is a real-valued random variable X*(a) indexed by a point a in domain . Given abiotic measurement locations
, the abiotic measurements consist of realizations of the process at these locations:
(1)
The stochastic process is a Gaussian Random Field (GRF) if for a set of locations the outcome vector is assumed to have a multivariate Normal distribution x* ∼ MVN(μ, Σ) where μi = E[X*(ai)] and
.
A model specifies a probability distribution for the vector of observed outcomes x* from the stochastic process X*(a), which we consider a ‘noisy’ realization of the latent Gaussian Field (GF) X(a). We define the observed stochastic process as X*(a) = X(a) + ϵ where ϵ ∼ N(0, δ2) is the measurement error. The underlying latent field for an abiotic variable is specified by a mean value E[X(ai)] = α and the Matérn covariance function:
(2)
where we used INLA’s default smoothness parameter ν = 1 and Kν is the modified Bessel function of the second kind of order ν. The parameter σ2 is the variance and κ is the scaling parameter of the spatial effect. The range is the distance where the spatial correlation is close to zero, related to the scaling parameter and empirically defined as
for the Matérn covariance function.
The covariance function now depends on two parameters: the variance (partial sill in kriging) and the range of the spatial effect. For large data sets, it can be computationally infeasible to estimate a dense spatial covariance matrix Σ. This is known as the ‘big n’ problem [56]. A computationally effective alternative is given by the stochastic partial differential equation (SPDE) approach [57], which involves a sparse precision matrix. The internal SPDE approach is parameterised by τ and κ and σ2 then follows from
. The default prior is log(τ) ∼ N(0, 10) and log(κ) ∼ N(0, 10).
The intercept defined as a fixed effect has a default prior α ∼ N(0, 1000). For the measurement error the default prior is defined in terms of precision, i.e. inverse of the variance .
3.2 Species Distribution Model (SDM)
We specify the SDM as a Generalized Additive Mixed Model (GAMM). We model species occurrence at a given location by a binary random variable Y(s), indexed by a point s in domain . Given species observation locations
, the outcomes are presence/absence at these locations:
(3)
Let y(si) denote the observed values of the binary random variable Y(si) ∼ Bernoulli(μi) whether the species is observed at location i, where
is the probability of occurrence at plot i given by the logistic link function. A structured additive predictor ηi accounts for the covariates in an additive way:
(4)
where β0 is the intercept;
are the coefficients with linear relationship to covariates x = (x1, …, xK) and f = (f1, …, fL) are functions of the covariates z = (z1, …, zL). These can be non-linear effects of covariates, time trends, random intercepts and slopes, temporal or spatial random effects. Non-linear effects can be modelled using a spline fj(u) based on a random walk model of order 2 over discrete domain u = (u1, …, ud) of the covariate values, where the second order increments are distributed as Δ2ui = ui − 2ui+1 + ui+2 ∼ N(0, ζ2). We use splines for day number, pH, and land use, while ORG, C/N, N, P, K are entered as fixed effects. The fixed effects, including the intercept, have default priors βj ∼ N(0, 1000). The default prior on random walk spline is again defined in terms of precision, i.e. inverse of the variance
.
3.3 Two-stage and joint models
The two-stage and joint models include the same two modelling tasks; interpolation of abiotic variables and SDM. There are two problems that cause uncertainty about the true covariate values in the SDM:
- We observe a realization
with an unknown amount of measurement error ϵ > 0.
- The set of abiotic measurements
does not contain every location
, so
is unobserved for
.
We have noisy observations at every abiotic measurement location
. However, the SDM defined in the previous section assumes the underlying realization of the random field xj(s) of every covariate j is known at every field visit location
for fitting the model and entire Netherlands
for predicting the map. Because the true abiotic values are not known, we must define how an SDM is fitted to the noisy point-referenced data.
First, consider K spatial (kriging) models for each covariate:
(5)
where the random variable
is a noisy realization of covariate Xj(s) with additional variance
in the outcome to express any possible measurement error. The underlying random field for each covariate j is specified as a realization of the Gaussian process depending on intercept αj and spatial correlation parameters
. We have also defined predicted abiotic values
as the the expected value of the posterior predictive distribution. These predictions are made at every field visit location
given observations at abiotic measurement locations
.
The two-stage model uses the predicted abiotic values as known constants:
(6)
The joint model considers the abiotic values xj(s) in every location as unknown latent variables Xj(s):
(7)
In both models, μi is the probability of occurrence at field visit i = 1, …, N given by the logistic link function g(ηi). At each field visit i, we have K random variable covariates
with linear effects and L fixed covariates
with non-linear effects fj modelled by a spline based on a random walk model of order 2. The models also include a spatial effect W(s) to model any spatial patterns in species occurrence unaccounted by the covariates. We use a Directed Acyclic Graph (DAG) to illustrate the two-stage model in Fig 2 and the joint model in Fig 3.
A rectangle is used to group variables that repeat together with the repeating subscript or argument defined at the bottom right. The white nodes are random variables, the gray nodes are observations of random variables, the square nodes are fixed values corresponding to hyperparameters. In the kriging models, are the observed values for the abiotics at locations a ∈ D, Xj(s) are the random variables depicting the random field at all locations
, and
are the posterior means, which can be seen as predicted abiotic values, for all K abiotic variables. In the SDM,
are treated as observations of true abiotic values, Zj(s) are the non-linear covariates, W(s) is the spatial effect, and Y(s) is the binary random variable indicating whether a species occurs at all locations
with observations at locations s ∈ D′. This two-stage model disregards the uncertainty about the true abiotic values. For the formulae refer to Eqs 5 and 6.
A rectangle is used to group variables that repeat together with the repeating subscript or argument defined at the bottom right. The white nodes are random variables, the gray nodes are observations of random variables, the square nodes are fixed values corresponding to hyperparameters. In the kriging models, are the observed values for the abiotics at locations a ∈ D, Xj(s) are the random variables depicting the random field at all locations
, for all K abiotic variables. In the SDM, the latent random variable Xj(s) describes the true abiotic values, Zj(s) are the non-linear covariates, W(s) is the spatial effect, and Y(s) is the binary random variable indicating whether a species occurs at all locations
with observations at locations s ∈ D′. The abiotic random variable is linked directly to the SDM, which means the uncertainty is taken into account by considering the entire distribution of the likely values. For the formulae refer to Eqs 5 and 7.
The two-stage model fits the K independent kriging models first, and uses the resulting predictions as input to a species distribution model fitted afterwards. Note that the joint model combines K + 1 separate models: K spatial models for each covariate and 1 species distribution model based on these covariates. These models are fitted simultaneously to observed data in each location, and it is not a requirement that every covariate or the species is observed at every location. The joint model correctly propagates all uncertainty related to the covariate prediction but can be computationally expensive especially for K > 1.
3.4 Validation
In order to evaluate model generalization ability, we mimic a scenario in which practitioners must use a model that has been trained in one area to estimate species occurrence in another location. We use spatial cross validation for the validation sets due to spatial correlation, since otherwise it would be relatively easy to predict species occurrence from observed nearby species occurrence. We formed the validation sets using both the set of provinces of the Netherlands as well as the Dutch Physical Geographical Regions (FGR) [50], see Figs 4 and 5. Provinces are administrative regions so we have some overlap in their environmental characteristics, which may help the model to estimate the habitat suitability of every species. The FGRs are regions defined by shared environmental characteristics, so we expect more errors since there are fewer similar areas included in the train data set.
These divisions were used for leave-province-out cross validation, where one province is left out for validation and the model is trained on other provinces, the process is then repeated for each province.
These divisions were used for leave-region-out cross validation, where one region is left out for validation and the model is trained on other regions, the process is then repeated for each region.
The models are trained by leaving all field visits in one of the regions out from training data and training on all other field visits. The predicted posterior distribution of the species occurrence probability μi ∈ [0, 1] is then compared to the true species occurrence yi ∈ {0, 1} in validation data. This gives cross-validated predictions for every species and field visit. We report the model accuracy as the Root Mean Squared Error (RMSE), which measures both discrimination and calibration. It is the square root of the Mean Squared Error (MSE), which is also a valid metric in binary classification known as the Brier score [58]:
We also compare predicted prevalence and true prevalence in every region. Given a partition of field visits into r = 1, …, T geographic regions
, where
and
, we calculate the true prevalence in each region as the percent of field visits where the species occurs
. We also calculate prediction intervals for the prevalence. This is done by sampling the species occurrence probabilities at every field visit location from their joint posterior distribution, then using these probabilities to sample the species occurrence (0/1). Repeating this sampling we obtain a distribution of predicted prevalences, where the 2.5% and 97.5% quantiles are used as the prediction intervals.
3.5 Simulation
It is not possible to know for sure which estimated associations are correct in a real data set since we do not know the ground truth. A simple simulation, illustrated in Fig 6, can be envisioned to assess which estimates are more correct. We sample data sets from an assumed parametric relationship defined by known parameter values, where a generative model of the data is given by the same likelihood that was used to define the models.
The random variables are defined on the entire unit square r (grid in A,B,C) but the observation locations of abiotics (dots in A,B) and species (dots in C) are chosen separately. This results in spatial misalignment, where the species occurrence Y is observed at many locations but not the abiotic measurement locations (NA).
Let us define the simulation formally. Consider all locations on the unit square r = [0, 1] × [0, 1]. Suppose we have observations Y(s) of species occurrence at field visit locations , and observations of abiotic variables
and
at measurement locations
. To mirror the scenario with our data set, we let N = 1000 and M = 100 to imply higher uncertainty about the abiotic values, and uniformly sample each location from the unit square. This results in spatial misalignment because the species and abiotic observations have different locations.
The abiotic observations are generated from latent variables that represent the true covariate values. We wish to define two covariates with spatial autocorrelation and confounding (Var[X1(r)] = Var[X2(r)] = 1.0 and Cov[X1(r), X2(r)] = 0.5). First, we define a latent Gaussian Random Field (GRF) on the unit square: let it have a variance parameter and a range parameter r† = 0.3. We then simulate three different realizations of the GRF:
according to the Matern covariance function. The true covariate values X1(r), X2(r) have the desired properties if we let X1(r) = Z(r) + Z1(r) and X2(r) = Z(r) + Z2(r). Their observed counterparts
are random variables with additional uncertainty caused by the measurement error, let it have a variance parameter
:
(8)
The true species occurrence probability μ at every location is defined using the true covariates with linear effects, let them have values
, and a spatial effect W(r):
(9)
Our simulation is therefore defined by the following ground truth parameter values: the number of locations (N, M), the GRF variance and range (
), the measurement error variance (
), and the linear effects (
). In each run of the simulation i, we generate a dataset
with a different realization of the abiotic covariates and species suitability using the ground truth parameter values, then sample the abiotic and species observation locations, and finally their observed values. This is illustrated in Fig 6 for one run of the simulation.
4 Results
In this section, we compare the interpretation and predictions given by the joint and two-stage models. First, we show all model output for a single species, including model parameters and predicted maps. We then compare all 50 species using summaries of these statistics and performance on the validation sets to assess model accuracy.
4.1 Example species: Empetrum nigrum
We chose to highlight Empetrum nigrum (Black crowberry) because it is one of the more common rare species and has clear soil type preferences. Moreover, the species is under threat of climate change in the Netherlands and could become extinct from the Netherlands with ongoing temperature raise. The species also has a longer lifespan, which makes it react to changes a bit slower than perennials but quicker than e.g. shrubs or trees. This species also shows differences in the model results that are representative of the differences that occurred in most species. Empetrum nigrum occurs in a minority of plots (2.4%), preferring dry, sunny to slightly shaded places with acidic sandy soil. Common occurrences in the Netherlands include the North Sea islands and large sandy nature areas.
We first show model parameters from both models in Fig 7. Empetrum nigrum has a strong tendency towards more acidic soils. The occurrence seems much more likely in natural areas, except those that are fully forested. For example, dry natural areas with high acidity have 8-fold increase in log-odds occurrence probability compared to forested areas with neutral pH. The joint model considers the day number, pH, and terrain to have a slightly stronger association with the species occurrence. However, there is a striking difference in the estimated associations and certainty of the five abiotic variables (Organic matter, Carbon/Nitrogen-Ratio, Nitrogen, Phosphorus, Potassium). The joint model that takes the uncertainty of the variables into account considers that associations are much smaller and more certain. For example, the association of Nitrogen and log-odds occurrence probability is estimated as -0.6 (std. 0.7) instead of -4.5 (std 4.0.) The estimates of the joint model are within credible intervals of the two-stage model. The spatial residual also tends to be slightly smaller in the joint model. Both models give almost identical estimates of the random field used to interpolate abiotic factors.
The first three panels (A) plot the SDM splines, the fourth panel (B) plots the coefficients for uncertain abiotic variables, the fifth panel (C) plots the parameters of the spatial component in the SDM, and the last five panels (D) plot the parameters of the spatial interpolation models of the abiotic variables and their measurement error.
We provide all six predicted maps from both models: 1 species occurrence map (Fig 8) and 5 abiotic maps in the supplementary material for Organic matter (S4 Fig) Carbon/Nitrogen-Ratio (S5 Fig), Nitrogen (S6 Fig), Phosphorus (S7 Fig), Potassium (S8 Fig). We compare each map over all grid cells by contrasting the two model predictions using a scatter plot in S9 Fig. Predicted abiotic values in Netherlands are virtually identical (1.00 correlation) but the predicted species occurrence maps have small differences (0.94 correlation). The joint model considers that Empetrum nigrum has a slightly higher occurrence towards more suitable areas and lower occurrence in non-suitable areas based on the comparison of mean posterior log-odds occurrence probability in S9 Fig. The predicted occurrence map is also considered slightly more certain in the joint model because the standard deviation is below the diagonal.
Left panel (A) is the observed species (Empetrum nigrum). The gray grid cells are where we had no observations at all, the black grid cells are where we did have observations, but not for these species. Ohter colors depict the fraction of how often this species was observed in plots in this cell. The top row middle (B) and right (C) panels compare the mean predicted log-odds occurrence probability in the joint and two-stage models. The bottom row middle (D) and right (E) panels compare the standard deviation, i.e. uncertainty, of predicted log-odds occurrence probability.
4.2 All species
We have run both models over all 50 expert selected species. We summarize these results by first considering the differences in abiotic associations, then compare the predicted maps, and finally the predictive accuracy in cross-validated areas.
The relationship between species occurrence and Organic matter, Carbon/Nitrogen-Ratio, Nitrogen, Phosphorus, and Potassium is displayed in S10 Fig. The same trend applies over all species: the joint model produces estimates of association that tend to be smaller and are almost always more certain. These differences are strongest for relatively rarer species, for which it is difficult to confidently estimate the associations with species occurrence. For more common species, estimates are often very similar, see for example Urtica dioica, Lolium perenne, and Quercus robur. Even though most estimates are similar there are few differences, for example Nitrogen for the Anthoxanthum odoratum. We summarize the correlations of the estimated abiotic parameters over all 50 species in Table 2. The correlations are modest to strong (0.64–0.84), indicating that there are differences between the two models.
We calculate a correlation coefficient from the predicted maps produced by the joint vs. two-stage model. Altogether, this results in 300 summaries of maps from 50 species and 6 predictions (SDM, ORG, C/N, N, P, K) displayed in S2 Table. The majority of maps estimated for different species are close to identical, but there are few exceptions with small differences that can occur in either estimated abiotic values or estimated species occurrence. For example, Empetrum nigrum has identical estimated abiotic values but shows different species occurrence maps, whereas Achillea millefolium has identical species occurrence and other abiotic values but shows a different estimated Nitrogen map. We visualize these summaries in Fig 9.
The maps are based on predicted mean log-odds occurrence probabilities (panel ‘SDM’) and predicted mean values (other panels). The histograms consist of 50 correlation coefficients calculated from the predicted grid cell values of the full-coverage map between the two-stage vs joint models.
We finally perform spatial cross-validation to compare out-of-sample species occurrence predictions to the true occurrence in each plot. The resulting RMSE for every species is summarized in Fig 10 and species specific RMSEs are listed in S3 Table. Generally, the joint model and the two-stage model display very similar RMSE. For some species the joint model appears more accurate; these include Primula elatior, Asparagus officinalis, Puccinellia maritima, Dactylorhiza maculata, and Epipactis palustris. The below diagonal area in Fig 10 signifies species for which the joint model has better performance.
Dots represent 50 different plant species. The left panel plots leave-region-out validation, while the right panel plots leave-province-out validation. The joint model outperforms the two-stage model for species that are located below the diagonal.
We also predict the overall out-of-sample prevalence for all 50 species in each province and FGR. The models generally produce almost identical estimates of prevalence, with the joint model giving more confident estimates, i.e. narrower prediction intervals. The full results are given in S11 Fig for provinces and S12 Fig for FGRs. The confidence increases more for species that occur rarely in the data set, are present in only few areas, or are considered threatened. The number of times each species occurs is counted in S1 Table. For example, in many provinces and FGR regions the least observed species Primula elatior has 0–25% credible prevalence in the two-stage model but 0–5% credible prevalence in the joint model.
4.3 Simulation
In the simulation, consider the covariate X1(r). It has a different random field in every run and a different estimated value , even though the true effect
stays the same. This enables assessing how well both models estimate the true parameter value in different data sets. We illustrate the estimates and their 95% credible interval from the simulations in Fig 11.
The joint and two-stage model return estimates of the true value (dashed line). Both return unbiased estimates () but the joint model is more certain (
0.23 vs 0.28). The joint model estimates are generally more accurate (0.22 vs. 0.24 RMSE).
We can consider four simulation settings if we exclude either spatial misalignment or measurement error. In spatially aligned data, abiotic measurements have also been performed at the 1000 species observation locations (M = 1100). In simulation without measurement error, we set . Kriging is not necessary if the data is spatially aligned because one may directly fit a regression model using the observed covariate values, which we call the ‘Direct’ model. A summary of all settings is displayed in S4 Table.
When there is no spatial misalignment or measurement error, all models return equally good estimates (1.01). Note that the direct model cannot be used to predict the map of abiotic values nor the map of species suitability
because there is no kriging interpolation of abiotics to all grid points. When there is no spatial misalignment but observations suffer from measurement error, the Direct model suffers from an attenuated estimate (0.65) whereas the Two-Stage and Joint model work equally well (1.02). If there is spatial misalignment, the Joint and Two-Stage model return similar estimates (1.02 vs. 1.03) of the effect
but the Joint model has a lower estimated variance (0.23 vs 0.28). The Direct model cannot be fitted because the covariate values are unknown at species observation locations. The Joint model has a higher accuracy in the predicted species observations map (0.108 vs 0.110) and the predicted covariate map (0.565 vs. 0.625).
5 Discussion
5.1 Comparing the joint and two-stage models
We have compared two modelling approaches, where the two-stage modelling approach is simple but does not take into account the uncertainty in the abiotic variables. The joint modelling approach is based on more correct assumptions since there is uncertainty in the abiotic variables as evidenced from the kriging results. The joint model and two-stage model asymptotically give the same results if the uncertainty decreases with better sampling coverage and less measurement error. The model definitions become identical if there is no uncertainty in the abiotic variables used in the SDM. We found differences to occur often in the model interpretation, with the joint model delivering abiotic associations of smaller magnitude and more certainty. Differences in predictive accuracy to unseen areas also occur for some relatively rarer species. However, predicted maps of species occurrence and abiotic values that cover the Netherlands appear almost identical when trained with the full data set for most of the species.
In our case study there are two reasons why the two-stage model is probably sufficient for predictive purposes. First, the abiotic variables can be reliably estimated due to a long range of spatial correlation, with credible values 20–150 km, see Fig 7. Second, there are 12799 species observation locations as illustrated in Fig 1, thanks to the LMF data having a large sample size and stratification designed with the goal of obtaining reliable status of plant occurrence in the Netherlands.
In the simulation we could choose whether there is spatial misalignment or measurement error. The goal is to always use the true covariate values in the SDM, and how well they are recovered depends on both the observation uncertainty caused by the measurement error and the model uncertainty caused by the spatial misalignment. A regression model can be fitted directly if there is no spatial misalignment, but the measurement error caused bias when the observed values were used in place of the true values, which is known as ‘regression dilution’. Kriging in both the Two-Stage and Joint estimates the expected value at every location, i.e. value of the abiotics without measurement error, so the association seems to be fixed when these are used in the SDM. It seems like the kriging in the two-stage model can be sufficient for dealing with the measurement error, but the joint model can be beneficial for dealing with spatial misalignment. The accuracy of predicted maps was higher in the joint model, especially for the abiotic value, which makes sense because abiotics had significantly more spatial uncertainty than species occurrence in the simulation.
The context of the SDM modelling task is important. We investigated a large data set designed for plant monitoring. Several factors could contribute to more uncertainty in other models or case studies, including: smaller sample sizes of species or abiotic variables, more rare species, more complex models, more covariates, more variability in covariates, smaller spatial ranges of covariates, and more measurement error. With more uncertainty in the covariates the differences between the models could become more pronounced.
5.2 Related literature
Some studies in the SDM literature can be seen as predecessors of our approach. Many researchers have suggested that a Bayesian analysis would provide solutions to a range problems in SDMs, including uncertainty in the predictor variables. The same idea motivates the model presented in our paper, but we argue it allows for more realistic SDMs since we include multiple predictor variables, spatial autocorrelation in the response variable, uncertainty estimated from data, and both Berkson and Classical error.
The definition of Classical and Berkson errors is as follows [59]. Denote the true value of a covariate as the random variable X and the observed value as X*. Denote a new random variable , Classical error UC ∼ N(0, δ2) and Berkson error UB ∼ N(0, σ2):
(10)
This corresponds to Classical error when UB = 0 since X* = X + UC, and Berkson error when UC = 0 since X = X* + UB. It directly follows that observations with Classical error have more variance: Var[X*] = Var[X] + δ2, whereas observations with Berkson error have less variance: Var[X] = Var[X*] + σ2. If neither is zero, both types of error are present. The INLA package can easily support either type of error [60] or more complex schemes that include both errors with missing values in covariates [61].
We connect the errors to our problem in Fig 12, which illustrates how uncertainty is created by kriging spatially misaligned values and having measurement error in the simulation. If repeated measurements of an abiotic factor give different results when measured at exactly the same location, we have measurement error. This corresponds to Classical error where the variance of the observations is larger than the variance of the true value. In Fig 12 for example, we have 4 measurements in the region X ∈ [0.63, 0.78], but we observe different outcomes due to the measurement error. If the observations of an abiotic factor are not done in all locations we use kriging to interpolate them. The kriging estimate is unbiased and the variance of predicted values is less than the variance of the true values. This corresponds to Berkson type error [62]. In Fig 12 for example, we have no measurements in the region X ∈ [0.78, 1.00], but we can estimate the abiotic values based on observations in the neighbouring regions. The predicted values are less variable than the true values. Regions with more observations imply less uncertainty; in the limit with unlimited observations we could perfectly recover the true value.
The observed value (blue dots) is a realization of the true value (blue line) with measurement error whose uncertainty is highlighted with 95% credible intervals (blue shaded). Kriging (orange line) estimates the true value whose uncertainty is highlighted with 95% credible intervals (orange shaded).
We summarize how our model can be seen as a generalization of previous studies in Table 3. Van Niel & Austin [28] explore the effects of Digital Elevation Model (DEM) error on SDMs, where an estimated error in the DEM is used to resample 10 different realizations of predictor variables. All of the steps of the modelling process were greatly affected by this type of error. The authors recommend that practitioners carefully consider how sensitive the modelling process and environmental variables are to errors in the source data. Denham et. al. [30] investigate a Bayesian framework for problems in ecology, where a classical measurement error model is inferred from validation data and linked to a regression model on these predictors. The results in real data yielded poor predictions, perhaps due to a faulty assumption of conditional independence between true and expert evaluated soil depth. Their simulations demonstrate the importance of using the measurement error model, instead of the standard model, for estimating the true relationship when the model assumptions are satisfied. McInerny & Purves [29] illustrate uncertainty in predictor variables using artificially generated data with a simple SDM where species absense or presense is determined by an unimodal response curve to temperature. They explain how uncertainty arising from fine-scale environmental variables causes ‘regression dilution’ and propose a correction using a Baysian method with a latent variable. Foster et. al. [27] investigate spatial misalignment in a synthetic survey where species presence or richness depends on the physical environment but the data sets are not co-located. They compare models fitted with true values, kriging predicted values, and Berkson error adjustment. However, even the true values resulted in bias, which the authors speculate could be due to spatial correlation. Stoklosa et. al. [31] show how classical error in predicted spatial climate variables can be incorporated into SDMs with hierarchical modelling and simulation-extrapolation (SIMEX). They perform a simple simulation with ground truth GLM parameters and fit the models to a species presence/absence data. Failing to account for error in variables resulted in biased parameter estimates and a loss of power as the error increases. Barber et al. [32] propose a hierarchical Bayes model to tackle the misalignment problem when species observations and annual mean temperature measurements have different locations. They compare 1) a joint model of the response variable and the covariate 2) a two-stage approach where the covariate is first predicted by kriging. They show that ignoring kriging uncertainty leads into different association, and interestingly the predicted map of species occurrence is more certain in the joint model.
Columns indicate: adjusting the models for uncertainty (Adjust), whether a GLM or GAM was used, multiple predictors (Xj), Berkson error (UB), Classical error (UC), the predictors were estimated by kriging (X(s)), spatial component was included in the SDM (W(s)).
5.3 Possible extensions
There are several improvements that can be considered in our work. First, the INLA R package has a technical limitation that uncertain covariates can only be included in linear terms, whereas it may be hypothesized that the abiotic variables ORG, C/N, N, P, K would have a unimodal response curve. We checked this assumption with a full GAMM where each covariate was modelled by a spline without uncertainty in covariates. However, results were often linear or at least monotonically increasing. There is no theoretical obstacle to implementing splines of uncertain covariates in a Bayesian framework. It may be interesting to investigate this with other, more general, packages such as Stan [63].
Second, computational efficiency will become a problem in increasingly complicated joint models. On average, to fit a single species takes 1 minute runtime for the two-stage model vs. 3 hours for the joint model, with identical components (4-core 2.5 Ghz, 8 GB RAM). A couple of hours is reasonable for a single modelling task and should not be an obstacle for taking the uncertainty into account, but performing the model fit over 50 species x 12+14 cross-validation splits would take months and necessitated a high performance cluster for parallel runs. The approximations used by INLA are quite fast, but more general approaches may result in an infeasible runtime. In practice, the benefits of taking uncertainty into account need to be balanced with the drawback of increasing computational burden.
Finally, future work might consider the uncertainty associated with time misalignment, i.e. situation where the field visit and abiotic measurement are done in different days or even years. INLA can support joint estimation of time series autocorrelation with spatial autocorrelation. This may be interesting especially for modelling effects of drought, acidification, climate change, and other environmental pressures over time.
6 Conclusions
We have investigated a new species distribution modelling approach where the uncertainty in abiotic variables caused by spatial misalignment and measurement error is taken into account. This is possible by fitting a joint model which simultaneously predicts the maps of abiotic values and species occurrence, feeding forward latent abiotic variables and their associated uncertainty estimated by kriging to the species distribution model. We have found that there are significant differences in the estimated associations between species occurrences and the uncertain abiotic variables. However, predicted maps for the Netherlands appear almost identical at this scale and resolution when models are fitted to the full data set. In testing out-of-sample predictive accuracy, we found that performance is very similar with the exception of few rare species where the new model is more accurate. It is interesting to note that information in the joint model goes both ways: abiotic variables will help to predict the species, but the species may also predict the abiotic variables. Our data set is a large nation-wide plant monitoring effort, and more uncertainty could imply larger differences between the models.
Supporting information
S1 Table. List of all 50 species chosen for the SDM.
The table shows common names, Latin names, and the number of field visits where the species is found (n). Bold common names indicate that the species is on the Red List, indicating rare or threatened species in the Netherlands.
https://doi.org/10.1371/journal.pone.0304942.s001
(PDF)
S2 Table. Pearson correlation coefficients.
We calculate a correlation coefficient between joint vs. two-stage model estimated maps for abiotic values and SDM log-odds occurrence probabilities in all 50 species.
https://doi.org/10.1371/journal.pone.0304942.s002
(PDF)
S3 Table. Validation RMSE in all 50 species of joint vs. two-stage model.
https://doi.org/10.1371/journal.pone.0304942.s003
(PDF)
S4 Table. Summary of model estimates over all simulation runs.
The first two columns signify spatial MisAlignment (MA) and Measurement Error (ME). Formally, the other columns are defined ,
,
,
,
where the expectation
is approximated by the sample mean over the sampled data sets
and r denotes the vector of all grid points.
https://doi.org/10.1371/journal.pone.0304942.s004
(PDF)
S1 Fig. Interpolated pH map.
The left panel (A) is the observed pH at abiotic measurement locations, the right panel (B) is the interpolated pH in every grid cell. Interpolation is based on spatial proximity, landuse, soiltype and Calluna vulgaris occurrence.
https://doi.org/10.1371/journal.pone.0304942.s005
(TIF)
S2 Fig. Landuse variable in every grid cell of the Netherlands.
Natural areas are small and localized regions, whereas most of the country is agricultural or built terrain.
https://doi.org/10.1371/journal.pone.0304942.s006
(TIF)
S3 Fig. Landuse in every grid cell vs. plots.
The number of field visits was stratified towards natural areas compared to the landuse in the entire country counted from grid cells.
https://doi.org/10.1371/journal.pone.0304942.s007
(TIF)
S4 Fig.
Left panel (A) is the observed organic matter mean value in each grid cell. The top row middle (B) and right (C) panels compare the mean predicted value in the joint and two-stage models. The bottom row middle (D) and right (E) panels compare the standard deviation, i.e. uncertainty, of the predicted value. Joint model uses Empetrum nigrum as the SDM species.
https://doi.org/10.1371/journal.pone.0304942.s008
(TIF)
S5 Fig.
Left panel (A) is the observed C/N-ratio mean value in each grid cell. The top row middle (B) and right (C) panels compare the mean predicted value in the joint and two-stage models. The bottom row middle (D) and right (E) panels compare the standard deviation, i.e. uncertainty, of the predicted value. Joint model uses Empetrum nigrum as the SDM species.
https://doi.org/10.1371/journal.pone.0304942.s009
(TIF)
S6 Fig.
Left panel (A) is the observed total nitrogen mean value in each grid cell. The top row middle (B) and right (C) panels compare the mean predicted value in the joint and two-stage models. The bottom row middle (D) and right (E) panels compare the standard deviation, i.e. uncertainty, of the predicted value. Joint model uses Empetrum nigrum as the SDM species.
https://doi.org/10.1371/journal.pone.0304942.s010
(TIF)
S7 Fig.
Left panel (A) is the observed total phosphorus mean value in each grid cell. The top row middle (B) and right (C) panels compare the mean predicted value in the joint and two-stage models. The bottom row middle (D) and right (E) panels compare the standard deviation, i.e. uncertainty, of the predicted value. Joint model uses Empetrum nigrum as the SDM species.
https://doi.org/10.1371/journal.pone.0304942.s011
(TIF)
S8 Fig.
Left panel (A) is the observed total potassium mean value in each grid cell. The top row middle (B) and right (C) panels compare the mean predicted value in the joint and two-stage models. The bottom row middle (D) and right (E) panels compare the standard deviation, i.e. uncertainty, of the predicted value. Joint model uses Empetrum nigrum as the SDM species.
https://doi.org/10.1371/journal.pone.0304942.s012
(TIF)
S9 Fig. Example species Empetrum nigrum (Crowberry) scatter plots that compare, at every grid cell, the predicted log-odds for the SDM and the predicted values for abiotic variables.
The top 6 panels illustrate the mean predicted value, while the bottom 6 panels present the standard deviation of the predicted value, i.e. uncertainty of the prediction.
https://doi.org/10.1371/journal.pone.0304942.s013
(TIF)
S10 Fig. Every panel plots for the given species the estimated associations between species occurrence and uncertain abiotic variables in the two-stage and joint model.
https://doi.org/10.1371/journal.pone.0304942.s014
(TIF)
S11 Fig. Every panel plots for the given species the predicted prevalence in the two-stage and joint model using leave-province-out validation.
These predictions are compared with the true prevalence (green) in each province.
https://doi.org/10.1371/journal.pone.0304942.s015
(TIF)
S12 Fig. Every panel plots for the given species the predicted prevalence in the two-stage and joint model using leave-region-out validation.
These predictions are compared with the true prevalence (green) in each FGR region.
https://doi.org/10.1371/journal.pone.0304942.s016
(TIF)
Acknowledgments
We thank Laurens Zwakhals for the overall coordination of the SPR Data project of which our research was a separate ‘proof of concept’ track. Job Spijker is thanked for helping with setting up the data management structure for this project. Also, we thank Jaap Slootweg, Ton de Nijs, Albert Wong, Arco van Strien en Arjen van Hinsberg for contributing to the kickoff meeting of the project. Statistics Netherlands (CBS) and the provinces of the Netherlands (“de gezamenlijke provincies”) generously provided the LMF-M&N data.
References
- 1. Ceballos G, Ehrlich PR, Barnosky AD, García A, Pringle RM, Palmer TM. Accelerated modern human–induced species losses: Entering the sixth mass extinction. Science advances. 2015;1(5):e1400253. pmid:26601195
- 2. Cardinale BJ, Duffy JE, Gozales A, Hooper DU, Perrings C, Venail P, et al. Biodiversity loss and its impact on humanity. Nature. 2014;486:59–67.
- 3. McCune J. Species distribution models predict rare species occurrences despite significant effects of landscape context. Journal of applied ecology. 2016;53(6):1871–1879.
- 4. Barbet-Massin M, Rome Q, Villemant C, Courchamp F. Can species distribution models really predict the expansion of invasive species? PloS one. 2018;13(3):e0193085. pmid:29509789
- 5. Segurado P, Araujo MB. An evaluation of methods for modelling species distributions. Journal of biogeography. 2004;31(10):1555–1568.
- 6. Li X, Wang Y. Applying various algorithms for species distribution modelling. Integrative zoology. 2013;8(2):124–135. pmid:23731809
- 7. Latimer AM, Wu S, Gelfand AE, Silander JA Jr. Building statistical models to analyze species distributions. Ecological applications. 2006;16(1):33–50. pmid:16705959
- 8. Miller JA. Species distribution models: Spatial autocorrelation and non-stationarity. Progress in Physical Geography. 2012;36(5):681–692.
- 9. Meyer H, Reudenbach C, Hengl T, Katurji M, Nauss T. Improving performance of spatio-temporal machine learning models using forward feature selection and target-oriented validation. Environmental Modelling & Software. 2018;101:1–9.
- 10. Meyer H, Reudenbach C, Wöllauer S, Nauss T. Importance of spatial predictor variable selection in machine learning applications–Moving from data reproduction to spatial prediction. Ecological Modelling. 2019;411:108815.
- 11. Paradinas I, Illian J, Smout S. Understanding spatial effects in species distribution models. Plos one. 2023;18(5):e0285463. pmid:37253039
- 12. Guisan A, Thuiller W. Predicting species distribution: offering more than simple habitat models. Ecology letters. 2005;8(9):993–1009. pmid:34517687
- 13. Elith J, Leathwick JR. Species distribution models: ecological explanation and prediction across space and time. Annual review of ecology, evolution, and systematics. 2009;40:677–697.
- 14. Braun CD, Arostegui MC, Farchadi N, Alexander M, Afonso P, Allyn A, et al. Building use-inspired species distribution models: Using multiple data types to examine and improve model performance. Ecological Applications. 2023;n/a(n/a):e2893. pmid:37285072
- 15. Kozak KH, Graham CH, Wiens JJ. Integrating GIS-based environmental data into evolutionary biology. Trends in ecology & evolution. 2008;23(3):141–148. pmid:18291557
- 16. He KS, Bradley BA, Cord AF, Rocchini D, Tuanmu MN, Schmidtlein S, et al. Will remote sensing shape the next generation of species distribution models? Remote Sensing in Ecology and Conservation. 2015;1(1):4–18.
- 17. Fournier A, Barbet-Massin M, Rome Q, Courchamp F. Predicting species distribution combining multi-scale drivers. Global Ecology and Conservation. 2017;12:215–226.
- 18. Goodchild MF. Integrating GIS and remote sensing for vegetation analysis and modeling: methodological issues. Journal of Vegetation Science. 1994;5(5):615–626.
- 19. Van Niel KP, Laffan SW, Lees BG. Effect of error in the DEM on environmental variables for predictive vegetation modelling. Journal of Vegetation Science. 2004;15(6):747–756.
- 20. Barry S, Elith J. Error and uncertainty in habitat models. Journal of Applied Ecology. 2006;43(3):413–423.
- 21. Daly C. Guidelines for assessing the suitability of spatial climate data sets. International Journal of Climatology: A Journal of the Royal Meteorological Society. 2006;26(6):707–721.
- 22. Rivington M, Matthews KB, Bellocchi G, Buchan K. Evaluating uncertainty introduced to process-based simulation model estimates by alternative sources of meteorological data. Agricultural Systems. 2006;88(2-3):451–471.
- 23. Marshall L, Beckers V, Vray S, Rasmont P, Vereecken NJ, Dendoncker N. High thematic resolution land use change models refine biodiversity scenarios: A case study with Belgian bumblebees. Journal of Biogeography. 2021;48(2):345–358.
- 24. Velásquez-Tibatá J, Graham CH, Munch SB. Using measurement error models to account for georeferencing error in species distribution models. Ecography. 2016;39(3):305–316.
- 25. Fernández M, Hamilton H, Kueppers LM. Characterizing uncertainty in species distribution models derived from interpolated weather station data. Ecosphere. 2013;4(5):1–17.
- 26. Martínez-Minaya J, Cameletti M, Conesa D, Pennino MG. Species distribution modeling: a statistical review with focus in spatio-temporal issues. Stochastic environmental research and risk assessment. 2018;32:3227–3244.
- 27. Foster SD, Shimadzu H, Darnell R. Uncertainty in spatially predicted covariates: is it ignorable? Journal of the Royal Statistical Society: Series C (Applied Statistics). 2012;61(4):637–652.
- 28. Van Niel KP, Austin MP. Predictive vegetation modeling for conservation: Impact of error propagation from digital elevation data. Ecological Applications. 2007;17(1):266–280. pmid:17479850
- 29. McInerny GJ, Purves DW. Fine-scale environmental variation in species distribution modelling: regression dilution, latent variables and neighbourly advice. Methods in Ecology and Evolution. 2011;2(3):248–257.
- 30. Denham RJ, Falk MG, Mengersen KL. The Bayesian conditional independence model for measurement error: applications in ecology. Environmental and Ecological Statistics. 2011;18:239–255.
- 31. Stoklosa J, Daly C, Foster SD, Ashcroft MB, Warton DI. A climate of uncertainty: accounting for error in climate variables for species distribution models. Methods in Ecology and Evolution. 2015;6(4):412–423.
- 32. Barber X, Conesa D, Lladosa S, López-Quílez A. Modelling the presence of disease under spatial misalignment using Bayesian latent Gaussian models. Geospatial health. 2016;11(1). pmid:27087038
- 33. Beale CM, Lennon JJ. Incorporating uncertainty in predictive species distribution modelling. Philosophical Transactions of the Royal Society B: Biological Sciences. 2012;367(1586):247–258. pmid:22144387
- 34.
Guisan A, Thuiller W, Zimmermann NE. Habitat suitability and distribution models: with applications in R. Cambridge University Press; 2017.
- 35.
Hall DB. Measurement error in nonlinear models: a modern perspective; 2008.
- 36.
Banerjee S, Carlin BP, Gelfand AE. Hierarchical modeling and analysis for spatial data. CRC press; 2014.
- 37. Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the royal statistical society: Series b (statistical methodology). 2009;71(2):319–392.
- 38. Paradinas I, Conesa D, Pennino MG, Muñoz F, Fernández AM, López-Quílez A, et al. Bayesian spatio-temporal approach to identifying fish nurseries by validating persistence areas. Marine Ecology Progress Series. 2015;528:245–255.
- 39. Rufener MC, Kinas PG, Nóbrega MF, Oliveira JEL. Bayesian spatial predictive models for data-poor fisheries. Ecological Modelling. 2017;348:125–134.
- 40. Sadykova D, Scott BE, De Dominicis M, Wakelin SL, Sadykov A, Wolf J. Bayesian joint models with INLA exploring marine mobile predator–prey and competitor species habitat overlap. Ecology and Evolution. 2017;7(14):5212–5226. pmid:29242741
- 41. Juan P, Díaz-Avalos C, Mejía-Domínguez NR, Mateu J. Hierarchical spatial modeling of the presence of Chagas disease insect vectors in Argentina. A comparative approach. Stochastic Environmental Research and Risk Assessment. 2017;31:461–479.
- 42. Martínez-Minaya J, Conesa D, López-Quílez A, Vicent A. Spatial and climatic factors associated with the geographical distribution of citrus black spot disease in South Africa. A Bayesian latent Gaussian model approach. European Journal of Plant Pathology. 2018;151:991–1007.
- 43.
Burgman M. Risks and decisions for conservation and environmental management. Cambridge University Press; 2005.
- 44. Rocchini D, Hortal J, Lengyel S, Lobo JM, Jimenez-Valverde A, Ricotta C, et al. Accounting for uncertainty when mapping species distributions: the need for maps of ignorance. Progress in Physical Geography. 2011;35(2):211–226.
- 45. Lecours V. On the use of maps and models in conservation and resource management (warning: results may vary). Frontiers in Marine Science. 2017;4:288.
- 46.
van der Peijl MJ, Gremmen NJM, van Tongeren OFR, De Heer M. Ontwerp Landelijk Meetnet Flora-Milieu & Natuurkwaliteit (LMF-M&N). RIVM rapport. 2000;718101001:76.
- 47.
Wamelink GWW, Goedhart PW, Frissel JY, Wegman RMA, Slim PA, Van Dobben HF. Response curves for plant species and vegetation types. Alterra; 2007.
- 48. Wamelink GWW, Goedhart PW, Frissel JY. Why some plant species are rare. PLoS One. 2014;9(7):e102674.
- 49.
Braun-Blanquet J. 3 Aufl. Pflanzensoziologie. 1964;.
- 50.
Bal D, Looise BJ. Arc/Info-file Physical-Geographical Regions of the Nederlands (Fysisch-Geografische Regio’s van Nederland); 2001.
- 51. Planbureau C. Omgevingsscenario’s Lange Termijn Verkenning 1995–2020. Werkdocument. 1996;89(1):1–127.
- 52.
Statistics Netherlands. File Landusage (Bestand Bodemgebruik); 2017.
- 53.
Wageningen University. Soiltype map (Grondsoortenkaart); 2006.
- 54. Wamelink GWW, Walvoort DJJ, Sanders ME, Meeuwsen HAM, Wegman RMA, Pouwels R, et al. Prediction of soil pH patterns in nature areas on a national scale. Applied Vegetation Science. 2019;22(2):189–199.
- 55. Diggle PJ, Tawn JA, Moyeed RA. Model-Based Geostatistics. Journal of the Royal Statistical Society Series C (Applied Statistics). 1998;47(3):299–350.
- 56. Jona Lasinio G, Mastrantonio G, Pollice A. Discussing the “big n problem”. Statistical Methods & Applications. 2013;22:97–112.
- 57. Lindgren F, Rue H, Lindström J. An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2011;73(4):423–498.
- 58. Rufibach K. Use of Brier score to assess binary predictions. Journal of clinical epidemiology. 2010;63(8):938–939. pmid:20189763
- 59.
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement error in nonlinear models: a modern perspective. Chapman and Hall/CRC; 2006.
- 60. Muff S, Riebler A, Held L, Rue H, Saner P. Bayesian analysis of measurement error models using integrated nested Laplace approximations. Journal of the Royal Statistical Society Series C: Applied Statistics. 2015;64(2):231–252.
- 61. Skarstein E, Martino S, Muff S. A joint Bayesian framework for missing data and measurement error using integrated nested Laplace approximations. Biometrical Journal. 2023;65(8):2300078. pmid:37740134
- 62. Lopiano KK, Young LJ, Gotway CA. Estimated generalized least squares in spatially misaligned regression models with Berkson error. Biostatistics. 2013;14(4):737–751. pmid:23568241
- 63. Monnahan CC, Thorson JT, Branch TA. Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo. Methods in Ecology and Evolution. 2017;8(3):339–348.