The interplay of various sources of noise on reliability of species distribution models hinges on ecological specialisation

Digitized species occurrence data provide an unprecedented source of information for ecologists and conservationists. Species distribution model (SDM) has become a popular method to utilise these data for understanding the spatial and temporal distribution of species, and for modelling biodiversity patterns. Our objective is to study the impact of noise in species occurrence data (namely sample size and positional accuracy) on the performance and reliability of SDM, considering the multiplicative impact of SDM algorithms, species specialisation, and grid resolution. We created a set of four ‘virtual’ species characterized by different specialisation levels. For each of these species, we built the suitable habitat models using five algorithms at two grid resolutions, with varying sample sizes and different levels of positional accuracy. We assessed the performance and reliability of the SDM according to classic model evaluation metrics (Area Under the Curve and True Skill Statistic) and model agreement metrics (Overall Concordance Correlation Coefficient and geographic niche overlap) respectively. Our study revealed that species specialisation had by far the most dominant impact on the SDM. In contrast to previous studies, we found that for widespread species, low sample size and low positional accuracy were acceptable, and useful distribution ranges could be predicted with as few as 10 species occurrences. Range predictions for narrow-ranged species, however, were sensitive to sample size and positional accuracy, such that useful distribution ranges required at least 20 species occurrences. Against expectations, the MAXENT algorithm poorly predicted the distribution of specialist species at low sample size.


Introduction
Understanding spatio-temporal distribution patterns of species is fundamental for ecology, conservation, biogeography, and many environmental studies. Species distribution model (SDM) allows for predictions of species distributions by quantifying relationships between species occurrence and associated environmental conditions [1][2][3]. SDM, which conceptually relies on ecological niche theory, is referred to by a number of alternative names, including: with different resolutions of environmental information. This allowed us to disentangle the effects of the various parameters without bias by assess performance against a known distribution range, which is near-impossible for real species. For each of our four species, we built suitable habitat models using five SDM algorithms and two grid resolutions each based on a varying number of species occurrences and with different levels of positional accuracy. We then investigated the subsequent variation in SDM performance and reliability. We assessed the outcome of the SDM based on classic model evaluation metrics (threshold-dependent and -independent), which are typically used in empirical studies because the actual distribution of the species in focus are unknown. However, unlike empirical species distribution studies, we "know" the real distributions, and can therefore directly measure the agreement between the SDM's predictions and the "true" ranges; our actual goal for the modelling exercises.

Materials and methods
Our analysis followed three steps: 1) creating ranges for four virtual species, 2) modelling species distribution ranges following the usual routine for each of the four species by sampling species occurrences from these ranges with different levels of positional accuracy, associating them with environmental information, and fitting species distribution models, and finally, 3) assessing SDM performance and reliability (Fig 1).

Generating virtual species ranges
We based our analyses on species with known distribution ranges for unbiased model quality evaluation. To this end, we created ranges for the virtual species (henceforth referred to as the '"true" range') using four environmental variables: 1) annual mean temperature, 2) altitude, 3) precipitation seasonality, and 4) annual mean evapo-transpiration. These environmental variables are widely considered to have a direct influence on the eco-physiology and niche of many species [16,[48][49][50][51]. We downloaded these environmental variables at two resolutions, 2.5 and 10 arc-min, (the source of these variables in Table A in S1 Appendix). Variables only available only in higher resolution were downscaled to 2.5 and 10 arc-min resolutions using bilinear interpolation. We conducted a Principal Component Analysis (PCA) on these four environmental variables, and in order to avoid unrealistic distribution ranges due to species response to each of the four variables, we chose the first two PCs summarizing the environmental variability across the study area [52,53]. This is an objective and realistic approach and ensures that the ranges of the virtual species were delineated based on realistic environmental variables [27,52,53]. We defined the environmental range inhabited by the species based on the mean ± standard deviation (S.D.) for each of the first two axes of the PCA using a Gaussian distribution function. [16,27,[54][55][56]. We then set the mean value (optimum of the environmental range) of the first two axes of the PCA at (0,0) for all species, and determined the degree of specialisation by adjusting the S.D. values of the first two axes of the PCA according to the species specialisation. The generalist species were characterized by low specialisation covering 80% (S.D. 0.8) of the environmental range. The restricted generalist, relaxed specialist and specialist species had ranges covering 60%, 40% and 20% of the environmental range respectively (Fig A in S1 Appendix).
The overall environmental suitability of each virtual species was computed by using the multiplicative approach, multiplying the output of the Gaussian distribution function of each of the first two PCs. We considered this approach to be more realistic, since it accurately represents the interaction between the environmental variables. For example, if one environmental variable was very unfavourable at a given location, the species' probability of occurrence will be low overall, despite the other variables being close to the species' optimum [27,51,54,57,58]. Flow diagram explaining the study design used to answer the study questions. The first step uses the first two axes of PCA and a Gaussian distribution function to create the four virtual species by adjusting the standard deviation (S.D.) value according to the species specialisation level. The second step shows the modelling process for these four species using five modelling algorithms with different sample sizes and different levels of positional accuracy (one precise and three increasingly Finally, the defined environmental ranges were projected onto the world at the African continental scale at two different grid resolutions: 10 arc-min, representing the low resolution (spatial unit size % 400 km2), and 2.5 arc-min, representing the high resolution (spatial unit size % 25 km2) (Fig A in S1 Appendix). We used the "virtualspecies" R package to generate these four virtual species [52].

Sampling species occurrences and introducing different positional accuracy levels
The threshold approach is widely used in simulation studies, where a fixed threshold is being selected to convert the probability of occurrence into a binary presence-absence map [27,52,[59][60][61]. This threshold is arbitrarily selected since there are no objectively justifiable threshold values based on the data and/or on the validation values (e.g. sensitivity and specificity) [27]. However, using the threshold approach is problematic and has been criticized for numerous reasons: 1) it maximises the omission and commission errors rate in species occurrences, 2) it alters the predefined species-environment relationship and, 3) it is inappropriate for many of the regression models that rely on logistic functions, which in turn might provide misleading results [52,59]. To mitigate these issues, we used the threshold and probability based approaches together. First, we selected arbitrarily a threshold of 0.2 to convert the probability map into a binary, presence-absence map. Next, from the corresponding probability map, we used the values of the probabilities of occurrence in each pixel as the success rate for one sample of the binomial distribution (i.e. a pixel with a probability of 0.8 has an 80% chance of being occupied by species) [4,27,62]. In practice, for each pixel in the presence area in the binary map we generated a random value (r) on the interval [0,1], where a pixel was considered "present" if its r value was greater than its probability of occurrence. Similarly, the absence data was obtained by drawing pixels from the absence area in the binary map, where a pixel was considered "absent" if its r value was less than its probability of occurrence. This resulted in pixels with higher suitability to be more likely to be identified as "present" and pixels with lower suitability more likely identified as "absent" [4,18,27,51,59,62,63]. Thus, we tried to minimize the omission and commission error in species occurrences.
Next, we shifted the sampled species occurrences in a random direction to introduce four levels of positional accuracy [12,17,43,64]. To this end, we created four buffer areas around each occurrence and randomly sampled occurrence outside the buffer area to represent four levels of positional accuracy: 1) precise, where the buffer area size was zero to represent no change in the positional accuracy, 2) low imprecision, where the buffer area size was equal to one pixel, corresponding to % 6 km at the high resolution and % 20 km at the low resolution, 3) intermediate imprecision, where the buffer size was equal to two pixels, 4) high imprecision, where the buffer size was equal to three pixels.

Modelling framework
Environmental variables. We followed the standard SDM's routines for selecting the predictors, where typically the causal relationship between occurrence and environmental conditions is unknown. We do acknowledge that a poor choice of predictors is another common source of uncertainty in SDM's studies. However, our objective here is to mimic the empirical SDM's routines and assess how far predictions are from the reality under realistic conditions irrespective of the appropriateness of choosing the right set of predictors (which most empirical studies also are unaware of to begin with). We therefore selected 19 climatic predictors, two topographic predictors, five vegetation predictors, and one aridity variable to model distribution ranges at our two grid resolutions. We reduced the number of predictors by calculating the Variance Inflation Factor (VIF); a measure for collinearity. We removed collinearity by eliminating predictors with VIF scores greater than 10 [65], using the "vifstep" function in the "usdm" R package [66]. Finally, 15 predictors remained to build the SDM (Table A in S1 Appendix), which was sufficient to avoid model over-fitting and develop an accurate SDM [67]. We rescaled all predictors to the two different grid resolutions we used in our study (2.5 and 10 arc-min) using bilinear interpolation [68].
SDM algorithms. We modelled the ranges of each species using five commonly used algorithms that are either regression-based or machine learning-based approaches. We used two algorithms from the regression-based approaches: the Generalized Linear Model (GLM) [69,70], a widely used linear regression method, and the Generalized Additive Model (GAM) [70,71], a closely related method allowing for non-linear relationships. We used three implementations of machine learning-based approaches: Generalized Boosted Model (GBM) [72], Random Forest (RF) [73] and Maximum Entropy Modelling (MAXENT) [68,74], which characterize the environmental space directly from calibration data [67]. We fitted the models using the "Biomod2" R package [75].
Modelling procedure. To determine the acceptable minimum number of species occurrences, we calibrated the SDM for each species at the two grid resolutions using the five algorithms with different sample sizes (5, 10, 20, 50, 100 and 200 occurrences) with five-fold crossvalidation and five replicates, where each replicate used a different background set, i.e., each model ran 25 times. We acknowledge that using species-specific model parameter tuning is recommended [76], however, to avoid an overwhelming complexity of the study outcome and also for the benefit of a better comparison between the algorithms, we decided to keep the default settings of the respective SDM algorithms (Table B in S1 Appendix). To determine the acceptable level of the positional accuracy of species occurrences, we repeated the procedure as described using imprecise occurrences (low, intermediate, and high) to compare with precise occurrences (Fig 1).

Model evaluations
Model evaluation is a crucial step in model selection and assessing the accuracy of the prediction [77]. In general, model accuracy is measured mainly through evaluation and agreement metrics [78,79]. Evaluation metrics are widely used to measure model performance through assessing the ability of a model to distinguish between presence and absence locations correctly [78,79]. Agreement metrics, however, measure prediction reliability by assessing the spatial agreement between the "true" and predicted ranges taking into account the probability values of pixels. In other words, reliability can be used to inform how far the predicted ranges are from the truth or "reality" [78,79]. Using different evaluation metrics is strongly preferred when true absence data are unavailable, and also when the goal is to model potential distribution ranges rather than realized ranges [80]. Therefore, we calculated the area under the curve (AUC) of the receiver operating characteristic (ROC), as well as the True Skill Statistic (TSS) to evaluate the predictive performance of the models. The AUC value (a threshold-independent evaluation metric) ranges from 0 to 1, with values below 0.5 indicating performance no better than random, whereas a value of 1 indicates perfect performance [77]. TSS value (a threshold-dependent evaluation metric) varies from -1 to 1, where a value of 1 indicates perfect model performance, and a value lower than or equal to zero indicates a model performance no better than random [81]. In this study, we considered the models with either median AUC value ! 0.7 or median TSS value ! 0.4 as good models with usefully predictive distribution ranges (successfully able to discriminate the suitable from unsuitable areas) [43, [82][83][84][85]. We used the "Biomod2" R package [75] to calculate the evaluation metrics (AUC and TSS) for each SDM internally as usually done in empirical studies (henceforth referred to as 'standard AUC' and 'standard TSS'). Additionally, we evaluated the SDM by calculating AUC and TSS using independent data (presence and "true" absences) sampled from the true ranges (henceforth referred to as "independent AUC" and "independent TSS"). We calculated these independent metrics using the "accuracy" function in the "SDMTools" R package [86]. We compared the differences between the independent evaluation and 25 model evaluation metrics using one-sample Wilcoxon test using the "stats" R package [87]. To test whether the grid resolutions of the environmental predictors influenced model performance, we assessed the differences between the standard evaluation metric (standard AUC and standard TSS) values at the high and low grid resolutions for all models using two-sample non-parametric Wilcoxon test.
We assessed the interaction of spatial resolution, SDM algorithm, positional accuracy, sample size, and species specialisation on SDM's performance using generalised linear models. We fitted two models, first fitting the exponentially transformed AUC as a function of spatial resolution, the SDM algorithm, positional accuracy, sample size, and species specialisation. In a second model, we additionally included the two-way interaction of these factors. We used the Akaike Information Criterion (AIC) to select the most parsimonious model favouring a low AIC value [88].
Measuring spatial agreement. We measured relative agreement between "true" and modelled ranges by calculating their geographical niche overlap. We calculated Schoener's D index [89] using the "nicheOverlap" function in the "dismo" R package [90]. The niche overlap value varies between 0 and 1, where the value of 0 indicates no overlap and value of 1 indicates complete overlap [91,92]. Additionally, we measured the absolute agreement between the "true" and modelled ranges through a pixel wise comparison using the Overall Concordance Correlation Coefficient (OCCC), a measure of agreement between two continuous datasets which were generated using two different approaches [93]. We computed the OCCC using the "epiR" R package [94]. The OCCC value varies between 0 and 1, with 0 representing 100% disagreement and 1 represents 100% agreement between the true and predicted ranges (See S2 Appendix for details).

Minimum sample size of species occurrences required for SDM prediction
Our results revealed inconsistencies between the evaluation and agreement metrics regarding the minimum sample size of species occurrences required for SDM. The evaluation metrics showed that MAXENT was the only algorithm that successfully modelled the distribution range with five species occurrences regardless of the species specialisation. In contrast, GAM failed to successfully model the distribution ranges with fewer than 50 species occurrences. Though GLM, GBM, and RF required minimum 20 species occurrences to successfully model the distribution ranges for generalist and restricted generalist species, only five species occurrences were required for successful modelling of relaxed specialist and specialist species (Fig 2  and Figs B-D in S1 Appendix).
In contrast, the agreement metrics showed that both MAXENT and GLM required a minimum of 10 species occurrences to model the ranges of generalist and restricted generalist species to % 50% agreement with the "true" ranges. However, for specialist and relaxed specialist species, MAXENT required 50 occurrences to achieve % 40% agreement with the "true" ranges, and GLM failed to achieve a similar agreement, even with 200 occurrences. Both GBM and RF required a minimum of 20 occurrences to achieve % 45% agreement with the "true" ranges, whereas, for generalist and restricted generalist species RF could not achieve 40% agreement even with 100 occurrences (Fig 3 and Figs E-L, and P in S1 Appendix).
These findings are indicative of the presence of the interactive effect of species specialisation and the SDM algorithms on the number of species occurrence required for SDM, where, the number of species occurrences required for a good SDM varied according to the species specialisation and the type of algorithm used (Figs 2 and 3 and Figs B-D, O and P in S1 Appendix). Our results suggested statistically significant differences between the standard and independent evaluation metrics according to the Wilcoxon test, however, the magnitude of these differences were relatively small, implying that these two metrics are practically similar (Tables C-F and Figs M and N in S1 Appendix).
Our results also revealed that grid resolution had no considerable effect on SDM compared to species specialisation and model algorithm (Fig 3 and Figs O and P in S1 Appendix). Although there was a statistical difference between the high and low grid resolutions, this effect size was relatively small (Tables G and H in S1 Appendix). Moreover, the difference was not consistent: in some cases, models at high grid resolution performed better than those based on low grid resolution, whilst in other cases models using low grid resolution performed better.

Impact of positional accuracy of species occurrences on performance of SDM
The models based on precise species occurrences tended to perform slightly better than those based on imprecise occurrences (low, intermediate and high). However, in some instances, when low sample sizes were used the models based on the imprecise species occurrences outperformed those based on precise occurrences. Models fitted with imprecise species occurrences had a clear tendency to reduce SDM performance in relaxed specialist and specialist species, which disappeared with sample sizes above 20 occurrences (Fig 4 and Fig O in S1  Appendix).
The result of the linear models indicates a significant influence of the interaction between spatial resolution, SDM algorithm, positional accuracy, sample size, and species specialisation on the SDM's performance (delta AIC >5000). Species specialization and sample size were the most influential variables (in terms of the effect size), whereas spatial resolution and positional accuracy were the least influential variables ( Table 1). The full set of the explanatory variables modelled is presented in supplementary file (Table I in S1 Appendix).

Impact of positional accuracy of species occurrences on reliability of SDM
Both niche overlap and OCCC indicated a strong spatial agreement between ranges modelled with precise and imprecise species occurrence data for generalist and restricted generalist species. This agreement weakened with decreasing positional accuracy, and increasing specialisation. Moreover, this agreement also weakened with increasing numbers of imprecise species occurrences, and differences were more pronounced at low grid resolution (Fig 3 and Fig P in  S1 Appendix). These findings were consistent across the five algorithms. In general, our results suggest an interaction between sample size and positional accuracy, SDM algorithms, species specialisation, and grid resolutions on the reliability of SDM.

Discussion
Our comprehensive analysis uncovered how the sample size and positional accuracy of species occurrences, model algorithms, grid resolution, and species specialisation affected SDM performance and reliability. We showed that species specialisation had by far the most dominant impact, where the algorithm performance and the effect of sample size and positional accuracy of species occurrences depended most on species specialisation (Fig 5). These conclusions are based on ecological reliability and spatial agreement, rather than statistical performance in modelling the data itself. The impact of grid resolution on the SDM's reliability only became important with imprecise species occurrences when modelling highly specialized species (See S3 Appendix for more details on the impact of grid resolution on SDM). Our results also revealed that metrics of model performance can be misleading in representing the actual performance, if matching the true distribution was the goal.
We corroborate previous studies that found model performance and reliability improves with increasing sample size [15,16,35,38,39,78,95]. Nonetheless, useful distribution ranges for widespread and narrow-ranged species could be achieved with as few as 10 or 20 species The variation between the performance of SDMs fitted with precise and imprecise species occurrences with different sample size (x axis) using five different SDM algorithms (column-wise) for four species with difference specialisation levels (row-wise). Line colour represents the precision levels of the species occurrences. Solid lines represent low grid resolution, and dashed lines represent high resolution. Dotted line is the threshold value below which poor model performance is indicated. https://doi.org/10.1371/journal.pone.0187906.g004

Table 1. Result of the linear model analysis investigating determinants of area under the receiver operating characteristic curve (AUC) values.
Exponentially transformed AUC values were modelled as a function of spatial resolution, SDM algorithm, positional accuracy, sample size, and species specialisation. Akaike Information Criterion (AIC) showed that the full model with interaction was the less parsimonious model with AIC = -66657.46. Reliability in species distribution modelling occurrences respectively. In contrast to previous studies [15,16,35,38,39,41,[96][97][98][99][100][101][102][103], our result indicates that the ranges of the less specialized species are in fact easier to predict than those of highly specialized species. These previous studies based their conclusions on SDM's performance measured by the values of sensitivity and specificity, where the species ranges with low performance values were considered more difficult to predict. Together with other studies [80,104,105], we showed that this might not be always true, since SDM performance was strongly influenced by species specialisation and the size of the study area. Recently, a study evaluated SDM performance in response to the size of the buffer area (0-60%; 0% buffer, all background data were drawn from presence domain) surrounding the species range, and found that performance in ranges with buffer areas of 5% was no better than random, while increasing the size of the buffer area around the same range increased the performance value [48]. Accordingly, the evaluation metrics such as AUC and TSS could be more informing about how broadly the modelled species is distributed across the study area rather than inform about SDM performance [104,105]. Our results, in line with other studies [80,104,106,107], emphasize that standard evaluation metrics should not be used to compare performance between different species, nor within the same species when using different SDM settings The minimum sample size required for a useful SDM varied according to species specialisation and the SDM algorithm. In generalist and restricted generalist species, both MAXENT and GLM predicted useful distribution ranges with as few as 10 species occurrences. In relaxed specialist and specialist species, the optimal minimum number was 20 using GBM and RF. While many studies have explored the impact of the number of species occurrences on SDM, until now this issue has remained unresolved. For example, using GARP, Stockwell and Peterson  Reliability in species distribution modelling ignored species specialisation. Only two studies considered species specialisation in their analyses, and both concluded that for narrow-ranged species, five species occurrences sufficed [15,16].

Degree of freedom
Several studies have shown that MAXENT stands out as the single best SDM algorithm [15,16,38,108], however, our results were not unanimous and revealed variation according to species specialisation and the number of occurrences. For example, at a high number of species occurrences (! 50) this algorithm outperformed the others across all species specialisation levels. At low number of occurrences (< 50), however, MAXENT underperformed in relaxed specialist and specialist species, and outperformed the other algorithms in generalist and restricted generalist species. We relate this variability to the differences in SDM algorithms, where, in contrast to the other algorithms, MAXENT and GLM had a tendency to over-predict (over-estimate the range occupied by a species) when fitted with a low number of occurrences (Figs E-L in S1 Appendix). As a result, the predicted ranges using MAXENT and GLM based on low numbers of occurrences resulted in a distribution range with a widespread probability surface (as a result of over-prediction). This widespread probability surface causes a good agreement with the "true" ranges for the wide spread species and poor agreement for the narrow-ranged species.
Our results revealed that the impact of the positional accuracy of species occurrences on SDM's performance was relatively small across all species, where, in many cases, the models based on precise occurrences were only slightly better than those based on imprecise occurrences. These findings are in line with previous studies concluding that SDM is generally less sensitive to the levels of positional accuracy of species occurrences [12,43,44]. However, consulting the spatial agreement metrics revealed that the previous conclusion might not always be true, and that the sensitivity of SDM to the positional accuracy of species occurrences also depends on species specialisation and sample size. For generalist and restricted generalist species, the impact of species positional accuracy on SDM's reliability was relatively small across all algorithms. However, the relaxed specialist and specialist species were in fact sensitive to the positional accuracy of the species occurrences. This sensitivity increased with increasing numbers of imprecise species occurrences and at low grid resolution. The sensitivity of specialised species to the level of positional accuracy could be due to the increased likelihood of assigning the imprecise species occurrences to unsuitable areas, whereas, in generalist species this likelihood is inherently lower. Accordingly, our results highlight the importance of investing time and effort into improving the positional accuracy of species occurrences for species with narrow distribution ranges, when modelling putative ecological specialists. However, for widespread species we believe that positional accuracy of species occurrence will have minimal effect on the reliability of SDM. This increases the relevance of data available in museums and online portals, especially for widespread species.
The sample size and positional accuracy of species occurrence data that can be used in SDM inherently varies according to the objective of the study. For example, if the goal is to define the environmental conditions that limit the distribution of a focal species, using high sample size and high positional accuracy may be necessary to minimize the commission error [109]. By contrast, if practitioners are interested in discovering a new population of a poorly known species, using high sample size and positional accuracy species occurrence data may not be crucially important. Therefore, defining the objective and the goal of the SDM is critical for achieving reliable conclusions in conjunction with a minimum amount of prior information about the species in question [109].
We have shown that species specialisation is the key factor with a dominant influence on SDM, which is usually unknown and/or not considered in species distribution models a-priori, while the spatial grid resolution has no considerable impact on SDM. We can conclude that narrow-ranged species are likely to be more sensitive than widespread species to changes in the level of positional accuracy of species occurrence and sample size. More important, we have also found that a high SDM performance does not always also imply a high reliability. In addition to our study and previous work, further effort needs to be directed towards investigating the impact of sample size and positional accuracy of species occurrences on: 1) SDM variables contribution, and 2) SDM transferability (spatial and temporal). Finally, it would be beneficial to explore the impact of using mixed levels of positional accuracy on SDM's reliability. The work-flow (Fig 5) we provide should help other researchers to select the most appropriate approach according to the characteristic of the available data in the quest to make the best use of the data available in species distribution modelling studies.