Automatic variable selection in ecological niche modeling: A case study using Cassin’s Sparrow (Peucaea cassinii)

doi:10.1371/journal.pone.0257502

Fig 1.

MERRA/Max architecture.

Conceptual diagram showing the major hardware and software components of the MERRA/Max prototype. The study’s testbed consisted of 10 virtual machines (VMs) within NASA’s ADAPT science cloud, with each VM contributing 10 processing cores to the testbed. Numbered arrows indicate the system’s processing workflow.

More »

Expand

Table 1.

Bioclim variables.

More »

Expand

Table 2.

MERRA-2 variables.

More »

Expand

Fig 2.

MERRA/Max run-time performance and scaling properties.

Figure shows the relationship between the amount of time it takes MERRA/Max to complete a screening run (T) (shown by the left Y axis and the colored lines labeled A, B, and C), the number of variables in the collection being scanned (N), the average number of random samples taken of each variable in the collection during the screening process (S), and the number of processor cores available in the compute environment (C) (shown by the colored vertical bars and right Y axis). MERRA/Max’s parallel implementation scales linearly with respect to S, and, for any given collection of size N and sample size S, the estimated minimum possible run time (Tmin) (shown in parentheses) can be achieved when enough cores are available for a completely parallel screening of the collection.

More »

Expand

Fig 3.

MERRA/Max use case scenarios.

Figure shows the results of two use cases involving Cassin’s Sparrow observational data and predictor data sets of contrasting size and complexity: the Bioclim collection with N = 19 variables (A) and a MERRA-2 reanalysis test collection comprising N = 86 variables (B). A Variable Screening step was used in each scenario to select the top six contributing variables in the underlying collection. Correlated variables (indicated with red text and yellow highlight) were identified in a Predictor Refinement step and thinned to reduce collinearities. In a third step, Model Calibration and a Final Model Run were performed with the remaining non-correlated variables (green highlight). AICc is Akaike’s information criterion corrected for small sample size, AUC is area under the receiver operating characteristic curve, PCC is percent correctly classified, TSS is True Skill Statistic, Parameters is MaxEnt’s measure of model complexity, r is Pearson’s correlation coefficient, r² is the coefficient of determination, and VIF is variable inflation factor. The estimated minimum run time (Tmin) for a completely parallel screening is shown in parentheses. Maps created by the authors show MaxEnt logistic output, which can be interpreted as an estimate of habitat suitability between 0 and 1 with warmer colors indicating better predicted conditions for the species.

More »

Expand

Fig 4.

Cassin’s Sparrow baseline model and maps.

Figure shows results from a MaxEnt run that builds on the Cassin’s Sparrow bioclimatic modeling work of Salas et al. [75] and reflects a more traditional approach to ENM (A) and Cassin’s Sparrow’s range map based on observational data (B). Highlighted variables indicate those that were also selected by MERRA/Max in the Bioclim use case. Range map provided by eBird (www.ebird.org), created 28 July 2020, and reprinted from [83] under a CC BY license, with permission from the Cornell Lab of Ornithology.

More »

Expand

Fig 5.

Ecological niche modeling (ENM) process.

Schematic description of the ENM process. Color bars under each step reflect an approximate amount of time that may be needed, ranging from low (blue) to high (red). The use of MERRA/Max to prescreen a large collection of predictors could support variable selection in the data cleaning step. Image provided by [92] and adapted for use here under a CC-BY license.

More »

Expand