Enhancing predictive performance for spectroscopic studies in wildlife science through a multi-model approach: A case study for species classification of live amphibians

Near infrared spectroscopy coupled with predictive modeling is a growing field of study for addressing questions in wildlife science aimed at improving management strategies and conservation outcomes for managed and threatened fauna. To date, the majority of spectroscopic studies in wildlife and fisheries applied chemometrics and predictive modeling with a single-algorithm approach. By contrast, multi-model approaches are used routinely for analyzing spectroscopic datasets across many major industries (e.g., medicine, agriculture) to maximize predictive outcomes for real-world applications. In this study, we conducted a benchmark modeling exercise to compare the performance of several machine learning algorithms in a multi-class problem utilizing a multivariate spectroscopic dataset obtained from live animals. Spectra obtained from live individuals representing eleven amphibian species were classified according to taxonomic designation. Seven modeling techniques were applied to generate prediction models, which varied significantly (p < 0.05) with regard to mean classification accuracy (e.g., support vector machine: 95.8 ± 0.8% vs. K-nearest neighbors: 89.3 ± 1.0%). Through the use of a multi-algorithm approach, candidate algorithms can be identified and applied to more effectively model complex spectroscopic data collected for wildlife sciences. Other key considerations in the predictive modeling workflow that serve to optimize spectroscopic model performance (e.g., variable selection and cross-validation procedures) are also discussed.


Introduction
Near infrared (NIR) spectroscopy used in tandem with chemometric analysis is a non-destructive biophotonic technique that has recently gained traction for addressing questions in wildlife sciences [1].Physiological traits can be discerned once models built from spectral data collected from biological samples (e.g., scent marks, feces, urine, hair, blood) or even the live animal, have been validated with direct methodologies [2].NIR spectroscopy offers several advantages over standard single-trait assays by providing a high throughput of rich biochemical information, in the form of a biochemical signature captured in a digitized spectral profile.With technological advances, NIR spectrometers have become more portable and fieldfriendly providing on-the-ground, real-time predictive capability for assessing aspects of animal physiology, particularly in studies pertaining to wildlife demography (e.g., sex, age, count data), disease detection, reproductive status assessments, nutrition profiling, and species taxonomic determination [1].The digitization of biochemical information captured in NIR spectra not only enables predictive model development across many applications at once, but also facilitates longitudinal studies that examine changes in population demographics or animal physiology and health over time [3].Once prediction models have been generated, new data can be incorporated to enhance sampling variability and novel modeling approaches may also be applied to improve predictive modeling outcomes.
In the field of spectroscopy, chemometric analysis involving a wide array of mathematical and statistical operations can be employed to generate a prediction model for the biophysiological parameter of interest.There are numerous algorithms (also referred to as models or learners) each with their own associated tuning parameters that may be applied and tweaked to calibrate a prediction model [4].In conventional classification and regression problems where the goal is to either predict the class or to quantify the value of an unknown sample for some target variable, unsupervised (e.g., k-means clustering, principal component analysis) and supervised learning (e.g., partial least squares, random forest, gradient boosting, Naïve Bayes) algorithms have been applied with great success [4].Although commonly used chemometric software platforms (e.g., Unscrambler X, OPUS) are user-friendly and highly customizable in some respects they can be limited in the number of algorithms used for developing and testing models.At present, modeling algorithms such as linear discriminant analysis (LDA) and various versions of partial least squares (PLS) represent the two most commonly cited algorithms for generating spectroscopic prediction models in wildlife science (Table 1).
However, there are many more algorithms for developing prediction models that are not only publicly available but are free to access in open-source software programs (e.g., Python, R).Given that an algorithm applies unique mathematical steps to generate predictions for the classification or regression task, it is entirely possible that different algorithms may offer more analytical flexibility, perform with varying levels of success, and some may prove to be better suited for the problem at hand [5][6][7][8][9].Furthermore, the ability of an algorithm to perform a specific task may vary depending on the characteristics inherent to the dataset (e.g., number of observations vs. predictor variables) and study system (e.g., species, tissue type) being examined [4].Overall predictive performance of spectroscopic-based models might benefit from a multi-model benchmark approach in which multiple algorithms are applied to independently train and test each of the models using identical data splits.Each model's predictive performance is then compared to identify the candidate algorithm that best predicts the classification or regression task of interest.For example, model efficacy varied considerably when three supervised model procedures were utilized to detect papillary carcinoma from Raman spectra, such that the random forest (81.5%) and AdaBoost (84.6%) algorithms yielded markedly higher classification rates compared to the decision tree classifier (75.4%) [8].While such multi-model approaches are routinely conducted to benchmark and evaluate candidate models in fields such as engineering, medicine, and agriculture [4,5,7,8,[10][11][12][13][14][15][16][17][18], the majority of spectroscopic-based studies in wildlife sciences (61/65 studies; see Methods) have applied a single-algorithm approach for model calibration and testing (Table 1).PLS models have been Table 1.Machine learning (ML) algorithms used in spectroscopic studies for wildlife science applications.LDA [36] PLS [3] Chinese giant salamander Sex Skin of live individual PLS [37] Dusky salamander Pheromone expression Skin of live individual LDA [3] Reptiles Sea turtle Nutrition Stomach contents PLS [38] Birds Macaw Nutrition Stomach contents PLS [39] Ostrich Nutrition Excreta PLS [40] Penguin

Myoglobin saturation Population ecology
Pectoral muscle implant Excreta LMM [41] PLSR [42] (Continued ) applied to answer a diverse set of questions across many different sampling types, such as species discrimination from dried blood traces between Bengal tigers and Indian leopards, and quantification of dietary components from the feces of black bears [19,20].However, it may be that different sample types and associated questions can be better addressed through the exploration of other algorithms [21].The wide variety of sampling modes (e.g., feces, hair,

Vertebrate
otolith, live animals) and questions addressed in wildlife sciences warrants a multi-faceted approach to robustly generate prediction models that are best suited for the sample type under investigation.For instance, K-nearest neighbors provided higher predictive outcomes compared to PLS when classifying structural differences among fish otoliths [21].The current study aimed to test the utility of a multi-model benchmark procedure in a question relevant to wildlife sciences, specifically species classification using the spectra collected from the live tissue of amphibians.Species determination is a valuable demographic trait that can be used to inform management decisions based off evaluations of biodiversity and community dynamics.
To compare model algorithms, we first created a spectral database containing NIR spectra from several amphibian species belonging to two different orders, Anura (i.e., frogs and toads) and Caudata (salamanders and newts).Although the high water content present in the dermis and dermal secretions of amphibians is thought to weaken the quality of spectral signals, previous NIR spectroscopy work with amphibians found that biologically relevant information could be obtained from the surface of amphibian skin [33,35,37,77].The porous, endocrine-like nature of amphibian skin allows the collection of a unique spectral profile that is likely associated with the skin's biochemical composition and can be analyzed without the use of molecular-based (e.g., DNA analysis) or potentially invasive procedures [78].The ability to hold, restrain, and scan amphibians in less than the minute needed to obtain spectra, makes them a valuable vertebrate class for studying how NIR spectroscopy can be used to obtain information on whole, live animals.Recent NIR studies on live amphibians have demonstrated promising results for evaluating and categorizing physiological factors such as biological sex, species, and reproductive status [19,36,37,77].For example, taxonomic classification that previously required genetic sequencing or specialized field biologists to distinguish closely related anurans has been facilitated with NIR spectroscopy [35].In the current study, we examined spectral profiles of eleven amphibian species to determine to what extent spectral profiles would differ between spectra belonging to different orders, families, genera, and species.We posited that species belonging to the same taxonomic clades (e.g., Order Anura or Caudata) would present greater similarities in their spectral profiles than species belonging to different taxonomic clades.
Secondly, we examined if predictive classification models based on the NIR spectra of amphibian skin in live amphibians could be improved through the employment of a multimodel benchmark approach.A benchmark modeling exercise was conducted in the open-source software, R, in a multi-class problem to compare the performance of seven conventional machine learning algorithms: generalized linear model with elastic net regression (GLM), K-nearest neighbors (KNN), linear discriminant analysis (LDA), partial least squares (PLS), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB).The scope of this study is limited to the aforementioned algorithms, but many others have recently been applied (e.g., Naïve Bayes, neural networks) to generate prediction models for spectroscopic datasets [79][80][81].Each model technique adheres to a unique, established set of mathematical rules to first assess spectral variability present in the dataset and then calibrate a model to subsequently make predictions on unseen test data.We investigated whether the seven algorithms applied would differ significantly with regard to their aggregated performance accuracies, as each model deploys a unique approach with varying characteristics (e.g., algorithm design, computational complexity, parameter tuning) to achieve the classification task of interest.

The spectra, principal component loadings, & scores
Spectral data were averaged and categorized according to their species designation, which indicated visually distinct patterns between the eleven amphibian species investigated (Fig 1).More specifically, major differences between species can be observed in both the raw and transformed spectra on the third overtone region of the NIR spectrum spanning 700-1200 nm, as well as the first and second overtone regions of the NIR spectrum that correspond to the range between 1200-1700 nm (Fig 1A & 1B).Overall, marked variability in spectral patterns were observed between the amphibians examined, especially in the NIR region adjacent to the visible region ranging from 700-1000 nm.Several additional spectral features were revealed in the transformed spectra (Fig 1B).Within the first overtone region, there is a distinct water signal spanning 1300-1600 nm, where spectral patterns can be distinguished between taxonomic orders (Anura in green and blue vs. Caudata in red).Differing moisture signals among species are notable and should be considered when assessing the spectra obtained from live tissue.Additionally, the application of mathematical pretreatments (see Methods) resolved the noisiness of some peaks (e.g., at approximately 950 nm) and improved overall spectral resolution.
The principal component scores plot is a representation of the original dataset that has been linearly transformed to a new coordinate-based system, such that the variables spanning 700-2500 nm are combined and reduced (i.e., based off their collinearity with one another) into linear combinations of latent variables called principal components (PC).PCs help to explain the total variance present in the dataset; the highest percentage of variance is explained by the first PC, then the second PC, and so on.Overall, the anuran (i.e., frogs and toads) spectral scores spanned all four quadrants of the PCA scores plot, whereas the majority of spectra belonging to caudates (i.e., salamanders) were strongly constrained to the negative axis of PC-1 (Fig 2).The biochemical similarities/differences inherent to the spectra are explained by the dominant peaks found in the positive and negative direction of the PCA loadings (Fig 2A inset), where the first two PCs accounted for 92% of the total variance present for the combined (anuran + caudate) database.Similarly, PC-1 and PC-2 accounted for 82% and 93% of the variance observed in the anuran-only and caudate-only databases, respectively.The loadings for both PC-1 and PC-2 varied distinctly at the following approximate wavelengths: 1100, 1400, and 1900 nm, as shown in Fig 2 .The Hotelling's T 2 influence plot produced by the PCA indicated that no outliers were present in the spectral dataset.Furthermore, all anuran species were clustered in close proximity to each other, with the greatest spectral overlap occurring between species belonging to the same genera (e.g., Anaxyrus species) and the greatest separation occurring between genera (e.g., Anaxyrus vs. Lithobates, Figs 2B and 3).Similarly, all caudate species were clustered in close proximity to one another, with very tight clustering observed among individuals belonging to the same genera (e.g., Ambystoma species) and distinct spectral separation occurring between genera (e.g., Ambystoma vs. Andrias and Cryptobranchus, Figs 2C and 3).For example, spotted and tiger salamanders belonging to the family, Ambystomatidae, were closely clustered and constrained to the negative hemisphere of PC-1, while Chinese giant and hellbender salamanders belonging to the family, Cryptobranchidae, were tightly clustered within the positive hemisphere of PC-1.

Discussion
The current study investigated the use of a multi-model benchmark approach for evaluating amphibian taxonomic status using a spectroscopic dataset.Several key observations were made based off the results from the specific datasets utilized, the algorithms tested, and the input parameter settings (i.e., the tuned parameters and hyperparameters specific to each model) applied to achieve the classification task at hand.First, we provide support for near infrared spectroscopy coupled with chemometrics as an analytical technique for detecting spectral differences at the taxonomic levels of order, family, genus, and species level for the anurans and caudates examined.Robust prediction models were produced to accurately classify the eleven amphibian species in our study, and we observed greater spectral similarities among individuals within the same taxonomic group than between taxonomic groups.Second, the multi-algorithm approach applied here demonstrates the ability of various machine learning algorithms to model and ultimately predict the species of unknown individuals based off of their spectral profiles.As indicated by significantly higher classification accuracies for both SVM and GLM compared to LDA and KNN models, we conclude that different modeling techniques have the ability to perform predictive species assignments of spectra collected from live amphibians with varying levels of success.Thus, the use of a multi-model approach may be especially beneficial for evaluating new questions and datasets for the first time, as an algorithm that produces the best performance for one dataset or question of interest may not be suitable across differing study systems.
In this study, we demonstrate that the spectral profiles that vary between species can be readily discriminated from the dermal tissue of living individuals in a non-invasive manner, with minimal handling and without the use of anesthetics.Despite the high water content present in amphibian skin, all prediction models classified the eleven amphibian species with a high degree of accuracy (>89%).In the present study, the SVM yielded the highest mean classification accuracy and KNN the lowest among the seven algorithms tested.In another study utilizing a multi-model approach, Benson et al. [21] identified KNN as the top-performing model for discriminating fish species via spectral scanning of otoliths, which highlights the importance of testing multiple models when evaluating new datasets and different species or testing physiological parameters for the first time.These findings are in accordance with the "no free lunch" theorem [82], which states that no single model is guaranteed to outperform competing models for any given machine learning problem.While classic chemometric modeling methods have proven useful for generating prediction models, it would likely benefit researchers within the discipline of wildlife science to employ a multi-model benchmark approach when undertaking novel investigations.Multi-model benchmarks can be employed at low cost and with relative ease using open-source software analytics programs and the code provided (S2 Document).Additionally, as software updates are regularly released by chemometric analysis programs, utilization of this approach may be facilitated in future studies.
For the vast majority of the misclassified spectra in this study, inaccurate predictions occurred between species classifications that were more taxonomically related.For example, spectra from caudate species tended to be misclassified as other caudates, and a similar pattern was upheld among misclassified anuran spectra, with the exception of the boreal toad.We are unsure why three boreal toad spectra were classified as caudates, but it could be a result of boreal toads being a more cold-adapted species and experiencing similar environmental pressures faced by caudates [83,84].The no-information rate, a metric that represents the best possible guess that can be made in the absence of information (i.e., the largest proportion of the dataset, or the most common class), was found to be 24%, suggesting that each of the models produced contained relevant information for predicting species designation in the current dataset (i.e., all models performed significantly better than a featureless, non-informative model) [4,85].In the case of the 11-class classification problem (anurans + caudates combined), 24% corresponds to the proportion of spectra belonging to the majority class (N = 399 spectra for Anaxyrus houstonensis) to the total number of spectra (N = 1661 spectra; Table 2).
Overall, the unique spectral profiles present between species likely represent a suite of biochemical differences in their skin that may be exploited for distinguishing individuals at the species level.We found that spectra could be used to group species based on taxonomic clade,  as indicated by the principal component scores and canonical discriminant plots.For example, salamander spectra (i.e., Order Caudata) were clustered in close proximity to one another, while spectra belonging to frogs and toads (i.e., Order Anura) tended to remain spatially distinct from their caudate counterparts.The PCA loadings for caudates versus anurans indicated differences at 1100, 1400, and 1900 nm approximately; these wavelengths coincide with absorbance frequencies of biochemical functional groups indicating different distributions of biological compounds in the matrix of living tissue, which explain the trends presented in the scores plots.
The spectral profiles between species were then evaluated at a finer taxonomic scale by analyzing anurans and caudates separately.When evaluating anuran spectra only, spectra belonging to the frog genus, Lithobates, were very tightly clustered within genera yet were spatially distinct from spectra belonging to the toad genus, Anaxyrus.Altogether, these findings suggest that clear genetic distinctions among genera produce signature spectral profiles that may become more distinct as the genetic distance between species increases.This study is the first to our knowledge to demonstrate the use of NIR spectroscopy for both anuran and caudate taxonomic designations using live, whole animal spectroscopic data.Scanning of live animals in real-time provides a unique opportunity to assess the in-vivo status of an individual.On the other hand, spectral analysis of biological and excreta samples for making remote predictions of what was occurring in the animal before/at the time of sample collection is also useful for applications in wildlife management and conservation [1,2].
The biochemical phenotypes inherent to each species, that present as different spectral profiles, are likely because amphibian skin is a complex mucosal organ that performs various critical physiological functions, including respiration, water uptake, ion transport, and innate immunity, whose composition will differ between species [86][87][88].Amphibian skin serves as a crucial immune organ constituting a sophisticated network of microbiological, physical, immunological, and chemical barriers to pathogen insult, as a result of having a wide range of secretory glands such as mucosal, granular, and adhesive glands [87].The presence and function of these glands is influenced by the varying environmental pressures faced by species as well as evolutionary mechanisms which serve to aid species survival and reproduction.For example, frogs in the genus Dermatonutus produce specialized intraspecific chemicals that are excreted from breeding glands, similar to pheromones produced by the mental gland of plethodontid salamanders during the breeding season [86,87].Likewise, salamander skin has evolved crucial evolutionary mechanisms for purposes such as predator defense and mate attraction; Plethodon shermani has specialized granular glands on the dorsal side of their tail that excrete a toxic, sticky protein that deters predators, while the glands on the ventral side of the tail produce pheromones used for scent marking and courtship-related behaviors [89].It is therefore crucial to consider the location of such glands when selecting the most appropriate sampling region for spectral data collection.The most optimal sampling region may very well differ depending on the physiological trait being modeled.
With regard to species investigated in the current study, adult hellbender salamanders rely solely on cutaneous breathing, making the skin an important site of electrolyte and gas exchange [90].Additionally, the skin of the Chinese giant salamander was found to contain 249 unique proteins, 155 of which are derived from the skin mucous alone [91].Members of the Anaxyrus genus possess granular glands that secrete peptides, biogenic amines, steroids, and alkaloids, which serve anti-predatory functions [92].Furthermore, within the Anaxyrus genus alone, there is a tremendous amount of amine and polypeptide diversity in the skin both inter-specifically and intra-specifically [93], which may have contributed to the unique spectral profiles observed among anurans in the present study.Amphibian skin also serves as an osmoregulatory organ that regulates acid-base balances in the body and the movement of hydrominerals [88].The epithelial tissue contains vital sodium, urea, and water channels (i.e., aquaporins) that are constantly changing factors affecting skin biochemistry [88].Given that different amphibian species produce varying types and amounts of glycoproteins and antimicrobial peptides, the types of fungal and bacterial communities that are harbored on the skin (i.e., microbiome) of individual species may also differ [94].Altogether, the enormous amount and phenotypic plasticity of biochemical diversity present within amphibian skin likely explains the high performance of near infrared spectroscopy for identifying various traits.
Multi-model benchmarks and ensembles have been explored and used extensively across several industries.Such an approach has become commonplace in fields such as engineering, medicine, and agriculture [4][5][6][7][8][10][11][12]14,15,95].Here, the application of a multi-model approach for optimizing spectroscopic prediction models in wildlife science is demonstrated.Considering the multivariate nature of large spectroscopic datasets, machine learning methodologies are well suited to address questions in wildlife science.Of importance to note is that ensembles may refer to at least two different modeling concepts.The first is similar to the multimodel approach employed in the current study, where numerous algorithms are applied to address the classification or regression problem on the same dataset, except results are combined in concert to base predictions.Ensembles go a step further in that the model acknowledges that one algorithm may perform more effectively for predicting a specific class, relative to other algorithms tested; for example, one algorithm may more effectively predict the majority class present in the dataset, while another may classify the minority class at a higher rate.On their own, it may be that neither algorithm is useful for predicting both classes, but when the algorithms are combined in an ensemble, they can achieve higher prediction rates than if they were operated separately [6].The concept of ensembling is also similar to stacking, where models are stacked on top of one another to base decisions, which may ultimately lead to the generation of a model with greater predictive performance.Alternatively, the term "ensemble" could refer to algorithms such as random forest, a model which builds numerous independent trees; each tree gets their own "vote", which is then tallied up in an ensemble to make the final prediction [96].In a random forest, the voting weight is equal amongst all trees, whereas for extreme gradient boosting algorithm, the vote is weighted by the overall accuracy yielded by a tree during the model calibration stage of model development.Furthermore, varying procedural steps used by each unique algorithm to make predictions likely contribute toward differences in their respective classification performance [4].As similar benchmarking approaches are applied across a wider range of species and tissue types (e.g., fin clip, vertebra, hair, blood, feces) in the future, candidate algorithm(s) most appropriate for specific study paradigms may be identified.
Based off the results from this study, we propose the application of a multi-model benchmark approach to evaluate the effectiveness of multiple algorithms for producing the best performance results.The algorithm that may perform best on one spectral dataset for any given classification or regression problem may differ depending on the problem and target species or sample matrix being investigated.Overall, this study demonstrates the utility of a multimodel benchmark approach for comparing and optimizing model performance using seven conventional machine learning algorithms.With that being said, other modeling frameworks, such as deep learning (e.g., convolutional neural networks) have been recently established and may serve to increase the predictive capacity of spectroscopic models built for addressing questions in wildlife science.

Dealing with imbalanced datasets
Imbalanced sample sizes among treatment groups are a major yet common obstacle to producing robust prediction models, especially in field settings where study populations are often limited.To overcome challenges concerning imbalanced datasets (e.g., when a class is either over-or under-represented), sample equalization can be conducted using a variety of methods, including under-sampling, over-sampling, augmenting the weight of the minority class, and through the synthetic minority over-sampling technique (SMOTE) [4].SMOTE helps to balance the samples for each class by generating new samples for the minority class; this procedure can be used in conjunction with other sample equalization techniques (e.g., undersampling of the majority class).However, it is important to note that although over-sampling the minority class or under-sampling the majority class can help to improve model performance measures for the newly balanced class, it may in turn negatively impact model performance for predicting other classes [97].For the present study, we acknowledge the imbalance of spectra between the eleven species investigated, with the greatest disparity occurring between the southern leopard frog (N = 26 spectra) and Houston toad (N = 399 spectra).This represents the classic problem of imbalanced datasets where the minority class (i.e., southern leopard frog in this case) is consequently the most misassigned class in the prediction model (Table 2).In such a case, the use of a class weighting procedure might be considered, which assigns a higher penalization value for misclassifying the minority class.Although an important consideration, methods accounting for class imbalance were not relevant to the objectives of this study and were thus not conducted.

Feature selection
Feature selection is a critical data processing step for exploring the dataset and evaluating the usefulness of the predictor variables, which is especially important for large spectroscopic datasets containing thousands of variables.Given that there are 1800 wavelengths in the NIR spectrum (700-2500 nm), each serving as a potential variable to predict the response, numerous methods have been employed to reduce database dimensionality in an effort to parse out meaningful variables in a sea of non-informative features.The variable reduction technique (also referred to as feature engineering) used here was Boruta (see Methods), although many other tools are available for reducing database dimensionality.Principal component analysis (PCA) is by far the most widely used method in spectroscopy studies, followed by three commonly used methods for reducing database complexity.First, some algorithms, such as linear discriminant analysis (LDA), partial least squares (PLS), and random forest (RF) internally perform "dimensionality reduction" processes.Second, "filter methods" may be used to compare the features (i.e., wavelengths) to one another and the outcome to determine relative importance of the features for predicting the response [98].For example, there are many filters available in the "mlr" package within R, all of which can be used to rank features based on their importance for the desired task.More specifically, filters such as the 'ranger_permutation' function works by dropping features from the dataset and observing how well the model performs in the absence of the excluded feature.Third, "wrapper" methods can be used and are similar to filter methods but make decisions on feature importance based on the model being built [99].For example, a sequential model wrapper may be used to select the optimal number of features by employing either forward or backward selection, which adds on or removes features sequentially at each step, similarly to stepwise regression, which only retains the most informative factors [99].As opposed to feature selection using filters, wrapper techniques use the model itself and thus wrapper models may yield more accurate results that come at the cost of having significantly longer run times during the calibration process.Given that both the quantity and quality (i.e., the informative value) of selected variables may vary between methods, ample consideration should be given to feature selection procedures during the model development stage.Variable selection methods for NIR spectroscopy have been extensively described in a recent work by Yun et al. [98].

Cross-validation schemes
Cross-validation is crucial for addressing concerns related to model overfitting and helps to avoid over-optimistic performance results for the training and test datasets [4].A recent study by Au et al. [100] employed a variety of cross-validation schemes (i.e., leave one out, leave group out, and random subset k-fold) to evaluate calibration and predictive error for assessing chemical compounds in eucalyptus leaves.They found that the method in which spectral observations were split (i.e., for calibration, cross-validation, and testing datasets) was important for fitting generalizable prediction models while minimizing error and bias for the crossvalidation and prediction datasets.Additionally, Au et al. [100] observed decreases in root mean square error prediction when the number of calibration/training samples increased, while prediction error plateaued when the calibration dataset reached a sufficient sample size.
The current study used a five-fold nested resampling scheme [101,102], which was selected after testing a variety of resampling methods in a pilot study.We observed that mean classification accuracies were highest and error minimized to the fullest extent when cross-validation was applied to both the inner (i.e., used for model calibration and tuning) and outer (i.e., used for testing) sampling datasets, whereas accuracy was lowest and variance the highest when holdout methods were solely applied to the outer resampling.Thus, testing and comparing various cross-validation schemes simultaneously may help to identify which cross-validation procedures provide the least unbiased estimates of predictive performance while minimizing error [102].Further considerations of cross-validation procedures for producing robust, broad-based models are outlined in detail by Au et al. and Cozzolino [100,103].Although the nested crossvalidation applied in this study is designed to provide a reliable estimate of model performance, a separate external dataset could serve as a final validation step to assess model efficacy on a completely unseen test dataset, which may be a valuable consideration for future studies.

Other algorithms, parameter, & hyperparameter tuning
Within freely available, open-source statistical programs such as R, it is possible to access numerous algorithms for model development and testing, each with their own unique set of tunable parameters and/or hyperparameters.During the calibration stage where models are tuned to achieve peak performance for the measure of interest (e.g., accuracy, mean misclassification error, kappa, ROC, AUC), tuning wrappers are employed with set search spaces, which can be conducted either at random with a given set of parameters or systematically in a grid-like fashion.While some algorithms, such as LDA or PLS have only one tunable parameter (i.e., number of principal components/factors applied), others such as GLM and XGB may have numerous tunable hyperparameters.Changing the search method as well as increasing or decreasing both the search space and the hyperparameters may greatly affect both the overall performance and run time for the prediction model.The cross-validation procedure that is applied can also profoundly influence the time required to calibrate the model (e.g., nested cross-validation is more computationally intensive than conventional k-fold cross-validation) [101].At any rate, after the prediction model has been adequately calibrated and tested, the tuned model can be used to make predictions at much faster rates.

Ethics statement
Experiments were approved by and adhered to the permit and/or animal care and use guidelines set forth by the Fort Worth Zoo (IACUC #17-H001; Federal Permit TE051818-0), Ladder Ranch (USFWS Permit #TE43754A-0), and Mississippi State University (IACUC # .

Data preprocessing & feature selection
Chemometric analyses were carried out in R version 4.1.2[104] and R Studio version 2020.02.0 [105] using the "mlr" package [106].The packages "xgboost" and "glmnet" were used to access the extreme gradient boosting and GLM algorithms, respectively [107,108].Regions within the visible-light spectrum were omitted so that only wavelengths within the near infrared range (700-2500 nm) would be included as predictor variables (often referred to as "features" in machine learning) in our analysis.Species class was used as the response variable (also referred to as the "task").The task for the model also includes information to specify predictor variables, if a blocking procedure is to be used, and whether the task is a classification or regression problem.Spectral data were preprocessed with smoothing (Savitsky-Golay, first order polynomial with a window size of 11) and first derivative (Savitsky-Golay, first order polynomial with 3 symmetric kernel points) functions using the "prospectr" package [109].The "Boruta" package was applied as the variable selection algorithm [110].The Boruta algorithm functions by using feature importance scores assigned to each original feature that competes against shadow features, which are latent variables representative of the original variables that are randomized to a new permutated form; only variables that score higher than the best shadow feature in the set for predicting the response variable in at least 95% of trials are selected as "important" [110].This significantly reduced the total number of variables in the dataset from 1800 to 906.Parameters "ntrees" and "maxRuns" affect the number of trees employed in the random forest algorithm used within Boruta, and the maximum number of runs completed, respectively; modulating these parameters may affect the total number of features selected.The newly reduced dataset of 906 variables (i.e., wavelengths) was used to calibrate each of the prediction models for the combined-species dataset.As for the caudate-only and anuran-only datasets, 387 and 715 variables, respectively, were retained by Boruta to build prediction models.

Cross-validation & model calibration
To avoid data leakage from the training dataset into the test dataset, we partitioned the full dataset into blocks and applied a five-fold cross-validation procedure.Due to several of the amphibian species being sampled with varying numbers of spectral replicates, the full dataset was partitioned into three blocks, depending on the number of spectral replicates collected for each species (range = 1-4 replicate scans; e.g., A. davidianus individuals had only one replicate taken versus three replicates taken for each A. houstonensis individual).The blocking procedure ensures that all replicate scans of an individual end up together in either the test or training dataset and removes the opportunity for these related scans to leak data between training and test datasets.The number of individuals, spectral replicates, and total number of spectra acquired for each species are provided in S3 Document.We used a nested resampling scheme where five-fold cross-validation was conducted on both the inner and outer resampling datasets; the inner resampling is used for model calibration and parameter tuning/optimization and the outer resampling returns results on the test dataset [101,111].This cross-validation procedure is expected to reduce overfitting and the overall variance yielded in the estimated performance accuracies, as the resulting output is an aggregated performance measurement that is generated by taking an average of the five training runs (as opposed to a single training run) for five iterations [102].More specifically, the outer sampling, comprised of the entire dataset, was split into five folds such that one fold was treated as the test dataset and the remaining four folds as the training dataset.The four latter folds were used in the inner resampling scheme and then followed a conventional five-fold cross-validation approach where hyperparameters were tuned for each iteration, and the hyperparameters yielding the highest aggregated results for the inner fold were then used for the prediction model's outer fold of test data that was initially held out [101].This process was run for five iterations (k = 5) so that each fold had a turn to be treated as test data; this is done by sequentially rotating test and training folds until each fold is represented in the test dataset yet remains independent from the training dataset.The five accuracies generated for each model were aggregated then compared using an ANOVA, and a Tukey HSD test (alpha = 0.05) was utilized to evaluate significant differences between the mean classification accuracies of all algorithms investigated.Assumptions for ANOVA were met according to Shapiro-Wilk, Bartlett's, and Durbin-Watson tests.Canonical discriminant plots to visualize spectral separation between taxonomic classifications were generated using JMP 14.0 [112].Variation between species designations was visualized by two canonical variables that are themselves uncorrelated with one another but each of which maximizes the correlations present for the species classification variable.
The 'benchmark' function in "mlr" was used to train and test each of the models using identical data splits for all seven algorithms to ensure that all models were directly comparable [106].The 'confusionMatrix' function in the "caret" package was used to provide an output of performance metrics including accuracy, no-information rate, and confusion matrices [85].Parameters and hyperparameters along with their tuning constraints (i.e., upper and lower range) for each of the algorithms applied are specified and briefly described in S4 Document.Confusion matrices for both the worst-and top-performing models for each of the three datasets (anurans + caudates combined, anurans-only, caudates-only) analyzed are reported in S5 Document.A literature search including research articles, published conference abstracts, and reviews was conducted in Google Scholar, Scopus, and Web of Science to determine the number of algorithms applied and tissue type sampled in each study (search words: "wildlife" AND "spectroscopy").We recognize the search conducted is not exhaustive but consider it representative of the general modeling approach currently utilized for spectroscopy studies in wildlife science.We acknowledge the possibility that authors explored but did not explicitly report the use of additional algorithms tested during the exploratory stage of data analysis.In any case, we categorized studies as a single or multi-model (two or more) approach according to the number of algorithms reportedly used for tuning models and making predictions.Results presented in Table 1 include studies containing vertebrate species exclusively, and studies found in our literature search were excluded if they did not clearly specify the algorithm or model type used for generating prediction models.Additionally, our search was limited to studies utilizing shallow machine learning algorithms, although deep learning is an emerging predictive modeling technique in spectroscopic studies [80,[113][114][115].

S5 Document. Confusion matrices for predicting taxonomic classifications for the anuranonly and caudate-only datasets. (XLSX)
S6 Document.Spectral database for the eleven amphibian species examined in this study.(XLSX)

Fig 1 .
Fig 1. A) Raw and B) transformed NIR absorbance spectra (700-2500 nm) were averaged and categorized according to their species designation.The averaged spectra for each species are colored by taxonomic groupings: frogs = green, toads = blue, and salamanders = red.Distinct patterns can be observed between the eleven species investigated across several regions of the NIR spectrum.https://doi.org/10.1371/journal.pcbi.1011876.g001

Fig 3 .
Fig 3. Canonical discriminant plots indicate biochemical separation between individuals belonging to different orders of amphibians (caudates located in the left oval and anurans located in the right oval).Discriminant plots visualizing biochemical relationships among the four caudates (bottom left) and seven anurans (bottom right) were also investigated.Illustrations by DMC.https://doi.org/10.1371/journal.pcbi.1011876.g003

Table 2 . Confusion matrix of results for the support vector machine (i.e., the top-performing model) applied to the combined (anuran + caudate) dataset.
Bolded values indicate the number of accurately classified spectra and non-bolded values indicate the misclassified spectra in the test dataset, respectively.Dotted lines separate major taxonomic orders (Anura vs. Caudata) and reveal that most spectra misclassifications occur within more closely related species.