How similar is “similar,” or what is the best measure of soil spectral and physiochemical similarity?

Spectral similarity indices were used to select similar soil samples from a spectral library and improve the predictive accuracy of target samples. There are many similarity indices available, and precisely how to select the optimum index has become a critical question. Five similarity indices were evaluated: Spectral angle mapper (SAM), Euclidean distance (ED), Mahalanobis distance (MD), SAM_pca and ED_pca in the space of principal components applied to a global soil spectral library. The accordance between spectral and compositional similarity was used to select the optimum index. Then the optimum index was evaluated if it can maintain the greatest predictive accuracy when selecting similar samples from a spectral library for the prediction of a target sample using a partial least squares regression (PLSR) model. The evaluated physiochemical properties were: soil organic carbon, pH, cation exchange capacity (CEC), clay, silt, and sand content. SAM and SAM_pca selected samples were closer in composition compared to the target samples. Based on similar samples selected using these two indices, PLSR models achieved the highest predictive accuracy for all soil properties, save for CEC. This validates the hypothesis that the accordance information between spectral and compositional similarity can help select the appropriate similarity index when selecting similar samples from a spectral library for prediction.


Introduction
Visible and near-infrared (VNIR) spectroscopy has demonstrated its ability to predict many soil physiochemical properties, such as soil organic matter (SOM), particle size, and iron content [1][2][3]. In addition to its wide use for soil properties, comparison of spectra from soil samples is used in several soil science-related applications [4], such as forensic soil science, archeology, and soil pollution assessments. Similarity indices have also been used to select samples from spectral libraries [5,6], and build local models for improving the physiochemical prediction of the target site [7][8][9]. The hypothesis of this strategy is that the selected similar spectra can better represent the spectral features of the target samples, thus leading to better model performance [10,11]. There are several spectral similarity indices, each with its own quantification, and different indices present different results of sample similarity [10]. For example, the spectral angle mapper (SAM; [12]) measures the angle between two spectral vectors to quantify similarity, while Euclidean distance (ED) measures distance in two or three dimensional Euclidean space. How to select similar spectra is a critical question, as it determines which candidate samples will be included for subsequent model building. Calibration datasets based on different indices will be unique, and thus lead to different model performances. In addition to the question of which spectral similarity index is most suitable for model calibration, the extent to which the index represents other measures of similarity between the soils was also explored; i.e., do similar spectra correspond to similar physiochemical properties of these soils? We may hope so, since the calibrated spectral library will be used to infer the physiochemical properties of the target samples.
Most previous research has examined one familiar or widely applied similarity index [8], with no attempt to compare between indices. Ramirez-Lopez et al. [13] was a notable exception, and developed an indicator to select the optimum similarity index. They proposed that the best distance metric would more accurately reflect soil compositional similarity, and imply (but do not confirm) that this would lead to the best predictive performance of the calibrated model. This was tested by comparing the spectral and compositional similarities of two soil properties: clay and pH. Clay has a strong spectral response in the VNIR around the water absorption features [14], while pH does not. Their research, however, did not cover soil properties with remarkably strong spectral responses, such as SOM [2].
We believe it is necessary to evaluate this method with more physiochemical properties, considering both properties with direct and indirect spectral responses. Direct spectral responses indicate direct interaction between the soil constituents and the electromagnetic radiation, while indirect responses are primarily based on a correlation with a combination of other soil properties. Moreover, a further step was taken beyond the research of Ramirez-Lopez et al. [13] by evaluating whether the calibration datasets selected by the better similarity index achieved higher predictive accuracy for a target sample.
Therefore, the objectives of this research were: (1) to evaluate five similarity indices using the accordance between spectral and compositional similarity (SOM, pH, CEC, clay, silt and sand) to select an optimum similarity index; and, (2) to determine if the optimum index maintains the greatest predictive accuracy when selecting similar samples from a spectral library for the prediction of a target sample. Our research hypothesis was that the samples selected based on the optimal similarity index would achieve higher predictive accuracy for the target samples.

Datasets
This study was carried out using a global soil spectral library, including 785 profiles (3831 generic horizons) selected from the International Soil Reference and Information Center (ISRIC). This library contains VNIR spectra (350-2500 nm, sampling interval 1 nm), geographic location, physical and chemical properties, and soil classification information [15].
These profiles were collected from 58 countries in Africa, Asia, Europe, North America, and South America. Spectral measurements were recorded with a FieldSpec FR spectrometer (Analytical Spectral Devices, Boulder, CO). The data providers reduced the spectra to 216 bands by averaging every 10 bands. Physiochemical properties measured by conventional laboratory methods included soil organic carbon (SOC), sand, silt, clay, pH, and cation exchange capacity (CEC).
We performed several quality control checks, and removed all samples with the sum of particle-size separates > 106% or < 94%, leaving 3,813 samples for further analysis [16]. From these, 500 samples with diverse spectral variations were selected as the test dataset using the Kennard-Stone (KS) algorithm [17], and the remaining 3,313 were used as the training set. Selection was based on the Euclidean distance of spectra as represented by principal components (PC). Before the implementation of the KS algorithm and subsequent distance calculation, reflectance spectra were transformed to absorbance, and baseline effects were corrected by a first-derivative transformation with Savitzky-Golay smoothing [18].

Spectral Angle Mapper (SAM)
. SAM is a commonly used similarity index first introduced by Kruse et al. [12], and measures the spectral angle between different samples. SAM considers both the differences in spectral shape and amplitude (Eq 1): where U i and R i represent the processed spectra for wavelength i (or their PC transformation) for the samples in the test and training datasets, respectively; and n is the number of spectral bands.

Euclidean Distance (ED). ED measures the distance of the two spectral vectors in Euclidean space (Eq 2):
ED ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi X n where U and R represent the processed spectral vectors (or their PC transformation) for the test and training datasets, respectively; and n is the number of spectral bands. ED was calculated using the dist function in the stats package of R.

Mahalanobis Distance (MD).
MD is the distance between two vectors, considering the covariance among vector elements (Eq 3): MD ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where U i and R j are the processed spectral vectors (or their PC transformation) for sample i in the test dataset and sample j in the training dataset, respectively; and S is the covariance matrix between U i and R j .

Accordance between spectral and individual compositional similarity
The most similar spectra for each of the test set were selected based on the lowest similarity distance for the five indices matched from the training dataset. The six physiochemical properties (pH, SOC, CEC, sand, silt, and clay) of the target samples were then compared to their corresponding matched sample. The root mean square error (RMSE) and the coefficient of determination (R 2 ) evaluated from the 1:1 line (actual: predicted) were used to evaluate the accordance between spectral and individual compositional similarity (Eqs 4 and 5): RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 m where m is the number of samples in the test dataset, y i is the physiochemical property of sample i in the test dataset,ŷ i is the physiochemical property of the most similar sample in the training dataset, and � y i is the average property value of the matched samples in the training dataset. Higher R 2 and lower RMSE indicate a greater accordance between the spectra and compositional similarity, and these criteria were used to select the optimum similarity index.

Accordance between spectral and integral compositional similarity
Soil spectra are an integrated result of the physiochemical properties of the sample. Apart from comparing spectra with individual compositional similarity, the relationship between spectral and integral compositional similarity was also investigated. To represent the integral composition, we used six standardized PCs converted from the six physiochemical properties. Since these properties have different units and are correlated, they were scaled first by normal-score transformation, and then converted into six PC scores using the prcomp function of the stats package in R. Then, the PC distance between the target and the matched sample in Euclidean space was calculated to represent the integral compositional similarity (Eq 6): ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where D i is the integral compositional distance between sample U in the test dataset and the most similar sample R in the training dataset, as represented by PC; and U pc(i) and R pc(i) are the PC scores for the ith samples U and R, respectively, as converted from their corresponding six physiochemical properties.

Partial Least Squares Regression (PLSR) model comparison
The reported optimum similarity indices selected by the accordance between spectral and compositional similarity were evaluated in terms of their predictive power. For each sample in the test dataset, all six properties were predicted by PLSR models based on similar samples matched in the training dataset using the five similarity indices. The number of similar samples selected from the training dataset has a great effect on the model performance [19], which although important, was not the focus of the research here. Different sizes (n = 5, 10, 20, 30, 40, 50, 100, 150, 200, 250, 300, 400 and 500) were tested for prediction of SOC. Model performance achieved the highest predictive accuracy and stabilized around~250; thus, this size was selected for all subsequent analyses of other physiochemical properties (S1 Fig). PLSR model performance was evaluated by the ratio of percent deviation (RPD; Eq 7): where SD is the standard deviation of the observed property values for the test dataset, and RMSEP is the RMSE of the prediction (see Eq 4). For each sample in the test dataset, the most similar 250 samples in spectral space (as evaluated by different similarity indices) were selected from the training dataset to build the PLSR models for prediction of the soil properties. We followed the criteria proposed by Chang and Laird [20] to evaluate the performance of the PLSR models: (1) RPD < 1.4, the model is not able to predict the target property; (2) 1.4 � RPD < 2.0, moderate model predictive performance; and (3) 2.0 � RPD < 2.5, the model can predict the target property well. Table 1 shows the summary statistics of physiochemical properties for the training and test datasets. Since a global soil spectral library was used, the soil properties covered a wide range of values (Table 1). Extremely acidic (pH = 3.00), alkaline (pH = 10.5), sandy with no SOC, and organic soils with high SOC (15.75%) were all included. The particle size class ranged from extremely sandy to extremely clayey. The range of the test set was somewhat narrower as an artifact of the smaller sample size. Fig 2 presents an illustrative example to depict how the five similarity indices differ in their selection of the most similar spectra. ED, MD, and ED_pca selected the same most similar spectra, primarily because of the similar distance calculations of ED and MD. Their matched spectra nearly overlapped with the target samples because the calculations of ED and MD focus on the relative difference of the reflectance values. The reflectance of the most similar spectra selected by SAM was much lower than that of the target sample since SAM primarily considers similarity in spectral shape. SAM_pca also yielded different results from the other methods. Table 2 presents a comparison between the spectra and the six individual compositional similarities. The similarity index was selected according to lower RMSE and higher R 2 values. As these two measurements always agreed, only R 2 will be presented in the following discussion of the individual soil properties. We used the following criteria [20] to indicate the accordance between spectra and individual compositional similarity: (1) R 2 < 0.5, poor; (2) 0.5 � R 2 < 0.8, moderate; and, (3) R 2 � 0.8, good accordance. For pH, the performance sequence was as follows: SAM > SAM_pca > MD > ED_pca > ED (R 2 , 0.58-0.64; RMSE, 0.89-0.96); thus, SAM achieved the best accordance for pH. For SOC, the performance sequence was: ED > SAM > ED_pca > MD > SAM_pca (R 2 , 0.47-0.56; RMSE, 1.13-1.25%). Clearly different from the results for pH, ED and SAM yielded the best performance, both outperforming the similarity based on the reduced dimension of PC space (ED_pca and SAM_pca). It is important to note, important information may be lost if all PCs are not retained, even though the 15 PCs used explained almost all of the total observed variance of the spectra. The performance for CEC was: SAM_pca > ED > SAM > ED_pca > MD (R 2 , 0.58-0.67; RMSE, 9.60-10.83 cmol�kg -1 ). Thus, the samples selected by SAM_pca were the most similar (R 2 = 0.67), while ED, SAM, ED_pca, and MD performed similarly (R 2 , 0.58-0.60).

Comparison between spectral and individual compositional similarity
For particle size distribution, percent clay content performance was generally superior to that of sand and silt. The performance sequence for clay was: SAM > SAM_pca > MD > ED > ED_pca (R 2 , 0.49-0.61; RMSE, 14.43-16.51%). The performance sequence for percent  It is apparent that there is no single sequence of best indices per-property. The highest accordance (R 2 ) achieved for CEC, pH, clay, SOC, sand, and silt were 0.67, 0.64, 0.61, 0.56, 0.55, and 0.37, respectively. No properties reached the level of good accordance (R 2 � 0.8), most fell within the range of moderate accordance (0.50 � R 2 < 0.80), and silt notably fell in the range of poor accordance (R 2 < 0.5).
We hypothesized that the properties with strong, direct spectral responses, such as SOC, would have good accordance, while properties with low or indirect spectral responses, such as pH, would be poor; however, the accordance for SOC was only moderate. The reason may be that the spectral response of SOC is masked or disturbed by the presence of other soil properties, such as iron content [21]. The moderate performance of pH indicated that properties with no direct spectral response still had the potential to be well predicted through spectral pedotransfer functions [22].
Predictably, when comparing similarity indices, different results were found. For example, the accordance for sand ranged from a moderate R 2 = 0.55 (SAM), to a relatively low R 2 = 0.33 (ED_pca). Thus, the selection of a proper similarity index is essential for determining suitable similar samples from spectral libraries of different scales. SAM provided the best or secondbest performance for all six properties. SAM mainly considers the overall spectral shape, focusing less on the relative difference in reflectance. The performance of SAM_pca was better than SAM in some cases, but it performed poorly for sand (R 2 = 0.46). ED and ED_pca values were very similar, with ED performing slightly better. Surprisingly, the differences between ED and MD were small as we expected MD to be superior since it accounts for covariance among bands, and the spectra were collected at high resolution and highly correlated.

Comparison between spectral and integrative compositional similarity
Pearson correlations between the spectral similarity evaluated by the five indices, and their corresponding integrative compositional similarity (represented by Euclidean distance of PCs converted from the six physiochemical properties) are presented in Table 3.
The correlation between integrative compositional and spectral similarity was significant (p < 0.01) for all five similarity indices, with ED being the highest (in contrast to its moderate performance in individual physiochemical property evaluation), and MD the lowest. The high correlation achieved by SAM was consistent with its good performance as evaluated by individual compositional similarity. The different trends in similarity index performance between compositional and individual compositional similarity indicates that the interactions between physiochemical properties and spectral responses are substantial; thus, an integrated measure is not simply the sum of the simple measures.

PLSR model prediction accuracy
As presented in Table 4, the high accordance between spectral similarity and SOC similarity was achieved by ED and SAM. For PLSR model prediction, similar samples selected by SAM yielded the best performance, followed by MD and ED. For pH prediction, SAM and ED . The best models for silt were based on SAM and SAM_pca, which is also consistent with the above similarity analysis. For prediction of sand composition, the models with the highest accuracy were also based on SAM_pca and SAM (RPD, 1.49-1.53).
Aligning with the results between spectra and individual compositional similarity, the PLSR models built based on SAM or SAM_pca provided the best performance for all of the physiochemical properties, save for CEC. Compared to the other similarity indices, SAM and SAM_pca could select samples that were closer in composition similarity. There are two important variables influencing the prediction accuracy of the PLSR model in this study: sample size and the selected similarity index. When the sample size was fixed (250 in the present study), selecting more compositionally similar samples can achieve greater accuracy when using PLSR models for prediction. This aligns with our research hypothesis that the samples selected based on the optimal similarity index will achieve higher predictive accuracy for target samples. Considering the possible loss of information during PCA, the use of SAM is recommended over SAM_pca. The model based on ED was only slightly better than that of MD, in agreement with their low differences for similarity comparison.
The RPD values of the PLSR models fell within two ranges: (1) for silt prediction, RPD was < 1.4; and (2) for all other PLSR models and indices, RPD was between 1.4 and 2.0. No models achieved an RPD > 2.0, possibly because of the large variations and heterogeneity of the global soil spectral library. We evaluated the statistical difference (i.e., RPD) of the five similarity indices using a pairwise t-test. SAM was statistically superior to MD (p < 0.05), while SAM was also superior to the other three indices (ED, ED_pca, and SAM_pca), but their differences were not significant. Thus, MD is not recommended considering its relatively poor performance. Although the difference between SAM and SAM_pca was not statistically significant, the performance of SAM_pca was unstable (relatively poor predictive performances for SOC and CEC); thus further supporting the use of SAM over SAM_pca. As shown in Table 4, different similarity indices had a significant influence on the performance of the PLSR model for the physiochemical properties analyzed. For example, the RPD of SOC varied over a relatively wide range, from 1.45 to 1.67. Thus, the selection of a reliable similarity index is essential. The accordance information between spectral and compositional similarity can help select appropriate indices when one needs to select similar samples from a spectral library for the prediction of a target sample. In addition, as revealed by the results of the similarity analysis and PLSR models, the properties that have high accordance with individual composition and spectral similarity, for example, pH and clay in this study, can be more accurately predicted using PLSR models. In contrast, the accordance for silt and sand was low; therefore, their PLSR models performed poorly. Thus, the relationship between individual composition and spectral similarity can be used as an indicator of the potential of VNIR spectroscopy for the prediction of different properties.

Conclusions
Compared to other similarity indices, SAM and SAM_pca selected samples were more compositionally similar to the target samples. Based on the similar samples selected by these indices, PLSR models achieved the highest predictive accuracy for all six of the soil physiochemical properties analyzed, except CEC. SAM is recommended over SAM_pca considering the possible loss of information during PCA analysis. The findings support the hypothesis that the accordance information between spectral and compositional similarity can help select appropriate indices when one needs to select similar samples from a spectral library for predicting target samples.