Building geochemically based quantitative analogies from soil classification systems using different compositional datasets

Soil heterogeneity is a major contributor to the uncertainty in near-surface biogeochemical modeling. We sought to overcome this limitation by exploring the development of a new classification analogy concept for transcribing the largely qualitative criteria in the pedomorphologically based, soil taxonomic classification systems to quantitative physicochemical descriptions. We collected soil horizons classified under the Alfisols taxonomic Order in the U.S. National Resource Conservation Service (NRCS) soil classification system and quantified their properties via physical and chemical characterizations. Using multivariate statistical modeling modified for compositional data analysis (CoDA), we developed quantitative analogies by partitioning the characterization data up into three different compositions: Water-extracted (WE), Mehlich-III extracted (ME), and particle-size distribution (PSD) compositions. Afterwards, statistical tests were performed to determine the level of discrimination at different taxonomic and location-specific designations. The analogies showed different abilities to discriminate among the samples. Overall, analogies made up from the WE composition more accurately classified the samples than the other compositions, particularly at the Great Group and thermal regime designations. This work points to the potential to quantitatively discriminate taxonomically different soil types characterized by varying compositional datasets.


Introduction
Developing models for predicting soil behavior, such as biogeochemical reactions or the environmental fate of contaminants, typically begins with the application of generalized mass balance models that incorporate hydrologic and physical processes driving transport with experimentally determined parameters describing (i) the batch equilibrium distribution of the solutes between the soil and water phases (i.e., sorption distribution coefficient or K D ) and (ii) kinetic parameters describing nutrient or solute persistence, such as via abiotic/biotic degradation. Attempting to develop data that is broadly applicable to a wide variety of environmental conditions, experimental soil systems are generally manipulated to show simple connections between particular soil characteristics and its associated soil "behavior". For example, the environmental fate of contaminants is often considered in terms of comparing the contaminant's abiotic sorption on soils across varying soil organic carbon, clay, or metal oxide concentrations. For charged solutes, these systems are typically further manipulated to study the impacts of pH or redox conditions through the addition of acids or bases, and organic carbon. When studying biotically mediated contaminant fate, the experimental systems are even further manipulated to hold soil moisture, aeration, and temperature constant. While these experimental conditions are useful for elucidating potential mechanistic information related to how a soil's properties influence the eventual fate of contaminants, it is reasonable to assume that these manipulations may also promote substantial deviations in the natural biogeochemical behaviors. Thus, there is a risk that developing soil behavioral information using highly idealized systems could ultimately frustrate our ability to predict the environmental fate for natural soil types.
Much of the error in biogeochemical modeling originates from uncertainties in the parameters describing the soil and sediment compartments [1,2]. Attempts to overcome this problem procedurally through standardized experimental or analytical methods (such as via published ASTM, USEPA, or OECD methods) are challenged by our current inability to define thermodynamically the initial state of a soil system, as can be done in systems with homogenous solids-a challenge largely occurring due to the inherent heterogeneity of soils. Inherent soil heterogeneity makes it theoretically difficult to represent any particular soil biogeochemical behavior with a single "universal" coefficient, such as a solute partitioning coefficient (K D ) for the equilibrium contaminant sorption or a 1 st -order kinetic coefficient for indicating a contaminant's persistence in soil. In spite of these theoretical and practical limitations, these types of environmental fate parameters are almost universally employed in biogeochemical models. While the uncertainty in the models is nearly impossible to quantify, these models remain popular given that the environmental science community generally lacks any good alternatives. Some models attempt to reduce the uncertainty in using single universal coefficients to describe all contaminant interactions by utilizing different fate parameters for different environmental phases-a modeling approach employed in popular multicompartment models [3]. A second alternative approach may involve substituting single fate coefficients with empirical functions based on univariate soil physical and chemical characteristics [4,5]. Here, it is assumed that one can better tune an environmental model to a particular site of interest by incorporating the properties of soil. The dilemma here is that the analogy (defined in this paper as the "structural" construct underlying data interpretation) for a site of interest remains ambiguous and highly uncertain. This fact occurs because the analogy was built solely on quantitative characteristics without any consideration of the soil's "context". Without defining the soil's context or class, it is impossible to fully appreciate the importance of the quantitative characteristics, and ultimately, discriminate one soil from another [6]. Lacking the ability to unambiguously discriminate one soil from another makes it difficult to justify extrapolating any prediction of a soil's biogeochemical behavior generated from a poorly defined analogy to other sites of interest. Thus, a (perpetually) repeating, time-consuming, and expensive pattern that has emerged in the environmental science community requiring that every new site of interest conduct a corresponding new round of investigations.
Stepping back from the approaches discussed above to consider the pedomorphological features as primary descriptors of soil in environmental modeling provides some helpful perspectives. The most common pedomorphologically based soil taxonomic systems are the NRCS system and the Food and Agricultural Organization of the United Nations Educational, Scientific, and Cultural Organization (FAO-UNESCO) system-with its emerging update as the World Reference Base or WRB [7,8]. These systems were designed to provide a common criteria and terminology for not only distinguishing different soil types, but also discussing soil behavior and application, such as in agriculture or other technological uses of soils [9]. While pedomorphological considerations readily discriminate among different soil types, these distinctions are largely qualitative (based on visual inspection), supplemented by limited quantitative characterization data. Thus, the corresponding analogies explaining the similarity among different soil types in these classification systems are largely qualitative, containing quantitative information that is insufficient to develop a distinctive signature of each soil type. In theory, building quantitative (or numeric) analogies would make the highly discriminating capability of soil classification systems accessible to quantitative modeling, such as predicting soil biogeochemical behavior relative to a designated soil type. To be clear, soil taxonomic systems excel at providing the necessary "context" by which soil properties may be interpreted.
As formulated materials, Norris [10] considered soils as "multivariate entities", arguing that a more accurate description of a soil is made by considering multiple soil characteristic variables than any single variable alone. Numerous studies over the past few decades have emphasized this idea (to one degree or another) for soils and other environmentally relevant solids and matrices. Examples include identifying terpane chemical signatures based on different source geologies [11]; discriminating soil geochemical baseline for metal contaminated sites [12]; chemically distinguishing coal fly-ash from background sediment after a major spill in the U.S. [13]; discriminating minerals collected by NASA's Mars Science Laboratory rover based on remote laser-induced breakdown spectroscopy [14]; distinguishing rainwater by age [15] or simply chemically distinguishing sediments based on collection site [16,17] and time of deposition relative to establishment of the U.S. EPA's Clean Water Act [18]. Not only can these approaches be useful for discriminating characterization data, but also discriminating environmentally mediated processes. For example, multivariate models have been generated for correlating soil fungal and bacterial communities [19][20][21], and organic and inorganic contaminant sorption distribution coefficients (K D ) [22,23] and degradation [24,25] to soil "types" or classes. In these examples, soil type was not reported in terms of organized pedomorphic taxonomic systems, but generalized in terms of the geographical and vegetative criteria, such as desert soils, forest soils, coniferous soils, hard-wood growth forests, etc.
Aitchinson's landmark revelations regarding the nature of compositional data are extremely useful for circumventing latent structural artifacts in characterization data. The theoretical basis for compositional data has been reviewed elsewhere [26]. However, it is important to point out that nearly all soil characterization data is inherently compositional in nature, and thus, CoDA theoretical considerations are not only appropriate but necessary. Consider that nearly all quantitative soil characterization data are reported in units of concentration, such as part per million (e.g., mg kg -1 ) or as a percentage, and composed of all positive (non-negative) values. As such, these traits are key indicators of compositional data, implying that if all parts of the composition were analyzed, they would sum to 100%. This means that the composition is "closed"; a "fixed-closure bias" is evident in the structure of compositional data that appears distorted in Euclidean space. These distortions give way to potentially spurious statistical correlations obtained using classical structural methods. Thus, analysis must be conducted either in "compositional" space, or in real space after the data is correctly transformed.
This paper describes our efforts to develop quantitative analogical models for different soil types, utilizing taxonomic descriptors originating from the NRCS soil classification system. Here, the analogies were built using soil characterization data partitioned into three different compositions described hereafter.

Soil selection, collection, processing, and characterization
Pre-selected soil series were targeted based on interest in populating different subcategories of soil types within the Alfisols order. We preferentially sought out pristine, non-anthropogenically disturbed sampling sites containing native vegetation. For this reason, we targeted our collections at State Parks (after receiving both verbal and written permissions from Park Rangers) and historically non-used areas (as recommended by range managers) on U.S. Army bases in the Eastern and Midwestern U.S. At each site (Fig 1 and Table 1) the profile was sampled using a soil corer to determine if the NRCS-designated soil series descriptions qualitatively matched the observed features. Once satisfied, we collected samples by excavating soil approx. 10-20 inches (25.4-50.8 cm) down the profile, sampling each horizon based on visible delineators and the soil series descriptions. Samples were collected in plastic bags, sealed, and shipped to our laboratory in Vicksburg, MS, USA, where they were air-dried, sieved, and ground using a soil grinder to pass through a 2-mm sieve, homogenized using an acoustic mixer, and then stored in polypropylene bottles at 4˚C until used. The soil mixture was subsampled in triplicate, and geochemically characterized using common soil characterization methods (S1 Table) used in environmental science for focusing on surface interfacial properties.

Multivariate modeling for analogy generation
All statistical modeling was conducted using R statistical computing software [28] through the graphical interface RStudio [29]. Soil taxonomic information for the collected soil series were extracted from the USDA-NCSS soil databases using the "soilDB" package [30]. Soil characterizations were conducted using both chemical and physical analysis (see Supporting Information). Soil chemical properties were quantified based on solutions analyzed for as many solutes as possible in order to capture any lingering (and often ignored) covariate information. The characterization data was divided into three different compositions based on the experimental methods employed: (i) water-extracted (WE), representing a composition measurement obtained by washing the soils with a dilute salt solution (containing 20 variables), (ii) Mehlich-III extracted (ME), representing a composition obtained using a weak acid solution (containing 14 variables), and (iii) particle size distribution (PSD, containing 3 variables). The ratio of variables to observations conforms to more "tolerant" rules of the appropriate dimensionality required for stable PCA results [31]. In order to develop the analogy models, all solution characterization data were uniformly converted to homogeneous mass concentration units (mg kg -1 ), including pH using -10^(pH), and electrical conductivity (EC) by assuming 1 mS cm -1 = 10 meq(+) L -1 = 640 mg L -1 total salts [32]. The exception to this conversion was our calculation of cation exchange capacity (CEC), in units of cmol(+) kg -1 , which was calculated by summing the charge-equivalent Na, Ba, Ca, Mg, and K concentrations measured from the Mehlich-III extractions.
Missing data patterns (due to analytical non-detects) were studied using the "zCompositions" package [33] for R. Missing variables that exceeded 50% of the total samples were removed from the matrix (see Supporting Information). As a result, we removed ortho-PO 4 , B, P, S, Sn, Ti, and Sr solutes from the WE composition, and P, Sr, S, Th, Sn, and Ti from the ME composition. Missing "zeroes" in all compositions were replaced using zComposition's multivariate imputation method from a matrix containing estimates of the analytical detection levels of the missing characterization data. The final variable set was closed to the unit sum of 100 using the clo command in the "compositions" package [34].
Analogies were created by modeling the data using robust principal component analysis (PCA) via pcaCoDa command in the "robcompositions" package [35]. Here, PCA was performed first using an isometric log-ratio (ilr) transformation of the dataset (using a default basis set), followed by conversion to a centered logratio (clr) transformed data for interpreting results. The clr transformation is defined as: Here, the terms in Eq 1 represents the logratio of each variable and the geometric mean across the different components as gðxÞ ¼ ð Q D i¼1 x i Þ 1=D . Given no particular algorithms exist for determining the stopping point of PCA for compositional data, the optimum number of principal components (PCs) was selected based on the recommendations by Jackson [36]. Here, we considered a combination of the calculated eigenvalues and percent explained variance for each component, as well as the overall shape of the Scree plot (See Supporting Information). According to Jackson [36], agreement between the Scree criterion and eigenvalues > 1 is typically exhibited in structured data. Irregular Scree plots are more indicative of random variation within the data; thus, we opted to choose PCs with eigenvalues > 1. Robust clr-loadings were extracted from the PCA and plots were studied for opportunities to remove variables that were redundant, or otherwise, provided no significant information. Afterwards, robust clr-scores were extracted from the model and used for statistically discriminating the samples via linear discriminant analysis (LDA) using the "MASS" package [37]. Shapiro-Wilk tests showed that all clr-transformed data conformed to a normal distribution (see Supporting Information). However, for compositional data, vector lengths graphically represent the deviation of the variable from the model center (i.e., the plot origin in reduced space), or the geometric mean (g (x)) in Eq 1-thus, the clr transformation) of each variable [38]. Compositional biplots are interpreted by studying the log ratios between pairs of variables x i and x j represent as:

PCA-based soil analogies for the three different compositional datasets
Eq 2 is simple but important to keep in mind when considering the relationship between any pair of variables. Large log ratios indicate high variation between the two variables, while the inverse is true for small log ratios. Log ratios close to zero indicate very little difference between the variables, opening the way for possibly combining the variables to filter out nonessential information. Measured link distances (lines drawn from the tip of one variable vector to another) graphically represent the log-ratio variances, while the pairwise vector lengths show their relative standard deviation. These basic properties of the compositional data analysis hold important implications for soil analysis; that soil data collections can be intelligently prioritized based on the most important information. What constitutes the most "important" information for discriminating different soil "types" as well as identifying important variables driving soil reactions remains unclear, however, the fact that any one subcomposition is scale invariant [39] may simplify the choice of characterization variables. All this assumes that highly significant and robust analogies can be developed. The water-extracted (WE) composition was described with a 5-PC model (based on shape of the Scree plot and eigenvalues > 1) with the PC 1-2 representing 80% of the explained variance, making it a good candidate for exploratory analysis using biplots. In particular, we examined the variance matrix and loading plots to explore logratios and subcompositions of low and high variance (Fig 2A), looking for opportunities to reduce the size of the composition [40]. Log ratios contributing least to the total variance of the PCA model were log(Ba/Na), log (Na/Zn), representing approx. 0.7 and 1.4% of the explained variance, respectively (see Supporting Information for the logratio variation matrix). However, none of these rays for the different log ratios overlapped in the biplot (Fig 2A), therefore, were not appropriate to combine into a single variable. On the other hand, rays for log(V/NH 4 .N), log(Ni/Mg), log(Fe/V), log (Cu/EC), and log(pH.H 2 O/pH.CaCl 2 ) noticeably overlapped, but the links were sufficiently large to discourage further combining these variables as well.
After evaluating the WE composition for nonsignificant information, we examined the biplot for evidence of collinearity among the remaining variables [39][40][41]. Collinear variables were suggested by different rays situated on a common line, whether in the same or opposite direction. The most obvious example in the WE composition was observed in the long link connecting the pH.KCl and Ca variables along the axes of their rays (Fig 2B). Calculations showed that a pH.KCl-Ca-EC subcomposition contributed only 0.045% to the total variance. A ternary principal component (t-PC) analysis (Fig 2B) showed the data points clearly following the line defining the principal axis. The first PC, explaining 95% of the variance in the subcomposition was highly loaded by pH.KCl. The NH 4 .N-SO 4 -Cl subcomposition also appeared similarly collinear, however, considerable scattering of the data around the principal axis ( Fig  2C), probably reflecting the large number of nondetects in the Cl data that were modified by the multivariate imputation technique.
In the biplots (Fig 2), clustering was evident for most of the samples (represented as clrtransformed scores), except the populations appeared "stretched" across the principal axis, suggesting the presence of leverage outliers (thus, the use of robust principal component algorithm). This outcome was expected given diversity of taxonomically distinct soil types captured in this work under the Alfisols Order. From the biplot in Fig 2, narrow linear boundaries or margins separating several of the clusters (at the Great Group level) were apparent for PC 1-2, particularly for the Fragiudalfs, Paleudalfs, and Paleustalfs samples, largely occupying negative PC-2 space. On the other hand, the Hapludalfs samples appeared to cluster in two separate populations, one at negative PC-1 and the other in positive PC-1 space, with the small Glossudalfs group.
For the ME composition, we obtained a two PC model (based on the eigenvalues > 1 given the unusual shape of the Scree plot), explaining only 50% of the total variance. Sample clustering in the biplot based on the Great Group designation was much less apparent with the ME composition, thus we expected this composition to be less discriminating than the WE composition. Furthermore, there were limited opportunities to simplify the composition. For example, the very low variance of log(Ca/CEC) suggested that the two variables may be redundant, however, the two rays did not overlap in the biplot (Fig 3). This lack of overlap may be an artifact of the low explained variance of the PCA model, with the low variance of the log(Ca/CEC) possibly suggesting the links were either very small or orthogonal [39]. Whichever the case, we decided against combining variables in the ME composition given the low explained variance of the overall PCA model. Also, t-PCA analysis did not find any strong collinear relationships among the reduced ME composition (plots not shown).
We realized that the ME composition in itself contained a unique subcomposition describing the exchangeable cation content at the soil surface. Thus, we created a separate CEC subcomposition from the exchangeable cations, Na, Ba, Ca, Mg, and K, and added exchangeable Al given the appearance of collinearity in Fig 2B (which could not be confirmed because of the low explained variance of the model for the ME composition). From the CEC subcomposition, we selected a 3-PC model that explained 84% of the total variance in the data, representing a substantial improvement over the full ME composition. Interestingly, the biplot (Fig 3B) was similar to that obtained in Fig 3A, suggesting that the variance among the exchangeable cation logratios dominated the structure in the ME composition (i.e., subcompositional coherence). The log (Ca/Mg) represented the largest link, suggesting that the log(Ca/CEC) in the ME composition was probably redundant. With the higher explained variance of the CEC subcomposition, the collinear relationship between Al and Ca was more evident (Fig 3C).
The PCA model for the PSD composition was limited to two PCs, given that the composition consisted of only three parts. Thus, we plotted this PCA results as both in compositional and clr-transformed biplots (Fig 4). The first PC explained 98% of the total variance, and highly loaded by %Sand. It was more difficult to observe clustering in this data, in part due to the smaller populations (as described previously), however, there were some apparent relationships. For example, the Fragiudalfs appeared to cluster predominantly in negative PC-1, within the log(Clay/Silt) (Fig 4A), while the "older" Paleudalfs and Paleustalfs mapped out in positive PC-1, better described by the log(x/Sand). This clustering is somewhat evident in the t-PCA plot as well (Fig 4B), with clustering at higher %clay, and toward the %Silt vertex.
LDA tests were conducted to determine the extent in which the different classes were discriminated based on their PCA models. Here, the cross-validated results were used as the test set. Overall, the WE composition ( Table 2) was highly discriminating for most of the classesan outcome implied by the linearly separable clustering apparent in the biplot (Fig 2). The WE composition best discriminated the general Suborder class, with most misclassifications occurring due to the model confusing samples as Udalfs (representing the largest population of samples). However, at the Great Group level (representing one level down in the NRCS taxonomy), the WE composition was much less accurate. While Hapludalfs and Fragiudalfs were correctly classified 73 and 80% of the time, the other Great Groups were confused by the Hapludalfs designation. Similarly, the WE composition most consistently identified the correct temperature classes (under the Family name), with the accuracy for all three thermal regimes ranging from 87-91%. The mesic and thermal regimes were the most confused with each other by the WE composition, while no thermic regime samples were confused with the frigid temperature class. The ME composition (Table 3) was markedly inferior in discriminating the difference classes compared to the WE composition. Similar to the WE composition, the ME composition well-discriminated the Suborder class, but confused all of the Great Groups with the Hapludalfs designation. Similarly, the ME composition confused all of the frigid and most of the mesic samples with the thermic temperature regime. Both the WE and ME compositions were largely confused by the semi-active clay activity designation, but the WE composition did a much better job at discriminating the active and super-active classes. The CEC composition Table 3. Misclassification matrix from linear discriminant analysis testing the accuracy of 2-PCA model used to describe the Mehlich-III extracted (ME) composition and a 3-PCA model for the CEC subcomposition to correctly assign class memberships of the Alfisols samples among the different Great Group taxonomic designations. Percent correct classifications are shown on the diagonal (in bold) while percent misclassification samples are shown off-diagonal. Accuracy of the CEC subcomposition is given in parenthesis. Quantitative soil analogies using different compositional datasets showed a similar ability at discriminating the different samples to the ME composition (Table 3). In this sense, the subcompositional coherence of the ME composition (using the CEC subcomposition) was apparent. We expected the PSD composition (Table 4) to be the least discriminating composition, given the limited chemical information inherent in this composition. Yet, to our surprise, the PSD composition exhibited a discriminating ability that was, overall, on par with the ME and CEC compositions. Where we expected the PSD composition to excel, in the texture designation (Family name), it performed no better than the other compositions. In general, the different compositions were similarly confused by coarse-loamy, loamy, and fine-loamy textures, but was more accurate in distinguishing the fine-loamy and fine-silty textures.

Discussion
This work agrees with other reports [42,43] showing that multivariate analogies built to represent different soil "types" can do well to capture the natural complexity of soils as articulated by soil classification systems, even in a relatively well-defined soil Order like Alfisols. In this paper, we studied the application of compositional theory to different soil characterization data of Alfisols within the Eastern and Central U.S., and its implications for discriminating soil classification-analogies. The results showed that the WE composition provided the best discrimination among the different soil classes, both at general and more specific levels of classification. Surprisingly, all of the compositions similarly resolved the soils at the Suborder class, pointing to their value for discriminating soils at more general classification levels. The WE composition was noticeably better at discriminating the samples than the other compositions only one step down the classification hierarchy, at the Great Group level. However, we observed that all of the compositions were confused by the Hapludalfs designation. This may be attributed in part to the inherent ambiguity of this particular class, given that Hapludalfs serves as a "catch-all" designation for unremarkable Udalfs. If so, this represents a potential opportunity for meaningfully revising this designation relative to the composition of interest. At the temperature regime designation (within the Family name), the WE composition correctly classified samples nearly 90% of the time. This result was especially surprising given that the WE composition was missing any information related to soil carbon, a major characteristic typically used to distinguish soils from different climatic regimes. Additional analysis adding soil carbon, nitrogen, and sulfur contents as non-compositional variables (not shown) to the WE composition seemed to have no effect on the misclassification rate for the thermal regime.
Related to this point was the observation that all of the compositions exhibited similar ability to discriminate between surface and subsurface horizons. Here, horizon position was overwhelmingly confused by the subsurface horizon. This probably reflects that fact that the surface horizons for collected Alfisols are typically thin, making it easier to confuse with the subsurface horizons, particularly when diffuse horizon boundaries were apparent. However, adding carbon, nitrogen, and sulfur contents as non-compositional variables to the WE composition greatly improved the accuracy of the horizon position predictions from 68 to 83% (S4 Fig).
CoDA principles provided a clearer theoretical basis for more detailed interpretations of the biplots than we've experienced with classical statistical techniques. CoDA also provided a reasonable basis to compare the value of each compositional data set. For example, the collinear relationship in the pH.KCl-Ca-EC subcomposition was sensible, pointing to well-known soil pH buffering mechanisms such as the dissolution of soil CaCO 3 and release of exchangeable Ca. Witnessing this behavior in the WE composition was reasonable, but, of course, not anticipated in the ME composition because the Mehlich extractant generally overwhelms the soil pH. This pH buffering behavior was clearly more important for discriminating the different soil classes than via acid-extractable solutes or particle size distributions, and perhaps gets us closer to articulating true aqueous chemical behavior in soils. On the other hand, the CEC subcomposition showed the linear relationship between exchangeable Al and Ca cations occupying the exchange phase. At a lower pH, the exchange phase is relatively more enriched in Al 3+ cations than Ca 2+ at higher pH. In total, the WE, ME, and CEC compositions emphasized the importance of pH as a dominant soil characteristic.
There are several implications arising from this research. Our results alluded to the value of different compositions emphasizing different ranges of soil behavior. This point is particularly important given that soil data comes in many "shapes and sizes", representing an important barrier in our efforts to build the described soil analogies. Researchers may use different combinations of destructive (extractions, digests) and non-destructive (XRF, x-ray diffraction) techniques depending on the rarity of the samples, equipment availability, and logistics of transporting soils. To our knowledge, no standard method or protocol is universally suitable for chemically and physically characterizing all soils. This limitation becomes painfully obvious when seeking consensus on the properties important for various soil-mediated reactions, such as contaminant environmental fate or spectroscopic response. Thus, the way forward may involve studying the value of different compositions as opposed to developing a standard characterization protocol for soils, taking advantage of the soils' subcompositional coherence when considering gaps in one characterization set to another.
These results suggest that classification-analogies could be developed for other NRCS soil types, expecting that inherent or latent structure will similarly emerge from geochemical characterization data as shown in this work for the Alfisols Order. In particular, statistically discriminant analogical models are expected for the more pedomorphologically distinguished taxonomic Orders, such as Mollisols, Ultisols, and Spodosols, and their accompanying Suborder and Great Group sublevels. These results indicate that classification-analogies may be developed for other soil classification systems, such as the previously mentioned WRB, if the specific soil taxonomy of samples can be identified, and the systems are relatively free of arbitrary or poorly defined criteria. In addition, classification analogies may be useful for reconnaissance-related applications, such as developing calibration sets using local soil types for extrapolating soil properties and behavior to remote or otherwise inaccessible soil types. Overall, this research points to the importance of considering soil taxonomy when sampling, providing a means for more explicitly incorporating soil taxonomic information in research.
It is important to note that this work does not include a critical analysis of the NRCS classification system, nor is it the intention of this work to strictly follow conventions among pedologists using the NRCS system in terms of collecting or presenting specific data relative to the classification level. For example, soil nutrient data is typically associated at the Family level in the NRCS classification system, which provides information relevant to land-use, such as whether the land is under agricultural production, etc., because our interest was more related to developing signature earlier on in the hierarchy. This approach was adopted intentionally with the goal of building up a unique soil archive where soil samples could be repeatedly tested for different complex soil processes, such as contaminant sorption or degradation using the same set of soil analogs, and their corresponding multivariate signatures, as opposed to collecting new soil samples when new inquiries arose in terms of complex processes.

Conclusions
In this paper, we explored the application of CoDA theory to develop highly discriminating soil classification-analogies. We showed that the potential of three different compositions, and one subcomposition, made up of soil characterization data, to accurately predict both taxonomic and location-specific classes of soils. Overall, the WE composition best discriminated the different soil classes.
Supporting information S1  Table. MANOVA results on the robust PCA scores for the particle-size composition.