Ultrahigh Dimensional Variable Selection for Interpolation of Point Referenced Spatial Data: A Digital Soil Mapping Case Study

Modern soil mapping is characterised by the need to interpolate point referenced (geostatistical) observations and the availability of large numbers of environmental characteristics for consideration as covariates to aid this interpolation. Modelling tasks of this nature also occur in other fields such as biogeography and environmental science. This analysis employs the Least Angle Regression (LAR) algorithm for fitting Least Absolute Shrinkage and Selection Operator (LASSO) penalized Multiple Linear Regressions models. This analysis demonstrates the efficiency of the LAR algorithm at selecting covariates to aid the interpolation of geostatistical soil carbon observations. Where an exhaustive search of the models that could be constructed from 800 potential covariate terms and 60 observations would be prohibitively demanding, LASSO variable selection is accomplished with trivial computational investment.


Introduction
Global soils have been estimated to contain the largest pool of terrestrial organic carbon in the biosphere, storing more carbon than all land plants and the atmosphere combined (Schlesinger, 1977). The importance of the dynamic equilibrium between carbon in soils and carbon in the atmosphere has been illustrated by such estimates as there having been 3.3 times the amount of carbon 1 in the atmosphere as CO 2 (g) present in global soils (Lal, 2004). More than half of the global soil carbon pool has been estimated to be comprised of organic compounds collectively referred to as soil organic carbon (hereafter SOC) (Lal, 2004). SOC may be depleted to as little as 25% of capacity when natural ecosystems are converted into agricultural systems with the majority of this carbon lost to the atmosphere as CO 2 (g) (Lal, 2004). The contribution such SOC losses would have made to terrestrial carbon dynamics may be appreciated in the context of the estimate that 34% of the global land surface had been devoted to agriculture by 2007 (Betts et al., 2007). Recharging SOC levels by sequestering CO 2 (g) in agricultural soils has been demonstrated to provide direct benefits to agriculture, in addition to providing an opportunity to partially offset anthropogenic green house gas emissions (Lal and Follett, 2009). Consequently, it is a key feature of national and international level carbon accounting endeavours.
The effort and cost associated with sampling SOC via laboratory analysis of soil core samples has led to a need to improve soil core sample based maps of SOC through statistical modelling using more readily attainable environmental variables as covariates (Mueller and Pierce, 2003;Barnes et al., 2003;Chan et al., 2008;Johnson et al., 2001;Simbahan et al., 2006a;Miklos et al., 2010;Vasques et al., 2012). Such modelling is often accompanied by two challenges. The first is spatial misalignment of observations of different variables and or observations and the locations (or coverage extents and resolutions) to which SOC is to be interpolated. The second is the availability of large numbers of potentially relevant covariates coupled with the belief that selected model(s) should be sparse. In this paper, we address the challenges of spatial misalignment and selection of a parsimonious subset of covariates to aid spatial interpolation of the response under an ultrahigh dimensional scenario. We achieve this by showcasing the performance of Least Absolute Shrinkage Selection Operator (LASSO) penalized Multiple Linear Regression (MLR) models on data from a real world case study of soil core derived observations of %SOC across 137ha of agricultural land in New South Wales, Australia. The remainder of the article is structured as follows. Section 2 describes the field sites along with the data collection, collation and spatial realignment for our case study. In Section 3 we explain the motivation of our choice of LASSO variable selection and summarize the key characteristics of this method. Section 4 contains the results and discussion of our analysis of the case study data. In particular we compare LASSO variable selection to four popular variable selection methods, evaluate the set of selected covariates and describe fitting spatial polynomials to the residuals for more precise interpolation of %SOC. We describe how we combine the predictions from these spatial polynomials with the predictions from the covariate based modelling to produce a full cover predicted raster for %SOC. We conclude with a discussion of this work and promising avenues for future research in Section 5.

Data Collection & Collation
Our case study data were collected from a 137ha area of native pasture with remnant woody vegetation on the Sustainable, Manageable, Accessible, Rural Technology (SMART) Farm of the University of New England near Armidale, New South Wales, Australia. The 60 observations of our response variable, percentage soil organic carbon (%SOC), include 57 values less than 2.55% while the remaining three values are 3.08%, 5.01% and 5.13%. We summarize the 63 environmental characteristics we consider here as potential covariates in Table  1. The Digital Elevation Model (DEM) derived covariates were calculated with the System for Automated Geoscientific Analyses (SAGA v2.1.0 (Conrad et al., 2015)) software and the resulting GIS layers were read into R (R Core Team, 2015) with the 'RSAGA' (Brenning, 2008) package. The remaining raster covariates were read into R with the R package 'raster' (Hijmans et al., 2015). Further details regarding the study site, field methodology and covariates are provided in Appendices A and B.

Spatial Realignment of Covariate and Response Observations
The geostatistical response observations are spatially misaligned with the geostatistical covariate observations and with the raster covariate observations. As the majority of our raster based covariates are derived from the 25m 2 resolution DEM, we realign all covariates to 25m by 25m squares centered on each response observation. Geostatistical covariates are interpolated to regular 100 by 100 point grids centred on the response observations via thin plate splines with the R package 'fields' (Nychka et al., 2015). The values of raster covariates are queried at these same grids of points centred on each response observation. The realigned value of each covariate associated with each response observation is taken as the mean of the values of this covariate across the grid of points centred on that response observation.
We use LASSO modified MLR fitted via the LAR algorithm in our case study analysis. Model-averaging the predictions from the LASSO solutions obtained from LAR executions within a cross validation scheme yields an aggregate estimate in a manner similar to random forests, bagged trees and boosted trees. A cross validation based approach also facilitates estimation of the shrinkage parameter for the LASSO fits (λ in Equation 11). The choice of LASSO modified MLR allows the importance of covariate terms (linear, non-linear and interaction) to be compared in terms of which have the coefficients that are shrunk to zero and which are assigned non-zero values. In contrast, whether the overall role of a covariate within the aggregated estimate from random forests, bagged or boosted trees is closer to linear or non-linear (and if non-linear what manner of non-linear) would be harder to judge from the results of such a fit. This ease of interpretability of the LASSO modified MLR comes with the cost of having to recenter and rescale (to mean zero and magnitude one) all covariates in each training each set (a requirement of the LAR algorithm (Efron et al., 2004)) and mirror those transformations on each associated validation set. Whereas, such transformations are unnecessary for binary tree based techniques.

LASSO Variable Selection as a Special Case of PLS
Penalized Least Squares (PLS) coefficient estimates (β in Equation 11) are calculated by identifying the coefficient estimate vector that minimizes the sum of the residual sum of squares and the result of applying some penalty function to the coefficients. Simple PLS estimates use the L γ norm p j=1 |β j | γ of the coefficient vector β for some γ > 0 as the penalty function so that where the tuning parameter λ controls the degree to whichβ is shrunk towards the zero vector (Ahmed, 2014). When γ is set to 1, the solution to Equation 11 is the L 1 PLS estimate of β, also known as the Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1996). When γ is set to 2, the solution to Equation 11 is the L 2 PLS estimate of β which is referred to as a ridge regression estimate (Hoerl and Kennard, 1970). Other penalized least squares techniques including adaptive LASSO (Zou, 2006), Smoothly Clipped Absolute Deviation (SCAD) (Fan and Li, 2001) and Minimax Concave Penalty (MCP) (Zhang, 2010) are derived through use of more complex penalty functions in place of the L γ norm in Equation 11. Solving Equation 11 with γ set to a value of 2 or less results in the values of some coefficients being estimated as zero exactly (how many depends on the value of the tuning parameter λ) (Ahmed, 2014). Since a coefficient estimate of zero is equivalent to exclusion from the selected model such a solution effectively performs both variable selection and shrinkage. As such, L γ penalized estimation with γ < 2 is applicable to our case study where the number of potential covariates exceeds the number of observations (p > n).
The requirement for a computational solution to L 1 penalized estimation (stemming from the presence of the absolute value in Equation 11) was originally addressed via relatively computationally expensive quadratic programming (Tibshirani, 1996) and has been addressed more recently by the computationally efficient Least Angle Regression (LAR) algorithm (Efron et al., 2004). From the PLS family of techniques we chose to adopt L 1 penalized estimation for three reasons: 1) suitability for variable selection and modelling with correlated covariates 2) suitability for variable selection in scenarios with p > n and 3) the computational efficiency of the LAR algorithm (Efron et al., 2004).
The LAR algorithm has been designed such that covariates continue to be added to the model until either the available degrees of freedom are exhausted or there are no covariates outside the current model that have a correlation with the current residual vector greater in magnitude than some user specified threshold value. In the case of the LASSO modification of the LAR algorithm, while steps of the algorithm may result in a covariate being removed from the current model, the algorithm still proceeds to add and remove covariates from the current model until either of the above criteria are met. Subsequently, the LAR algorithm (and the LASSO variant thereof) returns a sequence of selected models from which it is necessary to choose a parsimonious final model. Efron et al. [2004] derive a C p style stopping criterion for the LAR algorithm but note that this is most appropriate in scenarios with less potential covariates than observations. Alternative stopping criteria, applicable to more general scenarios, also exist (Valdman et al., 2012) though cross validation is a popular approach for ultrahigh dimensional problems (Engelmann and Spang, 2012;Usai et al., 2012Usai et al., , 2009). Hence, a cross validation based approach to making the final selection from the sequence of selected models produced by the LAR algorithm is adopted here. All analysis is conducted in the R language and environment for statistical computing (R Core Team, 2015) and all graphics are produced with the R package 'ggplot2' (Wickham, 2009). The data and R code associated with this work will be provided via a repository located at https://github.com/brfitzpatrick/larc once this work is published in a peer reviewed journal.

Comparison of Variable Selection Methods for MLR
We compare LASSO variable selection to the more generic variable selection methods: exhaustive search, forward stepwise selection, backwards stepwise selection and sequential replacement selection (also known as stepwise forwardsbackwards variable selection) on the case study data. Due to the complexity of interacting processes that may influence the formation, distribution and loss of SOC across the study site we consider polynomial terms up to order four for each covariate and all possible pairwise interactions of the covariates. The full set of potential covariates thus expands from 63 to 2205 potential covariate terms (63 * 4 + 63 2 ). With 60 observations of the response, if we wished to explore all possible models from an intercept only model up to those that used the available degrees of freedom, we would need to fit and compare some 60 i=1 2206 i ≈ 2.27 * 10 118 different models in an exhaustive search. To reduce the number of covariates considered and thus the required breadth of exhaustive search we pre-filter the design matrix such that no remaining pair of covariates have a correlation coefficient greater in magnitude than some critical value.
Since the correlation of a potential covariate with the response may be a poor indicator of the explanatory utility of this covariate in the presence of other covariates, we choose between highly correlated pairs of covariates based upon the spatial resolution at which each is available. The motivation behind this decision being an effort to optimise the spatial accuracy of our interpolation of the response. For covariates with the same spatial resolution, the one derived from the simpler function of observed data is chosen, otherwise the choice is made at random. These criteria are discussed in more detail in Appendix D.
Pre-filtering to enforce a maximum correlation coefficient magnitude (hereafter MCCM) of 0.4 between remaining covariate pairs results in a design matrix with 27 covariate terms. The branch-and-bound algorithm implemented in the 'leaps' package (Lumley and Miller, 2009) requires only a subset of the 28 i=1 28 i ≈ 2.68 * 10 8 models it is possible to construct from this design matrix to be fitted in order to determine the optimal model that would be returned from a full exhaustive search (Millar, 2002). As our aim here is to build models that make the best use of covariates for spatial interpolation of the response, the metric by which we compare the results of these variable selection techniques is the ability of the selected models to predict data held out from the fitting process. We conduct these comparisons on 500 unique divisions of the data into training and validation sets in a cross validation scheme. We use training and validation sets of 35 and 25 observations, respectively. The selection of a training set size is discussed in Appendix E.
Training sets constructed from the design matrix composed of 27 covariate terms are supplied to each of the variable selection methods (LASSO variable selection, forward selection, backward selection, sequential replacement and exhaustive search variable selection). In each case the final selection from the sequence of models returned is made to minimise the validation set element prediction error (here after VSEPE) sum of squares. The distributions of VSEPE absolute values from each variable selection technique are summarized in Table 2. When applied to these austerely pre-filtered design matrices, all five variable selection techniques yield very similar VSEPE distributions. The first three quarters of the ordered VSEPE absolute values obtained from LASSO variable selection are slightly more compressed towards zero than those from any other technique considered.
Predictions from the 500 selected models (one per training set) are modelaveraged with weights inversely proportional to the prediction error sum of squares on the associated validation sets. Taking i to index the 500 divisions of the data into training and validation sets, the weights for model-averaging, W i , are calculated following Equation 2. Here e i, j is the prediction error of the j th element of the i th validation set where each validation set has v elements.
The noticeable improvement in accuracy of the model-averaged predictions from the models selected by LAR is shown in the column of coefficient of determination values in Table 2. Corresponding summary statistics for the absolute values of the VSEPE obtained from model fitted to 800 term design matrices that result from using a much less stringent MCCM of 0.95 are also included in Table  2 along with the coefficient of determination for the associated model-averaged predictions. Similar, but greater magnitude, improvements are observed between the LAR selected models for the 27 covariates design matrices and the 800 covariates design matrices as were observed between models selected by other variable selection techniques and LAR selected models. These improvements come with an increased computational cost, but the application of the LAR algorithm to these expanded design matrices is still feasible even on a mid range laptop computer whereas exhaustive search variable selection on these expanded design matrices would be infeasible. The positive outliers in all the VSEPE distributions are likely the result of the three positive outliers in the response. When these are drawn as members of a validation set, models built from the associated training set likely under-predict these values in the validation set.
The distributions of the numbers of covariates selected by each of the variable selection methods from the 27 covariate design matrices are depicted in Figure 1. The LASSO method results in intercept only models far less frequently and larger numbers of covariates per model more frequently than the other techniques. The differences in predictive accuracy and numbers of covariates selected per model, between the LASSO and the forwards stepwise OLS based method may be explained in terms of the comparative theoretical properties of these algorithms. At each step in the respective algorithms both approaches choose the covariate most correlated with the current residual vector for inclusion in the current model. However, LAR adds this new covariate to the model in such a manner that the resulting prediction vector is equiangular between the previous prediction vector and this new covariate vector and only proceeds along this new prediction vector until some other covariate outside the current model is as correlated with the current residual vector as the most recently added covariate before repeating this procedure. Forwards selection, backwards stepwise variable selection and sequential replacement variable selection lack this facility to compromise between the correlated covariates. Furthermore, the differences between the results of LASSO variable selection and the exhaustive search variable selection may well stem from exhaustive search variable selection using OLS model fitting while the LASSO variable selection uses PLS based model fitting.

Frequently Selected Covariates
The numbers of the 500 selected models in which particular covariate terms occur can serve as an indicator of the relative importance of these terms for predicting the response. In Table 3 we list the 15 most frequently selected terms from LAR variable selection on the 800 column design matrices. Table  3 also lists covariate terms from the 2205 column design matrix which were very highly correlated (|r| > 0.95) with these top 15 covariates and were thus excluded from the analysis. A chord diagram depicting the selection frequencies of all 800 covariate terms is presented in Figure 2. The complexity of interacting processes producing the spatial distributions of SOC in agricultural landscapes like that of our case study site is reflected in the diversity of the categories of covariates terms selected (soil EC a , vegetation indices, DEM derived metrics, Table 3: The 15 most frequently selected covariates from the LAR variable selection executions on the 500 unique, 35 observation training sets constructed from the design matrix created by pre-filtering the full design matrix to enforce a maximum permitted correlation coefficient magnitude between remaining covariates pairs of 0.95. The second column contains the frequencies with which the selected covariates occurred in the 500 selected models. Accompanying each selected covariate in the final column are the covariates from the full design matrix that had correlation coefficient magnitudes with the covariate in question greater than 0.95 and thus were excluded from the design matrix supplied to the variable selection. Colons denote interaction terms for the two covariate terms the colon separates. Numeric superscripts denote polynomial terms for the covariate indicated by the acronym. Acronyms are expanded in Table 1 Figure 2: Covariate term selection frequencies across the 500 selected models obtained from applying the Least Angle Regression variable selection algorithm to the 35 observation training sets constructed from design matrices produced by pre-filtering the full design matrix to enforce a maximum permitted correlation coefficient magnitude between covariate pairs of 0.95. Poincaré segments represent interaction terms between the covariates they connect. Covariate Acronyms are expanded in Table 1. magnetic imagery, radiometric imagery and foliar projective cover layers) and the mixture of linear terms, higher order polynomial terms and interactions of linear terms selected for these covariates.

Modelling Spatial Component of Error
Following the model-averaging described above we fit a spatial model to the residual %SOC variation at each soil core location. This allows spatial position to serve as a locally appropriate proxy for all the unobserved processes and interactions that may influence the spatial distribution of %SOC at our case study site. One approach would be to use Kriging to spatially interpolate the residuals, but this requires the comparison of numerous pairs of orthogonal, directional, empirical semivariograms. A more attractive alternative is to calculate an empirical semivariogram raster, in which pairwise differences between geostatistical observations are assigned to two dimensional displacement bins and the empirical semivariance is calculated for each bin. The resulting raster may then either be smoothed (Banerjee et al., 2003) or simply examined directly and the spatial symmetry of the resulting values considered. In our study, however the small sample size would result in moderate numbers of pairs per bin only when a relatively large bin size is used. The resulting coarse spatial resolution would make characterisation any detected anisotropy infeasible. For this reason we adopt the simpler approach of fitting spatial polynomials to the residuals and model-averaging the results via the same procedure we use for the covariate based modelling.
The computational efficiency of the LAR algorithm enables us to explore design matrices that include single term polynomials for Easting and Northing values up to polynomial order 12 and interactions terms constructed from subsets of these single term polynomials such that all possible product terms which equate to an overall polynomial order of 6 or less are included in this exploration. We only consider interaction terms equivalent to a polynomial term of half the order of the maximum order of polynomial terms considered in order to avoid confounding between interactions terms of order equivalent to the higher order single polynomial terms. We use the results of fitting the spatial polynomials to training sets of 35 observations constructed from the design matrix pre-filtered to enforce a MCCM between covariate pairs of 0.95 for similar reasons involved in this decision for the covariate based variable selection. Again, 500 unique divisions of the data into training and validation sets are constructed and explored by LAR variable selection and final selections are made from each LAR model choice trajectory on the basis of which model minimizes the associated VSEPE sum of squares. Model-averaging is conducted with weights inversely proportional to the VSEPE sums of squares as per Equation 2.

Full Cover Inference
The 500 selected models (each selected for one of the unique training sets) yield 500 predicted values for %SOC at every pixel in the final prediction raster. We use the weighted model-averaging procedure described in Section 4.1 to calculate a %SOC prediction for each of these pixels. We also calculate an uncertainty estimate for these predictions, where the uncertainty is quantified by the width of the interval containing the middle 95% of the predictions for that pixel. A panel of two rasters is presented in Figure 3. The areal prediction of %SOC levels across the study area plus the areal prediction of the spatial component of the errors from the covariate based modelling is presented as the top raster in Figure 3. The predictions for each pixel from the covariate based modelling are constructed by model-averaging the predictions for that pixel from the models selected by LAR exploration of the 500 unique 35 observation training sets constructed by subsetting the 800 column design matrix. Our estimate of the uncertainty associated with these predictions is presented as the bottom raster in Figure 3. The predicted spatial distribution of %SOC levels is overall quite uniform across the study site with only a few localized regions of notably elevated or depressed values. The estimated uncertainty associated with the predicted %SOC levels is relatively low across the majority of the study site with a few regions of notably elevated uncertainty.

Discussion
In this work we demonstrate the suitability of LASSO modified MLR as implemented through the LAR algorithm for covariate assisted interpolation of a univariate response in a pedological context. The computational efficiency of the LAR algorithm is such that it is feasible to explore 500 unique, 35 observation subsets of a design matrix composed of 800 potential covariate terms, whereas the application of exhaustive search variable selection to this task would not have been computationally feasible. While LAR is often applied to the exploration of potential model spaces composed solely of linear main effects it may also be applied to the exploration of potential model spaces which include both polynomial terms for covariates and terms for the interactions of two or more covariates implemented through products of these terms. Efron et al. [2004] illustrate the exploration of such a model space in their simulation study which compares LAR, LAR-LASSO and Stagewise solution paths obtained from a potential model space comprised of linear main effects, interaction terms and quadratic terms. In such cases, the LAR algorithm is executed upon a design matrix that includes appropriately recentred and rescaled columns for polynomial terms and interaction terms. In our case study we expand 63 covariates to 2205 potential covariate terms by considering polynomial terms for all covariates up to polynomial order 4 and all possible pairwise linear interaction terms. Pre-filtering this full design matrix to enforce a MCCM between covariate pairs of 0.95 results in a design matrix comprised of 800 potential covariate terms. being the most westerly of these three pixels and the estimated uncertainty of 20.57 being the most northerly of these three pixels.
The L 1 penalty in LASSO regression allows for exploration of design matrices that include such highly collinear pairs of covariates. In contrast, it would be advisable to discard a great deal more of these covariates to reduce the degree of collinearity in the design matrices examined prior to conducting the variable selection with OLS based approaches such as information criteria based stepwise variable selection. Our concern regarding discarding numerous members of correlated pairs of covariates prior to conducting the variable selection appears justified in our case study. The VSEPE distributions arising from models fitted to design matrices filtered to enforce a MCCM between covariate pairs of 0.4 are more dispersed about zero than the VSEPE distributions arising from models fitted to design matrices filtered to enforce MCCM between covariate pairs of 0.95. Furthermore, it is the model averaged predictions of the models selected from exploration of training sets constructed from this less stringently pre-filtered design matrix that have the greatest coefficient of determination.
In our analysis we assume that covariate response correlations do not vary across the study area and so adopt a non-spatial regression approach. That is, we assume spatially stationary regression coefficients as the first stage of modelling the spatial distribution of %SOC. Spatially non-stationary linear regression coefficients may have added little here if some of the covariates varied in a spatially correlated manner. If there is spatial non-stationarity in a correlation between a covariate and some component of the response, this variation could well have be captured in our models by the selection of a polynomial term for the covariate in question were it also varying spatially. If this were the case, it would be difficult to show one of the these two interpretations to be more valid than the other in the absence of information beyond the data we have for the case study site. Given our primary objective of spatial interpolation of the response, the mechanism by which this interpolation is achieved (spatially stationary coefficients of polynomial terms or spatially non-stationary coefficients of linear terms) is less important than it would be if we were attempting to identify the pedological and edaphic processes that produce the observed distribution of %SOC.
Limitations of the analysis presented here include the interpolation of the covariates to the locations at which the response was observed being accomplished via separate models before the variable selection is performed. Further limitations stem from these interpolations being accomplished in a manner contingent upon the assumption of isotropic spatial dependence in the mean deviations of the covariates being realigned. By realigning the covariates by means external to the variable selection processes we, in effect, assume that the values we supply to the variable selection process are observed without error at the response locations. However, we know that there was both uncertainty associated with the collection of the covariates and uncertainty associated with the interpolation of the covariates to the locations at which the response was observed. The hierarchical Bayesian models for spatially misaligned data outlined by Banerjee et al. [2004] would be an interesting extension in this regard if these models could be extended to accomplish the variable selection task we have encountered. The advantage of such an approach would be a more realistic propagation of uncertainty, including the uncertainty associated with the spatial realignment of the data layers, through the model hierarchy to that associated with the final full cover areal predictions rather than the more limited cross validation based estimation of the uncertainty associated with areal prediction that we calculate here. If this were combined with a Bayesian LASSO, where the shrinkage parameter could be assigned a hyperprior and estimated as part of the model structure, the need for cross validation would no longer be as strong but the computational challenge would likely be substantial. Other penalized likelihood methods such as adaptive LASSO (Zou, 2006), SCAD (Fan and Li, 2001) and MCP (Zhang, 2010) could all form interesting comparisons to the LASSO modified MLR we have fitted with the LAR algorithm in this work. Further interesting comparisons could be conducted with Bayesian LASSO (Park and Casella, 2008), model-averaged Bayesian CART (Chipman et al., 1998), random forests (Breiman, 2001), boosted regression trees (Friedman, 2002) and model-averaged Bayesian treed regression (Chipman et al., 2002) with Bayesian LASSO implemented in the terminal node MLRs.

Acknowledgments
This work was funded by the CRC for Spatial Information ( , 2003)) and formed a part of the Uralla Plutonic Suite/Mount Duval Adamellite (acid porphyritic, hornblende-biotite monzogranite) and was characterized by yellow and brown chromosols (Isbell, 2002) upon the hills with alluvial soils and siliceous sand complexes distributed along the drainage routes.
The study site typically received 790mm of annual, summer dominant rainfall (Garraway and Lamb, 2011). The maximum elevation within the study site was 1120m and the elevation range across the study site was less than 110m.
The study site consisted of selectively cleared native pasture containing some remnant vegetation and regrowth and had historically been grazed by sheep and cattle. The south-east corner of the study site was situated at grid reference 371434E, 6632499N MGA GDA 94 Zone 56. Being used to grow pasture and receiving in excess of 450mm of annual rainfall, soils in this area fell within the class of agricultural soils deemed to have the highest potential of any agricultural soils in NSW for sequestration of atmospheric carbon as Soil Organic Carbon (SOC) (Chan et al., 2008). A hill-shaded plot of a digital elevation model cropped to the approximate boundaries of the study site and an aerial photograph of the study site have been presented as the panels of Figure 4.

Covariates
The study area was surveyed three times in 2009 for soil apparent electrical conductivity (EC a ) and the reflectance of the top of the pasture canopy under active illumination. The first survey was conducted in the warm summer month of February, following a prolonged dry period (0mm of precipitation in the preceding seven days) in what was otherwise the second wettest month of the year. The second survey was conducted in May, immediately after a significant rainfall event (84mm in the preceding seven days) in what was otherwise the cooler middle of the year when less rain fell than in summer. The third survey was conducted in November, which marked the end of the winter growing season and was a month in which less rain fell than did in the wet month of October and the very wet month of December. These surveys were conducted by an all terrain vehicle (ATV) towing a sensory array that consisted of a specially configured Geonics EM38 unit (Geonics Ontario Canada), an LED illumination array, near-infrared and visible light reflectance sensors (Crop Circle T M , Holland Scientific, USA) and a differential global positioning system (dGPS) (Trimble, Sunnyvale CA USA).
The Geonics EM38 instrument measured soil EC a which may be understood as the integral of the electrical conductivity response recorded across soil depths.
To collect these data the EM38 instrument emitted a varying magnetic field which induced an electric current in the soil underneath the instrument. The electric current induced in the soil created a magnetic field, the strength of which was proportional to the amount of electric current induced. The strength of the magnetic field that resulted from the current induced in the soil was taken as indicative of the strength of this induced electric current and recorded by the instrument. The strength of the electric current induced in the soil by the instrument-emitted electric field varied as a function of depth. With a Geonics EM38 instrument operated in the vertical dipole orientation, as Garraway et al. orientated the instrument in their survey, the relative signal response, S v , varied with depth, z, as follows S v (z) = 4z (4z 2 +1) 3 2 (Morris, 2009). Due to air essentially not conducting electricity at these strengths of inductive magnetic fields, the conductivity signal commenced at the soil surface. Thus the EC a measured by the EM38 instrument in vertical dipole mode was ∞ H σ(z)S v (z)dz where σ(z) was the electrical conductivity of the soil at depth z and S v (z) was the relative signal response of the soil at depth z (Morris, 2009). Thus, when operated in vertical dipole mode, the Geonics EM38 instrument had a peak relative signal response to the electrical conductivity of the soil 350mm below the instrument and was essential insensitive to the electrical conductivity of any medium immediately below the instrument. Garraway et al. mounted a Geonics EM38 instrument on a rubber sled that held the instrument approximately 15mm above the ground (Lamb et al., 2005) and towed this sled around the study area. Thus the EC a data we have for the case study would have been dominated by the soil electrical conductivity at depths around the relative signal response peak at 335mm below the soil surface with very little contribution from the electrical conductivity of the soil surface.
The reflectance sensors measured top of pasture reflectance of active illumination in the Near InfraRed (NIR) and visible Red (RED) regions of the electromagnetic spectrum. This style of proximal sensing of the reflectance of active illumination has been applied to both crop and soil mapping (Holland et al., 2012). The electro-optical principles governing the effectiveness of such sensors have been discussed in Holland et al. [2012]. The EC a , RED reflectance and NIR reflectance were recorded simultaneously at regular time intervals as the ATV traversed the study area. The east-west and north-south coordinates that accompanied each of these covariate observations were also recorded along with the associated Position Dilution Of Precision (PDOP) and Horizontal Dilution of Precision (HDOP) values. The data from each of the ATV surveys were cleaned of observations with large inaccuracies in positioning (as assessed by HDOP and PDOP measurements). The number of point observations that remained from each ATV survey after this cleaning had been conducted have been included in Table 4.

Response
In 2009, the study area was divided into five strata by means of k-means clustering (Bishop, 2006) the red, green and blue channels from aerial imagery and the EC a data from the February survey (Garraway and Lamb, 2011). At least six locations for soil core sampling were randomly selected within each strata with additional locations manually selected to improve the representation of landscape attributes. This stratified random sampling approach to choosing locations for soil core samples was similar to the process outlined in (Miklos et al., 2010). As a result, soil samples were collected to a depth of 200mm at 60 locations across the study area with locations georeferenced using a differential Global Positioning System (dGPS) instrument. At each of the 60 dGPS coordinates, three soil cores were collected from within a 1m radius of the coordinates and aggregated to form a single soil sample that was laboratory analyzed for percentage SOC (hereafter %SOC). Garraway et al. detailed the preparation of

Remotely Sensed Data
A 25m 2 resolution Digital Elevation Model (DEM) (sourced from the Department of Lands, New South Wales State Government, Australia) for the Armidale-Dumaresq region which contained the catchment in which the study area was situated was read into the System for Automated Geoscientific Analyses (SAGA v2.1.0 (Conrad et al., 2015)) software to calculate the terrain topographic and hydrological attributes listed in Table 5. The ready availability of the attributes listed in Table 5 and potential relevance of topography and hydrology to SOC distributions generally, lead us to include the full suite of such attributes that may be calculated with SAGA as potential covariates in this analysis. The resulting GIS layers were then read into R (R Core Team, 2015) with the 'RSAGA' (Brenning, 2008) package. Full cover layers for Foliar Projective Cover (FPC) produced by applying the Statewide Landcover and Trees Study (https://www.qld.gov.au/environment/land/vegetation/mapping/slats-methodology/) method to imagery from the SPOT5 satellite (10m 2 resolution) were acquired for the study region from 2012 and 2011 from the New South Wales State Government Department of Environment. These layers were read into R (R Core Team, 2015) with the 'raster' package (Hijmans et al., 2015). Potassium (K), Uranium (U) and Thorium (Th) count layers from an airborne γ ray radiometric survey (Brown, 2002) and similar layers from six channels of electromagnetic imagery (Brown, 2003) were read into R(R Core Team, 2015) with the 'raster' (Hijmans et al., 2015) package. Figure 4: a) A hill shaded terrain surface for the study site calculated from a digital elevation model. b) An aerial photograph of the study site. The locations at which soil core samples were collected have been depicted as filled circles in both panels. Cartographic projection systems are used to project latitude and longitude coordinates from a particular region of the surface of the Earth onto a two dimensional plane. In this analysis all spatial coordinates were treated as coordinates on a two dimensional plane and as such it was important to ensure that all data layers utilised the same projection system. This was accomplished with the R(R Core Team, 2015) package 'raster' (Hijmans et al., 2015) through the re-projection of all data layers that did not already use the most common projection system among the data layers to this projection system, namely a UTM projection for zone 56 South using the WGS84 ellipse.

Appendix B: Choice of Covariates
In this Appendix we explain our choice of environmental characteristics considered as potential covariates for modelling soil carbon. The first stage in collating this set of potential covariates was to identify the common covariates used in soil carbon modelling via a review of the available literature. Once we had this list of potential covariates, the second stage was to identify which of these we could obtain for our case study site.

Soil Organic Carbon and Soil Apparent Electrical Conductivity
Soil apparent electrical conductivity (EC a ) has variously been found indicative of some or all of soil: moisture content, pore distribution, pore size, salinity, clay content, mineralogy, cation exchange capacity and temperature . Discretizing soil EC a values into four classes produced significant factors in separate ANOVAs for each of soil total organic matter, soil particulate organic matter and total soil carbon in the top 30cm of soil sampled across 250 ha of farmland used for wheat, corn and millet in the state of Colorado in the United States of America (USA) (Johnson et al., 2001). A negative correlation was detected between soil EC a and SOC (r = -0.42) in the top 90cm of soil in 9ha of farmland that had a long history of cotton row cropping in Alabama, USA but no such correlation was detected in the top 30cm (Terra et al., 2006). Soil EC a was more correlated with soil carbon (r = -0.65 to -0.76) than were any of the local relative elevation, local slope and satellite measured soil surface reflectance of near bare fields at three sites (48.7 ha, 52.4 ha and 65.4 ha in area respectively) in farmland used for maize and soybean cropping in Nebraska, USA (Simbahan et al., 2006b). Studies have also been performed where it seemed likely that influences other than SOC were dominating the soil EC a signal. Principal Component Analysis (PCA) of soil properties in an 8ha agricultural field in Flanders, Belgium yielded largely independent spatial patterns in soil pH, EC a and SOC (Vitharana et al., 2008). Furthermore, in the soils of the world's oldest continuous cotton experiment (Alabama, USA) little correlation was detected between SOC in the top 15cm of soil and the EC a of the top 30cm of soil across 0.4ha of land (Siri-Prieto et al., 2006).

Soil Organic Carbon and Spectral Vegetation Indices
Land plant biomass in any location will have been influenced by many soil conditions. Through direct and indirect effects on soil: structural stability, water and nutrient retention, faunal activity and diversity, and elemental recycling (Lal and Follett, 2009) SOC levels may have influenced land plant biomass in many situations. Conversely, plants will have also provided an input of carbon to SOC via litter fall and root turn over. Thus empirical correlations between %SOC and plant biomass are plausible. The amount of green plant biomass present in a location may be indicated by the density of green leaves present above the soil there. The density of green leaves present above the land surface has often been estimated from the reflection spectra of the land surface when observed from above canopy height. Three spectral signatures of particular interest for these considerations have been identified as: those of healthy green leaves, those of stressed or senescent green leaves and those of agricultural soils. The marked difference in the intensities of light reflected from green leaves in the visible red (RED) and near infra red (NIR) wavelengths and the general weakness or absence of such a 'red edge' in the spectral reflectance signatures of stressed or senescent leaves and agricultural soils has formed the basis of many spectral vegetation indices used for monitoring vegetation (Pinter et al., 2003). Healthy leaves have typically exhibited a high reflectance of NIR light due to scattering of these wavelengths at the interface between the mesophyll and cell walls and low absorption of these wavelengths by photosynthetic pigments and organelles (Pinter et al., 2003). Whereas, healthy green leaves have typically displayed a low reflectance of visible wavelengths due to the high absorption of light in this region of the spectrum by photosynthetic and accessory pigments (Pinter et al., 2003). Green plant stress and senescence have often manifested in the form of depressed chlorophyll concentrations and the expression of accessory leaf pigments which together lower the absorption of visible wavelengths by leaves.
Where stress or senescence has lowered the absorption of visible wavelengths by leaves the reflectance peak of such leaves will have widened correspondingly from the green region of the spectrum typical of healthy green leaves through towards redder wavelengths. Where stress or senescence has manifested in this broadened reflectance of visible spectra by leaves a simultaneous decrease in the NIR reflectance of these leaves will have also occurred. Thus where it has occurred, stress or senescence would have resulted in a loss or weakening of the abrupt 'red edge' typical of the reflectance spectra of healthy green leaves (Pinter et al., 2003). Similarly, agricultural soils have been characterised by a lack of any such sharp contrast in the reflectance intensities of different wavelengths (Pinter et al., 2003). Most spectral vegetation indices have been constructed as functions of NIR and RED reflectance designed to quantify some aspect of the expected differences between the reflectance spectra of healthy green leaves and those of soils and stressed or senescent leaves. Thus vegetation indices have been designed to provide an estimate of the quantity of green plant material that contributed to the reflectance spectra of a land surface by quantifying the extent to which such differences in NIR and RED intensities occurred within this spectra (Pinter et al., 2003). The spectral signature obtained from an entire canopy may have differed markedly from that obtained from an individual green leaf and furthermore may have varied across the growing season as the canopy geometry altered with plant growth (Pinter et al., 2003). Thus differences between vegetation indices calculated from reflectance surveys conducted at the same location at different stages in the year may have yielded information about how green plant biomass there changed across the growing season. Changes in biomass across a growing season could, in turn, have been indicative of soil attributes baring other stronger influences on plant biomass lost or accrued. Thus the comparison of vegetation index values when collected in the same location at different points in the plant growth cycle could have been indicative of soil properties. For instance, plants that grew in poorer soils and experienced otherwise similar conditions could reasonably be expected to have produced less biomass in a growing season than the same plants in more favourable soils. Furthermore, the presence of SOC in surface soils has been documented as increasing soil aggregation thereby creating larger lacunae (also referred to as interstitial spaces) into which water may drain from the soil surface. Thus SOC has been broadly classified as beneficial to the infiltration of soil by water and the retention of water by soil (Franzluebbers, 2002). Thus increased %SOC at a location could in turn aid water retention there and allow plants that undergo seasonal curing (e.g. grasses like those at our case study site) to remain green longer into the dry part of the year at that location.
A review of studies of the correlations between soil organic matter and crop reflectance in visible and NIR regions of the electromagnetic spectrum and the vegetation indices thereby derived formed a section of the review paper . In certain situations, above ground plant biomass may have been related to SOC concentrations. Where this has been the case vegetation indices may have held relevance to %SOC concentrations. This seems to have been the case in the following studies. The Normalised Difference Vegetation Index (NDVI (Rouse et al., 1973)) and the Soil Adjusted Vegetation Index (SAVI (Huete, 1988)) have both achieved considerable popularity as vegetation indices. A positive correlation between canopy NDVI and biomass was detected in a 7ha cotton field in Larissa, Greece (Stamatiadis et al., 2005). Furthermore, the pasture canopy SAVI was found to be the best of a range of vegetation spectral indices for predicting pasture green dry mass across four 50ha paddocks in New South Wales, Australia (Trotter et al., 2010). Much like soil EC a , plant visible and NIR reflectance have variously been found correlated with soil properties other than SOC such as soil moisture and Cation Exchange Capacity (CEC)  along with prevailing climate, ecosystem, terrain and physical soil properties (Mulder et al., 2011). We have summarised studies that found correlations between any of a selection of vegetation indices that may be calculated from reflectance intensities in the NIR and RED bands and soil carbon in Table 6. Since our study site consisted of native pasture with remnant woody vegetation we restricted this summary to one of studies that were conducted in pastures, grasslands, prairies and steppes.

Soil Organic Carbon and Scattered Paddock Trees
Below ground root turnover and above ground litter fall have both been recognised as sources of detrital carbon to topsoil. Thus in native pastures with remnant woody vegetation in the form of scattered paddock trees, such as the pastures from which our case study data were collected, the locations of these trees may have influenced the spatial distribution of SOC. Elevated concentrations of organic matter in soils beneath and around trees and shrubs relative to surrounding soils have been observed across a variety of environments and ecosystems (Hibbard et al., 2011;Graham et al., 2004). In the Northern Table-lands of New South Wales (the region from which the data analysed here were collected) an ANOVA detected significantly elevated (P < 0.001) organic carbon content in the top 5cm of soils underneath the canopies of scattered paddock trees compared to soils beyond these canopy margins (Graham et al., 2004).

Soil Organic Carbon and Digital Elevation Model Derived Terrain Descriptors
Climate, parent material, topography and biotic factors may all have influenced pedogenesis to varying degrees in different ecological and geographic contexts. Topography may have influenced soil characteristics to a greater or lesser extent by having influenced hydrologic and erosional processes (e.g. soil water content, runoff and sedimentation) along with soil temperature (via aspect, exposure etc.) which together form and alter soils through mineral weathering, erosion, leaching, decomposition, horizontal zonation and sedimentation (Moore et al., 1993). Topography may also have affected the process of SOC loss that accompanied the conversion of natural land into agricultural land by having influenced SOC: leaching, movement as dissolved organic carbon or particulate organic carbon suspended in water flowing over or through the soil, and erosion by wind or water runoff moving soil and the constituent SOC (Lal, 2002). For the purposes of geostatistical modelling, topography has often been quantified via terrain metrics (e.g. elevation, slope, aspect, curvature, etc.) and hydrological metrics (e.g. catchment area, soil wetness index, stream force index, etc.) calculated for each of the pixels in a digital elevation model (DEM) of the land surface.
In 460ha of cropping and pastoral land in north-west New South Wales, Australia regression modelling identified elevation and plan curvature along with EC a , γ−ray radiometric potassium and thorium related emissions as useful predictors of total soil carbon (Miklos et al., 2010). In a 9ha field with a long history of row crop monoculture subject to conventional tillage in central Alabama, USA SOC was found correlated with a compound topographic index (metric of potential for water pooling on the land surface) (r = 0.48) and with land slope (r = -0.42) which lead the authors to postulate that erosion and field scale hydrodynamics were likely responsible for a large portion of the variability detected there in soil carbon (Terra et al., 2005). In mapping soil carbon in a 12.5ha field with a history of crop rotation between corn and soy beans in central Michigan, USA models that utilised terrain slope, aspect, plan curvature, profile curvature and tangential curvature were generally found to perform better than those that did not (Mueller and Pierce, 2003). In 9ha of cropping soil typically used for cotton in Alabama, USA models with combined topographic index, elevation, slope, silt content and EC a as covariates were found to account for up to 50% of the SOC variability leading the authors to conclude that the spatial distribution of SOC had been affected prominently by topography and historical erosion (Terra et al., 2004). In this same farmland in 2006 SOC concentrations in the top 30cm of soil were found to be correlated with the composite terrain index (r = 0.48) and terrain slope (r = -0.41) (Terra et al., 2006). In a 4.2ha catchment used as agricultural land in North Rhine-Westphalia, Germany correlations between SOC and profile curvature, plan curvature, catchment area, stream power index (Moore et al., 1993) and predictions from water and tillage erosion models (Quinn et al., 1991;Van Oost et al., 2000;Van Rompaey et al., 2001) of soil redistribution patterns have been detected (Dlugoßet al., 2010). From a study of 5.4ha of a dryland agroecosystem with a long history of winter wheat in North-Eastern Colorado, slope and wetness index were identified as the terrain attributes most correlated with soil organic matter (Moore et al., 1993).
Variable selection in this same study returned a linear model that explained 48% of soil organic matter variation with the covariates wetness index, stream power index and aspect. Studies that detected correlations between soil carbon and a selection of topography and hydrology metrics that may be calculated with the SAGA (Conrad et al., 2015) have been summarised in Table 7.

Soil Organic Carbon and Radiometric Imaging of the Earth Surface
Recording γ radiation naturally emitted from the surface of the Earth has been established as a means to detect geochemical anomalies in particular those associated with ore bodies (Cook et al., 1996). Collecting aerial images of such γ radiation emissions has been termed γ radiometry and the spectral signatures most frequently observed have been those associated with the production of 238 U , 232 T h and 40 K daughter radionuclides (Cook et al., 1996). In addition to detecting minerals rich in Uranium and Thorium and mapping geology based on prior knowledge of associations between the above radionuclides and geological materials, γ radiometry of a landscape has also facilitated the tracking of geochemical anomalies and inference regarding erosional processes therein (Cook et al., 1996). Such links to pedological processes have enabled γ radiometry to be used for soil mapping (Cook et al., 1996). On a broad spatial scale airborne radiometric data (particularly the K band) was found to improve the mean square error of predictions of soil organic carbon across Northern Ireland (∼13,843 km 2 ) when coupled with elevation data to 30.6% (Rawlins et al., 2009). Similarly, digital elevation model derived soil properties and γ radiometric survey data when used to build regression trees were found to account for 54% of the total soil carbon variation across 50, 000 ha of state forest in south Elevation (Terra et al., 2004;Miklos et al., 2010) Plan Curvature (Moore et al., 1993;Mueller and Pierce, 2003;Dlugoßet al., 2010;Miklos et al., 2010) Profile Curvature (Moore et al., 1993;Mueller and Pierce, 2003;Dlugoßet al., 2010) Slope (Moore et al., 1993;Mueller and Pierce, 2003;Terra et al., 2004Terra et al., , 2005Terra et al., , 2006 Stream Power Index (Moore et al., 1993;Dlugoßet al., 2010) Tangential Curvature (Mueller and Pierce, 2003) Topographic Indices (Terra et al., 2004(Terra et al., , 2005(Terra et al., , 2006 Wetness Index (Moore et al., 1993) eastern Australia (McKenzie and Ryan, 1999). Furthermore, γ radiometry has also been found useful for predicting the spatial distribution of soil carbon on scales closer to that across which our data were collected. Over a 5625ha square area of cropping land on the lower plains of the Macquarie River, News South Wales (Australia) percentage soil organic carbon in the top soil was found to be weakly negatively correlated with the concentrations of Potassium and Uranium in the ground as measured by a γ radiometric survey (Singh et al., 2012). Furthermore, regression modelling identified elevation and plan curvature along with EC a , γ−ray radiometric potassium and thorium reflectances as useful predictors of total soil carbon in 460ha of cropping and pastoral land in New South Wales (Miklos et al., 2010).

Appendix C: Choice of Modelling Method
In this appendix we compare and contrast a selection of Multiple Linear Regression (hereafter MLR) and Binary Tree (hereafter BT) based techniques in the context of data and objectives akin to those of our case study. We consider the defining characteristics of our case study data to be: (1) more potential covariate terms than observations (the p > n or ultrahigh dimensional situation for variable selection) (2) a high degree of collinearity among the potential covariate terms and (3) suspected importance of non-linear effects of covariates and interactions of covariate effects. The primary objective of our case study analysis was covariate assisted spatial interpolation of the response. Our case study also had the additional context of the modest computational resources provided by one mid-range laptop and our desire for an easily interpretable predictive mechanism. In Section 9.1 we introduce a selection of BT based models and in Section 9.2 we introduce a selection of MLR based models. In Section 9.3 we compare the relative merits of the model introduced in Section 9.1 and Section 9.2 for modelling data like those of our case study with our objective of covariate assisted spatial interpolation of the response.

CART
Classification And Regression Trees (CART) (Breiman et al., 1984) are two closely related techniques. Both utilise a binary tree based model structure but classification trees predict a categorical response while regression trees predict a continuous response. Our interest here is in techniques for modeling a continuous response and thus methods for solving regression problems. Subsequently we focus on regression trees. The regression trees of Breiman et al. 1984 partition the response observations y i , i = 1, ..., n, into M mutually exclusive sets by recursively partitioning the associated covariate space into M mutually exclusive regions R 1 , ..., R M through binary divisions along covariate axes. Each associated subset of the response is then modelled by the single parameter, maximum likelihood estimator for those observations, the group meanĉ m (Hastie et al., 2009). Thus, regression trees model a continuous response as per Equation 3.
Each y i will be modelled by one and only oneĉ m since the R m are disjoint and thus for any particular i, x i,. ∈ R m for exactly one unique R m .
Thus when continuous covariates are supplied to regression trees the predictions produced vary in a stepwise manner across the range of the covariates whereas the predictions from an MLR supplied with these same covariates would vary in a continuous manner across this same range. Furthermore, the recursive nature of the binary partitions of regression trees enable far more complex interactions to be modelled than those permitted by taking products of pairs of covariates as may be done in MLR. However, deep trees (trees with many binary partitions) would be required to form good approximations to even simple linear relationships between covariates and the response whereas such relationships may be modelled naturally as components of the structure of MLR. Deep trees rapidly become a concern with a limited number of response observations since they result in terminal node parameters being estimated from fewer observations (and thus less reliably).
Frequentist CART Regression trees, as proposed by Breiman et al. 1984, are constructed via recursive binary partitions of the observations based on whether particular covariate values of these observations exceeded threshold values with the range of these covariates. This results in the M mutually exclusive regions of covariate space R 1 , ..., R M referred to in Equation 3. Within each particular R m theĉ m that minimises the residual sum of squares for predicting the response in that region is simply the group mean for those response observationŝ c m = E(y i |x i ∈ R m ). The challenge when fitting regression trees is identifying a sequence of binary partitions that define a set of regions such that the residual sum of squares from the entire tree is low. Since an exhaustive search of the potential regression trees that may be constructed from a particular set of data is computationally infeasible in all but the most trivial cases, regression trees are typically constructed via a greedy algorithm in the hope of identifying a regression tree that fits the data well via a computationally tractable procedure (Hastie et al., 2009). Greedy algorithms are so named because at each step in the iterations of such an algorithm the decision that yields the best improvement in the decision metric (e.g. fit of the model etc.) between the current state and the next is the decisions that is taken. As such there is no long term planning or stochasticity involved in such algorithms and thus no guarantee of identifying the optimal solution. Indeed, such algorithms can be seen to be highly sensitive to local maxima.
For fitting regression trees using the residual sum of squares as the criterion for decisions to partition the data, the greedy algorithm is as follows. The algorithm commences with all data in a single set termed the root node of the tree. All possible binary divisions of this root node set along all possible covariate axes are then constructed in turn, and the residual sums of squares resulting from prediction of the response with the pairs of associated group means are computed. With a finite number of observations there is a finite number of ways to divide the data into two subsets based on the covariate values of these ob-servations. This complete but finite set of possible divisions of the data may be obtained by considering threshold values equidistant between each of the pairs of observed values of a covariate when these values are arranged in an ascending (or descending sequence) for each covariate in turn. The binary partition that yields the best improvement in the residual sum of squares for the entire tree relative to that at the previous step is then selected as the partition to use at this step in the algorithm and the process is repeated for each of the resulting subsets (also referred to as child nodes). This recursive partitioning process is then continued until some stopping criterion is met. A simple and popular choice of stopping criterion is a minimum number of observations per terminal node from which the practitioner considers it is still reliable to estimate a mean. Once this tree growing algorithm is halted by the satisfaction of the selected stopping criterion the resulting tree may then be 'pruned' by sequentially examining the effects of collapsing the parent nodes of the current terminal nodes on the basis of some cost-complexity criterion and taking this action where it is judged meritorious. More formally this growing and subsequent pruning procedure may be described as follows.
The algorithm commences at the root node which contains all observations. Given the full set of covariates as the design matrix, X, the algorithm takes each covariate, X j , in turn and computes the set of threshold values that would each produce different subsets of the response should they be used to define a binary partition of the data based on this covariate. For each unique pairing of a particular covariate, X j , and a particular threshold value for that covariate, s, the regions defined by a binary partition based on X j and s are two disjoint regions of covariate space: R 1 (j, s) = {X|X j ≤ s} and R 2 (j, s) = {X|X j > s}.
The algorithm compares all such regions that may be constructed at this step to identify the choice of covariate X j and threshold value s that solves arg min j,s arg min c1 xi,j ∈R1(j,s) (y i −ĉ 1 ) 2 + arg min c2 xi,j ∈R2(j,s) whereĉ 1 andĉ 2 are calculated for each pairs (j, s) as the respective response group means:ĉ 1 = E y i |x i,j ∈ R 1 (j, s) andĉ 2 = E y i |x i,j ∈ R 2 (j, s) . The resulting binary partition of the data is then made and the above process is repeated for each of the resulting child nodes until some stopping criterion, such as a threshold minimum number of observations per terminal node, is satisfied at which point the algorithm is halted. The resulting tree may then be pruned recursively subject to the changes in some cost-complexity criterion that result from collapsing the parent nodes of the various current terminal nodes. More formally, let: T ⊂ T 0 be defined as any tree that may be obtained by collapsing some non-terminal node(s) of T 0 , |T | be defined as the number of terminal nodes in tree T and let m index the terminal nodes of T each with the associated subset of the data R m . A cost-complexity criterion, C α (T ), may be defined as per Equation 6.
. This cost-complexity criterion is the sum of the residual sum of squares for the predictions from the whole tree and a multiple, α, of the number of terminal nodes in the tree. The tuning parameter α controls the trade off between the fit of the tree to the data and the complexity of the tree as quantified by the number of terminal nodes of the the tree. Increases in α will yield smaller trees with larger residual sums of squares. Each pair of original tree T 0 , grown as above, and tuning parameter value α will have some smallest sub-tree T α that minimizes C α (T ). This T α may be identified by weakest link pruning (Hastie et al., 2009) whereby the internal nodes that yield the smallest per-node increase in the residual sum of squares |T | m=1 N m Q m (T ) when collapsed are sequentially collapsed until there are no longer any binary partitions to collapse as the entire data are once again contained in the original 'root' node associated set. It has been shown that the sequence of sub-trees thus obtained dependably contains T α (Hastie et al., 2009;Breiman et al., 1984;Ripley, 1996). For the purposes of building a regression tree for interpolation, α may be estimated via cross validation (Hastie et al., 2009). This is traditionally how regression trees have been fitted however there is also an approach extant for fitting regression trees under the Bayesian paradigm.
Bayesian CART The Bayesian approach to fitting CART models was developed by Chipman et al. (1998) and involves the use of particular prior specifications for the terminal node parameters and the tree structure itself along with a stochastic search. The Bayesian CART (Chipman et al., 1998) is a CART model, it is the manner in which the model is fitted and the underlying assumptions that are Bayesian. Continuing our focus on models for a continuous response variable with continuous covariates we will describe the Bayesian approach to fitting a regression tree. The Bayesian regression tree model consists of a binary tree T with b terminal nodes and the associated parameter vector Θ = (θ 1 , θ 2 , ..., θ b ), each θ i being associated with the ith terminal node of the tree. If x i,. = [x i,1 , x i,2 , ..., x i,p ] falls within the region defined by the ith terminal node, the associated response variable y i |x i,. is modeled by the distribution f (y i |θ i ) with f representing some parametric family controlled by parameter(s) θ i . Chipman et al do not specify a closed form prior for the binary tree struc-ture but instead specify it implicitly by generating trees from a tree growing stochastic process. In this manner each tree grown by the stochastic process forms a randomly drawn observation from this tree prior. Such specification of the tree prior allows for simple evaluation of prior probability p(T ) for any tree T which in turn may be employed within a Metropolis Hasting (MH) algorithm.
To draw an observation from the tree prior a new tree is propagated from the tree consisting of a single 'root' node by stochastically dividing terminal nodes in an iterative process. This tree propagating process is governed by a function that controls the selection of terminal nodes for division and another function that controls the assignment of a division rule to a terminal node that has been selected to be divided. The function p DIV IDE (η, T ) generates the probability for tree T that terminal node η is divided. If terminal node η of tree T is chosen to be divided, the function p RU LE (ρ|η, T ) generates the probability of assigning the division rule ρ to this terminal node. A division rule for a binary partition specifies the covariate that will be used to define the binary partition and the threshold value of this covariate which will determine the division of the observations into two groups based on the values of this covariate associated with each of the observations. The form and parameter values assigned to these functions collectively control the frequency with which particular tree depths and geometries are generated and thus the eventual weighting of such trees in the prior distribution. As such, via influence on the prior, these functions may be used to guide the posterior towards identifying trees of the desired depths and geometries (for instance to emphasise trees with a minimum number of observations in each terminal node). The priors for the parameters of the distribution functions used to model the response observations in the terminal nodes may be taken as standard conjugate forms. A convenient option for priors on the parameters Θ = (θ 1 , θ 2 , ..., θ b ) for the terminal nodes 1, ..., b is to use mean shifted normal distributions for each θ i with (µ 1 , µ 2 , ..., µ b )|σ, T iid ∼ N (μ, σ 2 α ) and σ 2 |T ∼ IG( ν 2 , νλ 2 ). Such a formulation permits each terminal node an individual mean parameter and models all node mean parameters with independent and identical Normal distributions. Should a more flexible formulation be desired an individual variance parameter may be introduced for each terminal node mean via mean-variance shifted normal distributions as follows i.e. for i = 1, 2, ..., b, µ i |σ i , T iid ∼ N (μ, σ 2 i α ) and σ 2 i |T ∼ IG( ν 2 , νλ 2 ). Both the above the priors p(Θ|T ) facilitate closed from solutions for p(Y |X, T ) = p(Y |X, Θ, T )p(Θ|T )dΘ to be obtained analytically. Utilization of one of these closed form solutions along with a CART tree prior P (T ) enables the posterior of T to be obtained subject to a normalizing constant: p(T |X, Y ) ∝ p(Y |X, T )p(T ). The enormous number of trees that may be constructed from all but the smallest of data will render an exhaustive search of p(T |X, Y ) computationally intractable. Subsequently, the normalization constant will not be obtainable nor will it be possible to identify which trees have with the highest posterior probability. However, the posterior may still be explored using an MH algorithm to conduct a stochastic search. Such a stochastic search will result in a Markov chain sequence of trees that converges towards trees with higher posterior probabilities and converges in dis-tribution to the posterior p(T |Y, X). Chipman et al. (1998) note that their MH algorithm has a tendency to move rapidly towards a group of similar trees with high posterior probability proximate to the starting tree then remain in that vicinity exploring that group of trees with small local steps for many subsequent iterations of the algorithm. Given a large enough number of iterations MH algorithms will move between posterior modes and explore the entirety of the trees possible but there is no guarantee about how many iterations a MH algorithm must be run for in order to achieve such a complete exploration. In light of these considerations, Chipman et al. (1998) recommend comparing the results of multiple runs of their MH algorithm each originating from different starting values (origin trees). Chipman et al. (1998) recommend both multiple restarts of their MH algorithm from the single root node tree, citing high initial variability in the direction in which their MH algorithm will proceed often leading such restarts to converge on quite different trees, and multiple restarts from start values selected from interesting intermediate trees from previous runs of their algorithm or trees identified by other methods (e.g. bootstrap bumping (Tibshirani and Knight, 1999)). Such multiple restarts of the MH algorithm for fitting Bayesian CARTS will result in a range of selected trees which can either be model averaged or chosen between depending on the goals of the particular analysis. Aids for selecting the trees to include in the model averaging or for selecting a single tree could include the residual sum of squares of the resulting tree (either alone or constituent in a cost complexity metric) or plots of the observed likelihood of the trees p(Y |X, T ) against the number of terminal nodes of the same trees as a guide to the cost -complexity compromising being struck.

Bagged Trees
Perhaps the simplest elaboration of CART comes from applying it within a bootstrapping procedure. The term 'bagging' was created by compressing the term 'bootstrap aggregation' and refers to taking an average of the set of predictions obtained from applying the same model fitting procedure to a collection of bootstrap samples of some data. Bagged Trees are reviewed in Hastie et al. (2009) who make the following observations regarding properties of Bagged Trees that would be pertinent to the application this technique to data in which linear and non-linear effects of covariates are expected to be important and collinearity is extant among covariates. The average of many regression trees fitted to bootstrap resamples can better approximate linear and non-linear trends in the data than a single regression tree. This stems from how the average of many such stepwise approximations to a linear (or non-linear) relationship, many of which will differ slightly having been fitted to different bootstrap resamples of the data, the will form a much better approximation to this relationship than any of the individual constituent stepwise approximations. Collinearity among the covariates can lead to high variance among regression trees fitted to replicate data which bagging can smooth out in the hope of thereby obtaining a more generally applicable model. However bagging regression trees will not improve the bias in estimation relative to that associated with a single regression tree fit. This improvement in prediction due to reduction in variance from bagging comes at the cost of the interpretability of a single tree. This is occurs since bagged regression trees are an average of the predictions of many such binary trees that have different geometries and as such the simple binary, branching nature of a single regression tree is sacrificed.

Random Forests
Like bagging, random forests (Breiman, 2001) also involve averaging the predictions of a set of binary tree based models such as regression trees. Furthermore, both bagging and random forests build this set of tree based models from a set of bootstrapped samples of the data. Random forests elaborate on bagged trees by building 'de-correlated' trees. These de-correlated trees are propagated by choosing each binary division based on a randomly selected subset of the potential covariates. Repeating this tree propagating process many times, each on a different bootstrap resample of the data results in a set of 'de-correlated' trees that are then averaged to obtain a final prediction. As such random forests share the advantage of bagged trees over a single regression tree in that they may better approximate linear and non-linear trends in the data while still being able to model complex interactions between covariates. Random forests frequently perform synonymously to boosted trees and are easier to train and tune (Hastie et al., 2009).

Boruta All Relevant Variable Selection
Random forests have also been taken as a starting point for further methodological elaborations and refinements. One such technique is Boruta all relevant variable selection (Kursa and Rudnicki, 2010), hereafter BARVS (our acronym). The essence of the BARVS method is the recursive process of fitting a random forest then assessing which of the covariates utilised in this random forest made a sufficient contribution to the predictive performance of this random forest to warrant retention. The random forest is then refitted, this time using only the covariates deemed worth retaining in the previous iteration. This process in then repeated until a random forest is fitted in which all covariates utilised are deemed to have made sufficient contribution to the predictive performance of the random forest to warrant retention. Kursa et al. assess the contribution of a covariate to the predictive performance of the random forest via a Z score (though note that Z N (0, 1) ). The Boruta Z score for a covariate in a particular random forest relates to the loss of accuracy of prediction resulting from the random permutation of the values of that covariate among observations. The Boruta Z score for a covariate in a random forest is calculated from these losses of predictive accuracy from all the constituent regression trees that utilized that covariate. In particular, the Boruta Z score is calculated by dividing the mean of these losses in predictive accuracy by the standard deviation of these losses (hence the choice of name). The BARVS algorithm runs approximately as follows. Firstly, permuted copies of all covariates currently under consideration for inclusion in the random forest are added to this set of considered covariates and a random forest is fitted to these composite data. The Z scores are then calculated for all of covariates, including the permuted copies of the actual covariates. The maximum Z score among the permuted covariates is then identified and two sided t tests are performed to test the equality of the Z score for each actual covariate against this maximum Z score among the permuted covariates. Next, all actual covariates that have Z scores significantly less than this maximum Z score among the permuted covariates are discarded from the set of considered covariates and the current set of permuted covariates is also discarded. This procedure is then repeated until all remaining actual covariates have Z scores significantly greater than the maximum Z score among the permuted covariates created at that iteration. The random forest constructed from these exclusively relevant covariates is then retained as the final solution.

Boosted Trees
Boosting is a well regarded technique that may be applied to a variety of models including binary tree based models (Hastie et al., 2009). Boosted regression trees have been introduced in a manner accessible to quantitative scientists by Elith et al. (2008). Akin to bagged trees and random forests, boosted trees incorporate a sequence of binary trees (such as regression trees) fitted to sequentially modified versions of the original data, the final prediction being drawn from a combination of the predictions of this sequence of trees. In the case of boosted trees, the sequential modifications to the original data take the form of weightings calculated from the results of the tree fitted in the previous step. Boosted regression trees implement a type of functional gradient descent designed to minimise a loss function that quantifies the loss in predictive accuracy resulting from an imperfect model fit (Elith et al., 2008). The first tree is fitted to the original data by maximising the reduction in the value of the loss function relative to that from a single node tree. The second tree is fitted to the residuals from the first tree but the predictions from this second step in the fitting process are the result of combining the predictions first and the second tree. The third tree is then fitted to the residuals from the combined predictions of the first and second trees and so on. In this manner boosting sequentially focuses the model fitting on the observations that are difficult to explain. Boosting is typically conducted for as many iterations as is computationally feasible and the final prediction from the ensemble of boosted regression trees is a type of weighted average of the predictions from all the constituent trees. Predictions from boosted regression trees have increased stability and accuracy compared to those from a single regression tree model. Furthermore, the introduction of some stochasticity into the boosting algorithm via the inclusion of a bagging step can further improve the accuracy of the predictions and mitigate the effects overfitting (Friedman, 2002). Boosted trees like bagged trees and random forests involve the sacrifice of the interpretability of a single regression tree for better predictive performance. However, a relatively straight forward metric of covariate importance exists (Friedman, 2001;Friedman and Meulman, 2003) which scores covariates based on the frequency with which each defined a binary division and weights these scores proportionally to the improvement in the model fit that resulted from the inclusion of the associated division. These scores are averaged across all the trees in the boosting sequence and the resulting scores are scaled to sum to 100 for ease of interpretation. Unlike a single regression tree, boosted regression trees can easily model linear relationships, non-linear relationships and relationships that include step-like discontinuities (Elith et al., 2008). Unlike bagged trees and random forests, boosted regression trees reduce both the variance and bias associated with predictions (Elith et al., 2008).

Cubist
Cubist (https://www.rulequest.com/cubist-info.html) fits predictive models developed from Quinlans M5 model tree method (Quinlan, 1992). Quinlan's M5 method functions by creating a binary tree structure which is then pruned to reduce tree complexity without greatly reducing the overall fit of the tree to the data. Where this pruning converts former interior nodes of the tree into terminal nodes by collapsing the tree structure below them, MLR models are fitted to each of the subsets of the data thus defined. Each MLR model selects covariates from the set of covariates that were previously used to define the tree structure that was below (child nodes of) this current node. The Cubist method extends the M5 method by incorporating a boosting step.

Bayesian Treed Regression
The Bayesian implementation of CART (Chipman et al., 1998) has been extended to fit MLR models (rather than simple intercept only models) to the subsets of the data defined by the terminal nodes of the associated binary tree in a framework the creators dubbed Bayesian treed regression (Chipman et al., 2002). One motivation for such a model formulation being the scenario whereby different covariates are most useful for predicting the response in different subsets of the data. The end result is somewhat akin to C5 model trees (Quinlan, 1992) but the method of attaining such a fit is very different, the model being formulated under the Bayesian paradigm. Much like Bayesian CART, Bayesian treed regression utilises an implicitly specified tree prior and a stochastic search. The Bayesian formulation and stochastic search enable different tree geometries and MLR models in the terminal nodes of these trees to be explored in addition to different error variances to be modelled for each terminal node. The particulars of the model formulation are as follows. For each terminal node of the tree T , the response observations, Y , that are members of this node are assigned a parametric model. In this manner there is a separate parametric model for the response observations contained in each of the unique terminal nodes of the tree. In particular, the distribution of the elements of the response vector Y that are members of the i th terminal node of the tree T are modeled conditional upon the associated covariate values by the parametric model Y |x ∼ f (y|x, θ i ) where Θ = (θ 1 , ..., θ b ). This contrasts with a Bayesian CART where the distributions of the response observations in each terminal node are not modeled as conditional upon the associated observations of the covariates x there. Where CART and Bayesian CART models utilise a collection of stepwise functions to approximate correlations between the response and the covariates via the binary tree structure Treed regressions may be thought of as a collection of piecewise MLR models. As such Bayesian treed regression models would model linear or non-linear relationships between covariates and the response much more parsimoniously than Bayesian CART models as Bayesian treed models have the facility to invoke MLR in the terminal nodes. This facilitates the transfer of complexity in Bayesian treed models from the tree geometry to the terminal node MLR models. As such, one may reasonably expect shallower more readily intelligible trees to be coupled with the terminal node MLR models in Bayesian treed regression as compared to the more deeper, less readily intelligible trees that are coupled with terminal node single parameter models in Bayesian CART. This may be thought of as the binary tree component of the Bayesian treed models capturing the major subsets of the data and the associated MLR models describing the nuances thereof.
Each fit of a Bayesian treed model is uniquely defined by (Θ, T ) and Chipman et al. (2002) outline how the posterior distribution of these (Θ, T ) may be explored via stochastic search with the aid of a Metropolis Hastings algorithm. As with their Bayesian CART, Chipman et al. (2002) recommend comparing the fits to which multiple restarts of the stochastic search converge so as to better explore the posterior rather than running a single stochastic search for a long time. They make this recommendation in light of the noted tendency of their MH explorations to converge on local maxima in the posterior then explore the vicinity of that maxima for many iterations. Chipman et al. (2002) note that their approach is amendable to both Bayesian model selection, should a single model be desired, and Bayesian model averaging (e.g. by posterior or likelihood weighting of the iterations of the stochastic search) should better predictive performance be desired. When model averaging is elected, Chipman et al. suggest model averaging only the better fitting models.

An Introduction to the Multiple Linear Regression Based Models Considered this Work
Multiple Linear Regression (MLR) predicts observations of the response y i , i = 1, ..., n, from a linear combination of products of observations of the covariates x i,j , j = 1, ..., p, and the associated coefficient estimatesβ j plus an intercept termβ 0 as per Equation 9:ŷ In MLR, non-linear effects of covariates upon the response may be modelled via inclusion of additional covariate terms formed as single term polynomials constructed from the original covariates. Interactions between covariates in their effects upon the response may be modelled via including in the regression additional covariate terms constructed by taking products of the covariates constituent to the interaction in question.

MLR Maximum Likelihood
When a MLR model is fitted by ordinary least squares (OLS) likelihood maximisation the vector of coefficient estimatesβ is obtained from the design matrix X as per Equation 10.β = X T X −1 X T y As such ordinary least squares cannot estimate MLR fits for scenarios where the number of covariates, p, exceeds the number of observations, n, (the p > n or ultrahigh dimensional scenario). Even in situations where the number of covariates is less than the number of observations, consideration of non-linear terms for each covariate and the p k possible order k interaction terms can lead to the number of considered covariate terms grossly exceeding the number of observations and OLS fitting no longer being possible. Furthermore, fitting MLR by OLS is ill advised when collinearity exists among the covariates (Belsley et al., 1980).

MLR with Information Criterion Based Variable Selection
When the number of covariate terms desired to be considered for use in predicting the response exceeds the number of observations, OLS is often incorporated into a model comparison framework that fits and compares numerous models that utilise subsets of the available covariates such that p < n. This comparison is often effected via an information criterion such as the Akaike Information Criterion (AIC) (Akaike, 1974;Venables and Ripley, 2002;Konishi and Kitagawa, 2008) which assigns models a score that rewards goodness of fit to the data while penalizing model complexity. Where it is computationally feasible to do so all possible models that may be constructed from a particular set of covariates with a particular number of observations may be fitted via an exhaustive search procedure and compared, for instance in terms of information criterion values. Where the computational burden of an exhaustive search is deemed too great, a popular choice is to adopt some greedy algorithm based approach to search for optima in the information criterion values that accompany the set of possible models without having to fit all these models. Stepwise variable selection methods are examples of such techniques. It should be noted that in their base forms both stepwise variable selection and exhaustive searches still utilise OLS fitting procedures and as such are both ill advised in the presence of collinearity among the covariates.

MLR Bayesian
The Bayesian approach may also be used to fit MLR models (see for example Gelman et al. 2004). Furthermore, the Bayesian framework may be used to accomplish variable selection under the ultrahigh dimensional scenario by creating prior distributions that give each regression coefficient a high probability of being zero (Gelman et al., 2004) as performed in the Bayesian variable selection method Spike and Slab priors (Mitchell and Beauchamp, 1988;Geweke, 1996;George and McCulloch., 1997). A simple Bayesian formulation of an MLR with independent and non-informative priors on all coefficients may be adversely affected by collinearity among covariates. Collinearity among covariates will lead to high posterior variance of the associated coefficient estimates (Gelman et al., 2004) and subsequently slow Markov Chain Monte Carlo convergence. Bayesian analogues to shrinkage techniques for mitigating the undesirable effects of modeling with data that includes collinearity among the covariates are outlined along side their penalized likelihood based counterparts in the following section.

MLR Penalization / Shrinkage
One of the dangers when conducting MLR based modelling with covariates among which collinearity exists is that highly correlated pairs of covariates can be assigned arbitrarily large magnitude positive and negative coefficients that effectively negate each other. One action that may be taken to mitigate or at least control this effect is to impose a penalty upon the combined magnitude of the coefficients as part of the fitting process (Hastie et al., 2009). A simple choice for such penalty functions is to take the L γ norm of the regression coefficient vector β, for some value of γ, and search for theβ that minimizes the sum of the residual sum of squares and this norm of the regression coefficients as per Equation 11.
Solving Equation 11 with γ = 2 yields a Ridge Regression estimate equivalent to that obtained by solving Equation 12.
Ridge regression shrinks all coefficients towards zero and thus does not perform variable selection and is not useful for regression in the ultrahigh dimensional situation. Interestingly, ridge regression is equivalent to Bayesian MLR with an exchangeable normal prior distribution on the coefficients (Gelman et al., 2004). Using γ = 1 in Equation 11 yields an L 1 penalized least squares estimate also known as the Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1996). More complex choices for the penalty function than the L γ norm in Equation 11 are used in penalized least squares techniques such as adaptive LASSO (Zou, 2006), Smoothly Clipped Absolute Deviation (SCAD) (Fan and Li, 2001) and Minimax Concave Penalty (MCP) (Zhang, 2010). Of these techniques L γ penalization is perhaps the most easily applied. Solving Equation 11 for any γ < 2, will shrink the coefficient estimates for some covariates to zero exactly (how many depends on value of tuning parameter λ) thereby performing variable selection in addition to penalized estimation (Ahmed, 2014). As such L γ penalized estimation with γ < 2 is applicable in scenarios where where the number of potential covariates exceeds the number of observations (p > n) and collinearity exists among the covariates. The absolute value in an L 1 penalized estimate requires a computational solution initially provided by quadratic programming (Tibshirani, 1996) and more recently by the more computationally efficient Least Angle Regression (LAR) algorithm (Efron et al., 2004). An estimate of the LASSO solution may also be obtained from the posterior mode estimates of a Bayesian MLR with independent and identical, Laplace (double exponential) priors on the regression coefficients (Park and Casella, 2008).

Comparing Multiple Linear Regression Based Techniques and Binary Tree Based Technique
The comparison of the properties of Multiple Linear Regression (MLR) and Classification And Regression Tree (CART) has most relevance to this work as a foundation to inform the comparison of the various modifications of each technique that would have better suited the coupling of the particular characteristics of the data from our case study and our objective for the analysis of these data. We consider the defining characteristics of our case study data to be: (1) more potential covariate terms than observations (the p > n or ultrahigh dimensional situation for variable selection) (2) a high degree of collinearity among the potential covariate terms and (3) suspected importance of non-linear effects of covariates and interactions of covariate effects. The primary objective with our case study analysis was the construction of a model for covariate assisted interpolation of the response. Our case study also had the additional context of the modest computational resources provided by one mid-range laptop and our desire for an easily interpretable predictive mechanism.
A plethora of BT based approaches exist today that are variously modifications of and elaborations upon the Classification And Regression Trees (Breiman et al., 1984) (hereafter CART) framework. We confine ourselves here to considering a subset of these that appeared appropriate for the case study data and objectives namely: Bayesian CART (Chipman et al., 1998), bagged regression trees (Breiman, 1996), random forests (Breiman, 2001), boruta all relevant variable selection (Kursa and Rudnicki, 2010) (an elaboration upon random forests), boosted regression trees (Friedman, 2002), cubist (Quinlan, 1992) and Bayesian treed regression (Chipman et al., 2002) (an elaboration upon Bayesian CART). Readers unfamiliar with these BT based techniques are referred to Section 9.1. A similar diversity of modifications to and elaborations upon MLR exist though we confine ourselves here to considering the base form and the modification thereof that seem appropriate to the defining features of our case study data. Namely, we consider shrinkage modified MLR for its relevance to the situation where collinearity exists among the covariates with particular attention to LASSO style shrinkage for its additional relevance to the situation where the number of covariates exceeds the number of observations of the response (see Section 9.2 for an introduction to these techniques). We consider both LASSO for MLR fitted under the Bayesian paradigm (Park and Casella, 2008) and LASSO implemented through likelihood penalization as fitted by the LAR algorithm (Efron et al., 2004). We consider each of these techniques in light of the key characteristics of our case study data and our objective of building a model to interpolate the response. This allows us to narrow down the choice of methods further and make a final decision. This comparison is also summarised in Table 8.
We commence this consideration with perhaps the most widely know technique introduced above, MLR. A MLR model cannot be fitted by Ordinary Least Squares (OLS) based likelihood maximisation when the number of covariates desired to be included in the model exceeds the number of observations (see Section 9.2). The large number of potential covariate terms and suspected importance of non-linear effects and interactions of covariate effects meant that conducting exhaustive search variable selection on these data was not computationally feasible. Under such scenarios, a common option has been to apply some deterministic variable selection procedure that optimises an information criterion. Stepwise variable selection with the Akaike Information Criterion (AIC) (Akaike, 1974;Venables and Ripley, 2002;Konishi and Kitagawa, 2008) has been a popular choice in this regard having been available in R via the step function (R Core Team, 2015) through numerous release cycles. However, with linear regression, correlations between potential covariates can be a cause for concern with such techniques that rely on ordinary least squares model fitting (which in the case of linear regression is also maximum likelihood model fitting). The undesirable effects of correlations among covariates in a MLR model have been explained in the comprehensive book (Belsley et al., 1980). While Bayesian methods would allow a MLR model to be fitted with more covariates than observations the high degree of collinearity among some pairs of covariates would render the associated pairs of coefficients poorly defined which in turn could complicate convergence of the MCMC iterations upon a fit to these data. In the presence of collinearity among covariates some form of shrinkage is advisable for preventing highly correlated covariates being assigned arbitrarily large magnitude but opposite signed coefficients that all but cancel each other out due to the high correlation of the associated covariates. Coefficient shrinkage is performed by the popular Penalized Least Squares (PLS) family of techniques (Ahmed, 2014) which forms a subset of the larger family of coefficient shrinkage themed modifications of MLR. We considered shrinkage techniques including Ridge regression (Hoerl and Kennard, 1970), LASSO (Tibshirani, 1996), LASSO fitted by LAR (Efron et al., 2004) and the Bayesian LASSO (Park and Casella, 2008). Of particular interest were both LASSO for MLR fitted under the Bayesian paradigm (Park and Casella, 2008) and LASSO implemented through likelihood penalization as fitted by the LAR algorithm (Efron et al., 2004) since LASSO is appropriate to both the situation where the number of covariates exceeds the number of observations and the situation where substantial collinearity exists among the covariates. LASSO conducts shrinkage in a manner that shrinks some coefficients to zero exactly which in effect excludes them from the model and as such is useful for conducting shrinkage and variable selection simultaneously. A LASSO modified MLR fit may be obtained via the deterministic and computationally efficient LAR algorithm (Efron et al., 2004) which adds or removes the covariate that best optimises its criteria at each iteration. Alternatively, a LASSO modified MLR estimate may be obtained by fitting a Bayesian MLR with independent and identical Laplace priors on all the coefficients (Park and Casella, 2008). Such an exploration, utilising multiple MCMC chains with a variety of starting values, could well conduct a more thorough exploration of the possible LASSO fits of MLR models that could be constructed from the potential covariates than would be conducted by the deterministic LAR algorithm. Though we note that were non-linear effects and all possible pairwise interactions considered as potential covariate terms, the computational burden involved in obtaining convergence of all chains could be considerable simply due to the number of coefficients to be estimated.
In contrast to MLR, CART may be run in its base form on ultrahigh dimensional data given sufficient computational resources. With more covariates than observations it would be vital to enforce a minimum number of observations per terminal node at which to commence pruning but if this were done there would be no a priori barrier to the implementation of this technique on such data. Were CART models being fitted within the Bayesian paradigm the tree prior would need to be parameterised to perform a similar function guiding the posterior away from deep trees with low numbers of observations per terminal node. Furthermore, the collinearity present among covariates would not prove a barrier to the implementation of CART (here regression trees as we have a continuous response) via the standard algorithms as even small differences among covariates would be sufficient for one covariate to define a binary partition of the response that reduces some loss metric more than the other covariate. The 63 covariates of our case study analysis, while not prohibitively excessive, would likely lead to some computational burden when fitting Bayesian CART models to these data as the stochastic searches necessary may take many iterations to converge. We postulate this slow convergence due to the collinearity among some covariates and also due to the number of covariates considered. Given the tendency of Bayesian CART to converge upon a local maxima and remain there moving locally for many MCMC iterations thereafter (Chipman et al., 1998) the greater the number of chains with a diversity of starting trees that could be run the greater the conviction one could have that a good enough exploration of the posterior had been conducted to have at least identified some trees that were more than just local maxima of extremely small regions of the posterior. As such a good case exists for devoting significant computational resources to fitting Bayesian CART. Suspected interactions of covariates and non-linear effects of covariates upon the response are a more interesting issue in the context of regression trees. Regression trees by their nature can model complex interactions between covariates and this is a strength they possess relative to MLR. However, as a single regression tree is essentially a collection of stepwise predictions for the response across mutually exclusive partitions of covariate space, deep trees will be necessary to approximate even a simple linear relationship between a covariates and the response well. The limited number of response observations available in our case study and large number of covariates of which several or more may well have non-linear relationships with the response means that trees of sufficient depth to describe multiple non-linear relationships would not be possible with sufficient observations in each terminal node to formulate reliable estimates of the response means there. Subsequently, some form of averaging of multiple trees would be advisable as the average of many slightly different stepwise functions can well approximate a linear or non-linear relationship even when the constituent binary trees are quite shallow. The trees thus averaged could be the trees that scored above some threshold posterior probability or trees fitted to multiple re-samples or re-weightings of the full set of observations.
Fitting regression trees to numerous bootstrap re-samples of the data would be the simplest option for generating multiple regression trees to model average in order construct better approximations to linear and non-linear relationships between covariates and the response than would be possible with a single binary tree. This technique is referred to as bootstrap aggregated trees or bagged trees for short (Breiman, 1996). Furthermore, averaging the predictions from numerous shallow trees each fitted to a different bootstrap re-sample of the data could also allow the various linear and non-linear relationships between different covariates and the response to be incorporated into the model despite the small sample size. If different covariates were most important to different subsets of the response, different trees could result from fitting a regression tree to each bootstrap re-sample and thus this diversity of relationships could be incorporated into the bootstrap aggregation. In this manner, even though the response sample size limits the depths of trees possible, numerous linear and non-linear relationships between covariates and the response could be incorporated into the predictions through the different shallow trees constituent to the model averaging. As such, bagged trees could incorporate approximations of far more linear and non-linear relationships into predictions of the response than it would be possible to utilise via a single binary tree fitted to the same data. As predictions from bagged trees are constructed from the predictions of numerous individual regression trees the same ability to operate on data with high collinearity among the covariates exists for bagged regression trees as does for single regression trees. Similarly, bagged regression trees retain and even enhance the strength of single regression trees at modelling complex interactions between covariates upon their effect on the response. However, these advantages come at the cost of the easily interpretable structure of a single regression tree.
Much the same assertions may be made for random forests.
Random forests (Breiman, 2001), like bagged trees, model average predictions from a set of regression trees derived from bootstrap re-samples of the data and thus have similar advantages to bagged trees over single regression trees at approximating linear and non-linear relationships between covariates and the response. Random forests differ from bagged trees in the manner in which regression trees are fitted to the bootstrap re-sample of the data. At each binary partition in such a tree in a random forest, only a randomly selected subset of the potential covariates are made available to the algorithm to define the binary partition. Random forests may be better than bagged trees at approximating different linear and non-linear relationships between the covariates and the response with a limited number of observations due to the combination of bootstrapping and selecting from random subsets of covariates at each partition resulting in an broader range of shallow trees to aggregate than bootstrapping alone. Random forests, being composed of many individual regression trees, should not be adversely affected by collinearity among covariates and should still be able to model complex interactions.
Boruta all relevant variable selection (Kursa and Rudnicki, 2010) is an elaboration of the random forest method. As such, it shares the advantages of bagged trees and random forests over single regression trees for approximating linear and non-linear relationships between covariates and the response and, like all binary tree based methods, is able to model complex interactions among covariates in their effects upon the response. However, collinearity among the covariates may pose a problem for the application of boruta all relevant variable selection. The problem stems from the boruta algorithm being vulnerable to discarding useful covariates when these covariates are highly correlated with one or more other covariates. If highly correlated covariates were supplied to a random forest algorithm and these covariates contained useful information for predicting the response these covariates could well be used interchangeably throughout particular regression trees for defining binary partitions of the response. Thus the loss in predictive accuracy that would result from permuting the values of just one of these correlated covariates and using it in place of the actual covariate would likely be less than would result if this particular covariate had been used in all binary partitions of the tree where the partition was defined based on one of the other covariates with which this covariate was highly correlated. Subsequently, the importance of this particular covariate to the predictive accuracy of the regression tree could be underestimated to the extent that the Boruta all relevant variable selection algorithm would discard this covariate from the set of covariates worth retaining. Furthermore, if the correlated covariates in question were sufficiently correlated that they were being used interchangeably throughout the regression tree then they could all dilute the estimated importance of the others in this manner such that none of these correlated covariates were selected for retention by the boruta all relevant variable selection algorithm despite all containing much the same important information for predicting the response and thus it being well worth retaining one of them.
Boosting iteratively refits a model to data that is re-weighted at each iteration to emphasize the observations that were poorly predicted by the model fitted in the previous iteration. The predictions from this sequence of models are then combined to produce the final prediction. Where these models are regression trees the procedure is referred to as boosted regression trees (Friedman, 2002). Thus the predictions constructed from a sequence of boosted regression trees will have similar advantages to those from bagged trees and random forests over a those from a single regression tree model. Such aggregated predictions will better approximate linear and non-linear relationships than the predictions from a single shallow regression tree given the small number of response observations and numerous potential non-linear and interacting effects of covariates expected to exist in our data. Furthermore, by its very nature boosting will produce a variety of trees each fitted to re-weightings of the data which emphasize a different subset of the response observations. Thus, where the small sample size would hinder a single regression tree capturing the suspected importance of multiple non-linear relationships between covariates and the response, different trees in the sequence of boosted trees will describe different relationships between covariates and the response that are found to be important to observations up weighted at that iteration. Since the ensuing predictions from all such trees will then be aggregated to produce the final predictions all these different identified relationships will be combined into the predictions of the response. Again, similar to the above approaches that aggregate sequences of regression trees to produce predictions, collinearity among the covariates should not greatly hinder the regression tree fitting process constituent to fitting boosted regression trees as differences will exist among even quite collinear covariates sufficient to choose between them when they are useful for defining a binary partition of the response. Furthermore, still being based on binary trees, boosted regression trees will be able to model complex interactions between covariates in their relationships with the response.
The cubist method https://www.rulequest.com/cubist-info.html is in essence an extension of the M5 or model tree approach (Quinlan, 1992) that incorporates a boosting step. As such the model that is fitted to the iteratively re-weighted data at each iteration of the boosting algorithm starts as a regression tree which is then iteratively pruned back by collapsing parent nodes to the current terminal nodes and fitting MLR models to the subsets of the data defined by these newly created terminal nodes. Each MLR model fitted to the data contained in a newly created terminal node uses as potential covariates all the covariates that defined the binary partitions that were collapsed to create this new terminal node. As such, collinearity among the covariates could lead to poorly defined coefficients in these terminal node MLRs as discussed above (assuming these covariates were also collinear in the subset of observations contained the terminal node in question) unless some shrinkage based fitting was conducted there. This model structure would allow for very flexible description of linear and non-linear relationships between covariates and the response as not only are different stepwise predictions being averaged but constituent in this averaging are also the predictions for various MLR fits. Similarly, being based upon a binary tree structure complex interactions may be modeled implicitly by this method.
Bayesian treed regression (Chipman et al., 2002) is superficially similar to Quinlan's M5 in that it fits MLR models to subsets of the data contained in the terminal nodes of a binary tree. Bayesian Treed Regression, however, is fitted under the Bayesian paradigm via stochastic search as an elaboration of a Bayesian CART model (Chipman et al., 1998). The intricate model structures permitted by fitting MLR models in the terminal nodes of binary trees is attractive for data in which complex combinations of linear, non-linear and interaction effects of covariates upon the response are suspected to be important. Some form of shrinkage would be advisable in the terminal nodes to mitigate the concerns of conducting MLR with collinear covariates. As has been discussed above LASSO may be implemented within the Bayesian framework by placing Laplace priors on the regression coefficients (Park and Casella, 2008). Our greatest concern associated with this method would be the computational burden inherent in conducting a good exploration of the posterior. Collinearity among covariates would slow the convergence of stochastic searches and the shear breadth of possible models would require numerous chains to be run with a great variety of starting values in order to have any confidence whatsoever that a good exploration of the posterior had been conducted given the noted tendency of this algorithm to rapidly converge on local maxima in the posterior then remain in the neighbourhood of this maxima for many subsequent MCMC iterations (Chipman et al., 2002(Chipman et al., , 1998. The number of parameters in the stochastic search could be reduced if one were willing to let the binary tree portion of the model be the only form of allowing for potential interactions of covariates (i.e. consideration of the 63 2 pairwise interaction terms in the MLR models could then be avoided). Further reduction in the number of parameters in the stochastic search could be achieved by only allowing linear terms for each covariates in the terminal node MLRs and relying on model averaging of high posterior probability fits to account for non-linear effects through the averaging of the predictions from multiple combinations of stepwise functions with MLR fits. Such an approach coupled with a tree prior parameterised to avoid deep binary trees would allow for a rich variety of linear, non-linear and interactions effects to be incorporated into the aggregated predictions.
Of the techniques considered, LASSO penalized MLR fitted via the LAR algorithm, random forests and boosted trees are here proposed as the techniques that were most suitable for our case study analysis (see Table 8). These three techniques: were appropriate for application to data with characteristics like those of our case study, suited our objective of building models to interpolate the response, were straight forward to implement with existing software and seemed unlikely to be adversely computationally burdensome with the com-putational resources available. Model-averaging of high posterior probability regression trees identified via the Bayesian CART method also appeared quite well suited to data and objectives like ours but would have required more effort to implement (tree priors require some work to parameterise so as to emphasize tree structures that maintain a minimum number of observations in all terminal nodes) and could well have proved quite computationally expensive with the computational resources available. Model-averaging of high posterior probability fits from Bayesian treed regression also appeared very promising for data and objectives like ours provided some form of shrinkage could be implemented in the terminal node MLRs and the 63 covariates were considered only as linear main effects. However, we note that the implementation of shrinkage in the terminal nodes would not be a trivial task and that both these techniques seemed likely to be quite computationally intensive with data like ours. The choice of an easily implemented technique that was appropriate to our data and computationally efficient was thus reduced to a choice between LASSO penalized MLR fitted via the LAR algorithm, random forests and boosted trees.
We have elected to use LASSO modified MLR fitted via the LAR algorithm in our case study analysis. Model-averaging the predictions from the LASSO solutions obtained from LAR executions within a cross validations scheme yielded an aggregate estimate in a manner conceptually similar to the manners in which random forests, bagged trees or boosted trees aggregate predictions from the same model structure fitted to variations on the same data set. Another useful consequence of using a cross validation based approach was that we were able to estimate the shrinkage parameter for the LASSO fits (λ in Equation 11) via cross validation. The motivation for this approach was an attempt to fit models that would perform well at interpolation. For the purposes of building models for interpolation, LASSO modified MLR fitted via the LAR algorithm was a defensible choice of method from a set of good options. Our preference for LASSO modified MLR fitted via LAR was the result of a secondary interest in which non-linear terms for covariates and which particular pairwise interactions were most useful for predicting the response. This was more readily apparent from LASSO modified MLR fits where each covariate term has a coefficient that has been either shrunk to zero, effectively excluding the covariate term from the model, or assigned a value. When employed within a cross validation scheme this translated into frequencies of selection of these specific covariate terms which were still easy to interpret. In contrast, whether the overall role of a covariate within the aggregated estimate from random forests, bagged or boosted trees was closer to linear or non-linear (and if non-linear what manner of non-linear) would have been harder to judge from the results of such a fit. This ease of interpretability of the LASSO modified MLR came at a cost of having to recenter and rescale (to mean zero and magnitude one) all covariates in each training each set (a requirement of the LAR algorithm (Efron et al., 2004)) and mirror those transformations on each associated validation set whereas this would not have been necessary for binary tree based techniques. Substantial collinearity existed among the 2205 potential covariate terms. We wished to create subsets of the full design matrix in order to explore less collinear sets of potential covariates for two reasons. Firstly, exploring design matrices which included highly collinear pairs of covariates seemed unnecessary. This was because a variable selection algorithm would have selected a covariate from each highly correlated pair of covariates on the basis of minute differences between the covariate values at the locations at which the response was observed and thus this decision could have been adversely influenced by errors in measurement or interpolation. Secondly, the variable selection functions in the version of the 'leaps' package (Lumley and Miller, 2009) we used required that the design matrices explored included no pairs of covariates with correlations coefficient magnitudes greater than 0.4.

49
The covariates derived from the All Terrain Vehicle (ATV) surveys had the finest spatial resolution of all the covariates considered in our case study. In an effort to build models that would have predicted the response with the greatest spatial accuracy, when faced with highly correlated pairs of covariates we chose to retain the covariates collected by the ATV surveys over any others. Of the ATV survey derived covariates: visible Red reflectance (RED), Near InfraRed reflectance (NIR) and soil apparent electrical conductivity (ECA) only ECA had no other covariate terms derived from it while all the vegetation indices were calculated as functions of the RED and NIR reflectance values. For this reason ECA was retained over NIR, RED and any other highly correlated covariate term. As the vegetation indices were theoretically more indicative of green biomass than raw RED or NIR reflectance, and thus potentially more closely related to SOC levels (see 8), vegetation indices were retained over the raw reflectance values were any such pairs overly correlated. Next in the order of detail of spatial resolution were the Foliar Projective Cover (FPC) Layers. We obtained two data such layers: the projected foliage cover for 2011 (FPCI) and the projected foliage cover for 2012 (FPCII). Since 2011 was less temporally removed from the 2009 soil survey than 2012, FPCI was set to be preferentially retained over FPCII or any other highly correlated covariates. The coarsest spatial resolution data were derived from the Digital Elevation Model (DEM). These data included elevation along with terrain and soil hydrology metrics calculated from the elevation. We considered that these terrain and soil hydrology metrics came closer to describing landscape processes that may have influenced SOC formation, mineralization and or transport and thus the spatial distribution of SOC levels. Subsequently, we elected to retain terrain and hydrology metrics over elevation should elevation have been highly correlated with any of these metrics. Any remaining pairs of highly correlated covariates were then chosen between at random. Once this hierarchy of filtering operations had been applied to the 63 potential covariate terms the remainder was expanded to include all remaining covariate terms to polynomial order four and all possible interactions between pairs of linear terms for these remaining covariates.
In the spirit of Occam's razor when searching the expanded design matrix for correlated pairs of covariate terms, single term polynomial terms were set to be retained in preference to any interaction terms with which they were found to be highly correlated. Finally, lower order polynomial terms were set to be retained in preference to any higher order polynomial terms with which they were highly correlated. Once all the above heuristics had been implemented in the order described a selection was made from any remaining pairs of highly correlated covariates at random to complete the enforcement of a maximum permitted correlation coefficient magnitude between covariates in the filtered design matrix.

Appendix E: Choice of Training Set Size and Design Matrix Filtering Austerity
We compared the results of using three different pairs of training and validation set sizes combined with each of four different levels of austerity in the design matrix filtering repeating the variable selection and model-averaging routine for each. The training set sizes compared were 35, 45 and 55. Sets of 500 unique training sets of each of these sizes were constructed from design matrices filtered to enforce maximum correlation coefficient magnitudes between remaining covariate pairs of 0.95, 0.8, 0.6 and 0.4. Given our primary objective of interpolating the response between the soil core observations our focus was on out of sample predictive accuracy. The metrics for out of sample prediction accuracy we adopted were the summary statistics for the distributions of the Validation Set Element Prediction Error (VSEPE) absolute values. The distributions of the absolute values of the VSEPE and the coefficients of determination for the model-averaged predictions from models selected from each of these combinations of training set size and design matrix filtering austerity have been summarized in Table 9. For each level of austerity in filtering the design matrix the distribution of the absolute values of the VSEPE appeared to become more compressed towards zero with increasing training set size. However, it should be noted that the number of validation set elements that were predicted under the scenarios where 500 training sets of 35 observations were used was much larger compared to the number predicted under scenarios where 500 training sets of 55 observations were used (500 * (60 − 35) compared too 500 * (60 − 55)). As such we only recommend comparing the VSEPE distributions obtained from each of the four collections of 500 training sets of the same size that were constructed from design matrices subjected to different the levels of filtering austerity considered.
The final model for each training set was selected from the sequence of models for that training set returned by the LAR algorithm as the model which maximised the predictive accuracy on the associated validation set. As the 35 observation training sets were the smallest of the three sizes of training sets considered, the models selected for these training sets would have had out of sample predictive accuracy most emphasised in this second stage of their selection process where the shrinkage parameter was selected to minimise validation set predictive error. Since areal interpolation of the response from full cover covariate observations via these models selected by LAR is in essence out of sample prediction, we elected to use the results of variable selection on one of the collections of 500 unique 35 observation training sets for this areal interpolation. Of the models selected from the collections of 500 training sets of 35 observations it was those constructed from the design matrix filtered to enforce a maximum permitted correlation coefficient magnitude between covariate pairs of 0.95 that had the first three quarters of the ordered VSEPE absolute values most compressed towards zero. Furthermore, of the scenarios involving 35 observation training sets it was the model-averaged prediction from the models fitted to the 500 training sets constructed from the design matrix filtered to enforce a maximum permitted correlation coefficient between covariates of 0.95 that had the best coefficient of determination. For these reasons, we elected to interpolate the response across the study area with the model-averaged predictions from the models selected by applying LAR to the 35 observation training sets constructed from the design matrix filtered to enforce a maximum permitted correlation coefficient magnitude between remaining covariate pairs of 0.95.