Estimation of soil pH with geochemical indices in forest soils

Soil pH is a critical soil quality index and controls soil microbial activities, soil nutrient availability, and plant roots growth and development. The current study aims to evaluate various pedotransfer functions for predicting soil pH using different geochemical indices (CaO, ratios of Al2O3, Fe2O3, TiO2, SiO2, MgO, and K2O to CaO) in forest soils. Various models including empirical functions (quadratic, cubic, sigmoid, logarithmic) and artificial neural network with these geochemical indices were assessed by independent testing set. Mean bias error (MBE), root mean square error (RMSE), mean absolute percentage error (MAPE), mean absolute error (MAE), coefficient of determination (R2), t-statistics (t-stat), and Akaike’s Information Criterion (AIC) were applied to evaluate the model performances. Additionally, a new indicator (global performance indictor, GPI) was originally introduced in this study and was used to rank these models. According to GPI, the sigmoid functions and ANNs performed better than others. On average, they could explain above 70% of the variability in soil pH. Both model structure and dataset shape impact on model performance. The best input was CaO for ANNs, sigmoid, and logarithmic functions. The ratios of K2O to CaO and Al2O3 to CaO were the best inputs for quadratic and cubic equations, respectively.


Introduction
Soil pH indicates soil acidity and alkalinity. Generally, slightly acidic soils are optimal for macro-and micro-nutrients availability [1]. Soil pH impacts on soil nutrients and plant growth and development [2]. It is a critical element for understanding soil nutrient availability and weathering as well as relationships between soil and biota. The relationship between soil pH and base saturation has been well studied. Some researchers observed a curvilinear relationship between soil pH and Ca saturation [3,4]. Others reported a linear relationship between them [5,6].
Soil CaO has been applied to predict soil pH with other geochemical elements. For example, Lukens et al. used ratios of Fe 2 O 3 , TiO 2 , and Al 2 O 3 to CaO to predict soil pH with sigmoid functions [7]. The models produced similar prediction accuracy with coefficient of determination changing between 0.7 and 0.74, root mean square error between 0.83 and 0.88. Nordt and Driese found that bulk soil CaO + MgO could be used to predict soil pH in Vertisol [8]. The prediction of soil pH using bulk soil elemental oxides is also an issue in pedotransfer functions. Soil CaO, is one source of Ca 2+ supply to soil solution, we believe that itself could be used to estimate soil pH. However, studies on this topic were limited.
The objectives of the current study were to (1) evaluate various pedotransfer functions for predicating soil pH using several geochemical indices and (2) investigate the usefulness of soil CaO to predict soil pH. To do this, five models with different geochemical indices were compared and tested. Specifically, artificial neural networks were evaluated with respect to the non-linear relationship between soil pH and the geochemical indices. Model performances were evaluated by an independent validation set.

Study site
The study area covering 13326 km 2 is located in the core region of the Three Gorges Reservoir of China (Fig 1). It has a humid subtropical monsoon climate with a mean annual precipitation of 1267 mm and a mean annual temperature of 16.02˚C. The elevation varies between 175 and 2033 m with a mean of 643 m. The slope changes between 0.45˚and 52.96˚with a mean of 17.83˚.

Data
A total of 1163 samples were collected from forest soils in the study area (Fig 1), where the major bedrock lithologies are carbonate rocks and sandstone and soil type is Combisols [9]. The study did not involve private land, protected land, endangered or protected species. No specific permissions were required for these locations/activities. In order to ensure an even distribution of selected sites, systematic sampling using a regular grid was applied in this work [10]. Surface soils at 0-20 cm depth were collected at a density of 1 sample/km 2 . For each sampling site, 3 to 5 subsamples collected within 50 m of the site were mixed to represent the sample. All the sampling locations were recorded by Global Positioning System (GPS). Standard measurements were performed on the soil samples. Prior to laboratory analysis, samples were air-dried and passed through a 2 mm soil sieve. Soil pH was determined in a soil-to-water ratio of 1:2.5 with a glass electrode. The elements (Al 2 O 3 , Fe 2 O 3 , TiO 2 , SiO 2 , K 2 O, Mg 2 O, and CaO) were measured by Inductively Coupled Plasma-Optical Emission Spectrometry (ICP-OES) method [10].
Ratios of Al 2 O 3 , Fe 2 O 3 , TiO 2 , SiO 2 , MgO, and K 2 O to CaO (hereafter AlCa, FeCa, TiCa, SiCa, MgCa, and KCa) and CaO were used to develop the pedotransfer functions to predict soil pH in forest soils [7]. These geochemical indices were calculated by where X represents Al 2 O 3 , Fe 2 O 3 , TiO 2 , SiO 2 , MgO, and K 2 O. All data were divided into calibration and validation sets for each dataset. Approximately 2/ 3 of the data were used to develop (or train) the models. The remaining 1/3 of the data were used to validate the models.

Models
Both empirical functions (quadratic, cubic, sigmoid, and logarithmic) and artificial neural network were tested in this work. The expressions of these empirical functions are given in Table 1. For sigmoid function, parameter k and p are the minimum and range of the response, respectively.
The artificial neural networks (ANNs) that are inspired by biological neural network are also frequently used tools for various fields [11][12][13]. ANNs can deal with both linear and nonlinear relationships between variables [11,12]. In the current study, ANNs with three layers (an input, a hidden, and an output layers) were tested and trained with scale conjugate gradient back propagation algorithm (Fig 2). The output of a node is, where f is an activation function, y is the output of a node j, x i is an input of the vector of inputs, w ij is the weight connected the input x i to the node j, and b j is a bias associated with the node j. The parameters (weight and bias) are determined during the training stage based on a set of input data and targets. The tangent and linear activation functions were used in the hidden layer and output layer, respectively [14][15][16][17].
The numbers of neurons in the hidden layer between 2 and 20 were tried. To train the ANNs, three datasets were created randomly based on the calibration dataset for training (70%), validating (15%), and testing (15%). The ANNs with the lowest value of root mean square error (RMSE) and the highest value of coefficient of determination (R 2 ) were selected to predict soil pH using the geochemical indices. Number of parameters was calculated by Table 1. Empirical models used in the current study.

Name
Ab. Equation Logarithmic Log y = b 0 +b 1 ln(x) a k and p are the minimum and range of the response, respectively.

Performance evaluation
Model performances could be evaluated by comparing predicted and measured data based on a set of statistical error indicators. In this work, mean bias error (MBE), root mean square error (RMSE), mean absolute percentage error (MAPE), mean absolute error (MAE), and coefficient of determination (R 2 ), t-statistics (t-stat), and Akaike's Information Criterion (AIC) [19] were employed to assess the model performances based on the independent validation set.
RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where n is the number of observations, y i , and ŷ i are the measured and estimated soil pH of the ith soil sample, respectively, � y is the mean value of the measured soil pH, k is the number of parameters. MBE shows overall under-or over-estimation tendency. A negative value of MBE indicates an overestimation of the model, and a positive one indicates an underestimation of the model. The most accurate model has an MBE value closed to zero, lower values of RMSE, MAPE, MAE, t-stat, AIC, and a higher value of R 2 . Each statistical error indicator has its specific strength and weakness. For example, RMSE is not a better indicator than MBE for evaluating average model performance [20]. However, MBE could not give the correct performance when the model has overestimations and underestimations at the same time. Therefore, to find out the best model based on the above-mentioned indicators, a new Global Performance Indicator (GPI) was introduced in this work. Each indicator should be scaled on a scale of 0-1 with 0 being the best and 1 representing the worst. For the indicators that have negative or positive values, their absolute values are used in GPI. For the indicators that the lower the better (e.g., RMSE and MAPE etc.), the minimum is scaled to 0 and maximum to 1 (Eq 11). For the indicators that the higher the better (e.g., R 2 ), the maximum is scaled to 0 and minimum to 1 (Eq 12). For the ith model, the GPI was defined as, Where P is the performance indicator. P max and P min are the maximum and minimum of P for the corresponding indicators of the evaluated models. I ij is the scaled value of indicator j for the ith model and m is the number of performance indicators. Models with GPI closer to zero perform better.

Statistical analysis
A one-way analysis of variance (ANOVA) was used to test the difference in variables between calibration and validation sets. Pearson's correlation coefficients were calculated to determine the strength of correlations between soil pH and geochemical indices. The analyses of descriptive statistics were performed in SPSS v13.0. Model development and validation were done by MATLAB v9.0.

Model calibration
The coefficients of determination (R 2 ) of the developed models based on the calibration set are given in Table 5. The ANNs with 18,7,11,7,14,19, and 15 hidden nodes were applied to estimate soil pH using CaO, AlCa, FeCa, SiCa, TiCa, MgCa, KCa, and respectively ( Fig 5). On average, ANN produced the highest value of R 2 (0.73), followed by sigmoid (R 2 = 0.7) and cubic (R 2 = 0.63) equations. The values of R 2 ranged between 0.21 (p < 0.01, logarithmic equation with SiCa) and 0.77 (p < 0.01, ANN with SiCa).  Soil pH and geochemical indices in forest soils

Model performance
Performances of the models were evaluated based on the validation set and the statistical error indicators were shown in Table 6. On average, all models except sigmoid functions presented underestimation tendency according to MBE. In terms of MAPE, models gave good estimation of soil pH (mean MAPE = 7.4%). ANN and sigmoid models could explain above 70% of the variability in soil pH (R 2 = 0.73 and 0.71, respectively). Logarithmic model performed worst with the highest values of MBE, RMSE, MAPE, MAE, AIC, and the lowest values of R 2 . ANN gave the best estimations of soil pH according to RMSE, MAPE, MAE, t-stat, and R 2 . Sigmoid model performed best based on AIC and MBE. The geochemical indices gave varied prediction performances with models. For example, SiCa produced the highest R 2 in ANNs, KCa in quadratic and cubic functions, CaO in logarithmic and sigmoid models. Lukens et al. [7] predicted soil pH by AlCa, FeCa, and TiCa using sigmoid models. They reported that TiCa and FeCa gave slightly better performances than AlCa. In the current work, CaO, AlCa, SiCa, and KCa produced better predictions of soil pH than FeCa and TiCa using sigmoid functions based on R 2 . Models gave different prediction accuracy indicated by different statistical error indicators. For example, ANN with SiCa was the best one in terms of RMSE, MAPE, MAE, and R 2 . Sigmoid function with TiCa performed best based on MBE and t-stat. Cubic with KCa was the best according to AIC.
Because the used statistical error indicators did not always give the consistent results, the GPI was introduced and calculated by combining these indicators. The ranking of the models according to each accuracy indicator and GPI was reported in Table 6. On average, the results of GPI indicated that sigmoid model, ANN, and cubic were ranked 1st, 2nd, and 3rd. The model performance indicated by GPI was acceptable and better, because it combined all the performance tests. GPIs were also calculated within each model. The geochemical indices gave different performance for the evaluated models. CaO ranked 1st in ANNs, sigmoid and logarithmic functions. KCa ranked 1st in quadratic models. Therefore, CaO and KCa were the best inputs to predict soil pH for both ANNs and the empirical equations over the study site. Scatter Table 4 plots of the observed and predicted soil pH by ANN with CaO and sigmoid with CaO were given in Fig 6. Statistics of validation results were listed in Table 7. The maximum pH values were underestimated while the minimums were overestimated for both models. There was no significant difference in soil pH between observations and predictions for the two models.

Discussion
On average, ANNs performed better than cubic, quadratic, and logarithmic functions. Among the empirical approaches, sigmoid function was the best one. Model structure results in the differences between them [21]. ANN constructs a network connected with weighted nodes that were trained by certain algorithms. Compared with other models, the main advantages of ANNs are: 1) they are non-parametric techniques and do not need any model assumptions; 2) ANNs have no assumption on data distribution. Generally, ANN is often criticized for its complex network structure that makes the results difficult to interpret [22]. The indicator, AIC, based on an "information-theoretical approach" has been widely used for model selection [23][24][25]. In this case, ANNs produced higher values of AIC than others, due to the larger number of model parameters. Besides, data set shape also impacts on model performance, especially for the empirical functions. The rank order of them are sigmoid > cubic > quadratic > logarithmic functions. The best input was CaO for ANNs, sigmoid and logarithmic functions. The ratios of K 2 O to CaO and Al 2 O 3 to CaO were the best inputs for quadratic and cubic equations, respectively. CaO and the ratios of elemental oxides to CaO could be used to predict soil pH, because Ca 2+ is the main driver affecting soil pH [7]. The sigmoid functions indicated the geochemical indices have different rates of change in soil pH. This was also given by the scatter plots ( Fig  4). The oxides that were more abundant than CaO had higher values of growth rate and inflection point (e.g., SiO 2 , Al 2 O 3 , Fe 2 O 3 ) and vice versa (e.g., TiO 2 , MgO, K 2 O). Lukens et al. (2018) stated that samples collected from calcareous soils could have a relatively large values of FeCa or AlCa and compressed intervals at higher index values, where pH decreases as a function of Ca loss and Fe or Al gain. This could also explain the relationships between soil pH and the ratios of elemental oxides to CaO over the current study site. Soil pH is a key parameter for understanding soil weathering and relationships between soil nutrient availability and environmental factors. Weathering indices that incorporate Ca in some form could track soil pH. A recent study reported that soil pH values are closely correlated with water balance (mean annual precipitation-mean annual potential evapotranspiration) at global scale [26]. The pedotransfer functions and geochemical proxies compared and evaluated in the current study could be used to estimate significantly environmental components in the past time [7].

Conclusions
Various pedotransfer functions with different geochemical indices were applied to estimate soil pH in forest soils. The predicted data were compared to the measurements of an individual validation dataset. In order to do so, 7 statistical indicators have been applied to test models performances. Moreover, a new accuracy factor, named Global Performance Indicator (GPI), was originally introduced in this study and was used to rank the proposed models. The rank order was sigmoid > artificial neural network > cubic > quadratic > logarithmic. Soil CaO could be used to predict soil pH with ANNs, sigmoid and logarithmic functions. KCa and AlCa were the best inputs for quadratic and cubic equations, respectively.