In Situ Measurement of Some Soil Properties in Paddy Soil Using Visible and Near-Infrared Spectroscopy

In situ measurements with visible and near-infrared spectroscopy (vis-NIR) provide an efficient way for acquiring soil information of paddy soils in the short time gap between the harvest and following rotation. The aim of this study was to evaluate its feasibility to predict a series of soil properties including organic matter (OM), organic carbon (OC), total nitrogen (TN), available nitrogen (AN), available phosphorus (AP), available potassium (AK) and pH of paddy soils in Zhejiang province, China. Firstly, the linear partial least squares regression (PLSR) was performed on the in situ spectra and the predictions were compared to those with laboratory-based recorded spectra. Then, the non-linear least-square support vector machine (LS-SVM) algorithm was carried out aiming to extract more useful information from the in situ spectra and improve predictions. Results show that in terms of OC, OM, TN, AN and pH, (i) the predictions were worse using in situ spectra compared to laboratory-based spectra with PLSR algorithm (ii) the prediction accuracy using LS-SVM (R2>0.75, RPD>1.90) was obviously improved with in situ vis-NIR spectra compared to PLSR algorithm, and comparable or even better than results generated using laboratory-based spectra with PLSR; (iii) in terms of AP and AK, poor predictions were obtained with in situ spectra (R2<0.5, RPD<1.50) either using PLSR or LS-SVM. The results highlight the use of LS-SVM for in situ vis-NIR spectroscopic estimation of soil properties of paddy soils.


Introduction
Paddy soil is one of the most important soil resources for humans because more than half of the world's population takes rice, the typical farming product of paddy soils, as staple food. As one of the major rice producers, China has a large area of paddy fields of more than 25 million hectares, accounting for 29% of the cultivated lands of China and 23% of the world [1]. In the past 30 years, due to over-fertilization, significantly declined soil pH has been found in major crop production areas and enhanced nitrogen deposition has been identified in terrestrial and aquatic ecosystems as well as in rice [2,3]. As a result, characterizing the properties of paddy soils in an efficient way is of great importance for management of crop growth and yield.
Over the past decades, various agricultural sensors have been used to determine the soil properties as well as their spatial variabilities [4]. Among the agricultural sensors, visible and nearinfrared (vis-NIR) spectroscopy has received popularity because it is fast, less labor-intensive and cost-effective compared to conventional chemistry experiments and enables rapid measurements of various soil physical and chemical properties. However, the flooded soil condition in paddy fields makes it difficult to perform soil sampling and analysis. The best time for soil measurement is the short time gap between the harvest and following rotation, when irrigation water has been drained away. Despite the success of predicting various soil properties using laboratory-based measurement with vis-NIR spectra, the pretreatment of samples (e.g. air-drying, grinding and sieving) is still tedious and time-consuming. With its faster and more effective characteristics compared to the laboratory-based spectroscopic measurement, in situ vis-NIR is a promising method in measuring and mapping soil properties of paddy fields [5].
Researchers have reported successful application of in situ vis-NIR spectroscopy to prediction of several soil properties. In terms of predicting clay content, Waiser et al. (2007) [6] found that in situ vis-NIR sensing can obtain similar results compared with laboratory-based sensing. With regard to soil organic and inorganic carbon, Morgan et al. (2009) [7] got slightly larger prediction errors using field-based vis-NIR measurements than using laboratory-based sensing method. When predicting soil color and mineral composition, Viscarra Rossel (2009) [8] concluded that results from in situ vis-NIR measurements were in good agreement with Munsell Book and X-ray diffraction methods. Furthermore, Mouazen et al. (2009) [9] improved prediction accuracy of available P by optimizing the field-based vis-NIR sensing system. In addition, a few other soil properties have been predicted with acceptable accuracy, including soil organic matter [10], nitrogen [11,12], pH [13] and water content [12,14]. However, most of the studies on predicting soil properties using in situ vis-NIR spectroscopy were conducted on dry farming land.
Although a couple of studies conducted on determining properties of paddy soils based on laboratory-based vis-NIR spectroscopy [15,16], to the best of our knowledge, there are few papers published describing the systematic use of in situ vis-NIR measurements to predict soil properties in paddy fields.
The aims of this study were to evaluate the feasibility of in situ vis-NIR sensing for prediction of soil properties in paddy soils by (i) predicting various soil properties of paddy soils (i.e. organic carbon (OC), organic matter (OM), total nitrogen (TN), available nitrogen (AN), available phosphorus (AP), available potassium (AK) and pH using in situ vis-NIR spectroscopy; (ii) comparing the prediction accuracy between in situ vis-NIR spectra and laboratory-based spectra for paddy soils; (iii) evaluating the prediction accuracy of in situ vis-NIR measurements of soil properties by implementing a multivariate calibration algorithm, i.e., linear partial least square regression (PLSR), and a data-mining algorithm, i.e., least-square support vector machine (LS-SVM).

Ethics Statement
We randomly chose 11 paddy fields from close vicinity to 6 cities in Zhejiang province and got permission from Agricultural Bureaus from these six cities, i.e. Tonglu

Soil sampling and spectroscopic measurements
In this study, the spectra of the soil samples were recorded by proximal in situ stationary vis-NIR sensing and by laboratorybased vis-NIR measurements. A total of 184 sampling sites were randomly selected in eleven paddy fields in Zhejiang Province, China, with latitudes ranging from 29u039N to 30u109N, and longitudes from 119u109E to 122u489E. The water in the paddy fields was drained and left to dry for 10 days prior to sampling and vis-NIR measurement.
vis-NIR measurements at 104 sampling sites were taken in November 2011, while the remaining 80 sites were surveyed in August 2013. At each site, the water content of the surface soil (i.e. 0-20 cm) was firstly measured using a TDR-300 (Spectrum Technologies Inc., USA) with a 20-cm guide. Then, a soil sample was collected using a cube soil sampler to a depth of 20 cm. The surface of the sample profile was flattened and evened, without smearing the soil. Spectra were recorded at three randomly selected locations at different depths within A horizon. If there were stones, roots or voids within the soil sample, spectroscopic measurements were made on the adjacent area. For each of the three sensing locations, 10 spectra were recorded, and the mean value of the whole 30 spectra was used to represent the spectra of the soil at that site. In total, 184 spectra were recorded under the field condition with one spectrum per site.
After in situ vis-NIR measurements, the samples were packed into plastic bags, labeled and transported to laboratory. The soil samples were air-dried, ground and sieved to less than 2 mm. The vis-NIR spectra of these 184 samples were then measured again under laboratory condition. The chemical analyses of soil properties were also conducted using these samples, which would be described later.
A Fieldspec ProFR vis-NIR spectrometer (Analytical Spectral Devices, Boulder, CO, USA) was used for in situ and laboratorybased measurements. The instrument measures the spectra between 350 and 2500 nm, with a resolution of 3 nm at 700 nm and 10 nm at 1400 nm and 2100 nm. The sampling resolution of the spectra is 1 nm. To implement in situ sensing, a high intensity contact probe (Analytical Spectral Devices) was used to prevent the interference from stray light during measurement. The probe has its own light source and a viewing window of 2 cm in diameter through which the measurements are made. To keep the measurement consistent, the contact probe was also used in the laboratory-based measurement. A Spectralon panel with 99% reflectance was used to calibrate the spectrometer before each measurement.

Chemical analysis
Soil OM was measured using the H 2 SO 4 -K 2 Cr 2 O 7 oxidation method at 180uC for 5 minutes [17]. Soil TC and OC content were determined by dry combustion at 1100uC with a multi N/C 3100 (Analytik Jena AG, Germany). Before the determination of soil OC, soil samples were acidized by hydrochloric acid to remove the inorganic carbon in the soil. Soil TN was measured using Semi-micro Kjeldahl Method and soil AN was measured by the alkaline hydrolysis diffusion method [18]. Soil AP was measured by the NH 4 F-HCl method [18]. Soil AK was measured using the NH 4 OAC extraction method and analyzed using a flame photometer [18]. Soil pH was measured in a 1:1 soil: water suspension [18]. The statistics of measured soil properties are listed in Table 1.

Data pre-processing
The spectral regions for 350-399 nm and 2451-2500 nm were deleted because of noise. The reflectance spectra were transformed to apparent absorbance (log1/R) and then mean centered. The smoothing process of the spectra was made using the Savitzky-Golay algorithm with a window size of 11 and polynomial of order 2 [19]. One sample was regarded as outlier and removed from the dataset because its spectra were strange. For each soil property, corresponding values were sorted from small to large, and then every forth one was selected into test dataset, leaving the rest in training dataset.

Partial least square regression (PLSR)
Among the multiple linear calibration algorithms, partial least square regression (PLSR) [20] is one of the most popular algorithms used for spectral calibration and prediction. It is closely related to principal component regression (PCR) yet with a slight difference. Both of them compress the data before prediction while PLSR avoids the dilemma encountered by PCR of choosing components for the regression [21].
We assume the spectral data matrix used as independent variable into PLSR is X, where X = [x 1 ,x 2 , Á Á Á ,x i ], and soil properties as dependent variable is y, with both mean-centered. The first step to perform PLSR is to extract a few linear combinations (called components or factors), T, of the original spectral matrix X: where v are the scaled weights and can be calculated as the eigenvectors of the matrix X 0 yy 0 X. Then both X and y can be regressed onto T as follows: where P are spectral loadings and q are chemical loadings, describing how the variables in T relate to X and y. E and f are residuals and represent noise or irrelevant variability in X and y.
After the model parameters are estimated, they can be combined into the final prediction model aŝ where b 0 is the intercept andb b i are regression vectors. The detailed description ofb b i can be found in the book of [21].
To avoid over-fitting or under-fitting, leave-one-out cross validation was used to determine the number of factors to retain in the calibration models [22]. Root mean square error of cross validation (RMSE) and Akaike information criterion (AIC) [23] were used to decide the number of factors.

RMSE CV~ffi
Whereŷ y i is the predicted value and y i is the observed value, n is the number of calibration samples.  Where n is the number of samples and p is the number of features used in the prediction. The best model has the smallestRMSE CV and AIC.
Least square support vector machine (LS-SVM) Support vector machine (SVM) is a kernel-based learning algorithm [24] and has been widely used in the pattern classification and regression. The kernel-based learning methods use an implicit mapping of the input data in a high dimensional feature space, a special type of hyperplane defined by a kernel function, in which a regression model is built. As an optimized algorithm based on standard SVM, the least-squares support vector machine (LS-SVM) [25] uses a squared loss function instead of the e-insensitive loss function, from which equality constraints rather than inequality constraints follow. Compared to SVM, complex calculations are avoided in LS-SVM and the multivariate calibration problem can be solved in a relatively fast way. The theory of LS-SVM has been introduced by Suykens et al. [25].
Similarly, the spectral data matrix used as independent variable is X, where X = [x 1 ,x 2 , Á Á Á ,x i ], and soil properties as dependent variable is y. The LS-SVM uses nonlinear regression function: where b 0 is the bias; n is the number of samples; x i is the measured vis-NIR spectra of different samples; K(x,x i ) is defined by the kernel function. We used radial basis function kernel (RBF), which is the typical general-purpose kernel: where s 2 is the RBF kernel function parameter, determining the width of the kernel. a i is Lagrange multipliers (i.e. support value), which is used by solving the linear Karush-Kuhn-Tucker (KKT) system: where I refers to an (n6n) identity matrix; c is the regularization parameter which balances the model's complexity and the training errors; I n is a (n61) vector, with all elements ones; y is an (n61) vector of observed properties values and K denotes elements in kernel matrix. As we can see from these formulas, in order to make an LS-SVM model, two additional parameters (i.e. c and s 2 ) need to be determined by users. The regularization parameter c determines the trade-off between the fitting error minimization and smoothness of the estimated function, and is important to improve the generalization performance of the LS-SVM model. An increase in c is analogous to an increase in the number of latent variables in a PLS model [26]. The RBF kernel function parameter s 2 changes the width of the kernel, and thus the degree of the non-linearity can be modeled. When s 2 increases, the kernel becomes confined, forcing the model towards a linear regression, and its accuracy decreases as well. By contrast, decreased s 2 and increased c may lead to over-fit and thus should be treated cautiously [26].

Assessment of statistics
Coefficients of determination (R 2 ), root mean square error (RMSE) and the ratio of prediction derivation (RPD) were used to compare the prediction accuracies.

RPD~SD=RMSE ð12Þ
Whereŷ y i is the predicted value and y i is the observed value; y y i is the mean of observed value; y ŷ y y i is the mean of predicted value; SD is standard deviation of observed values; n is the number samples.
Williams (2003) [28] proposed a criterion for the classification of R 2 and RPD: an R 2 value below 0.50 or an RPD value below 1.5 indicates very poor model predictions and such a value could not be useful; an R 2 value between 0.50 and 0.65 an RPD value between 1.5 and 2.0 indicates a possibility of distinguishing between large and small values, while an R 2 value between 0.66 and 0.81 or an RPD value between 2.0 and 2.5 makes approximate quantitative predictions possible. For an R 2 value between 0.82 and 0.90 or an RPD value between 2.5 and 3.0 and above 3.0, the prediction is classified as good. If R 2 value is larger than 0.91 and RPD value is larger than 3.0, the prediction is considered excellent. Generally, a good model prediction would have large values of R 2 and RPD, and a small value of RMSE. In order to simplify the classification, Grade A to E was assigned to the accuracy classes from excellent to not useful.

Comparison of in situ spectra and laboratory-based spectra
The average reflectance (R) of in situ and laboratory-based measurements of 183 samples and their respective standard deviations are given in Fig. 1. In brief, in situ spectra have smaller reflectance values compared with the laboratory-based spectra. This is because the presence of soil moisture, replacing the air within the soil gaps, increases forward scattering of light and thus the whole absorption of soil moisture at each wavelength increases [30].
Near Infrared (NIR) spectra are dominated by weak overtones and combinations of fundamental vibration which occurs in the MIR region, while visible spectra mainly comprise of electronic transitions [22]. The absorption features of the raw reflectance spectra are usually broad and weak and some of them are difficult to distinguish with the naked eye. As such, continuum removal was applied to all spectra to emphasize absorption features in the spectra. The averaged continuum removed reflectance (CR) is given in Fig. 2, and wavelength specific t-tests were performed between the continuum removed laboratory-based and in situ spectra. In Fig. 2, shaded regions show where there were significant differences between the spectra at a~0:01 significance level. The absorption features due to soil iron oxides near 430 nm and 480 nm [31] have similar size and shape in both in situ and laboratory-based spectra. However, the absorption feature near 650 nm probably correlated with haematite (Fe 2 O 3 ) [32,33] of in situ spectra is greater than that of laboratory-based measurements. The shallow absorption near 1000 nm may be due to amidogen group present in both in situ and laboratory-based spectra, and they are significantly different. The most obvious differences between the two types of spectra are located in two primary water absorption regions within the NIR spectrum, i.e. one around 1450 nm and the other near 1950 nm. It can be explained by the permanently waterlogged conditions of the paddy soil samples. The absorptions caused by soil moisture increases when soil is wet and their features broaden and deepen compared to laboratory collected spectra. However, the strong water absorption near 1950 nm of in situ field collected spectra partly masks the absorptions of clay minerals near 2200 nm which can be identified in the dry laboratory-based spectra. It might affect the prediction accuracies of the spectroscopic models [30].

Prediction of soil properties using PLSR
PLSR algorithms were performed on the training dataset with the optimal number of factors decided by leave-one-out cross validation, and the test dataset was used to validate the PLSR model independently. Taking TN for example, the cross-validated RMSEcv and AIC were plotted against the number of factors (Fig. 3). The optimal number of factors was selected based on the minimum RMSE CV and AIC. Meanwhile, a small number of factors should be included in the model to reduce its complexity when comparable predictions can be obtained. As a result, 8 factors were selected to be used in PLSR with laboratory vis-NIR spectra.
Prediction accuracy of seven soil properties with laboratorybased soil spectra using PLSR method and their accuracy classes are presented in Table 2. Of all the measured soil properties, TN was best predicted with R 2 of 0.87 and RPD of 2.81(Grade B). OM and OC were approximately quantitatively predicted (Grade C), with R 2 of 0.81, RPD of 2.30 and R 2 of 0.81, RPD of 2.20 for OM and OC, respectively. The predictions of TN, OM and OC is comparable to previous studies [34,35] The successful predictions of these properties are mainly because carbon and nitrogen have direct spectral responses due to the overtones and combinations of N-H, C-H+C-H and C-H+C-C in the vis-NIR spectra [36,37].
However, the prediction accuracy often varies with the forms of carbon and nitrogen present in the soils [36,38]. The phenomenon also occurs in our results. For example, prediction of AN shows a lower accuracy than that of TN with R 2 of 0.86 and RPD of 2.49 (Grade B). This is because most of AN in soil is inorganic, which have no characteristic absorption in vis-NIR region, and the amount of AN is usually small, generally less than 5% of TN, which have a slighter effect on soil spectra.
Although some researchers have reported successful prediction of AP and AK using vis-NIR [39][40][41][42], it is not the case in this study. AP was not well predicted in consideration of R 2 of 0.29 and RPD of 1.17 (Grade E); prediction of AK was even worse with R 2 of 0.07 and RPD of 0.77 (Grade E). It is because there is no direct spectral absorption features in the vis-NIR region for AP and AK. The occasionally successful prediction of these soil properties may be due to the covariation with other soil properties which have directly spectral responses in the vis-NIR range [37]. However, in the present study, poor correlations of AP or AK with carbon and nitrogen have been found (see Table 3).
Additionally, pH can be predicted with approximately quantitative accuracy with R 2 of 0.82 and RPD of 2.42 (Grade B-C). Although without direct spectral responses in the vis-NIR region, measurements of pH were always reported to be more successful compared to P and K [43,44]. It might be because pH is related to wavelengths of minerals [33]. However, further investigation was needed.
Spectroscopic prediction using PLSR: in situ vs. laboratory-based Prediction accuracies with in situ collected spectra using PLSR are given in Table 2. Compared to laboratory-based spectroscopic measurements, predictions of soil properties, such as OC, OM, TN, AN and pH, with in situ measured spectra were worse. For example, predictions of soil OM and OC using laboratory-based spectra were considered to be approximately quantitatively accuracy (Grade C) while those using in situ measurements were only able to be distinguished between high and low values (Grade D). Besides, the prediction accuracy of AN decreases to Grade D using in situ spectra (R 2 = 0.76 and RPD = 1.91) from Grade B using laboratory-based spectra (R 2 = 0.86 and RPD = 2.49). It may be caused by the environmental factors existing during the in situ measurement, such as soil moisture, ambient light, temperature and condition of the soil surface, which would partly mask the absorption features of some soil properties.
As prediction of soil properties with in situ vis-NIR spectra is less accurate than with laboratory-based measurement when linear calibration algorithm was used, a non-linear data mining (i.e. LS-SVM) algorithm was carried out aiming to extract more useful information from the in situ spectra and improve predictions.

Spectroscopic prediction of soil properties: PLSR vs. LS-SVM
In attempt to improve the prediction accuracy using in situ soil spectra, LS-SVM was used to build the models. In order to determine the parameters of c and s 2 for LS-SVM models, c ranging from 2 21 to 2 10 and s 2 ranging from 2 to 2 15 were tested.
The ranges were based on previous studies. For each combination of c and s 2 , the root mean square error of cross-validation (RMSE cv ) was calculated and the optimal parameters were determined when smaller RMSE cv occurred. The optimizing process of predicting TN is shown in Fig. 4. The grid search and leave-one-out cross validation were employed to find the optimal combination of c and s 2 . Grid search is a two-dimensional minimization procedure based on exhaustive search in a limited range. The grids of ''.'' in the first step was 10610, and the searching step at this stage was relatively large. The grids of ''6'' in the second step was 10610, and the searching step in the second stage was relatively small. The optimal search area was determined using the contour lines of RMSE cv plotted in Fig. 4.
Predictions with in situ spectra using LS-SVM can be found in Table 2. Firstly, comparison between PLSR and LS-SVM was made with in situ spectra. Soil OM and OC can only be distinguished by high and low values (i.e. Grade D) when PLSR method was performed (OM: R 2 = 0.75 and RPD = 1.83; OC: R 2 = 0.75 and RPD = 1.95). However, using LS-SVM method, both OM and OC can be approximately quantitatively estimated (i.e. Grade C), with the prediction accuracies of R 2 = 0.81 and RPD = 2.18 for OM, and R 2 = 0.79 and RPD = 2.20 for OC. Prediction of TN was even more accurate using LS-SVM with R 2 = 0.88 and RPD = 3.05 (i.e. Grade A) compared to PLSR with R 2 = 0.86 and RPD = 2.68 (i.e. Grade B). Besides, comparable prediction accuracies of AN were obtained between LS-SVM and PLSR, both with R 2 = 0.76 and RPD = 1.91 (Grade D). In terms of pH, LS-SVM only slightly improved the prediction compared to PLSR. However, AK and AP remained unpredictable (Grade E) using two methods. The use of the data-mining algorithm (i.e. LS-SVM here) improved the prediction accuracy of most of soil properties compared with the linear PLSR algorithm with in situ vis-NIR spectra. Fig. 5 shows the predicted values of seven soil properties against the observed ones using LS-SVM with in situ vis-NIR spectra.
Surprisingly, the predictions of OM, OC and pH with in situ spectra using LS-SVM were comparable to those using PLSR with laboratory-based spectra; prediction of TN using in situ spectra with LS-SVM was one grade better than using laboratory-based measurement with PLSR. The prediction accuracy of TN is comparable to the result from Kleinebecker et al. (2013) [45] with air dried samples. However, in term of AN, laboratory-based model with PLSR still offers better prediction. Given the improved prediction results of OM, OC, TN and pH using LS-SVM, in situ vis-NIR spectroscopy would become an effective tool for rapid and reliable measurement of soil properties in the field. In situ prediction: paddy soils vs. irrigated soils The prediction of paddy soil properties with in situ vis-NIR spectra were compared to a recent review of in situ vis-NIR measurements [36] of irrigated (arable) soils, i.e. dry-farming soils (Table 4). Prediction accuracy of TN and pH of paddy soils is similar to that of irrigated soils. However, due to the presence of considerable amount of soil water in paddy soils, which affects the in situ measured soil vis-NIR spectra, OC, AP and AK are better predicted in irrigated soils compared to paddy soils.

Conclusions
Compared with laboratory-based vis-NIR spectroscopic measurement, field-based measurement is more efficient by measuring soil spectra directly in situ. It thus offers a promising way to analysis soil properties quickly in paddy fields when water is drained away before and after harvest. In our study, systematic research on paddy soil properties using in situ vis-NIR spectra and laboratory-based vis-NIR spectroscopy were carried out, including soil organic matter (OM), total organic carbon (OC), total nitrogen (TN), available nitrogen (AN) available phosphorus (AP), available potassium (AK) and pH.
Using the PLSR algorithm with laboratory-based vis-NIR spectra, soil OM, OC, TN, AN and pH can be quantitatively estimated with various accuracies while AP and AK can be poorly predicted. However, the prediction accuracy of soil properties decreased to some extent when in situ spectra were used for modeling. It happened especially for the prediction of soil OM, OC, AN and pH, with one grade decreasing. It might be due to the existence of soil moisture and ambient light, as well as the environment temperature and soil surface condition, which might mask or partly mask the absorption information on spectra, and influence their prediction accuracies.
By performing the non-linear LS-SVM algorithm, prediction of soil OM, OC, TN and pH with in situ vis-NIR spectra was obviously improved. Their predictions were comparable or even better than laboratory-based spectroscopic measurement using PLSR algorithm. Prediction of AN was not improved and AP and AK remained unpredictable. Thus, we propose the use of LS-SVM algorithm for in situ vis-NIR spectroscopic estimation of soil properties of paddy soils.
Owing to the permanently waterlogged conditions of paddy soils, in situ prediction of several soil properties of paddy fields is less accurate compared with irrigated soils. Other data mining methods are expected to be tested on the in situ paddy soil spectra. Besides, further research on the chemometic algorithms for removing the effects of water and other environmental factors from the spectra might fundamentally improve the prediction of soil properties with in situ spectra.

Supporting Information
File S1 In situ measured vis-NIR spectra of 184 samples. To every tenth wavelength was retained to reduce the size of the file. (XLSX)