Prediction of soil organic carbon in a coal mining area by Vis-NIR spectroscopy

Coal mining has led to increasingly serious land subsidence, and the reclamation of the subsided land has become a hot topic of concern for governments and scholars. Soil quality of reclaimed land is the key indicator to the evaluation of the reclamation effect; hence, rapid monitoring and evaluation of reclaimed land is of great significance. Visible-near infrared (Vis-NIR) spectroscopy has been shown to be a rapid, timely and efficient tool for the prediction of soil organic carbon (SOC). In this study, 104 soil samples were collected from the Baodian mining area of Shandong province. Vis-NIR reflectance spectra and soil organic carbon content were then measured under laboratory conditions. The spectral data were first denoised using the Savitzky-Golay (SG) convolution smoothing method or the multiple scattering correction (MSC) method, after which the spectral reflectance (R) was subjected to reciprocal, reciprocal logarithm and differential transformations to improve spectral sensitivity. Finally, regression models for estimating the SOC content by the spectral data were constructed using partial least squares regression (PLSR). The results showed that: (1) The SOC content in the mining area was generally low (at the below-average level) and exhibited great variability. (2) The spectral reflectance increased with the decrease of soil organic carbon content. In addition, the sensitivity of the spectrum to the change in SOC content, especially that in the near-infrared band of the original reflectance, decreased when the SOC content was low. (3) The modeling results performed best when the spectral reflectance was preprocessed by Savitzky-Golay (SG) smoothing coupled with multiple scattering correction (MSC) and first-order differential transformation (modeling R2 = 0.86, RMSE = 2.00 g/kg, verification R2 = 0.78, RMSE = 1.81 g/kg, and RPD = 2.69). In addition, the first-order differential of R combined with SG, MSC with R, SG together with MSC and R also produced better modeling results than other pretreatment combinations. Vis-NIR modeling with specific spectral preprocessing methods could predict SOC content effectively.


Introduction
Traditional methods for the determination of soil organic carbon (SOC) content not only are time consuming and laborious but also need high cost and exhibit poor real-time performance [1]. Hyperspectral data, via narrow and fine spectral bands, can capture the deep a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 information hidden in the soil. And it was widely used in analysis of soil physical and chemical properties, such as soil humus structure [2], soil nutrients [3][4], soil salinity [5] and soil moisture content [6]. All above studies have confirmed that hyperspectral data had the advantage of real time, high efficiency and low cost, which could make up the shortcomings of the traditional methods primely.
In the field of SOC research, Vis-NIR spectroscopy is mainly used for spectral feature analysis and quantitative prediction. Bartholomeus [7] analyzed nine kinds of soil samples with SOC contents varying from 0.06% to 45.1% and concluded that the spectral index (which was defined as the sum of the total reflectance minus the continuum removed function) based on visible light correlated well with SOC. Henderson [8] found that there was good correlation between organic carbon in soils derived from different parent materials and near-infrared longwave bands (1100~2526 nm). Consorti suggested that laboratory reflectance spectroscopy in the Vis-NIR range coupled with a geostatistical analysis can be used as a tool for predicting spectrally mapped SOM [9]. Aïchi et al. [10] established a model that was proven to be valid over a range of 0.90-5.20% of organic carbon content through the original spectral absorbency of 400~950 nm, which included the near-infrared band. The above studies have shown that visible and near-infrared bands exhibited good statistical relationships with SOC content. Mouazen concluded that appropriate preprocessing of Vis-NIR diffuse reflectance spectra could aid extracting sensitive variables for PLSR modeling to achieve higher prediction accuracy of soil total N, total C, and organic C [11]. However, few study has been done to improve the prediction accuracy using a combination of various spectral preprocessing methods.
Although the exploitation of coal resources has promoted economic development, it has also caused severe surface collapse problems, directly leading to deterioration of soil quality [12]. While land reclamation can improve soil structure and soil physical and chemical properties, thereby increasing soil quality [13]. Although SOC accounts for only a small proportion of total soil, it plays a key role in regulating soil quality and function. SOC is not only a carbon source for the growth of microorganisms, vegetation and saprophytic animals, but also an important ecological factor that can maintain sound soil physical structure and biological diversity [14]. Therefore, the rapid monitoring and evaluation of the SOC content is important with respect to reclamation modes and management measures in mining areas [15]. In this study, soil samples were collected in a mining area for the determination of SOC content and Vis-NIR data, and the spectral data were pretreated by different combinations of Savitzky-Golay (SG), multiple scattering correction (MSC) and various mathematical transformations. The transformed data were then modeled and optimized by PLSR to estimate the SOC content in the study area, which can provide a basis for monitoring and evaluation of SOC.

Research area
The Baodian mining area is located in the northwest region of Zoucheng City, which encompasses a total area of approximately 35.76 km 2 and lies between 35˚23'13.2"-35˚28'8.4"N latitude and 116˚48'7.2''-116˚52'26.4"E longitude. The study area has a temperate continental monsoon climate with four distinct seasons. The topography is dominated by plains, and the soil types are mainly fluvo-alluvial soil and Vertisol. Since 1986, long-term coal mining activities have caused serious ground subsidence problems, and the current water-logged area occupies nearly 1/3 of the area (Fig 1).

Sampling and analysis of soil samples
The sampling work was carried out during 2-7 June 2017 after wheat crops were harvested. The sampling sites were arranged as evenly as possible according to actual situation in the study area, and a total of 104 soil samples were collected. The position of each sampling site was determined using GPS, and approximately 1.0 kg surface soil (0-20 cm) was collected at each site. A mixture of 5 pieces of soil was collected using a diagonal sampling method within an area of 10 m×10 m. Each soil sample was kept in a sealed package for subsequent spectral measurement and SOC content determination in the laboratory.
The plant roots, stones, small animals and other intrusions mixed in the soil samples were first removed, and then dried and ground until the particles could pass through a 20-mesh sieve and then a 100-mesh sieve (the 20-mesh sieve was used for later spectral measurements). The soil samples that passed through the 100-mesh sieves were used for the determination of the SOC content via potassium dichromate titration method. Prediction of SOC in coal mining area by Vis-NIR spectroscopy

Determination and processing of soil spectra
The spectral determination was carried out in a darkroom using an ASD FieldSpec4 Spectroradiometer, whose spectral range is 350-2500 nm. In spectral measurements, the probe field angle was adjusted to 15˚, and the incident angle of the light source was 30˚; the distances from the light source and probe to the soil surface were 50 cm and 15 cm, respectively. The soil samples were sieved through 20-mesh screen for spectral measurements. To avoid the influence of ground-reflected light, the soil was loaded into an aluminum case, under which a lightabsorbing cloth was placed. And the surface of the aluminum case was scraped with cardboard for a smooth surface of the soil [16]. 8 spectral curves for each soil sample were detected and collected.
The mean spectrum of the 8 curves measured for each sample was used as the measured sample curve. The data were then exported to EXCEL, except the data of the initial band (350-499 nm), the tail band (2451-2500 nm), and the bands influenced by water vapor in the environment (1300-1450 nm and 1800-1950 nm). Then, the spectral data were processed in three different ways (SG, MSC, and SG together with MSC). Then the preprocessed reflectivity was subjected to reciprocal (1/R), reciprocal logarithm (log(1/R)), first-order differential (R') and second-order differential (R") transformations. A total of 15 kinds of modeling data were generated. Previous studies have shown that spectral noise can be effectively reduced when the SG smoothing data window is approximately 15, the fitting order is 3, and 2 smooth times were used [17]. MSC can also effectively eliminate the scattering effects caused by particle size, loading density and humidity [18]. Mathematical transformation can effectively extract soil spectral characteristics and prominent hidden spectral information [19].

Partial least squares regression analysis
In this study, R and its transformed form were independent variables. The bands of hyperspectral were narrow and multiple, and they were closely correlated with the adjacent bands. Therefore, on the one hand, the number of independent variables was much larger than the sample number; on the other hand, there was a high degree of autocorrelation within the independent variables. PLSR, combined with the principal component analysis, canonical correlation analysis and OLS regression [20], can effectively solve the abovementioned problems.
Modeling was performed by R software. The coefficient of determination (R 2 ), root mean square error (RMSE) and the relative prediction deviation (RPD) were used to evaluate the accuracy of the models. A higher R 2 indicates a better degree of model fitting, and a lower RMSE indicates more accurate model prediction. Models with a higher RPD (greater than 2) are more robust. In summary, the models with higher R 2 and RPD values but lower RMSE values are much more reliable.

Statistical characteristics of organic carbon in soil samples
The total soil samples were divided into a training set (64 samples) and a verification set (40 samples) based on the spectral reflection characteristics by the K-S algorithm.
In terms of the statistical characteristics of the sample collectivity, the SOC content ranged from 0.79 g/kg to 27.72 g/kg, with an average value of 11.34 g/kg, indicating that the SOC content in the study area was generally low (at the below-average level). The standard deviation and coefficient of variation were 5.09 g/kg and 44.86%, respectively, indicating that the sample has a certain discreteness. The parameter values in the training set and verification set were similar to those in the sample collectivity (Table 1).

Spectral characteristics of soil organic carbon
In accordance with the classification standard of soil organic matter in the second national soil survey in China [21], the soil samples were divided into 6 groups (grade I, >23.20 g/kg; grade II, 17.40-23.20 g/kg; grade III, 11.60-17.40 g/kg; grade IV, 5.80-11.60 g/kg; grade V, 3.48-5.80 g/kg, and grade VI, 0.58-3.48 g/kg) based on their SOC content. The mean soil spectrum of each group was used to analyze the spectral characteristics of the SOC content at different grades.
Spectral characteristics of soils with different SOC grades were demonstrated in Fig 2, and it can be seen that R was negatively related to the SOC content, which was obvious within grades I, II, and III, while R varied slightly between grades V and VI. The spectral curves of the soil with different SOC content grades exhibited a uniform pattern, which increased rapidly in the visible band (400-760 nm) but then ascended gradually in the short near-infrared and near-infrared longwave bands (780-1300 nm), then the curves formed a high reflectivity platform until it began to decline after 2100 nm. As ambient water absorption electromagnetic waves are strong near 1400 nm and 1900 nm, two absorption valleys were formed. There was a reflection peak near 2150 nm that reaches the maximum reflectivity.

Establishment and optimization of hyperspectral prediction models for soil organic carbon
As shown in Table 2, a total of 15 sets of modeling data were used to construct the prediction model for SOC content by PLSR. The modeling results from log(1/R) and R" were both inferior to those from R, while R' can improve the model accuracy indeed. The spectral data processed by MSC produced better results in F(R) (the inversion model that took R as the independent variable). Among models with SG preprocessing, F(R') (the inversion model that took R' as the independent variable) had a higher modeling accuracy; and among models with SG+MSC preprocessing, both F(R) and F(R') were better than the others. Overall, the preferable models were F(R') with SG preprocessing, F(R) with MSC preprocessing, and F(R') or F(R) with SG+MSC preprocessing. The predicted and observed values of the model were verified and analyzed. In terms of the results (Fig 3), the following four models exhibited good fitting, especially the models whose R were preprocessed using SG, MSC, and first-order differential transformations ( Fig 3D); these models could effectively predict the SOC content.

Discussion
This study demonstrated that the SOC content was negatively correlated with spectral reflectance, which is consistent with many other research findings [22][23]. Preprocessing R with MSC or "SG+MSC" could aid improving the fitting R 2 (which could reach 0.84) of the constructed estimation models. The R 2 of F(R') with SG or "SG+MSC"preprocessing was relatively higher (0.71 and 0.86, respectively) than that of other models, and similar conclusions were also drawn by Wang HT [24] in his study on forest SOC. Fig 3 shows that the R 2 of F(R) was lower with SG preprocessing and higher with MSC preprocessing. However, the situation was reversed with F(R'). Therefore, it was deduced that preprocessing with SG and R' transformation can better extract spectral characteristics, and thus get better modeling results (which is consistent with the results of Aixia Yang [25]); By contrast, MSC processing is more suitable for direct modeling using the original reflectivity, and the model accuracy is better than that of F(R') with SG preprocessing. While this result differed with the study by Huazhou Chen [26], which may be due to differences in the regional environment and soil type [27]). This study found that when the SOC content decreased to a certain extent, the negative correlation trend between spectral reflectance and SOC content was no longer obvious. Presumably, when the SOC content is low, the spectral characteristic information of SOC may be obscured by other information in the soil spectrum. To further verify this speculation, correlation analysis between different soil organic carbon contents and corresponding spectral reflectances was carried out in two groups divided according to SOC content. As shown in Fig 4, in the range of 400-700 nm, the SOC-reflectance correlation of the low-content group was slightly higher than that of the high content group; in the range of 750-2450 nm (excluding the ranges of 1300-1450 nm, 1800-1950 nm, and 2450-2500 nm), the correlation of the high content group was much higher than that of the low content group. This indicates that the speculation is of certain reliability. When the SOC content is low, it is not suitable for direct modeling because of the low sensitivity of R. However, how low the SOC content is when the hyperspectral model is no longer applicable, still needs further study.
In addition, the SOC content in the study area ranged from 0.79 g/kg to 27.72 g/kg, with an average value of 11.34 g/kg, indicating that the SOC content in the mining area was generally low (at the below-average level) and had great variability. It is speculated that the surface subsidence Prediction of SOC in coal mining area by Vis-NIR spectroscopy caused by coal mining would disturb the distribution of SOC and lead to the decrease and regional differences of SOC content, which was also proved by a previous study [12]. (1) The SOC content in the study area was generally low (at the below-average level) and had great variability.

Conclusion
(2) The spectral curves of soils with different SOC contents were consistent in morphological characteristics, but they were negatively correlated with SOC contents in terms of reflectance values. In addition, the sensitivity of the original reflectance to the change in SOC content decreased when the SOC content was low.
(3) The quantitative prediction of the SOC content in the study area could be effectively determined by PLRS. The extent of model fitting was best up to 0.86, and the verification coefficient was 0.78, and the relative prediction deviation was 2.69.
(4) The modeling results with different spectral processing methods were different. There were four models (F(R') with SG preprocessing, F(R) with MSC preprocessing, and F(R') or F(R) with SG+MSC preprocessing) that had relatively better precision, especially the models F (R') with SG+MSC preprocessing.
Supporting information S1 File. The primary data of soil organic carbon content and soil spectral reflectance. (XLS)