Use of data mining techniques to classify soil CO2 emission induced by crop management in sugarcane field

Soil CO2 emissions are regarded as one of the largest flows of the global carbon cycle and small changes in their magnitude can have a large effect on the CO2 concentration in the atmosphere. Thus, a better understanding of this attribute would enable the identification of promoters and the development of strategies to mitigate the risks of climate change. Therefore, our study aimed at using data mining techniques to predict the soil CO2 emission induced by crop management in sugarcane areas in Brazil. To do so, we used different variable selection methods (correlation, chi-square, wrapper) and classification (Decision tree, Bayesian models, neural networks, support vector machine, bagging with logistic regression), and finally we tested the efficiency of different approaches through the Receiver Operating Characteristic (ROC) curve. The original dataset consisted of 19 variables (18 independent variables and one dependent (or response) variable). The association between cover crop and minimum tillage are effective strategies to promote the mitigation of soil CO2 emissions, in which the average CO2 emissions are 63 kg ha-1 day-1. The variables soil moisture, soil temperature (Ts), rainfall, pH, and organic carbon were most frequently selected for soil CO2 emission classification using different methods for attribute selection. According to the results of the ROC curve, the best approaches for soil CO2 emission classification were the following: (I)–the Multilayer Perceptron classifier with attribute selection through the wrapper method, that presented rate of false positive of 13,50%, true positive of 94,20% area under the curve (AUC) of 89,90% (II)–the Bagging classifier with logistic regression with attribute selection through the Chi-square method, that presented rate of false positive of 13,50%, true positive of 94,20% AUC of 89,90%. However, the (I) approach stands out in relation to (II) for its higher positive class accuracy (high CO2 emission) and lower computational cost.

Introduction Carbon dioxide (CO 2 ) is continuously exchanged between soils and the atmosphere, mainly through the processes of photosynthesis and incorporation of organic matter derived from plants (efflux of CO 2 ) and the decomposition of this organic matter by soil organisms (CO 2 flux). Therefore, the amount of carbon (C) stored in the soils depends mainly on the equilibrium between the C inputs and outputs of the system [1].
In this context, increased C storage in soils worldwide could help compensate for the increase in anthropogenic CO 2 emissions, while increased soil emissions could significantly worsen atmospheric rises [2].
Thus, the potential of agriculture to mitigate greenhouse gas emissions has been the subject of intense scientific research over recent years [1]. However, after many decades of research on soil organic matter decomposition, very few robust mathematical models and experiments to predict the effect of both biotic and abiotic factors on the soil C balance have been developed [3].
Considering this issue, the alternative that we propose is the use of data mining techniques to predict CO 2 emission from the soil. Among the techniques of data mining, classification is a task that stands out in the studies of the scientific community of Knowledge Discovery in Database (KDD). The principle of this task is to predict a categorical variable, i.e., to discover a function that maps a set of records into a set of predefined variables, called classes. Such a function can be applied to new records in order to predict the class in which such records fit [4,5].
Several algorithms can be used to perform such a task; however, some of them, such as J48, Naive Bayes, Multilayer Perceptron, SVM, Bagging, are widely used due to their usual good performance regarding classification processes. However, the models obtained through the use of different classifiers require assessment under the same conditions. An alternative to compare predictive models is the Receiver Operating Characteristic (ROC) curve, which is a graphical method for evaluation, organization, and selection of diagnostic and/or prediction systems. This tool has been widely used in data mining processes to assess classification models and is particularly useful in areas presenting a great disparity between classes or when different costs/benefits must be taken into account for the different errors/correctness in classification [6].
Hence, our study aimed at using data mining techniques to predict the soil CO 2 emission induced by crop management in sugarcane areas in Brazil. To accomplish that, we used different variable selection methods (correlation, chi-square, wrapper) and classification (Decision tree, Bayesian models, neural networks, support vector machine, bagging with logistic regression), and finally we compared the efficiency of different approaches through the ROC curve (Receiver Operating Characteristic).

Site locations and experiment descriptions
The composition of the original database for this work was carried out by collecting soil variable (physical, chemical, microbiological) and climatic data at an experimental sugarcane area located in the city of Iracemápolis, São Paulo, Brazil (22˚34'50" S, 47˚31'07" W; 608 m above sea level), during the sugarcane crop reform period.
The soil under investigation is classified as Rhodic Hapludox, according to the Soil Taxonomy System [7], and is described as a Latosssolo Vermelho Eutroférrico, according to the Brazilian Soil Classification [8], with very clayey texture. In the site characterization, the soil Oliveira has received no salary, gratification, bonus or gratuity for taking part in this research. Every effort made in this study was result of research collaboration among students, professors and researchers interested in the research line of this study. In summary, Dr. Oliveira's contribution is related to the co-supervision of this research work, as well as the use of machine learning algorithms (predictive modeling) and interpretation of the results. Embrapa did not provide any financial support for this research.
presented a pH around 4.6, an average organic C content of 10.5 g C dm -3 , base saturation of 50%, cation exchange capacity 10.25 cmol c dm -3 , bulk density of 1.34 kg dm -3 , from 0-40 cm depth. The distribution of particle sizes of sand, silt, and clay were 140, 194 and 666 g kg -1 , respectively. The data were obtained from a field experiment with a randomized complete block design in the split-plot design with four replicates. The plots involved treatments with and without cover crop with Crotalaria Juncea and subplots with two soil tillage conditions (minimum and conventional tillage). Each experimental unit (sub-plot) consisted of 15 lines of sugarcane, with a spacing of 1.5 m and 34 m in length.

Analysis of CO 2 emissions, soil temperature, and soil moisture
Soil CO 2 emissions were recorded between April 27 and August 3, 2013. Emissions were measured in the morning (i.e. between 8 and 10 am) [9], using five PVC collars (diameter = 10 cm and height = 7 cm), inserted 0.02 m into the soil at each of the 16 plots (80 collars in total). The PVC collars were distributed in sugarcane field as follow: two of them located in the in-row, one in an intermediate position considered as seedbed region (at 0.38 m distance from inrow), and two in the center of the inter-row (at 0.75 m distance from in-row).
Soil CO 2 emissions were measured using an infrared gas analyzer (IRGA) produced by LI-COR (model LI8100A, Nebraska, USA). This chamber was a closed system with an internal volume of 991 cm 3 and a contact area with the soil of 71.6 cm 2 . When recording emissions, the chamber was placed over the PVC collars to prevent any mechanical disturbance to the soil profile, which may result in overestimation of emissions.
The quantification of CO 2 emissions began 24 hours after soil tillage operations. In each plot, the CO 2 emission measurement lasted 90 seconds. The measurements were taken for 97 days days after soil tillage, until CO 2 emissions had stabilized. The CO 2 emissions recorded from the five collars within each plot (two in intra-rows and three in inter-rows) were aggregated into a single measure using a weighted average (assuming an area of 27% for the rows and 73% for the inter-rows). Soil CO 2 emission over the entire study period was estimated by the integral of the area formed below the emission curves over time.
The measurements of soil temperature and soil moisture were taken at exactly the same time as soil CO 2 measurements. Soil temperature (St) was assessed at all points studied with the sensor which is part ofthe same LI-COR instrument (model LI8100A). This sensor consists of a 20 cm rod inserted into the soil perpendicular to the surface near the PVC collars used to measure the soil respiration. Soil moisture (Sm) was measured simultaneously with the measurement of CO 2 concentration using Time Domain Reflectrometry (TDR). Probe Thetaprobe ML2 (Delta-T Devices, Cambridge, UK) is an English manufacturing tool that directly measures the water content in the soil, corresponding to the volumetric moisture content, using the principle of wave generation which releases an electromagnetic pulse to a set of rods with reflection measured in TDR. Moreover, during experimental period, rainfall events were monitored through an automatic surface weather station installed at the experiment site.
In the end of the CO 2 sampling period, disturbed and undisturbed soil samples were collected at the depths of 0.00-0.10 m, 0.10-0.20 m, 0.20-0.30 m, and 0.30-0.40 m.to evaluate the effects of the treatments on soil attributes.

Soil physical, chemical and biological variables
Total soil porosity (Sp), macroporosity (Macro), microporosity (Micro), and bulk density (Bd) were analyzed according to the Brazilian Agricultural Research Corporation methodologies [10]. The soil resistance to penetration (RP) was obtained using the Stolf´s formulation [11].
The mean diameter of the aggregates (MDA) was determined according to the method described by Kemper and Chepil [12] and the calculation of the aggregate tensile strength (Ts) was performed as described by Dexter and Kroesbergen [13].
All samples were taken to the laboratory, air dried and subsequently passed through a 2.0 mm mesh. We measured soil pH (CaCl 2 0,01 mol L -1 ), exchangeable cations (Ca 2+ , Mg 2+ and K + ), phosphorus available in resin (P), organic C concentration (wet oxidation), acidity potential, CEC potential, and base saturation in accordance with the methodology proposed by Raij [14].
For the analysis of microbial biomass C, the samples were collected and placed in a cooler and refrigerated during the transportation to the laboratory for preservation in cold storage at 4˚C until analysis. The microbial biomass C (MBC) was determined through the fumigationextraction method proposed by Vance [15].

Data mining
The original dataset was constituted of 19 variables (one dependent or response variable and 18 independent or explanatory variables) ( Table 1) which were added to the dataset numbering 1,552 observations. The variable-target refers to the soil CO 2 emission and is the classification target.
The data were initially evaluated by through descriptive statistics through Box-plot graphs. The box shows the data between the first and third quartiles (hinge), with median represented by a line inside the box. Vertical lines (whiskers), starting in the middle of the base (and top) of the box and ending in values (referred to as adjacent lower and upper) approximately indicate the variability of the data [16]. In order to identify different levels of CO 2 emissions from the soil, a goal attribute discretization was required. For this purpose, the CO 2 emission values (kg ha -1 dia -1 ) were organized in ascending order and divided equally into three emission classes: low, medium and high ( Table 2).
In order to compare the performance of different classifiers through the ROC curve, a new discretization was performed considering the high class belonging to the positive class and the remaining classes (low and medium) were grouped to form the negative class (Fig 1).
To do so, an imbalance between the classes occurred since the negative class obtained 2/3 of the data and the positive class kept only 1/3, hampering the classifiers to learn the positive class. For the purpose of improving the accuracy of the model and mitigating the problem of unbalanced classes, the "under sample" method was applied to balance the number of observations per class. This method is used in the training process to slightly reduce the observations of the major (negative) class and allow the (positive) minor class to be learned by the classifier, considering that the interest class is most often the positive class. For this purpose, the Stratified Remove Folds Filter was used to select two subsets, one for training, containing 90% of the original dataset, and one for testing, with 10% of the original dataset. Subsequently, we applied the Neighborhood Cleaning Rule (NCL) filter) available in Weka software 3.4.13. The distribution of classes in the database after this operation is shown in Fig 2. From this subset, the selection of variables was performed using different methods: 1. Absence of attribute selection upon the occurrence of use of all data.
2. Correlation-based feature selection (CFS) searching for the set of correlated variables to prevent the reuse of the same information [17].
3. The chi-square method (χ 2 ) is based on the concept of statistical independence.
4. Wrapper method, which functions as a "black box". In order to find the subset of variables that most satisfactorily fits the classification algorithm [18].
From the variables selected through each selection method, different classification algorithms were tested: 1. J48-Decision tree [19].
The different approaches were validated using the supplied test set method (90% of the database for training and 10% for testing) and two metrics: (i) accuracy rate; (ii) Kappa coefficient. The induction of the decision tree model resulted in the calculation of the known matrix of errors, or matrix of confusion (Table 3), widely used in statistical analysis of agreement [23].
Column 'Total' in Table 3 presents P as the total value of positive cases and N as the total of existing negative cases in the training set. In the Total, P' is the total number of cases that the model rated as positive cases and N' the total number of cases classified as negative. From the matrix of confusion, it is possible to extract the performance evaluation metrics. The rate of accuracy is the percentage of examples that were correctly classified by the classifier and can be expressed according to Eq 1.
To describe the measure of agreement between the predicted and observed classes, which deducts the expected number of correct answers (using a classification at random) of the actual number of the accuracy of the classifier, we used the Kappa measure (Eq 2). Kappa values vary from 0 to 1, representing bad and good classification results, respectively. The Kappa Classification of soil CO 2 emission coefficient can be defined by the following equation [23]: where Pr (a) is the relative agreement observed for a given class in the matrix of confusion; Pr (e) is the probability of expected agreement for this same class. The Kappa coefficient is calculated taking into account all classes. A possible interpretation of the performance of the models from this measure was introduced by Landis and Koch [24].
In order to compare the performance of the different classifiers associated with the attribute selection methods, the ROC curve was used, a graphical tool used to assess the algorithm using software Roc on 2.0.
The analysis of the curves generated by each classification model in the ROC space is represented by the ratio between the true positive rate (TPR) axis and the false positive rate (FPR). The classifiers positioned at the bottom of the "convex hull" have a precision rate lower than those contained in it. In addition, the larger the area under the curve, the greater the accuracy of the model. The point (0, 0) represents the strategy of never classifying an example as positive. Models that correspond to this point do not present any false positive, but are unable to classify any true positive either. The inverse strategy, of always classifying a new example as positive, is represented by the point (100%, 100%). The dot (0, 100%) represents the perfect   In contrast, the point (100%, 0) represents the model that always makes erroneous predictions [6].

Results
The behavior of soil CO 2 emission for each treatment along the experimental period is presented in Fig 3. It was possible to observe great variability of the CO 2 emission, mainly in the first days after the soil tillage practices and after rainfalls events, reaching emission values of up to 771 kg ha -1 day -1 (Fig 3A). Around the 40 th day after tillage, there was a tendency of stabilization of soil CO 2 emissions, independent of rainfall events in the experimental area.
Analyzing the average soil CO 2 emissions, it was observed that the treatment where conservationist practices were used (cover crop and minimal tillage-MTCC) lower median values of CO 2 emission was observed compared to other treatments (Fig 3B), which presented average CO 2 emission of 63 kg ha -1 day -1 versus 74 kg ha -1 day -1 of the treatment CTCC, 71 kg ha -1 day -1 of the standard treatment CTWC, and finally 78 kg ha -1 day -1 of the MTWC. Soil moisture was another attribute that presented high variability along the evaluated period. Soil moisture was influenced by rainfall events, where its values ranged from 4% to 36%. The soil temperature presented small variability among the treatments. In general, a tendency of small average temperature was observed in cover crop treatments (Fig 4).
In addition, it can be observed that the areas under cover crop cultivation presented higher porosity (macro, micro e TP), Al +3 , K, P, Organic C e and lower Bd, PR, RT, Mg+ 2 . Specifically for the MTCC treatment, it was observed that in this treatment occurred greater Sm, Micro, TP, Al + 3 , K + , MBC associated with lower MDA, Bd, H + Al + 3 , Ca + 2 , Mg + 2 (Figs 4 and 5).
According to the results presented in Table 4, the number of variables selected differed among the assessment methods, which has occurred since the selection of only one attribute, as in the case of the CFS method, having selected soil temperature exclusively, until the selection of ten variables, as in the case of method χ 2 ( Table 4).
When analyzing the variables selected in the different approaches, we observed the recurrence of some of them, such as: soil temperature, rainfall, soil moisture, pH, and soil organic C ( Table 4). Soil temperature appears in all attribute selection methods, followed by rainfall, present in six out of the seven approaches assessed, and the three variables, soil moisture, pH, and soil organic C, which appear in four of the seven methods tested.
Among the algorithms assessed for soil CO 2 emission classification, J48, Naïve Bayes and Bagging with logistic regression presented better performance when using the χ 2 method for attribute selection, with the accuracy rate of the different approaches ranging from 87.18 to 84.61 and Kappa coefficient from 0.73 to 0.66. In contrast, the SMO and Multilayer Perceptron algorithms presented a subset of variable selected through the Wrapper method with an accuracy rate varying from 89.10 to 85.90 and Kappa coefficient from 0.77 to 0.69 ( Table 5).
The IV approach provided higher AUC and TP, being 89.90 and 94.2, respectively, followed by the V approach, which presented an AUC of 89.85 and a TP of 94.2. In contrast, the III approach provided the lowest AUC, 85.10 and TP, 82.7, against the remaining approaches (Table 6).
Regarding the analysis of the different approaches through the ROC curve, in general, all of them had a good performance regarding the classification, since none of them remained below the random performance line (Fig 6). In addition, we observed that the best approaches were IV and V for being the only ones to be encompassed by the convex hull. However, the IV approach has advantages over the V for being closer to the point (0.0 and 100).

Effect of the cover plant and soil tillage to mitigate CO 2 emissions
In spite of the high variability of CO 2 emission, the extreme values were not considered as outliers, since they were associated by tillage operations and rainfall events in the experimental area.
The magnitude of CO 2 loss from soil influenced by tillage practices is highly related to the intensity of soil disturbance caused by tillage which alters soil organic matter decomposition environment due soil aeration, break the aggregates, incorporation of residue into the plow layer [25,26].
Studies conducted under Brazilian conditions confirm this tends, i.e., Iamaguti [27], evaluated the CO 2 emissions influenced by three systems of soil tillage in sugarcane replanting area, observed that for the conventional tillage the exposed and disaggregated soil favors higher losses of CO 2 . In a study on quantification the impact of sugarcane, harvest systems, straw and soil management on soil CO 2 emissions, Figueiredo [28], showed that intensive soil tillage and the incorporation of sugarcane straw into the soil increased short-term CO 2 emissions.  Rainfall events, in turn, promoted the increase of soil moisture and stimulated CO 2 emissions, producing CO 2 emissions belonging to the high class. This fact is supported by several studies in sugarcane fields that demonstrate the influence of rainfalls events on soil CO 2 emissions due to changes in soil water content [28] [29] [30].
The soil moisture is an attribute essential for soil microbial metabolism, and can present both direct and indirect influence under CO 2 emissions. Adequate soil moisture tends to stimulate microbial activity in the soil rather than inhibit. According to Vicent [31] optimum soil moisture values range from 25% to 40%, above this range the CO 2 emission is limited by excess water and lack of oxygen in the soil, and below limits soil respiration by drought. In our study, the average value of soil moisture remained around 20%, whose maximum soil moisture value was 36%, not exceeded, therefore, values that could compromise the availability of O 2 in the pore space of the soil.
However, after the 40 th day after soil tillage we observed a tendency to stabilize CO 2 emissions. The same occurred in other studies performed by Panosso [32] and La Scala [33]. This is probably associated to decrease of the labile C caused by tillage [34,35].
Our results also showed that the association between the cover crop and minimum tillage were efficient methods to promote the mitigation of soil CO 2 emissions in sugarcane fields. These results are in agreement with Figueiredo [28], Tanveer [36] and Moitinho [37], that observed a lower CO 2 emission due to the reduction of the frequency of tillage and maintenance of crop residues on soil surface.
The identification of the factors influencing the emission of CO 2 from the soil in agricultural areas opens up opportunities for adopting practices that reduce the net emissions of this gas [27]. In a study on the effect of different soil tillage systems on CO 2 emissions, Lu [38], observed that the average loss of C in the form of CO 2 was significantly lower in the treatment with less soil disturbance (no-tillage) compared to conventional tillage. The authors pointed out that the lower soil temperature presented by no-tillage treatment reduces microbial activity and, consequently the emission of CO 2 in the soil. Another explanation for the lower CO 2 emissions in no-tillage treatment could be greater physical protection of the C inside the soil aggregates. Moreover, the maintenance of crop residues on the soil surface in areas where soil disturbances are minimal, limits soil-residue contact, resulting in a reduced rate of decomposition, and consequently lower CO 2 emissions in these systems [25,39].
The adoption of cover crop and minimum tillage (CTCC) provided desirable modifications in soil attributes, such as: greater Sm, Micro, TP, K + , MBC, and lower, Bd. In agreement, Torres [40], evaluated changes in the soil physical attributes with the use of different cover crops under no-tillage system, and observed positive changes in physical attributes in the soil surface  layer. Similar results were also found by Lu [38], who observed significant changes in soil properties through the conversion of the conventional tillage system to no-tillage. However, it is important to highlight that our results represent a case study with one soil type and with specific climatic conditions, and more studies should be performed in order to analyze the results in different edaphoclimatic conditions.

Classification of CO 2 emissions
The variables of soil moisture, soil temperature, rainfall, pH, and soil organic C showed high predictive power regarding the classification of soil CO 2 emission and were frequently selected through the different methods of attribute selection assessed in this study. Several studies confirm the direct and/or indirect influence of these variables on the soil CO 2 emission. For example, La Scala Junior [33], Corradi [41], Teixeira [42] found direct correlations between soil moisture and CO 2 emissions under different conditions. Iamaguti [27], Karhu [43] and Tavares [44] observed a significant correlation between soil temperature and soil CO 2 emission. Moitinho [29] and Silva-Olaya [30] observed higher CO 2 emissions along days with rainfall events. Fuentes [45] and Marcelo [46] reported that increasing soil pH affects the activity and microbial population in the soil and consequently their CO 2 emission. Reuss and Johnson [47] and Tossell [48], point out that carbonic acid, produced by the dissolution of CO 2 in water, is an important acidifying agent in natural systems. La Scala Junior [49] observed a linear correlation between CO 2 emission and Organic C.
According to Park [50], the closer the AUC is to the value 1, the better the overall performance of the test, which makes a test with AUC = 1 be regarded as perfectly accurate; therefore, the higher the AUC, the better the overall performance of the approach used. In this sense, the IV approach presented higher AUC, being superior to all of the remaining approaches assessed; however, the difference between IV and V, I and II approaches was very low, within the order of 0.05, 0.46 and 0.74, respectively.
In contrast, the analysis of the different approaches in the ROC space proved both the IV and the V as the most satisfactory for having been the only approaches to be encompassed by the convex hull. However, approach IV tends to have advantages over V since it is closer to the point (0.0 and 100%). Prati [6] point out that an optimal model should be as close as possible to the point (0 and 100%). In addition, Fawcett [51] refers to the ROC chart as a description of the relative compensation between the benefits (true positive) and costs (false positives) of a classification. Thus, a point located in the upper left corner demonstrates that a higher amount of positive and negative examples are classified correctly, consequently generating a lower classification cost, corroborating with the statement that the IV approach is the best one.
Through their research, several authors have found good performance both in the attribute selection Wrapper method and in the Multilayer Perceptron classifier. Liu and Yu [52] point out to a variety of available attribute selection algorithms; however, the attribute selection Wrapper method has generally performed better than other methods for selecting a subset of variables that are more suitable to the predetermined mining algorithm, but also tending to demand high computational effort and cost, especially in datasets with great dimensionality, which makes it possibly not suitable to some mining algorithms. In agreement with the results, Hall and Geoffrey [53], by comparing several methods of variables selection for the supervised classification, also concluded that the Wrapper method was the most satisfactory for the selection.
Finally, Freitas [54] investigated the use of the classifier Multilayer Perceptron for the predicting of the spatial variability of soil CO 2 emissions in sugarcane areas, also concluding that this approach satisfactorily met the expectations having proved good application potential for this type of issue.

Conclusions
The CO 2 emissions present high variability, mainly in the first days after the soil tillage, in which soil disturbance caused by tillage and rainfall events stimulate CO 2 production, while the association between cover crop and minimum tillage are effective strategies to promote the mitigation of soil CO 2 emissions.
Data mining techniques to predict soil CO 2 emission proved promising results for having allowed the development of several classifications approaches with high accuracy rate (above 80%), with variables such as soil moisture, soil temperature, rainfall, pH, and organic C showing high predictive power. Finally, among the approaches assessed in this study, the Multilayer Perceptron approach (a particular case of artificial neural networks) with the Wrapper attribute selection method tends to be more advantageous for its higher accuracy of the positive class with a consequent lower cost of classification.
Finally, we highlighted that the data presented herein represent a short-term case study conducted in specific edaphoclimatic conditions. Comprehensive studies evaluating the efficiency of the data mining techniques to predict the soil CO 2 emissions should be encouraged to test the reproducibility of these techniques under different soil and climate conditions.  Table. Soil chemical attributes. Factor 1 = Cover crop; Factor 2 = Soil tillage; Block = Experimental design in randomized blocks; H+Al = Acidity potential.; Al +3 = Exchangeable aluminum; Ca +2 = Exchangeable calcium; Mg +2 = Exchangeable magnesium; K + = Exchangeable potassium; P = Exchangeable phosphorus; Organic C = Organic Carbon. (XLSX) S4