Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Application of FT-NIR spectroscopy to the prediction of Chromium contamination in soil by evolutionary chemometrics

  • Shaoyong Hong,

    Roles Data curation, Funding acquisition, Methodology, Visualization, Writing – original draft, Writing – review & editing

    Affiliation School of Artificial Intelligence, Guangzhou Huashang College, Guangzhou, China

  • Zhanhong Liang,

    Roles Data curation, Formal analysis, Software, Visualization, Writing – original draft

    Affiliation School of Artificial Intelligence, Guangzhou Huashang College, Guangzhou, China

  • Huazhou Chen,

    Roles Funding acquisition, Investigation, Project administration, Validation

    Affiliation School of Mathematics and Statistics, Guilin University of Technology, Guilin, China

  • Jia Weng,

    Roles Formal analysis, Methodology, Validation

    Affiliation School of Artificial Intelligence, Guangzhou Huashang College, Guangzhou, China

  • Ken Cai,

    Roles Funding acquisition, Resources, Software

    Affiliation College of Automation, Zhongkai University of Agriculture and Engineering, Guangzhou, China

  • Xianchuan Wu

    Roles Conceptualization, Investigation, Resources, Supervision, Writing – review & editing

    wuxianchuan2025@163.com

    Affiliation School of Artificial Intelligence, Guangzhou Huashang College, Guangzhou, China

Abstract

Fourier-transform near-Infrared (FT-NIR) technology offers a promising alternative to traditional methods for detecting soil Chromium (Cr) contamination. However, the relationship between soil Cr content and the spectra may involve complex non-linear dynamics and data redundancy. Therefore, selecting spectral feature variables and constructing parametric scaling models for rapid estimation has become a focal point in current research. In this study, the parametric scaling support vector machine (PSSVM) method is proposed for optimizing the modeling parameters, the binary modified differential evolution (BDE) algorithm is designed for selecting the feature variables. In combination, a novel combined optimization system is established by embedding the PSSVM model into the BDE iterative process. The system (BDE-PSSVM) is validated by estimating the soil Cr content based on the FT-NIR spectral data. The soil samples are collected from the area around a centralized waste treatment base, serving as the research subject. The original spectral data underwent preprocessing using Savitzky-Golay smoothing. Subsequently, the samples were divided into the training and testing sets by the SPXY algorithm, where the testing samples are strictly excluded from the model training process. Feature selection and the parametric scaling model optimization are simultaneously performed by applying the BDE-PSSVM model. The most optimal model observes the minimal root mean square error of 8.114, which only carries 56 discrete variables. In comparison to some other counterpart modeling methods, the BDE-PSSVM uses less feature variables and yields the better prediction results. This finding indicates that the proposed BDE-PSSVM modeling system provides an efficient way for rapid estimation of soil Cr content in cooperation with the FT-NIR technology. The proposed system is expected to undergo testing for its application in detecting additional analytes.

1 Introduction

Heavy metal pollution in soil is mainly caused by various diverse human activities, such as industrial emissions, over use of pesticides and fertilizers and improper treatment of waste. It is worth noting that when the concentration of heavy metals exceeds the level that the environment elements can absorbed, it will cause harm to the ecological environment and human health [1]. In the ecosystem circulation, the heavy metals accumulate and is further transferred to parts of the plant or crops, and finally consumed by humans through the food chain in ecological cycle [2]. In sequence, the abnormal concentration of heavy metals conceived in soil can easily cause hemolysis and cell damage to humans, for even bringing serious irreversible hazards such as carcinogenesis and teratogenicity. Thus, rapid and accurate detection of heavy metal contents in soil is much crucial for soil remediation, restoration and reuse, in order to prevent harms to human health [3].

Typical methods for soil heavy metal detection mainly rely on reagent-involving chemical reactions, which are often destructive, time-consuming, and accompanied by the release of pollutants. Some existing traditional detection methods are originated from electrochemical techniques [4], but they are trapped intensive to the laboratory and not convenient for portable detection in a wide range large-scale performance [5]. Novel developments in spectral sensing technology have been verified for determining heavy metal contents. LIBS was the common spectroscopic technology applied for Cr detection. Combined with the studies on chemometrics, LIBS has been validated useful for rapid analysis of soil Cr [67]. Near-infrared (NIR) spectroscopy emerged for the detection of soil heavy metals. In common sense, the presence of heavy metal elements in soil has little spectral response in the NIR bands, which makes it difficult to quantify soil metals by NIR technology [8]. Fourier transform near-infrared (FT-NIR) spectroscopy is an analytical technique based on the interaction of the NIR light and the overtone and combination vibrations of substance molecules [9]. It correlates with the chemical composition and structure of substances to achieve qualitative identification or quantitative determination. This technique is non-destructive to samples, requires no complex pretreatment, and is applicable to samples in various forms such as solids, liquids, and gases. It features rapid analysis, with a single detection taking only a few minutes, and is environmentally friendly while reducing detection costs [1011].

The change of heavy metals can be significantly revealed by using the diffused reflectance mode, and the FT-NIR bands exhibit its reasonable function to quantify the heavy metals in soil samples, when in highly support by the new advancements of chemoinformatic methods [12]. Previous publications reported that the main chemometric studies for NIR/FT-NIR detection of Cr is based on the classical PLS and PCA methods for modeling, accompanied with some variable selection technique [13]. Seldom work reports the study of machine learning or deep learning methods for Cr determination. The study on methodologies, especially on the AI and machine learning methods, has provided a new direction for the inversion of heavy metal contents [14]. Such that the NIR technology has become a powerful scientific tool for estimation of heavy metals in soil samples [15].

Method investigation to expose the FT-NIR information usually involves the reduction of noises, the selection of features, the optimization of models and the discussion of model stabilities [16]. When dealing with the multivariate spectral data, it is the core challenge to select appropriate spectral features (digitally revealed by variable selections for the spectral data). Although it is a common method for some studies to select relevant bands to benefit the model, the FT-NIR data, which goes integrated with signal amplification, still contain a lot of hidden information as stored in latent variables [17]. This is because there are strong correlations among spectral features, implying that the internal relationships in neighboring bands are often overlapped (Wang et al., 2020). With the increase of on-the-target bands, the curves of the spectra of a complex detected analyte (such as soil) cover each other, and gradually become blurred, unclear and confused, such that the colinearty can hardly be distinguished. This indicates that sample information extracted from the same waveband or from two partially repeated wavebands exhibit similar spectral properties, thereby increasing the difficulty for quantification, especially for the trace elements of heavy metals.

In dealing with the soil properties inherently included in the FT-NIR data, many chemometric methods have been studied, modified or even brand-new ones been proposed [18]. Waveband selection or variable selection is the traditional issue in spectral data analysis. Conventional methods involve interval screening, moving window and spatial transformation. Classical methods include principal component analysis, partial least squares and its variants (interval selection, moving window) [19]. Rapid estimation of heavy metal content in soil requires an excellent model to extract spectral information [20]. Many studies implied that the accuracy of nonlinear models is generally higher than that of linear models in estimating soil heavy metal content [2122]. Chromium is a kind of trace content that needs a high-quality regression model to enhance the effect of model prediction. There is a model for estimating Chromium content was successfully constructed using surface soil samples. The study reported that extreme learning machine-based nonlinear model series are able to predict heavy metals to obtain high precision accuracy [23].

Nonlinear models require a lot of effort to refine their parameters. Support vector machine (SVM) is a good supervised machine learning method based on supporting vectors. It has been applied in NIR analysis of soil nutrients, food safety and medicines and pharmacies [2425]. By adding support vectors, the correlation of the spectra be enhanced, potentially aiding in the rapid estimation of soil heavy metals [26]. SVM models works with evolutionary technique has been validated. An SVM model combined with GA was applied for the inversion calculation of heavy metal contents. The particle swarm optimization algorithm was employed to enhance the calibration model with support vectors. Through model refinement, a significant improvement in prediction accuracy is achieved [27]. These results demonstrate appreciating improvements in model accuracy. Attention is drawn that these studies used evolutionary method for optimizing the SVM parameters. Studies on evolutionary method for spectral feature selection are just at the beginning.

Suitable optimization algorithms can assist the nonlinear model to quickly find suitable parameters, thus improving the stability and prediction accuracy of the model for rapid estimation of soil heavy metals [28]. To address this issue, evolutional algorithms are preferred in recent years as they are engaged with iterative cycle of individual populations [29]. However, due to the fact that soil is a kind of complex analytes that contain diverse components, the spectral information extracted by the present algorithms may still lack relevant neighborhood attributes. Supporting vectors well address this issue. Actually, there are increasing evidences proving that evolutionary methods are able to select variables for spectral data analysis, such as the validated application by adaptively modification of GA [30] and the particle swarm optimization (PSO) [31]. These indicates that an evolutionary algorithm combined with AI-based modeling mode is able to support feature selection in spectral chemoinfometics, for the optimization of linear and nonlinear models. Differential evolution (DE) is a simple, easy-to-operate swarm intelligent algorithm, which provides an iterative optimization way for continuously refining the model, and synchronously‌ evaluating the prediction effects [32]. If DE method is appropriately designed for spectral feature selection, in fusion with the modeling parameter tuning process, it is prospectively effective to enhance the stability and predictive accuracy of the calibration models for rapid estimation of soil heavy metals.

This study focuses on training the SVM models for NIR spectral analysis of soil. The spectral data and the corresponding laboratory-tested Chromium (Cr) content are measured. This paper contributes in two aspects: (1) To propose the parametric scaling strategy for the optimization of the kernel supported SVM model for quantitative determination of the Cr contents by using the NIR spectral data; (2) To propose a binary-coding modified differential evolution (BDE) method to combine with the parametric scaling SVM (PSSVM) model; (3) To explore a deep optimization strategy in combined relation to both parameter tuning and feature selection. In quantitative comparison with the conventional PLS and SVM models, the proposed combined optimization strategy provides a novel robust chemometric methodology to improve the prediction accuracy of the NIR calibration model, which benefits the spectroscopic rapid determination of the Chromium content in soil samples. However, during the process of the proposed method applied to other dataset, it is still necessary to re-establish models for the different targeted samples. The key to the popularization and application of spectroscopic metrology models lies in passing the verification of modeling effectiveness.

2 Materials and preparations

2.1 Description of samples

Target soil samples were collected from around the Centralized Waste Treatment Base in north Zhongshan city of Guangdong province, China. The samples are mainly of the basal waterland soil and marginal red soil. A series of sampling sites are located. The distances between each adjacent site were slightly different, ranging about 3–5 meters. At each site, the five-point sampling method was applied in accordance with the national standard of China “Soil Quality - Guidance on Sampling Techniques” (GB/T 36197–2018). Centered at the designated point, a rectangular sampling area is demarcated. One sampling point is set at the center position of the rectangular area and at each of its four diagonal vertices, totaling 5 points. At each point, 10 cores were extracted from 0–20 cm in depth. Each core weighed about 20 grams. These cores were mixed together as the one-fifth part of a sample, then the 5-point sampling parts were mixed to comprise one sample (weighed approximately 1 kilogram). Each sample was separately placed in a labelled polythene plastic bag for storage, and convenient for delivery. The soil samples were firstly exposed to sunlight and proper ventilation for air-drying. Then, the they were screened and finely grounded to remove the clods, stones or twigs. Successively, the samples were filtered using a 10-mesh nylon sieve for particle refinement.

Chromium is widely distributed in the copper smelting system, present in almost all intermediate products [33]. Chromium pollution has the most extensive impact, covering the majority of the surveyed land. Therefore, the measurement of Chromium is considered the primary focus of this study. Detailed analysis of Chromium (Cr) content in these soils was conducted. One hundred and sixty-five soil samples were utilized to measure the Cr content and NIR spectral data. The plastic-bag storage of each sample was divided into two parts. One part was used to collect the NIR spectral data, while the other part used to determine the Cr content obeying the conventional method regulated in the national standard of China “Chromium metal—Determination of chromium content—The ammonium ferrous sulfate titrimetric method” (GB/T 4702.1−2016). The histogram distribution of the Cr content is shown in Figure 1 for the 165 soil samples, and the basic statistics are also listed.

thumbnail
Fig 1. Histogram distribution of Cr content for the 165 soil samples.

https://doi.org/10.1371/journal.pone.0341152.g001

2.2 The FT-NIR measurement

The FT-NIR spectra of the 165 samples were one-by-one measured by using the BUCHI NIRMaster Spectrometer (BUCHI Labortechnik AG, Switzerland). The measurements were conducted in accordance with China’s national standards: GB/T 21186−2007 “Fourier Transform Infrared Spectrometers” and GB/T 6040−2019 “General Rules for Infrared Analysis”.

Each soil sample was placed in a petri dish, ensuring complete coverage of the bottom and surface to minimize background interference in data measurements. The entire sampling process is carried out in a dark environment to minimize the effects of external stray light and the data is collected via computer-link controls. During acquisition, the NIR optical path is to go perpendicularity to the soil observation surface, passe through the sample to near the bottom of the dish, and reflex back. Sensors and detectors are available to collect the reflectance signals. Then the signals were transformed and amplified by the FT section. During the spectroscopy measurement, the surrounding temperature and humidity were controlled consistent at 23 °C and 42%RH. Each sample was in queue to be automatically scanned for 64 times and the average data was recorded in the computer. The FT-NIR spectral range covered from 10000 to 4000 cm-1, with the digital resolution of 8 cm-1, thus to obtain the spectra containing a total of 1512 wavenumber points.

During the detection, sample properties, environmental and instrument-related factors are all potential to influence the quality of data, resulting in extraneous information such as noise, spikes, instrument-generated noise and stray light [34]. These interferences possibly increase the difficulties to find the valuable information and spectral features from the raw data. To address these issues, Savitzky-Golay (SG) filtering algorithm was used to smooth the data. This method is able to effectively reduce the impact of instrument background and drift on the signal. It is easy to obtain the pretreated spectral data which is optimally pretreated by the 1st derivative performance of a cubic polynomial accompanied with a 37-point window. The raw spectra and the SG-pretreated curve were shown in Figure 2.

thumbnail
Fig 2. The FT-NIR spectra of 165 soil samples.

(a) The raw data (b) The SG-pretreated data.

https://doi.org/10.1371/journal.pone.0341152.g002

2.3 Sample set division for data training and testing

When the data is used for FT-NIR analysis, it is usually required to divide the sample pool into the sets for model training and testing. The testing sample set is not relevant to the model training process, while the training sample set is used to establish and the model, to tune the parameters/hyperparameters and to determine a suitable optimization strategy. In our experiment, we divide the 165 soil samples into the training set and the testing set with the ratio of 7:3, and the Sample Partitioning method based on joint X-Y distance (SPXY) [35] is applied. This division results in 115 samples allocated for training and 50 samples for testing. Descriptive statistical analysis was performed on the Cr contents for the training set and for the testing set (See Table 1). During the training process, the model is calibrated by using the 5-fold cross validation.

thumbnail
Table 1. Descriptive statistics of Cr contents for the training and testing sample sets, respectively.

https://doi.org/10.1371/journal.pone.0341152.t001

As frequently used in spectral analysis, the two indicators of the root mean square error (RMSE), the Pearson correlation coefficient (R) and the coefficient of determination (R2) are employed for model evaluation [36]. They are formulated as:

(1)(2)(3)

where, represents the Cr content of the -th sample, is the FT-NIR modeling prediction value of the Cr content of the -th sample; and are the mean values of the sets of and , respectively; and represents the numbers of participating samples in the targeted sample set. For convenient discussions, we denote , and with the subscript for cross validation and the subscript for testing.

3 Methodologies

Our research focused on machine learning methods for chemometric improvement in FT-NIR data analytical field, only involving non-destructive FT-NIR spectral measurements and surface soil scraping sampling, which would not cause any disturbance to the soil structure or the surrounding vegetation. Also, the centralized waste treatment base where the samples were collected is not a restricted or protected ecological zone, or a cultural relic site, or a private property. Thus, this is a kind of non-destructive and low-impact research activities in public non-protected areas. No permits were required.

3.1 Parametric scaling design for support vector machine

Support Vector Machine (SVM) is a commonly used stoichiometric method to establish a non-linear calibration model to minimize the structural risk. It uses a kernel function to map the variables from a low-dimensional space to a high-dimensional feature space, and establish the optimal decision function [37]. The aim of SVM optimization is to identify the following functional relationship

(4)

where is the targeted analyte (the Cr content of soil) and (the NIR spectral data). Concerning on the minimization of structural risk, the constrain is defined in relation with the model coefficients and the prediction errors, i.e.,

(5)

where and are the coefficient for regression, is a tunable parameter for model regularization on avoiding overfitting; represents the error between and , which is denoted as for each of the samples (); and is the number of participating samples.

The radial basis function (RBF) is frequently used as the kernel for specified mapping, as it can simplify the computation, and is endorsed stable and robust [38]. The RBF kernel is defined as using an exponential transform to illustrate the projection of on anyone of the other variables , namely,

(6)

where, represents the radial width of the RBF kernel function, which controls the nonlinear degree of the mapping.

The parametric scaling SVM (PSSVM) model is operated by collaborative tuning of and , which mainly works on the control of the optimal structure with minimum risk of falling into local optimization. By proposing the greedy grid search testing on the combined parameters , the SVM model can be precisely optimized [39].

However, even though that the PSSVM optimization works on greedy grid search, it does not provide an effective way for deep selection of informative variables. Feature selections in advance of modeling could have prospective improvement for model training and testing [40].

3.2 The algorithmic flow of binary differential evolution

Differential evolution (DE) algorithm is an iterative optimization technique based on swarm intelligence, which will launch continuous evolutionary updating evolutionary populations, and the difference among populations are evaluated [41]. Taking a set of variable combination as the population in the DE procedure, then the generation evolution of the population indicates the continuous updating of the selected informative variable combination.

According to the traditional DE method, the parameters include population size (), number of variables in the interval (), crossover threshold () and the maximal iteration (). The binary-modified DE (BDE) method inherits the basic steps of DE, including the population initialization and the multi-round iteration of mutation, crossover and selection [42]. For improvement, BDE adopts 0–1 binary coding to decide whether the individual variables are selected, where Code 1 indicates that the variable is selected, while Code 0 means that the variable is not selected. This alternative operation simplifies the population evolution procedure. The main task is merely to test each individual variable if it adopts the changes of 0–1 binary coding during the mutation and crossover steps. Specifically, the BDE iterative optimization process can be designed in steps,

Population initialization. A population is taken as the solution to the modeling search of feature variables. The population is initialized and evolute in the increasing series of generations (). Supposing the population is composed of individual variables, , coupled with 0–1 binary labels . Each individual label is created as an dimension vector denoted as where the -th element adopts the binomial coding value of , which determines whether the corresponding individual is selected to the current generation () or not. Then the individuals are optimized in iterative cycles of mutation, crossover and selection. A fitness function is designed to evaluate the BDE evolutionary results for feature variable selection [43].

Mutation. The mutation of strictly follows the mutation of . How an individual mutates is influenced by the other individuals (represented by the labels). Taking two other individuals as the influencing factors, the elements of is updated by calculation as follows,

(7)

where points to the current generation and means the next generation; and represent the labels of any other two individuals which are randomly selected (); represent a candidate selection of by mutation. Equation (7) tells that whether the value at the -th element of mutates depends on the mutated differences of the -th of and (denoted as ). If , then will become the opposite side of (changed from 0 to 1 or from 1 to 0); if , then will be the same as . In this way, all elements of will be checked and refreshed from the -th to become the candidate at the -th generation, when goes through the dimensions.

Crossover. Crossover takes place on the basis of the mutated individuals, with accurate correspondence with the label . The resultant crossover candidate individual is labeled as . The calculation is monitored with a preset threshold (), and an instant crossover probability () is produced for making the decision. If , then ; otherwise, crossover is not accepted, and remains the same as . In sequence, all individuals of the population will be updated according to the labels .

Selection. The fitness value is computed for selecting the evolved individuals at the -th generation. The selection decision depends on the comparison of the candidate individuals labeled with to the original individuals labeled with . If the fitness value is optimized, the individuals of the -th generation (confirmedly denoted with labels ) is identified with the labels; otherwise, if not optimized, which means that this round of BDE evolution does not improve the set of feature variables, thus will inherit the original individuals of the -th generation, namely,

(8)

Accordingly, the labels point to a refreshed set of improved feature variables by one round of BDE iteration. Along with the iteration goes to the end, the set of informative variables is well optimized, to enhance the model prediction performance.

3.3 The combined modeling framework of BDE-PSSVM

In this study, we proposed the novel optimization strategy of combining the PSSVM model with the iterative BDE method. The combined modeling procedure performs along the BDE multi-round cycling calculation of mutation, crossover and selection, where PSSVM model is embedded into each step. The model predictive value is used to evaluate the immediate evolutionary fitness values. The flowchart of BDE-PSSVM modeling procedure is showed in Fig 3.

In experiment, the RBF supported SVM model is trained and tested with its combined parameters being scaled as the values of for the tuning of . The BDE iteration is tested with expansion on parameter tuning, for (the size of population changes from 1 to 8), (any one population contains the number of individuals changing from 10 by step 10–100) and (for a maximum of 200 iterative rounds). The crossover is tested at four different thresholds of . The fitness value is evaluated using the predictive value from the kernel supported SVM model. In this way, the NIR calibration model is refinedly optimized by the BDE-PSSVM computing methodology to obtain stable quantification results with prospective high prediction accuracy.

4. Results and discussion

4.1. SVM parameter scaling optimization

The PSSVM model was applied for the FT-NIR spectroscopic analysis to predict the soil chromium content, and the RBF function was used as the kernel. The regularization factor and the radial width of the RBF kernel had great influence on the prediction effect of the model. The parametric scaling tuning of the parameter group of optimize the model and determine the Cr content of soil. The calibration samples were used to train the model, and are adjustable during the training.

The parameters and were constructed as for grid search. The candidate values are preset as an integer exponential for screening ; thus, a total of 100 candidate combination of were monitored. Each valuing for the combination of determined one SVM model for NIR prediction. The trials are touching slowly from the initial prediction to the optimal result. By cross validation, the objective function value was calculated for training. In the way of parametric scaling, the conducted from each value of are found, and we draw the scatter plot for internal comparison (see Fig 4). In the figure, we use the logarithm form to present the values of and , thus it is easily observed that the most optimal SVM model is constructed with and , which deduce the optimized equaling to 14.48 mg/kg. Other than the optimal model, the advantage of parametric scaling is that it is able to simultaneously provide many different possible models with acceptable prediction results, such as the cases when equals to . These appreciating selections allow most possible choices for actual application.

thumbnail
Fig 4. The grid search prediction results by parametric scaling of for SVM model.

https://doi.org/10.1371/journal.pone.0341152.g004

However, the PSSVM model is established and trained based on the full-range spectroscopic data. The candidate combinations of parameters inevitably overlap the optimal results then obtain a not-so-better result through the kernel calculation, even though the model plays in a grid search scaling mode. To address this issue, differential evolutionary method is applied to select feature variables for optimizing the PSSVM model.

4.2. Feature selection by BDE-PSSVM optimization

The model is beneficial by the combination of BDE-PSSVM optimization, in which BDE is functional to select informative features. The characteristics of the model is to ensemble the PSSVM modeling section into the BDE iterative optimizational process. The parameter expansion for BDE algorithm (i.e., the size of the population, the decomposed length of an individual, the increase of iteration times) can help enlarge the coverage of model training, and speed up the converge to the optimal solution. The initial 0–1 valuing of each individual is partially random but obeying the BDE formulations. The parameters as preset in our experiment are tested for each of the BDE iterative rounds of mutation, crossover and selection. For the total 200 times of iteration, all possible PSSVM models are tested and the most optimal one is specially focused. In order to fully consider the influence of different combinations of the PSSVM modeling parameters, we highlight the most optimal modeling result at each step of BDE iteration (see Fig 5). By comparison on BDE iteration, the best BDE-PSSVM model is observed at the 146th iteration. Fortunately, we can see the more applicable results that the model prediction results get stable close to the best model after around the 95th iteration.

thumbnail
Fig 5. The iterative trend of BDE optimization in combination with PSSVM parametric scaling.

https://doi.org/10.1371/journal.pone.0341152.g005

Combinations of informative variables were selected by the BDE-PSSVM model, which can be categorized into wavelength-based variable selection methods. The insight view of how the BDE parameter is tuned can help trigger the strategy for applicable feature selection. Other than the iteration times, the BDE parameters and are also monitored. For each combination of when was tuned on , on and on , respectively, a total of 320 parametric BDE patterns are trained for 200 iterations, in which the PSSVM models are optimized in duration. The best refined models are observed for each pattern (see Fig 6). Every pattern determines one variable combination that points to a set of informative features. For example, the most optimal pattern is found on , it observes the minimal of 8.114, resulting in a combination of 56 variables that are most informative (See Fig 7).

thumbnail
Fig 6. The BDE iterative results by tuning the specific parameters of {NP, L, CT}.

(a) NP = 1 (b) NP = 2 (c) NP = 3 (d) NP = 4 (e) NP = 5 (f) NP = 6 (g) NP = 7 (h) NP = 8.

https://doi.org/10.1371/journal.pone.0341152.g006

4.3. Model verification and comparison

The feature variables trained by BDE-PSSVM model were selected as the informative variable combination for the optimal prediction of soil Cr content. Subsequently, the model with its optimal parameters and the selected feature variables are verified based on the testing samples, which is totally independent from the model training process.

To investigate the modeling efficiency, the BDE-PSSVM model is compared to some models combined with other evolutionary methods (GA and PSO) and some classical variable selection models (MWPLS, SiPLS). The model training results and testing results are showed in Table 2, the used number of variables is also listed in. Upon comparing with Table 2, it is evident that the BDE-PSSVM model exhibits the lowest RMSE value and the highest R value, while the applied number of feature variables are the least, which indicated that the regression model is simple. Contrast to the competitor models, the BDE-PSSVM achieves significant improvements in both model prediction accuracy and feature selection.

thumbnail
Table 2. Comparison of the optimal models by different methods with parametric scaling.

https://doi.org/10.1371/journal.pone.0341152.t002

The outputs of the proposed model were further compared with published results from recent years. For example, Wang et al. reported an of 0.737 and an RPD of 3.000 for soil Cr prediction using the MEA-BP neural network model [44]. Han et al. established an SVM model, achieving a predictive RMSE of 2.20 and an R² of 0.77 [45]. Shirley et al. employed a PLS model coupled with MSC preprocessing, yielding a prediction RMSE of 7.57 and an R² of 0.76 [46]. Yuan et al. proposed a wavelength phase-out PLS model combined with repetition rate priority combination methods to improve soil Cr prediction via NIR spectroscopy, achieving an RMSE of 9.15 and an R of 0.843 [47]. In direct comparison with these previously reported NIR/FT-NIR modeling results, the proposed BDE-PSSVM model yielded an RMSECV of 8.114 and an RCV of 0.931. The performance is superior to most of the previous studies. Moreover, the observed of 0.864 indicates that the model has high explanatory power for soil Cr content, thereby ensuring optimal testing results.

5 Conclusions

To accurately predict the soil Cr content by FT-NIR spectroscopy, this study proposes a modified model optimization system based on BDE iterations in combination with the PSSVM modeling. The combined optimization model is functional to perform a deep learning strategy for grid search parameter tuning and for feature selection.

Based on the analysis by PSSVM model, a grid search parametric optimization way is proposed. Except observing the most optimal PSSVM model with parameters equaling to , we also provided a number of appreciating parameter combinations for model optimization, which are able to support real-scene applications. Moreover, the BDE-PSSVM modeling system is established based on the iteration of generations by modifying the DE method. The parametric scaling SVM modeling section is embedded in each round of BDE iteration, for adjusting the most optimal modeling parameters in steps. The BDE plays the 0–1 valuing role to select feature variables for model training and testing. Finally, the BDE-PSSVM model successfully select a set of 56 feature variables (see Fig 7) that are most informative for the quantitative prediction of soil Cr content. The model is able to well optimize the FT-NIR calibration model and improve the prediction effect, in comparison to the counterpart models, including some evolutionary algorithm of similar type, some classical linear and nonlinear modeling methods (SVM, PLS) and their variants for feature selection (see Table 2).

The research findings demonstrate that the BDE-PSSVM model performs remarkably well in modeling within the FT-NIR data. This discovery also reveals that, in quantitative prediction of soil Cr, there are many possibly acceptable models that may suit for in-situ application. When confronting other analytical targets, the methodology is still necessary to re-establish models for the different sample set. This can be the limitation of the studies on algorithms and methodologies. This research strengthened the stability and prediction accuracy of the binary modified DE algorithm combined with the parametric scaling SVM model, providing reference in support for future prediction of Cr and other heavy metal contents in large-scale soil contaminated areas. For different detection objects, it can be easily extended to train the models with different sample sets. Therefore, this study aims to verify the proposed modeling and calculation methodology would have great application potential.

References

  1. 1. Ali MM, Rahman S, Islam MS, Rakib MRJ, Hossen S, Rahman MZ, et al. Distribution of heavy metals in water and sediment of an urban river in a developing country: a probabilistic risk assessment. Inter J Sedim Res. 2022;37(2):173–87.
  2. 2. ERKOÇ HA, ÇOLAK ESETLİLİ B. Potential of Purslane (Portulaca oleracea L.) in Phytoremediation: A Study on the Bioaccumulation and Bio-Transfer of Cadmium, Nickel, and Copper in Contaminated Soils. J Agr Sci-Tarim Bili. 2023.
  3. 3. Sun Y, Li Y. Potential environmental applications of MXenes: a critical review. Chemosphere. 2021;271:129578. pmid:33450420
  4. 4. Duan Y, Wang S, He C, Zhang F, Wang Y, Li Y. Research on oil recovery of oily sludge by subcritical/supercritical hydrothermal upgrading. J of Chemical Tech & Biotech. 2024;99(6):1435–44.
  5. 5. Miao Z, De Buck J. Biosensor for PCR amplicons by combining split trehalase and DNA-binding proteins: a proof of concept study. J Microbiol Methods. 2023;211:106780. pmid:37422082
  6. 6. Li X, Huang J, Chen R, You Z, Peng J, Shi Q, et al. Chromium in soil detection using adaptive weighted normalization and linear weighted network framework for LIBS matrix effect reduction. J Hazard Mater. 2023;448:130885. pmid:36738619
  7. 7. Xu J, Wang X, Yao M, Liu M. Polarization-resolved LIBS for chromium quantification in soil: a novel chemometric model for matrix effect suppression and detection limit enhancement. J Anal At Spectrom. 2025;40(9):2556–61.
  8. 8. Huang Y-C, Huang C-Y, Minasny B, Chen Z-S, Hseu Z-Y. Using pXRF and Vis-NIR for characterizing diagnostic horizons of fine-textured podzolic soils in subtropical forests. Geoderma. 2023;437:116582.
  9. 9. Chen H, Liu Z, Cai K, Xu L, Chen A. Grid search parametric optimization for FT-NIR quantitative analysis of solid soluble content in strawberry samples. Vibrat Spect. 2018;94:7–15.
  10. 10. Quintelas C, Braga A, Cordeiro A, Ferreira EC, Belo I, Páscoa RNMJ. FT-NIR spectroscopy analysis for monitoring the microbial production of 2-phenylethanol using crude glycerol as carbon source. LWT Food Sci Technol. 2022;155:112951.
  11. 11. Fazayeli H, Amodio ML, Fatchurrahman D, Serio F, Montesano FF, Burud I, et al. Potential application of hyperspectral imaging and FT-NIR spectroscopy for discrimination of soilless tomato according to growing techniques, water use efficiency and fertilizer productivity. Sci Hortic. 2024;328:112928.
  12. 12. Abrantes G, Almeida V, Maia AJ, Nascimento R, Nascimento C, Silva Y, et al. Comparison between Variable-Selection Algorithms in PLS Regression with Near-Infrared Spectroscopy to Predict Selected Metals in Soil. Molecules. 2023;28(19):6959. pmid:37836802
  13. 13. Wang J, Hu X, Shi T, He L, Hu W, Wu G. Assessing toxic metal chromium in the soil in coal mining areas via proximal sensing: prerequisites for land rehabilitation and sustainable development. Geoderma. 2022;405:115399.
  14. 14. Chen H, Xie J, Xu L, Feng Q, Lin Q, Cai K. Feature selection for portable spectral sensing data of soil using broad learning network in fusion with fuzzy technique. IEEE Sensors J. 2024;24(5):5644–53.
  15. 15. Zhai Y, Zhou L, Qi H, Gao P, Zhang C. Application of visible/near-infrared spectroscopy and hyperspectral imaging with machine learning for high-throughput plant heavy metal stress phenotyping: a review. Plant Phenomics. 2023;5:0124. pmid:38239738
  16. 16. Liu K, Zhang Y, Gao T, Tong F, Liu P, Li W, et al. A handheld rapid detector of soil total nitrogen based on phase-locked amplification technology. Comput Electron Agric. 2024;224:109233.
  17. 17. Chen H, Lin B, Cai K, Chen A, Hong S. Quantitative analysis of organic acids in pomelo fruit using FT-NIR spectroscopy coupled with network kernel PLS regression. Infrared Phy Technol. 2021;112:103582.
  18. 18. Giménez-Campillo C, Arroyo-Manzanares N, Campillo N, Díaz-García MC, Viñas P. A volatilomic approach using ion mobility and mass spectrometry combined with multivariate chemometrics for the assessment of lemon juice quality. Food Control. 2025;169:111027.
  19. 19. Mareczek L, Riehl C, Harms M, Reichl S. Analysis of the impact of material properties on tabletability by principal component analysis and partial least squares regression. Eur J Pharm Sci. 2024;200:106836. pmid:38901784
  20. 20. Liu Q, Du B, He L, Zeng Y, Tian Y, Zhang Z, et al. Digital soil mapping of heavy metals using multiple geospatial data: feature identification and deep neural network. Ecol Indic. 2023;154:110863.
  21. 21. Yolcu U, Yalcin IE, Uras ME, Ozyigit II. Modeling the effects of essential heavy metals on environmental pollution: a linear and nonlinear prediction model via cascade forward‐neural network. Math Methods App Sci. 2024;47(6):4306–18.
  22. 22. Huo X-S, Chen P, Li J-Y, Xu Y-P, Liu D, Chu X-L. Comparative study of linear and nonlinear calibration algorithm for extrapolation ability of near infrared spectroscopy quantitative analysis. Vibrat Spectrosc. 2024;132:103693.
  23. 23. Wang W, Man Z, Li X, Chen R, You Z, Pan T, et al. Response mechanism and rapid detection of phenotypic information in rice root under heavy metal stress. J Hazard Mater. 2023;449:131010. pmid:36801724
  24. 24. Ayinde AS, Huaming Y, Kejian W. Review of machine learning methods for sea level change modeling and prediction. Sci Total Environ. 2024;954:176410. pmid:39312971
  25. 25. Mitsui T, Mori A, Takamatsu T, Kadota T, Sato K, Fukushima R, et al. Evaluating the identification of the extent of gastric cancer by over-1000 nm near-infrared hyperspectral imaging using surgical specimens. J Biomed Opt. 2023;28: 086001.
  26. 26. Beniaich A, Terra FS, Demattê JAM, Horák-Terra I, Martins JKD, Sousa-Baracho IP. Enhancing soil property predictions using spectral fusion: comparisons between outer product analysis and vector concatenation and among modeling algorithms. Soil Tillage Res. 2025;251:106546.
  27. 27. Wang Y, Li Z, Wang W, Liu P, Tan X, Bian X. Rapid quantification of single component oil in perilla oil blends by ultraviolet-visible spectroscopy combined with chemometrics. Spectrochim Acta A Mol Biomol Spectrosc. 2024;321:124710. pmid:38936207
  28. 28. Lu X, Li F, Yang W, Zhu P, Lv S. Quantitative analysis of heavy metals in soil by X-ray fluorescence with improved variable selection strategy and bayesian optimized support vector regression. Chemomet Intell Lab Syst. 2023;238:104842.
  29. 29. Jiang M, Chen M, Zeng J, Du Z, Xiao J. A comprehensive evaluation of the potential of three next-generation short-read-based plant pan-genome construction strategies for the identification of novel non-reference sequence. Front Plant Sci. 2024;15:1371222. pmid:38567138
  30. 30. Feng Q, Chen H, Xie H, Cai K, Lin B, Xu L. A novel genetic algorithm-based optimization framework for the improvement of near-infrared quantitative calibration models. Comput Intell Neurosci. 2020;2020:7686724. pmid:32695153
  31. 31. Wang S, Zhang X, Mpango P, Sun H, Bian X. Firefly interval selection combined with extreme learning machine for spectral quantification of complex samples. J Chemom. 2024;38(9).
  32. 32. Shen Y, Wu J, Ma M, Du X, Wu H, Fei X, et al. Improved differential evolution algorithm based on cooperative multi-population. Eng Appl Artif Intell. 2024;133:108149.
  33. 33. Liu Y, Chen H, Yang Y, Jiao C, Zhu W, Zhang Y, et al. Atomically inner tandem catalysts for electrochemical reduction of carbon dioxide. Energy Environ Sci. 2023;16(11):5185–95.
  34. 34. James E, Powell S, Munro P. Performance optimisation of a holographic Fourier domain diffuse correlation spectroscopy instrument. Biomed Opt Express. 2022;13(7):3836–53. pmid:35991914
  35. 35. Chen W, Chen H, Feng Q, Mo L, Hong S. A hybrid optimization method for sample partitioning in near-infrared analysis. Spectrochim Acta A Mol Biomol Spectrosc. 2021;248:119182. pmid:33234474
  36. 36. Chen Y, Zhang J, Feng J, Chen W, Liu W, Chen J, et al. Holistic quality evaluation method of Epimedii Folium based on NIR spectroscopy and chemometrics. Phytochem Anal. 2024;35(4):771–85. pmid:38273442
  37. 37. Forootani A, Iervolino R, Tipaldi M, Baccari S. A kernel-based approximate dynamic programming approach: theory and application. Automatica. 2024;162:111517.
  38. 38. Elen A, Baş S, Közkurt C. An adaptive gaussian kernel for support vector machine. Arab J Sci Eng. 2022;47(8):10579–88.
  39. 39. Mo L, Chen H, Chen W, Feng Q, Xu L. Study on evolution methods for the optimization of machine learning models based on FT-NIR spectroscopy. Infrared Phy Technol. 2020;108:103366.
  40. 40. Diaz-Olivares JA, Bendoula R, Saeys W, Ryckewaert M, Adriaens I, Fu X, et al. PROSAC as a selection tool for SO-PLS regression: a strategy for multi-block data fusion. Anal Chim Acta. 2024;1319:342965. pmid:39122277
  41. 41. Yang Q, Yuan S, Gao H, Zhang W. Differential evolution with migration mechanism and information reutilization for global optimization. Expert Syst Appl. 2024;238:122076.
  42. 42. Zhang Y, Chen H, Chen W, Xu L, Li C, Feng Q. Near Infrared feature waveband selection for fishmeal quality assessment by frequency adaptive binary differential evolution. Chemom Intell Lab Syst. 2021;217:104393.
  43. 43. Sallam KM, Abohany AA, Rizk-Allah RM. An enhanced multi-operator differential evolution algorithm for tackling knapsack optimization problem. Neural Comput Applic. 2023;35(18):13359–86.
  44. 44. Wang X, An S, Xu Y, Hou H, Chen F, Yang Y. A back propagation neural network model optimized by mind evolutionary algorithm for estimating Cd, Cr, and Pb concentrations in soils using Vis-NIR diffuse reflectance spectroscopy. Appl Sci. 2020;10:51.
  45. 45. Han A, Lu X, Qing S, Bao Y, Bao Y, Ma Q, et al. Rapid determination of low heavy metal concentrations in grassland soils around mining using vis-NIR spectroscopy: a case study of inner Mongolia, China. Sensors (Basel). 2021;21(9):3220. pmid:34066493
  46. 46. Silva FSR, da Silva YJAB, Maia AJ, Biondi CM, Araújo PRM, Barbosa RS, et al. Prediction of heavy metals in polluted mangrove soils in Brazil with the highest reported levels of mercury using near-infrared spectroscopy. Environ Geochem Health. 2023;45(11):8337–52. pmid:37605089
  47. 47. Yuan L, Chen X, Yao L, Pan T. Multi-parameter optimization for Vis–NIR spectroscopic analysis of multiple indicators of soil heavy metal in the tideland reclamation area of the Pearl River Delta. Soil Sediment Contam An Inter J. 2023;33(2):115–38.