Research on the chemical oxygen demand spectral inversion model in water based on IPLS-GAN-SVM hybrid algorithm

Qirong Lu; Jian Zou; Yingya Ye; Zexin Wang

doi:10.1371/journal.pone.0301902

Abstract

Spectral collinearity and limited spectral datasets are the problems influencing Chemical Oxygen Demand (COD) modeling. To address the first problem and obtain optimal modeling range, the spectra are preprocessed using six methods including Standard Normal Variate, Savitzky-Golay Smoothing Filtering (SG) etc. Subsequently, the 190–350 nm spectral range is divided into 10 subintervals, and Interval Partial Least Squares (IPLS) is used to perform PLS modeling on each interval. The results indicate that it is best modeled in the 7th range (238~253 nm). The values of Mean Square Error (MSE), Mean Absolute Error (MAE) and R2score of the model without pretreatment are 1.6489, 1.0661, and 0.9942. After pretreatment, the SG is better than others, with MSE and MAE decreasing to 1.4727, 1.0318 and R2score improving to 0.9944. Using the optimal model, the predicted COD for three samples are 10.87 mg/L, 14.88 mg/L, and 19.29 mg/L. To address the problem of the small dataset, using Generative Adversarial Networks for data augmentation, three datasets are obtained for Support Vector Machine (SVM) modeling. The results indicate that, compared to the original dataset, the SVM’s MSE and MAE have decreased, while its accuracy has improved by 2.88%, 11.53%, and 11.53%, and the R2score has improved by 18.07%, 17.40%, and 18.74%.

Citation: Lu Q, Zou J, Ye Y, Wang Z (2024) Research on the chemical oxygen demand spectral inversion model in water based on IPLS-GAN-SVM hybrid algorithm. PLoS ONE 19(4): e0301902. https://doi.org/10.1371/journal.pone.0301902

Editor: Guanghui Liu, State University of New York at Oswego, UNITED STATES

Received: November 10, 2023; Accepted: March 25, 2024; Published: April 11, 2024

Copyright: © 2024 Lu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data and code presented in this study have been deposited in the author's GitHub repository (https://github.com/aJanm/code-and-data-of-spectrum/tree/master/code-and-data-of-spectrum) to facilitate free access.

Funding: This research was funded by The National Natural Science Foundation of China Project (No. 62166012), the Guangxi Key Laboratory of Embedded Technology and Intelligent System Foundation, China, and the Innovation Project of Guangxi Graduate Education, China (No. YCSW2022319). Qirong Lu and Jian Zou are the funders and have the role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Chemical Oxygen Demand (COD) is one of the indicators used to represent the degree of water pollution, which reflects the amount of oxidant consumed in the process of oxidizing a water sample. Currently, there are various methods for measuring COD in water, and among them, the model based on UV-visible spectroscopy has the characteristics of easy operation and data analysis. UV-visible-near infrared spectrophotometers and infrared spectrometers [1–6] are typically used for the qualitative analysis. Spectroscopic techniques [7–11] are also used for online monitoring and for studying the correlation between water indicators and spectral intensity. However, the spectral data are easily affected by various factors such as instrument response, sample preparation, and environmental noise, resulting in noise and biases. Pretreatment methods are essential to reduce noise and improve the correlation between spectral data and chemical composition. Methods such as Standard Normal Variate, Multiple Scattering Correction, Smoothing Filtering, Moving Average Filtering, First-Order Differentiation, Second-Order Differentiation, Wavelet Transformation, Standardization, and Normalization are widely adopted. In spectral modeling, Partial Least Squares (PLS) algorithm is typically combined with other algorithms to achieve better modeling performance [12]. For example, Kernel Partial Least Squares and Boosting PLS have been utilized to predict leaf water content [13]. In another study, PLS and Support Vector Machine (SVM) algorithms were used to detect trace element content in poultry manure [14]. Due to the collinearity of spectra, selecting the optimal modeling wavelengths is crucial. To this end, Ying Li [15] integrated swarm intelligence algorithms and the PLS algorithm to establish a model for detecting apple juice adulteration. Similarly, Cheng et al. [16] combined the genetic algorithm with the PLS model to obtain optimal modeling wavelengths.

SVM algorithms are typically used for chemical concentration detection [17–20]. C. Robert [21] used both linear and non-linear SVM models to identify complete beef and lamb meats. Similarly, H. Sun [22] combines the Kernel Principal Component Analysis and SVM to improve accuracy. However, compared to PLS, SVM requires larger datasets to achieve better training results. Furthermore, the high cost and large size of multi-functional spectrometers [23–25] make them impractical to acquire data in research groups with limited funding. Therefore, Generative Adversarial Networks (GAN) can be used for data augmentation [26–32]. Cao Z et al. [33] combined GAN networks in spectral data analysis to enhance analysis accuracy and mitigate overfitting. In response to the scarcity of rice seed spectral data, Qi et al. [34] generated rice seed spectral data to address the issue of limited samples. Based on this, a neural network model was established using three modeling methods: real data modeling, fake data modeling, and mixed modeling of real and fake data. Zhang M et al. [35] proposed a new data augmentation strategy based on the original GAN network to tackle the challenges of small sample sizes and imbalanced samples in hyperspectral image processing. J. Wang [36] utilized a trained CGAN model for data augmentation, resulting in a five-fold increase in the dataset. Additionally, Cai et al [37] utilized the spectrogram of samples as inputs and applied data augmentation based on GAN to generate additional training data. Miao et al. [38] utilized a GAN to generate highly similar and diverse synthetic samples for fault diagnosis.

After a comprehensive analysis, the experiment combines UV-Visible spectroscopy, Interval Partial Least Squares (IPLS) method, SVM, and GAN for COD concentrations analysis. First, the UV-Visible spectrophotometer is used to obtain the spectral intensity of COD samples. At the same time, six methods are used to preprocess the water data. Secondly, spectral data training and test sets are created, and IPLS is used to select the spectral range for modeling, which the entire spectral range is divided into 10 segments, and a PLS model is established for each range. Thirdly, the model with the highest accuracy will be selected. Subsequently, GAN is utilized to process both the original and preprocessed spectral data, generating additional data for modeling. Lastly, SVM models are constructed for both the original and generated spectral data to validate the feasibility of the GAN through modeling effects.

Materials and methods

The study does not involve activities that require specific permits, such as working with endangered species or in protected areas. In accordance with local regulations and guidelines, no permits are required for this study.

Instruments and reagents

The experiment uses a spectrum acquisition system composed of a hyperspectral imager, a quartz cuvette, and a tungsten lamp lighting source. The Ultraviolet (UV) spectrum of the water sample is obtained through this system produced by Beijing Puxi General Instrument Co., Ltd., and it has a wide wavelength range of 190nm~900nm and a high precision of ±0.3nm wavelength indication error. The detailed parameters are shown in the Table 1 below:

Download:

Table 1. Instrument performance indicators.

https://doi.org/10.1371/journal.pone.0301902.t001

This instrument has functions such as photometric measurement, spectral scanning, quantitative determination, time scanning, spectral bandwidth scanning, DNA protein determination, and graphic processing. During the experiment, to ensure the accuracy of measurement results, a dark current calibration is used to eliminate some instrument noise. When measuring the absorbance or transmittance, baseline calibration is required. In this paper, the first step is to use the spectral scanning function to obtain the data of the COD samples, and then use Python for visual analysis to obtain the relationship between the solution concentration and the spectrum.

The sample pretreatment is as follows. The method is to take 0.8502g of potassium hydrogen phthalate solute, add distilled water to 1L, and stir until the solute is completely dissolved to obtain 1g/L COD solution. Based on this method, COD standard solutions with a concentration of 10-100mg/L are prepared in sequence. The specific implementation process of the paper is shown below.

Collection and treatment of the water samples

The collection of surface water samples is an important step in environmental sciences, which is used to monitor water quality and to comprehend the state of pollution in water bodies. The following are the basic steps in surface water sampling:

Determine the sampling points: firstly, the location of the sampling points requires to be determined, which can represent the water quality in the Li River area.
Prepare sampling equipment: Bring appropriate sampling equipment such as sampling bottles and samplers. The equipment should be clean to avoid the contamination of the samples.
Pre-treatment: Before surface water samples are collected, rinse the bottles or samplers several times with flowing field water to minimize possible sample contamination.
Sampling method: At the chosen sampling location, the sampler is promptly immersed in water to prevent contact with other substances and to minimize the risk of airborne contamination. When sampling in water depths of 5 meters or less, samples are typically collected at a depth of 0.5 meters below the surface. For depths ranging from 5 to 10 meters, samples are collected at a depth of 0.5 meters below the surface and 0.5 meters above the bottom.
Number of samples: According to the needs of the study and the requirements of laboratory tests, water samples were collected from seven different sections of the Li River. And water samples of about 500 ml to 1 litre were collected.
Marking of samples: At the time of sample collection, each sampling bottle was marked with relevant information, such as the name of the sampling site, date, time, and so on.
Sample preservation: After sampling is completed, ensure that the samples are preserved under appropriate conditions to avoid contamination or degradation of the samples. The water samples will need to be stored at 4 degrees Celsius and sent to the laboratory for spectral analysis as soon as possible.

As shown in Fig 1, in the first step, spectra are obtained using the instrument. In the second step, model inversion research is conducted on Chemical Oxygen Demand (COD) using the Interval Partial Least Squares (IPLS), Support Vector Machine (SVM), and Generative Adversarial Networks (GAN) methods.

Download:

Fig 1. Specific implementation flowchart.

(A). The internal structure of the spectrometer. (B). Spectral diagram of water sample. (C). Inversion of chemical oxygen demand model.

https://doi.org/10.1371/journal.pone.0301902.g001

Fig 1A illustrates the internal structure of the spectrometer, while Fig 1B depicts the spectrum of the water sample. The spectrometer is used to collect spectral data, which is then utilized to create a spectrum for qualitative analysis of solution concentration. Fig 1C illustrates the process of qualitative analysis for COD. Initially, six methods are used for data preprocessing. Subsequently, the IPLS algorithm is used to model and predict the preprocessed data, followed by the application of GAN networks to augment data for SVM model construction. Evaluation metrics for the model include Mean Squared Error (MSE), Mean Absolute Error (MAE), and R2score.

Spectral collection and pretreatment

Spectral data collection is susceptible to noise, and therefore, pretreatment is essential. Pretreatment aids in noise elimination and reduces the impact of other factors on the model’s accuracy. As depicted in Fig 2, we have outlined common spectral preprocessing methods. This research considers six pretreatment methods, including Standard Normal Variate (SNV), Multiple Scatter Correction (MSC), Vector Normalization, Savitzky-Golay smoothing filtering (SG), Wavelet Transform (WT), and Standardization methods.

Download:

Fig 2. Flow chart of data pretreatment.

https://doi.org/10.1371/journal.pone.0301902.g002

The original spectra are processed using methods to visualize them and establish an IPLS model. Based on the model’s effectiveness, the optimal pretreatment method is selected. The flowchart of pretreatment is presented in Fig 2.

Method

Interval partial least squares algorithm

Before introducing the IPLS algorithm, it is essential to introduce PLS, which is a typical mathematical optimization algorithm used to study the statistical relationship between the dependent variable and the independent variable. It can be employed for regression modeling when the number of sample points is less than the number of variables or severe multicollinearity among the independent variables. Previous research [39, 40] has shown that, compared to other linear models, PLS has better prediction results in qualitative analysis of UV spectra.

Before using PLS, it is essential to understand its basic principles and advantages. PLS projects the original independent variable data onto the direction of the dependent variable to obtain a new set of independent variables, thereby eliminating the multicollinearity between the independent variables. This can improve the stability and predictive ability of modeling. For example, the concentration matrix of COD can be set as the dependent variable, denoted as Y = (y_ij)_n×m while the measured UV spectral absorbance matrix can set as the independent variable, represented as X = (x_ij)_n×p, where n is the number of water samples, m is the number of components, and p is the number of spectral wave points. We decompose X and Y into feature vectors, as shown below.

(1)

(2)

As shown above, U is the concentration characteristic factor matrix of n rows and d columns, Q is the d × m order concentration loading matrix; T is the UV absorbance characteristic factor matrix of n rows and d columns, P is the d × p UV absorbance loading matrix; G and F are the n × m concentration residual matrix and n × p UV absorbance residual matrix, respectively.

We decompose Y and X according to the correlation of eigenvectors to build a regression model, as shown below.

(3)

Fd is the random error matrix, and B is the d-dimensional diagonal regression coefficient matrix, for the water sample, if the measured UV absorbance vector is x, then the concentration y can be derived from the following equation.

(4)

The IPLS is to build several models for spectra ranges based on the PLS, and to evaluate the models using three metrics: Mean Square Error (MSE), Mean Absolute Error (MAE) and R2score. The equations of these metrics are shown below. denotes the sample size, y denotes the true value, and denotes the predicted value.

(5)

(6)

(7)

IPLS is chosen for its capability to effectively handle collinear spectral data. By dividing the spectral range into intervals, IPLS can capture the nonlinear relationship between spectral variables and chemical properties, thus mitigating the effects of collinearity. Moreover, IPLS allows for the selection of informative spectral intervals, focusing modeling efforts on the most relevant spectral regions. This feature enhances model interpretability and reduces computational complexity, making IPLS a suitable choice for chemical oxygen demand (COD) modeling with spectral data.

SVM regression algorithm

The Support Vector Machine (SVM) Regression Algorithm is typically used in spectral analysis. It is an effective method to construct a nonlinear discriminant model. An introduction to SVM-based spectrum modeling is provided here.

Data acquisition and preprocessing: Spectral data with various compositions is collected, containing reflectance or absorption intensities at multiple wavelengths. The raw spectral data is preprocessed, including noise removal, baseline correction, and spectral smoothing etc. The preprocessing aims to enhance the quality and resolvability of the data.
Feature extraction: The features are extracted from the preprocessed spectral data. In spectral modeling, features are usually reflection or absorption values at various wavelengths in the spectrum.
Model construction: The dataset is divided into a training set and a test set. Techniques such as cross-validation are used to ensure the reliability of the model. The appropriate kernel functions for SVM, such as linear, polynomial, or Gaussian kernel, are chosen based on the specific problem. The SVM model is trained using the training set, and the model parameters are adjusted to achieve optimal results. During the training process, it will search for an optimal hyperplane that maximizes the margin between sample points.
Model evaluation and optimization: The trained model is evaluated using a test set, and the model performance is assessed using metrics such as accuracy, Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared score (R2score). The model is optimized based on the evaluation results, such as adjusting hyperparameters, increasing training samples, and employing other methods.

Overall, SVM-based spectral modeling is comprehensive and involves multiple steps, including data acquisition, preprocessing, feature extraction, model construction, evaluation, and optimization. Through these steps, a spectral analysis model is constructed to address practical problems. It is chosen as the modeling algorithm due to its robustness in handling high-dimensional data with limited samples. SVM can effectively model nonlinear relationships between spectral features and COD concentrations while avoiding overfitting, even with a relatively small dataset. Additionally, SVM offers flexibility in kernel selection, allowing the modeling of complex relationships between spectral variables and target variables. This versatility makes SVM suitable for capturing the relationships present in spectral data for COD prediction.

In this manuscript, the parameters of the SVM model are shown below.

’kernel’: the default kernel function is ’rbf ’, depending on the case, we can choose ’linear’, ’sigmoid’, ’poly’ and ’precomputed’, etc. The kernel can transform a nonlinear problem into a linear one;
’C’: the penalty parameter of c-svc, whose default value is 1.0, when its value is larger, the weaker the generalization ability of the model, but when its value is lower, the stronger the generalization ability of its model;
’degree’: when setting the kernel to ’poly’, the dimensionality of ’poly’ can be set using the ’degree’ parameter, whose default value is 3;
’gamma’: the kernel coefficients for ’rbf’, ’poly’ and ’’sigmoid’, whose default value is ’auto’;
’catch_size’: the default value is 200, which denotes the size of the kernel function cache.

The selection of proper parameters in the SVM model is essential to model accuracy. Although the parameters can be set empirically, it can be time-consuming and require significant effort. Alternatively, the GridSearchCV can be employed to identify the optimal combination of hyperparameters, such as ’kernel’, ’gamma’, ’degree’, and ’C’.

The GridSearchCV performs a systematic search of the parameter space by evaluating the model’s performance. This enables the identification of the best combination that results in the higher model accuracy. To further enhance the accuracy of the model, cross-validation techniques are employed in combination with GridSearchCV.

Generative adversarial networks

Generative adversarial networks (GAN) is one of the most classical network models in recent years and have achieved significant success in computer vision and natural language processing, etc. The main principle of GAN is to generate optimal samples in generators and discriminators using game theory.

The number of samples has an influence on the training of the model, and in this paper, due to spectral samples limitation, a GAN is used to generate samples for better training.

As depicted in Fig 3, the generator accepts a set of random vectors and is responsible for generating realistic data, and the discriminator is responsible for learning to determine the authenticity of the data. The optimization objective function of the network is shown below.

(8)

As shown above, D represents the Discriminator.G represents the Generator, where the real data X matches the P_data(x) distribution, and Z represents the noisy data, which matches the P_Z(z) distribution. V(D,G) denotes the degree of difference between the real sample and the generator sample. max_D V(D,G) denotes the degree of maximizing the difference between the real and generated samples when Generator is fixed, and denotes the degree of minimizing the difference between the real and generated samples when Discriminator is fixed.

Download:

Fig 3. GAN network structure diagram.

https://doi.org/10.1371/journal.pone.0301902.g003

The training process of the generator is as follows: when the discriminator is fixed, the generator generates samples for it. At first, due to the discrepancy between the generated and real samples, the discriminator feeds the training losses to the generator. The ultimate objective is to train the generator to produce samples that are indistinguishable from real data, fooling the discriminator into classifying them as real with a high degree of confidence (i.e., close to 1).

(9)

During the training process of the discriminator, the generator is fixed and the discriminator improves its discrimination capability by continuously comparing the real samples with the generated samples. Finally, it attains a higher frequency discrimination performance. Therefore, the frequency discrimination ability of non-real samples is close to 0.

(10)

During the training process, both the generator and the discriminator become stronger and gradually reach a balance.

When employing a GAN network to generate one-dimensional data, both the generator and discriminator are designed as neural networks optimized for processing one-dimensional data. The following outlines the specific training process:

Data preparation: First, prepare authentic COD one-dimensional data and record the corresponding COD concentrations. The column dimension of the input data is 160, which corresponds to the total number of spectral data sampling points.
Initialize the network: Randomly initialize the weights and biases of the generator and discriminator.
Define the loss function: For generating one-dimensional data, use ’binary_crossentropy’ as the loss function for both the discriminator and generator. Additionally, utilize ’rmsprop’ (Root Mean Square Prop) as the optimizer for both networks. The generator’s loss function aims to ensure that the generated data distribution closely matches the distribution of real data, while the discriminator’s loss function aims to correctly distinguish between real and generated data.
Train the Discriminator: In each training iteration, sample a batch of data from the original real data and generate a batch of fake data using the Generator. Merge two batches and assign labels (1 for real data and 0 for generated data). Then, feed the merged samples into the Discriminator, calculate its loss, and update parameters through backpropagation.
Train the Generator: Generate a batch of fake data from the generator and feed it into the discriminator. The objective here is to have the generated samples misclassified as real data (labeled 1) by the discriminator. Calculate the generator’s loss and update its parameters through backpropagation, improving the generator’s ability to produce realistic samples.
Adversarial training: During the training process, the generator and discriminator confront each other. The generator attempts to produce realistic COD samples to deceive the discriminator, while the discriminator strives to distinguish real data from the generated data. This adversarial training process continues for 500 iterations.
End of training: The training process concludes when a certain number of iterations are reached or when the performance of the generator and discriminator stabilizes.
Data generation: After training is complete, the generator can be employed to generate new one-dimensional data. By sampling from the generator, one can obtain one-dimensional data samples that match the generated model.

In this paper, GAN network consists of Generators and Discriminators, first 3 kinds of Generator network parameters are introduced as shown in Table 2 below.

Download:

Table 2. Three generator network parameters.

https://doi.org/10.1371/journal.pone.0301902.t002

As shown in the Table 2, all three Generators consist of four layers: Input, Dense1, Dense2, and Dense3. Their network structure is quite similar. To facilitate display, their network parameters are presented in a table. Taking the Output Shape of Dense1 layer as an example, the Output Shape of the Dense1 layer for three Generators is depicted by the values (None, 5/10/20), which corresponds to (None, 5), (None, 10), and (None, 20), respectively. These values indicate that the output dimensions after processing by the Dense1 layer of the three Generators are (None, 5), (None, 10), and (None, 20). Likewise, ‘805/1610/3220’ denotes the number of network parameters in the Dense1 layer for the three Generators, which are 805, 1610, and 3220, respectively.

As shown in the Table 3, all three Discriminators consist of four layers: Input, Dense1, Dense2, Dropout and Dense3. Their network structure is also quite similar. Taking the Input Shape of Dense3 layer as an example, the Input Shape of the Dense3 layer for three Discriminator is depicted by the values (None, 5/10/20), which corresponds to (None, 5), (None, 10), and (None, 20), respectively. These values indicate that the input dimensions after processing by the Dense3 layer of the three Discriminators are (None, 5), (None, 10), and (None, 20). Likewise, ‘805/1610/3220’ denotes the number of network parameters in the Dense3 layer for the three Discriminator, which are 6, 11, and 21, respectively.

Download:

Table 3. Three discriminator network parameters.

https://doi.org/10.1371/journal.pone.0301902.t003

In summary, the Generator model takes a 160-dimensional vector as input, processes it through two hidden layers with several units each and ReLU activation, and then produces a 160-dimensional output vector with each element being the result of applying the hyperbolic tangent (tanh) function. The discriminator model takes a 160-dimensional vector as input and processes it through two hidden layers with several units each and ReLU activation functions. It then applies dropout to the outputs of the second layer, followed by a final dense layer with a sigmoid activation function to produce a single output representing the probability of the input being classified as the positive class in a classification task.

GAN are selected for data augmentation to overcome the challenge of limited datasets. It can produce synthetic spectral data that mimic the distribution of real spectral samples. This process expands the training dataset and enhances model generalization. Unlike traditional data augmentation methods such as interpolation or oversampling, it can generate diverse and realistic spectral variations, capturing the complexity and variability of real-world spectral data more effectively.

GridSearchCV technique

GridSearchCV is a commonly used technique for parameter tuning in machine learning. It combines cross-validation and grid search to efficiently search for optimal parameters. By specifying a range of parameters to explore, it systematically evaluates the performance of different parameter combinations using cross-validation.

The process begins with the initialization of hyperparameter combinations. Subsequently, an SVM model is established, and the various parameter combinations are sequentially traversed and evaluated for their modeling effectiveness. Each parameter combination is inputted into the SVM model, and the modeling process is completed. Finally, the best-performing parameter combination, which yields the most favorable modeling results, is selected.

In this paper, we utilize GridSearchCV to fine-tune the parameters of the SVM model. By exhaustively searching through all possible combinations within the specified parameter range, we aim to identify the optimal parameter configuration that maximizes the model’s performance in cross-validation. This approach ensures that the chosen parameter values are well-suited to the problem at hand, enhancing the overall effectiveness and reliability of the SVM model.

Results and discussion

Spectral pretreatment

Spectral pretreatment can remove irrelevant information such as noise, and it is useful to analyze the correlation between the spectrum and the COD concentration. The results of pretreatment using various methods are shown in Fig 4 below.

Download:

Fig 4. Spectra obtained after pretreatment.

https://doi.org/10.1371/journal.pone.0301902.g004

As illustrated in the Fig 4, the effectiveness of pretreatment by using Standard Normal Variate (SNV), Multiple Scattering Correction (MSC), Normalization, Savitzky-Golay Smoothing Filtering (SG), Wavelet Transformation (WAVE) and Standardization methods for the original spectra, as shown in the Fig 4. The spectral graph obtained by SNV, MSC, and normalization methods are relatively similar, while the spectral graph obtained by SG and WAVE methods also exhibit similarities. The spectral range from 190 to 340 nm is plotted on the horizontal axis, while the absorbance values are shown on the vertical axis. The application of pretreatment enhances the smoothness of the original spectra.

Feature wavelength selection

Interval partial least squares method.

After acquiring the pre-processed data, the selection of the wavelength range is executed using the IPLS method. To determine the optimal spectral range, the 190 nm~350 nm spectral range is divided into ten subintervals of equal width, and a PLS is performed on each subinterval, thereby establishing individual regression models. Subsequently, the model exhibiting superior performance is selected. log₁₀ (MSE), log₁₀ (MAE) and R2score are used as metrics of the model.

As shown in Table 4 above, the MAE obtained in different pretreatment methods and modeling in different spectral ranges are given. It shows that all six methods achieve the minimum MAE value in the seventh range of the spectrum, while the values on both sides increase sequentially. As indicated in the "Original modeling effect" column of the table, the model achieved the best modeling effect in the seventh range without data preprocessing, with the log₁₀ MAE of 0.0278, highlighted in bold in the table. From the value of log₁₀ MAE, the final MAE of the original model can be found to be 1.0661. Specifically, taking the value of 0.0136 in the 7th row and 3rd column as an example, the original spectrum was initially pre-processed using the SG method, which yielded the input for the PLS in the 7th spectral range. This value represents the log₁₀ MAE of the model. The final MAE of the model processing by the SG method is 1.0318. Better results can be attained by modeling the data with the pretreatment method. The data are also visually depicted in the Fig 5.

Download:

Fig 5. Graph of MAE value results.

https://doi.org/10.1371/journal.pone.0301902.g005

Download:

Table 4. Table of MAE corresponding to different pretreatment methods.

https://doi.org/10.1371/journal.pone.0301902.t004

As illustrated in the Fig 5, the horizontal axis displays each spectral range, numbered 1 to 10, and the vertical axis shows the MAE in each range. The smallest error value is obtained for the 7th range, specifically for the wavelength range of 238–253 nm. Regarding the pretreatment methods, the SG method yielded better results than the other methods.

The MSE resulting from modeling using different pretreatment methods and spectral ranges are presented in the Table 5 below. The values in the table represent log₁₀ MSE.

Download:

Table 5. Table of MSE corresponding to different pretreatment methods.

https://doi.org/10.1371/journal.pone.0301902.t005

As shown above, the MSE obtained in different pretreatment methods and modeling in different spectral ranges are given. Table 5 presents the MSE corresponding to the model established using six preprocessing methods. It shows that all six methods achieve the minimum MSE value in the seventh range, while the values on both sides increase sequentially. As indicated in the "Original modeling effect" column of the Table 5, the model achieved the best modeling effect in the seventh range without data preprocessing, with the log₁₀ MSE of 0.2172, highlighted in bold in the Table 5. From the value of log₁₀ MSE, the final MSE of the original model can be found to be 1.6489. Specifically, taking the value of 0.1681 in the 7th row and the 3rd column as an example, the original spectrum was initially pre-processed using the SG method, which yielded the input for the PLS in the 7th spectral range. This value represents the MSE of the model.

Our analysis indicates that there are significant variations in the errors of models constructed across different spectral intervals. Specifically, the 10th subinterval generates the largest model error, while the 7th subinterval results in the smallest model error. Additionally, better results can be attained by modeling the data using the pretreatment method. In the 7th range, from the smallest value of log₁₀ MSE, the final MSE of the original data can be found to be 1.6489 and the MSE of the model constructed by the SG method is 1.4727. The data are also visually depicted in the Fig 6.

Download:

Fig 6. Graph of MSE value results.

https://doi.org/10.1371/journal.pone.0301902.g006

Fig 6 displays the results of the MSE. Overall, the MSE values are slightly larger than the MAE, and the overall tendency of the MSE is similar to the MAE. The data trends in the two graphs are similar, with only numerical differences. The six preprocessing methods all achieve the minimum error value in the seventh range of the spectrum, while the values on both sides increase sequentially. The MSE obtained through the 7th range modeling is the smallest, and the SG method demonstrates a superior pretreatment effect compared to other methods.

The data of R2score are shown below. The "-" in the Table 6 indicates some instances with poor modeling results.

Download:

Table 6. Table of R2score corresponding to different pretreatment methods.

https://doi.org/10.1371/journal.pone.0301902.t006

The results presented in the Table 6 demonstrate that six pretreatment methods yield improved performance in the seventh range, as evidenced by R2score values, while the values on both sides increase and decrease sequentially. The correlation modeled in the 7th range is the strongest, while the correlation modeled in other ranges deteriorates. In comparison, the model constructed for the 7th subinterval has the optimal effect. Specifically, the SG method achieves the highest R2score value of 0.9942 for modeling in the seventh range, while the original modeling effect closely follows with an R2score value of 0.9620.

As illustrated in the Fig 7, the vertical axis represents the R2score values obtained from the different pretreatment methods across various spectral ranges, where a value closer to 1 indicates better performance. Among the pretreatment methods, the SG method outperformed the others in the seventh spectral range.

Download:

Fig 7. Graph of R2score value results.

https://doi.org/10.1371/journal.pone.0301902.g007

Finally, we evaluate three indexes to determine the optimal pretreatment method for samples. The results indicate that the SG pretreatment method is most effective. Furthermore, we determined that the optimal modeling range was the 7th segment, which to a range of 238~253 nm. A PLS model is established based on the spectral data, which could be used for the subsequent inversion study.

Spectral inversion study

According to the water sample collection criteria in Environmental Monitoring, we gathered seven water samples from a section of the Li River. We utilized the optimal PLS model for the spectral inversion study. As depicted in Figs 8–11. These four figures depict the spectral lines of various water samples and COD standard solutions. By comparing the spectral lines, the concentration of the water sample can be preliminarily determined.

Download:

Fig 8. Comparison of the first 4 water samples with the standard solution.

https://doi.org/10.1371/journal.pone.0301902.g008

Download:

Fig 9. Comparison of water sample 5 and standard solution.

https://doi.org/10.1371/journal.pone.0301902.g009

Download:

Fig 10. Comparison of water sample 6 and standard solution.

https://doi.org/10.1371/journal.pone.0301902.g010

Download:

Fig 11. Comparison of water sample 7 and standard solution.

https://doi.org/10.1371/journal.pone.0301902.g011

In the first step of our analysis, we compared the absorbance curves of the water samples with the standard COD solutions, as illustrated in Fig 8. The spectrogram presents the absorbance values of the four water samples as indicated by the black, red, green, and blue dashed lines. Notably, the spectral lines of the four water samples show a closer resemblance.

As illustrated in the Fig 8, these four water samples have no obvious absorption peaks in the overall spectral range, and the preliminary analysis shows that the concentration of COD is low in these water samples.

Water sample 5 is analyzed, and the results are presented in Fig 9. The spectral line depicts as the black dashed line, exhibits a close resemblance to the standard solution of COD with a concentration of 10 mg/L.

As illustrated in the Fig 10, we compared the water sample 6 with standard solutions. The absorbance curve is depicted by the black dashed line in Fig 10. Analysis of the spectrum indicates that the COD of water sample 6 falls between the spectra of the COD standard solutions of 10 mg/L and 20 mg/L, with a concentration of approximately 15 mg/L.

As illustrated in the Fig 11, we compared water sample 7 with standard solutions. Our analysis shows that the COD in water sample 7 is greater than that in water sample 6. The absorbance curve is depicted by the black dashed line in Fig 11. Notably, the spectrum of water sample 7 is more like the spectrum of the COD standard solution with a concentration of approximately 20 mg/L.

As illustrated in the Fig 11, we used the PLS model to predict the COD concentrations of the water samples. The results infer that the COD concentrations of the first four water samples were lower, with concentrations of 10.87 mg/L for water sample 5, 14.88 mg/L for water sample 6, and 19.29 mg/L for water sample 7. Notably, the model predictions were consistent with the qualitative analysis results.

This consistency offers ways to confirm the model’s accuracy. The reliability of modeling is improved by consistency in a number of ways, including the following:

Check the model’s correctness: Since spectral data figures are analyzed for qualitative analysis, model outputs that align with these findings suggest that the model has a higher degree of accuracy. We can confirm whether the model accurately represents the components in the water sample by contrasting the expected results with the findings of the qualitative investigation.
Optimize the features chosen and the model’s parameters: Consistency analysis can be used to evaluate how well the features and parameters chosen for the model worked. In the event when the model’s output is consistent with qualitative analysis, the selected features and parameters might be more appropriate. On the other hand, if there are variations, feature selection or model parameter selection might need to be reconsidered.
Increasing the model’s credibility in real-world applications: Reliable outcomes strengthen the model’s argument and increase its credibility in real-world applications. In contexts like water sample prediction, the model’s credibility plays a critical role in enabling decision-making and appropriate action.

Consistency with qualitative analysis results can be viewed as an indicator of model quality and reliability. This approach can be adopted when high precision in the model is not a strict requirement.

SVM-based cod modeling

The original and pre-processed spectral data are fed into the GAN network to generate synthetic data for modeling purposes. Next, we use the SVM model to train on and evaluate against the original data, the generated data, and a mix of both. Through this process, we can verify the feasibility of the GAN in generating synthetic data for modeling.

The original data, the data obtained by pretreatment and the data generated by GAN are used as data sets for the training of SVM, and the modeling process is shown below. The initial parameter array of the SVM is set in Table 7. Table 7 lists various kernel functions and their respective parameters. Each function includes a ’C’ parameter, which represents the degree parameter of c-svc. The default value is 1.0. A higher value enhances the generalization ability of the model, while a lower value strengthens the model’s generalization ability. The gamma parameter signifies the kernel coefficients for ’rbf’, ’poly’, and ’sigmoid’, with the default value being ’auto’.

Download:

Table 7. Table of initial parameter arrays.

https://doi.org/10.1371/journal.pone.0301902.t007

As previously mentioned, in our SVM modeling approach, we utilize four different kernel functions: ’rbf’, ’sigmoid’, ’poly’, and ’linear’. The ’gamma’ parameter, which is only valid for ’rbf’, ’poly’, and ’sigmoid’, has a default value of ’1/n_features’, where ’n_features’ is the number of sample features. Additionally, the ’degree’ parameter has a default value of 3. The values of the evaluation indexes without searching for parameters are shown in Table 8.

Download:

Table 8. Table of evaluation index values.

https://doi.org/10.1371/journal.pone.0301902.t008

Table 8 displays the effectiveness of utilizing four kernel functions for SVM modeling. Based on the evaluation of the model performance using four indicators, it shows that the MSE and MAE of the model are larger, while the accuracy and correlation of the model are lower. As observed from the results, the performance of our SVM is suboptimal, likely due to the limited size of spectral datasets and the lack of parameter tuning. To address this issue, we apply the GridSearchCV method to perform a thorough parameter search. We initialize the search with the parameter array shown in Table 9 and utilize the GridSearchCV to systematically probe the parameters and identify the optimal combination of them.

Download:

Table 9. Table of the optimal parameters obtained from GridSearchCV.

https://doi.org/10.1371/journal.pone.0301902.t009

As indicated in Table 9 of the manuscript, the GridSearchCV yielded the optimal parameters. Automatic parameter adjustment with GridSearchCV yields optimal results for a given set of parameters. Specifically, each kernel function’s parameter lists—such as the "C" and "gamma" lists—are supplied. In order to discover the best combination of parameters, GridSearchCV searches the parameter list exhaustively and trains the SVM model for each combination. Subsequently, the ideal parameters are obtained and employed to train the SVM model, leading to enhanced training results.

After obtaining the optimal parameters, they are passed into the SVM for modeling, and the results are shown in Table 10.

Download:

Table 10. Effects of modeling using parameters from the GridSearchCV.

https://doi.org/10.1371/journal.pone.0301902.t010

The results in Table 10 demonstrate a significant improvement in the model’s performance after the parameter search, as compared to the results presented in Table 8. After adjusting the parameters, the linear kernel performs relatively the best in the modeling process using four kernel functions. For instance, the ’sigmoid’ kernel’s accuracy is observed to be lower, with a corresponding ’C’ value of 50. It is noteworthy that a higher ’C’ value tends to influence the model’s generalization. Taking linear kernels as an example, the correlation and accuracy have increased, while the MSE and MAE values have decreased.

A key element of SVM is the kernel function. The kernel function enables SVM to perform nonlinear mapping in high-dimensional space, thereby resolving the issue of linear inseparability in the original feature space. Four kernel functions are utilized for modeling in this article. An examination of the modeling effect is presented below.

Firstly, non-linearly separable data in the original feature space can be handled with the "rbf" function. In order to adapt to various complex data distributions, it enables the learning of increasingly complex decision boundaries.

Second, the "sigmoid" kernel function is sensitive to parameter choice and appropriate in scenarios with extremely complex data distributions. Moreover, data exhibiting polynomial relationships in the feature space is well-suited for "poly" kernel function. They can adapt to various data distributions by adjusting the offset and order of the polynomials. The "linear" kernel can perform better when the data relationship is relatively simple and does not require intricate nonlinear mapping. The spectral data in this manuscript demonstrate a small-scale, simple data distribution and no complex data distribution in the feature space. As a result, the "linear" kernel function is utilized to enhance modeling results.

Overall, the comprehensive evaluation criteria indicate that the ’linear’ kernel outperforms the other alternatives, making it the preferred choice for subsequent modeling.

After selecting the ’linear’ kernel, the data are generated utilizing three GAN networks with different structures, and the GAN-generated data are aggregated with the original data for training, and the results are shown in Table 11.

Download:

Table 11. Effects of modeling using parameters from the GridSearchCV.

https://doi.org/10.1371/journal.pone.0301902.t011

As shown in Table 11, the original spectral data and the generated data were blended for SVM modeling using 3 types of GAN for data augmentation, and then carrying out SVM training, compared with modeling directly with the original data, the MSE and MAE of the model have considerably decreased, and the accuracy and R2score of the model have considerably increased, the accuracy has improved by 2.88%, 11.53% and 11.53% in turn, and the R2score has improved by 18.07%, 17.40% and 18.74% in turn.

Due to the small dataset, the samples used for training is relatively limited, only approaching a hundred. SVM did not exhibit good performance on this small dataset. That’s why using GAN for data augmentation. After data augmentation, the modeling effect is better than before. To achieve better results, more data needs to be generated. In conclusion, GAN provides a better way for data augmentation when the model is trained with less data and upgrades the training effect to a certain extent.

Conclusion

Spectral collinearity and limited spectral data set are two main problems affecting COD modeling. To address these problems, First, the IPLS method is utilized to effectively identifies the spectral range for modeling and mitigates the impact of spectral collinearity. Secord, we used six different data preparation techniques, the model fits best in the 7th range (238~253 nm), according to the results. From modeling without data pretreatment, the MSE, MAE, and R2score values are 1.6489, 1.0661, and 0.9942, respectively. Following the SG method’s pretreatment, the R2score rises to 0.9944 and the MSE and MAE drop to 1.4727 and 1.0318, respectively. This suggests that proper data pretreatment is essential to obtain more reliable results in spectroscopic analysis.

Next, we predicted the concentrations in the water sample using the best model. The findings show that the first four water samples had lower COD values, but samples 5, 6, and 7 had amounts of 10.87 mg/L, 14.88 mg/L, and 19.29 mg/L, respectively. Notably, the results of the qualitative study align with the model predictions.

To address the problem of having a little dataset, we used three different GANs for data augmentation before using the results for SVM modeling. The experimental results indicated that the MSE and MAE of SVM models decreased when compared to the original dataset. Additionally, the R2score grew by 18.07%, 17.40%, and 18.74%, and the accuracy of the three models improved by 2.88%, 11.53%, and 11.53%.

In summary, IPLS, GAN, and SVM are selected based on their complementary strengths in addressing the challenges of spectral collinearity and limited datasets. IPLS addresses collinearity by emphasizing informative spectral intervals, GAN enhances the dataset with realistic synthetic samples, and SVM efficiently models relationships between spectral features and COD concentrations.

This research still has several limitations despite making some progress. The model can be applied to some simple water systems. And the sub-models described in the manuscript can serve as the basis for future research. Some explicit directions and hypotheses for future studies are as follows.

Investigating different methods of data augmentation: We can explore alternative approaches to data augmentation, such as variational autoencoders, autoencoders, or different versions of generative adversarial networks. And we can compare the performance of various strategies on small-sample datasets.
Expanding modeling to other indicators of water quality: We can utilize the suggested model to model indicators of other water quality, such as total phosphorus, ammonia nitrogen, etc. This extension would confirm the applicability and versatility of the approach.
Examining multimodal data fusion: To enhance the modeling accuracy of water quality indicators in the future, we can consider integrating multimodal data, such as sensor and spectral data. It would be beneficial to incorporate multiple types of data into a single model and evaluate how they impact the model’s performance.

Acknowledgments

The acquisition of data and the implementation of experiments were supported by Drainage Engineering Management Office, Guilin, China and Li River Comprehensive Treatment and Eco-logical Protection Project Headquarters, Guilin, China. We would like to extend our heartfelt gratitude towards Mo Bikun, Zhong Yan and Jiang Yuwei for providing feedback and guidance on our work for further improvements.

References

1. Singh P, Datta M, Ramana GV, Gupta SK, Malik T. Qualitative comparison of elemental concentration in soils and other geomaterials using FP-XRF. Kogbara RB, editor. PLoS ONE. 2022 May 20;17(5):e0268268. pmid:35594243
- View Article
- PubMed/NCBI
- Google Scholar
2. Wang CY, Kao TC, Chen YF, Su WW, Shen HJ, Sung KB. Validation of an Inverse Fitting Method of Diffuse Reflectance Spectroscopy to Quantify Multi-Layered Skin Optical Properties. Photonics. 2019 May 30;6(2):61.
- View Article
- Google Scholar
3. Huangfu K, Li J, Zhang X, Zhang J, Cui H, Sun Q. Remote Estimation of Water Quality Parameters of Medium- and Small-Sized Inland Rivers Using Sentinel-2 Imagery. Water. 2020 Nov 7;12(11):3124.
- View Article
- Google Scholar
4. Li Z, Fu Z, Zhang Y, Guo Y, Che F, Guo H, et al. Temporal and Spatial Distribution and Fluorescence Spectra of Dissolved Organic Matter in Plateau Lakes: A Case Study of Qinghai Lake. Water. 2021 Dec 7;13(24):3481.
- View Article
- Google Scholar
5. Xia Y, Li W, He X, Liu D, Sun Y, Chang J, et al. Efficient Removal of Organic Matter from Biotreated Coking Wastewater by Coagulation Combined with Sludge-Based Activated Carbon Adsorption. Water. 2022 Aug 7;14(15):2446.
- View Article
- Google Scholar
6. Xia M, Zhao N, Yin G, Yang R, Chen X, Feng C, et al. A Design of Real-Time Data Acquisition and Processing System for Nanosecond Ultraviolet-Visible Absorption Spectrum Detection. Chemosensors. 2022 Jul 15;10(7):282.
- View Article
- Google Scholar
7. Khan MFS, Akbar M, Wu J, Xu Z. A review on fluorescence spectroscopic analysis of water and wastewater. Methods Appl Fluoresc. 2022 Jan 1;10(1):012001. pmid:34823232
- View Article
- PubMed/NCBI
- Google Scholar
8. Charnier C, Latrille E, Jimenez J, Lemoine M, Boulet JC, Miroux J, et al. Fast characterization of solid organic waste content with near infrared spectroscopy in anaerobic digestion. Waste Management. 2017 Jan;59:140–8. pmid:27816468
- View Article
- PubMed/NCBI
- Google Scholar
9. Han X, Xie D, Song H, Ma J, Zhou Y, Chen J, et al. Estimation of chemical oxygen demand in different water systems by near-infrared spectroscopy. Ecotoxicology and Environmental Safety. 2022 Sep;243:113964. pmid:35994903
- View Article
- PubMed/NCBI
- Google Scholar
10. Goffin A, Guérin-Rechdaoui S, Rocher V, Varrault G. An environmentally friendly surrogate method for measuring the soluble chemical oxygen demand in wastewater: use of three-dimensional excitation and emission matrix fluorescence spectroscopy in wastewater treatment monitoring. Environ Monit Assess. 2019 Jul;191(7):421. pmid:31177336
- View Article
- PubMed/NCBI
- Google Scholar
11. Wei X, Li S, Zhu S, Zheng W, Zhou S, Wu W, et al. Quantitative analysis of soybean protein content by terahertz spectroscopy and chemometrics. Chemometrics and Intelligent Laboratory Systems. 2021 Jan;208:104199.
- View Article
- Google Scholar
12. Yang L, Su H, Wen Z. Improved PLS and PSO methods-based back analysis for elastic modulus of dam. Advances in Engineering Software. 2019 May;131:205–16.
- View Article
- Google Scholar
13. Nie P, Qu F, Lin L, Dong T, He Y, Shao Y, et al. Detection of Water Content in Rapeseed Leaves Using Terahertz Spectroscopy. Sensors. 2017 Dec 6;17(12):2830. pmid:29211043
- View Article
- PubMed/NCBI
- Google Scholar
14. Bedin FCB, Faust MV, Guarneri GA, Assmann TS, Lafay CBB, Soares LF, et al. NIR associated to PLS and SVM for fast and non-destructive determination of C, N, P, and K contents in poultry litter. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy. 2021 Jan;245:118834. pmid:32920437
- View Article
- PubMed/NCBI
- Google Scholar
15. Li Y, Guo Y, Liu C, Wang W, Rao P, Fu C, et al. SPA Combined with Swarm Intelligence Optimization Algorithms for Wavelength Variable Selection to Rapidly Discriminate the Adulteration of Apple Juice. Food Anal Methods. 2017 Jun;10(6):1965–71.
- View Article
- Google Scholar
16. Huangfu K, Li J, Zhang X, Zhang J, Cui H, Sun Q. Remote Estimation of Water Quality Parameters of Medium- and Small-Sized Inland Rivers Using Sentinel-2 Imagery. Water. 2020 Nov 7;12(11):3124. pmid:26617027
- View Article
- PubMed/NCBI
- Google Scholar
17. Gu H, Liu K, Huang X, Chen Q, Sun Y, Tan CP. Feasibility study for the analysis of coconut water using fluorescence spectroscopy coupled with PARAFAC and SVM methods. BFJ. 2020 May 11;122(10):3203–12.
- View Article
- Google Scholar
18. Pan S, Zhang H, Li Z, Chen T. Classification of Ginseng with different growth ages based on terahertz spectroscopy and machine learning algorithm. Optik. 2021 Jun;236:166322.
- View Article
- Google Scholar
19. Wang S, Liu S, Che X, Wang Z, Zhang J, Kong D. Recognition of polycyclic aromatic hydrocarbons using fluorescence spectrometry combined with bird swarm algorithm optimization support vector machine. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy. 2020 Jan;224:117404. pmid:31374351
- View Article
- PubMed/NCBI
- Google Scholar
20. Zhang Y, Li J, Fan X, Liu J, Zhang H. Moisture Prediction of Transformer Oil-Immersed Polymer Insulation by Applying a Support Vector Machine Combined with a Genetic Algorithm. Polymers. 2020 Jul 16;12(7):1579. pmid:32708631
- View Article
- PubMed/NCBI
- Google Scholar
21. Robert C, Fraser-Miller SJ, Jessep WT, Bain WE, Hicks TM, Ward JF, et al. Rapid discrimination of intact beef, venison and lamb meat using Raman spectroscopy. Food Chemistry. 2021 May;343:128441. pmid:33127228
- View Article
- PubMed/NCBI
- Google Scholar
22. Sun H, Lv G, Mo J, Lv X, Du G, Liu Y. Application of KPCA combined with SVM in Raman spectral discrimination. Optik. 2019 May;184:214–9. 1. Sun H, Lv G, Mo J, Lv X, Du G, Liu Y. Application of KPCA combined with SVM in Raman spectral discrimination. Optik. 2019 May;184:214–9.
- View Article
- Google Scholar
23. Chai J, Zhang K, Xue Y, Liu W, Chen T, Lu Y, et al. Review of MEMS Based Fourier Transform Spectrometers. Micromachines. 2020 Feb 20;11(2):214. pmid:32093291
- View Article
- PubMed/NCBI
- Google Scholar
24. Shen Z, D’Agui H, Walden L, Zhang M, Yiu TM, Dixon K, et al. Miniaturised visible and near-infrared spectrometers for assessing soil health indicators in mine site rehabilitation. SOIL. 2022 Jul 18;8(2):467–86.
- View Article
- Google Scholar
25. Long DS, McCallum JD. Adapting a relatively low-cost reflectance spectrometer for on-combine sensing of grain protein concentration. Computers and Electronics in Agriculture. 2020 Jul;174:105467.
- View Article
- Google Scholar
26. Cheng JR, Yang Y, Tang XY, Xiong NX, Zhang Y, Lei FF. Generative Adversarial Networks: A Literature Review. KSII TIIS. 2020 Dec 31.
- View Article
- Google Scholar
27. Lee JS, Shin K, Ryu SM, Jegal SG, Lee W, Yoon MA, et al. Screening of adolescent idiopathic scoliosis using generative adversarial network (GAN) inversion method in chest radiographs. Ijaz MF, editor. PLoS ONE. 2023 May 22;18(5):e0285489. pmid:37216382
- View Article
- PubMed/NCBI
- Google Scholar
28. Han H, Wang X, Gu F, Li W, Cai Y, Xu Y, et al. Better Late Than Never: GAN-Enhanced Dynamic Anti-Jamming Spectrum Access With Incomplete Sensing Information. IEEE Wireless Commun Lett. 2021 Aug;10(8):1800–4.
- View Article
- Google Scholar
29. Dam T, Anavatti SG, Abbass HA. Mixture of Spectral Generative Adversarial Networks for Imbalanced Hyperspectral Image Classification. IEEE Geosci Remote Sensing Lett. 2022;19:1–5.
- View Article
- Google Scholar
30. Moharram MA, Sundaram DM. Land use and land cover classification with hyperspectral data: A comprehensive review of methods, challenges and future directions. Neurocomputing. 2023 Jun;536:90–113.
- View Article
- Google Scholar
31. Barrientos-Espillco F, Gascó E, López-González CI, Gómez-Silva MJ, Pajares G. Semantic segmentation based on Deep learning for the detection of Cyanobacterial Harmful Algal Blooms (CyanoHABs) using synthetic images. Applied Soft Computing. 2023 Jul;141:110315.
- View Article
- Google Scholar
32. Huang Y, Chen Z, Liu J. Limited agricultural spectral dataset expansion based on generative adversarial networks. Computers and Electronics in Agriculture. 2023 Dec;215:108385.
- View Article
- Google Scholar
33. Cao Z, Zhang S, Liu Y, Smith CJ, Sherman AM, Hwang Y, et al. Spectral classification by generative adversarial linear discriminant analysis. Analytica Chimica Acta. 2023 Jun;1261:341129 pmid:37147049
- View Article
- PubMed/NCBI
- Google Scholar
34. Qi H, Huang Z, Jin B, Tang Q, Jia L, Zhao G, et al. SAM-GAN: An improved DCGAN for rice seed viability determination using near-infrared hyperspectral imaging. Computers and Electronics in Agriculture. 2024 Jan;216:108473.
- View Article
- Google Scholar
35. Zhang M, Wang Z, Wang X, Gong M, Wu Y, Li H. Features kept generative adversarial network data augmentation strategy for hyperspectral image classification. Pattern Recognition. 2023 Oct;142:109701.
- View Article
- Google Scholar
36. Wang J, Han B, Bao H, Wang M, Chu Z, Shen Y. Data augment method for machine fault diagnosis using conditional generative adversarial networks. Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering. 2020 Oct;234(12):2719–27.
- View Article
- Google Scholar
37. Cai L, Cao K, Wu Y, Zhou Y. Spectrum Sensing Based on Spectrogram-Aware CNN for Cognitive Radio Network. IEEE Wireless Commun Lett. 2022 Oct;11(10):2135–9.
- View Article
- Google Scholar
38. Miao J, Wang J, Zhang D, Miao Q. Improved Generative Adversarial Network for Rotating Component Fault Diagnosis in Scenarios With Extremely Limited Data. IEEE Trans Instrum Meas. 2022;71:1–13.
- View Article
- Google Scholar
39. Bian X, Zhang R, Liu P, Xiang Y, Wang S, Tan X. Near infrared spectroscopic variable selection by a novel swarm intelligence algorithm for rapid quantification of high order edible blend oil. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy. 2023 Jan;284:121788. pmid:36058170
- View Article
- PubMed/NCBI
- Google Scholar
40. Zhang X, Kano M, Song Z. Optimal Weighting Distance-Based Similarity for Locally Weighted PLS Modeling. Ind Eng Chem Res. 2020 Jun 24;59(25):11552–8.
- View Article
- Google Scholar

[ref1] 1. Singh P, Datta M, Ramana GV, Gupta SK, Malik T. Qualitative comparison of elemental concentration in soils and other geomaterials using FP-XRF. Kogbara RB, editor. PLoS ONE. 2022 May 20;17(5):e0268268. pmid:35594243
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Wang CY, Kao TC, Chen YF, Su WW, Shen HJ, Sung KB. Validation of an Inverse Fitting Method of Diffuse Reflectance Spectroscopy to Quantify Multi-Layered Skin Optical Properties. Photonics. 2019 May 30;6(2):61.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Huangfu K, Li J, Zhang X, Zhang J, Cui H, Sun Q. Remote Estimation of Water Quality Parameters of Medium- and Small-Sized Inland Rivers Using Sentinel-2 Imagery. Water. 2020 Nov 7;12(11):3124.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref4] 4. Li Z, Fu Z, Zhang Y, Guo Y, Che F, Guo H, et al. Temporal and Spatial Distribution and Fluorescence Spectra of Dissolved Organic Matter in Plateau Lakes: A Case Study of Qinghai Lake. Water. 2021 Dec 7;13(24):3481.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Xia Y, Li W, He X, Liu D, Sun Y, Chang J, et al. Efficient Removal of Organic Matter from Biotreated Coking Wastewater by Coagulation Combined with Sludge-Based Activated Carbon Adsorption. Water. 2022 Aug 7;14(15):2446.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref6] 6. Xia M, Zhao N, Yin G, Yang R, Chen X, Feng C, et al. A Design of Real-Time Data Acquisition and Processing System for Nanosecond Ultraviolet-Visible Absorption Spectrum Detection. Chemosensors. 2022 Jul 15;10(7):282.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref7] 7. Khan MFS, Akbar M, Wu J, Xu Z. A review on fluorescence spectroscopic analysis of water and wastewater. Methods Appl Fluoresc. 2022 Jan 1;10(1):012001. pmid:34823232
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref8] 8. Charnier C, Latrille E, Jimenez J, Lemoine M, Boulet JC, Miroux J, et al. Fast characterization of solid organic waste content with near infrared spectroscopy in anaerobic digestion. Waste Management. 2017 Jan;59:140–8. pmid:27816468
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref9] 9. Han X, Xie D, Song H, Ma J, Zhou Y, Chen J, et al. Estimation of chemical oxygen demand in different water systems by near-infrared spectroscopy. Ecotoxicology and Environmental Safety. 2022 Sep;243:113964. pmid:35994903
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref10] 10. Goffin A, Guérin-Rechdaoui S, Rocher V, Varrault G. An environmentally friendly surrogate method for measuring the soluble chemical oxygen demand in wastewater: use of three-dimensional excitation and emission matrix fluorescence spectroscopy in wastewater treatment monitoring. Environ Monit Assess. 2019 Jul;191(7):421. pmid:31177336
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref11] 11. Wei X, Li S, Zhu S, Zheng W, Zhou S, Wu W, et al. Quantitative analysis of soybean protein content by terahertz spectroscopy and chemometrics. Chemometrics and Intelligent Laboratory Systems. 2021 Jan;208:104199.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref12] 12. Yang L, Su H, Wen Z. Improved PLS and PSO methods-based back analysis for elastic modulus of dam. Advances in Engineering Software. 2019 May;131:205–16.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref13] 13. Nie P, Qu F, Lin L, Dong T, He Y, Shao Y, et al. Detection of Water Content in Rapeseed Leaves Using Terahertz Spectroscopy. Sensors. 2017 Dec 6;17(12):2830. pmid:29211043
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref14] 14. Bedin FCB, Faust MV, Guarneri GA, Assmann TS, Lafay CBB, Soares LF, et al. NIR associated to PLS and SVM for fast and non-destructive determination of C, N, P, and K contents in poultry litter. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy. 2021 Jan;245:118834. pmid:32920437
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref15] 15. Li Y, Guo Y, Liu C, Wang W, Rao P, Fu C, et al. SPA Combined with Swarm Intelligence Optimization Algorithms for Wavelength Variable Selection to Rapidly Discriminate the Adulteration of Apple Juice. Food Anal Methods. 2017 Jun;10(6):1965–71.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref16] 16. Huangfu K, Li J, Zhang X, Zhang J, Cui H, Sun Q. Remote Estimation of Water Quality Parameters of Medium- and Small-Sized Inland Rivers Using Sentinel-2 Imagery. Water. 2020 Nov 7;12(11):3124. pmid:26617027
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref17] 17. Gu H, Liu K, Huang X, Chen Q, Sun Y, Tan CP. Feasibility study for the analysis of coconut water using fluorescence spectroscopy coupled with PARAFAC and SVM methods. BFJ. 2020 May 11;122(10):3203–12.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref18] 18. Pan S, Zhang H, Li Z, Chen T. Classification of Ginseng with different growth ages based on terahertz spectroscopy and machine learning algorithm. Optik. 2021 Jun;236:166322.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref19] 19. Wang S, Liu S, Che X, Wang Z, Zhang J, Kong D. Recognition of polycyclic aromatic hydrocarbons using fluorescence spectrometry combined with bird swarm algorithm optimization support vector machine. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy. 2020 Jan;224:117404. pmid:31374351
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref20] 20. Zhang Y, Li J, Fan X, Liu J, Zhang H. Moisture Prediction of Transformer Oil-Immersed Polymer Insulation by Applying a Support Vector Machine Combined with a Genetic Algorithm. Polymers. 2020 Jul 16;12(7):1579. pmid:32708631
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref21] 21. Robert C, Fraser-Miller SJ, Jessep WT, Bain WE, Hicks TM, Ward JF, et al. Rapid discrimination of intact beef, venison and lamb meat using Raman spectroscopy. Food Chemistry. 2021 May;343:128441. pmid:33127228
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref22] 22. Sun H, Lv G, Mo J, Lv X, Du G, Liu Y. Application of KPCA combined with SVM in Raman spectral discrimination. Optik. 2019 May;184:214–9. 1. Sun H, Lv G, Mo J, Lv X, Du G, Liu Y. Application of KPCA combined with SVM in Raman spectral discrimination. Optik. 2019 May;184:214–9.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref23] 23. Chai J, Zhang K, Xue Y, Liu W, Chen T, Lu Y, et al. Review of MEMS Based Fourier Transform Spectrometers. Micromachines. 2020 Feb 20;11(2):214. pmid:32093291
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref24] 24. Shen Z, D’Agui H, Walden L, Zhang M, Yiu TM, Dixon K, et al. Miniaturised visible and near-infrared spectrometers for assessing soil health indicators in mine site rehabilitation. SOIL. 2022 Jul 18;8(2):467–86.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref25] 25. Long DS, McCallum JD. Adapting a relatively low-cost reflectance spectrometer for on-combine sensing of grain protein concentration. Computers and Electronics in Agriculture. 2020 Jul;174:105467.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref26] 26. Cheng JR, Yang Y, Tang XY, Xiong NX, Zhang Y, Lei FF. Generative Adversarial Networks: A Literature Review. KSII TIIS. 2020 Dec 31.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref27] 27. Lee JS, Shin K, Ryu SM, Jegal SG, Lee W, Yoon MA, et al. Screening of adolescent idiopathic scoliosis using generative adversarial network (GAN) inversion method in chest radiographs. Ijaz MF, editor. PLoS ONE. 2023 May 22;18(5):e0285489. pmid:37216382
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref28] 28. Han H, Wang X, Gu F, Li W, Cai Y, Xu Y, et al. Better Late Than Never: GAN-Enhanced Dynamic Anti-Jamming Spectrum Access With Incomplete Sensing Information. IEEE Wireless Commun Lett. 2021 Aug;10(8):1800–4.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref29] 29. Dam T, Anavatti SG, Abbass HA. Mixture of Spectral Generative Adversarial Networks for Imbalanced Hyperspectral Image Classification. IEEE Geosci Remote Sensing Lett. 2022;19:1–5.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref30] 30. Moharram MA, Sundaram DM. Land use and land cover classification with hyperspectral data: A comprehensive review of methods, challenges and future directions. Neurocomputing. 2023 Jun;536:90–113.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref31] 31. Barrientos-Espillco F, Gascó E, López-González CI, Gómez-Silva MJ, Pajares G. Semantic segmentation based on Deep learning for the detection of Cyanobacterial Harmful Algal Blooms (CyanoHABs) using synthetic images. Applied Soft Computing. 2023 Jul;141:110315.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref32] 32. Huang Y, Chen Z, Liu J. Limited agricultural spectral dataset expansion based on generative adversarial networks. Computers and Electronics in Agriculture. 2023 Dec;215:108385.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref33] 33. Cao Z, Zhang S, Liu Y, Smith CJ, Sherman AM, Hwang Y, et al. Spectral classification by generative adversarial linear discriminant analysis. Analytica Chimica Acta. 2023 Jun;1261:341129 pmid:37147049
View Article
PubMed/NCBI
Google Scholar

[111] View Article

[112] PubMed/NCBI

[113] Google Scholar

[ref34] 34. Qi H, Huang Z, Jin B, Tang Q, Jia L, Zhao G, et al. SAM-GAN: An improved DCGAN for rice seed viability determination using near-infrared hyperspectral imaging. Computers and Electronics in Agriculture. 2024 Jan;216:108473.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref35] 35. Zhang M, Wang Z, Wang X, Gong M, Wu Y, Li H. Features kept generative adversarial network data augmentation strategy for hyperspectral image classification. Pattern Recognition. 2023 Oct;142:109701.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref36] 36. Wang J, Han B, Bao H, Wang M, Chu Z, Shen Y. Data augment method for machine fault diagnosis using conditional generative adversarial networks. Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering. 2020 Oct;234(12):2719–27.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref37] 37. Cai L, Cao K, Wu Y, Zhou Y. Spectrum Sensing Based on Spectrogram-Aware CNN for Cognitive Radio Network. IEEE Wireless Commun Lett. 2022 Oct;11(10):2135–9.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref38] 38. Miao J, Wang J, Zhang D, Miao Q. Improved Generative Adversarial Network for Rotating Component Fault Diagnosis in Scenarios With Extremely Limited Data. IEEE Trans Instrum Meas. 2022;71:1–13.
View Article
Google Scholar

[127] View Article

[128] Google Scholar

[ref39] 39. Bian X, Zhang R, Liu P, Xiang Y, Wang S, Tan X. Near infrared spectroscopic variable selection by a novel swarm intelligence algorithm for rapid quantification of high order edible blend oil. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy. 2023 Jan;284:121788. pmid:36058170
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref40] 40. Zhang X, Kano M, Song Z. Optimal Weighting Distance-Based Similarity for Locally Weighted PLS Modeling. Ind Eng Chem Res. 2020 Jun 24;59(25):11552–8.
View Article
Google Scholar

[134] View Article

[135] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Instruments and reagents

Collection and treatment of the water samples

Spectral collection and pretreatment

Method

Interval partial least squares algorithm

SVM regression algorithm

Generative adversarial networks

GridSearchCV technique

Results and discussion

Spectral pretreatment

Feature wavelength selection

Interval partial least squares method.

Spectral inversion study

SVM-based cod modeling

Conclusion

Acknowledgments

References