Discrimination and prediction of the origin of Chinese and Korean soybeans using Fourier transform infrared spectrometry (FT-IR) with multivariate statistical analysis

The ability to determine the origin of soybeans is an important issue following the inclusion of this information in the labeling of agricultural food products becoming mandatory in South Korea in 2017. This study was carried out to construct a prediction model for discriminating Chinese and Korean soybeans using Fourier-transform infrared (FT-IR) spectroscopy and multivariate statistical analysis. The optimal prediction models for discriminating soybean samples were obtained by selecting appropriate scaling methods, normalization methods, variable influence on projection (VIP) cutoff values, and wave-number regions. The factors for constructing the optimal partial-least-squares regression (PLSR) prediction model were using second derivatives, vector normalization, unit variance scaling, and the 4000–400 cm–1 region (excluding water vapor and carbon dioxide). The PLSR model for discriminating Chinese and Korean soybean samples had the best predictability when a VIP cutoff value was not applied. When Chinese soybean samples were identified, a PLSR model that has the lowest root-mean-square error of the prediction value was obtained using a VIP cutoff value of 1.5. The optimal PLSR prediction model for discriminating Korean soybean samples was also obtained using a VIP cutoff value of 1.5. This is the first study that has combined FT-IR spectroscopy with normalization methods, VIP cutoff values, and selected wave-number regions for discriminating Chinese and Korean soybeans.


Introduction
The soybean (Glycine max) is a useful plant crop with high lipid and protein contents [1]. Soybeans can be used to produce soybean oil, as a protein source, or as a good source of nutrients. They are also pharmacologically active, with these effects originating from their constituent PLOS  isoflavones [2]. The beneficial health effects of soybean isoflavones include reducing the risks of cardiovascular problems [3,4], cancer [5][6][7], and osteoporosis [8,9]. In Korea, soybeans are cooked and used to prepare foodstuffs such as doenjang (fermented soybean paste), cheonggukjang (fast-fermented soybean paste), and gochujang (fermented red pepper paste) [10]. Soybeans are frequently used in Korean cuisine. However, there are many cases where the country of origin of the beans is unclear, and relatively inexpensive foreign soybeans are often imported and labeled as Korean soybeans. The National Agricultural Products Quality Management Service introduced an agricultural food country-of-origin labeling system in 1991 to protect domestic agricultural producers and consumers [11]. Soybeans have been included in that system since 2017, and merchants must now indicate the origin of any soybeans that they advertise for sale [12]. This situation means that technology for discriminating Chinese and Korean soybean is needed.
The quality of soybeans depends on several factors such as their variety and where they were cultivated, and these factors must be considered when determining where particular soybeans originate from. However, it is difficult to consider all soybean varieties because there are hundreds of varieties spread over a vast area [13]. We assumed that soybeans cultivated for thousands of years within a particular region would have become well adapted to the local environmental conditions, and hence that the soybeans could be discriminated based on geographical factors rather than varietal differences.
Metabolomics can be used to discriminate genetic and environmental differences based on the comprehensive profiling and analysis of plant metabolites [14]. This can be implemented using established tools such as gas chromatography/mass spectrometry, nuclear magnetic resonance (NMR) spectroscopy, liquid chromatography/mass spectrometry, Fourier-transform infrared (FT-IR) spectroscopy, and direct-infusion mass spectrometry [15]. These tools can be used to discriminate the geographical origin of plants. For example, a method employing a socalled electronic nose and combined gas chromatography/mass spectrometry/olfactometry with principal-components analysis has been used to discriminate the geographical origin of chrysanthemum flower teas [16]. 1 H-NMR spectroscopy has been combined with statistical analysis to discriminate the geographical origin of Chinese, Indian, and Korean sesame oils [17]. Four different geographical origins of Lycium barbarum fruit (China, Mongolia, and two locations in Tibet) were discriminated using liquid chromatography coupled with quadrupole time-of-flight mass spectrometry for metabolite profiling [18]. Near-infrared reflectance (NIR) spectroscopy has been used to discriminate Korean soybeans from soybeans of various origins [19].
We chose FT-IR spectroscopy for the present study because it is a fast, convenient, and nondestructive analytical tool. These characteristics make FT-IR spectroscopy suitable for the rapid identification of foods and agricultural products [20]. However, the physical characteristics of samples (particle size and thickness) affect the obtained FT-IR spectra [21,22], and so the obtained raw data need to be normalized. Four normalization methods can be applied to FT-IR spectral data: area normalization, amide normalization, minimum-maximum (minmax) normalization, and vector normalization. Constructing more-precise prediction models for the discrimination of Chinese and Korean soybeans requires suitable normalization and scaling methods to be determined, and then a prediction model selected by comparing the predictive power of each cutoff for the variable influence on projection (VIP).
NMR spectroscopy has previously been used to discriminate between soybeans originating from China and Korea [23], while NIR spectroscopy was used to discriminate Korean soybeans and soybeans of various origins [19]. However, the present study is the first to investigate a prediction model that can discriminate between Chinese and Korean soybeans using recording the FT-IR spectra. The OMNIC program (version 8.2.0.387, Thermo Scientific, Waltham, Massachusetts, USA) was used to obtain all of the FT-IR spectra. Sixty-four scans were recorded in order to obtain average analytical results and enhance the signal-to-noise ratio. Each spectrum was scanned between 4000 and 400 cm -1 and had a spectral resolution of 4 cm -1 .
The following four normalization methods that are widely used in FT-IR spectroscopy analysis were used to process the FT-IR spectra: area normalization, min-max normalization, amide normalization, and vector normalization. In vector normalization, all spectra are converted from transmittance to absorbance, and the FT-IR absorbance spectra were converted into first and second derivatives using the Savitzky-Golay derivative with nine smoothing points in OMNIC. For vector normalization, the absorbance values of FT-IR spectral data were divided by the Euclidean norm to calculate the vector normalization value. For the other normalization processes, all spectra were converted from transmittance to absorbance, and then ATR correction was applied using OMNIC. For area normalization, each absorbance value at a specific wave number was divided by the total (integrated) absorbance area of the spectrum. For min-max normalization, each absorbance value was divided by the difference between the highest and lowest absorbance values. For amide normalization, each absorbance value was divided by the difference between the highest amide band and the lowest absorbance value.

Multivariate statistical analysis
After the FT-IR spectral data had been normalized, we used the SIMCA-P+ software (version 13.0, Umetrics, Umeå, Sweden) to carry out multivariate statistical analysis. Partial-leastsquares discriminant analysis (PLS-DA), partial-least-squares regression (PLSR), and hierarchical cluster analysis (HCA) were conducted using SIMCA-P+. Both the single linkage method and Ward's clustering method were employed to carry out HCA. Cross-validation and permutation tests were applied to the PLS-DA and PLSR models. Cross-validation was performed to evaluate the predictability of the models and to prevent overfitting. The models were evaluated using the R 2 Y and Q 2 Y parameters as obtained by cross-validation. Permutation tests were conducted 20 times using SIMCA-P+. Permutation test parameters such as the R 2 Y and Q 2 Y intercepts were obtained to evaluate the statistical significance of the models.

Results and discussion
Band assignment of the FT-IR spectra FT-IR spectral data were obtained for each soybean sample. A representative FT-IR spectrum -from the sample from Inner Mongolia Autonomous Region province in China-is shown in Fig 4, which contained 12 noticeable bands that could be assigned as follows (Table 1): 1. One at 3304 cm -1 due to N-H protein stretching [24].
5. One at 1745 cm -1 due to C = O stretching of lipids [27]. 6. One at 1645 cm -1 due to C-O and C-N protein stretching [24]. This is known as the amide I band and is the main amide band.
7. One at 1538 cm -1 due to C-N stretching and N-H bending modes of protein. This is known as the amide II band [24].
9. One at 1398 cm -1 due to CH 3 bending of protein and COO − symmetric stretching of fatty acids and amino acids [25,28].
10. One at 1239 cm -1 , which is the amide III band that contains contributions from PO 2asymmetric stretching [28].
11. One at 1155 cm -1 due to CO-O-C asymmetric stretching of cholesterol ester and C-O stretching of oligosaccharides and triacylglycerols [25,29].
In addition to the bands arising from soybean components, three bands arising from the environment were detected. The tiny band between 4000 and 3500 cm -1 is attributable to water-vapor O-H stretching, and the other two bands correspond to carbon dioxide: O-C-O stretching at 2442-2208 cm -1 and O-C-O bending at 914-400 cm -1 [31]. As listed in Table 1 and shown in Fig 1, the peaks associated with lipids (2925, 2854, 1745, and 1456 cm -1 ) and proteins (3304, 1645, 1538, and 1239 cm -1 ) could be clearly discriminated.

Determination of normalization and scaling methods
To determine the optimal normalization and scaling methods, permutation tests were carried out using two components. The normalization methods used were area normalization, amide normalization, min-max normalization, and vector normalization. Two types of scaling methods were employed: unit variance (UV) and Pareto scaling.  The permutation parameters of the PLS-DA models are listed in Table 2. The R 2 Y and Q 2 Y values, which indicate the model fit and predictability, respectively, range between 0 and 1.0. A PLS-DA model with a high R 2 Y value is regarded as providing a good fit to the data. A Q 2 Y value from 0.5 to 0.9 indicates good predictability, while one greater than 0.9 is considered to indicate excellent predictability. Permutation tests were performed to obtain the R 2 Y and Q 2 Y intercepts. The models were regarded as valid when the R 2 Y and Q 2 Y intercepts were less than 0.4 and 0.05, respectively [32].
The optimal PLS-DA models were selected after comparing the R 2 Y and Q 2 Y values. The optimal normalization and scaling methods for the model involved applying vector normalization after the second differentiation and UV scaling methods. The results are presented in Table 2, which indicates that this procedure yielded highest R 2 Y and Q 2 Y values of 0.938 and 0.912, respectively, for the comparison between Chinese and Korean soybean samples, of 0.747 and 0.701 for the comparison of Chinese soybean samples, and of 0.809 and 0.771 for the comparison of Korean soybean samples. Table 2 indicates that both the R 2 Y and Q 2 Y values were highest when using the vector normalization method, which is possibly due to the derivative process used in vector normalization revealing minute differences between similar spectra [33]. This hypothesis is supported by the use of second derivatives allowing better discrimination of the minute differences in the FT-IR spectra compared to using first derivatives.

Development of a PLSR model for determining the origin of soybeans using appropriate wave-number selection
PLSR can be employed to construct a prediction model for the origin of soybeans. Soybeanorigin PLSR models were developed by applying suitable vector normalization after a second differentiation and UV scaling, and using two components. Apart from normalization, scaling methods, and the number of components, the VIP cutoff value was used to establish the mostprecise prediction model. Training sets (six replicates) and a test set (one replicate) were prepared to construct the PLSR models. Both sets were used to obtain root-mean-square error (RMSE) values, including the root-mean-square error of estimation (RMSEE) and the rootmean-square error of prediction (RMSEP). RMSEE can be obtained from training sets, and its value is used to evaluate the accuracy of a PLSR model. RMSEP, which can be obtained from the test set, is employed to assess the predictability of PLSR models. These RMSE values range from 0 to 1, with smaller values indicating higher model accuracy and predictability.
Because FT-IR spectra may be affected by environmental factors such as water vapor and carbon dioxide, PLSR models constructed using different wave-number regions were compared to identify the best prediction model. Three wave-number regions were used to obtain prediction models: 4000-400 cm -1 , 4000-400 cm -1 excluding the water vapor and carbon dioxide regions, and 2000-400 cm -1 . As listed in Tables 3-5, numerous VIP cutoff values were used to select better prediction models based on the RMSEP values. The permutation parameters of the PLSR models for comparing Chinese and Korean soybeans are listed in Table 3, while those for comparisons of Chinese soybeans and of Korean soybeans are listed in Tables 4 and 5, respectively.
The PLSR models were compared to identify the PLSR models that satisfied the R 2 Y and Q 2 Y intercepts and had the lowest RMSEP values. The FT-IR spectral region between 4000 and 400 cm -1 that excluded the water vapor and carbon dioxide regions was the best. The PLSR model that did not apply a VIP cutoff value was selected for the prediction model presented in Table 3 for discriminating Chinese and Korean soybeans because it had the smallest RMSEP value (= 0.120). Table 4 indicates that the PLSR model with a VIP cutoff value of 1.5 was the optimal prediction model for discriminating Chinese soybeans, having an RMSEP value of 0.293, while Table 5 indicates that the PLSR model for discriminating Korean soybeans had the lowest RMSEP value of 0.170 for a VIP cutoff value of 1.5.
HCA dendrograms were constructed to evaluate the similarity of the samples using the optimal PLSR models for discriminating the soybean samples. As shown in Fig 5A, the Chinese and Korean soybean samples could be clearly discriminated using the single linkage Discrimination and prediction of the origin of soybeans using FT-IR with multivariate statistical analysis method. Fig 5B shows that the soybean samples from the northeastern and eastern provinces of Chinese were clustered in the same clade using Ward's method, whereas those from the southeastern provinces comprised the other clade. As shown in Fig 5C, three regions (upper, right side, and left side) were clustered using Ward's method. Because the soybeans from the right-and left-side provinces appeared to be similar, the Korean provinces could be simply divided into upper and lower provinces. This result suggests that Chinese and Korean soybean samples can be discriminated by latitude-dependent climatic factors without consideration of the plant variety.

Practical application of the PLSR model for predicting the origin of soybeans
The results presented in Tables 3-5 indicate that it is not only possible to discriminate between Chinese and Korean soybeans but also to identify the region in which soybeans have been cultivated. There is a wide diversity of soybean varieties used in China and Korea, but the present results indicate that it is possible to determine the origin of soybeans without considering their variety.
Our results indicate that it is possible to discriminate where soybeans originate from because they reflect regional characteristics. The soybean samples from China could be divided Table 4. List of permutation parameters of the PLSR models obtained using variables selected by vector normalization applied after the second differentiation, UV scaling, and with various VIP cutoff values using different wavenumber areas for the comparison of the three groups of Chinese provinces. Discrimination and prediction of the origin of soybeans using FT-IR with multivariate statistical analysis into those from the northeastern provinces (Neimenggu, Heilongjiang, Jilin, and Liaoning), Huang-Huai-Hai (Hebei, Shandong, and Anhui), Yangtze River (Hubei), and the southeastern provinces (Zhejiang, Jiangxi, Fujian, Guangdong, and Guangxi). If Huang-Huai-Hai and the Yangtze River region are considered to be the same province (due to their geographical proximity), the separations of the Chinese provinces are highly consistent with the predictions based on dividing the soybean regions into the northeastern, eastern, and southeastern provinces. The samples from South Korea were divided into those from the central provinces (Gyeonggi-do, Gangwon-do, Chungcheongbuk-do, and Chungcheongnam-do), Honam provinces (Jeollabuk-do and Jeollanam-do), and Youngnam provinces (Gyeongsangbuk-do and Gyeongsangnam-do). The results in Table 5 indicate that it was possible to separate three provinces (upper, left side, and right side) if Chungcheongnam-do (a central province) was grouped with Honam province. The flow chart in Fig 6 shows a method for discriminating the country of origin using prediction models when identifying unknown soybean samples. It is unclear where some of the soybeans available in Korean markets originate from, often because they are substituted by cheaper Chinese soybeans. The present results indicate that the flow chart in Fig 6 can be used to verify the origin of any suspect soybeans. Moreover, in addition to discriminating between Chinese and Korean soybeans, it is possible to discriminate between various production Discrimination and prediction of the origin of soybeans using FT-IR with multivariate statistical analysis regions in a single country. Our flow chart can be applied to identify the original location of cultivation and detect the adulteration of the cultivation origin of soybeans.

Conclusion
In this study we investigated whether FT-IR spectroscopy analysis can be combined with multivariate statistical analysis to predict the country of origin of soybean samples. This is the first study to discriminate the origin of soybeans using various factors including scaling methods, normalization methods, VIP cutoff, and wave-number region. These particular factors were selected since they allow the origin of soybeans to be determined easily and precisely. Our experimental results showed that this method could discriminate not only the country of origin but also the region of production within a country. The best PLSR prediction models for discriminating the origins employed UV scaling, vector normalization (second derivative), and the wave-number region from 4000 to 400 cm -1 excluding the water vapor and carbon dioxide regions. The PLSR prediction model for discriminating the country of origin (Chinese vs. Korean soybeans) was more precise when a VIP cutoff was not used. When the PLSR prediction models were constructed using a VIP cutoff within a single country, a VIP cutoff value of 1.5 was found to be optimal for discriminating the origin of soybeans. Various soybean varieties and landraces are provided and grown worldwide according to the demands of both growers and consumers. Soybean cultivars reportedly have a short market life; for example, 54% of the cultivars submitted to the Varietal Information Program for Soybeans (the program supported by the Illinois Soybean Association of the US) are new [34]. In addition, various types of soybean seed are utilized in the production of products such as meal, tofu, soymilk, and edamame, and these seeds can exhibit various differences such as in their texture, color, and hilum characteristics. It is also thought that soybean germ plasm has been exchanged internationally. Therefore, PLSR models for predicting or differentiating soybean Discrimination and prediction of the origin of soybeans using FT-IR with multivariate statistical analysis samples should be updated regularly (at least every 4-5 years) by sampling and analyzing the available samples using FT-IR spectroscopy. We suggest the application of additional objective criteria for the differentiation of various soybean seeds (varieties and landraces), such as the basic and novel protocols for differentiation and prediction as used in this study based on the optimization of preprocessing methods using FT-IR spectroscopy. Discrimination and prediction of the origin of soybeans using FT-IR with multivariate statistical analysis The practical application of these methods will require further studies using soybean samples from other countries. Once soybeans from many countries have been investigated, it might be possible to discriminate the countries of origin of unidentified soybean samples by using FT-IR spectroscopy combined with multivariate statistical analysis.
Supporting information S1 Table. The provinces, cities, and geographic coordinates of soybean samples harvested in 2016 from Republic of Korea and China. (DOCX)