Quantitative Structure Activity Relationship Models for the Antioxidant Activity of Polysaccharides

In this study, quantitative structure activity relationship (QSAR) models for the antioxidant activity of polysaccharides were developed with 50% effective concentration (EC50) as the dependent variable. To establish optimum QSAR models, multiple linear regressions (MLR), support vector machines (SVM) and artificial neural networks (ANN) were used, and 11 molecular descriptors were selected. The optimum QSAR model for predicting EC50 of DPPH-scavenging activity consisted of four major descriptors. MLR model gave EC50 = 0.033Ara-0.041GalA-0.03GlcA-0.025PC+0.484, and MLR fitted the training set with R = 0.807. ANN model gave the improvement of training set (R = 0.96, RMSE = 0.018) and test set (R = 0.933, RMSE = 0.055) which indicated that it was more accurately than SVM and MLR models for predicting the DPPH-scavenging activity of polysaccharides. 67 compounds were used for predicting EC50 of the hydroxyl radicals scavenging activity of polysaccharides. MLR model gave EC50 = 0.12PC+0.083Fuc+0.013Rha-0.02UA+0.372. A comparison of results from models indicated that ANN model (R = 0.944, RMSE = 0.119) was also the best one for predicting the hydroxyl radicals scavenging activity of polysaccharides. MLR and ANN models showed that Ara and GalA appeared critical in determining EC50 of DPPH-scavenging activity, and Fuc, Rha, uronic acid and protein content had a great effect on the hydroxyl radicals scavenging activity of polysaccharides. The antioxidant activity of polysaccharide usually was high in MW range of 4000–100000, and the antioxidant activity could be affected simultaneously by other polysaccharide properties, such as uronic acid and Ara.


Introduction
In our normal metabolism process, oxygen free radicals and non-oxygen free radicals are continuously produced, and lower concentrations of free radical can play a crucial role in regular physiological functions [1][2][3][4][5]. However, many diseases, such as cardiovascular diseases, diabetes, aging and cancer, can be conducted by unregulated overproduction of free radicals [6][7][8].
Thus, it is essential to develop natural and effective antioxidants [9]. Previously reports revealed that many natural polysaccharides possess potent scavenging activities of free radicals and can be used as potential antioxidants [10][11]. It is always impossible to obtain a large quantity of experimental data because of a lack of perfect data sites, and so the study on relationship between bioactivities and the properties of polysaccharides by model forecast approach was relatively poor [12].
The quantitative structure-activity relationship (QSAR) model, which use relevant molecular physico-chemical properties to predict important treatment responses, is considered as an alternative to the experimental evaluation [13]. It has gained increasingly attention, and a variety of QSAR methods have been developed for water treatment process selection, membrane separation and adsorption etc [14][15].
To date, QSAR models for predicting the bioactivities of polysaccharides have seldom been developed. A study reported the relationship between monosaccharide composition ratio and macrophage stimulatory activity by model forecast approach [12]. To obtain theoretical supports for applications of polysaccharides from natural products, the main aim of this work was to establish reliable soft measurement models to predict performance and study the relationship between polysaccharide properties and antioxidant activities of polysaccharides by QSAR. In our QSAR studies, multiple linear regression (MLR) method, and the nonlinear methods including artificial neural network (ANN) and support vector machine (SVM) were used.

Data set
The present study showed that the antioxidant activity of polysaccharide has related with many factors, including monosaccharide composition [16], uronic acid (UA), molecular weight (MW), protein content (PC) and sulfate group content et al [17]. In the data selection, we chose natural purified polysaccharides without sulfate groups to study QSAR models for predicting antioxidant activities of polysaccharides. A various set of polysaccharides and their antioxidant activities were collected from different published papers . Antioxidant activities of polysaccharides were represented by the 50% effective concentration (EC 50 ). To set up a more reliable model, we selected 141 compounds. The detailed publication lists with corresponding antioxidant activities and compounds were given. The normalization process was adopted in the distribution of the parameters with 2 as the bottom of the log logarithm, and MW was divided by 10000 in the normalization process.
In models, a training data set was applied to develop the model. A test set, which was never included during their development, was used to validate the predictive power of model [46][47]. The training set and test set were chosen by random distribution.

Descriptors
The structure of polysaccharide was complex and could be represented by variety of descriptors. However, the major composition of polysaccharide was monosaccharide joined together by glycosidic bonds, which was essential to their bioactivities, so we used monosaccharide composition as descriptors. The following descriptors of monosaccharide composition were considered for modeling EC 50 values in MLR, ANN and SVM analysis. Descriptors of monosaccharide composition: rhamnose (Rha), arabinose (Ara), mannose (Man), glucose (Glc), galactose (Gal), fucose (Fuc), xylose (Xyl), ribose (Rib), glucuronic acid (GlcA) and galacturonic acid (GalA). Usually, gas chromatography (GC) and high-performance liquid chromatography (HPLC) were performed for the identification and quantification of monosaccharide composition. For HPLC analysis, glucuronic acid (GlcA) and galacturonic acid (GalA) could not be identified. Thus, total uronic acid (UA) could be determined by other methods, such as the sulfuric acid carbazole method, and then UA was also used as a descriptor in our models. The descriptors of PC and MW were also adopted in models. STATISTIC.10 method was used to establish SVM, MLP and ANN models, and the picture was drawn by using RStudio (Version 0.99.902-2009-2016 RStudio, Inc.).

Linear model generation
There were primarily two different approaches for choosing a descriptor subset in MLR, and they were filter and wrapper methods. The procedure of filter method was that setting and filtering descriptors were supposed to generate the top priority subset before training. However, the learning algorithm was wrapped into the selection procedure in the wrapper method [48]. In MLR, we used wrapper as the target learning algorithm. The training data set was applied only for selecting descriptor. At first, we employed a two-dimensional research method. It was a combination of forward and backward search. Then we assessed the selected descriptors on the target learning algorithm. In the learning process, we used 10 fold cross validation method. In stepwise MLR analysis, we selected training descriptor sets and then established a linear model [49].

Artificial neural network and Support vector machines
It was appropriate for artificial neural network (ANN) to model nonlinear relationship. We can find many reviews about ANN research and its application in QSAR studies [49][50][51]. In this study, we employed multi-layer perceptron (MLP) [52] and three layer reverse Back-Propagation (BP) network. In the back-propagation ANN, we utilized the technique of supervised learning, and the trained network was trained by minimizing the squared error of the network's output. The first step of training model was to confirm the number of layers and neurons in each player. The second step was to optimize the learning rate as well as momentum parameters. In the input layer, the architecture of the network was composed of eleven neurons, which were the eleven relative descriptors chosen. In the output layer, there was one neuron, i.e. EC 50 values of the antioxidant activity. In all the layers, logistic function was applied. In the hidden layer, through changing the number of neurons, we got the lowest RMSE and highest correlation coefficient. We applied 30% of the training data set for verification. The verification was employed to hinder from the over fitting. All of optimization process were taken with 10 fold cross validation [53]. Support vector machines (SVM) was originally developed for the classification problem, and SVM has been used to solve nonlinear regression estimation. Nowadays, SVM has demonstrated much success in QSAR and quantitative structure-property relationship (QSPR) studies [54][55][56][57]. We selected support vector machine classifier method (epsilon-SVM) which was most commonly used in QSPR and QSAR studies to optimize the value of kernel parameter g (gamma) [53].

Validation techniques and model performance evaluation
We used a 10 fold cross validation technique. This procedure divided the data set into 10 folds or groups, created the model using 9 of the sets, and tested it on the remaining group. When the procedure was repeated, each of the 10 groups had served as a test group. The root mean square error (RMSE) was calculated, averaged, and then used to evaluate the predictive performance of three models.

Models for the DPPH scavenging activity of polysaccharides
The data was divided into two parts using random classification. One was the training set, the other was the test set. The entire data set including 74 compounds was divided into two clusters. The test set of 22 compounds was chosen randomly from this cluster, and the remaining compounds were used as the train set. Compound number 4,5,7,9,17,20,25,30,31,32,33,34,36,38,44,54,57,62,63,64,69,73 were selected as the test set, and the rest of the compounds were the train set. The test set and train set were given in Table 1. The data distribution of parameter was shown in Fig 1, the data distribution was uniform, and no other single variable values was close to EC 50 values distribution (-6, 2). The shape of data distribution from EC 50 and Ara was similar, which indicated that there was a certain relation between them. In addition to MW, other physical quantities were all the components of polysaccharides, so MW was used to establish the model by itself.

MLR results
In this study, the training data set of 52 compounds was used. A stepwise linear regression analysis was used to determine the relationship between the dependent variable of EC 50 and the independent variables of uronic acid (UA), protein content (PC) and monosaccharide compositions (Rha, Ara, Man, Glc, Gal, Fuc, Xyl, GlcA and GalA). To achieve this goal, regression analysis was implemented by using the forward stepwise. In stepwise regression procedures, the first was to choose the most correlated independent variable, and then to select independent variable which was most correlated with the remaining variance in the dependent variable. This procedure was to increase the additional independent variable with R-squared (R 2 ) which was not changing until a significance of at least 80%. Accordingly, the variables of Ara, GalA, GlcA and PC were included in the regression model. The relationship between the matrix of parameters and EC 50 was shown in Fig 2. One variable data was used as the abscissa, another variable data was used as ordinate, and all points had been portrayed by the matrix scatter plot. From the diagonal we can see that the distribution of the data was all similar in shape. Fig 3 showed the correlation between model parameters and EC 50 , and the proportion of Ara, GalA and GlcA accounted 0.51, 0.39 and 0.35, respectively, which indicated that they had the most effect on EC 50 . In Fig 3, we can see that EC 50 had a positive correlation with Ara and PC, and it has negative correlation with GalA and GlcA, which was consistent with the model given in equation. The regression Eq 1, which could be obtained through the statistical analysis, was as follows. Because the effect of UA on EC 50 was little, UA was not added to the model equation. The linear model selected four major relevant descriptors, and gave a stable model with R = 0.807 and RMSE = 0.423. EC 50 ¼ 0:033Ara À 0:041GalA À 0:03GlcA À 0:025PC þ 0:484 In the model, R value was 0.807 (p <0.001), fit indicators of the model were acceptable, the model was coincided with the data structure, and Ara, GalA, GlcA, PC and EC 50 were significant correlation. The predicted EC 50 values of the training and test set by using the MLR equation were given in the Table 2. Predicted values and experimental values of EC 50 in two sets of data were plotted and shown in Fig 4. Most of the data were distributed from 0 to 1.5, and there were some predicted and negative values existing in the left lower corner. The experimental values of these negative values were between 0 and 0.2, which could be accepted. Experimental values and predicted points were distributed in two sides of the curve fitting, and most point of test set distributed among the prediction set, which illustrated that the establishment of training set used for the multiple regression model was very good to predict the numerical value of test set. The above linear model was applied to predict the 22 test data set, and these test data were never used in model building. The result showed R = 0.872, RMSE = 0.361 and p = 1.245E -7 , which showed that there was a significant correlation. Multiple linear regressions (MLR) established the relationship between the dependent variable of EC 50 and the independent variable of polysaccharide properties. The results showed that the statistics for MLR equation were good, and it also offered some views about the polysaccharide properties influences on DPPH-scavenging activity of polysaccharides.

ANN results
Polysaccharide properties were considered as the input layer node in neural networks, and EC 50 values of the DPPH-scavenging activity was the output layer node. Numbers of nodes had a great influence on the test results. The optimization was done with 10 fold cross validation, and 30% of test data were used for validation. Selected parameters of the number of neurons in the hidden layer were optimized by changing from 4 to 14, and it was worthy to mention that the initial value of 7 selected was optimal. The selected network adopted Broyden Fletcher Goldfard Shanno (BFGS) algorithm which was still seen as the best Quasi-Newton algorithm. When the entire training data was trained in the network with the optimized parameters, it gave R = 0.96 and RMSE = 0.018. The experimental and predicted values of EC 50 for the train data using the ANN model were plotted and shown in Fig 5. The experimental value was abscissa, the point distribution of the prediction value for the y-coordinate was on both sides of the curve fitting from 0 to 1.5, and the point distribution was uniform and closed to each other. According to the view of point, the density of horizontal and vertical coordinates and the fitting effect were perfect. The predicted values of EC 50 for the train and test data were given in the Table 2. The test set was used for prediction and gave R = 0.933 and RMSE = 0.055.

SVM results
We selected radial basis function (RBF) kernel for function modeling in SVM, the best parameter C, g and ε were selected by using 10 fold cross validation, a SVM model was obtained by training the whole training set, and then the model was used for the test set. By varying the parameter values in the training set systematically, we optimized SVM parameters, and calculated RMSE of the model. The parameter value which gave the lowest RMSE was selected. The regularization parameter C controlled the alternate use between maximizing the margin and minimizing the training error. If the value of C was too small, then there was not sufficient stress on fitting the training data. To have a stable learning procedure, a large value of C should be set up first [57]. To discover an optimal value of C, the RMSE of SVM model with different C values was calculated. Then, this value C = 9 was selected as the optimal value. We achieved the selected parameters (g = 0.091, ε = 0.1, C = 9) and the final training running in the whole training set, and EC 50 of the DPPH-scavenging activity was predicted. The predicted EC 50 on the basis of this model was plotted and shown in Fig 6 and Table 2. The statistical parameters

Comparison of MLR, ANN and SVM models
The statistical parameters obtained from the investigative models for train and test set were shown in Table 3. The error estimates were applied to model performance evaluation, and RMSE were lower for nonlinear models (SVM, ANN) generated by the machine learning methods than that by multiple linear regression. The correlation coefficients (R) given by SVM and ANN models were also higher than that by multiple linear regression. The above results indicated that the performances of nonlinear models SVM and ANN were better than that of a linear MLR model for the prediction of DPPH-scavenging activity of polysaccharides. The comparison of the nonlinear models demonstrated that ANN model accurately predicted the relationship between polysaccharide properties and the DPPH-scavenging activity for the train Quantitative Structure Activity Relationship Models for the Antioxidant Activity of Polysaccharides data set, and this was obviously evident from a lower RMSE (0.018) and a higher R (0.96) value. While ANN model was also the best one in the prediction of the test set.  (Table 4), MW was normalized before the analysis, the size of MW was taken with a base-8 of log, and the data was shown in Table 4 [58][59][60][61][62][63][64][65][66].
We used EC 50 values as the horizontal coordinate and established the correlation between EC 50 and MW. As shown in Fig 7, the value of EC 50 decreased with the decrease of MW, which indicated that the smaller MW could have the stronger DPPH free radical scavenging activity. This result was in accord with those reported in the literature [20,59]. In Fig 7, it could also be found that there were some points which did not conform to the rules, such as TYAP-3 and BSFP-1. BSFP-1 had the smaller MW and a relatively larger EC 50 value [60], which may be because BSFP-1 had no UA. TYAP-3 had larger MW, but its EC 50 value was smaller. The reason may be that the content of Ara accounted for 45.82% in TYAP-3 [25] . Fig 7 showed that when the value of EC 50 arranged from 0 to 2, the value of Y axis was 0-5.5, which indicated that MW was between 4000 and 100000.  According to the above results, we could conclude that the antioxidant activity of polysaccharide usually was higher in MW range of 4000-100000. However, MW was not the only factor, and the antioxidant activity could be affected by other polysaccharide properties, such as UA and Ara.
We selected five relevant descriptors in MLR model, and a stable model EC 50 = 0.12PC+-0.083Fuc+0.013Rha-0.02UA+0.372 (R = 0.664, RMSE = 1.149, F = 8.268, p<5.17E -5 ) was given. According to the model, PC, Fuc, Rha and UA had significant correlation with EC 50 of the hydroxyl radicals scavenging activity, and the relevant correlation coefficient was shown in Table 6.
The statistical parameters of MLR, ANN and SVM models for the train set and the test set were shown in Table 7. According to a lower RMSE and a higher R value, the results indicated that nonlinear model ANN was better than models obtained from MLR and SVM for the prediction of hydroxyl radicals scavenging activity of polysaccharides.

Sensitivity analysis from ANN
According to two ANN models, the results of sensitivity analysis were shown in Table 8. The higher sensitivity coefficient indicated that this descriptor had the more influence upon the antioxidant activity of polysaccharides. The results indicated that Ara and GalA had a great effect on DPPH-scavenging activity, and PC, UA and GalA had a great effect on hydroxyl radicals scavenging activity of polysaccharides, which was consistent with the results from MLR.

Conclusions
To establish quantitative structure-activity relationship (QSAR) models for antioxidant activity of polysaccharides, MLR, SVM and ANN methods were used, and polysaccharide properties (UA, PC, monosaccharide compositions, MW) as descriptors were selected. MLR models for predicting EC 50 of DPPH-scavenging activity and hydroxyl radicals scavenging activity of polysaccharides consisted of four major descriptors, and the models were EC 50 = 0.033Ara-0.041GalA-0.03GlcA-0.025PC +0.484 and EC 50 = 0.12PC +0.083Fuc +0.013Rha -0.02UA +0.372, respectively. A comparison of results from models indicated that the ANN model with R = 0.96 and RMSE = 0.018 predicted more accurately the DPPH-scavenging activity of polysaccharides than SVM and MLR models. ANN model (R = 0.933, RMSE = 0.055) was also the best one for predicting the hydroxyl radicals scavenging activity of polysaccharides. According to MLR and ANN models, Ara and GalA were most critical in determining the DPPH-scavenging activity of polysaccharides, and PC, UA and GalA had a great effect on hydroxyl radicals scavenging activity of polysaccharides. The polysaccharide of MW 4000-100000 usually owned higher DPPH-scavenging activity, but the antioxidant activity could simultaneously be affected by other polysaccharide properties. These results may provide some new insights in the complex study of polysaccharide structure and bioactivities, and we can simply predict the antioxidant activity of polysaccharide by using the established models after determining the monosaccharide composition ratios and MW.
It is worth noting that the highly GalA-containing polysaccharide could exhibit significantly antioxidant activity, which might be because they owned the functional group-COOH. It has been reported that the functional groups such as-COOH, CH 3 CO-and-SH were generally recognized as good electron or hydrogen donors that might be related to the antioxidant activity of polysaccharides [5]. The antioxidant activity of polysaccharide was also found to correlate to complex structure such as glycosidic linkages, branch ratios, and microstructure etc, polysaccharide properties is not enough for fine detailed structure of polysaccharide, and the research on more precise structure-function relationships remained to be explored.