Optimal experimental conditions for Welan gum production by support vector regression and adaptive genetic algorithm

Welan gum is a kind of novel microbial polysaccharide, which is widely produced during the process of microbial growth and metabolism in different external conditions. Welan gum can be used as the thickener, suspending agent, emulsifier, stabilizer, lubricant, film-forming agent and adhesive usage in agriculture. In recent years, finding optimal experimental conditions to maximize the production is paid growing attentions. In this work, a hybrid computational method is proposed to optimize experimental conditions for producing Welan gum with data collected from experiments records. Support Vector Regression (SVR) is used to model the relationship between Welan gum production and experimental conditions, and then adaptive Genetic Algorithm (AGA, for short) is applied to search optimized experimental conditions. As results, a mathematic model of predicting production of Welan gum from experimental conditions is obtained, which achieves accuracy rate 88.36%. As well, a class of optimized experimental conditions is predicted for producing Welan gum 31.65g/L. Comparing the best result in chemical experiment 30.63g/L, the predicted production improves it by 3.3%. The results provide potential optimal experimental conditions to improve the production of Welan gum.


Introduction
Welan gum is a kind of polysaccharide, which is one of the secretions of Alcaligenes sp.NX-3 strain. It has good stability, ideal thickening property, unique shear thinning property, good suspension and emulsification, and assured safety, and can be used in oil drilling with its unique shear-thinning properties. Finding optimal experimental conditions to maximize the production of Welan gum is paid growing attentions. This can process the production of Welan gum industrially. In 2014, producing Welan gum fermentation in laboratory is achieved in [1], where cyperus beans are used as raw materials, protein and hydrolysis as substrate. After that, Bacillus foecalis alkaligenes are designed as starting bacterial strain, to optimize the yield process of Welan gum by response surface method [2]. PLOS  search ability and convergence speed. The adaptive Genetic Algorithm we adopt can improve these two aspects to a certain extent. In the case of crossover probability, the AGA method can enable the crossover probability to vary with the evolution process and give the same crossover ability to the individuals of the same generation population, so as to realize the global search ability better. In the case of mutation probability, according to the fitness value of each individual to be mutated, the AGA method can make the mutation probability adaptively change with the evolutionary process.

Support vector regression
Support Vector Machine (SVM) is known as a kind of machine learning method for classification proposed in 1995 [15], has been widely used in biological data processing [16][17][18] and bioinformatics [19][20][21][22][23]. It focuses on doing classification with seeking structured minimum risk to improve the generalization ability of learning machine and minimizing empirical risk and confidence limit [24,25], thus achieving good statistical law under the condition of the less statistical sample size. In general, it is a kind of two-category model, the basic model is defined as the feature space interval on the maximum linear classifier. The learning strategy of SVM is to maximize the interval, which finally can be converted into a convex quadratic programming problem. Support Vector Regression (SVR) is developed based on SVM for dealing with regression forecasting problems [26,27]. Some basic concepts of SVR are briefly recalled.
Given a set of training data {(x 1 , y 1 ), (x 2 , y 2 ), . . ., (x l , y l )}, R n × R, where x i denotes the input samples, y i is the target value and l is the total number of input samples. In SVR, the goal is to find a function f(x), i.e., an optimal hyperplane, which has at most ε deviation from the actually obtained target y i for all the training data as flat as possible. The form of functions is denoted as where F(Á) is a nonlinear mapping by which the input data x is mapped into a high dimensional space F, (Á, Á) denotes the dot product in space F. Eq (1) can be transformed into the following convex constrained optimization problem by introducing the non-negative slack variables ξ i and x Ã i to cope with the otherwise infeasible constraints thereinto, C > 0, with C being the penalty parameter. ξ i , x Ã i are slack variables introduced in order to allow a certain error [28][29][30][31][32]. ξ is also a parameter of the ε-insensitive loss function, where ε is called the tube size [33]. The greater the value of C is, the greater the penalty for data points beyond the ε deviation, which determines the balance between the degree of smoothness of the function and the number of sample points beyond ε deviation. To find the upper bound of a convex quadratic programming problem, Lagrangian function is applied: The optimization problem can be obtained as follows: where a Ã i is the nonnegative Lagrange multiplier that can be obtained by solving the convex quadratic programming problem. By exploiting the Karush-Kuhn-Tucker (KKT) conditions of the primal optimization problem [34][35][36], we can get the equation a Ã i a Ã j ¼ 0, which means that both of the multipliers a Ã i and a Ã j equal to zero, or one of multipliers is zero and ða Ã i À a Ã i Þ is nonzero. The data samples with non-vanishing Lagrange multipliers are called the support vectors inside or outside the ε-insensitive tube [33].
The regression estimation function can be obtained by learning as follows: where N NSV represents the number of standard support vectors. K(x i , x j ) is defined as the kernel function. According to Hilbert-Schmidt principle, when kernel function matches Mercer conditions, that is, for any given function g(x), if R b a g 2 ðxÞdx is limited, the value of the kernel is equal to the dot product of two vectors x i and x j in the feature space F(x i ) and F(x j ), i.e., We choose here the Gauss radial basis function as kernel function.
where σ is the kernel parameter.

Adaptive genetic algorithm
Genetic Algorithm (GA) derives from the computer simulation study of biological system [37], which has been widely used function optimization, combinatorial optimization, job shop scheduling problems [38], complex network clustering, pattern mining [39][40][41]. However, there are still some disadvantages, the most obvious disadvantages are the low efficiency and easy to fall into local optimum [42,43].
In 2000, adaptive Genetic Algorithm (AGA) [44] is proposed, which improves the performance of traditional GA to some extent. After that, adaptive GA is improved by involving certain intelligent strategies, including crossover to avoid inbreeding, crossover probability associated with the number of evolution and regulating adaptive mutation probability [45]. The formula which is only related to the number of evolution for cross-probabilistic computing is as follows: In the formula, m tmp is an intermediate variable for calculation, T Gen is the maximum evolutionary number preset, t is the current evolutionary number (0 t T Gen ), P c, max is the largest crossover probability preset, P c, min is the smallest crossover probability preset, and P c (t) is the crossover probability of current population. The formula of adaptive mutation probability related to the number of genetic evolution and individual fitness is as follows: In the formula, P m, max is the largest mutation probability preset, P m, min is the smallest mutation probability preset, f(x i ) is the fitness value of individual x i , f max is the maximum value of fitness in current populations, P m (t) is the mutation probability of individual x i in current population [45].

The mathematic model and data experiments
In this section, it starts by selecting probable elements from original data, and then the values of two important parameters of the model are determined. After that, the mathematic model based on SVR is built to describe the relationship between Welan gum products and experimental conditions. With the model, AGA is applied to find the optimal sample point of the model, which corresponds to a class of potential optimal experimental conditions to maximize the production of Welan gum. The flowchart is shown in Fig 1. The mathematic model Data preparation. Before building the mathematic model for describing the relationship between Welan gum production and experimental conditions, it needs to normalize the data. Optimal experimental conditions for Welan gum production by SVM and AGA SVR mainly deals with the nonlinear problems, so the magnitude of the eigenvalues of the samples should be different greatly, the results will be greatly affected without normalizing samples. Besides, normalizing samples can avoid the small weight of the model and leading to the instability of the numerical calculation, so that the parameter optimization can converge at a faster speed and the accuracy of the model can be improved. The normalized formula used in our method is as follows: where x is the original data, y is the normalized data, x min is the minimum of the original data, x max is the maximum of the original data, y min is the minimum of the normalized data, y max is the maximum of the normalized data. The value of y min is set to be 0 and the value of y max to be 1. The normalized data is shown in Tables 1 and 2 below: Without losing the generality, all 67 samples collected from Welan gum producing experiments are classified according to the production, which are divided into three types: high, middle and low level production. Specifically, productions between 0g/L and 5g/L belong to low level production data, in total 8 groups; productions between 5g/L and 20g/L are in medium level, in total 39 groups; productions more than 20g/L are in high level, in total 20 groups. Each time the model data is taken, the order of the samples within each yield is randomly arranged, For each level data groups, the first 70% of each type data is used as training data, the 30% data left are used as the testing data.
Before building the mathematic model, it is necessary to determine the values of two parameters, namely penalty factor parameters (c) and kernel function parameters (g). Here, grid search method is used to determine the optimal values of the two parameters. The result is shown in Fig 2 below: In the above figure of contour line, two red dotted lines are represented separately the optimal values of the two parameters. The intersection of two lines, that is, the red point in the figure represents the value of the "CVmse". The CVmse means that the mean of the squares of the difference between the predicted value and the true value under the 5-fold cross validation.
After the values of the parameters are determined, the training data and testing data are determined according to the selection of the aforementioned method. The index of the accuracy of the model is reflected in the square of correlation coefficient. The diagrams in Figs 3 and 4 reflect the model's prediction of the testing data and the relative error.

Finding optimal experimental conditions by AGA
With the mathematical model constructed, an improved AGA is used to find experimental conditions for optimal production. The process has the following steps. Step 1: Initialize the population and encode the individuals. Each sample is related to nine variables, so we consider the nine variables as nine genes that make up a chromosome. For example, encode [glucose, yeast, KH 2 PO 4 , MgSO 4 , fluid volume, PH value, temperature, rotational speed, inoculation amount] to [x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , Step 2: Select good individuals based on the fitness values.
Step 3: Perform crossover operation. From the first individual in the population, the corresponding crossover probability of the individual is calculated, denoted as cross_rate. We randomly generate a random number between 0 and 1, denoted as rand_num. If the value of rand_num is less than cross_rate, the individual is performed crossover operation. That is, two integers between 1 and 9 are randomly generated, where the smaller number is the starting position of the crossed chromosome, the larger number is the ending position, the chromosome of the individual is exchanged with the chromosome of the next adjacent individual, in the range from the starting position to the termination position. In addition, if the i-th individual did not perform the crossover operation, the above-described process is repeated for the i+1-th individual; if the i-th individual performed the crossover operation, the above-described process is repeated for the i+2-th.
Step 4: Perform mutation operation. From the first individual in the population, the corresponding mutation probability of the individual is calculated, denoted as mutate_rate. We randomly generate a random number between 0 and 1, denoted as rand_num. If the value of rand_num is less than mutate_rate, the individual is performed mutation operation. That is, an integer between 1 and 9 is randomly generated as the location of the gene that needs to be mutated, regenerate the gene at the location.
Step 5: The new individuals generated by the above operations constitute the new population, and go to step 2.
Repeat these steps until we find the optimal individual. The size of initial population is set to be 300, that is there are 300 individuals, the number of iterations is 500. The selection operator is roulette selection method, which is also known as the proportional selection operator. The basic idea is that the probability of each individual  selected is proportional to its fitness value.
where P(x i ) is the selection probability of individual x i , K is the population size. The value of parameter P c,min is set to be 0.6, P c,max to be 0.9, P m,max to be 0.1 and P m,max to be 0.001. The search results are shown in Fig 5. To improve the accuracy and further reduce the range of the nine gene variables. We made the following changes by observing the genetic variables of samples with productions higher than 30g/L, which is   Optimal experimental conditions for Welan gum production by SVM and AGA

Results
The accuracy of the established mathematic model is 88.36%, the optimal medium composition ratio is shown in Table 3 below: The maximum production of Welan gum is 31.65g/L. This hybrid computational method, which combines with SVM and AGA, has the intelligent learning ability and can overcome the limitation of large-scale biotic experiments [46][47][48][49][50][51]. A mathematic model of predicting production of Welan gum from experimental conditions with accuracy rate 88.36% is obtained, a class of optimized experimental conditions is designed to produce Welan gum 31.65g/L. Comparing the best results in chemical experiment 30.63g/L, the predicted production can be improved by 3.3%.

Conclusion
We focused on building a mathematic model of Welan gum, the nine factors which contribute the experimental conditions of producing Welan gum as preparative optimization indicators. The nine factors include glucose, yeast, KH 2 PO 4 , MgSO 4 , fluid volume, PH value, temperature, rotational speed and inoculation amount. A hybrid computational method combined with SVM and AGA is proposed. Through the training of sample data, a mathematic model of predicting production of Welan gum from experimental conditions is obtained. We find the optimal sample point in the sample space, i.e. a class of optimized experimental conditions. This hybrid computational method has a good learning ability, which can avoid the high cost problem caused by large-scale biological experiments. It also overcomes the "mature" defects of traditional Genetic Algorithm. The result provides a potential experimental conditions by data mining to improve the production of Welan gum in the lab.
For further research, neural-like computing models, e.g., spiking neural P systems [52] can be used for optimization of Welan gum production. As well, some recently developed data processing and mining methods, such as the speculative approach to spatial-temporal efficiency for multi-objective optimization in cloud data and computing [53], privacy-preserving smart similarity search methods in simhash over encrypted data in cloud computing [53], kdegree anonymity with vertex and edge modification algorithm [54], kernel quaternion principal component analysis for object recognition [55], might be used for optimizing experimental conditions of Welan gum. In the aspect of data preparation, decision tree [56] can be used to deal with the missing attribute value of some samples in dataset. Optimal experimental conditions for Welan gum production by SVM and AGA