Artificial Intelligence versus Statistical Modeling and Optimization of Cholesterol Oxidase Production by using Streptomyces Sp.

Cholesterol oxidase (COD) is a bi-functional FAD-containing oxidoreductase which catalyzes the oxidation of cholesterol into 4-cholesten-3-one. The wider biological functions and clinical applications of COD have urged the screening, isolation and characterization of newer microbes from diverse habitats as a source of COD and optimization and over-production of COD for various uses. The practicability of statistical/ artificial intelligence techniques, such as response surface methodology (RSM), artificial neural network (ANN) and genetic algorithm (GA) have been tested to optimize the medium composition for the production of COD from novel strain Streptomyces sp. NCIM 5500. All experiments were performed according to the five factor central composite design (CCD) and the generated data was analysed using RSM and ANN. GA was employed to optimize the models generated by RSM and ANN. Based upon the predicted COD concentration, the model developed with ANN was found to be superior to the model developed with RSM. The RSM-GA approach predicted maximum of 6.283 U/mL COD production, whereas the ANN-GA approach predicted a maximum of 9.93 U/mL COD concentration. The optimum concentrations of the medium variables predicted through ANN-GA approach were: 1.431 g/50 mL soybean, 1.389 g/50 mL maltose, 0.029 g/50 mL MgSO4, 0.45 g/50 mL NaCl and 2.235 ml/50 mL glycerol. The experimental COD concentration was concurrent with the GA predicted yield and led to 9.75 U/mL COD production, which was nearly two times higher than the yield (4.2 U/mL) obtained with the un-optimized medium. This is the very first time we are reporting the statistical versus artificial intelligence based modeling and optimization of COD production by Streptomyces sp. NCIM 5500.


Introduction
The production of metabolites produce through microbial strains is mostly affected by the process parameters and medium components. Generally, the fermentation processes are multi-variable and optimization of medium components is a cumbersome task. The conventional one factor at a time (OFAT) approach is time-consuming and often incapable of reaching the true optimum due to complex interactions among the factors/ variables [1]. Generally, statistical or mathematical designs are used to reduce the number of experiments and to increase the precision of the results. Response surface methodology (RSM) is a combination of mathematical and statistical techniques and generally used for modeling and analysis of problems associated with multivariable systems. It is based on design of experiments (DOE) for the development of models, estimation of the model coefficients and prediction of the response for optimum conditions [2,3]. RSM estimates the relationship between the responses (i.e., product yield) and the experimental parameters (i.e., concentration of the medium components). It adjusts the concentration of the medium components to shift the product yield (response) in a certain direction to achieve the required optima. The RSM has been successfully applied for optimization of medium components for metabolite production [1,4], culture parameters in bioprocess engineering [5][6][7], etc. Despite its successful use in various processes, RSM has some limitations like, in case of more than six or seven variables, the number of variables interaction terms will increase and resulted in complexity of the study and the practical feasibility of the method will challenged [8]. In addition, the RSM fails to precisely describe an object function [9].
Artificial Neural Networks (ANNs) are complex mathematical models that successfully mimic biological neural networks. ANNs have been used for optimization and prediction purposes and are often preferred over regression models for the noisy data. ANNs have been used to optimize and model highly nonlinear and complex biological processes [10][11][12][13][14][15][16][17][18] etc. Mathematical model generated by RSM or ANNs can be optimized more precisely by using mathematical tools, like Nelder-Mead simplex, genetic algorithm (GA) etc. GA is an optimization tool which can be used even under conditions of unavailability of complete model of the process. GA is based on Darwin's principle of genetic evolution and uses genetic operators, like selection, mutation and crossover to find the optimum solution of the problems. In terms of microbiological metabolite production process, the media components are represented as genomes or chromosomes and the factors to be optimized i.e., level of medium constituents are represented as genes [19]. The chromosomes with high productivity are selected and replicated proportionally to the productivity. GA randomly selects the individuals, from the current population and uses them to produce the next generation. Over successive generations, the population "evolves" toward an optimal solution. Cholesterol oxidase (COD; cholesterol: oxygen oxidoreductase, EC 1.1.3.6), a bi-functional FAD-containing enzyme belongs to the family of oxidoreductases and catalyzes the oxidation of cholesterol into 4-cholesten-3-one in the presence of O 2 and isomerization of 4-cholesten-3-one into Δ 4 -3-ketosteroid [20]. COD has received great importance due to its broad application in clinical laboratories for the determination of serum cholesterol, used as a biocatalyst for the production of various steroids, and implicated in the manifestation of some bacterial and viral diseases. These biotechnological applications COD have warranted for screening, isolation and characterization of newer microbes from diverse habitats as a source of COD and optimization and microbial COD production at commercial scale [20,21]. This study attempts to determine the quantitative effects of five medium components (soybean meal, glycerol, maltose, sodium chloride and magnesium sulphate) on COD production by Streptomyces sp. NCIM 5500 using statistical Response Surface Methodology and artificial intelligence technique followed by optimization using Genetic Algorithm COD production by Streptomyces sp. NCIM 5500 was studied under different production media viz. Cholesterol enrichment medium, MGYP medium, X-medium and YMG medium [22]. Cholesterol enrichment medium and X-medium were found to be the best producers of COD [22]. In order to keep the production cost effective and economical, soybean meal based X-medium was selected for the production and optimization of COD in the present study.

Microbial strain and fermentation conditions
The COD producing microbial strain was isolated from pre-treated soil sample collected from the agricultural fields of Northern India as reported earlier [22]. The strain was characterized on the basis of 16S rRNA homology (Gene Ombio Technologies, Pune, India [22]. Seed flask was prepared by inoculating (with a loop full slant culture) the medium having composition of 0.5 g/L MgSO 4 .7H 2 O, 0.5 g/L (NH 4 ) 2 HPO 4 , 3 g/L NaCl, 1 g/L K 2 HPO 4 , 10 g/L soybean meal, 3 g/L CaCO 3 and 15 ml glycerol. The culture was incubated at 28°C for 48 h at 180 rpm. Two percent (v/v) inoculum was used to inoculate the production medium with the similar composition as mentioned above for the seed medium. For the production of enzyme, the flasks were incubated at 28°C for 96 h at 180 rpm.

Enzyme assay and protein estimation
The culture broth was centrifuged at 10,000 rpm for 15 min at 4°C and the supernatant was used as a source of COD. The enzymatic activity of COD was assayed by Allain's method of cholesterol conversion into 4-cholesten-3-one [23]. For the assay, 3.03 mL reaction mixture was prepared comprising of 94 mM potassium phosphate, 0.35% Triton X-100, 3.4 mM taurocholic acid, 0.9 mM cholesterol, 19.8 mM phenol, 1.5mM 4-aminoantipyrine and 19 units of horse radish peroxidase (HRP) enzyme isolated from horseradish root (Amoracia rusticana). The reaction mixture was incubated at 37°C for 5 min afterwards it was boiled for 5 min in a water-bath to stop the reaction. The reaction mixture was cooled at room temperature and the absorbance was measured at 500 nm. One unit of COD is defined as the amount of enzyme required to produce 1 μmol of 4-cholesten-3-one per min under the test condition. Total protein concentration in the broth was determined by Lowry's method using bovine serum albumin (BSA) as a standard [24].

Selection of effective medium components
The most suitable production medium with highest productivity was selected by observing the production of COD under different media [22]. At the end, soybean meal based X-medium was selected for further experiments related to the enhancement of COD concentration [22]. Classical approaches, like removal, supplementation and replacement experiments were performed using OFAT methodology for the selection of effective medium components for COD production [1]. All experiments were performed in triplicate and the average values were used for the calculations.

Modeling and optimization of medium for COD production
Response surface models are multivariable polynomial models, mostly used to determine a set of variables that optimize a response (i.e., COD concentration in this study). Five medium components viz. soybean, glycerol, maltose, MgSO 4 and NaCl were selected to generate the model for response optimization. The circumscribed central composite design (CCD) was used to study the interaction effect between the above mentioned variables/ factors. The uncoded and coded values of the variables at five levels of CCD have been summarized in Table 1. For five variables, thirty six run CCD design containing ten star points, ten centre points and sixteen axial points were generated by using ccdesign function of the statistical tool box of MATLAB 7.10.0 (R2010a) (Math Works Inc., USA). The activity of COD was estimated for each experimental run. A quadratic response surface model was generated and its polynomial coefficients were calculated using statistical tool box of MATLAB. The experimental results were fitted to the quadratic equation (Eq 1) given by regstat function of the statistical toolbox of MATLAB to determine the coefficients of the equation and to obtain an optimum response surface model.
Where, Y is the predicted response, a 0 is the intercept coefficient, a i X i are the linear terms, a ij-X i X j are the interaction terms and a ii X 2 are the square terms. Additionally, ANN was used to model the effect of the five media components on enzyme activity. Different architectures of feed forward neural network were designed and trained using neural network tool box of MATLAB. Different combinations of transfer functions were used as input and hidden layers while neurons as output layers containing 'purelin' transfer function. The networks were trained with a training data-set comprising 30 experimental runs (24 training runs and 6 test runs). The training of the networks was done by using three functions viz. gradient descent, gradient descent with adaptive learning and Levenberg-Marquardt training algorithm using MATLAB traingd, traingda and trainlm functions, respectively. The trained network models were simulated and validated using validation data set (experimental data which was not used for training) for precision.
The models generated through RSM and ANN were further optimized by employing genetic algorithm ga function of MATLAB. The input parameters of 'ga' function were as follows: Pop-

Selection of effective medium components
Soybean meal based X-medium was selected for the production and optimization studies of COD [22]. Under un-optimized production medium conditions, the COD concentration was found to be 4.2 U/mL. In order to enhance the COD production, single-dimension optimization experiments were carried out. The results of removal experiments suggested that removal of soybean meal, glycerol, MgSO 4 , and NaCl shows drastic decrease in COD yield (Fig 1). Further, in carbon and nitrogen supplementation and replacement experiments ( Table 2) ammonium ion showed a strong inhibitory action on the COD production, whereas maltose demonstrated positive effect on COD production, hence maltose was included in the statistical medium optimization studies [22].

Generation of response surface regression model for COD production
After fitting the experimental results in the quadratic (eq 1), the RSM yielded below mentioned response surface model: Where, Y is the response (i.e., enzyme concentration in U/ml) and X 1 , X 2 , X 3 , X 4 and X 5 are the coded values of the test variables, soybean, glycerol, maltose, MgSO 4 and NaCl, respectively. The goodness of fit of the model is explained by the determination coefficient (R 2 = 0.920067), which indicates that the second order polynomial model (Eq 2) fits to the experimental data and can explain 92.01% of the variations in the result. The determination coefficient provides the degree of precision of the model in predicting the outcome. Thus, the developed response surface model was capable of predicting the outcomes of the experiment with 92.01% accuracy. The correlation between the independent variables (i.e., medium components) was explained by high value of the correlation coefficient (R = 0.959201). The statistical significance of the second order  Optimization of COD Production from Streptomyces Sp.
Response surface plots (Fig 2) obtained from MATLAB are function of two variables at a time, while maintaining the rest at fixed levels (central values, representing zero level in coded units). Response plots are quite effective in explaining the individual as well as the interaction effects of independent variables (in this case medium components) on dependent variable (Enzyme conc. represented as Enzyme activity) [26]. The dark red regions in each response surface plot represent the regions where maximum enzyme production was observed. It can be observed that soybean and glycerol have an overall weak negative effect on enzyme production. Soybean and maltose appear to have weak positive interaction effect. Soybean and MgSO 4 show a negative interaction effect increasing both of them together will adversely affect enzyme production. Soybean and NaCl show a strong positive interaction effect. Glycerol and maltose also show a weak positive interaction effect. An interesting observation is a very strong negative interaction effect of Glycerol and MgSO 4 on the enzyme production. This may be attributed to their specific negative individual effects, which multiplies when these medium components are increased together. Optimization of COD Production from Streptomyces Sp.

Generation of ANN regression model for COD production
A three layered feed forward back propagation neural network having five neurons in input layer and fifteen neurons in hidden layer with hyperbolic tangent sigmoidal transfer function for hidden layer and linear transfer function for both input and output layer was found most efficient and saved (Fig 3). The Levenberg-Marquardt (LM) training algorithm was found to be most accurate and fastest among the three algorithms. The model generated by applying LM algorithm has been given as Eq 3.

Enzyme activity
Eq 3 is the representation of the trained feed-forward ANN model correlating the concentrations of five medium components and the COD concentration in MATLAB. Here, 'purelin' and 'tansig' are MATLAB functions which calculate the layer's output from its network input. purelin gives linear relationship between the input and the output, whereas tansig is a hyperbolic tangent sigmoid transfer function and is mathematically equivalent to 'tanh'. tansig is faster than tanh in MATLAB simulations, thus it is used in neural networks. LW and IW are weights of connections from the input layer to the hidden layer and from the hidden layer to the input layer, respectively. The weights of bias connections of the input and the hidden layers are represented as b and a, respectively. The input variables have been represented as X. After training the neural networks with LM algorithm, the networks were simulated to predict the enzyme activity for a given media composition. The network learned training data-set with 95.75% efficiency and predicted validation data-set with 93.77% accuracy (Fig 4).

Optimization of the RSM regression model using GA
The final response surface model was optimized using GA. The algebraic form of the model (i.e., Eq 2) was used as a fitness function while performing the optimization by using GA. By employing the defined criteria, the response of the model reaches to its optimum value successfully after eleven generations (Fig 5). The algorithm found maximum output of the enzyme in given experimental bounds at the optimized values of the variables. The maximum enzyme production (6.283 U/mL) was obtained after eleven generations using 1.01 g/50 mL soybean, 1.49 g/50 mL maltose, 0.075 g/50 mL MgSO 4 , 0.45 g/50 mL NaCl and 1.488 ml/50 mL glycerol. However, the GA-optimized (predicted) productivity was verified experimentally and leaded to 6.04 (±0.5) U/mL COD production, which is in close agreement with the GA-predicted COD concentration (6.283 U/mL). Nearly 1.5 folds increase was found in the optimized experimental COD concentration (6.04 U/mL) as compared to the un-optimized medium (4.2 U/mL).

Optimization of the ANN regression model using GA
The algebraic form of the final trained neural network model (Eq 3) was used as a fitness function of GA to optimize the concentrations of the medium components for maximum COD activity. The model was optimized within the experimental range similar to the optimization of RSM model (Eq 2). Using a population size of 200, the GA reached to the optimum value after 61 generations. Optimization was repeated several times to ensure the global optima. The ANN-GA model predicted a maximum of 9.934 U/mL COD concentration in terms of enzyme activity using 1.431 g/50 mL soybean, 1.389 g/50 mL maltose, 0.029 g/50 mL MgSO 4 , 0.45 g/50 mL NaCl and 2.235 ml/50 mL glycerol. The GA optimized COD concentration was verified experimentally and yielded 9.75 U/mL COD at the optimized concentration. The experimentally verified (media optimized) COD concentration was double (from 4.2 to 9.75 U/mL) than COD concentration obtained with un-optimized medium and nearly 60% higher than the yield predicted by RSM generated model.

Discussion
Previously it has been reported that COD is the first enzyme involved in the cholesterol degradation and it is produced by various microorganisms. Arthrobacter, Rhodococcusequi, Nocardia erythropolis, N. rhodochrous and Mycobacterium sp. are intracellular/ intrinsic membrane bound COD producers, whereas Pseudomonas sp., Schizopyllum commune, Brevibacterium sterolicum, Streptoverticillium cholesterolicum, and some species of Streptomyces like S. violascens, S. parvus, etc. produces extracellular COD [20,[27][28][29]. COD produced from Streptomyces sp. has been reported to be of higher quality because of lower production cost, stability and longer shelf life [30]. Earlier, we reported extracellular production, purification and characterization of COD by the soil isolate Streptomyces sp. NCIM 5500 [22]. We also compared the COD production from free cells to Ca-alginate entrapped cells of Streptomyces sp.under batch conditions [31]. However, the production of COD by optimizing the medium components using  statistical/ mathematical or artificial intelligence based techniques has not been reported so far from this strain.
Root Mean Square Error (RMSE) and Mean absolute percentage error (MAPE) were determined for the two techniques (RSM and ANN) applied in this study for the prediction of experimentally obtained enzyme concentrations. RMSE and MAPE for RSM are 4.92 and 13.52, respectively, while for ANN they are 4.1 and 7.8, respectively. This qualifies ANN as a better predictor of experimental values as compared to RSM.
The COD production (in terms of enzyme activity) in an un-optimized medium was 4.2 U/ ml which was significantly increased to 6.04 U/mL by employing RSM coupled with GA. Whereas ANN coupled with GA resulted in further enhancement in COD concentration (9.75 U/ml,) which was nearly 2.32 folds higher than the yield obtained with un-optimized production medium. A combinatorial method using RSM coupled with GA has been successfully used to solve the problems associated with process optimization [32,33]. Chauhan et al. (2009) reported 2.48 folds increase in COD productivity from S. lavendulae by using statistical approaches [34]. Five medium components viz. soybean, glycerol, maltose, MgSO 4 and NaCl found important and were studied for the optimization of COD production. The results of the effect of individual medium component on COD activity correlated to the role of those components for COD production. Glycerol and maltose showed positive effect on COD production. Earlier study also reported that glycerol supports COD production in S. lavendulae [34]. Soybean meal is a complex nitrogen source and contains amino acids, carbohydrates and also includes fatty acids [4,34], which enhance the enzyme (COD) production [34]. Here, in this study, during the experiments with Streptomyces sp. NCIM 5500, MgSO 4 was found to be more effective than NaCl for COD activity, which is in contrast to the previous report of Amiri et al. (2008), where they reported NaCl favors COD production than MgSO 4 [33]. However, other reports support the use of both the salts in the production medium [34]. It was evident from linear and quadratic effect that higher concentration of MgSO 4 and lower concentration of NaCl is responsible for greater enzyme production. On the contrary to NaCl supplementation in the production medium for COD production, plethora of reports suggests the use of MgSO 4 for stabilization or even enhancement of COD activity [11,34,35]. Also, El-Shoraet al. (2011) reported that COD production activates by Mg 2+ ions in case of Staphylococcus epidermidis [35].
Media optimization using ANN model coupled with GA resulted in higher COD concentration than RSM-GA approach. RSM is a useful technique for understanding the interaction effects of variables but neural network is better in terms of precision, and the same was found in this study. In general, the biological processes are defined by many non-linear complex relationships. ANNs are nonlinear stochastic models that mimic biological neural networks and are efficient in modeling complex biological processes. Desai et al. (2008), compared the efficiency of RSM and ANN in predictive modeling and medium optimization for the production of scleroglucan [36]. They reported that ANN fitted experimental data has greater efficiency than RSM. ANN based model is more generalized as it predicts completely unseen data with greater efficiency (98%) than RSM (89%) [36].
In this study, RSM and ANN were used along with CCD to derive a model for interaction effects of medium components (i.e., soybean meal, glycerol, maltose, NaCl and MgSO 4 ) on COD production. Further GA was employed to optimize the RSM/ANN models. The media composition obtained by optimizing both of the models resulted in higher COD concentration than the yield recovered through un-optimized media. This hybrid methodology, i.e., coupling of ANN with GA was found to improve COD production significantly (nearly 2 folds) and proved better than RSM, as the model developed through ANN was found to give nearly 60% higher COD concentration than the yield predicted by RSM generated model. The combinatorial approach (coupling of ANN with GA) presented in this study is sufficiently general and thus can also be successfully employed for the optimization of various parameters used in other bioprocesses. Overall, the higher COD concentration achieved in this study through ANN coupled with GA approach will paves the way for future studies for the production of COD at commercial scale using Streptomyces sp. NICM 5500 as well as implication of other/ combination of artificial intelligence techniques for higher and sustainable production of COD.