Flux prediction using artificial neural network (ANN) for the upper part of glycolysis

The selection of optimal enzyme concentration in multienzyme cascade reactions for the highest product yield in practice is very expensive and time-consuming process. The modelling of biological pathways is a difficult process because of the complexity of the system. The mathematical modelling of the system using an analytical approach depends on the many parameters of enzymes which rely on tedious and expensive experiments. The artificial neural network (ANN) method has been successively applied in different fields of science to perform complex functions. In this study, ANN models were trained to predict the flux for the upper part of glycolysis as inferred by NADH consumption, using four enzyme concentrations i.e., phosphoglucoisomerase, phosphofructokinase, fructose-bisphosphate-aldolase, triose-phosphate-isomerase. Out of three ANN algorithms, the neuralnet package with two activation functions, “logistic” and “tanh” were implemented. The prediction of the flux was very efficient: RMSE and R2 were 0.847, 0.93 and 0.804, 0.94 respectively for logistic and tanh functions using a cross validation procedure. This study showed that a systemic approach such as ANN could be used for accurate prediction of the flux through the metabolic pathway. This could help to save a lot of time and costs, particularly from an industrial perspective. The R-code is available at: https://github.com/DSIMB/ANN-Glycolysis-Flux-Prediction.


Introduction
The emergence of genomics, transcriptomics and proteomics, along with improvements in information technology, helped us to integrate the information, build the mathematical in silico model of a biological system and observe its behaviour [1][2][3][4][5]. The integration of different "-omics" data helped us to understand the genetic difference between the phenotypes, to identify the molecular signature [6,7] and use metabolic engineering [8,9] etc. There have been many attempts to model biological systems, like Saccharomyces cerevisiae [4,[10][11][12], Escherichia coli [13][14][15], other organisms [3] and many plant metabolic networks for observing and predicting the behaviour of a system using different methods [2,16]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Many different kinds of mathematical models exist to study biological systems [17,18]. Several approaches have been developed to determine or estimate the flux through the metabolic pathway [19][20][21]. Based on the data and constraints used, the mathematical modelling can be classified into two broad categories [2,16] i.e., kinetic modelling or mechanistic modelling [22][23][24], and constraint-based or stoichiometric modelling [12,25,26]. The kinetic model defines the reaction mechanism in the system using kinetic parameters to evaluate rate laws. These rate laws are defined from the experiment, assuming that the experimental conditions are similar to in-vivo conditions [27]. To build a kinetic model, the system has been made as simple as possible, while retaining system behaviour. The modelling of enzymes like phosphofructokinase could be problematic and might need more parameters than other enzymes [28]. Determining the kinetic parameter is expensive and time consuming; some parameters could be more difficult to measure. Although many enzymatic assays are described in the literature, sometimes it is necessary to modify the assay for new enzymes or to find a new one. In some cases, for example, following enzyme reaction through spectrophotometers or spectrofluorimeters, this is difficult due to no absorption or emission signals [29] linked to the reactants. Most of the available kinetic data are obtained from in-vitro studies using purified enzymes which might not represent the exact properties of in-vivo enzymes [23]. For example: The V max value measured in vitro may not represent the value of an in vivo system because of the destruction of enzyme complexes, cellular organisation and the absence of an unknown inhibitor or activator [30,31]. A constraint-based model uses physiochemical constraints like mass balance, thermodynamic constraints, etc., in the modelling, to observe and study the behaviour of the system [25]. There are different methods, like flux balance analysis [32] and metabolic flux analysis [33]. Flux balance analysis is an approach to studying biochemical networks on a genomic scale, which includes all the known metabolite reactions, and the genes that encode for a particular enzyme. The data from genome annotation or existing knowledge is used to construct the network [5,34] and the physicochemical constraints are used to predict the flux distribution, considering that the total product formed must be equal to the total substrate consumed in steady state conditions [32]. This method is used to predict the growth rate [5,32,34,35] or the production of a particular metabolite [36]. Metabolic flux analysis, an experimental based method, allows the quantification of metabolite in the central metabolism using the Carbon-labelled substrate [33,37,38]. The labelled substrate is allowed to distribute over the metabolic network and is measured using NMR [39] or mass spectrometry [32].
Many of the biomolecules like organic acids [40,41], antibiotics [42][43][44], bioethanol etc. [45,46] have been used in the pharmaceutical and food industries and as energy sources. Biomolecule production is attracting the attention of biologists and industries due to the decrease in non-renewable resources and global warming [47,48]. Synthetic biology and systems biology help to obtain the highest yield of biomolecules from the source [49][50][51].
Glycolysis is the centre of the metabolic system in all living organisms. It is an anaerobic pathway present in almost all living cells and helps in ATP generation. Glycolysis has been established as the central core for the fermentation. It contributes to the production of different metabolites, like citric acid, succinic acid, amino acids, etc., through pyruvate, the end product of glycolysis [52].
The neural network is an architecture, modelled on the brain, that is organized with neurons and synapses present in a structure of nodes (formal neuron) connected together [53,54]. Each numerical input corresponds to the input layer, and the value to predict (variable to explain) corresponds to the last level, the output layer. Between those two layers, intermediary nodes are present, built specifically and in sufficient number to model the problem; they form the hidden layer. The neural network has been successfully applied in different fields of science such as physics [55,56], environmental science [57][58][59] and data mining [60,61] for the prediction of different features in the system. The ANN is also core for deep learning [62]. The artificial neural network could be used to predict the product outcome (i.e. flux through the pathway) when combined with Flux balance analysis or other modelling approaches.
The ANN has been used earlier in predicting the fluxes from 13 C labelling of metabolites in mammalian gluconeogenesis by M.R. Antoniewicz et al. [21]. Three linear regression modelling methods, multiple linear regression (MLR), principal component regression (PCR) and partial least square regression (PLS) were run on simulated data and compared to the ANN. The study showed that ANN, that requires the larger sample (>200) performed better than the other methods for flux prediction using new mass isotopomer data [21].
Due to the challenges in estimating the flux using different methods like constraint-based and kinetic-based modelling approaches [63][64][65][66], we developed a simple method using artificial neural networks. This method is based purely on the existing experimental data and hence does not require kinetic parameters as in kinetic modelling and no prior information is required regarding the stoichiometry of the metabolic pathway. In this study, an artificial neural network was built to estimate the flux using enzyme concentrations for the upper part of glycolysis as input. Finding the optimum enzyme concentration-which gives the highest product-through experiments is very tedious and expensive. The neural network approach could be helpful for choosing the optimum enzyme concentrations which enhance the final product concentration without experimental setup, within a short period of time. Experiments were carried out in order to: i) assess the structure of the ANN using three different approaches, ii) evaluate different activation functions and iii) compare the prediction of flux of NAD + to the fluxes predicted by Fievet & al. (2006).

Principle of artificial neural networks
The base element of ANN is the perceptron, defined in 1958 by Rosenblatt [67]. A combination function computes a value based on the input layer and some weights. This is a weighted sum ∑n i p i (observed node) of the n i values in the input layer. To define the output value, a function called activation function is applied to this value. We note n i for the node i, the weight p i corresponds to the connection between node i, the observed node and the activation function f, associated with the observed node (Fig 1).
The structure of an ANN is defined by the number of layers and nodes, by the way they are linked (activation function) and the method to estimate the weights.

Input for building the ANN model
The flux measurement data from in-vitro reconstructed upper part of glycolysis [20] was used to build the artificial neural network (Fig 2). The input for the ANN model consists of concentrations of enzymes phosphoglucoisomerase (PGI), phosphofructokinase (PFK), fructose bisphosphate aldolase (FBA) and triose phosphate isomerase (TPI) in mg/L, and the output is flux J (μM/s) measured as the NADH consumption by glycerol-3-phosphate dehydrogenase (G3PDH). The flux measured through the upper part of glycolysis is indirect and we assume that most of the NADH in the system is consumed during the measurement. The data has been normalised using min-max method before building the neural network.

Experimental details
The upper part of glycolysis was reconstructed in vitro (Fig 2), with constant concentration of hexokinase andglycerol-3-phosphate dehydrogenase, while the other four enzymes (PGI, PFK, FBA and TPI) concentrations varied. The total enzyme concentration of the four enzymes (PGI, PFK, FBA and TPI) was constant at 101.9 mg/L. The NADH consumption using the glycerol-3-phosphate dehydrogenase is monitored every 2s with Uvikon 850 spectrometer at 390 nm from 60 to 120s. The linear slope of NADH was calculated as the flux through the pathway. The assays were performed in triplicate by Fiévet et al., 2006, at 25˚C, by adding 1mM ATP at pH 7.5. Data is given in supplementary information (Table A in S1 File).

Structure of ANN
The artificial neural network built with a single layer of hidden units [68]

Results and discussion
The three-neural network algorithms: nnet [69], neuralnet [70] and RSNNS [71] are built with a hidden number of units ranging from 9 to12, as shown in Equation 1. The RMSE and coefficient of determination were compared between algorithms during leave-one-out cross validation. Out of three algorithms, the neuralnet performed better than the other two (Table B in Using the neuralnet model, with "logistic" and "tanh" activation functions, the effect of the number of hidden units on RMSE and R 2 was studied (Fig 3) with a leave-one-out-cross-validation procedure (Table C in S1 File). The logistic function with 13 hidden units gives a RMSE of 0.847, R 2 of 0.93 and tanh function RMSE of 0.804, and R 2 of 0.94 with 6 hidden units. The R-script to build the ANN with leave-one-out cross validation is given in R-Scripts A in S1 File.
Experimental flux was compared with the ANN predicted flux by leave-one-outcross-validation procedure, using chosen hidden units with logistic and tanh function (Fig 4). The effects of enzyme concentrations on the predicted flux and experimental flux were compared, and found to follow a similar trend (Fig 5).
During the cross validation of the neural network model, a negative flux value was predicted for one combination of enzymes (Table 1: Index-3). This is because, during a leave-oneout procedure (LOOcv), one combination of the four enzymes is not included in the model training and has to be predicted. The negative value shows the poor ability of the ANN model to predict the outliers, i.e. a combination that is not close (in terms of PGI, PFK, FBA and TPI concentrations) to those included in the training data set.
The original study by J. B. Fievet et al. developed a model to predict flux. As the authors mentioned in their article, their flux predictor overestimates the observed flux by a constant factor. The predicted flux in their method, has an R 2 value of 0.86, whereas an ANN approach with logistic function showed an R 2 value to be 0.93 and in case of tanh activation function, an R 2 of 0.94, obtained with leave-one-out cross validation, which implies that the ANN approach is more efficient in predicting the flux than the method developed in the Fievet study. The effect of enzyme concentrations on the predicted flux by both methods follows a similar trend.
The difference between actual flux and ANN predicted flux was an average of 0.57 μM/s for logistic and for tanh, with a standard deviation of 0.63 and 0.57 respectively (Table 1)  Flux prediction using artificial neural network (ANN)

Conclusion
Kinetic modelling of metabolic pathways is challenging because of the difficulties in estimating the kinetic parameters [29,65], and is sometimes expensive because of the high-cost substrates and technologies involved [72,73], whereas the constraint-based model does not use any kinetic parameters but is efficient enough to predict the flux of metabolites. Choosing the optimum enzyme concentrations for the highest flux could be a challenge when conducting experiments. Using artificial intelligence with available experimental data can help us find a quicker and more cost-effective solution for biological problems.
In this study, a neural network model was built successfully with two different activation functions: i.e., logistic (sigmoidal) and tanh, with RMSE and R 2 values of 0.847, 0.93 and 0.804, 0.94 respectively. The difference between actual flux and ANN predicted flux was an average of 0.57 for both activation functions. The J. B. Fievet et al. method after the correction has a RMSE of 1.30, with a 1.05 difference between predicted and observed flux, which clearly indicates that the ANN method works better than the other method.
It has not escaped our attention that the artificial neural network model depends on the diversity of the training data, and hence training the model with a maximum of variability in the concentration of enzymes plays a crucial role. The model built in this study might not be enough to extrapolate a model for all other enzyme concentrations. In future, the study will be extended to predict the wide range of flux and the enzyme concentration for a particular flux. Experiments in a wet laboratory will be carried out for an upcoming application paper. Even though further work has to be performed, the artificial neural network approach is a promising method in metabolic pathway modelling and could find its place in metabolic engineering and industrial scale biomolecule synthesis.  Supporting information S1 File.