Rapid Prediction of Bacterial Heterotrophic Fluxomics Using Machine Learning and Constraint Programming

13C metabolic flux analysis (13C-MFA) has been widely used to measure in vivo enzyme reaction rates (i.e., metabolic flux) in microorganisms. Mining the relationship between environmental and genetic factors and metabolic fluxes hidden in existing fluxomic data will lead to predictive models that can significantly accelerate flux quantification. In this paper, we present a web-based platform MFlux (http://mflux.org) that predicts the bacterial central metabolism via machine learning, leveraging data from approximately 100 13C-MFA papers on heterotrophic bacterial metabolisms. Three machine learning methods, namely Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Decision Tree, were employed to study the sophisticated relationship between influential factors and metabolic fluxes. We performed a grid search of the best parameter set for each algorithm and verified their performance through 10-fold cross validations. SVM yields the highest accuracy among all three algorithms. Further, we employed quadratic programming to adjust flux profiles to satisfy stoichiometric constraints. Multiple case studies have shown that MFlux can reasonably predict fluxomes as a function of bacterial species, substrate types, growth rate, oxygen conditions, and cultivation methods. Due to the interest of studying model organism under particular carbon sources, bias of fluxome in the dataset may limit the applicability of machine learning models. This problem can be resolved after more papers on 13C-MFA are published for non-model species.


Introduction
With the advent of systems biology tools, such as genomics, transcriptomics, proteomics, and metabolomics during the last decade, the understanding of intracellular metabolisms from genotype to phenotype has been dramatically boosted. Notably, 13 C metabolic flux analysis ( 13 C-MFA) enables the quantification of metabolic reaction rates in vivo [1]. It determines carbon metabolic fluxes using the mass isotopomer distribution (MID) of proteinogenic amino acids or free metabolites from 13 C labeling experiments. 13 C-MFA is considered as a reliable measurement of central metabolic reaction rates [2], which has demonstrated its power in discovering novel pathways [3,4], validating gene functions [3], verifying engineered strains [5,6], and revealing energy metabolism of host strains [7]. In the past decade, advanced parallel bioreactor systems, mass spectrometry, and computational tools resolving metabolic fluxes have been developed [8][9][10][11], which improved the precision of flux profiles [12] and extended 13 C-MFA's application to the non-stationary metabolic phase [13,14]. On the other hand, broad applications of 13 C-MFA are still hindered because 13 C experiments, biomass analysis, and flux calculations are expensive and time-consuming [15]. Moreover, some microbial systems may not be amenable to 13 C-MFA if they require complex nutrients or their genome annotation is incomplete [16]. Before performing 13 C-MFA on non-model species, laborious work is needed to examine extracellular metabolites, to characterize unknown pathways, and to analyze biomass compositions.
This study aims to employ an artificial intelligence (AI) approach called machine learning (ML) to investigate bacterial fluxomics patterns. ML is a powerful tool in systems biology [17] and has demonstrated successes in omics studies [18,19]. For example, the precision of genome annotation on the model species C. elegans has been significantly enhanced by employing a simplified Support Vector Machine (SVM) method. Researchers have reached an accuracy of 75% on controversial genes [20]. At the transcriptomics level, ML approaches have been frequently used for disease identification. For instance, SVM has successfully recognized the gene expression patterns of hepatocellular carcinoma (HCC) [21], diffuse large B-cell lymphoma (DLBCL) [22] and ovarian cancer [23]. At the proteomics level, Supek et al. have employed a combined approach by integrating the Principal Component Analysis (PCA) method with SVM, to enhance analytic power in identifying "fingerprint" proteins (i.e., unique proteins in each tissue) from different horseradish tissues (leaf, teratoma, and tumor) grown in vitro [24]. In metabolomics, an SVM method can resolve the NMR data of metabolites in urine samples from different groups of people (healthy vs. pneumonia) [25]. In metabolic modeling, Karp's group have adopted ML algorithms to predict the existence of various pathways for metabolic network reconstruction in different organisms [19].
The general idea of ML is to statistically build a numerical predictive model or an estimator which is a function f : X 7 ! y that maps a vector of numbers called the feature vector to a vector of numbers called the target or the label. In many cases, the target is a 1-D vector, or a scalar. One may consider the feature vector as the input and the target as the output of the model. If the target takes discrete values, we call the ML model a classifier. Otherwise, a regressor. A commonly used classifier is binary classifier, where the cardinality (size) of the target set y is 2, e.g., y = {+1,−1}. In this paper, we build a regressor f : R n 7 !R for each flux, where R stands for the set of all real numbers. In supervised ML, a pair of a feature vector and a target form a training sample. Given a finite set of N samples {(X 1 , y 1 ), . . ., (X N , y N )}, an ML algorithm will find such a function, usually through solving a numerical optimization problem, to minimize the predictive error. Samples used to train a model form the training set while those for testing the performance form the test set. Given a new piece of data, numerically represented as a vector X new , the model f will predict the target f(X new ), e.g., a flux value given reaction parameters where are represented by the vector X new in this paper. The models learned through ML are usually not analytical models that can be represented using equations. Rather, they are numerical operators. For example, an artificial neural network (ANN) model can be represented by a series of weight and bias matrices, each of which is for one layer. A poor model can only predict well on the training set as if it only "remembers" the training samples, while a good model can learn the patterns among data and still be accurate on samples it has never "seen". Hence, researchers make the training and test sets mutually exclusive. A mechanism called cross validation is used to ensure the mutual exclusiveness of training and test sets while making full use of the dataset.
A distinct advantage for ML applications is that they can reduce the need for costly experimental supplies and time-consuming benchwork. Despite the progress in utilizing ML methods in systems biology, there is no similar application in the fluxomics field to predict the flux profile. Therefore, we conceived the idea of integrating ML strategies with fluxomics research. To efficiently employ ML methods, a dataset with a sufficient number of samples is a prerequisite. Recently, a 13 C-MFA dataset named CeCaFDB has been constructed, which includes more than 100 papers mostly on prokaryotic species [26]. Based on this dataset, five categorical and sixteen continuous features were initiated to describe the environmental and genetic factors involved in 13 C-MFA of bacterial species. Unlike most omics projects employing ML approaches, this work built regressors rather than classifiers: 29 lumped central metabolic fluxes were adopted as the outputs to describe the central carbon metabolism of bacteria species. A 10-fold cross validation evaluated the performance of different algorithms. Furthermore, we included a knowledge-based system to check whether user inputs were biologically meaningful. Lastly, quadratic programming was employed to adjust the fluxes predicted by ML to satisfy stoichiometric constraints. Our web-based platform MFlux provides reasonable predictions for central metabolic flux profiles on 30 bacteria species, and it can be accessed online at http://mflux.org along with the training data. Although our platform is still in the early phase, our trial to integrate AI approaches with mechanistic models will have broad impacts on both systems biology and metabolic engineering fields.

Data collection
The dataset used to build MFlux are constructed from the literature. The total uptake rate of carbon sources is normalized as 100; all other fluxes are normalized based on the uptake rate of carbon sources. We obtained 13 C-MFA information for bacterial species from the CeCaFDB dataset and added a few recent papers (approximately 120 papers in total, as of January 2015). 13 C-MFA data related to photosynthetic bacteria was excluded in ML study because of their diverse CO2 fixation pathways, light-sensitive fluxomes, and insufficient sampling sizes for ML. For photosynthetic species, MFlux currently only reports a general description of their fluxomic features based on corresponding references.
In heterotrophic microorganisms, interconversions between glycolysis metabolites (phosphoenolpyruvate and pyruvate) and TCA cycle metabolites (oxaloacetate and malate) involve a set of anaplerotic reactions (e.g., phosphoenolpyruvate carboxylase, phosphoenolpyruvate carboxykinase, pyruvate carboxylase, and malic enzyme) serving as a key switch point for carbon flux distribution in bacteria [27]. These reactions balancing both carbon and cofactors may be employed by different microbial species. For example, E. coli anaplerotic pathways involve phosphoenolpyruvate carboxylase and malic enzyme, while Bacillus species furnish pyruvate carboxylase (the pyruvate shunt). In the case of Corynebacterium, both phosphoenolpyruvate carboxylase and pyruvate carboxylase are functional [28,29]. These anaplerotic pathways can re-route fluxes when central pathway such as pyruvate kinase is knocked out. To ease the ML efforts, the anaplerotic pathways were lumped into two routes that exchanges fluxes between the TCA cycle and the glycolysis nodes: (MAL ! PYR + CO 2 and PEP + CO 2 ! OAA). This simplification also considered the fact that 13 C-MFA has poor resolutions on anaplerotic fluxes because various combinations of these reactions could generate similar labeling patterns in amino acids [30].

Feature vector for ML
As mentioned earlier, supervised ML builds models based on the samples, each of which is a pair of a feature vector and a target. Based on published 13 C-MFA methodologies and microbial physiologies, we proposed five categorical features: species, nutrient types, oxygen conditions, engineering method, genetic background, and cultivation methods. There were two considerations when choosing those features. First, genetic modifications can significantly reorganize fluxomes. To improve the predictability on mutant strains, our platform allows toggling on or off certain central pathways (by manually setting the flux boundaries) in engineered strains. Second, the factor of cultivation method aims to reveal fluxome differences between shake flask cultures (a pseudo-steady state approach) and bioreactor cultures (a well-controlled fermentation or chemostat cultivation). Meanwhile, we introduced sixteen continuous features: growth rate, substrate uptake rate, and the ratio of multiple substrate uptakes (glucose, fructose, galactose, gluconate, glutamate, citrate, xylose, succinate, malate, lactate, pyruvate, glycerol, acetate and NaHCO 3 , as shown in Fig 1). Since the features include both categorical and continuous ones, one-hot encoders were used to convert categorical feature values into real numbers. Each feature was then standardized into zero mean and unit variance as assumed by many ML approaches. For each predicted flux, or the target/label in ML terminology, we scaled it into the interval [0, 1] by the min-max method. In addition to the min-max method, we also tested unit-variance-zero-mean standardization for scaling flux values, and the result was quite similar.

Machine learning algorithms
The problem of predicting fluxes was modeled as a regression problem in ML where a computer program learns from existing data to estimate continuous variables. Twenty-nine regressors were trained to predict the 29 fluxes. We tested three widely-applied ML algorithms, including k-nearest neighbors (k-NN), decision tree, and SVM. To ensure a fair comparison, we performed a grid search for the best parameter set of each algorithm. The detailed parameter sets for 29 SVM-based regression models can be found from our web page. The programming language used for this project was Python 2.7 and the numpy and scikit-learn modules were utilized for machine learning [31]. Program files for training the models and testing them are wrapped in S1 Program. Full version including web-end code is released under GNU GPL v3 at https://bitbucket.org/forrestbao/influx

Model evaluation and cross validation
Considering the limited number of samples in the current dataset, we adopted a 10-fold cross validation. An N-fold cross validation works as follows. All samples in our dataset are spliced into N equal parts. In each iteration, N − 1 parts are used as the training set, while the remaining as the test set. In the next iteration, the test set will be rotated to another part of the data, and the training set will consist of all other samples. This procedure will stop when all parts of the data have been incorporated into the test set exactly once, and training set exactly N − 1 times. Finally, the accuracy of the model can be calculated by checking the prediction result for each sample. For each flux, the error in cross validation is computed using Mean Squared Error (MSE).

Stoichiometric constraints and boundary
One unique feature of our method is incorporating the overall mass balance through central metabolic pathways. The stoichiometric equations in Fig 1 under steady state are summarized as follows: AceCoA Ru5P Specifically, v 1 represents the flux from carbon substrate (either glucose or galactose) to G6P since both glucose and galactose can be catabolized to G6P, vaa 1 and vaa 2 represent fluxes involved in biomass building block synthesis or extracellular products, while vbm represents carbon fluxes going to biomass from different precursors [32].
A series of linear constraints can be derived from the stoichiometric equations above and used to restrain fluxes predicted by the ML methods: Among equations listed above, Eq 22 indicates the case for co-metabolism of both C6 sugars. Meanwhile, a list of inequality constraints can be drawn, given that all biomass fluxes are non-negative: Among all inequality constraints, Eq 39 works well except for the case of zwf knockout, where the direction of Eq 39 could be reversed [33].

Flux adjustment using stoichiometric constraints
We adopted a quadratic programming method similar to minimization of metabolic adjustment (MOMA) [34], to tune fluxes to satisfy the stoichiometric constraints. The CVXOPT package for Python was employed here for quadratic programming [35]. The optimization problem is modeled as where the vectorv ¼ ½v 1 ; . . . ;v 29 is the flux values predicted by ML, the vector v = [v 1 , . . ., v 29 ] is the flux values to be solved in this optimization problem, the function Scaled(Á) using Min-Max scaling to scale all fluxes into the range [0, 1], the matrix S is obtained from all equality constraints from Eq 22 to Eq 30, and the matrix A is obtained from all inequality constraints from Eq 31 to Eq 42. Notably, the biomass composition for a same species varies significantly under various conditions. Therefore, the quadratic programming looses mass balance constraints toward biomass synthesis. The purpose of scaling fluxes into the same range is to avoid the bias because fluxes have different dynamic ranges. The objective function f(v) can be rewritten into a standard quadratic programming problem using the following steps: where Min i and Max i are the range of the i-th flux. Since the last term 1 2v i Max i ÀMin i 2 and the coefficient 2 are constants, they can be omitted from the objective function. Hence, Eq 43 can be rewritten in standard quadratic programming form as

Constraint programming and input checking
To ensure user inputs, e.g., growth rates, oxygen usage, and substrate uptake rates, are biologically meaningful, MFlux first checks the satisfiability (e.g., whether cell growth rate is realistic) of input values [36]. The biological meaningfulness is represented using constraint programming [37], where each input is treated as a variable of a given domain. A set of inputs lacking of biological meaning will cause those constraints to be unsatisfied and MFlux will report an error message to warn the user. The Python module python-constraint [38] is used as the constraint solver.
Overall system design Different parts of MFlux mentioned above are put together as illustrated in Fig 2. The prediction on 29 fluxes is done via an RBF-kernel SVM, whose outcome will be finalized by quadratic programming. Users can set boundary constraints to represent information about genes that are knocked out on the species, and such information will be used in quadratic programming. If parameters set by the user are not biologically meaningful, a warning message will be displayed. In the future, users will also have the option to enter flux constraints and settings of their own experiment to improve the prediction accuracy of MFlux.

Pathway map and statistical analysis
The core metabolism of bacteria is summarized into a pathway map in Fig 1. Considering the availability of information, 29 major fluxes with 14 potential substrates were used to represent a universal heterotrophic carbon metabolism for non-photosynthetic bacteria species, which includes glycolysis, the tricarboxylic acid (TCA) cycle, the pentose phosphate (PP) pathway, the Entner-Doudoroff (ED) pathway, the glyoxylate shunt and the anaplerotic pathway. It is difficult for 13 C-MFA to precisely resolve the anaplerotic pathway fluxes [39]. Information on the anaplerotic pathway is either incomplete or not precise in many publications in our dataset. Consequently, we simplified the anaplerotic pathway into two reversible fluxes. Similarly, we ignored several overflow fluxes which occasionally appear in 13 C-MFA anaerobic metabolisms (e.g., the secretion of formate, butyrate, or pyruvate), because of lacking sufficient samples for machine learning. The omission of those fluxes can also partially explain the high prediction error in some fluxes (e.g., v 8 : Pyruvate ! Acetyl-CoA).
By statistical analysis, we determined the variation between each flux profile and the average flux profile from our 13 C-MFA dataset. The average value, the range, and the 95% confidence interval for each flux are shown in Fig 3. The most conservative fluxes from our dataset include the non-oxidative pentose phosphate pathway and the glyoxylate shunt. The former pathway supplies precursors for bio-synthesizing amino acids (i.e., histidine, phenylalanine, and tyrosine) and nucleotides. The latter acts as an alternative carbon reserving path to the TCA cycle and is inhibited by the presence of glucose (most 13 C-MFA is based on the glucose metabolism). All 29 fluxes are found to have a relatively narrow confidence interval compared to possible flux ranges, suggesting that fluxes of different bacteria species varies in a relatively small range. This is because most 13 C-MFA studies are focusing on models species (e.g., E. coli and B. subtilis) and glucose based metabolism, while there are much less MFA efforts to study nonmodel species or metabolism of carbon substrates other than sugars (i.e., bias of fluxome research across).

Optimization of algorithms and parameters
To decide the most suitable ML algorithm, we first performed a grid search in the parameter space, using a dataset of wild type (WT) samples only. The best results of three different algorithms (for SVM, linear kernel only here) are presented in Fig 4. SVM makes better predictions than either the decision tree or k-NN on most fluxes. After this step, we carried out a second round of grid search to optimize parameters and improve the performance of SVM on the whole phenotype (WP) dataset (both WT and engineered). Both the linear kernel and radial bias function (RBF) kernel were included in this round of grid search.  Better cross validation was expected from the SVM models trained on the WT dataset, rather than on the WP dataset, while sophisticated genetic variations are not included in the WT dataset. However, cross-validation results refuted our initial thought: the models from the WP dataset demonstrated significantly better performance than those trained on the WT dataset (data shown in Fig 5). This result can be interpreted as that the size of the training set is a major factor affecting the model quality, especially when the training set is relatively small (the sizes of WT and WP datasets are about 150 and 450 samples, respectively). We also compared the SVM results using the linear kernel with those using the RBF kernel, and the RBF kernel showed slightly better performance (Fig 6). The parameter set producing the most accurate cross-validation result was used to configure MFlux. Notably, prediction on v 11 (the second step of the oxidative PP pathway) and v 24 (the glyoxylate shunt) have relatively large variations. Two factors may contribute to this fact. Both v 11 and v 24 have relatively narrow ranges (see    1) and consequently even small numerical variations will generate larger relative errors for both fluxes. Meanwhile, genetic modifications may influence both v 11 (e.g., zwf knockout [40]) and v 24 (e.g., ppc knockout [41]) significantly. For instance, knocking out zwf in E. coli will cause a zero flux in v 10 (the oxidative pentose phosphate pathway, OPP pathway) [42]. However, the lack of sufficient information on flux re-organization mechanisms in engineered microbes reduces ML predictability. This is because most engineered microbial fluxomics studies are focused on a few model species such as E. coli. To resolve this problem, the MFlux platform allows the users to manually set the boundaries of central fluxes to improve prediction quality (e.g., setting a zero flux through the OPP pathway for E. coli zwf mutant).

Flux correction by quadratic programming
After parameter optimization, the SVM models of the best parameter sets can predict with relatively small error. However, the flux profile predicted by the ML method does not necessarily satisfy the inherent stoichiometric constraints of metabolic networks because the ML methods do have big enough dataset at this stage to reflect this. The situation could get even worse where specific fluxes predicted by the ML algorithm may go beyond a biologically meaningful range (e.g., the predicted glyoxylate shunt flux v 24 may have a negative value). To address those issues, we employed quadratic programming for flux correction as described in the Methods section. More rational results with improved accuracy are expected after flux correction. An essential assumption of this step is that ML predictions are relatively close to real values reported in the literature. This hypothesis is backed by our cross-validation results further validated in the following case studies.

Case studies
To demonstrate the functionality of MFlux, we carried out tests on 20 cases, and the results are illustrated in Fig 7. Brief information for each case is listed in Table 1, and comprehensive results are included in S1 and S2 Tables. In general, MFlux can achieve decent flux predictions. Here we demonstrate two cases which are Cases 8 and 16.
In Case 8, B. subtilis strain uptakes the mixed substrates succinate and glutamate. To illustrate mixed substrates co-metabolisms, we tested MFlux with 13 C-MFA data of B. subtilis reported by Chubukov et al. [44]. Microbial fermentation fed with multiple substrates of low price is promising for the biotechnology industry. However, there are very few quantitative analyses of this topic. In this test, we adopted the same set of parameters found in the literature In Case 16, G. thermoglucosidasius strain M10EXG grows under microaerobic conditions. G. thermoglucosidasius is a thermophilic and ethanol tolerant bacterium which can convert both hexose and pentose into ethanol [28]. To predict its central fluxomes, the parameter set used is listed in S1 Table, Table). In this case, MFlux reduces the deviations of predicted fluxes from 13 C-MFA values.
For species with genetic modifications in major pathways (Cases 2, 3, 4, 12, and 13, E. coli and C. glutamicum), MFlux predictions have an RMSE between 5 and 10, higher than the RMSE for prediction of wild type strains. Since MFlux is currently unable to capture complex regulatory mechanisms of flux reorganization, human-computer interaction can be employed by manually tuning boundary values of certain fluxes to improve flux prediction quality. For example, knocking out ppc on E. coli may activate the glyoxylate shunt [41,42]. The users can assign a non-zero lower boundary of the glyoxylate shunt when running MFlux.

Improving flux balance analysis of microbial metabolism via MFlux
Stoichiometry-based flux balance analysis (FBA) is an important mechanistic tool to predict unknown cell metabolism [50]. Accurate FBA prediction relies highly on setting the objective function and the flux constraints appropriately (based on thermodynamics or experimental analysis). Here, we compare FBA with MFlux for predicting E. coli metabolisms. The latest  subtilis was incubated in a shake flask (37 C, 300 rpm, aerobic condition), and supplied with labeled succinate and glutamate as carbon sources in M9 minimal medium. Detailed information is in S1 Table. doi:10.1371/journal.pcbi.1004838.g008  version of E. coli iJO1366 genome-scale model (2583 fluxes) was used [51]. Two comparative case studies were performed on E. coli fluxomes: one case for glucose based 13 C-MFA via parallel labeling experiments [12] and the other for glucose and glycerol co-utilization (unpublished data from the Shimizu Group). Neither of the test cases was included in the training set of MFlux. Given 13 C-MFA results as the control, MFlux results apparently have smaller RMSEs than FBA predictions. In the first case, the FBA has an RMSE of 11.3, while MFlux has an RMSE of 6.5 (Fig 10A). In the second case, the FBA has an RMSE of 22.5, while MFlux has an RMSE of 5.1 (Fig 10B). To circumvent variations caused by alternative solutions in FBA, we also employed pFBA and geometricFBA for both cases [52,53] (S2 Table). In general, pFBA does not show better results compared with FBA for either case, while geometricFBA does not converge in our calculation. FBA alone has given good predictions of growth rate as well as input and output fluxes, but not of intercellular fluxes. It is difficult to obtain actual P/O ratios, the ATP maintenance cost, the oxygen flux, and the transhydrogenase activities [55]. These energy/cofactor variables strongly affect the fluxes in the oxidative PP pathway (NADPH generation) and the TCA cycle (NADH, NADPH, and FADH 2 generation). Without proper flux constraints and  [54]. The default values of growth associated maintenance energy (GAM) and nongrowth associated maintenance energy (NGAM) were adopted. A) E. coli fluxome of glucose metabolism was precisely measured via parallel labeling experiments (a recent paper not in our dataset) [12]. B) E. coli fluxome of glycerol and glucose co-metabolism as measured by Drs. Yao and Shimizu (unpublished data). The E. coli strain was cultured in chemostat fermentor with a working volume of 1 L(37 C). The dilution rate in the continuous culture was 0.35 h −1 . [1-13 C] glucose and [1, 3-13 C] glycerol were used for tracer experiments. The flux calculation is based on a previous method [42]. The RMSE from FBA is 22.5, while the RMSE from MFlux (this work) is 5.1. The COBRA toolbox running on MATLAB R2012b was employed for FBA/pFBA/geometricFBA simulation, and Gurobi 5.5 was used for linear programming. Detailed information is included in S2 Table. doi:10.1371/journal.pcbi.1004838.g010 objective functions, it is more challenging for FBA to narrowly determine intracellular fluxomes in suboptimal metabolisms, especially for co-metabolism dual substrates because of the large solution space for the cell metabolism to optimize biomass growth using two substrates. As a complementary tool, MFlux may offer a quick metabolic overview and provide biologically meaningful flux boundaries to reduce FBA solution spaces when proper constraints for FBA are unavailable.

Discussion
Metabolic robustness of fluxome patterns among microbial species "Robustness" was originally defined as the closed-loop process stability under perturbations in the control field. This definition is applicable to biochemical networks. To maintain the physiological output (i.e., the fluxome) within a desired range, microorganisms employ sophisticated control disciplines at different architecture levels, from the genome to the phenotype. In contrast to chaotic transcriptional profiles, the microbial fluxome shows robustness so that cells can survive in constantly-altering environments or in response to genetic mutations [56][57][58]. Metabolic rigidity at the flux level was first reported by Stephanopoulos in the early 1990s [59,60]: NADPH is important for anabolism in the exponential growth phase, and the flux ratio around glucose-6-P node is rigid to form NADPH [60]. Moreover, 12 precursors from the central metabolism are required for biomass formation, which all have relatively small variations that are mainly dependent on biomass compositions. Due to both thermodynamic and mass balance constraints, cell metabolism aims to minimize variations in flux ratios under environmental perturbations. This rule also works for engineered microbes with moderately overexpressed pathways or strains from random mutations or deletions of non-essential genes. The feature of metabolic robustness facilitates ML applications.
Flux pattern recognition enables MFlux to predict metabolism of new species by learning from a small set of fluxome information from the same genus. For example, the metabolisms of P. aeruginosa, P. fluorescens, and P. putida have been studied by 13 C-MFA in the past decade [61][62][63][64][65]. The results show that different Pseudomonas species employ remarkably identical fluxomics types: they employ a highly active ED pathway for glycolytic metabolism and keep a low flux on the PP pathway for biomass synthesis, due to the lack of the pfk gene [66]. The ED pathway has less cost for protein formation than the Embden-Meyerhof-Parnas (EMP) pathway, yet only one ATP is formed per glucose [67,68]. Pseudomonas species have slow cell growth rates and their aerobic metabolisms do not yield by-products. They also demonstrate a very active pyruvate shunt (MAL ! PYR) and NADPH overproduction flux (a benefit for counteracting oxidative stress). On the other hand, the TCA cycle in Pseudomonas species show plasticity under genetic and environmental variations [69], and can respond to increased ATP and NADH demands under stress conditions [70].
For different bacterial species (e.g., E. coli and Bacillus), their fluxomes (e.g., glucose metabolisms) can be similar, because central fluxes in catabolism are regulated by energy and building block requirements that show much smaller variations than genome or transcriptional differences. On the other hand, change of carbon substrates may alternate flux distributions. For example, co-utilization of glucose and glycerol in E. coli cause significant reorganization of fluxomes. In a same microbial strain, different fluxome patterns can be employed for metabolizing different substrates (e.g., glucose-based fluxome vs acetate based fluxomes). Recognizing these metabolic patterns allows the use of a relatively small training set to perform a decent metabolic prediction of diverse metabolic types. Consequently, these common principles of certain classes of microorganisms can be captured by machine learning for fluxome predictions.

Limitations of machine learning
There are several major challenges regarding MFlux. First, the 13 C-MFA flux in literature may have errors and biases, which would be included in the learning/training process of MFlux and lead to further variations. For example, current 13 C-MFA studies are not evenly distributed among a broad scope of microbial genera. Most reported MFAs are concentrated in a few model microbial species or metabolism of only a few substrates (mainly glucose), and thus our current ML cannot predict fluxomes well in certain cases. Such problem (model bias) can be resolved after more 13 C-MFA papers for non-model species are included in the database and more constraints are implemented by our platform.
Second, the predictability of ML is limited to species and pathways that are already included in learning. More information and efforts are required to deal with cases of genetically modified strains with engineered pathways that hijack flux for synthesis of diverse commodity chemicals [13]. Currently, 13 C-MFA has not widely used by synthetic biology community yet. In future versions of MFlux, new metabolic knowledge and rules should be applied for flux corrections.
Third, it is still difficult to incorporate regulation mechanisms into the current model. For instance, various catabolite repression mechanisms regulate the cell fluxome in the presence of multiple substrates (e.g., glucose shows catabolite repression for fast growing E. coli when both glucose and glycerol are available, Fig 10) [71]. These hierarchy regulations among substrate utilization can be dependent on growth rates or can differ among microbial species (E. coli, Bacillus and Corynebacterium).
Fourth, when oxygen is not available, fast bacterial sugar utilization will activate mixed acid fermentation (e.g., by utilizing lactate dehydrogenase and pyruvate formate lyase) to produce complicated overflow metabolites [13]. This mechanism is also furnished in yeast and mammalian cells. However, 13 C-MFA studies on anaerobic metabolisms are much less frequent than on aerobic metabolisms. MFlux cannot predict the complicated patterns of overflow fluxes at this stage.
Fifth, our current dataset is still unable to support ML studies on phototrophic bacterial fluxomes. For phototrophic metabolism, its energy generation (ATP, NADH and NADPH) may not be controlled by substrate catabolism. Some phototrophic bacteria (e.g., cyanobacteria) have versatile autotrophic and photomixotrophic metabolism that is highly sensitive to light and substrate availability. Other phototrophs may even have CO 2 fixation pathway (such as the reversed TCA cycle). Therefore, our MFlux platform could not make ML predictions but only reports a general description of metabolic features of these species.
Lastly, ML cannot directly estimate fluxes for carbon sources which are not part of the learning dataset. To predict fluxomes for new substrates, users need to assume that similar entrypoints of carbon sources into the central metabolic network may cause similar flux distributions (e.g., sucrose has to be treated as a combination of glucose and fructose).

Conclusion
This proof-of-concept study demonstrates that AI methods can facilitate fluxomics research with reasonable precision. 13 C-MFA is a very small field of just hundreds of MFA research papers on microbial species published in the past two decades. In the long term, ML methods may solve this problem: with a large and reliable fluxomics dataset and more information from 13 C-MFA and AI scientists, the future MFlux model can make broad-scope metabolism predictions. To sum up, MFlux presents the first platform introducing ML in the field of fluxomics and it will be continuously updated and improved. It will inspire the development of similar computational tools to advance omics and metabolic engineering fields [72].
Supporting Information S1 Program. MFlux Computer Program (Source code). Python scripts in a ZIP file. (ZIP) S1 Table. Results of 20 case studies. Detailed information for 20 cases studies using MFlux, including literature references, input conditions, 13 C-MFA flux, the flux profiles predicted by ML, and the flux profiles predicted by MFlux with additional constraints. (XLSX) S2 Table. Detailed information of the comparison with FBA/pFBA. The information of constraints, objective function, and simulation results. (XLSX)