The authors have declared that no competing interests exist.
Conceived and designed the experiments: VS CW. Performed the experiments: VS. Analyzed the data: VS AM. Contributed reagents/materials/analysis tools: P. Rai DS. Wrote the paper: VS AM P. Rai JB P. Ravikumar DS CW.
A widely studied problem in systems biology is to predict bacterial phenotype from growth conditions, using mechanistic models such as flux balance analysis (FBA). However, the inverse prediction of growth conditions from phenotype is rarely considered. Here we develop a computational framework to carry out this inverse prediction on a computational model of bacterial metabolism. We use FBA to calculate bacterial phenotypes from growth conditions in
Research into metabolism and physiology generally tries to uncover how an organism's internal state is determined by the environment to which the organism is exposed. For example, one might ask which genes are up or downregulated as microbes are grown on different nutrient sources
Do we expect physiology, and in particular the internal metabolic fluxes, to be predictive of the current environment? On the one hand, one could envision a scenario where an organism can only assume a small number of distinct metabolic states, and many diverse environments elicit the same physiological response. In other words, the mapping from environment to metabolism is many-to-one. Under this scenario, metabolism would not be particularly predictive of environment. On the other hand, each environment might elicit an entirely different metabolic response, i.e., the mapping from environment to metabolism is one-to-one. Under this scenario, organismal physiology can be considered an accurate reflection of the specific environment the organism resides in, and the environment can be predicted accurately from the metabolic state. In reality, we can expect the mapping between environment and metabolism to fall somewhere between these two extremes. While there are probably many different metabolic states an organism can assume, there will also be distinct environments that create similar metabolic responses.
Metabolic modeling approaches generally ask the forward question, i.e., how can we calculate the metabolic state as a function of the environment. For example, flux balance approaches calculate the metabolic fluxes in an organism as a function of input fluxes and the organism's metabolic network
Here, we asked whether the internal metabolic fluxes in an
Our overarching question is to what extent the internal metabolic fluxes in an
We use the
A biochemical network can be treated as a system that takes up the nutrients from the environment and converts them into useful metabolic precursors such as amino acids, nucleotides, and lipids. These environmental nutrients are brought into the cell via
Further, to make the task of predicting growth conditions from fluxes more difficult and more realistic, we introduced background impurities in all simulated environments. Each environment consisted of a set of primary metabolites (usually one carbon and one nitrogen source) plus a small quantity of randomly chosen other metabolites. We varied the number of metabolites serving as impurities to evaluate how sensitive the regression model was to the amount of chemical noise present in the environment. Impurities were selected at random from a set of 174 carbon and 78 nitrogen sources used previously with the
We first wanted to test how well prediction might perform in a best-case scenario. To this end, we selected seven carbon and seven nitrogen sources (
Carbon sources | Nitrogen sources |
D-glucose | Ammonia |
Pyruvate | Adenine |
Glycerol | Cytidine |
Acetate | Putrescine |
D-ribose | L-glycine |
D-fructose | L-alanine |
D-sorbitol | L-glutamine |
Growth condition | Replicates | Total observations | Impurities | Viable observations |
Training data size | Test data size |
7C, 7N | 100 | 4900 | 1 C/N | 4893 | 489 | 2447 |
5 C/N | 4860 | 486 | 2430 | |||
10 C/N | 4836 | 483 | 2418 | |||
1 C/N | 4893 | 1223 | 2447 | |||
5 C/N | 4860 | 1215 | 2430 | |||
10 C/N | 4836 | 1209 | 2418 | |||
1 C/N | 4893 | 2446 | 2447 | |||
5 C/N | 4860 | 2430 | 2430 | |||
10 C/N | 4836 | 2418 | 2418 | |||
Maltose, 7N | 100 | 700 | 1 C/N | 695 | NA |
695 |
20 C/N | 699 | NA |
699 | |||
Cytosine, 7C | 100 | 700 | 1 C/N | 602 | NA |
602 |
20 C/N | 700 | NA |
700 | |||
Excess N, normal C | 100 | 4900 | 1 C/N | 4865 | 2432 | 2433 |
(7N, 7C) | ||||||
Excess C, normal N | 100 | 4900 | 1 C/N | 4848 | 2424 | 2424 |
(7C, 7N) | ||||||
min. abs. flux | 100 | 4900 | 10 C/N | 4139 | 2069 | 2070 |
(7C, 7N) | ||||||
only C impurities | 100 | 4900 | 20 C | 4396 | 2198 | 2198 |
(7C, 7N) | ||||||
only N impurities | 100 | 4900 | 20 N | 4898 | 2449 | 2449 |
(7C, 7N) | ||||||
174C, 78N | 2 | 27144 | 1 C/N | 25140 | 12596 | 12544 |
Each row details the growth conditions used for flux balance analysis (FBA) and the sizes of training and test data sets for inverse prediction of growth conditions from simulated phenotypes.
Viable observations include only those observations with a biomass value above the viability threshold of 0.558.
Models were trained on the 7C, 7N data set with 1 C/N impurity, 2446 data points.
We considered two alternative approaches to prediction, joint prediction and separate prediction. Under joint prediction, we considered all 49 pairwise combinations of the seven carbon and seven nitrogen sources as distinct outcomes, and we trained a single model to predict one of those 49 possibilities. Under separate prediction, we trained two separate models, one for the seven carbon sources and one for the seven nitrogen sources. Overall, both prediction approaches worked quite well. Even at relatively high numbers of impurities, we could correctly identify the main carbon and nitrogen sources in over 80% of the cases (
For joint prediction, each data point corresponds to training/testing a new regression model. Similarly for separate prediction, each data point corresponds to training/testing two separate new regression models. (A) The misclassification rate increases as the number of impurities increases. (B) The misclassification rate decreases as the size of the available training data increases. In all cases, separate prediction out-performs joint prediction.
To understand where the misclassifications are coming from, we plotted heatmaps that show the actual growth sources and the predicted sources at two different numbers of impurities (1 C/N and 10 C/N). At 10 C/N, a number of carbon sources are predicted as either acetate or pyruvate (
For each heat map, the actual C or N source is plotted along the
In a direct comparison, however, the separate prediction models always outperfomed the joint prediction models (
Next, we looked into understanding the role of excess resources on prediction results. Above, we used the conventional maximum uptake rate of 20 mmol gDW−1 hr−1 that is generally used for carbon and nitrogen sources in FBA studies. To determine to what extent our results depended on this choice, we artificially increased the uptake rates of the carbon source to a maximum of 1000 mmol gDW−1 hr−1 while keeping the nitrogen source at the normal rate, and vice versa. These simulations can be considered as conditions of excess carbon (when maximal carbon uptake is artificially increased) or excess nitrogen (when maximal nitrogen uptake is artificially increased).
When predicting growth conditions from the final fluxes, we obtained similar results as before, i.e., individual prediction performed better than joint prediction. For an artificially high uptake rate for nitrogen but with a normal uptake rate for carbon sources, the misclassification rate with separate prediction was 10%, while the misclassification rate with joint prediction is 26%, at an amount of impurities of 1 C/N and with a training data size of ∼2450 replicates. Separately predicting carbon resulted in 151 mispredictions compared to 109 mispredictions for nitrogen. In combination, there were 250 mispredictions using separately trained models. Joint prediction resulted in 638 mispredictions. Similarly, for artificially high uptake rates for carbon sources and normal uptake rates for nitrogen sources, the misclassification rate under separate prediction was 3.8% while the misclassification rate under joint prediction was 14%. Joint prediction resulted in 324 mispredictions. Separate prediction resulted in a combined misprediction from C and N sources of 94 mispredictions (64 C and 32 N, respectively). Clearly, prediction rates were better for separate prediction compared to joint prediction even with larger training data sizes, which was not the case for
Since individual prediction seemed to work well, we next tested whether we could use this approach to predict growth conditions chosen from the comprehensive list of 174 carbon and 78 nitrogen sources. Joint prediction in this case was infeasible, since we would have had to train a model to distinguish between
All the results presented so far were obtained with a simple maximization of the biomass reaction. FBA can also be carried out with different optimization functions, and the equilibrium fluxes that are found will depend on the specific optimization function chosen. To confirm whether our approach would work under different optimization schemes, we carried out additional simulations in which we maximized biomass and then subsequently minimized the absolute sum of fluxes, holding the maximal biomass value constant. Then we performed the regression analysis as described above. We carried out this analysis for the case of 7 distinct C and 7 distinct N growth substrates, 10 C/N impurities, and individual prediction of C and N sources. We found a combined misclassification rate of 23%, relative to 15% using only the biomass maximization. (See
The previous subsection has shown that a regularized regression model is capable of predicting the primary carbon and nitrogen sources used from steady-state metabolic fluxes. We next wanted to investigate how exactly the regularized regression model carries out this task. For each flux balance simulation, the resulting flux data set contains 1443 flux values, corresponding to 1443 reactions that are not transport reactions. One of these reactions is the biomass reaction, which we excluded from the regression modeling. Thus, we have 1442 predictor variables in the regression model. In this situation, a standard regression model would have to determine 1443 regression coefficients, one per reaction plus an intercept. By contrast, the regularized regression model we employed sets most regression coefficients to zero and retains only a small number of non-zero coefficients. (The exact number of non-zero coefficients is determined through the choice of a tuning parameter, which is selected by cross-validation. See
To gain mechanistic insight into predictive reactions, we mapped them onto the
Reactions in distinct parts of the metabolism are predictive for different carbon sources. A list of the predictive reactions can be found in
Reactions in distinct parts of the metabolism are predictive for different nitrogen sources. A list of the predictive reactions can be found in
At an impurity number of 1 C/N and using the largest training data size (see
We also analyzed how the regression model performed when some of the key predictive reactions were removed. As mentioned above, there were 100 unique reaction IDs for individual prediction of carbon and nitrogen sources at the lowest number of impurities and with the largest training data set analyzed. We eliminated each of these 100 reactions at a time as predictors in the regression model, trained a new model separately for both the carbon and nitrogen sources, and calculated the prediction accuracy. We combined the results of individual predictions to calculate the prediction accuracy of the combination of the sources. With the exception of the reaction “glucose 6-phosphate isomerase” (PGI), the misclassification rate remained unchanged when we eliminated any of the other 99 reactions before model fitting. PGI catalyzes a reaction that produces fructose-6-P from glucose-6-P, and knock-out of the PGI gene causes diminished growth rate
Finally, we asked whether we could predict growth substrates simply on the basis of flux through the entry points of the metabolites into the metabolic network, i.e., based on first set of reactions past transport. For the 7 C and 7 N substrates we considered, there are 38 such post-transport reactions (
Substrate | Reactions |
D-glucose | PGI, G6PDH2r, PGMT |
Pyruvate | PDH, LDH, PYK, ME1, ME2 |
Glycerol | GLYCDx, GLYK |
Acetate | ACKr, ACS |
D-ribose | RBK, RPI TKT1 |
D-fructose | FRUK, F6PA, PFK |
D-sorbitol | SBTPD |
Ammonia | ALLTAMH, DAPAL, HMBS, SADH |
Adenine | ADD, ADPT |
Cytidine | CYTD, CYTDH, CYTDK2 |
Putrescine | GGPTRCS, PTRCTA, SPMS |
L-glycine | GLYAT, GLYCL, GLYTRS |
L-alanine | ALAR, ALATA_L, ALATA_L2 |
L-glutamine | GLNX, GLUN, GLUSy |
Instead of using all internal reactions in the regression model, we also considered a regression model that contained only the post-transport reactions listed here.
Next, we wanted to determine how the prediction would perform on previously unseen carbon or nitrogen sources. We first obtained simulated flux measurements using maltose as the carbon source and using either of the seven nitrogen sources used earlier. We generated simulated flux data for 100 replicates and at impurity numbers of 1 C/N and 20 C/N. This resulted in 700 observations. After eliminating replicates with very low biomass (see
For 20 C/N impurities, there were 699 viable flux measurements. At this amount of chemical noise, maltose was predicted as glucose 68% of the time, while the correct nitrogen source was predicted 81% of the time. For both low and high numbers of impurities, individual predictions seem to outperform joint prediction. Further, separate prediction is more likely to correctly predict all the known growth sources while predicting the unknown ones to their nearest known compound.
Next, we did simulations to test how an unseen nitrogen source gets predicted with the above models. For this, we used cytosine as a nitrogen source and either of the 7 carbon sources used earlier. Note that cytosine is one of the 4 bases founds in DNA and RNA. We used 2 numbers of impurities, 1 C/N and 20 C/N, and we generated 100 replicates for each case for testing. At 1 C/N, there were 602 viable flux measurements, i.e., for these measurements biomass was greater than the threshold used in this study. Interestingly, all 98 non-viable flux measurements were for Cytosine + Acetate sources. For the viable flux measurements, only 5 carbon sources were wrongly predicted (∼0.01% misclassification). Interestingly, in all cases, the nitrogen source cytosine was predicted as ammonia. This result may be due to a reaction that directly liberates the exocyclic amine of cytosine as ammonia.
At 20 C/N impurities, all the 700 flux measurements were viable (biomass greater than threshold). In this case, 27 carbon sources were incorrectly predicted (∼0.04% misclassification rate). The nitrogen source cytosine is predicted as ammonia in 78.8% of cases and as adenine in all other cases.
We have developed a method for making predictions regarding bacterial growth conditions from known simulated metabolic fluxes. We generated fluxes using the complete
It was surprising to us that given the same number of observations in the training set, separate prediction of nutrients always performed better than joint prediction. There are two likely explanations for this result. First, making joint predictions requires discriminating between 49 different pairwise combinations. By contrast making individual predictions only requires discriminating 7 different conditions in two different sets. Thus, one possible explanation for the lack of predictive power is that we simply did not have the appropriate level of training data. Indeed adjusting the amount of training data appears to have a dramatic effect on joint prediction in particular (
Although the background chemical noise can have a dramatic affect on model accuracy, the misclassification rate remained acceptably low even with 10 randomly picked C/N impurities. The addition of these impurities revealed one interesting and unexpected physiological hypothesis about the
To verify that the observed default carbon-source misclassification was not an artifact of nutrient limitation (carbon versus nitrogen), we increased the uptake rates of carbon source artificially high while keeping nitrogen source at normal uptake and vice versa. This ensures limiting conditions for one source and non-limiting for the other. These simulations did not alter our earlier conclusion that separate prediction performs better than joint prediction.
In addition to these simulations, we carried out three further analyses. First, instead of using all the metabolic reactions in the iAF1260 model, we used only the post-transport reactions in the regression model, as the earlier analysis had suggested that the key reactions for a growth substrate seemed often to be at the substrate's entry point into the metabolic network. However, this approach lead to poorer predictions than did the approach of initially using all metabolic reactions in the model and letting the LASSO technique select the predictive ones. This finding confirms that the growth substrates are non-trivially encoded in the internal fluxes of the metabolic network. Second, instead of using equal numbers of carbon and nitrogen impurities, we also considered using only carbon or only nitrogen impurities. When only carbon (or nitrogen) impurities were present, prediction of nitrogen (or carbon) sources had higher sensitivity than when there was a mixture of C/N impurities. Finally, we considered an alternative optimization protocol where we minimized the absolute sum of fluxes on the FBA solution obtained by maximizing biomass. Under this protocol, prediction accuracy was somewhat lowered relative to our default protocol. However, prediction remained possible at accuracies way above random guessing.
Our regression model had a relatively large feature space (1442 reactions) compared to the number of observations used to train the model (∼480 to ∼2450). Therefore, efficient feature reduction was crucial to obtain reliable models. We prevented over-fitting during feature selection by employing regularized regression via the LASSO
Our work is conceptually related to the work by Brandes et al.
One shortcoming of our statistical approach to predicting growth conditions is that it cannot predict previously unseen nutrients, i.e., carbon or nitrogen sources that were not used in the training data set. Nevertheless, we found that our regression model made reasonable choices, such as predicting the previously unseen maltose (a disaccharide consisting of two glucose molecules) as glucose. In this context, it is comforting that separate prediction generally outperformed joint prediction, since separate prediction was much more robust to previously unseen nutrients. In particular, a previously unseen carbon source did not substantially negatively affect prediction of a previously seen nitrogen source and vice versa.
Throughout this work, we have used basic flux balance analysis to predict the bacterial phenotype. In principle, one could use more realistic models that integrate regulatory information and/or signalling-pathway information with flux balance analysis techniques
We have found that predicting growth conditions from simulated metabolic flux data is a computationally tractable problem. Of note, our data indicate that separately predicting carbon and nitrogen sources performs better than jointly predicting them from paired input. Although this result is to some extent influenced by the volume of training data, it very likely reflects the structure of the metabolic reactions in the
We carried out flux balance analysis (FBA) using the COBRA toolbox
After setting up the constraints on transport fluxes as described in the previous paragraph, we carried out FBA using biomass as the objective function. FBA finds fluxes that are consistent with the given constraints and that maximize the objective function. The FBA was performed with the function
To understand if the choice of objective function affects the prediction of growth substrates, we also carried out simulations with an additional level of constraints. The additional constraint we imposed was minimization of the absolute sum of fluxes on the solution obtained from prior biomass maximization. This analysis was carried out using the function
We initially carried out simulations on 49 growth conditions consisting of all pairwise combinations of 7 carbon and 7 nitrogen sources (
For any given growth environment we simulated, we set the lower bound of the exchange reactions corresponding to the carbon and nitrogen sources present to −20 mmol gDW−1 hr−1. This lower bound is commonly used in many studies
For conditions with excess carbon or nitrogen, we increased the maximum uptake rate of one source while keeping the other one fixed. Thus, we changed the lower bounds (uptake rate) of carbon sources to −1000 mmol gDW−1 hr−1 while keeping the lower bounds of nitrogen sources at −20 mmol gDW−1 hr−1 and vice versa.
Finally, we also carried out simulations on all pairwise combinations of all 174 carbon and 78 nitrogen sources previously used in Feist et al
To make the simulation scenario more challenging and more realistic, we incorporated different numbers of impurities (chemical noise) to the simulated growth media. For this, we used a subset of the 174 carbon and 78 nitrogen sources, previously used in Feist et. al
For all the results described above, we used a biomass threshold to filter out non-viable flux measurements. We calculated this threshold value using biomass measurements at the lowest number of impurities (1 C/N), using all pairwise combinations of the 7 carbon and 7 nitrogen sources chosen for the first analysis, and with the largest training dataset size (∼2450 replicates). We recorded the biomass values for all these simulations, and used as lower threshold of viability three standard deviations below the mean, which came out to 0.558.
We predicted growth conditions using regularized multinomial logistic regression, as implemented in the GLMNET package
After filtering for biomass, for each number of impurities, we used half of the dataset as test set. We used subsets of the remaining half as training sets (i.e, ∼245, ∼490, ∼2450 observations). On the training sets we fitted regularized regression models via 3-fold cross validation, using the function
To guarantee that the LASSO model would converge, we imposed a minimum threshold of 10−6 on the magnitude of all flux values. Absolute flux values below the threshold were set to zero before fitting the LASSO model.
All raw data and analysis scripts are available online in the form of a git repository at
(EPS)
(EPS)
(EPS)
(XLSX)
(XLSX)
We thank members of the Segrè lab for helpful discussions on flux balance analysis. We thank Dakota Derryberry for a critical reading of the manuscript. The Bioinformatics Consulting Group and the Texas Advanced Computing Center (TACC) at UT provided high-performance computing resources.