Phenomenological Model for Predicting the Catabolic Potential of an Arbitrary Nutrient

The ability of microbial species to consume compounds found in the environment to generate commercially-valuable products has long been exploited by humanity. The untapped, staggering diversity of microbial organisms offers a wealth of potential resources for tackling medical, environmental, and energy challenges. Understanding microbial metabolism will be crucial to many of these potential applications. Thermodynamically-feasible metabolic reconstructions can be used, under some conditions, to predict the growth rate of certain microbes using constraint-based methods. While these reconstructions are powerful, they are still cumbersome to build and, because of the complexity of metabolic networks, it is hard for researchers to gain from these reconstructions an understanding of why a certain nutrient yields a given growth rate for a given microbe. Here, we present a simple model of biomass production that accurately reproduces the predictions of thermodynamically-feasible metabolic reconstructions. Our model makes use of only: i) a nutrient's structure and function, ii) the presence of a small number of enzymes in the organism, and iii) the carbon flow in pathways that catabolize nutrients. When applied to test organisms, our model allows us to predict whether a nutrient can be a carbon source with an accuracy of about 90% with respect to in silico experiments. In addition, our model provides excellent predictions of whether a medium will produce more or less growth than another () and good predictions of the actual value of the in silico biomass production.


Data controls
In what follows, we describe how we modified the ATP hydrolysis in the in silico organisms so that biomass production reflected nutrient catabolysis, and how we identified true biomass production from the results of FBA experiments.
ATP maintenance. Some of the in silico organisms were constructed to directly predict the in vivo growth rate, which is typically lower than the predicted biomass yield. The growth rate of an organism is related to the biomass yield, but it is a kinetic property that is affected by a variety of factors such as nutrient uptake, regulation of protein expression, and temperature. The biomass yield is dimensionless and reflects the efficiency with which the organism uses the nutrients available [1]. The authors of the in silico organisms model the growth rate by forcing the in silico organism to hydrolyze a fixed amount of extra ATP, known as ATP maintenance. The ATP maintenance is typically calculated by fitting growth rate data of the organism on a single medium [2]. As we are only modeling biomass yield, we remove ATP maintenance from the in silico organisms as follows.
There are two types of maintenance, growth-associated maintenance (GAM), and non-growth associated maintenance (NGAM). For NGAM, the hydrolysis of ATP is added as a separate reaction, and this merely changes the lower bound of nutrient uptake that is required for biomass production to begin (see section on b min below). We turn off this reaction if it is present by constraining its flux to zero.
For GAM, the hydrolysis of ATP is incorporated into the biomass function, and this affects the overall growth rate of the in silico organism. However, the ATP hydrolysis in the biomass function accounts for both GAM and ATP needed to polymerize protein and DNA. The energetic costs of the polymerization of biomass components is important for biomass yield, and we therefore cannot remove the ATP hydrolysis from the biomass function entirely.
To solve this problem, we calculate the ATP needed for the polymerization of biomass components using the experimentally determined values published for E. coli [2,3]. The published biomass function for each in silico organism contains the stoichiometry of the protein, DNA, and RNA, and it is straightforward to calculate the stoichimetric coefficient for ATP hydrolysis needed for polymerization costs alone. We replace the published coefficient for ATP hydrolysis with the calculated value. For E. coli and S. cerevisiae, this value of ATP needed for polymerization costs was already available [2,4,5].
Source of carbon. Consider a species s growing on nutrient i. By using FBA, we find the optimal biomass produced b s i . Often b s i > 0, but to decide whether it can be considered true biomass production, we have to take into account the following issues: • H. pylori and M. tuberculosis present the unusual case of having nutrients in the minimal medium with which they can already produce biomass in the absence of any additional nutrients. This means that we will observe biomass production for any nutrient we test, even if the nutrient cannot be catabolized. We counter this by introducing a minimal biomass production, b s min . Any biomass production in excess of b s min is then attributable to the nutrient we are testing.
• In the latest reconstruction of E. coli, there are several nutrients for which Feist et al. considered the resulting biomass to be too small [2]. Additionally, when considering an organism that has b s min > 0, some nutrients such as pyrimidines will produce biomass above b s min even though they are not catabolized. The reason for this is that these nutrients are directly used in the biomass, i.e. these nutrients are not catabolized, and thereby save the organism carbon and energy, resulting in a larger biomass production. For these reasons, we estimated the minimal biomass production threshold b cat beyond which biomass production is attributable to a nutrient being catabolized. We set b cat = 0.008, and assume that nutrients for which b s i − b s min < b cat are not catabolized.
In building our model, our first concern is to determine whether a nutrient can be a source of carbon. Therefore, we reduce the biomass production b s i to a binary observation α s i such that: In the reconstructions of E. coli and H. pylori, there are five nutrients available that were added as sinks for metabolites that had been observed to accumulate in silico [2,6]. We do not consider these nutrients in our analysis.
The biomass reaction for S. aureus was derived from the same source as B. subtilis, but it was designed so that its demand in the number of carbons is approximately 100-fold higher than that of B. subtilis. If we reduce the observation to a binary variable, this should not matter, even if the resulting biomass produced is 100-fold smaller. However, for some nutrients, FBA fails to reach a solution because the maximum nutrient uptake value of −1 is too small. We counter this by using a maximum nutrient uptake value of −100 in S. aureus.  Figure S1. Redundancy of nutrient-pathway membership. A The overlap for pairs of groups. Many nutrients are found in more than one KEGG pathway, and also in more than one group of pathways. This redundancy is depicted here as group-group overlap. For each pair of groups, we count the number of nutrients that are members of pathways found in both groups, and normalize with the number of nutrients in the smaller group. We only perform this analysis on nutrients which are not classified as G or N G based on chemical structure and function (see main text). B Average overlap for individual groups. We find that Amino acid metabolism (AA) has the highest average overlap, sharing many nutrients with other groups. This finding is consistent with the notion that AA is central to metabolism. Key: AA=Amino acid metabolism; C=Carbohydrate metabolism; oAA=Metabolism of other amino acids; CoV=Metabolism of cofactors and vitamins; N=Nucleotide metabolism; E=Energy metabolism; L=Lipid metabolism.  Figure S2. Model selection. We generate logistic models according to the pathways listed in supplemental table S6. We focus first on the G nutrients (15 pathways) and only then add the 11 pathways comprising mostly N G nutrients. Note how for N > 7 the values of the information criteria start increasing indicating that the model is overfitting the data.

Supplementary Tables
Supplemental Table S1: Minimal media for in silico organisms.
The presence of a symbol indicates that a species uptakes the nutrient in question. The symbol type indicates whether a nutrient's uptake is constrained: ( checkmark) a nutrient's uptake is unconstrained, (n) a nutrient's uptake is constrained by n. Such nutrients contain organic carbon and the value for n is taken from the biomass function.
In the case of cytosine in S. aureus, n is the sum of the CMP and dCMP requirements in the biomass. M. barkeri respires anaerobically, and thus does not have oxygen in its minimal media. In addition, M. barkeri's source of sulfur is SO −2 3 and its terminal electron acceptor is H 2 S.
The presence of a symbol indicates that a species uptakes the nutrient in question. The symbol type indicates whether the uptaken nutrient can be a source of carbon in the species in question: ( ) source of carbon, (x) not a source of carbon. x Ferrichome x Glutathione x Heme x Hemin x Sialic acid x Pantothenate x      The simple nutrients that compose each complex nutrients are listed here for all the species which take up the complex nutrient. The absence of a symbol indicates that the species does not take up the nutrient in question.
x) The species takes up the nutrient but does not catabolize it; •) The species takes up the nutrient but does not catabolize all of the simple nutrients that compose it; ) The species takes up the nutrient and catabolizes all of its components.        x †: meso-2,6-diaminopimelate.
Riboflavin deg. Acetyl-cystine bimane Bimane Citrate-Mg           The presence of a symbol indicates that a species uptakes the nutrient in question. The symbol type indicates whether the uptaken nutrient can be a source of carbon in the species in question: ( ) source of carbon, (x) not a source of carbon.

Name
Bs Ec Mb Sc Adenine x Guanine x Xanthine x Hypoxanthine x Urate     Table S7: G and N G nutrients in pathways considered for the logistic model.
The data in this table complements Figure S2. Table S7. G and N G nutrients in candidate pathways.