Population FBA predicts metabolic phenotypes in yeast

Using protein counts sampled from single cell proteomics distributions to constrain fluxes through a genome-scale model of metabolism, Population flux balance analysis (Population FBA) successfully described metabolic heterogeneity in a population of independent Escherichia coli cells growing in a defined medium. We extend the methodology to account for correlations in protein expression arising from the co-regulation of genes and apply it to study the growth of independent Saccharomyces cerevisiae cells in two different growth media. We find the partitioning of flux between fermentation and respiration predicted by our model agrees with recent 13C fluxomics experiments, and that our model largely recovers the Crabtree effect (the experimentally known bias among certain yeast species toward fermentation with the production of ethanol even in the presence of oxygen), while FBA without proteomics constraints predicts respirative metabolism almost exclusively. The comparisons to the 13C study showed improvement upon inclusion of the correlations and motivated a technique to systematically identify inconsistent kinetic parameters in the literature. The minor secretion fluxes for glycerol and acetate are underestimated by our method, which indicate a need for further refinements to the metabolic model. For yeast cells grown in synthetic defined (SD) medium, the calculated broad distribution of growth rates matches experimental observations from single cell studies, and we characterize several metabolic phenotypes within our modeled populations that make use of diverse pathways. Fast growing yeast cells are predicted to perform significant amount of respiration, use serine-glycine cycle and produce ethanol in mitochondria as opposed to slow growing cells. We use a genetic algorithm to determine the proteomics constraints necessary to reproduce the growth rate distributions seen experimentally. We find that a core set of 51 constraints are essential but that additional constraints are still necessary to recover the observed growth rate distribution in SD medium.

k cat was always selected, preferably from experiments expressing wild type S. cerevisiae proteins, but when one was not available, mutants and other species were allowed. Finally, for proteins which did not have a turnover rate in BRENDA, the highest available for a wild-type yeast enzyme, 38, 000, was chosen so as to avoid over-constraining the model.

A1.2 Noise Properties of Proteins in S. cerevisiae
Variability in protein copy number has been investigated previously in single cell fluorescence studies of yeast and E. coli [8,9]. Both studies observed two distinct regimes of noise behavior based on the mean copy number. Noise in low protein copy number proteins dropped hyperbolically with mean copy number; this noise has been described as intrinsic noise arising from the inherent stochastic nature of gene expression. Higher copy number proteins exhibit an approximately constant level of noise; this plateau is labeled associated with extrinsic sources arising from variability in common factors involved in gene expression like RNA polymerase or ribosomes. We observe similar behavior in noise measured by Dénervaud et. al., (Fig A1). Proteins with mean copy numbers below approximately 1,000 show a decrease in their noise with mean copy number while proteins expressed at levels higher than approximately 1,000 have fairly constant noise.

A1.3 Reliability of mRNA Microarray Correlation Data
We considered two ways of determining the reliability of the mRNA microarray-based correlation data we used. The first was to determine if the correlations exhibit discernible behavioral traits, such as anti-correlation between genes associated with fermentation and respiration. To see if this was reflected in the correlation data, we considered the glucose transporter HXT1. As seen in Fig A11, we found negative correlations between HXT1 and several genes associated with the TCA cycle and oxidative phosphorylation. We also see positive correlation between HXT1 and ethanol fermentation genes, as would be expected in a Crabtree-positive yeast strain. The second way we evaluated the reliability of our correlation data was by comparing it to experimentally established regulatory links in yeast [10]. To perform this comparison, we devised a distance metric (Equation A1): where Regs i,j represents the number of common transcription factors regulating genes i and j, and δ(x) represents the Kronecker delta function (1 if the argument is 0, and 0 otherwise) This metric is based on the idea that if two genes share a common transcription factor, their expression should be correlated, either positively or negatively depending on the regulatory relationship between transcription factor and the genes. The first term penalizes gene pairs with large numbers of shared transcription factors but weak correlation, while the second penalizes gene pairs with large correlation but no shared transcription factors. If the correlations generally reflect the known regulatory links, the distance metric will be smaller. To see if this is true we compared the distance metric obtained for correlations from actual expression data [11] with the distance metric obtained for correlations from randomized expression data. As seen in Fig A12 top, the distance metric for the actual expression data is significantly smaller than the distribution of distance metrics (Fig A12 bottom) obtained by randomizing expression data.

A1.4 Extended Methodology: Genetic Algorithm for Constraint Selection
Main Methods: A new procedure for filtering overly-constraining turnover rates based on the Micro Genetic Algorithm (GA) formalism was developed [12]. This method utilizes an entire growth distribution as a target for optimizing the selection of experimental constraints.Micro Genetic Algorithm was chosen instead of a "regular" Genetic Algorithm solely for computational cost concerns. In a "regular" GA algorithm in dozens to hundreds of genomes would have to be simulated at each generation, and several hundred generations could need to be evaluated to reach the same results. The computational cost would be extremely higher as compare to our GA implementation. In our attempt to reduce the size of search space we have restricted GA variables to binary values representing weather to use a particular k cat or 38,000 s −1 rather than more flexible values k cat can take in the doubling procedure. Briefly, a population of 10 "genomes" was simulated, each one composed of a list of "genes" that indicated if a protein's k cat would be kept at its original value, or if it would be raised to 38, 000 s −1 .
The original k cat values are either obtained from BRENDA or from literature (See Table A3 and SI File S1). The genomes were allowed to evolve by exchanging information, and each new generation was created by a random selection of solutions biased by their fitness, while always taking the best solution to the next generation (see SI Section Extended Methodology: Genetic Algorithm for Constraint Selection for details). The fitness of each genome was determined by simulating a cell population based in its k cat selection, and then calculating the goodness-of-fit between the resulting growth rate distribution and the observed distribution [4].
Extended Methods: Each genome was composed of 368 "Boolean genes", one for each protein that had a k cat available. The value of the binary gene indicated whether a original k cat would be used with its respective protein count to calculate a v max , or if the maximum k cat of 38, 000 s −1 would be used for the v max calculation. Since the micro GA applies a uniform cross-over operator, where each gene in an offspring is randomly selected from one of two parent genomes, no mutation operator was used. Moreover, a tournament selection strategy was applied with a sample size of 4, while also applying elitism for the fittest genome. This guarantees high variability and a fast convergence, while preserving optimal results. Population convergence was determined by comparing each genome to the fittest genome in a generation, and counting the number of different gene states between them. When all genomes had less than 5 different gene values when compared with the fittest genome, we determined the population had converged, in which case the fittest genome was kept and all other 9 genomes were re-set to random states. The fitness was calculated by simulating a 4800 cell population from each genome, and then calculating the Watson variation of the Cramér-von-Misses goodness of fit test, comparing the simulated and the experimentally observed growth rate distribution [4] for the entire population. The fitness function was defined as the inverse of the test statistic. The genetic algorithm took anywhere from 2.5 to 15.5 hours to achieve a good goodness-of-fit between simulated and observed growth distribution using 320 CPUs.

A1.5 Proteins with Significant Mean Copy Number But Zero Flux Predicted
Metabolic models are mappings between the genotype of an organism and the reactions that can be catalyzed by their gene products. Given a growth medium and knowledge of the strain (specifically the existence of gene knock-outs, etc.), these models can predict which reactions can carry flux and which cannot. Flux variability analysis with zero optimal growth requirement was used to determine the minimum and maximum possible flux through each reaction. So called "dead ends"-reactions that can never carry flux due to incompleteness of the model-were not considered. We observed several proteins (Table A7) that were measured in significant copy number despite the reactions they catalyze being unable to carry any flux. This inconsistency may be due to two reasons. The first is that the function of the protein might be out of scope of the metabolic model. Some proteins could be involved in both metabolic and non-metabolic functions within the cell; the expression of such a protein might be predominantly associated with its "moonlighting" function. Another possibility is that certain proteins are highly expressed due to transcriptional regulation which is not accounted for by the metabolic model. For example URA1 is expressed at a copy number of around 18,000 even though there is a deletion in URA3, a gene downstream of URA1 in the strain being used. Other proteins in uracil biosynthesis like URA4 and URA5 (Fig A13) are also expressed in high copy number. This might be due to the inducing activity of dihydroorotic acid [13] which is a metabolic intermediate in uracil biosynthesis and is known to up-regulate expression of URA1 and URA4.

A1.6 Metabolic Map
Metabolic maps are extensively applied by the systems biology community as a tool to both explore and understand metabolic activity in a cell. They allow an easy way of visualizing the flux distribution throughout the simulated metabolic pathways. Despite their importance, they suffer from a problem which afflicts most efforts to combine large scale biological information: conflicting naming conventions. Maps built for previously developed metabolic models can rarely be reused because metabolites and reaction identifiers constantly change from version to version, and between organisms and data sources (such as fluxomics, transcriptomics and proteomics). The latest versions of the yeast metabolic models tried to unify nomenclature and create a consistent pattern for metabolites and reaction names, but a new map was not created to make use of such developments. In this work, a comprehensive map representation for Yeast 7.6 was built using Escher [15], and it is presented and made available to the community.

MTD (NADP)
FVA with 90 % Growth and 100% Total ux (from 100% Growth) Fig. A18. Flux variability of reactions in serine glycine cycle among fast growing cells with 90 % optimality for growth and 100 % for total flux Minimum and maximum fluxes through reactions in the serine glycine cycle according to flux variability analysis while maintaining 90% of the optimal growth rate and 100 % of optimal total flux as calculated using pFBA. Optimal total flux is calculated at 100 % optimal growth. Fluxes are calculated for 605 fast growing cells with growth rate > 0.35 hr −1 . Little variation is exhibited except in the reactions involving methylenetetrahydrofolate dehydrogenase (MTD) which are alternate pathways for NADH and NADPH production. Lower fluxes are observed than the case when 100 % optimality is enforced both for growth rate and total flux because lower growth requirement reduces the energy requirement and hence the flux through the cycle. Right panel shows histogram with no variability in threonine uptake while growth rate and total flux is fixed at its optimal value as determined using pFBA. Aminoacylation reactions are not present in all models, therefore their biomass pseudo-reaction uses either amino acids (iMM904 and Yeast 5), or charged tRNAs (Yeast 6 and Yeast 7). All yeast consensus models were taken from yeast.sourceforge.net.