Prediction of Metabolic Flux Distribution from Gene Expression Data Based on the Flux Minimization Principle

Prediction of possible flux distributions in a metabolic network provides detailed phenotypic information that links metabolism to cellular physiology. To estimate metabolic steady-state fluxes, the most common approach is to solve a set of macroscopic mass balance equations subjected to stoichiometric constraints while attempting to optimize an assumed optimal objective function. This assumption is justifiable in specific cases but may be invalid when tested across different conditions, cell populations, or other organisms. With an aim to providing a more consistent and reliable prediction of flux distributions over a wide range of conditions, in this article we propose a framework that uses the flux minimization principle to predict active metabolic pathways from mRNA expression data. The proposed algorithm minimizes a weighted sum of flux magnitudes, while biomass production can be bounded to fit an ample range from very low to very high values according to the analyzed context. We have formulated the flux weights as a function of the corresponding enzyme reaction's gene expression value, enabling the creation of context-specific fluxes based on a generic metabolic network. In case studies of wild-type Saccharomyces cerevisiae, and wild-type and mutant Escherichia coli strains, our method achieved high prediction accuracy, as gauged by correlation coefficients and sums of squared error, with respect to the experimentally measured values. In contrast to other approaches, our method was able to provide quantitative predictions for both model organisms under a variety of conditions. Our approach requires no prior knowledge or assumption of a context-specific metabolic functionality and does not require trial-and-error parameter adjustments. Thus, our framework is of general applicability for modeling the transcription-dependent metabolism of bacteria and yeasts.


Introduction
Cellular metabolism involves a myriad of regulatory processes and metabolic components functioning together through a complex set of interactions and reactions. Although ''omics'' technologies provide an increasingly large body of information on each individual component involved in metabolism, our knowledge of how these components as a system give rise to multiple phenotypes under different conditions is far from complete. A powerful approach to investigate metabolism and metabolic processes is to analyze the flow of material and energy through a metabolic network. In particular, the analysis of metabolite fluxes in a metabolic network serves as an essential tool in many biotechnology and biomedical applications, for example, to enhance the production of food and biofuels [1], identify disease biomarkers and drug targets [2,3], and study complex human physiological processes [4]. Metabolite flows in a network can be determined by experimental or computational techniques.
A standard experimental technique to quantify the distribution of fluxes in a network is to perform a metabolic flux analysis (MFA), which is based on isotope labeling techniques (mostly using 13 C) [5]. 13 C-MFA traces isotope-labeled metabolites using mass spectrometry and determines individual reaction fluxes by fitting 13 C data to a network model with the help of additional measurements on exchange fluxes, such as nutrient uptake and product excretion rates. Due to experimental difficulties in obtaining quantitative and precise measurements that cover a large-size network with diverse pathways and many metabolites, the use of 13 C-MFA is typically limited to the determination of fluxes related to the central carbon metabolism [6]. The most common computational techniques used for the analysis of genome-scale networks are flux balance analysis (FBA) and its derivatives [7,8]. FBA postulates steady-state cellular metabolism as being driven toward maximizing a certain fitness function (typically, biomass production) and estimates the flux distribution by solving a linear programming (LP) problem. Modification of the FBA algorithm to incorporate additional biological information from gene expression profiles is often used to generate contextdependent flux estimates for specific biological conditions without changing the fundamental optimization criterion of the algorithm.
Although gene transcripts are not a direct readout of enzyme activities, as posttranscriptional events determine cellular protein concentrations and activity, a number of applications have shown that they provide important cues for the likelihood that associated reactions are activated [9][10][11][12][13]. These studies include the pioneering work of Shlomi et al. [14], who identified distinct metabolic activity in 10 different human cancer tissues. Our previous work in this area includes the prediction of metabolic adaptation of Mycobacterium tuberculosis under hypoxic and anaerobic conditions [15] and the development of a kinetic modeling framework to predict phenotypic alterations of Saccharomyces cerevisiae in response to chemical treatments [16]. Depending on the experimental design and platform, gene transcriptional expression profiles are collected either as absolute or differential values. Metabolic network integration algorithms that depend on differential expression data generally require reliable measurements or estimates of the flux distribution at a reference condition. The availability of a well-characterized biological reference state provides a robust starting point for investigating perturbed states or conditions. However, data for a reference state may not always be available or even obtainable. Absolute gene expression data are more typically used when studying complex tissues, where a reference no longer strictly refers to a unique cell population or metabolic state. Our interest lies in generating a generalized methodology that can use absolute gene transcriptional data with a minimum number of assumptions, constraints, parameters, or auxiliary data inputs.
Existing methods that similarly address these issues using absolute expression data include E-Flux [17], integrative metabolic analysis tool (iMAT) [14,18], and the algorithm developed by Lee et al. [19]. These methods vary in their assumptions and the approaches used to constrain the solutions of the metabolic fluxes based on gene transcription data. E-Flux can be regarded as an extended version of FBA with additional flux constraints, i.e., it maximizes biomass yield under the imposed upper and lower limits on fluxes determined as a function of their associated gene expression levels. In contrast, iMAT maximizes the fit between gene expression and the flux state of the network such that the number of reactions that are highly expressed and carry a flux is maximized, while the number of reactions that are underexpressed and carry a flux is minimized. Finally, the algorithm recently presented by Lee et al. maximizes the correlation between flux magnitudes and the associated gene expression levels. Importantly, none of these methods have an explicit requirement for defining a cellular metabolic object through an optimal biomass production. Although maximization of biomass production as used in E-Flux and FBA has been exploited to great advantage in many simulations and analyses of microbial growth, the general assumption is context-and growth-condition specific [20,21] and is not supported for multicellular organisms and tissues [19,22]. While other alternative objective functions, such as maximizing or maintaining a fixed level of ATP production or other vital cellular component, could be considered, it is difficult to know a priori which one is the most appropriate in a given condition [23,24].
The methods discussed above have been shown to provide accurate and detailed predictions of flux distributions for specific systems, however, as shown in this article, they do not perform consistently across different conditions and organisms. Thus, our aim was to create a more broadly applicable computational approach that does not heavily rely on context-specific knowledge and assumptions. Our approach is based on flux minimization, with the hypothesis that flux magnitudes are proportional to enzyme concentrations, and that cells are frugal in synthesizing enzymes due to limited internal resources (such as ribosomes, RNA polymerases, and ATP) [21,25]. The principle of flux minimization has been used successfully to estimate the metabolic states (i.e., flux distributions) in uncharacterized environments in EXploration of Alternative Metabolic Optima (EXAMO) [26]. As a preceding step, the implementation of EXAMO requires the reconstruction of an environment-specific subnetwork, which generally includes a number of iterative curation procedures and heuristic decisions. In contrast, our algorithm predicts flux distributions from a generic network by minimizing a weighted sum of flux magnitudes, where the weights are a function of the corresponding gene expression levels. A similar idea was used in Gene Inactivity Moderated by Metabolism and Expression (GIMME) [27], which is a framework for assembling contextspecific networks from a large set of metabolic reactions, but not aimed at predicting flux distributions. In this work, we developed an LP-based framework by modifying GIMME for quantitative prediction of flux distributions. Whereas GIMME requires certain metabolic functionalities to be active above condition-dependent thresholds, we removed these decisions by forcing biomass production to carry nonzero flux. In both GIMME and our algorithm, the use of absolute gene expression data provides the ability to use transcriptional data taken from a single experimental condition where a control condition may not be available. We have termed our method ''expression data-guided flux minimization'' (E-Fmin).
We evaluated the developed E-Fmin algorithm through case studies of S. cerevisiae and Escherichia coli metabolism by a comparison with experimentally determined flux data and other model predictions discussed above. The analysis showed that, whereas other algorithms performed well for one condition/ organism, our algorithm showed consistently good predictions of flux distribution for both organisms under different conditions. Thus, we believe that E-Fmin will provide a robust capability for analyzing complex metabolic systems.

Results and Discussion
Structure of the E-Fmin Algorithm E-Fmin employed a similar structure to that of GIMME, with the following optimization problem that minimizes a weighted sum of flux magnitudes, i.e., subject to Sr~0 ð2Þ where r i is the i th flux, its weight w i is a function of gene expression level, S is the (m6n) stoichiometric matrix, r is the vector of n fluxes, r L and r U are vectors of upper and lower limits of r, respectively, RMF stands for required metabolic functionalities, and d is a threshold to be specified as an input. Eq. 2 denotes stoichiometric balances of fluxes under steady-state conditions. Flux bounds in Eq. 3 are determined by accounting for their directionality as constrained by thermodynamics, i.e., upper and lower bounds are 6' if a reaction is reversible, and if one of the reactions is irreversible, the opposite reaction bound was set to zero. By the inequality constraint in Eq. 4, one or more reactions classified as RMF are forced to carry fluxes above d. GIMME formulates the weight w i as g cutoff {g i , if g cutoff wg i , and as 0 otherwise, where g i is the gene expression level mapped on the i th reaction and g cutoff is a cutoff value. The choice of RMFs and two parameters, d and g cutoff , is condition-specific. In other words, depending on environmental conditions and organism characteristics, RMF can be biomass production, ATP production, or some other metabolic functionality; the appropriate values of d and g cutoff may also vary.
To develop a tool that enables quantitative flux predictions, we made modifications to GIMME. First, we replaced Eq. 4 with the constraint where r B is the specific rate of biomass production and e is an arbitrarily small value. Eq. 5 allows the specific rate of biomass production to vary from a negligible value to its theoretical maximal value, and thus should be valid even under conditions of significantly suppressed growth. In this regard, we view Eq. 5 as a context-independent constraint that can be applied across a wide range of conditions. It should be noted that e is not a parameter to be adjusted, as the final normalized results are invariant for any particular choice. Second, we set g cutoff as the maximum value of gene expression data (i.e., unity value after normalization). Then, w i is represented as 1{g i , meaning that E-Fmin suppresses all reactions (except those fully expressed) in inverse proportion to their associated gene expression levels. For more details on the E-Fmin algorithm, see MATERIALS AND METHODS. Figure 1 shows the overall procedure of implementing and assessing the performance of E-Fmin. Figure 2 shows the application of E-Fmin to a toy metabolic network. The example network contains four major pathway options (P 1 to P 4 ) for converting substrate into product and biomass. Path P 1 leads to product formation through reactions r 1 and r 2 , whereas paths P 2 , P 3 , and P 4 lead to biomass formation through reactions (r 1 , r 3 , and r 9 ), (r 1 , r 4 , r 5 , and r 9 ), and (r 1 , r 6 , r 7 , r 8 , and r 9 ), respectively. This metabolic network is underdetermined, as it has nine unknown fluxes, r 1 to r 9 , but only five available equations given by the stoichiometric matrix S in Figure 2. Thus, the determination of a particular flux distribution among infinite solutions requires either additional experimental flux measurements or application of a computational optimization method such as E-Fmin.

Illustrative Example
In the absence of gene expression data, E-Fmin treats all weights as equal, i.e., w i = 1 (i = 1,2, …, 9). E-Fmin then selects P 2 , the shortest among the biomass-producing pathways, as the solution. Note that P 1 , while shorter than P 2 , is not the solution of E-Fmin due to the constraint that forces biomass production to carry flux.
Incorporation of gene transcript expression levels into E-Fmin allows us to use biological and condition-specific information to estimate the flux distribution in the system. Figure 2 shows two different exemplar gene expression patterns (cases 1 and 2). The weights are given as a function of expression level, 1 -g i , where g i is the normalized gene expression level for the i th reaction, ranging from 0 to 1. We set the weights of the reactions for which no gene expression levels are available to 1. In case 1, E-Fmin discards P 2 as the associated genes are not highly expressed. For paths P 3 and P 4 , which are similar in their gene expression levels, E-Fmin selects the former because of the smaller sum of flux magnitudes. Case 2 shows that even the longest pathway, P 4 , can be selected as the solution of E-Fmin if the associated relative gene expression levels are sufficiently high.
E-Fmin, like other LP-based algorithms, yields only a single pathway among alternative solutions, despite the likelihood that multiple routes can simultaneously be activated at a certain ratio according to their associated gene expression levels. For example, in case 1, it is probable that both P 3 and P 4 could carry nonzero throughput fluxes. To account for these situations, we would implement flux variability analysis (FVA) after the E-Fmin analysis to account for the range of flux variation through each reaction.

Prediction of Exometabolomic Fluxes of S. cerevisiae
We applied our algorithm to the aerobic growth of S. cerevisiae. We used experimental data collected by Lee et al., which included RNA-seq transcriptomic data and exchange flux measurements of S. cerevisiae aerobically growing in chemostat cultures. This study provides data under two different growth conditions, i.e., glucose uptake fluxes of 16.5 and 11.0 mmol/(gDW?h), which correspond to 75% and 85% of the maximal attainable biomass levels, respectively. We used a genome-scale metabolic network of S. cerevisiae, Yeast 5 [28]. While earlier metabolic network reconstructions of yeast have been reported separately by different groups, Yeast 5 is a community-driven network reconstruction based on standard names and methods and is updated periodically.
Figures 3A and 3B show experimentally measured exchange fluxes and model predictions at glucose uptake fluxes of 16.5 and 11.0 mmol/(gDW?h), respectively. For comparison, we also show the predictions from other methods, including GIMME, FBA, E-Flux, Lee et al.'s algorithm [19], and iMAT, along with those of E-Fmin. We set the maximization of biomass production as the objective of FBA and implemented it in two different forms: without flux minimization (or classical) and with flux minimization. For GIMME and iMAT, we constrained the production of biomass to be above a certain threshold (90% of the maximal growth rate as predicted by FBA), which prevents the predicted growth rate from becoming zero. See MATERIALS AND METHODS for details on their implementation.
The experimental data showed that ethanol and CO 2 are the predominant products, whereas the production of glycerol, acetate, trehalose, lactose, and biomass is insignificant. This trend is well captured by E-Fmin, E-Flux, and Lee et al., but not by GIMME, FBA, and iMAT. Table 1 and Table S1 show correlation coefficients (denoted by r in this report) and sum of squared error (SSE) values, respectively, obtained by comparison between model predictions and experimental data. E-Fmin, E-Flux, and Lee et al. show higher r values (Table 1) and lower SSE values (Table S1) in comparison to other methods. As detailed in the MATERIALS AND METHODS, we obtained a P value of 0.04 and rejected the null hypothesis that the ranking of average correlation coefficients in Table 1 could occur by chance alone. The ability to accurately predict the production of metabolic products (such as ethanol) is important, as they are often the target products of interest for biotechnological applications. The poor prediction of FBA implies that maximization of biomass production may not be a valid assumption for these specific experimental conditions for the yeast growth. Whereas the performance of GIMME (and iMAT) may be improved by adjusting their parameters, their appropriate values cannot be determined a priori. We provided the flux data and model predictions in Table S2.

Prediction of Intracellular Flux Distribution in E. coli
To test our algorithm across organisms, we additionally applied it to the aerobic growth of E. coli in chemostat cultures. We obtained microarray gene expression profiles (from central carbon metabolism) and 13 C-MFA-based flux data of E. coli K-12 strains from Ishii et al. [29]. In their work, Ishii et al. experimentally investigated the response of E. coli to environmental and genetic perturbations and provided multiple high-throughput omics data for both wild-type and mutant strains. To study the effect of environmental perturbations, they cultured wild-type cells at varied dilution rates, while the effect of genetic perturbations was examined by disrupting 24 single genes contained in glycolysis and in the pentose phosphate pathway. As a result, they observed that gene disruptions lead to only subtle changes in mRNA levels, suggesting that E. coli is able to adequately compensate for the loss of a single gene or enzyme in the central metabolism by using excess or complementary capacity of other available enzymes. Conversely, they reported that wild-type E. coli cells appreciably change mRNA levels in response to the variation of the dilution rates but exhibit low-variations (robustness) in metabolite concentration levels. Figure S1 shows that the experimentally measured flux distributions are also robust against such environmental perturbations. Therefore, this system poses a challenging problem in integrating varying gene expression profiles for different dilution rates yet predicting an unchanging flux distribution.
For computational predictions, we incorporated the gene expression data of Ishii et al. and an E. coli network model into the E-Fmin framework as detailed in MATERIALS AND METHODS. Multiple genome scale metabolic reconstructions of E. coli have been built in the last decade [30], and, in this study, we used a recent, well-curated genome-scale network of E. coli K-12 strain, iAF1260 [31]. As an updated version of the previous reconstruction iJR904 [32], iAF1260 comprehensively accounts for 1,260 open reading frames, which correspond to ,30% of E. coli's genome [33]. Figure 4 shows intracellular metabolic fluxes of wild-type E. coli strains obtained from E-Fmin and 13 C-MFA at respective dilution rates. Predicted fluxes show good matches with 13 C-MFA data in all cases. Table 2 shows r with an average value of 0.91 along with P values. Low P values indicate that the correlations obtained from E-Fmin are statistically significant. Table S1 shows relatively low values of SSE with an average value of 57.8. We observed that E-Fmin underestimated the CO 2 production rate in all cases. Table 2 also shows the performance results for the other algorithms investigated. Similar to the statistical analysis of Figure 1. Schematic description of the E-Fmin framework. The algorithm was implemented through the following procedures. The first step was to obtain absolute gene expression profiles in a given condition from microarray, RNA-seq, or other high-throughput methods. Second, gene expression profiles were mapped onto individual reactions using gene-reaction associations. Third, the mapped expression data were integrated with the network model, and the optimization problem is solved to predict the flux distribution. Finally, model predictions were validated by comparison with experimentally measured flux data. The performance of model prediction can be gauged using standard measures, such as correlation coefficients (denoted by r) and sum of squared error (SSE). doi:10.1371/journal.pone.0112524.g001 Table 1, we obtained a P value of ,10 210 and rejected the null hypothesis that the ranking of average correlation coefficients in Table 2 could arise by chance alone. The classical FBA (without flux minimization) performed poorly because some of the fluxes attained their upper (or lower) bound values, a problem associated with the existence of multiple LP solutions. These outliers were removed by applying the flux minimization as a secondary objective. Consequently, with the flux minimization, FBA provides reliable estimates, implying that the hypothesis of maximal biomass production may be valid for E. coli growing under these specific conditions. Interestingly, despite using the same objective function, the prediction of E-Flux was not comparable to that of FBA. It seems that the E-Flux predictions in this system are more affected by the imposed flux bounds than the choice of objective function. Algorithms based on the direct association with expression levels, such as in Lee et al., showed weak predictive powers, which may be ascribed in part to the use of gene expression data covering only the central carbon metabolism. Similarly to the classical FBA, the flux vector predicted by iMAT also contained outlier fluxes that reached their bounds. After removing them, iMAT showed an improved predictive capability.
We provided the intracellular flux data and model predictions at the dilution rate of 0.1 1/h in Table S3. Table 3 shows similar results for the 24 single-gene knockout mutants; i.e., E-Fmin, GIMME, FBA (with flux minimization), and iMAT calculations were associated with correlation coefficient averages of 0.87, 0.84, 0.92, and 0.86, respectively, whereas the other algorithms exhibited relatively lower average correlation coefficients ranging from 0.05 to 0.53. We obtained a P value of ,10 210 and rejected the null hypothesis that the ranking of average correlation coefficients in Table 3 could arise by chance alone. Performance comparison using SSE values in Table S1 showed the same trend. The satisfactory performance results of GIMME and iMAT were due to the imposed biomass production constraint. The correlation coefficients (and SSE) values became lower (and higher) for both methods when the threshold was adjusted to a lower value (results not shown).

Prediction of Biomass Yield
As demonstrated in the case studies considered above, E-Fmin has strong predictive capabilities for the metabolism of both S. cerevisiae and E. coli despite the different metabolic characteristics Figure 2. Toy example illustrating an implementation of the E-Fmin algorithm. The network model includes nine reactions (r 1 to r 9 ) but only five available stoichiometric constraints among the five intracellular metabolites under the steady-state assumption. E-Fmin determines the full flux vector for this undetermined system by solving a linear programming problem such that a weighted sum of flux magnitudes is minimized while biomass production (i.e., r 9 in this example) carries nonzero flux. Given two sets of transcriptomic data, E-Fmin generates different flux distributions (denoted by thick arrows). The weight to the i th reaction (w i ) is formulated as a decreasing function of the associated gene expression level (g i ), i.e., w i = 1 -g i . The weights highlighted in red represent the reactions for which no associated gene expression data are available. doi:10.1371/journal.pone.0112524.g002 of these organisms. For E. coli, the assumption of maximal production of biomass underlies the successful application of FBA in many studies of this organism [34,35]. Experimentally measured biomass yields of E. coli at balanced growth conditions are close to the theoretical maximum predicted by FBA [36]. Conversely, the metabolism of more advanced organisms, including yeast, may not be adequately accounted for by the same hypothesis [20]. Accordingly, the biomass yield of yeast is much lower than its theoretical maximal value.
The E-Fmin algorithm was designed to account for all of these features. For E. coli, E-Fmin predicted biomass yield of 0.095 gDW/(mmol-glucose) for both wild-type and mutant E. coli strains, which is close to the FBA prediction of 0.096 gDW/ (mmol-glucose), but without requiring biomass production to be maximized. In the case of yeast, E-Fmin predicted a biomass yield of 0.028 gDW/(mmol-glucose) for both uptake fluxes of 16.5 and 11.5 mmol/(gDW?h), which correspond to ,25% of the theoretical maximum 0.104 gDW/(mmol-glucose). While Crabtreenegative yeast strains, such as Pichia stipitis, show high biomass yield close to the FBA prediction, the biomass yield of Crabtreepositive yeast S. cerevisiae is known to be much lower due to appreciable production of fermentation products, particularly ethanol [37]. The experimentally obtained biomass yields by Lee et al. were 0.020 gDW/(mmol-glucose) for both systems, confirming the accuracy of the E-Fmin predictions. The low biomass yield of S. cerevisiae growing in aerobic cultures has also been observed in other experiments. For instance, Papini et al. [37] experimentally measured the biomass yield of wild-type S. cerevisiae growing on glucose as 0.031 gDW/(mmol-glucose), which is close to our prediction. Conversely, the algorithm by Lee et al. predicted zero biomass yields for both E. coli and S. cerevisiae. The prediction of iMAT also led to zero biomass yields without additional constraint for biomass production to be larger than a certain level (e.g., 90% in our simulations) of the theoretical maximal.

Features of the E-Fmin Algorithm
The main distinguishing features of the E-Fmin algorithm provide multiple advantages when analyzing cellular metabolism based on gene expression data. First, E-Fmin can use a generic network model to predict context-specific flux distribution. This is not the case for other methods, such as EXAMO, where the reconstruction of a condition-specific subnetwork is a prerequisite. Second, E-Fmin uses absolute gene expression data that can be collected directly from a single condition without defining or determining a standard reference conditions. This is advantageous over many of the currently available methods that are based on relative expression data [14][15][16], which typically require expanded sets of data including flux distribution at a reference condition. Conversely, if the reference flux distribution is not known, E-Fmin can be used to provide the reference flux distribution. Third, E-Fmin contains no parameters to adjust. Note that any arbitrary positive value can be used for e in Eq. 5. The flux distribution obtained as the solution of E-Fmin becomes the same after normalization regardless of the value of e. The normalized flux vector (obtained from E-Fmin) can be scaled to 'actual' values using experimentally measured fluxes, whenever available. Finally, the formulated LP problem leads to fast simulations. This is an important consideration when performing extensive FVA or analyzing systems for a large number of conditions. In addition, due to the principle of flux minimization, fluxes predicted by E-Fmin contain no thermodynamically infeasible cyclic paths. The flux minimization also shrinks the space of alternative optimal solutions, leading to very narrow flux variation through individual reactions. We observed no appreciable changes in E-Fmin predictions from the implementation of FVA. In all simulation studies including E-Fmin, we assumed that the cellular composition was known. For cases in which cells grow in conditions that significantly differ from standard cultures, additional measurement of biomass composition is likely to improve model predictions.

Conclusions
The work reported in this article addresses the issue of how one can effectively use gene expression data to study a phenotypic response of metabolism under various conditions by estimating flux distribution. Although a direct link between transcriptomic and fluxomic data is generally weak due to posttranscriptional modifications, the framework we developed was able to extract maximal information of metabolic fluxes by integrating mRNA data with a metabolic network model based on the principle of flux minimization. Importantly, E-Fmin does not require a priori knowledge of context-specific metabolic functionalities. This feature allows E-Fmin to be applicable across different conditions/organisms without modifying any components of the framework. We validated this capability through studies of S. cerevisiae and E. coli strains exhibiting distinct metabolic characteristics. In comparing the overall ability of the different methods to achieve a ranking above average, i.e., a meta-analysis of all rankings from the separate Tables 1-3, we found that both E-Fmin and FBA (flux min) were statistically significantly ranked above all other methods (P values 0.03 and 0.04, respectively). Although similar in nature and performance, this analysis misses the fact that the latter method was unable to give satisfactory results for the Saccharomyces cerevisiae data. Thus, compared with other methods, E-Fmin provided more consistently reliable predictions for both organisms. While these studies used gene expression measurements, E-Fmin could incorporate protein expression data as an alternative input to predict flux distribution. Table 1.  As a basic constraint, E-Fmin forced nonzero flux through the biomass-producing reaction. This constraint was sufficient for E-Fmin to be able to correctly predict biomass production, high in E. coli and low in S. cerevisiae, based on absolute gene expression data. Nevertheless, the inclusion of an expanded set of key reactions may be required to obtain physiologically reliable prediction for a specific condition. A systematic basis for the identification of additional condition-specific constraints is therefore a topic of great general interest. In our future work, we will expand the applications of the E-Fmin framework to examine a broader range of physiological conditions.

Metabolic Network Models and Experimental Data
We used genome-scale metabolic network models taken from the recent E. coli and S. cerevisiae reconstructions iAF1260 [31] and Yeast 5 [28], respectively. We downloaded the Systems Biology Markup Language (SBML) model of iAF1260 from the Biochemical Genetic and Genomic knowledgebase [38], which contains 2,382 reactions, 1,668 metabolites, and 1,261 genes. We obtained the SBML model of Yeast 5 (version 5.21) from Lee et al. [19], which contains 2,061 reactions, 1,605 metabolites, and 893 genes. To evaluate the prediction of our algorithm using a minimum number of auxiliary metabolic information, we removed any condition/organism-specific constraints, such as ATP requirements for cellular maintenance and oxygen/carbon uptake rates. We allowed the magnitudes of all fluxes to vary without bounds.
We obtained preprocessed gene expression profiles and 13 C-MFA data of E. coli from Ishii et al. [29]. They provided wild-type strain data at several different dilution rates (i.e., 0.1, 0.2, 0.4, 0.5, and 0.7 1/h) and mutant strain data for 24 single-gene disruptions at a fixed dilution rate of 0.2 1/h. We similarly obtained data for S. cerevisiae from Lee et al. [19], which included preprocessed gene expression data, growth rate, and exchange flux measurements, including the uptake rate of glucose and the production rates of ethanol, CO 2 , glycerol, acetate, trehalose, and lactose. In both experiments, data were collected from glucose-limited chemostat cultures.

Gene-To-Reaction Mapping and Further Processing
The initial step in predicting fluxes using E-Fmin is to map gene expression data onto reactions based on gene-protein-reaction associations, which are provided with the iAF1260 and Yeast 5 network models. This mapping is straightforward if a single gene product catalyzes a reaction, i.e., the association of a gene via mRNA and enzyme to a single reaction is unambiguous. In general, a reaction is associated with multiple gene products, and their relations are described using Boolean operators such as ''AND'' and ''OR.'' The AND operation represents the involvement of multiple gene products in catalyzing a reaction, whereas the OR operation signifies that only one of the gene products is involved in the reactions. We implemented these operations by taking the minimal and maximal value of the associated gene expression data [39], as follows: where uppercase and lowercase letters indicate gene names and their expression levels. After this mapping, we divided the Table 2. Correlation coefficients (r) and P values of E-Fmin and other methods in predicting intracellular flux distributions in wild-type Escherichia coli at varied dilution rates. expression data by their maximum value to normalize their ranges from 0 to 1. We used g i to denote the resulting normalized gene expression values for the i th reaction. We designed w i in Eq. 1 as a linearly decreasing function of g i as follows: Eq. 8 ensures that fluxes with lower (higher) gene expression levels are more (less) suppressed.

Implementation
We recast the absolute-sum minimization problem presented in Eqs. 1-3 and 5 into a linear form, as follows: subject to in addition to the constraints given in Eqs. 2, 3, and 5. We used 0.01 for e in Eq. 5. We solved the above problem using the CPLEX (ILOG, Mountain View, CA) LP solver. We ran all simulations on a desktop PC with an Intel (Santa Clara, CA) Pentium i3 CPU and 4-GM RAM. We provide the MATLAB scripts used for generating the results in Data Set S1. We used our own code for the implementation of E-Fmin and GIMME, the COBRA package for FBA and E-flux, and Lee et al.'s scripts for their algorithm and iMAT. For the implementation of GIMME, we constrained the biomass production to be greater than or equal to 90% of the theoretical maximum and set 0.05 as the cutoff value (g cutoff ). Data Sets S2 and S3 provide the network models and data files used for the simulations of S. cerevisiae and E. coli metabolism, respectively.

Statistical Analysis of Correlation Tables
Given a table of correlation coefficients of predicted intracellular flux distributions for various conditions and methods, we wanted to test whether the observed rankings could occur by chance. We first normalized the correlation coefficients to scores ranging from 0 to 1 by rank-ordering all of them from small to large and, for each observation, scoring its rank as a percentile. Then, we used the F-test in a one-way analysis of variance [40] to ascertain significance. In short, we computed the F-ratio of between-method variability (V b ) to within-method variability (V w ): The variability values were calculated as where S ij is the score of method i in condition j, S i denotes the average of the scores of method i in all conditions, n i is the number of conditions for method i, Sdenotes the overall average, Nis the number of observations, and K is the number of methods. Using the F-ratio, between-method degree K{1, and within-method degree N{K, we determined the P value according to the Fdistribution. Finally, we used rank product analysis [41] to assess whether any method performed significantly above average among all tested methods. For each of the Tables 1-3, we sorted the methods according to their average scores from large to small and obtained their ranks. We computed the geometric mean of the ranks of each method of Tables 1-3 as its rank product RP: where R it is the rank of method i in table t. We exhaustively permuted the three, sorted lists, and computed the P value as the fraction of all permutations that had smaller or equal rank products.    Data Set S1 Matlab scripts. Provides numeric codes for running E-Fmin and other algorithms to predict flux distributions from given gene expression profiles. At the end of the simulation, the code calculates correlation coefficients (r) together with P values and coefficients of determination (R 2 ) as performance measures for comparison. (ZIP)