lumpGEM: Systematic generation of subnetworks and elementally balanced lumped reactions for the biosynthesis of target metabolites

In the post-genomic era, Genome-scale metabolic networks (GEMs) have emerged as invaluable tools to understand metabolic capabilities of organisms. Different parts of these metabolic networks are defined as subsystems/pathways, which are sets of functional roles to implement a specific biological process or structural complex, such as glycolysis and TCA cycle. Subsystem/pathway definition is also employed to delineate the biosynthetic routes that produce biomass building blocks. In databases, such as MetaCyc and SEED, these representations are composed of linear routes from precursors to target biomass building blocks. However, this approach cannot capture the nested, complex nature of GEMs. Here we implemented an algorithm, lumpGEM, which generates biosynthetic subnetworks composed of reactions that can synthesize a target metabolite from a set of defined core precursor metabolites. lumpGEM captures balanced subnetworks, which account for the fate of all metabolites along the synthesis routes, thus encapsulating reactions from various subsystems/pathways to balance these metabolites in the metabolic network. Moreover, lumpGEM collapses these subnetworks into elementally balanced lumped reactions that specify the cost of all precursor metabolites and cofactors. It also generates alternative subnetworks and lumped reactions for the same metabolite, accounting for the flexibility of organisms. lumpGEM is applicable to any GEM and any target metabolite defined in the network. Lumped reactions generated by lumpGEM can be also used to generate properly balanced reduced core metabolic models.


Introduction
Stoichiometric models have been extensively used since 1980s [1][2][3] and prediction capabilities of these networks have been proven to be very useful. The size and structure of these models varied among different studies. One of the pioneering studies on E. coli through a small stoichiometric model is performed by Varma et al. [4,5] in where the authors described the model as composed of core carbon metabolism pathways namely, glycolysis, pentose phosphate pathway, TCA cycle and formation of some by-product formations accompanied by a part of the Electron Transport Chain (ETC). This stoichiometric definition is further extended by the integration of a biomass composition formulation that is provided in the classic text published by Neidhardt [6]. In this textbook, E. coli metabolism has been explained extensively, and all the components (amino acids, lipids, DNA, RNA etc.) that constitute 1gDW of cell were reported based on previous experiments [7]. Moreover, the amounts of 12 precursor metabolites from the core carbon metabolism (erythrose-4-phosphate, ribose-5-phosphate, pyruvate, alpha-ketoglutarate, phosphoenolpyruvate etc.) along with the requirement of cofactors (ATP, NADH, NADPH) and inorganic compounds (S, NH4) to synthesize these biomass building blocks (BBB) were estimated. Such representation has been used in different studies to understand the core carbon metabolism and its relation with biomass accumulation [8,9]. This information for E. coli allowed the authors to develop a model that can describe the growth requirements and limitations of the organism without including the complex biosynthesis routes for each individual biomass building block. With such a small stoichiometric representation of the core metabolism (~50 reactions), authors were able to predict many aspects of E. coli physiology. Similar metabolic models with a reduced representation of the biosynthesis of the building blocks have been used in many other studies for E. coli and other organisms [10][11][12][13][14][15].
In the following years, with the development of sequencing and high-throughput technologies, the gene-protein-reaction (GPR) associations [16] have been improved and the number of sequenced genomes has sparked off [17,18]. This accumulation of knowledge eventually has led to the development of Genome scale metabolic networks (GEMs) [19,20], which encapsulate all the known biochemistry of organisms. These comprehensive representations of metabolism are accompanied by biomass formulations that account for the cell composition [21]. The contribution of each biomass building block is either determined empirically, or approximated from phylogenetically close species [22].
Many metabolic models for various organisms [20,[23][24][25][26][27] and their strains [28,29] have been constructed and they proved to be extremely useful for many different purposes ranging from strain design for biosynthesis of industrial chemicals to drug discovery [30]. Although GEMs are widely used and have provided important insight and guidance, small and yet predictive models are still widely in use in different areas, such as metabolic flux analysis (MFA) and kinetic modelling. And while the biomass formulation of Neidhardt is still in use and has proven to be valid in the last 25 years, as Pramanik et al. [12] have shown, the changes in the biomass composition have significant effects on the internal fluxes, thus should be considered very carefully. In this respect, the extended and curated biochemistry in GEMs can be used to validate and to improve the approximations made by Neidhardt, and to extend it for any organism and for every biomass building block defined in biomass compositions of GEMs and to account for alternative synthesis routes.
Towards this, we developed lumpGEM, a mixed-integer linear programming algorithm that identifies all the alternative reaction subnetworks that should be used to produce a cellular metabolite or biomass building block from a defined set of metabolites. For each subnetwork, lumpGEM derives a lumped reaction that can capture the overall stoichiometry of the subnetwork, while preserving the elemental balance. The concept of identifying minimum number of reaction or enzymes to perform a certain function is not new, throughout the last two decades, different studies have been performed around this idea. Hatzimanikatis et al. developed a mixed-integer linear programming (MILP) formulation to determine the minimum number of regulatory structures to manipulate to optimize a bioprocess, along with possible alternatives [31]. In another pioneering study, Burgard et al. [32] have determined the minimum number of reactions that can sustain growth in E. coli contemporary GEM [19]. Later, the concept of finding all possible optimal solutions in metabolic networks has been introduced [33]. Elementary flux modes analysis (EFMA) has been another popular tool to analyze the metabolic networks in terms of minimal functional units [34] and used before to study nucleotide production [35]. The main limitation of EFMA is the requirement of generation of all the EFMs, which is computationally too costly to be applied to genome-scale networks. Due to this limitation, Figueiredo et al. [36] discussed the concept of K-shortest EFM, an MILP approach, and identified 10 minimal reaction sets that can produce lysine in E. coli and C.glutamicum. In lumpGEM, we follow a similar but an efficient approach compared to the previous studies and we have merged it with thermodynamics-based flux analysis (TFA) to account for bioenergetics constraints. With lumpGEM we can generate thousands of thermodynamically feasible minimal subnetworks that can produce a target metabolite from a set of selected precursor metabolites.
In this study, we focused on the biosynthesis pathways of biomass building blocks of E. coli iJO1366 [29], and S. cerevisiae iMM904 [37] and with lumpGEM (Fig 1), we have identified known and possible other alternative synthesis routes/subnetworks for all BBBs from core carbon metabolism as defined in [4,5]. We demonstrated that lumpGEM is capable of building lumped reactions in where the contribution of each core carbon metabolite to synthesize a biomass building block is identified and properly accounted.
Then, we performed a comparison between the values reported by Neidhardt and those generated by lumpGEM and found that lumpGEM recovered Neidhardt tables and extended them by including alternative biosynthetic routes/lumped reactions. We also performed a comparison between E. coli and S. cerevisiae for their metabolic capabilities to produce BBBs and revealed their similarities and differences. Such studies will help us to understand the capabilities of E. coli and S. cerevisiae 'per' biomass building block and identify the flexibility of the organism to survive by activating different parts of the metabolism to accumulate biomass. The generality of the method makes it applicable to any GEM that has a well-defined biomass composition. In addition, lumpGEM can generate lumped reactions from any part of the metabolism for any target metabolite, either a biomass building block or a biochemical and chemical compound or sets of compounds, which makes it a versatile tool to be used for different purposes.

Results and discussion
The genome-scale models (GEMs) can simulate the growth characteristics of organism since they include the necessary biosynthetic paths to biomass building blocks (BBBs), which make up 1 gDW of cell. However, when the GEM is optimized for maximum specific growth using Flux Balance Analysis, the contribution of M core (the core metabolites, for definitions, see Material and methods) to synthesize a biomass building block is not evident from the flux distribution due to the degrees of freedom that the system has, and the alternative routes that a BBB can be synthesized. In order to overcome this limitation, lumpGEM utilizes a Mixed-Integer Linear Programming (MILP) formulation (See Material and methods) to reveal the contribution of M core and R core (core reactions) while preserving the elemental balance. MILP formulations have been often used on biochemical networks for many purposes [31,36,38,39] since they allow the control of reactions with an on/off manner. We made use of this binary Inputs and outputs for lumpGEM. By defining the core precursors (AKG: alpha-keto glutarate, oxaloacetate, . . .), cofactor pairs (NADH, ATP, . . .), inorganics (SO 4 , NH 4 ), biomass building blocks (BBBs), and non-core parts of metabolism, the GEM is provided to lumpGEM. The output of lumpGEM is thermodynamically feasible subnetworks, which with the core, is capable of synthesizing BBBs. The MILP characteristic of lumpGEM allows the building of alternative subnetworks and lumped reactions for the same BBB j , and it ranks them according to yield. decision in order to control the flux through the reactions of R ncGEM (the reactions defined in GEM network other than R core , (also see Postulate 1, Material and methods), and lumpGEM allowed us to build minimal subnetworks that can synthesize BBBs from any selected part of the metabolism in GEMs, in this specific case, the core carbon metabolism.
The biomass formulation defined in E. coli IOJ1366 is very well characterized and detailed and contains 102 biomass building blocks. It is mainly composed of amino acids, lipids, nucleoside triphosphates (NTPs), deoxy-NTPs and inorganic compounds (Nickel, Zinc, Iron, etc.) along with cofactors such as NAD + /NADH, NADP + /NADPH, CoA/AcCoA, and FAD. Experimental estimates of the growth associated ATP maintenance and production of diphosphate are also included in the biomass composition.
The main difference between our approach for generating synthesis routes for the BBBs and the database-based analysis is that the subnetworks that our method generates may include branches from linear synthesis pathways. The difference emerges from the mass conservation constraint that we force during our analysis. For instance, the smallest subnetwork that lump-GEM generated for the synthesis of histidine is composed of 21 reactions and the precursors are ribose-5-phosphate (R5P) and oxaloacetate. In the databases such as EcoCyc [40], the pathway for histidine synthesis is linear and composed of 10 steps, specifying ribose-5-phosphate (R5P) as the only precursor. When we analyse the 21-reaction subnetwork (Fig 2), we see branching points in the linear route from R5P to histidine. Due to the mass balance constraint, three metabolites, 1-(5-Phosphoribosyl)-5-amino-4-imidazolecarboxamide, L-Glutamine and diphosphate cannot be balanced in a network that is composed of core reactions and the linear pathway from ribose-5-phophate to histidine. Hence, the generated sets of reactions are not only the linear routes from precursor metabolites to biomass building blocks, but balanced subnetworks with stoichiometrically proportional branches (Postulate 3, Materials and methods). lumpGEM captured reactions from various subsystems that are parts of histidine subnetwork, namely alanine and aspartate metabolism, anaplerotic reactions, folate metabolism, glutamate metabolism, histidine metabolism, nucleotide salvage pathway, purine and pyrimidine biosynthesis. In addition, the lumped reaction that is generated from this subnetwork (for lumping algorithm, see Materials and methods) has only core metabolites, biomass building blocks and by-products on both reactants and products sides. This representation is similar to Neidhardt's definition, since he also described the stoichiometric expenditure of core metabolites in his estimations. Similar to our analysis, the values that Neidhardt et al. reported for the synthesis of histidine are different than the linear route that is reported in databases. However, Neidhardt reported only 1 lumped reaction for each biomass building block, and they are not overall stoichiometrically balanced (S2 Table). lumpGEM allows us to build alternative subnetworks and corresponding lumped reactions for the same BBB j . In this specific case, with the minimum subnetwork size S j min being 21, lumpGEM generated 12 alternative subnetworks, and 3 unique lumped reactions. This signifies that the overall lumped reactions of different subnetworks can be the same. This has been observed also in a previous study that focuses on pathway generations for an industrial chemical [41].
A comparison between lumpGEM results and Neidhardt tables for the common metabolites (ATP, NADH, NADPH, etc) indicates that they are close to each other, however the overall stoichiometry that lumpGEM reports for the biosynthesis of histidine is more detailed and includes more metabolites as co-substrates and co-products ( Table 1). The main reason for this discrepancy is that Neidhardt did not report elementally balanced lumped reactions for any of the biomass building blocks. For instance, oxaloacetate appears as a precursor to balance the non-core metabolite L-aspartate in the subnetworks that participates in adenylosuccinate synthase reaction as a co-substrate, and is not reported in Neidhardt precursors (Table 1).
Therefore, values reported by lumpGEM are more comprehensive, and account for the balancing of all metabolites in the synthesis pathways, along with the small molecules, such as inorganic metabolites and protons.
The differences between the alternative subnetworks may emerge from different reactions in the 'linear' pathway from main precursor to the biomass building blocks, or from the other non-core reactions, which are balancing the non-core metabolites in the linear route. These two sources of differences, and especially the latter, may result in an explosion in the number of subnetworks that can be generated for some of the biomass building block. For metabolites like amino acids, which are not so far from the core carbon metabolism, the number of alternative smallest subnetworks S j min is relatively small (Table 1), whereas for big molecules, such as lipids, there exists hundreds of alternative routes (S1 Table). The small number of alternative subnetworks for amino acids also shows that the number of non-core metabolites that appeared The subnetwork cannot produce any histidine from core network without including the purple reactions due to mass balance constraints. 'non-core linear' represents non-core metabolites along the linear synthesis pathway, 'non-core mass balance' represents the metabolite that appears as non-core within the purple reactions.  along the linear synthesis route is small, since the main explosion in the number of subnetworks emerges from these reactions. As an example, all 12 alternative subnetworks for histidine include the linear 10 steps route from R5P to histidine and alternative subnetworks are generated by other non-core reactions. Moreover, there is a slight correlation between the number of alternative subnetworks and the size of the subnetworks. Most of the amino acids that have more than 2 alternative subnetworks require more than 10 steps for their biosynthesis. A big molecule, Phosphatidylglycerol (dihexadecanoyl, n-c16:0) is a lipid with a S j min of 40 reactions, and has 127 alternative subnetworks with 16 unique lumped reactions. In the first subnetwork generated by lumpGEM, within the 40 reactions, 34 of them are part of linear synthesis route and 6 of them are balancing non-core metabolites. However, despite the increase in the number of subnetworks, the number of unique lumped reactions remains relatively small (S1 Table).
When we analyse the alternative lumped reactions, in some cases, we only observe a difference in the stoichiometry of the cofactors in the reactant and product side. For example, the small change between the two alternative L-arginine lumped reactions is caused from alternative reactions that are decomposing pyrophosphate. In one case, pyrophosphate is decomposed into phosphate and water by the inorganic diphosphatase enzyme and in the second subnetwork it is decomposed into ATP, phosphate and proton using ADP as co-substrate by polyphosphate kinase enzyme. Thus, the main carbon flows of the subnetworks are exactly the same, and the lumped reactions differ only in the stoichiometry of the cofactor metabolites.

Complex biomass components and biomass associated processes
Along with the biomass building blocks such as amino acids and lipids, the biomass formulation of iJO1366 includes growth associated maintenance (GAM) and the de novo synthesis of adenylate pool metabolites. The Varma core network (S2 File 2) is capable of hydrolyzing the amount of ATP for GAM in the biomass (53.95mmol/gDW). However, the stoichiometric coefficients of ATP and ADP are not the same in the biomass equation, which indicates that the biomass formulation of E. coli requires synthesis of adenylate pool metabolites. In order to identify the subnetwork(s) for the synthesis of these pool metabolites, we followed the same procedure that we did for biomass building blocks, and we built a GEM with an additional reaction with the coefficients of ATP, ADP, water, phosphate and proton from GEM biomass. Then, by forcing a flux through this reaction, and by minimizing the number of non-core reactions (See Material and methods), we generated minimal subnetworks S min . The resulting networks are composed of 27 reactions with 24 alternatives, which synthesize adenylate pool metabolite ADP.
Along with the biomass building blocks, GAM and adenylate pool metabolites, the biomass equation of iJO1366 includes diphosphate as a by-product. By following the same procedure that we have followed for adenylate pool, we generated subnetworks that balance the diphosphate in the biomass formulation. There are 2 alternative subnetworks composed of only 1 reaction in R ncGEM that can balance the diphosphate in biomass equation: Inorganic Diphosphatase (PPA) and Polyphosphate Kinase (PPKr). Both of these reactions are decomposing the diphosphate into phosphate and proton, the former with water, and the latter with ATP/ ADP cofactor pair.

Ranking alternative lumped reactions-Yield analysis
When we analysed the alternative lumped reactions for the same biomass building block, we observe different requirement of precursors, cofactors, nitrogen and sulphur sources. This behaviour is expected, and a detailed analysis could also suggest which lumped reaction is more suitable for specific studies. One of the main criteria to rank the lumped reactions is their capability to synthesize the BBB from the carbon source, specifically the yield of BBB on the carbon source. In order to calculate the yield per lumped reaction, we built a 'mini' core model for each of them, which is composed of M core -R core, V BBB and the lumped reactions under study. By optimizing the synthesis of the selected BBB and calculating the C-mole yield over the carbon source of interest, specifically glucose, we ranked the alternative lumped reactions for each BBB. Interestingly, different lumped reactions can produce different amounts of biomass building blocks over a wide range of yield amounts ( Table 2).
dTTP is a deoxy nucleoside triphosphates and its main precursor for all the generated subnetworks is ribose-5-phosphate (R5P) from Pentose Phosphate Pathway. Consequently, the core network can supply the same amount of carbon to all the generated subnetworks and corresponding lumped reactions. One explanation for the differences in yields (Table 2) is the capability of the lumped reactions to direct all the carbon from R5P to dTTP. When we analyse the 4 lumped reactions with highest yield, we see that on the product side, the only compound other than inorganics and cofactor pairs like quinone/quinol, ATP/ADP-PI is dTTP. For the lumps with second highest yield, along with the pyruvate (C3) in the reactant side, we see AcCoA (C2) and CO 2 (C1) on the product side. In the 3 rd highest yield case, fumarate (C4) replaces pyruvate on the reactant side, and succinate (C4) appears on the product side as a by-product. The lowest yield producing lumped reaction has fumarate and pyruvate on the reactant side, and has AcCoA, CO 2 and succinate on the product side of the equation. This signifies that the lumped reactions with lower yield are losing carbon through those core metabolites and the number of carbons in these metabolites defines the yield. However, it is not possible to generalize such a rule, since the metabolites appearing on the product side of the lumped reaction can be assimilated by the system and the capacity of the network for this reassimilation will have a significant effect on the yield of the lumped reaction. Moreover, having a lower yield does not necessarily mean that these lumped reactions are not useful for metabolic modelling, since they can be used under sub-optimal growth conditions or under conditions when growth is not the main physiological optimality criterion or to provide flexibility under mutation. The use of these alternatives and their physiological interpretation deserves an in-depth study and lumpGEM can serve as a framework for systematic studies.

Generating a metabolic model with lumpGEM
By generating subnetworks for GAM and diphosphate, lumpGEM took into account all the components of biomass formulation both on the product and reactant sides. By testing for yield, we have shown that all the generated lumped reactions are capable of producing their target BBB j . However, to produce biomass, these lumped reactions must be able carry flux under the same quasi steady state condition, in the same model with biomass as cellular objective. This requires generating a metabolic network composed of the defined core network, lumped reactions, the transport and sink reactions defined in GEM. We used the Varma model (S2 File 2) as core model and the GEM iJO1366 as defined in the previous sections and introduced a new mixed-integer linear programming (MILP) formulation to have a systematic way to choose the lumped reactions. We formulated the problem as the following: We generated a model with all the lumped reactions generated by lumpGEM and calculated the theoretical maximum yield for biomass. We then fixed this yield in the model, and minimized the number of active lumped reactions and include only these lumped reactions for further analysis. This network consists of 56 cytosolic enzymatic (Varma network), 144 transport reactions, 78 lumpGEM output reactions along with 64 sinks (343 in total with biomass formulation) and 153 unique metabolites. The metabolic network is capable of producing 0.951/hr specific growth rate with 10 mmol/gDWhr glucose uptake rate. This signifies that all the lumped reactions were capable of producing corresponding BBB j successfully under the given condition simultaneously. The specific growth rate of GEM for the same condition is 0.997/hr, which is close to the biomass accumulation of the model generated with the output of lumpGEM. This shows that lumpGEM can be used to generate networks, which are small, but comprehensive and able to mimic the GEM behaviour.
Analysis on compartmentalized models, test case on S. cerevisiae lumpGEM can be applied to any GEM with a proper biomass formulation, and in order to test its applicability, we used it on a eukaryotic, compartmentalized GEM, iMM904 [37] of S. cerevisiae. This yeast is one of the mostly studied unicellular organisms along with E. coli, and has many biotechnological applications [42]; thus making it a strong candidate for modelling approaches. Applying lumpGEM on this S. cerevisiae strain revealed the contribution of different possible precursors and cofactors for each of the biomass building blocks defined in GEM, moreover it revealed alternative subnetworks and lumped reactions for the same biomass building block. This can be interpreted as building 'Neidhardt style' tables for S. cerevisiae. In order to define the core for S. cerevisiae, we used the Metabolic Flux analysis (MFA) model of Christen et al. [43] and mapped the reactions of this model with the iMM904 reactions. In addition, we mapped reactions of iMM904 that are in Varma network but not in MFA model and included them as core reactions for S. cerevisiae network. The generated core network is composed of (S2 File 2) 88 reactions and 67 unique metabolites along 2 compartments, cytoplasm and mitochondria. Following these steps, we have applied the lumpGEM algorithm to the GEM as described in Material Methods section. The main difference between E. coli and S. cerevisiae GEMs are that S. cerevisiae GEM is compartmentalized, however this does not bring any more complexity for lumpGEM since it treats the transport reactions between compartments as it treats single enzymatic reactions. These transport reactions are part of the generated subnetworks for biomass building blocks and can participate in lumped reactions. These lumped reactions can include metabolites from different compartments, if these compartments are included in the ad-hoc selected network. By the nature of the lump-GEM algorithm, the lumped reactions include only core metabolites, biomass building blocks and by-products. Since the core network has 2 compartments, mitochondria and cytosol, the resulting lumped reactions can have metabolites from these compartments. The synthesis pathways of common biomass building blocks of S. cerevisiae and E. coli such as amino acids are very similar. The sizes of the networks differ mainly from the transport reactions between the compartments. Another reason for the divergence is the non-core metabolites along the linear routes to biomass building blocks, because there are different enzymes in these two organisms that are balancing these non-core metabolites. Interestingly, different subnetworks between the two organisms do not necessarily produce different lumped reactions. When the synthesis pathway is in one compartment (mainly cytosol), the lumped reactions of E. coli and S. cerevisiae are very similar (Table 3). L-phenylalanine, L-methionine, L-serine, L-cysteine and L-tyrosine are some examples of these similarities. Small differences for these overall reactions emerge from different cofactor usage. A similar behaviour is also observed for subnetworks including more than 1 compartment. The main difference between E. coli and S. cerevisiae overall reactions emerges from metabolites in different compartments, which are also different due to the energetics cost of transport reactions. As an example, the L-arginine biosynthesis in E. coli and S. cerevisiae is very similar, however the main difference emerges from the transport reactions between the compartments in the yeast. A smallest subnetwork for E. coli is composed of 13 reactions, and for S. cerevisiae, this number rises to 17. Two additional reactions for S. cerevisiae are the transport reactions between cytosol and mitochondria for the metabolites L-aspartate and ornithine. Another difference is the bicarbonate equilibration reaction (HCO3E) converting carbon dioxide to bicarbonate, which is the co-substrate for the carbamoyl-phosphate synthase reaction converting glutamine to carbamoyl phosphate. The reaction that produces carbamoyl phosphate in E. coli model uses carbon dioxide instead of bicarbonate as co-substrate. Moreover, it does not use L-glutamine as co-substrate. Therefore, the last difference in the subnetworks emerges from the synthesis of glutamine by S. cerevisiae through the usage of glutamate synthase (NADH2) and glutamine synthetase enzymes.
As a general comparison, lumpGEM generated 2797 subnetworks and 615 unique lumped reactions for 102 biomass building blocks for E. coli, whereas it generated 155 subnetworks and 114 lumped reactions for S. cerevisiae for 44 biomass building blocks (S1 File). Even though the numbers seems very different, the main explosion of number of possible subnetworks for E. coli emerges from lipids, which are not common with S. cerevisiae. Moreover, E. coli GEM has a more detailed biomass building block definition compared to S. cerevisiae GEM. Thus, it is fairer to compare the common biomass building blocks between two organisms as we have shown for amino acids.
We have also performed an analysis to compare E. coli and S. cerevisae for the production of a valuable industrial chemical, succinate. In this case, we did not define any core, other than the media composition, and minimized the number of active reactions to produce succinate. The analysis indicated that S. cerevisiae requires at least 33 reactions to produce succinate with a 100% carbon-mole yield over glucose, whereas E. coli needs only 25. The overall lumped reaction for E. coli do not have any metabolite other than glucose, oxygen and cytosolic proton in the reactant side, and only succinate, water and periplasmic proton on the product side. The overall lumped reaction generated for the S. cerevisiae subnetwork also includes the same metabolites, with additional production of ATP along with succinate. When we apply the same formulation to the biosynthesis of L-malate, we observe that minimal subnetwork generated for S. cerevisiae (30 reactions) are still bigger compared to E. coli (24 reactions), however this time the lumped reaction does not include any metabolite other than L-malate, glucose, water, oxygen and proton. The E. coli lumped reactions for L-malate is very similar to the succinate case. This analysis is done only for 1 S min subnetwork for both compounds, and can be extended by generating all the minimum subnetworks, and for every common metabolite between E. coli and S. cerevisiae.

Conclusion
The complexity of cellular metabolism necessitates the development of methods to better understand and investigate the metabolic capabilities of organisms. For this purpose, small sized metabolic models are developed and widely used to study cellular physiology, and are proven to be useful platforms for many applications. With the emergence of genome scale models (GEMs), studies on metabolism entered a new era, since GEMs encapsulate all biochemistry that occurs in an organism. However, it is still crucial to be able to focus on certain parts of the metabolism with a modular manner and to understand the capabilities of these modules. Within this scope, we developed lumpGEM, a systems biology tool that captures the minimal sized subnetworks that are capable of producing target compounds from a set of defined core metabolites. In this work, we applied lumpGEM on all biomass building blocks (BBB) of E. coli iJO1366 and S. cerevisiae iMM904 and generated different subnetworks that can participate in producing biomass. We also used lumpGEM to re-define the pathway and subsystem definitions for the biosynthesis of BBBs, and identified that many different subsystems participate for the production of a single BBB. Moreover, by lumping the generated subnetworks, lumpGEM allowed us to identify the individual contribution of core metabolites and cofactors for the synthesis of each BBB. This procedure is a very promising method for many applications, such as experimental studies like MFA [44] and in silico studies like FBA, TFA [45,46] and atom mapping analysis. The main advantage of lumpGEM is its capability of making the ad-hoc selected core models consistent with genome-scale model (GEMs) in terms of biomass requirements. Metabolic Flux Analysis (MFA) analysis can benefit from such approach, since lumpGEM identifies the expenditure of core metabolites for the biosynthesis of biomass building blocks, thus accounting correctly for the metabolic costs. lumpGEM can also be used to build synthesis pathways for any metabolite defined in the metabolic network. This makes it a versatile tool to study the characteristics of industrial chemical production strains [47], since it identifies all the enzymes, either linearly connected or nested that participate in the biosynthesis of the target compound. This approach can be used to compare the similarities and differences between the host organisms to produce a target chemical, since lumpGEM can compare the metabolic costs and capabilities of different organisms for the biosynthesis of an industrially relevant molecule. Apart from pathway and subsystems/subnetworks analysis, lumping strategy reduces the complexity of the networks significantly. By exploiting this property of lumpGEM, we built a reduced model that has an ad hoc defined core with a biomass yield very close to its parent GEM model. With a systematic approach to define the core [48], we can generate representative reduced models that are consistent with their GEM for different studies, such as kinetic modelling [49][50][51], in where it is crucial to base the analysis on models that do not sacrifice stoichiometric, thermodynamic and physiological constraints.

Materials and methods
The algorithm considers a core system of metabolites and reactions and identifies them in genome scale model (GEM) of the organism. Using this GEM, it decomposes the biomass composition of the organisms into individual biomass building blocks (BBB). lumpGEM source code is available in S3 File.
In lumpGEM, we introduce and use the following definitions: where, v BBB, j : The sink that is created in Step 1.a for BBB j for its biosynthesis. μ max : Theoretical maximum specific growth rate for the given physiology in 1/hr units. n j, GEM : The stoichiometric coefficient for BBB j in mmol/gDW unit as defined in original GEM.
Postulate 2: Any M core is a potential precursor for the biosynthesis of BBB j . Postulate 3: Maximizing for the sum of z rxn, i results in the smallest subnetwork S j to produce BBB j from M core . This subnetwork is not necessarily composed of only linear pathways as reported in databases such as KEGG [53], SEED [54] or EcoCyc [40] etc. and may include branches.
Postulate 4: The flux distribution for each generated subnetwork cannot guarantee an optimum flux distribution that will specify the individual stoichiometric contribution of each M core to synthesize BBB j due to the degrees of freedom (DOF) that the system has.
Moreover, only the defined core reactions and metabolites of the GEM built in Step f for generating subnetworks is constraint with thermodynamics, and the generated subnetworks are constrained by only mass-balance.
To test the thermodynamic feasibility of these subnetworks, we have built the following MILP formulation for each S j : a. Generate a model comprised of: i. M core and R core ii. S j iii. v BBB b. Apply thermodynamic constraints on this model as described in [46,52].
Postulate 5: Minimizing the sum of net flux in S j generates a stoichiometrically proportional flux distribution in the subnetwork S j . This leads to the exact stoichiometric expenditure of each M core to synthesize BBB j . d. Lump R sub with respect to the flux distribution obtained after minimization. This is collapsing the reactions into 1 overall reaction that is stoichiometrically equivalent to the flux distribution generated above.

Generating alternative subnetworks for each BBB j
To identify alternative subnetworks for BBB j , GEM is further constrained with the following integer cuts constraint after generating each S j with an iterative manner [33].
Postulate 6: Since R sub k is active if only z R sub k ¼ 0, the next solution will have at least 1 different reaction from the previous solution. Aftermath, the same procedure is applied for the newly generated S j,2 .
Supporting information S1 File. The subnetworks generated for E. coli and S. cerevisiae biomass building blocks. All generated subnetworks for all the biomass building blocks of E. coli iJO1366 and S. cerevisiae