Identifying All Moiety Conservation Laws in Genome-Scale Metabolic Networks

The stoichiometry of a metabolic network gives rise to a set of conservation laws for the aggregate level of specific pools of metabolites, which, on one hand, pose dynamical constraints that cross-link the variations of metabolite concentrations and, on the other, provide key insight into a cell's metabolic production capabilities. When the conserved quantity identifies with a chemical moiety, extracting all such conservation laws from the stoichiometry amounts to finding all non-negative integer solutions of a linear system, a programming problem known to be NP-hard. We present an efficient strategy to compute the complete set of integer conservation laws of a genome-scale stoichiometric matrix, also providing a certificate for correctness and maximality of the solution. Our method is deployed for the analysis of moiety conservation relationships in two large-scale reconstructions of the metabolism of the bacterium E. coli, in six tissue-specific human metabolic networks, and, finally, in the human reactome as a whole, revealing that bacterial metabolism could be evolutionarily designed to cover broader production spectra than human metabolism. Convergence to the full set of moiety conservation laws in each case is achieved in extremely reduced computing times. In addition, we uncover a scaling relation that links the size of the independent pool basis to the number of metabolites, for which we present an analytical explanation.


Introduction
When studying metabolic networks at the scale of the whole genome, it is often the case that the information required to develop dynamical models is not available, because either kinetic parameters or reaction mechanisms are partially or fully unknown. In many cases, the most reliable information is encoded in the stoichiometry of the reaction network (the stoichiometric matrix) and, partly, in the assignment of reaction reversibility [1][2][3][4]. Based on these, constraint-based models like Flux Balance Analysis (FBA) have been able to shed light on functional optimality in different contexts, providing (mostly for unicellular organisms) unprecedented predictive power on issues like the organization of reaction fluxes, response to knock-outs, or gene essentiality [5][6][7][8]. FBA can now almost routinely be performed on genome-scale networks [9]. Stoichiometric matrices however harbor a host of additional physical, biological and functional information [5,[10][11][12], including regulatory (see e.g. extreme pathways [13,14] and flux modes [15]), robustness (see e.g. the geometry of the FBA solution space [16][17][18]) and statistical (see e.g. the individual distributions of allowed fluxes and flux-flux correlations [18][19][20]) characterizations. Unluckily, the full solution of the problems just listed on genome-scale networks with thousands of reactions and metabolites presents serious computational challenges, as the algorithms currently available do not scale gently with the system size. This is also the case for the identification of moiety conservation laws that we shall consider here [10,21,22].
Conservation relationships for concentration variables emerge in biochemical reaction networks from the sheer structure of their input-output stoichiometry [10]. In particular, given a stoichiometric matrix, it is generically possible to find linear combinations of metabolite levels that are due to be constants of motion of the dynamical system governing the time evolution of concentrations and reaction rates. For instance, if the subset of metabolic reactions describing energy metabolism is considered, ATP and ADP would be coupled in every reaction -one is a product whenever the other is a substrate -, so that the aggregate concentration of ATP and ADP would remain constant over time while individual levels may fluctuate in a correlated way. The existence of such relationships has profound consequences. In first place, any intervention aimed at altering the level of a certain metabolite should consider whether its variations are limited or not by conservation laws. Secondly, conservation laws constrain a network's production capabilities, as there clearly cannot be a net stationary output of a compound belonging to a conserved quantity. Therefore, mapping out these conservation laws amounts to obtaining a genome-scale picture of what a cell can (in principle) excrete or make available to processes outside metabolism, in particular secondary metabolism, e.g. the production of pigments and antibiotics. Finally, conserved quantities imply an effective reduction of the number of independent flux or concentration variables in a reaction network, an important aspect especially for dynamical modeling.
The problem of finding generic conservation laws is relatively straightforward to solve with the tools of linear algebra, since, as said above, they correspond to specific (linear) dependencies of the rows of the stoichiometric matrix. When one is interested in the conservation of particular chemical moieties, however, the problem takes a more challenging twist. Indeed, because of intrinsic discreteness, the combinations that describe moiety conservation should only be constructed with non-negative integers. We refer to these particular combinations as 'moiety conservation laws' (MCLs), and we shall be interested in finding a basis for MCLs (i.e. a maximal linearly independent set of MCLs) in large, genome-scale metabolic networks. Computing an MCL basis in a large system is much harder than finding linear rowdependencies, as one passes from a linear to an integer-linear programming problem with the concomitant increase of computational complexity up to NP-hard [23,24]. On the other hand, knowledge of an MCL basis allows, as we shall see, to map out exhaustively a large class of conservation laws in biochemical networks.
The literature on conservation laws in reaction networks, mostly focused on characterizing the so-called semi-positive conservation laws (SPCLs, given by linear combinations of MCL basis vectors), is quite rich. Pioneering mathematical analyses of SPCLs in biochemical systems date back to the early 1990's [21,25]. Besides clarifying the origin and significance of these invariants, such studies have shown that conservation laws can be encoded in a convex representation of the left kernel of the stoichiometric matrix [25]. Later on [22], a classification of conservation laws has been proposed based on an analogy between properties of the right (extreme pathways) and left null spaces of the stoichiometric matrix. The computational demand of finding MCLs in genomescale networks has driven further work concerned with the development of ad hoc algorithms [26,27]. After the seminal proposals made in [25], already in [22] a method is presented, that however scales exponentially with the system size and can only be employed for the analysis of rather small networks. The Metabolite Concentration Coupling Analysis (MCCA) and the Minimal Pool Identification (MPI) tools for genome-scale metabolic network analysis were instead introduced in 2005 [28]. The former allows for the identification of subsets of metabolites whose concentrations are coupled within common SPCLs, while MPI helps to determine the SPCLs for individual metabolites. More recently [29], the formal algebraic duality of metabolite producibility and conservation has been exploited to devise a method that relates biomass producibility to nutrient availability, which was then applied to the metabolism of Escherichia coli, obtaining a large set of novel putative growth media. Then in [30] a mixed integerlinear programming problem has been posed and solved in order to find the ensemble of all metabolites that appears in SPCLs. Furthermore, in modeling metabolic networks in terms of Petri nets, the problem of finding MCLs has been connected with the search for the so-called P-invariants of the network [31]. It is worth noting that metabolite levels are in principle experimentally accessible (e.g. by mass spectrometry), although such knowledge is mostly used in metabolomic analysis to reconstruct flux patterns.
Despite many efforts, a consistent computational method to determine all MCLs in genome-scale networks has remained so far elusive. In this work we construct such a computational method. The technique we propose exploits the above mentioned duality and combines different kinds of algorithms (message passing, Monte Carlo and relaxation). The mathematical background and the structure of the method we employ are discussed in 'Materials and Methods'. As case studies, we have considered different reconstructions of the metabolic network of the bacterium E. coli and six tissue-specific human metabolic networks derived from the Recon-2 database [32]. In particular, we have been able to identify in each case all MCLs in different conditions (see 'Results'). Furthermore, by studying E. coli we have uncovered a relation between the number of pools and the size of a network (number of metabolites and/or reactions), a theoretical justification for which is also discussed, before reporting our conclusions.

Methods
Background. Given a metabolic network encoded by the stochiometric matrix S~(S mr ), where S mr is the stochiometric coefficient of metabolite m[f1, . . . ,Mg in reaction r[f1, . . . ,Ng (with the standard sign convention to distinguish substrates from products), the time evolution of the concentration vector c~fc m g satifies where v~fv r g is the vector of reaction fluxes and we have assumed that the stoichiometry of metabolite exchanges with the environment is included in S. Consider a linear combination Q of concentration variables with fixed coefficients k m §0, i.e.
where the bracket ( : , : ) stands for the scalar product. Clearly, So if k belongs to the left null-space of S, that is if then the aggregate concentration variable Q is conserved in any flux state v. Following [21], we shall generically call a conserved metabolite pool like Q, defined by a vector k satisfying (4), a 'semipositive conservation law' (SPCL). From a physical viewpoint, a SPCL represents an invariance constraint that is required to be satisfied by trajectories of the dynamics of the system, i.e. by solutions of (1) with given specifications of how v depends on c. In principle, every vector k belonging to the left kernel of S (without sign restriction) defines a conserved quantity, to which we can refer generically as a 'conservation law'. The total number of linearly independent conservation laws of this type equals the dimension of the left null-space of S, i.e. M{rank(S). This number also provides an upper bound to the number of linearly independent SPCLs, namely The restriction to k §0 in (4) allows for a more straightforward interpretation of quantities like Q. Considering, for instance, the toy network formed by the three reactions a simple calculation shows that two conserved quantities exist, i.e. Q 1~cB zc E zc D and Q 2~cE zc F . However, Q 2 could be also written as Q' 2~cF {c B {c D . While both Q 2 and Q' 2 are conserved quantities, Q 2 can be interpreted to be a total enzyme mass if E and F represent, respectively, a bound and a free enzyme species. A similar physical interpretation is harder to imagine for Q' 2 (and the situation rapidly becomes more complicated in larger networks). Among solutions of (4), those for which the coefficients k m are non-negative integers can be fully rationalized in chemical terms as related to the conservation of moieties, groups or chemical elements. We shall hereafter define a 'moiety conservation law' (MCL) as a solution of The problem we face here concerns the identification of all MCLs of a given stoichiometric matrix S. Furthermore, we will be interested in constructing an integer basis of the left null space of S through which all SPCLs can be obtained as linear combinations with real coefficients. From a mathematical view point, finding such a basis represents a variation of the more general problem of computing the minimal integer Hilbert basis of the polyhedral cone defined by the left null-space of the stoichiometric matrix, which is known to be NP-hard [33,34]. Although some exact deterministic algorithms are available [35,36], their computational costs become too high when the underlying network is sufficiently large. In particular, for the sizes relevant to genome-scale metabolic modeling (N,M * > 10 3 ) one always runs into the combinatorial explosion of computation times. Incidentally, if this integer basis suffices to generate all conservation laws via linear combinations with real coefficients, then all conservation relationships encoded in S are SPCLs and the number of MCLs in the basis saturates the bound (5). Otherwise, S necessarily allows for at least one conservation law that cannot be expressed through a SPCL but also involves negative coefficients. In such cases, we have #findependent MCLsgvM{rank(S) and an integer basis of the left kernel, rigorously speaking, does not exist.
It is very important to stress that a high computational complexity does not necessarily imply the existence of an exponential number of elements of the MCL basis (as a matter of fact, we shall see that the basis size grows roughly linearly with the network size); rather, it quantifies the time required to find a single element of the basis in the worst case. Roughly speaking, combinatorial explosion of computing times occurs when one has to go through an exponential number of candidates (k-vectors in our case) before finding a solution of (7). It is rather intuitive that finding all solutions of (7) by a deterministic procedure would require exploring the entire space of k-vectors, which assuming non-zero upper bounds for the k m 's is indeed exponentially large in the number of metabolites.
In order to find all independent solutions to (7), stochastic strategies are therefore mandatory. In brief, we shall map this task to a global optimization problem whose solution can be retrieved via stochastic algorithms known to be exact in special situations, a kind of approach that has been used before with considerable success in the solution of other NP-complete and NP-hard problems [18,[37][38][39]. Our strategy is divided in three steps: (a) compute a list of all metabolites belonging to at least one MCL; (b) construct an integer basis for MCLs from that list; (c) check that S does not allow for any further MCL.
Step (a) can be carried out in different ways, starting, as we shall see, from a straightforward analysis of the kernel of S T . We shall also discuss here a more involved but more informative approach based on a messagepassing procedure.
Step (b) will be done by Monte Carlo and step (c) by a relaxation algorithm. The conceptual implementation of each step is sketched below and a C++ code performing the complete procedure can be downloaded from http://chimera. roma1.infn.it/SYSBIO/, together with a test case (the E. coli iAF1260 network). Further details and work flow are given in the Text S1.
Step (a): finding all metabolites belonging to at least one MCL. A first, elementary preprocessing step may consist in computing the left kernel of S, e.g. by Gaussian elimination. Evidently, every vector in a basis of Ker(S T ) is a generic, linearly independent conservation law, i.e. a non-null solution of S T k~0 with real-valued k. Eventually, some of these vectors will be such that k m §0 for each m, i.e. will be outright SPCLs. The search for MCLs can be carried out among metabolites that appear in this basis with positive components. Furthermore, the size of such a basis is an upper bound for the maximum number of linearly independent MCLs. In order to compute the left kernel, we resort to a Gaussian elimination method with partial scaled pivoting, a technique currently employed to deal effectively with possible illconditioning of the stoichiometric matrix under control. We however cross-checked results, in every case, with the results obtained by the robust routines employed by both Mathematica and MatLab softwares. Ultimately, though, the relaxation algorithm discussed below provides a definitive certificate for correctness, in the sense that convergence only occurs when all kernel vectors have been found (and removed from the stoichiometric matrix, see below).
A more complex but (in our view) more rewarding alternative is suggested by the following considerations. Because MCLs are represented by solutions of (7), one could obtain a statistical picture of the set of MCLs by computing the marginals of the probability distribution where d(x; y) is the Kronecker delta function (~1 if x~y and~0 otherwise) and N sol stands for the number of solutions of (7). Indeed, each marginal represents the probability that the m-th component of the k vector attains a value k m over the solution space of (7). Disposing of all such marginals is equivalent to disposing of the list of metabolites belonging to at least one MCL, since P m (k m )~0 implies that metabolite m does not belong to any MCL. Actually, the set of marginals provides much more information than that: in fact, by definition, where in the last equivalence we stressed the probabilistic interpretation of the marginal as the histogram of the m-th coordinate over the solutions fk s g, s~1, . . . ,N sol . So, P m is proportional to the number of distinct MCLs to which metabolite m belongs. Unluckily, a direct evaluation of (10) requires computing a sum over k M{1 max terms (assuming k m [f0,1,2, . . . ,k max g), which becomes computationally infeasible for M larger than a few tens. We can however estimate each marginal effectively by resorting to message-passing (MP) techniques. MP is an efficient computational strategy (with a running time scaling linearly, as opposed to exponentially, with the number of variables) that is formally exact on locally tree-like networks [40][41][42] and is extensively used as a heuristic procedure to solve hard computational problems defined on sparse or even loopy networks [37,43,44]. It has been also applied previously to the analysis of metabolic networks [18].
In short, we have devised a MP algorithm to compute the marginals (10) and obtain a full statistical representation of the space of MCLs. Details are given in the Text S1. Upon convergence, when all marginal probability distributions are evaluated, one disposes of a list of metabolites belonging to at least one MCL (with the additional information described above). The following step concerns the construction of an integer basis for MCLs from this list.
Step (b): constructing the basis. Because MP will have considerably pruned the set of metabolites (thereby greatly reducing the number of variables: we shall now denote by M c the number of metabolites belonging to at least one MCL), the most convenient method to explore the structure of individual MCLs is Monte Carlo. Indeed, the problem of finding the integer solutions of (7) can be mapped onto that of finding the minima of the fictitious, discrete 'energy function' given by where k m [f0,1,2, . . .gVm and J m,n : Note that E(k)~0 if k satisfies (7). Therefore MCLs correspond to the minima of E. Several controlled and optimized Monte Carlo methods are available to compute the minima of functions like (11) [45]. More specifically, these protocols (the simplest of which is probably the Metropolis scheme) are capable of sampling vectors k distributed according to where Tw0 is a parameter and Z(T) a normalization factor. The minima are recovered upon slowly decreasing T ('annealing') in the limit T?0.
In this work, in order to retrieve the individual MCLs as ground states of E, we have performed iterated Metropolis-based annealings to minimize the energy (11) as detailed in the Text S1. Eventual linear dependencies among the retrieved minima can be resolved by Gaussian elimination to yield an actual nonnegative integer basis for the space of MCLs. In case superpositions of independent MCLs (for which E~0 as well) are identified as minima of E by Monte Carlo, they can be easily reduced by iteratively subtracting the other independent MCLs found, taking care to maintain the non-negativity constraint for the coefficients k m .
The solution spaces of (7) and (14) are linked by the Motzkin theorem of the alternative, which can be stated as follows: Theorem (Motzkin, 1936). Consider any arbitrary subset R of rows of S. Then, either there exists a solution v * to system (14) such that all inequalities corresponding to the subset R hold strictly, or system (7) has a solution k * , with k Ã m w0 for each m[R. In essence, Motzkin's result guarantees that solutions of (14) verify strict equalities for metabolites belonging to MCLs. This is rather intuitive if one interprets strict inequalities in (14) as conditions for metabolite producibility [29,46]. Luckily, a solution of a subset of constraints in (14) with strict inequalities can be found very efficiently by relaxation algorithms (e.g. MinOver [47] or Motzkin's scheme [48]). These classic methods work by correcting iteratively (t being the step) the least unsatisfied constraint, according to the scheme l being a parameter that can be fixed in different ways, from a constant (as in MinOver [49]) to a quantity proportional to the amount by which the constraint is violated (as in the so-called Motzkin scheme [48]). The above dynamics converges to a solution, if one exists, in polynomial time. Therefore a simple numerical check to confirm that all MCLs have been found consists in looking for a solution of (14) with strict inequalities for all m's remaining after having removed from S the rows corresponding to metabolites belonging to at least one MCL: when all MCLs have been found, a solution necessarily exists and relaxation converges to it. Further detailes are reported in the Text S1. This completeness analysis finalizes our protocol.

Materials
We shall apply the method described above to find MCL bases for two reconstructions of E. coli's metabolic networks of rather different sizes, namely iJR904 [4], with M~761 chemical species and N~1074 reactions, and iAF1260 [50], with M~1668 and N~2381. These numbers refer to the sizes of the respective stoichiometric matrices including all uptakes but excluding the biomass reaction. Two limiting cases for the choice of the exchange fluxes will be considered. First, we shall analyze MCLs formed in a 'rich medium', where all uptake reactions are active. Then, we shall look at the case of 'minimal medium', by studying MCLs after having eliminated part of the intakes, while keeping exchange reactions for ca2, fe2, glc-D, h2o, h, k, mg2, mn2, na1, nh4, o2, pi, so4 and zn2. (In the latter case a much larger number of MCLs is to be expected.) We shall see that, while for iAF1260 independent MCLs suffice to generate all conservation laws as linear combinations with non-negative coefficients (in other words, the bound (5) is saturated), the model iJR904 presents one conserved quantity that cannot be expressed as a linear superposition of MCLs with non-negative coefficients.
We have furthermore computed the independent MCLs emerging in 6 tissue-specific reconstructions of human metabolic networks derived from the reactome Recon-2 [32] (bone marrow, breast glandular, heart muscle, hippocampus glial and neuronal and liver hepatocytes), as well as in the entire reactome. The latter case serves mainly to obtain computing times in a worst case (largest reconstruction tested). MCLs for the human networks were obtained using stoichiometric matrices that include all uptakes and exclude the (eventual) biomass reaction. Table 1 lists the independent MCLs found for iAF1260 with a 'rich medium'. They are 38 in total, matching exactly the dimension of the left kernel of the stoichiometric matrix.

E. coli iAF1260
20 of them (numbers 1-19 and 32) are formed by a tRNA in two forms: free and bound to its corresponding amino-acid. To have a physical interpretation, we note that if a model possesses a conserved quantity corresponding to a chemical moiety, then that model is closed with respect to that moiety, in the sense that it does not allow for changes in the level of that particular chemical group. In this sense, MCLs based on a tRNA reflect the fact that, in the model where they have been found, the expression of each tRNA is necessarily constant (more precisely, it is assumed to change on time scales longer than those over which metabolite levels equilibrate).
Compounds in MCL 20, arbutin 6-phosphate (arbt6p) and hydroquinone (hqn) only occur in one reaction (arbutin 6phosphate glucohydrolase: arbt6p + h2o R g6p + hqn). From a topological point of view, they represent 'leaves' of the reaction network, in the sense that arbt6p is not produced by any reaction while hqn is not consumed by any reaction. The fact that their overall level is invariant is merely due to this peculiar geometric property. MCL 21 is composed by hydrogen cyanide (cyan) and thiocyanate (tcynt) in their cytoplasmic form. Interestingly, the aggregate concentration of these compounds is conserved despite the fact that, in the rich medium, there are uptakes for both. This is due to the fact that the model lacks reactions that transport the periplasmic species into the cytoplasm. Exactly the same situation holds for the MCLs 22 and 23, formed by dymethyl-sulfide (dms) and -sulfoxide (dmso), and by thrymethylamine (tma) and thrymethylamine-N-oxide (tmao), respectively.
MCLs 24-28 express the conservation of the level of lipoproteins (apolipoprotein, disulfide isomerase I and II, disulfide interchange and oxidase). Notice that MCL 28 is the only one to involve periplasmic species exclusively.
Compounds in MCLs 33 and 34 are all based on the element selenium, with respect to which the model is closed (i.e. there are no uptakes of selenium-based compounds). We also note from this example that independent MCLs can be overlapping, in the sense that the same compound (sectrna in this case) may belong to different conserved quantities. In such cases, the levels of metabolites in the overlapping groups become effectively coupled, allowing to re-cast the moiety conservation laws associated to the overlapping pools in simpler terms. In the present case, for instance, the fact that sectrna is shared by MCLs 33 and 34 corresponds to the conservation of the single quantity c seln½c zc selnp½c {c sectrnasec½c {c trnasecys½c . MCL 35 reflects the conservation of the level of biotin, while the sulfur atom is a leaf of the network, appearing only in the biotin synthase reaction.
Finally, MCL 38 represents the conservation of the level of acyl carrying protein (ACP).
We can argue that a suitable set of additional uptakes (comprising tRNAs, selenium, disulfide proteins, the aforementioned redox enzymes, biotin, and ACP), together with the missing periplasm-cytosol transport reactions (for cyan, tma and dms), will render the iAF1260 network completely open, thereby allowing for the possibility that the levels of each of the chemical moieties appearing therein are altered. Eliminating uptakes, on the other hand, will generate additional MCLs. Table 2 reports the 36 additional independent MCLs that occur in iAF1260 in a 'minimal medium' containing only ca2, fe2, glc-D, h2o, h, k, mg2, mn2, na1, nh4, o2, pi, so4 and zn2 (i.e. by allowing for 14 of the 299 possible uptakes). A close inspection reveals that many of these MCLs emerge from the lack of uptakes for elements like silver (39), cadmium (40), nickel (41), molybdenum (42), cobalt (43), tungsten (44), mercury (46), chloride (47), arsenic (55) and copper (64). A more detailed biochemical analysis is required to interpret the remaining MCLs.
Notice that, while 72 of the MCLs discussed above correspond to solutions of (7) with k m [f0,1g Vm, MCL 72 has k m [f0,1,2g Vm while MCL 73 has k m [f0,1, . . . ,4g Vm. This shows that, while in general identifying SPCLs cannot be treated as a Boolean problem, the range of values of k m to be considered in (7) can be relatively small.

E. coli iJR904
For sakes of comparison, in Tables 3 and 4 we report the independent MCLs found in the iJR904 reconstruction of E. coli's metabolism in the 'rich' (all uptakes allowed) and 'minimal' (defined in the same way as for iAF1260) media, respectively. One can see that, in essence, the MCLs of iJR904 are included among those of iAF1260, with some simplifications. For instance, the pool related to ACP conservation (number 17 in Table 3) is smaller in iJR904 than it is in iAF1260. Notice also that MCL 29 in Table 4 displays two anomalous coefficients k m~5 0, due to the effective, non-integer stoichiometry with which the corresponding compounds occur in the reconstruction. When a compound is represented by a small non-integer coefficient in a reaction, solutions of (7) will typically take on large values of k m .
Considering the 'rich medium', it is interesting to note that, even though Table 3 exhausts all of its MCLs (17 in total), the existence of an additional conservation law is revealed by studying the left kernel of the stoichiometric matrix, whose dimension turns out to be 18 rather than 17. The corresponding conserved quantity cannot be expressed as a linear combination of MCLs with non-negative coefficients. In particular, it is formed by the levels of 7 metabolites, with the formula Once we move to the 'minimal medium', however, the chemical species pertaining to this conserved quantity fall into well defined MCLs, namely numbers 30 and 31 in Table 4. The additional 14 independent MCLs generated by iJR904 in a 'minimal medium' suffice to describe all conservation laws for this network.

Human metabolic network reconstructions
Finally we turn our attention to the independent MCLs arising within the human reactome Recon-2 and in six of the tissuespecific networks derived from it [32]. The full details of the MCLs we found are provided as Material S1. To summarize results, the sizes of the bases (number of independent MCLs) and the convergence times of the method (specifically, of the C++ code downloadable from http://chimera.roma1.infn.it/SYSBIO/) are reported in Table 5.
We note that if a MCL occurs in Recon-2 and its metabolites are present in a specific subnetwork, then the same MCL will be present in the subnetwork as well. An example is given by the aggregate level of mithocondrial NAD and NADPH, which is conserved in the complete reactome and in all subnetworks we tested. On the other hand, a MCL may be present in a tissue specific network but not in the full reactome, as is the case for the MCL describing the conservation of the sodium cathione, which occurs in all subnetworks we tested but not in Recon-2, due to the presence of a sodium uptake in the latter but not in the tissuespecific reconstructions.
Only one of the networks we analyzed presents conservation laws that cannot be derived from MCLs, namely the bone marrow cell metabolic network. In this particular case, the convergence time of the method suffers from the lack of a stopping criterion in terms of the dimension of the kernel, so that verification of completeness by a relaxation dynamics is mandatory. In just two other cases convergence times exceed a few seconds, namely for the entire reactome (whose size is over twice as large as that of any other network we tested) and the hippocampus glial network. In the latter case, a large MCL related to the conservation of coenzyme A is present, whose discovery requires a longer computing time. In the worst case (Recon-2), however, the time for the full procedure to converge and output the basis vectors is just over 3 minutes on an Intel Dual Core at 3.06 GHz.
Finally, we note that the number of MCLs and of metabolites involved therein considerably exceed those found in the E. coli reconstructions, even though the network sizes are comparable at least with iAF1260. Considering that a reduced structure of MCLs correlates with larger production capabilities, this suggests that bacterial metabolic networks may be intrinsically designed by evolution to cover a broader spectrum of production profiles than their human counterparts. It will be important to validate this scenario on future, more accurate reconstructions of human metabolism.

Scaling of the number of independent MCLs with the network size
In Figure 1 we display the size of the MCL basis (i.e. the number of independent MCLs) as a function of the network size (M in this plot) for the E. coli networks we have considered above as well as for a smaller reconstruction, namely the core matrix of the iAF1260 model [50] (formed by M~72 metabolites that interact through N~94 reactions), both for the 'rich' and 'minimal' media. The 'rich medium' for the E. coli core network allows for 20 uptakes, 5 of which (glc-D, o2, nh4, pi, h2o) survive in the 'minimal medium'. This network is easily seen to present, in both media, 5 MCLs with straightforward biochemical meaning: nad z nadh (nad conservation), nadp z nadph (nadp conservation), q8 z q8h2 (coenzyme-Q conservation), amp z arp z atp (adenylate moiety conservation) and coa z accoa z succoa (coenzyme-A conservation). It appears that the number of independent MCLs scales approximately linearly with the network size. (A similar study for the human tissue-specific networks is less fruitful, as the network sizes are similar in those cases and indeed the number of independent MCLs retrieved is roughly constant, as shown in the Text S1.) While the investigation of a larger family of networks is needed to characterize this regularity more thoroughly, it is instructive to analyze the origin of this scaling behaviour. Interestingly, some insight can be obtained from the analysis of random networks.
Consider an ensemble of 'random metabolic networks' with N reactions and M compounds, such that each stoichiometric coefficient is chosen randomly and independently with probabilities (0vcv1)  Prob(S mr~1 )~Prob(S mr~{ 1)~c=2: The parameter c rules the average connectivity of the network. In particular, we will assume that c~2c=N (with c a constant), and consider the limit N,M??, keeping a fixed ratio a~N=M. In this limit, the above model generates sparse stoichiometric matrices with Poisson distribution for the in-and out-degrees, with average values Sd m in T~Sd m out T~c for metabolites and Sd r in T~Sd r out T~c=a for reactions, respectively. (In real networks, the degree distribution of metabolites is known to have heavy tails due to the presence of ubiquitous compounds like water, ATP, etc. whose connectivities typically grow with the network size. On the other hand, the degree distribution of the remaining metabolites follows a Poissonian law to a good approximation.) By definition, the overall number of MCLs of size n (i.e. involving n chemical species) is given by where d(x) denotes Dirac's d-function, we assume k m [f0,1g for sakes of simplicity for each m, and DDkDD~P m k m . N n as defined above however includes all linear combinations of independent MCLs that produce a new MCL of size n. To obtain the number of independent MCLs, one should subtract from N n the contributions due to superpositions of smaller pools. For instance, all distinct pairs of pools of size 1 would contribute to N 2 as well, so that the number of independent pools of size 2 is given by Expression (20) furthermore depends on the particular network being examined. We shall focus on its average over the entire ensemble. Writing the d-function as d(x)~1 2p Ð 2p 0 exp(ixw)dw, summing over k m 's and averaging over the stoichiometry one finds Expanding the integrand and noting that which can be evaluated in the limit N??. For n~2, keeping only the leading-order terms in M and approximating SN 2 1 T^SN 1 T 2 in (21), one gets For the networks being examined, once the most connected compounds are removed, one finds c^1:6, leading to SN irr 2 T^2,15,32 for the E. coli core, iJR904 and iAF1260 models, respectively (considering a 'rich medium'). This should be compared with the actual numbers of independent pools of size 2 we found, namely 3,10 and 31 respectively. (Similar results can be obtained, with more work, for larger values of n.) It is now straightforward to show that the size B of the pool basis scales linearly with M. Upon summing (24) over n, the total number of pools N tot~P n SN n T is seen to satisfy In other terms, there exists a number z[½1,2 such that N tot~z M . On the other hand, assuming for simplicity that pools in the basis are non-overlapping, one has N tot~2 B , from which we get B~M log 2 z, i.e. a linear scaling with M, in agreement with the behavior displayed in Figure 1. Therefore, despite the coarse way in which it approximates the considerably more structured real metabolic networks, this ensemble of random networks does provide useful hints about the origin of the scaling behavior observed in cellular networks.

Discussion
Conservation laws described by the left kernel of the stoichiometric matrix S take on a specific biochemical significance when the coefficients involved in their definition are non-negative integers, in which case they typically describe the conservation of a particular molecular moiety. Moiety conservation laws have been shown to have relevant biotechnological or medical implications [51], and are bound to play a key role in understanding the dynamics of intracellular reaction networks. Indeed, conservation relationships, as specific dependencies among variables, allow in principle to reduce the number of degrees of freedom, thereby speeding up dynamical simulations. Unluckily, though, dynamical approaches are still largely prevented from limited knowledge of kinetic constants and reaction mechanisms. Identifying the independent conserved moieties embedded in a given S, however, requires solving the hard constraint-satisfaction problem of finding all integer, non-negative, solutions to a linear system of equations defined by the network's input-output relationships and corresponding roughly to the maximal conserved moieties introduced in [25]. Methods to tackle this issue on small networks are available and have been employed before. For the network sizes relevant in metabolic modeling, though, the a priori search space of the problem is huge and exact methods are doomed to fail because of exceeding computational costs. It is crucial to stress that intractability arises even in presence of a small number of independent MCLs from the fact that retrieving them by a deterministic method would require an exhaustive search through an exponentially large number of candidate solutions. (On the other hand, it is clear that problems with an exponential number of solutions are not necessarily hard to solve.) Luckily, powerful heuristics to tackle this type of problems has been developed in the last decade at the interface between statistical mechanics and computer science. Here, we have constructed and applied a technique that allows to obtain full information about the independent MCLs associated to a largescale stoichiometric matrix S in reasonable compsuting times (a few seconds for genome-scale metabolic networks). In particular, we combine different approaches (message passing, Monte Carlo, and relaxation algorithms) in order to construct a basis for MCLs through which all conserved metabolite pools (i.e. all semi-positive conservation laws) can be described. In short, we first prune irrelevant variables, then extract individual MCLs from the