Chemical Basis of Metabolic Network Organization

Although the metabolic networks of the three domains of life consist of different constituents and metabolic pathways, they exhibit the same scale-free organization. This phenomenon has been hypothetically explained by preferential attachment principle that the new-recruited metabolites attach preferentially to those that are already well connected. However, since metabolites are usually small molecules and metabolic processes are basically chemical reactions, we speculate that the metabolic network organization may have a chemical basis. In this paper, chemoinformatic analyses on metabolic networks of Kyoto Encyclopedia of Genes and Genomes (KEGG), Escherichia coli and Saccharomyces cerevisiae were performed. It was found that there exist qualitative and quantitative correlations between network topology and chemical properties of metabolites. The metabolites with larger degrees of connectivity (hubs) are of relatively stronger polarity. This suggests that metabolic networks are chemically organized to a certain extent, which was further elucidated in terms of high concentrations required by metabolic hubs to drive a variety of reactions. This finding not only provides a chemical explanation to the preferential attachment principle for metabolic network expansion, but also has important implications for metabolic network design and metabolite concentration prediction.


Introduction
One of the most intriguing findings in systems biology is that despite the varied constituents and metabolic pathways of three domains of life, their metabolic networks exhibit the same scalefree organization. That is, a small part of metabolites participate in a large number of reactions (which are also termed hubs), while others are involved in a few reactions [1]. As the scale-free architectures are robust and error-tolerant, this finding provides meaningful insights into the design principle of metabolic networks.
The scale-free organization of metabolic networks has been hypothetically explained in terms of evolution that the newrecruited metabolite members attach preferentially to those that are already well connected (rich get richer, also known as preferential attachment principle) [2][3][4]. This implies that the metabolic network hubs originated relatively earlier than others in evolutionary history [5]. However, several issues about this evolutionary explanation remain elusive. First, the molecular basis of preferential attachment principle has not been fully elucidated, as it is inexplicable how the new metabolites ''know'' which metabolites are well connected. Second, the evolutionary explanation to the metabolic network organization has little implications for network design, because we do not know how to choose metabolites as hubs to construct a new metabolic network. Since most metabolites are small molecules and metabolic processes are basically chemical reactions, we speculate that the metabolic network organization may have a chemical basis, which stimulated our interest to address these issues by combining bioinformatics and chemoinformatics. The latter is a discipline devoted to encoding, storing, managing, searching and analyzing all kinds of chemical data by information technology [6,7].

Correlations between network topology and chemical properties
Primarily, we explored the relationships between network topology and chemical properties for the metabolites recorded in Kyoto Encyclopedia of Genes and Genomes (KEGG). As illustrated in Figure S1, the metabolic network of KEGG is scale-free. There are 154 metabolites with degrees (defined as the number of edges linked to the metabolites) higher than 10, while 1180 are connected with only one metabolite. As shown in Table 1 and Figure 1, there exist qualitative and even quantitative correlations between degree and some chemical properties. In particular, molecular polarity, characterized by partition coeffi-cients (ClogP, AlogP and LogD), ratio of atomic charge weighted partial positive surface area on total molecular surface area (FPSA3) and water solubility, rises with the increase of degree. Similar correlations can be observed for the metabolic networks of Escherichia coli (E. coli) ( Figure 2) and Saccharomyces cerevisiae (S. cerevisiae) ( Table 2). Therefore, it seems that metabolites get more polar and thus more water-soluble with the rise of degrees, which implies that the organization of the metabolic networks has a chemical basis. It is of apparent interest to explore the reasons underlying these correlations.
Explanation to the correlations between network topology and chemical properties As metabolic reactions are basically chemical reactions, it is natural to resort to chemical principles to explain the correlations. It is well known that the precondition for a chemical reaction to occur is DG = DG 0 + RTlnQ ,0, where Q is the reaction quotient and is determined by the relative concentrations of reactants and products. Thus, for metabolites that participate in a large number of reactions as reactants (which usually have large degrees, as shown in Table S4), they must reserve high concentrations (quantities) to drive the reactions. Since metabolic reactions mainly occur in non-membrane systems which are hydrophilic environments, the metabolic network hubs must be highly water-soluble to reach high concentrations, which means that the hubs tend to be strong-polar. Therefore, the observed correlations between degree and chemical properties could be basically explained in terms of chemical property requirements of metabolic hubs. This explanation is supported by the correlations between degree and metabolite concentration and between metabolite concentration and chemical properties.
Recently, the absolute concentrations for over 100 metabolites of E. coli, exponentially growing in aerobic environment, were determined by Bennett and co-workers [8]. The concentrations of the measured metabolites are strongly biased. The top 10 abundant compounds account for 77% of the total concentration, while the less abundant half comprise only 1.3%, reminiscent of the topological structures of metabolic networks. As shown in Figure 3, there exists a correlation between the concentration and degree for E. coli metabolites. The metabolites with larger degrees have relatively higher concentrations and the degrees decline gradually with the drop of concentrations. However, one may argue that the metabolite concentrations oscillate during different phases of life, so how the concentrations of metabolites can correlate with degrees of connectivity-a static property? The answer resides in the fact that the amplitude of metabolite oscillation is rather low. For instance, during the life cycle of a yeast cell the amplitude of metabolite oscillation is usually within 10-fold, with a median of ,2.4-fold [9]. Therefore, it is reasonable to consider that the observed correlation between degree and metabolite concentration (at the level of order of magnitude) is robust.
A stepwise multiple linear regression analysis was conducted by SPSS (Version 15.0. SPSS Inc. Chicago, IL.) to select the most meaningful chemical properties from 83 descriptors to correlate with negative logarithm of E. coli metabolite concentrations (2LogC). The final regression equation is: 2LogC = 6.105 + 0.431 6 "ClogP" + 15.595 6 "FNSA3" + 16.727 6 "FPSA3" 2 5.333 6 "RPCG", in which ClogP, FNSA3 (ratio of atomic charge weighted partial negative surface area on total molecular surface area), FPSA3 and RPCG (ratio of most positive charge on sum total positive charge) are all descriptors characterizing molecular

Author Summary
The metabolic networks of the three domains of life exhibit the same scale-free organization, which has been hypothetically explained in terms of preferential attachment principle. Here we reveal that the scale-free organization of metabolic networks may have a chemical basis. Through a chemoinformatic analysis on metabolic networks of Kyoto Encyclopedia of Genes and Genomes (KEGG), Escherichia coli and Saccharomyces cerevisiae, it was found that the metabolites with higher degrees of connectivity (hubs) are of relatively stronger polarity. The reason underlying this phenomenon is that to drive a variety of reactions, metabolic hubs have to be highly concentrated. Since the intracellular environments are hydrophilic, metabolic hubs have to be strong-polar to reach high concentrations. This finding has direct implications for metabolic network design and provides a chemical explanation to the preferential attachment principle, which has been validated by numerical simulations of metabolic network expansion. In addition, the correlations between metabolite concentration, metabolic network topology and metabolite chemical properties also suggest that we can use chemical and topological properties of metabolites to predict their intracellular concentrations. A support vector regression model has been successfully established to predict the metabolite concentrations for Escherichia coli. polarity. The fitted concentrations by the chemical properties correlate well with the experimental values ( Figure 4), indicating that the metabolite concentrations (at least for E. coli) are determined to a certain extent by their polarity and solubility, namely, strong-polar metabolites have relatively high concentrations. This finding is similar to the observation about protein abundance of E. coli that highly abundant proteins are on average more hydrophilic than those with low copy numbers [10]. However, in protein-protein interaction (PPI) networks, protein degree is negatively correlated with concentration [11], just contrary to the observation on metabolic networks. The underlying reason was suggested as that the hub proteins of PPI networks tend to use hydrophobic residues at surface to bind diverse partners through nonspecific hydrophobic interactions [11]. The cellular concentrations of hub proteins are thus constrained by their hydrophobicity. Therefore, the different behaviors of PPI and metabolic network hubs can be well understood by basic chemical rules. Taken together, the above observations offer an explanation to the correlation between topology and chemistry of metabolic networks. This finding also provides new clues to understanding the molecular basis of preferential attachment principle underlying the evolution of metabolic networks.

Chemical basis for the preferential attachment principle
Since life originated from water environments, the primordial metabolites must be highly hydrophilic. With the evolution of organisms, more and more complex membrane systems evolved, which required hydrophobic metabolites to perform intercellular and intracellular communications [12]. As a result, the evolutionary direction of metabolites is from hydrophilic to hydrophobic, which is clearly shown in the chemical evolution of S. cerevisiae metabolomes (Table 3). According to the correlation between metabolite concentration and chemical properties (Figure 4), it is reasonable to infer that the early-originated metabolites have relatively higher concentrations than the late-recruited counterparts in water environments. Since high-concentrated metabolites have more potential to drive new reactions, it is understandable why the new-recruited metabolites prefer to select old members as initial reactants (because they are more abundant and thus more accessible). Taken together, the present analysis reveals that metabolite concentration is a key factor to govern the metabolic network expansion. Although the late metabolites can not ''know'' which counterpart is well connected, they can ''sense'' which member is abundant, which provides a self-consistent explanation to the preferential attachment principle in terms of chemistry.
This explanation was validated by numerical simulations that were based on three rules. First, the network expands continuously by adding new metabolites (vertices) with a constant rate, namely, n metabolites are added in each step (n = 1 in the present simulations). Second, the newly added metabolites have lower  concentrations compared to the old ones, i.e., there is a declining trend for the concentrations of emerging metabolites. Third, the metabolites of higher concentrations have higher probability to be involved in the emerging reactions (edges). The present simulations start with 1 metabolite with the initial concentration (C i ) of 1,000,000 and terminate when a metabolite reaches a concentration (C f ) of # 10. This concentration range spans five orders of magnitude, which coincides with the variation range of metabolite concentrations in E. coli (from ,10 27 to ,10 22 mol/L) [8]. The concentration decline (d) in each step is 1,000, with a random fluctuation (f) of 1,500. As a result, the total number of generated metabolites reaches around 1,000, which is close to the real number of metabolites of organisms. The numbers of reactions (edges) added in each step are 5 or 10. As shown in Figure 5, the simulations with different parameters exhibit similar power-law distributions of node degrees, which suggests that the concentration-governed model provides a viable explanation to the scalefree organization of metabolic networks.

Implications for metabolic network design
The above finding implies a chemical criterion in metabolic network design that the polarity of hubs should be compatible with the working environments to guarantee the high concentrations of these critical metabolites. If the environments are polar (e.g., water), one should use hydrophilic molecules as hubs, while if the environments are non-polar (e.g., hydrocarbon solutions) [13], hydrophobic molecules should be selected as hubs. This opinion is preliminarily supported by the fact that the ''core'' of organic chemical network (i.e., a small set of strongly connected, chemically diverse substances) identified by Bishop et al. [14] are really much less polar than the hubs of metabolic networks (Table 4), well reflecting the fact that organic chemical reactions are mainly performed in organic solvents which are less polar than water. Thus, this chemical criterion is of apparent value in metabolic network design.

Implications for metabolite concentration prediction
A primary goal of systems biology is to quantitatively characterize cellular behaviors, which requires the information about the absolute concentrations of metabolites. As the intracellular content of metabolites is quite low [15], it is a big challenge to determine their concentrations experimentally. Thus, it is of great significance to use theoretical methods to do predictions. In a pioneering study, Kümmel et al established a network-embedded thermodynamic   (NET) method to predict intracellular metabolite concentrations [16]. However, this method depends largely on Gibbs energies of formation for metabolites, so its use is restricted to a small part of metabolites. The correlations between metabolite concentration and their topological/chemical properties revealed in this study suggest that intracellular metabolite concentrations may be predicted by their topological and chemical properties. By using the support vector regression (SVR) [17] method in R (version 2.11.1), a SVR model was established to predict E. coli metabolite concentrations by their topological and chemical properties. This model was evaluated by leave-one-out cross validation. The squared correlation coefficient is 0.5906 and the total mean squared error is 0.5316. The fitted metabolite concentrations by this model correlate well with the original experimental values (Figure 6). To evaluate the relative contribution of each descriptor to the performance of SVR model, we constructed SVR models by deleting one parameter each time and calculated the squared correlation coefficients of leave-one-out cross validation by using grid search over supplied parameter ranges. The smaller the squared correlation coefficient becomes, the more important the deleted descriptor is to the SVR model. As shown in Table 5, the deletion of degree results in the lowest squared correlation coefficient, followed by the deletion of ClogP, which means that degree and ClogP make most important contributions to the performance of SVR model.
The E. coli metabolite concentrations that have been predicted by the NET method [16] were also estimated by the SVR model. The SVR predictions agree well with the NET results and those determined by prior experiments (at the level of order of magnitude) ( Table 6). By the SVR method, the intracellular concentrations for other E. coli metabolites were also predicted and presented in Table S6, which can be used as initial data in E. coli metabolic network simulation. As the SVR model only depends on very basic (topological and chemical) properties of metabolites, it is expected to be applicable in metabolite concentration prediction for other bacteria.
In summary, the present analysis indicates that the organization of metabolic networks has a chemical basis. That is, metabolic hubs prefer to select relatively strong-polar metabolites. This basis can be explained in terms of high concentrations required by metabolic hubs to drive a variety of reactions. The present finding not only provides a molecular-level explanation to the preferential attachment principle for metabolic network expansion but also has direct implications for metabolic network design and metabolite concentration prediction.

Metabolic network reconstruction and topological parameter calculation
The KEGG-based metabolic network was reconstructed by manually screening the 8100 small-molecule reactions recorded in KEGG Ligand Database (http://www.genome.jp/kegg/ligand. html) (up to Sep 2009) [18]. The screening criteria are as follows: i) The reactions involving macromolecules (e.g., polymers, proteins and nucleic acids) and metabolites with unspecified residues (denoted by R group) were deleted; ii) Currency metabolites, including gases, metal ions and cofactors were discarded, except that they directly participate in metabolic reactions [19,20]. The resulting small-molecule metabolic network consists of 4875 nodes (compounds) and 9263 undirectional edges (substrate-product relations).
The metabolic network of E. coli was reconstructed by manually screening the 1317 small-molecule reactions for E. coli K-12 recorded in EcoCyc Database (http://www.ecocyc.org) [21]. The screening criteria are the same as above described. The resulting small-molecule metabolic network consists of 601 nodes (compounds) and 1538 undirectional edges (substrate-product relations).
The metabolic network of S. cerevisiae was reconstructed by manually screening the 1923 small-molecule reactions recorded in YEASTNET (http://www.comp-sys-bio.org/yeastnet) [22]. The screening criteria are the same as above described. The resulting small-molecule metabolic network consists of 612 nodes (compounds) and 2654 undirectional edges (substrate-product relations).
The parameters describing the network topology were calculated by Network Analyzer Plugin in Cytoscape-2.7.0 [23,24]. The node degree of a node n is defined as the number of edges linked to n. The basic information for KEGG, E. coli and S. cerevisiae metabolites that are involved in the metabolic networks are presented in Tables S1-S5.

Identification of early and late members of S. cerevisiae metabolome
To elucidate the molecular basis of preferential attachment principle underlying the evolution of metabolic networks, we identified the early and late members from S. cerevisiae metabolome. Recently, Prachumwat and Li classified yeast proteins into five age groups, according to the occurring patterns of their orthologs in other species [25]. The oldest age group, consisting of 1806 members, includes proteins that can be traced back to eubacterial genomes. Among these proteins, 972 are enzymes. According to the KEGG records, 633 metabolites associated with these ancient enzymes were collected, 12 of which are aerobic metabolites (according to the aerobic metabolite information provided by Raymond and Segrè [26]) and thus are not early metabolites. The remained 621 metabolites constitute the set of early metabolites of S. cerevisiae, in which 243 members are involved in the metabolic network of S. cerevisiae. The other 369 ( = 6122243) metabolites of S. cerevisiae metabolic network were thus regarded as late members.
Chemical property calculation, network expansion simulation and statistical analysis 83 commonly used property descriptors were calculated with Cerius2 (Version 4.11L. Accelrys Inc. San Diego, CA.), Sybyl

Support vector regression model construction
By a trial-and-deletion procedure, 11 properties that have largest contributions to the support vector regression (SVR) model were selected, which include degree and 10 chemical properties, i.e., 6mem rings Molecules (number of 6 membered rings), Amide Molecules (number of amide), ALogP (the Ghose and Crippen octanol-water partition coefficient), ClogP (partition coefficient octanol/water), FNSA3 (ratio of atomic charge weighted partial negative surface area on total molecular surface area), FPSA3 (ratio of atomic charge weighted partial positive surface area on total molecular surface area), HBD Count (number of hydrogen bond donating groups in the molecule), N Count (number of Nitrogen atoms), LScore Molecules (floating point Lipinski measure) and RPCG (ratio of most positive charge on sum total positive charge (Relative positive charge)). Radial basis kernel function e {cju{vj 2 was chosen to construct a e-SVR model. The parameters were trained by using grid search over supplied parameter ranges and the best parameters were obtained as follows: gamma = 0.01, epsilon = 0.22, cost = 7.9. The SVR algorithm for metabolite concentration prediction is available on request. Figure S1 Power-law degree distribution of KEGG metabolites. (DOC)