^{1}

^{2}

^{1}

^{*}

Conceived and designed the experiments: SM. Performed the experiments: TYP. Analyzed the data: TYP SM. Wrote the paper: TYP SM. Derived mathematical solutions: SM.

The authors have declared that no competing interests exist.

In prokaryotic genomes the number of transcriptional regulators is known to be proportional to the square of the total number of protein-coding genes. A toolbox model of evolution was recently proposed to explain this empirical scaling for metabolic enzymes and their regulators. According to its rules, the metabolic network of an organism evolves by horizontal transfer of pathways from other species. These pathways are part of a larger “universal” network formed by the union of all species-specific networks. It remained to be understood, however, how the topological properties of this universal network influence the scaling law of functional content of genomes in the toolbox model. Here we answer this question by first analyzing the scaling properties of the toolbox model on arbitrary tree-like universal networks. We prove that critical branching topology, in which the average number of upstream neighbors of a node is equal to one, is both necessary and sufficient for quadratic scaling. We further generalize the rules of the model to incorporate reactions with multiple substrates/products as well as branched and cyclic metabolic pathways. To achieve its metabolic tasks, the new model employs evolutionary optimized pathways with minimal number of reactions. Numerical simulations of this realistic model on the universal network of all reactions in the KEGG database produced approximately quadratic scaling between the number of regulated pathways and the size of the metabolic network. To quantify the geometrical structure of individual pathways, we investigated the relationship between their number of reactions, byproducts, intermediate, and feedback metabolites. Our results validate and explain the ubiquitous appearance of the quadratic scaling for a broad spectrum of topologies of underlying universal metabolic networks. They also demonstrate why, in spite of “small-world” topology, real-life metabolic networks are characterized by a broad distribution of pathway lengths and sizes of metabolic regulons in regulatory networks.

It has been previously reported that in prokaryotic genomes the number of transcriptional regulators is proportional to the square of the total number of genes. We recently offered a general explanation of this empirical powerlaw scaling in terms of the “toolbox” model in which metabolic and regulatory networks co-evolve together. This evolution is driven by horizontal gene transfer of co-regulated metabolic pathways from other species. These pathways are part of a larger “universal” network formed by the union of all species-specific networks. In the present work we address the question of how topological properties of this universal network influence the powerlaw scaling of regulators in the toolbox model. We also generalize its rules to include reactions with multiple substrates and products, branched and cyclic metabolic pathways, and to account for optimality of metabolic pathways. The main conclusion of our analytical and numerical modeling efforts is that the quadratic scaling is the robust feature of the toolbox model in a broad range of universal network topologies. They also demonstrate why, in spite of “small-world” topology, real-life metabolic networks are characterized by a broad distribution of pathway lengths and sizes of regulons in regulatory networks.

In prokaryotic genomes the number of transcriptional regulators is known to
quadratically scale with the total number of protein-coding genes

The question we address in this study is how the topology of the universal network
determines this scaling exponent. To answer this question we first consider and
solve a more realistic (yet still mathematically treatable) case in which the
universal metabolic network is a directed tree of arbitrary topology. While being
closer to reality than previously solved

To make our approach even more realistic we propose and numerically study a
completely new version of the toolbox model incorporating metabolic reactions with
multiple substrates and products as well as branched and cyclic metabolic pathways.
Furthermore, unlike random linear pathways on a universal network

We will first consider the case where the universal metabolic network is a
directed tree. For simplicity in this section we will consider the case of
catabolic pathways, while identical arguments (albeit with opposite direction of
all reactions) apply to anabolic pathways. The root of the tree corresponds to
the central metabolic core of the organism responsible for biomass production.
Peripheral catabolic pathways (branches of the tree) convert external nutrients
(leaves) to this core, while the internal nodes of the tree represent
intermediate metabolites. Each of metabolites is characterized by its distance

The organism-specific metabolic network (filled circles and thick edges)
is always a subset of the universal network (the entire tree). Nodes are
divided into layers based on their distance

The toolbox model specifies rules by which organism acquires new pathways in the
course of its evolution. It consists of the following steps: 1) randomly pick a
new nutrient metabolite (a leaf node of the universal network that currently
does not belong to the metabolic network of the organism) 2) use the universal
network to identify the unique linear pathway which connects the new nutrient to
the root of the tree (the metabolic core) and finally 3) add the reactions and
intermediate metabolites in the new pathway to the metabolic network of the
organism (filled circles and thick edges in

Consider an organism capable of utilizing _{M}_{L}

We further assume that due to random selection

This equation allows one to iteratively calculate

The Galton-Watson branching process _{2}, _{1} and
_{0} correspondingly. If the average number of
parents

The principal geometric difference between supercritical and critical trees is
that in the former case the number of nodes in a layer

A random critical network where each node has at most has two parents in the
previous layer is defined by

A critical branching process that has not terminated by level

Here

This quadratic relation is exact in a critical branching tree where each node can
branch out into at most two nodes at the next layer, and it is still correct to
a leading order in

As was explained in the previous section the ratio

The geometrical properties of the universal network such as its total number of
nodes/edges

For a supercritical branching process

Here

Note that this equation connecting

In conclusion, while the toolbox model on a critical universal network is
characterized by a quadratic scaling between

To test our mathematical results for a more realistic version of the universal
tree we linearized pathways and reactions in the network formed by the union of
all reactions in the KEGG database

To better understand the origins of this scaling we investigated the topology of
the underlying universal tree. The criticality of a tree is defined by the
asymptotic value of the ratio

In addition to using a random spanning tree to linearize the KEGG network we also
tried a version using minimal paths. In this version the universal network is
generated by randomly picking a metabolite and connecting it to the root of the
tree (pyruvate) by the shortest path. At a first glance such “minimal
path” selection appears to be reasonable from evolutionary standpoint.
Indeed, evolution would favor simpler and shorter pathways in order to minimize
the expenditure of resources to achieve a given metabolic goal

How do we reconcile the evolutionary pressure apparently selecting for minimal
pathways with dramatically wrong scaling properties of this model? We believe
that most of the ultra-short “small world” pathways generated by
minimal paths on the KEGG network are unrealistic from biochemical standpoint.
Indeed, highly connected co-factors often position metabolites with very
different chemical formulas in close proximity to each other. For example, the
KEGG reaction R00134: _{2} →
cyanate. As a consequence of such artificial shortcuts branches of the universal
network linearized by minimal paths are much shorter than they are in reality.
.This problem is at least partially alleviated by 1) removing unusually
high-degree nodes corresponding to common co-factors such as H_{2}O,
ATP, NAD in the metabolic network so that some unrealistic paths are eliminated,
and also 2) using random spanning tree instead of the shortest paths. In Ref.

Real metabolic reactions routinely include multiple inputs (substrates) and
multiple outputs (products) (see

The number of products of an irreversible reaction | ||||||

The number of substrates of an irreversible reaction | 1 | 2 | 3 | 4 | 5 | |

1 | 157 | 141 | 4 | |||

2 | 82 | 491 | 95 | 7 | ||

3 | 1 | 123 | 170 | 31 | 1 | |

4 | 10 | 73 | 15 | |||

5 | 1 |

The number of substrates/products at one end of a reversible reaction | |||||

The number of substrates/products at the opposite end of a reversible reaction | 1 | 2 | 3 | 4 | 5 |

1 | 143 | 231 | 6 | ||

2 | 553 | 284 | 15 | ||

3 | 106 | 69 | 1 | ||

4 | 6 | 3 |

The new version of the toolbox model simulates addition of anabolic pathways
aimed at production of new metabolites from those the model organism can
currently synthesize (its current metabolic core). The new pathways are

At the beginning of the simulation, the model organism starts with a
“seed” metabolic network consisting of 40 metabolites
classified by the KEGG as parts of central carbohydrate metabolism, plus
a number of “currency” metabolites such as water, ATP and
NAD (see the section “Seed metabolites of the scope
expansion” of

At each step a new metabolite that cannot yet be synthesized by the
organism is randomly selected from the “scope”

To search for the minimal pathway that converts core metabolites to this
target we first perform the “scope expansion”

Next we trace back added reactions starting from the target and progressively moving to lower levels. One starts by finding the reaction responsible for fabrication of the target metabolite and adding it to the new pathway (if several such reactions exist in the last layer we randomly choose one of them). In case of multi-layer expansion process some substrates of this reaction are not among the core metabolites (otherwise this reaction would be in the first layer). One then goes down one layer and adds the reactions fabricating these missing substrates. This is repeated all the way down to the first level of the original expansion. The resulting pathway includes the minimal (or nearly minimal) set of reactions needed to generate the target metabolite from the current metabolic core of the organism. Starting from the next step of the model the target and all intermediate metabolites become part of the metabolic core. Genes for enzymes catalyzing these new reactions are assumed to be horizontally transferred to the genome of the organism. The newly added metabolic pathway is assumed to have a dedicated transcriptional regulator so that the number of transcription factors in our model is always equal to the number of pathways or their target metabolites.

Steps 1–5 are repeated until metabolic network of the organism
reaches its maximal size. At this stage it includes the entire scope

The diagram explains different types of metabolites and reactions. Reactions (squares) in the added pathway use base substrates (yellow circles with horizontal shading) from the metabolic core of the organism (light blue area) to produce the target metabolite (the red circle). Added pathway generates intermediate products (green circles) as well as byproducts that are not further converted to the target (blue circles). Products of some reactions feed back into the metabolic core (yellow circles with vertical shading). Reactions are labeled with expansion steps at which they were added to the pathway.

Numerical simulation of this model shows that the number of transcriptional
regulators scales with the number of metabolites with power

The scaling between the number of regulated pathways (leaves),

The mathematical formalism derived in the previous sections is limited to
tree-like universal networks and thus does not directly apply to the new model.
Nevertheless, one generally expects the quadratic scaling to be limited only to
critical, “large world” networks in which organisms with small
genomes initially tend to acquire sufficiently long pathways. As noted before,
from purely topological standpoint the KEGG network has a “small
world” property making long pathways unlikely. It is important to check if
the realistic treatment of multi-substrate reactions did in fact restore the
“large world” property and criticality to the KEGG universal network
by increasing the minimal number of steps required for connecting target
metabolites to the metabolic core. To quantify the criticality of the expansion
process as before we use the ratio

The ratio

Unlike linearized pathways in the original version of the toolbox model _{border rxn}–the
number of added reactions that are connected (as a substrate or a product) with
at least one metabolite in the core, 2)
_{base}–the number of metabolites in the core
that serve as substrates to reactions in the added pathway, 3)
_{feedback}–the number of core metabolites
that are products of reactions in the new pathway, 4)
_{byproduct}–the number of final metabolic
products of the added pathway that are neither core metabolites nor the target,
5) length-the number of steps (layers of the scope expansion process) it takes
to transform core metabolites into the target product. 4 illustrates the
definition of these parameters while

Faster-than-linear scaling of the number of byproducts,
n_{byproduct}, and the total number of metabolites,

Approximately linear relationship between a) pathway's length and
its number of reactions _{border rxn,} and the total number
of reactions, _{base,} and the total number of
metabolites, _{feedback,} and the
total number of metabolites,

Approximately linear relationship between _{border rxn}
vs. _{base}) or as products
(_{feedback}). Further analysis indicates that
“currency metabolites” (common co-factors that serve as substrates
or products of many reactions) constitute a significant fraction
(∼25%) of all core metabolites involved in border reactions (see the
section “Analysis of the currency metabolites in the toolbox model”
of

The lengths of the pathways are represented by circles and solid line, while the shortest distances of the targets from the core are represented by crosses and dotted line.

As can be seen from _{feedback}
metabolites in the core network of the organism to which they were added. The
relative scarcity of byproducts suggests that pathways in our model satisfy the
evolutionary constrains imposed on real-life organisms. Indeed, as previously
proposed in Ref

The small world property of complex biomolecular networks has been extensively
discussed in the literature during the last decade (see

The extent of small world topology in metabolic networks has been recently disputed
in

How to reconcile this apparent contradiction? The answer known from pioneering
studies of R. Heinrich and collaborators (see e.g.

These arguments lead us to adapt the “scope expansion” algorithm by
Heinrich et al

Optimality of metabolic pathways in central carbon metabolism was recently discussed
in Ref.

Simplified “toy” models based on artificial chemistry reactions have a long history of being used to reveal fundamental organizational principles of metabolic networks:

The recent model of Riehl et al

The study of Pfeiffer et el

Finally the artificial chemistry studied by Hintze et al

In our study we used the real-life (even if incomplete and sometimes noisy) metabolic
universe of all reactions in the KEGG database. The only simplifying approximations
remaining in the new most realistic version of the toolbox model is random selection
of metabolic targets in the course of evolution and easy availability of any subset
of KEGG reactions for horizontal transfer. Both these approximations can be relaxed
in later versions of the model. Another promising direction is to extend the toolbox
model to artificial chemistry universal networks of Refs.

The universal network used in our study consists of the union of all reactions listed
in the KEGG database. The directionality of reactions and connected pairs of
metabolites were inferred from the map version of the reaction formula:

Supplementary information.

(0.37 MB DOC)

Additional support of this work was provided by the DOE Systems Biology Knowledgebase project “Tools and Models for Integrating Multiple Cellular Networks”.