^{1}

^{*}

^{2}

^{1}

Conceived and designed the experiments: AM GMP JH. Analyzed the data: AM. Wrote the paper: AM GMP JH.

The authors have declared that no competing interests exist.

The availability of genomes of many closely related bacteria with diverse metabolic capabilities offers the possibility of tracing metabolic evolution on a phylogeny relating the genomes to understand the evolutionary processes and constraints that affect the evolution of metabolic networks. Using simple (independent loss/gain of reactions) or complex (incorporating dependencies among reactions) stochastic models of metabolic evolution, it is possible to study how metabolic networks evolve over time. Here, we describe a model that takes the reaction neighborhood into account when modeling metabolic evolution. The model also allows estimation of the strength of the neighborhood effect during the course of evolution. We present Gibbs samplers for sampling networks at the internal node of a phylogeny and for estimating the parameters of evolution over a phylogeny without exploring the whole search space by iteratively sampling from the conditional distributions of the internal networks and parameters. The samplers are used to estimate the parameters of evolution of metabolic networks of bacteria in the genus

Metabolic networks correspond to one of the most complex cellular processes. Most organisms have a common set of reactions as a part of their metabolic networks that relate to essential processes such as generation of energy and the synthesis of important biological molecules, which are required for their survival. However, a large proportion of the reactions present in different organisms are specific to the needs of individual organisms. The regions of metabolic networks corresponding to these non-essential reactions are under continuous evolution. Using different models of evolution, we can ask important biological questions about the ways in which the metabolic networks of different organisms enable them to be well-adapted to the environments in which they live, and how these metabolic adaptations have evolved. We use a stochastic approach to study the evolution of metabolic networks and show that evolutionary inferences can be made using the structure of these networks. Our results indicate that plant pathogenic

Biological networks are under continuous evolution and their evolution is one of the major areas of research today

In this work, we focus on metabolic networks. The evolution of metabolic networks is characterized by gain and loss of reactions (or enzymes) connecting two or more metabolites and can be described as a discrete space continuous time Markov process where at each step of the network evolution a reaction is either added or deleted until the desired network is obtained

Here, we present an extended model called the hybrid model that combines an independent edge model, where edges are gained or lost independently, and a neighbor-dependent model of network evolution

We use the hybrid model to study the evolution of a set of metabolic networks connected over a phylogeny. Previous attempts to study the evolution of metabolic networks in a phylogenetic context include Dandekar

In the neighbor-dependent for the evolution of metabolic networks

The neighborhood component

Although the neighbor-dependent model summarized above produces a biologically relevant behavior whereby highly connected reactions are toggled more frequently than the poorly connected counterparts, it does not allow one to determine the strength of the neighborhood structure effecting the evolution of metabolic networks. To overcome this limitation, a parameter can be introduced in the model that corresponds to the neighborhood effect during the course of metabolic network evolution.

Consider two networks

It can been seen from (4) that the model behaves under the independent edge model when

(A) Toy networks consisting of 13 nodes. The nodes are labeled from A to M (blue) and the hyperedges are labeled from 1 to 10 (red). The reference network consists of all allowed hyperedges for this example system. Networks

Biological networks are connected over a phylogenetic tree which is known through sequence analysis. Calculating the likelihood over a phylogeny requires a sum, over all possible networks that may have existed at the interior nodes of the tree, of the probabilities of each scenario of events. For example,

Here

In general, the likelihood of a tree with more than three networks can be calculated using the recursion described by Felsenstein

Evaluating Equations 5 and 7 requires an algorithm to systematically and efficiently sample networks at the internal nodes of a tree and a method to calculate the pairwise likelihood of network evolution. A Metropolis-Hastings algorithm to calculate the pairwise likelihood based on sampling paths between network pairs was described by Mithani

Given a set of networks related by a phylogenetic tree, the networks at the internal nodes of the tree can be sampled using a Gibbs sampler. The general idea is to sample each internal network by conditioning on its three neighbors (one parent and two children). This approach for sampling internal networks is similar to the one used by Holmes and Bruno

Consider a network

For each hyperedge

Calculate, for each neighbor

Sample the new state

The tree contains arbitrary networks assigned at the internal nodes. Also shown are the proportion of insertion and deletion events and the proportion of allowed insertion and deletion events while going from various ancestral networks to descendant networks.

Once the transition probability matrices have been obtained, the sample for the new network

The samples for hyperedges labeled 2 to 10 can be drawn in a similar fashion to obtain the new network.

The Gibbs sampler described above samples the internal networks on a phylogenetic tree for given parameter values. This can be extended to estimate the parameters

Choose initial values for the parameters

Generate

Use

Repeat

The samples for parameters can be drawn using a Metropolis-Hastings algorithm

For a given tree

Starting from root, calculate the proportion of insertion events

The calculation of hyper-parameters

The hybrid model for metabolic network evolution described above allows estimation of the neighborhood effect shaping the evolution of given set of networks. The proposal for the parameter

Calculate the average number of neighbors present in the networks present at the leaves of the phylogeny. For example, if the network

The shape parameter

The proposal probability

The proposal probability

The Metropolis-Hastings procedure described above to sample parameters requires the likelihood of the tree when moving in the parameter space. The likelihood can be calculated using Equation 5 which in turn requires calculation of the pairwise likelihood between network pairs. The pairwise likelihood can be calculated using the Metropolis-Hastings algorithm described in Mithani

Let

Initialize

Select the hyperedge

Add the top

Remove the hyperedges present in

Repeat steps 2–4 until

Increment

An example is given in

The pseudo-likelihood of going from the network

To see if the hybrid model fitted the metabolic network data better than the neighbor-dependent model, a likelihood ratio test was performed using the metabolic data for the bacteria belonging to the genus

Pathway map | Neighbor-dependent model | Hybrid model | LH ratio | |||

( |
Log LH | ( |
Log LH | |||

Glycolysis/Gluconeogenesis | (2.6177, 0.4229) | −76.53 | (0.4989, 0.1598, 0.2152) | −63.47 | 26.13 | |

Pentose phosphate pathway | (0.5680, 0.7144) | −60.13 | (0.4762, 0.2953, 0.4259) | −53.42 | 13.41 | |

Lysine degradation | (0.0127, 1.0780) | −59.43 | (0.0063, 0.2926, 0.0159) | −52.40 | 14.05 | |

Histidine metabolism | (0.7669, 0.3895) | −54.22 | (0.1852, 0.1643, 0.1370) | −47.28 | 13.89 | |

Phenylalanine metabolism | (1.1035, 0.6856) | −62.40 | (1.0299, 1.0297, 0.0038) | −49.91 | 24.97 | |

Pyruvate metabolism | (0.1648, 0.5656) | −88.64 | (0.0897, 0.1913, 0.1194) | −81.74 | 13.81 |

The fit of the data was further tested by comparing the degree distributions of the nodes obtained by simulating network evolution under the neighbor-dependent and hybrid models. The MLEs for the evolution of networks obtained under the two models were used as the simulation parameters. For example, when evolving the pathway maps in

Boxplots showing the degree distributions of nodes obtained by simulating the evolution for the pentose phosphate pathway, lysine degradation and phenylalanine metabolism maps in

To test the Gibbs sampler described in Section ‘Sampling internal nodes’, the three network phylogeny shown in

We also ran the Gibbs sampler for parameter estimation for the toy networks. The sampler was run from a random starting value for 60,000 iterations with the first 10,000 iterations regarded as burn-in period. The samples were collected every

Parameters were estimated for the toy network phylogeny shown in

To study the metabolic evolution in bacteria, we used the Gibbs sampler to estimate the evolution parameters for the metabolic networks of bacteria belonging to the genus

The phylogenies were generated using multi-locus sequence analysis of conserved housekeeping genes (

Pathway Map | Phylogeny | var( |
var( |
var( |
||||

Glycolysis/Gluconeogenesis | ((pfs,pfo),pfl) | 0.2276 | 0.0054 | 2.0552 | 1.0280 | 1.2228 | 0.2653 | 1.6807 |

(MAP00010) | (pst,(psb,psp)) | 0.2404 | 0.0070 | 1.8645 | 3.3775 | 0.7280 | 0.2891 | 2.5611 |

17 pseudomonads | 0.1506 | 0.0034 | 0.7610 | 0.0117 | 0.7329 | 0.0114 | 1.0383 | |

Pentose phosphate pathway | ((pfs,pfo),pfl) | 0.2785 | 0.0057 | 2.1103 | 0.5582 | 1.7194 | 0.3227 | 1.2273 |

(MAP00030) | (pst,(psb,psp)) | 0.3251 | 0.0071 | 1.8490 | 1.6921 | 1.0172 | 0.3212 | 1.8178 |

17 pseudomonads | 0.1863 | 0.0042 | 0.6762 | 0.0029 | 0.7462 | 0.0126 | 0.9062 | |

Lysine degradation | ((pfs,pfo),pfl) | 0.0802 | 0.0032 | 0.9662 | 0.2795 | 1.6567 | 1.5943 | 0.5832 |

(MAP00310) | (pst,(psb,psp)) | 0.0637 | 0.0025 | 0.6986 | 0.1030 | 2.7245 | 2.8663 | 0.2564 |

17 pseudomonads | 0.0473 | 0.0030 | 0.4706 | 0.3492 | 0.6443 | 0.7188 | 0.7304 | |

Histidine metabolism | ((pfs,pfo),pfl) | 0.1833 | 0.0065 | 1.6829 | 1.2133 | 1.0507 | 0.3456 | 1.6017 |

(MAP00340) | (pst,(psb,psp)) | 0.1749 | 0.0064 | 1.5321 | 0.9479 | 1.0735 | 0.3082 | 1.4272 |

17 pseudomonads | 0.0986 | 0.0022 | 0.8685 | 0.0203 | 0.6795 | 0.0057 | 1.2781 | |

Phenylalanine metabolism | ((pfs,pfo),pfl) | 0.0783 | 0.0029 | 1.1686 | 0.2072 | 1.8255 | 0.9345 | 0.6402 |

(MAP00360) | (pst,(psb,psp)) | 0.0678 | 0.0024 | 1.0573 | 0.1448 | 2.2334 | 1.1112 | 0.4734 |

17 pseudomonads | 0.0617 | 0.0017 | 0.6004 | 0.0061 | 1.0723 | 0.0682 | 0.5599 | |

Pyruvate metabolism | ((pfs,pfo),pfl) | 0.1413 | 0.0018 | 1.6497 | 0.3362 | 1.7913 | 0.4424 | 0.9210 |

(MAP00620) | (pst,(psb,psp)) | 0.1559 | 0.0020 | 1.5542 | 0.5376 | 1.2840 | 0.2888 | 1.2105 |

17 pseudomonads | 0.1119 | 0.0007 | 0.7668 | 0.0099 | 0.6838 | 0.0142 | 1.1213 |

When estimating the parameters, the hyperedges corresponding to the reactions that were common to all seventeen

The high insertion to deletion ratio (

The comparison of the evolution parameters between

The final aim of this study was to infer reactions present in the common ancestor of

To demonstrate this, the Gibbs sampler was run on the pathway maps listed in

The ancestral networks were reconstructed over the

The ancestral networks were reconstructed over the

The ancestral networks were reconstructed over the

The ancestral networks were reconstructed over the

The ancestral networks were reconstructed over the

The ancestral network reconstruction using the Gibbs sampler reported high likelihood values for reactions which are present in all the networks down a lineage and low likelihood values for reactions which show variable distributions across the

Ancestral predictions were also generated under the parsimony model for these networks using the Fitch Algorithm

In this study, we have used a Bayesian approach to study the evolution of metabolic networks. We extended the neighbor-dependent model described by Mithani

The neighbor-dependent model

We also presented a Gibbs sampler to sample the networks at internal nodes of a phylogenetic tree where the internal networks were sampled by conditioning on three neighbors (one parent and two children) in an approach similar to the one used by Holmes and Bruno

A Gibbs sampler to estimate the evolution parameters was also presented. Standard distributions were used to generate proposals for parameters. The standard distributions provide satisfactory mixing of the MCMC sampler with appropriate scaling

The evolution parameters were estimated on a phylogeny connecting the metabolic networks of bacteria belonging to the genus

An important factor affecting the results when estimating the evolution parameters and reconstructing the ancestral networks relates to the use of individual pathway maps. Although computationally tractable, individual pathway maps do not take a complete network perspective and may, therefore, lead to incorrect results by ignoring a part of the reaction neighborhood, the so-called border effect. This is particularly true for reactions which occur at the boundary of a metabolic pathway map, which may have a large number of their neighbors not included in that pathway map. The calculation of reaction neighborhood solely using the pathway map under consideration ignores all neighboring reactions that are not present in the pathway map thus affecting the likelihood values. For example, consider R01424 in phenylalanine metabolism. This reaction is present in thirteen out of seventeen pseudomonads including all three

When performing the analyses the hyperedges present in all seventeen

Finally, the analysis presented here uses data from the KEGG database. The metabolic annotations available for the majority of genome-sequenced organisms are generated using automated annotation tools based on the similarity of predicted genes to genes of known function and therefore contain a substantial amount of noise. For example, some genes predicted to have a broad enzymatic function are linked to multiple reactions, while others fail to meet the detection threshold for annotation and are therefore recorded as absent. Nevertheless, networks deposited in databases like KEGG are commonly treated as if they are as certain as sequence data, which is a serious error that undermines many present investigations. It would be desirable to take this noise into account while modeling the evolution of metabolic networks. One way would be to use hidden states to model experimentally validate metabolisms which are observed though predicted metabolisms. This will not only enable one to model the noise in the data but also allow correct prediction of a metabolism for an organism using homologous information similar to comparative genome annotation

In summary, evolutionary modeling of metabolic network is an important area. Using statistical models of network evolution such as the one described here not only allow one to investigate how the metabolic networks evolve in closely related organisms but also enable testing of biological hypotheses such as specialization of genomes and identification of regions of metabolic networks that are under high selection.

Simulation results for insertion frequencies for the toy network H_{1} shown in

(0.02 MB PDF)

Sub-networks of the toy network H_{1} shown in _{1} but present in the sub-network are shown in gray.

(0.05 MB PDF)

Likelihood surfaces calculated by matrix exponentiation using all 1024 networks (True Likelihood) and using the networks visited by the Gibbs sampler (Estimated Likelihood) for different insertion and deletion rates for the toy networks phylogeny shown in

(0.11 MB PDF)

An example MCMC trace showing the rate parameters for the first 1,000 iterations of the Gibbs sampler for the toy networks phylogeny shown in

(0.04 MB PDF)

Example MCMC traces showing the rate parameters for the first 1,000 iterations of the Gibbs sampler initiated from different starting values. The sampler was run on the

(0.22 MB PDF)

Number of insertion and deletion events for the alterable reactions, that is the reactions which were neither defined as core nor were defined as prohibited in the network obtained using the Gibbs sampler run under the hybrid model. The sampler was run for the six pathway maps used in this analysis over the phylogeny connecting the seventeen ^{th} iteration.

(0.02 MB PDF)

Number of insertion and deletion events at each branch of the phylogeny connecting the seventeen ^{th} iteration. Strain abbreviations: pae:

(0.02 MB PDF)

Degree distributions of nodes at the ancestral levels of the

(0.04 MB PDF)

Degree distributions of nodes at the ancestral levels of the

(0.07 MB PDF)

Degree distributions of nodes at the ancestral levels of the

(0.08 MB PDF)

Degree distributions of nodes at the ancestral levels of the

(0.04 MB PDF)

Degree distributions of nodes at the ancestral levels of the

(0.03 MB PDF)

Degree distributions of nodes at the ancestral levels of the

(0.08 MB PDF)

Pseudo-likelihood P^{*}(H_{2}|H_{1}) conditioned on H for the toy networks shown in

(0.03 MB PDF)

Basic information of the metabolic networks for the seventeen genome-sequenced strains of

(0.04 MB PDF)

Effective sample sizes (ESS) for the estimated parameters (δ: neighbor dependence probability, λ: insertion rate and μ: deletion rate) using the Gibbs sampler run under the hybrid model for the evolution of metabolic networks over the phylogeny connecting different ^{th} iteration. Strain abbreviations: pfl:

(0.03 MB PDF)

Running time and the acceptance percentage of the Gibbs sampler for the estimation of evolution parameters (δ: neighbor dependence probability, λ: insertion rate and μ: deletion rate) run under the hybrid model for different metabolic networks over the phylogeny connecting different ^{th} iteration. Strain abbreviations: pfl:

(0.03 MB PDF)

Testing of the Gibbs sampler for parameter estimation.

(0.04 MB PDF)

We would like to thank Dr. Thomas Mailund for providing valuable comments on the manuscript.