^{1}

^{2}

^{3}

^{*}

Conceived and designed the experiments: RHB DJK GAC. Performed the experiments: RHB DJK. Analyzed the data: RHB. Wrote the paper: RHB DJK GAC.

The authors have declared that no competing interests exist.

Graphical models describe the linear correlation structure of data and have been used to establish causal relationships among phenotypes in genetic mapping populations. Data are typically collected at a single point in time. Biological processes on the other hand are often non-linear and display time varying dynamics. The extent to which graphical models can recapitulate the architecture of an underlying biological processes is not well understood. We consider metabolic networks with known stoichiometry to address the fundamental question:

High-throughput profiling data are pervasive in modern genetic studies. The large-scale nature of the data can make interpretation challenging. Methods that estimate networks or graphs have become popular tools for proposing causal relationships among traits. However, it is not obvious that these methods are able to capture causal biological mechanisms. Here we address the power and limitations of causal inference methods in biological systems. We examine metabolic data from simulation and from a well-characterized metabolic pathway in plants. We show that variation has to propagate through the pathway for reliable network inference. While it is possible for causal inference methods to recover the ordering of the biological pathway, it should not be expected. Causal relationships create subtle patterns in correlation, which may be dominated by other biological factors that do not reflect the ordering of the underlying pathway. Our results shape expectations about these methods and explain some of the successes and failures of causal graphical models for network inference.

Understanding the nature of cause and effect is fundamental to all fields of scientific investigation, but the concept of causality can present special difficulties in biology

Recent advances in high-throughput phenotyping technologies have made large-scale measurements of molecular traits possible. Expression QTL (eQTL), metabolic QTL (mQTL) and protein QTL (pQTL) can be used to link thousands of molecular phenotypes to genetic loci, as well as to clinical phenotypes

The interpretation of a directed edge between nodes

Several algorithms have been proposed for the inference of causal relationships among phenotypes using genetic data

Deterministic models of cellular metabolism can be defined by ordinary differential equations (ODEs) derived from simple laws of mass-balance

Glucosinolates are secondary metabolites that influence the interaction of plant and pest and have a wide range of important functions in human health

The aliphatic glucosinolate biosynthetic pathway occurs in three stages: (1) side chain elongation, (2) formation of glucone moeity and (3) side-chain modification. The metabolites that are measured in the Bay

In order to address these questions, we have inferred causal networks from mQTL data using simulated metabolic models of common

Pathway motifs were constructed using ODEs (

(A) Linear, (B) merging pathway via metabolic reaction, (C) merging pathway via independent paths, (D) branching pathway, (E) branching pathway with inhibition, (F) branching pathway with epistasis.

Correlation of the genotype variable,

Left: The correlation between metabolites and genetic multipliers, correlation indicates evidence of a QTL, the sign and magnitude indicate direction and size of the effect respectively. Center: metabolite correlation after conditioning on QTL. Right: The inferred causal graphical model estimated from the top ten graphs from MCMC. Edge weights indicate regression coefficients.

In most cases, the correlation between metabolites after conditioning on genotype variables was enhanced (

The linear and merging pathway reconstructions did not mimic the ordering in the metabolic pathway (

Significant QTL were identified for all of the metabolites in the aliphatic glucosinolate biosynthesis pathway (

QTL mapping was performed for metabolites in the homo-methionine, dihomo-methionine and penta/hexa-methionine side-chains from the Bay

Correlation dissipated non-uniformly after conditioning metabolites on QTL (

Correlation of metabolites in from the Bay

Side chains: homo-methionine, dihomo-methionine and hexahomo-methionine, were first examined independently (

The (A) homo-methionine, (B) dihomo-methioine and (C) hexahomo-methionine side chains were reconstructed independently. (D) The network was reconstructed from the entire panel of aliphatic metabolites and their QTL. Edge weights indicate regression coefficients.

The entire panel of QTL and metabolites from the glucosinolate biosynthesis pathway were examined in a single model (

In order to infer a causal relationship between a substrate

Suppose there is no propagation of the non-genetic variation,

Consider the Bay

A real data illustration of the necessity of non-genetic residual propagation for causal inference. Consider the causal model:

Graphical models provide a framework for estimating causal relationships between genotypes and phenotypes. Models of this type can be used to perform

Several algorithms have been proposed for building graphical models in the context of genetic crosses

We analyzed metabolite data and from real and simulated pathways with known network stoichiometry. The Michaelis-Menton kinetics used in our simulated metabolic reactions are special cases of Hill functions and represent a rough approximation to actual enzyme reactions. Similar models have been used to describe gene regulatory networks and other biological phenomena, e.g.

Correlation in metabolite data can be driven by a variety of factors that do not directly relate to the network stoichiometry. In order to capture the biochemical ordering of the pathway, noise has to propagate through the biochemical network. Many biological pathways are buffered by feedback or other stabilizing features that reduce noise propagation and mask the correlations that would imply causal connections. Failure to consistently observe substrate-product correlation may explain some of the differences observed between the plant data and simulations for matching pathway architectures. Our objective is not to confirm that our simulations accurately reflect the plant data or to make generalizations about certain pathway architectures. Rather, we seek to leverage real data from a well-studied biological system and simulated data from pathway motifs to explore a variety of architectures and conditions. A shortcoming of

In the plant data, many of the substrate-product relationships remain intact after conditioning on QTL (

Conditioning on QTL genotypes strengthens the correlation among metabolites in most of the simulated pathway motifs (

Biosynthetic pathways, which often branch to produce two or more end products, are especially prone to inhibition

Estimation of kinetic parameters in dynamic models requires time course data, which is often sparse, and the computations involved can be challenging

Using both real data and simulated data, we tested the ability of graphical models to capture causal relationships between variables from from a variety of metabolic pathway topologies and conditions. We found that the use of linear statistical models to approximate relationships in dynamic non-linear systems from static data has some merit, but the results should be interpreted carefully. It is not realistic to expect to fully recover ordered pathway relationships with causal inference methods. Our results emphasize the necessity of biological variation beyond the genetic factors in the model for reliable network inference. We demonstrated that residual correlation induced between substrate and product in a metabolic reaction can be dominated by variety of factors, including, flux shunting, co-regulation, position in the pathway, genetic factors and inhibition. We found that inhibition can lead to missing edges in graphical models, washing out the genetic signal and making connected pathways look independent. An accurate genetic model is important, especially when epistasis is present. Taken together, these results temper our expectations and explain some of the success and failures of causal graphical models for genotype-phenotype inference.

Metabolic QTL data from a population of 403

Pathway motifs were used to define systems of ODEs that depend on flux rates,

The dynamics of a substrate

There are

The Pearson correlation is calculated for the variables in each pathway architecture. Residuals are estimated after each metabolite is conditioned on the QTL in the model. The residuals are used to calculate the conditional correlation of the metabolites given the genetic factors in the model. Directed graphical models are estimated using Bayesian Networks with a MCMC algorithm

(TIF)

(TIFF)

(TIFF)

(PDF)

(PDF)

(PDF)

(PDF)

We thank Daniela Kamir for her helpful discussions on biochemistry.