9 Apr 2014: The PLOS Computational Biology Staff (2014) Correction: Learning Gene Networks under SNP Perturbations Using eQTL Datasets. PLoS Comput Biol 10(4): e1003608. doi: 10.1371/journal.pcbi.1003608 View correction
The standard approach for identifying gene networks is based on experimental perturbations of gene regulatory systems such as gene knock-out experiments, followed by a genome-wide profiling of differential gene expressions. However, this approach is significantly limited in that it is not possible to perturb more than one or two genes simultaneously to discover complex gene interactions or to distinguish between direct and indirect downstream regulations of the differentially-expressed genes. As an alternative, genetical genomics study has been proposed to treat naturally-occurring genetic variants as potential perturbants of gene regulatory system and to recover gene networks via analysis of population gene-expression and genotype data. Despite many advantages of genetical genomics data analysis, the computational challenge that the effects of multifactorial genetic perturbations should be decoded simultaneously from data has prevented a widespread application of genetical genomics analysis. In this article, we propose a statistical framework for learning gene networks that overcomes the limitations of experimental perturbation methods and addresses the challenges of genetical genomics analysis. We introduce a new statistical model, called a sparse conditional Gaussian graphical model, and describe an efficient learning algorithm that simultaneously decodes the perturbations of gene regulatory system by a large number of SNPs to identify a gene network along with expression quantitative trait loci (eQTLs) that perturb this network. While our statistical model captures direct genetic perturbations of gene network, by performing inference on the probabilistic graphical model, we obtain detailed characterizations of how the direct SNP perturbation effects propagate through the gene network to perturb other genes indirectly. We demonstrate our statistical method using HapMap-simulated and yeast eQTL datasets. In particular, the yeast gene network identified computationally by our method under SNP perturbations is well supported by the results from experimental perturbation studies related to DNA replication stress response.
A complete understanding of how gene regulatory networks are wired in a biological system is important in many areas of biology and medicine. The most popular method for investigating a gene network has been based on experimental perturbation studies, where the expression of a gene is experimentally manipulated to observe how this perturbation affects the expressions of other genes. Such experimental methods are costly, laborious, and do not scale to a perturbation of more than two genes at a time. As an alternative, genetical genomics approach uses genetic variants as naturally-occurring perturbations of gene regulatory system and learns gene networks by decoding the perturbation effects by genetic variants, given population gene-expression and genotype data. However, since there exist millions of genetic variants in genomes that simultaneously perturb a gene network, it is not obvious how to decode the effects of such multifactorial perturbations from data. Our statistical approach overcomes this computational challenge and recovers gene networks under SNP perturbations using probabilistic graphical models. As population gene-expression and genotype datasets are routinely collected to study genetic architectures of complex diseases and phenotypes, our approach can directly leverage these existing datasets to provide a more effective way of identifying gene networks.
Citation: Zhang L, Kim S (2014) Learning Gene Networks under SNP Perturbations Using eQTL Datasets. PLoS Comput Biol 10(2): e1003420. doi:10.1371/journal.pcbi.1003420
Editor: Kevin Chen, Rutgers University, United States of America
Received: July 23, 2013; Accepted: November 18, 2013; Published: February 27, 2014
Copyright: © 2014 Zhang, Kim. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This material is based upon work supported by an NSF CAREER Award under grant No. MCB-1149885, Sloan Research Fellowship, and Okawa Foundation Research Grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Recent advances in the next-generation sequencing and other high-throughput technology has allowed researchers to collect various types of genome-scale datasets, providing unprecedented opportunities to discover detailed gene regulation processes in cells via analysis of the massive data. The standard approaches for identifying the causal regulatory relationship among genes have been based on examining gene-expression data collected after an experimental perturbation of one or two genes such as in gene knockout studies ,  or over-expression studies . In a typical experimental design, genome-wide gene-expression levels are measured using microarrays under different experimental conditions such as strains with one or two genes knocked out , . Then, the differential gene-expression patterns between control and experimental conditions are examined to obtain clues as to the organization of genes into functional modules and key regulators of those modules.
However, this standard approach for identifying the wiring of gene networks based on experimental perturbations of gene regulation system comes with many limitations. Experimentally perturbing the activity of a gene can be very costly, time-consuming, and laborious and it is even more so for repeating such perturbation for every single gene in an organism to obtain a comprehensive picture of gene network wiring. Furthermore, the experimental methods are usually limited to a perturbation of one or two genes at a time due to experimental infeasibility and combinatorial explosion in the number of experiments to perform. Thus, they cannot be used to perturb more than two genes at the same time to obtain information on multifactorial gene interactions. More importantly, it is often not possible to apply such experimental perturbation studies to humans for ethical reasons. Finally, given the set of differentially expressed genes under each perturbation, it is difficult to distinguish between those genes that are directly regulated by the perturbed gene and those genes in the downstream of the pathway whose expressions are influenced as secondary/indirect effects.
Genetical genomics approach has been proposed as a less expensive but more powerful alternative to the approach with experimental perturbations , . Genetical genomics treats genetic variation as naturally-occurring perturbation of gene regulatory networks and tries to learn gene networks by examining the effects of genetic variation on gene expression measurements within a large population of individuals. The key advantage of genetical genomics approach is that unlike experimental perturbations that can be performed only on one or two genes at a time, there are more than millions of genetic variants across genomes in the case of a human population, enabling the effects of multifactorial perturbations to be observed directly in gene expression data. Another advantage is that while experimental perturbation studies involve artificial perturbations in lab with often large perturbation effects, the perturbations of gene network by genetic variants occur in nature and usually induce more subtle changes in gene expressions. Thus, understanding the consequences of network perturbations by genetic variants is likely to lead to more direct understanding of the gene networks that exist in nature. In addition, genetical genomics approach can be easily applied to humans as well as other organisms, since genotype and gene-expression data are routinely collected for the purpose of expression quantitative trait locus (eQTL) mapping to understand the genetic architecture of complex phenotypes and diseases , .
However, the genetical genomics approach poses a significant computational challenge, because it is not obvious how to decode the effects of multifactorial genetic perturbations from genotype and gene expression data. In an experimental approach with a perturbation of one or two genes, the genes that are differentially expressed under each perturbation experiment can be easily identified with a simple computation. However, in genetical genomics approach, the gene expression variability is the result of aggregated effects of multifactorial perturbations by a large number of genetic variants and it is not obvious how to decouple the aggregated perturbation effects to identify the set of genes that are differentially expressed with respect to each perturbation by each individual genetic variants. Furthermore, as many of the genetic variants do not have any functional consequences or perturbation effects, genetical genomics approach has an additional computational challenge of identifying eQTLs or the genetic variants that affect gene expressions, while identifying the gene network perturbed by these eQTLs at the same time. Because of these computational challenges, given gene-expression and genotype data, researchers tended to limit their analysis to eQTL mapping, where eQTL mapping can be viewed as a special case of genetical genomics analysis that assumes an isolated effect of genetic perturbation on a single gene with no downstream effects in gene network. While statistical methods such as graph-guided fused lasso (GFlasso)  have been developed to detect genetic effects on multiple correlated gene-expression traits, they focused only on eQTL mapping assuming a known gene network, instead of performing a more powerful genetical genomics analysis. Those few existing computational methods for genetical genomics analysis have been limited in terms of computational efficiency and statistical power –. For example, discovering eQTLs and reconstructing the gene network perturbed by those eQTLs were performed in two separate steps, leading to reduced statistical power , or average gene-expression levels within each gene module were used as a trait to identify eQTLs rather than using the original gene-expression data, leading to the loss of information on individual gene activities .
In this paper, we propose a statistical framework that directly addresses all of the above computational challenges of genetical genomics analysis within a single statistical analysis to achieve the maximum statistical power for identifying gene networks via single-nucleotide-polymorphism (SNP) perturbations. Given SNP genotype and gene-expression data collected for a large number of individuals, our statistical method simultaneously identifies 1) the gene network structure by decoding the effects of multifactorial perturbations of gene regulation system by a large number of SNPs, 2) eQTLs that perturb this gene network, 3) the genes whose expressions are directly perturbed by eQTLs and the genes whose expressions are indirectly perturbed as secondary downstream effects of the direct perturbations in the network, and 4) detailed characterizations of the SNP perturbation effects on the gene network by decoupling the complex multifactorial SNP effects on the gene network with respect to each individual perturbation.
Our proposed statistical framework is based on probabilistic graphical models, and in particular, we introduce a new statistical model, called a sparse conditional Gaussian graphical model (CGGM), that models a gene network under SNP perturbations as an undirected graphical model. In our statistical model, the unknown gene network is represented as a graph over gene-expression traits and this graph is associated with an unknown probability distribution that models the strengths of gene-gene interactions in the gene network and the strengths of perturbation effects of SNPs (Figure 1A). Then, both the gene network structure perturbed by SNPs and probability distribution associated with the network structure are learned jointly from data. We show that the learning problem is convex, leading to increased statistical power and a guarantee in the quality of the estimated model, and develop an efficient learning algorithm that scales to a large dataset. Given the estimated graphical model, we describe inference methods to characterize the detailed mechanisms of how the effects of SNP perturbations propagate through the network (Figures 1B–D).
(A) The graph structure of sparse CGGM for modeling a gene network perturbed by SNPs. The gene network is defined over gene-expression traits 's (). The edges between gene-expression traits 's and SNPs 's () indicate the direct perturbations of the gene-expression traits by the given SNPs. The nodes for SNPs 's are shaded to show that the SNPs are conditioning variables in the conditional probability model. (B) Illustration of how the effects of the direct perturbation of the gene network by SNP propagate through the gene network, as obtained by performing inference on sparse CGGM in Panel (A). While SNP perturbs gene-expression traits and directly, this effect propagates through the network to perturb the expressions of other genes indirectly. The two directly perturbed genes and are shown as diamond-shaped nodes. The size and color-shade of each node indicate the strength of indirect perturbation of the given gene-expression trait by SNP , with a larger and darker node for stronger perturbation. (C) The portion of the overall indirect SNP perturbation effects in Panel (B) that arose from the propagation of the direct perturbation of gene by SNP . (D) The portion of the overall indirect SNP perturbation effects in Panel (B) that arose from the propagation of the direct perturbation of gene by SNP . Within our statistical framework, we can perform inference on sparse CGGM in Panel (A) to obtain the indirect perturbations in Panel (B), and then decompose the indirect perturbations in Panel (B) into Panels (C) and (D) in a principled manner.
From the computational point of view, addressing the challenges of genetical genomics analysis requires handling the computational challenges of both gene-network analysis given gene-expression data and eQTL mapping given gene-expression and genotype data, namely gene-network structure learning and SNP feature selection, at the same time. We show that in fact our sparse CGGM subsumes as special cases both a sparse Gaussian graphical model –, which is popular as a model for gene network, and a sparse linear regression model , , , which is widely used for eQTL mapping, thus providing a natural unifying representation for a gene network and eQTLs perturbing this network. Moreover, by embedding the standard regression model for eQTL mapping within a probabilistic graphical model and leveraging the representational power of a graphical model, our approach allows to extract a significantly more detailed characterization of the functional roles of eQTLs than any of the existing methods for eQTL mapping.
In our experiments, we apply our statistical framework to HapMap-simulated and yeast eQTL datasets , . Using HapMap-simulated data, we demonstrate our approach can recover the true underlying gene network under SNP perturbations, and at the same time, can recover true eQTLs with greater statistical power than other existing methods that have been developed for eQTL mapping. In addition, we applied our method to yeast eQTL dataset collected for 112 segregants of two yeast parent strains, BY4716 and RM11-1a . Nearly all of the previous analyses of this dataset focused either on eQTL mapping or on gene network analysis ignoring the genetic information, with an exception of the genetical genomics analysis of this dataset performed by multivariate regression with covariance estimation (MRCE)  that has also been mistakenly called sparse CGGM  as we further discuss in the next section. However, the analysis by MRCE  was performed for only a small subset of the full dataset because of its expensive computational cost. In our experiment, MRCE took more than weeks of computation to analyze the full yeast eQTL dataset, whereas our method ran within a day. We provide an in-depth analysis of SNP perturbations of a subnetwork over genes involved in stress response in yeast, and show that this subnetwork obtained from SNP perturbations by our approach is well supported by the network inferred from experimental perturbations in knock-out studies in the literature.
Materials and Methods
We begin our discussion with a brief background on sparse Gaussian graphical models for learning gene networks from gene expression data and sparse regression models for identifying eQTLs from gene expression and SNP data. Then, we present our proposed statistical framework for genetical genomics analysis and describe the model, learning algorithm, and inference procedure for sparse CGGM.
Background on Sparse Gaussian Graphical Models for Gene Network Learning
A Gaussian graphical model defines a probability distribution over an undirected graph that models a gene network. The nodes of the graph correspond to continuous-valued random variables for gene-expression traits and the edges represent probabilistic conditional dependence relationships between pairs of nodes –. Given microarray gene-expression measurements for genes and individuals, a Gaussian graphical model assumes that the gene-expression measurement for the th individual is an independently and identically distributed sample from a Gaussian distribution , where is a vector of 0's and is a covariance matrix. Then, it is well-known that the inverse covariance represents a Gaussian graphical model, where the non-zero (or zero) value for in the th entry of represents the presence (or absence) of edges between the th and th gene-expression traits in the gene network. While each non-zero element in implies conditional dependency between the th and th gene-expression traits given all the other gene-expression traits, computing the inverse of to obtain the covariance amounts to performing an inference in this graphical model to obtain marginal dependencies, or equivalently dependencies between the two nodes without consideration of any other nodes.
Graphical lasso , ,  has been widely used to learn a sparse Gaussian graphical model, where only statistically significant gene-gene interactions have edges with non-zero entries in . Graphical lasso minimizes the negative log-likelihood with sparsity-inducing penalty as follows:(1)where is the trace of matrix , is the sample covariance matrix, is the norm of , and is the regularization parameter that determines the amount of sparsity. A large value of leads to a sparser estimate with a greater number of zero elements in . The optimal value for can be determined using cross-validation. The problem in Eq. (1) is convex and can be solved efficiently .
Background on Sparse Regression Methods for eQTL Mapping
In eQTL mapping, the problem of identifying SNPs influencing gene-expression levels from eQTL data is often formulated as that of learning a multivariate linear regression model , , . Given genotype data for SNPs , where is a vector of length with each element taking values from for the number of minor alleles at the given locus, and the expression measurements for genes for samples, a linear regression model for the functional mapping from SNPs to gene-expression traits is given as:(2)where is the regression coefficient matrix representing the unknown association strengths, and is the matrix of noise terms whose rows are Gaussian-distributed with mean zeros and covariance . Typically, a model without an intercept is considered, assuming that the genotype data are standardized to have mean and unit variance.
Since each gene-expression trait typically has only a small number of eQTLs, lasso ,  has been widely used to obtain a sparse estimate of . Lasso minimizes the squared-error criterion with penalty as follows:(3)where is a regularization parameter that controls the amount of sparsity in . Eq. (3) is convex with a single globally optimal solution, and efficient algorithms are available for solving it .
Lasso essentially performs separate regression analyses, treating the gene-expression traits as independent of each other. In order to combine the statistical power across multiple correlated gene-expression traits, GFlasso  assumed a known gene network and extended the standard lasso by including an additional penalty, called graph-guided fusion penalty, that encourages multiple related genes in the gene network to be influenced pleiotropically by a common SNP. Given a gene network with a set of edges and edge weights 's for each edge between the th and th genes, the graph-guided fusion penalty takes the form of , where each term in the penalty encourages the amount of influence of the th SNP on the expression levels of the th and th genes to be similar if the two genes are connected with an edge in the network. GFlasso was capable of identifying SNPs with pleiotropic effects, but it was restrictive in that the gene network should be known a priori.
Within the same statistical framework of linear regression method, MRCE  attempted to shift the focus from eQTL mapping towards genetical genomics analysis by identifying the gene network and eQTLs jointly. Towards this goal, MRCE relaxed the assumption of uncorrelated noise (i.e., being a diagonal matrix) in lasso and estimated the full noise covariance matrix . Then, the inverse of the noise covariance corresponds to a gene network. MRCE minimizes the negative log-likelihood of data with an penalty for both and :(4)where and are the regularization parameters. We notice that unlike GFlasso, MRCE does not have any mechanisms to leverage the estimated gene network to model pleiotropic effects of SNPs on multiple correlated gene-expression traits in . The optimization problem in Eq. (4) is not convex, but bi-convex, since fixing either or and solving for the other is a convex optimization problem. Thus, Rothman et al.  proposed to optimize for each of and alternately given the other over iterations. However, they noted that this strategy often does not converge, and instead suggested to use an approximate method that prematurely terminates the iterative optimization procedure after two iterations. As we discuss in the Results section, we found that even this approximate method was too slow to be applicable to a dataset of even moderate size.
The same statistical method for MRCE has been proposed independently in the literature under the name of sparse conditional Gaussian graphical models (CGGMs) . However, we emphasize that the statistical model that is learned in MRCE is not a graphical model, because as we further discuss in detail in the next sections, the parameters do not model conditional dependencies as in graphical models but only models marginal dependencies. We believe that MRCE was mistakenly called a sparse CGGM due to the resemblance between the inverse noise covariance matrix in MRCE and the inverse covariance matrix in graphical lasso as well as the aspect of the standard regression model as a conditional model for given . The sparse CGGM that we propose in this paper is set up as a proper probabilistic graphical model and as we show in the next sections, is significantly more powerful than MRCE in terms of representational power and computational efficiency.
Under the same name of sparse CGGMs, Li et al.  introduced a related but different statistical method that also models a gene network corresponding to in MRCE. However, unlike our approach and MRCE, the estimation procedure for the sparse CGGM in  amounted to a two-stage process, where the gene-expression data are pre-processed to remove SNP effects in the first stage and then these pre-processed gene-expression data are used to learn a gene network in the second stage. Thus, their graphical model was defined only on gene-expression traits and did not directly model the relationship between SNPs and gene expressions to identify eQTLs. In contrast, our sparse CGGM is set up as a graphical model on both gene expressions and SNPs, performs a joint estimation of gene network and eQTLs, and infers various perturbation effects of SNPs on gene expressions via inference.
CGGMs for Modeling Gene Networks under SNP Perturbations
In this section, we introduce a statistical model for CGGM as a model for a gene network under SNP perturbations. Then, in the next sections, we describe a learning algorithm for estimating a sparse model for CGGM from data and discuss inference schemes for the estimated sparse CGGM. A sparse CGGM estimated from data captures the gene network structure and direct perturbations of the gene-expression levels by eQTLs via conditional dependency structure in the estimated graph structure. By performing inference on this estimated sparse CGGM, we can obtain a detailed characterization of how the direct SNP perturbation effects propagate through the gene network to perturb the expression levels of other genes indirectly.
The key idea behind our proposed approach is to model a gene network under SNP perturbations as a Gaussian graphical model for a gene network conditional on SNPs. We derive a CGGM as a conditional distribution from the Gaussian graphical model for a joint probability distribution for SNPs and gene-expression traits for the th individual. Let us assume a Gaussian graphical model with covariance and inverse covariance , where zero mean is assumed after mean-centering each gene-expression trait and SNP. Then, the conditional distribution of given can be obtained as . We further re-write this conditional distribution, using the inverse covariance matrix and the partitioned inverse formula  to obtain a CGGM:(5)CGGM parameters and represent a gene network and SNP perturbation effects on this gene network, respectively. A non-zero value for the th element of indicates that the th SNP is an eQTL for the th gene-expression trait. This SNP perturbation captures the direct influence of the th SNP on the th gene expression, since the graphical model captures conditional dependencies. While in experimental perturbation studies the expressions of only one or two genes can be directly perturbed (e.g., by knocking out the genes), our CGGM for genetical genomics study allows multiple gene-expression traits to be perturbed by multiple SNPs at the same time. Then, this multifactorial genetic perturbations of gene-expression levels are decoded to learn a gene network by a learning algorithm that estimates and simultaneously.
In order to show the direct correspondence between a CGGM and a general undirected graphical model, we re-write Eq. (5) by expanding the quadratic term in the Gaussian distribution in Eq. (5) to obtain:(6)where is a constant, also known as a partition function in the literature of probabilistic graphical models , which ensures that forms a proper probability distribution integrating to . As Eq. (6) is equivalent to the Gaussian-distribution form in Eq. (5), this constant can be obtained in a closed-form by directly comparing Eq. (6) with Eq. (5). If is positive definite, the integral in the partition function is finite and the probability distribution is well-defined. The representation in Eq. (6) explicitly shows that a CGGM is an undirected graphical model  defined over a graph with two sets of edges, namely the set of edges connecting each pair of gene-expression traits in gene network and another set of edges connecting each SNP to gene expressions that the SNP is influencing (Figure 1A). Then, following the definition of an undirected graphical model , the numerator in Eq. (6) is a weighted sum of features over the graph edges, where 's and 's are features and and define edge weights. The gene network is modeled an undirected graph, but the directions from SNPs to gene-expression traits are implicit, since the model is a conditional probability model for gene-expression traits conditional on SNPs.
Learning Sparse CGGMs
Since both gene-gene interactions and SNP perturbations of gene-expression traits are highly modular and localized, we are interested in learning a sparse model for CGGM. In other words, only statistically significant gene-gene interactions should be represented as edges with non-zero entries in and only a small number of statistically significant direct SNP perturbations should be estimated as having non-zero effect sizes in . In order to impose a sparsity constraint, we learn a sparse CGGM by minimizing the negative log-likelihood of data with an penalty as follows:(7)where is the negative log-likelihood of data based on Eq. (6) or equivalently Eq. (5), and and are the regularization parameters that control the amount of sparsity. It is not necessary to explicitly consider the positive-definite constraint for within the optimization problem in Eq. (7), because the partition function in the data log-likelihood contains term that acts as a log-barrier function for the positive-definite constraint . Within the penalty for , we do not penalize the diagonal elements of , since we found that this leads to a slightly better performance in our experiments, consistent with what has been reported for graphical lasso. It is straightforward to prove that the problem in Eq. (7) is convex (Text S1) . Thus, the learning algorithm is guaranteed to find the globally optimal solution that achieves the maximum statistical power.
The main challenge for solving Eq. (7) arises from the non-smoothness of the penalty function. We adopt a variant of accelerated proximal gradient algorithms, called a Nesterov's second method , that has been developed as a general-purpose algorithm for handling a non-smooth component of the parameter estimation problem while improving the convergence (and thus, computation time) of the standard gradient descent algorithm –. We provide details of the learning algorithm in Text S2.
Lasso  and graphical lasso , ,  can be viewed as special cases of the sparse CGGM estimation problem in Eq. (7). When , the sparse CGGM learning problem in Eq. (7) essentially reduces to applying graphical lasso to gene-expression data, ignoring genotype data, since the large encourages all or nearly all of the elements of to be set to zeros. On the other hand, if , the sparse CGGM learning problem becomes equivalent to lasso that fits a regression model for each gene-expression trait separately, ignoring gene network, since the large tends to set all or almost all of the off-diagonal elements of to zeros. The optimal values for and that strike the right balance between these two extreme cases can be found by cross-validation.
Inference on Sparse CGGMs for Detailed Characterizations of SNP Perturbations of Gene Networks
So far, we showed that by learning a sparse CGGM, it is possible to decode the underlying gene network and its direct multifactorial perturbations by SNPs from data. Now, we show that given a sparse CGGM estimated from data, we can perform inference on this graphical model to characterize the mechanisms of SNP perturbations of gene network in detail. Below, we discuss how inference schemes can be used on our estimated model to learn about indirect/secondary downstream effects of the direct SNP perturbations, a decomposition of the overall multifactorial SNP perturbation effects with respect to each individual direct perturbation, and a decomposition of observed covariance in gene expressions into genetic and non-genetic components. We note that all of these inference schemes involve only few simple matrix operations and are highly efficient.
Direct and indirect SNP perturbations of gene network.
Unlike experimental perturbation studies for learning a gene network, in genetical genomics analysis with our statistical framework, the properties of a sparse CGGM as a graphical model allow us to learn whether the differential gene expression arises from direct or indirect/secondary effects of the SNP perturbation. Direct perturbations of gene-expression traits by SNPs are encoded explicitly as conditional dependencies or as the sparsity pattern of in the estimated sparse CGGM. Then, performing inference on this graphical model is equivalent to inferring indirect SNP perturbations of gene-expression traits, which arose from the effects of direct SNP perturbations propagating through the gene network to affect other neighboring gene-expression traits indirectly and pleiotropically.
More specifically, while Eq. (6) provides the standard form of a CGGM as a graphical model (Figure 1A), obtaining the form in Eq. (5) from Eq. (6) amounts to performing inference in the CGGM (Figure 1B). In other words, performing inference in a CGGM parameterized by and is equivalent to computing and , where the choice of notation and was deliberate due to the connection between inference in a CGGM and the standard regression model in Eq. (2) as discussed in the next section. Thus, even if the th SNP does not perturb the th gene-expression trait (i.e., the th entry of is zero), if the th entry of is non-zero, the th SNP perturbs the th gene-expression trait indirectly, as a result of the downstream effects of the direct perturbation of other genes in the network by the th SNP.
More generally, indirect SNP perturbations can be inferred by examining the sparsity pattern of . The indirect SNP perturbations as captured by the sparsity pattern of in sparse CGGM is determined by both direct SNP perturbations (the sparsity pattern of ) and the topology of the estimated gene network (the sparsity pattern of ). For example, if the gene network contains multiple connected components, where each connected component is a network over a subset of gene-expression traits without any edges across different connected components, the effects of direct SNP perturbations propagate to influence other gene-expression traits only within the given connected component. This leads to a sparsity pattern in , where a SNP directly influencing a gene-expression trait in has non-zero indirect perturbation effects in for all of the gene-expression traits within the same connected component. On the other hand, if is a diagonal matrix with no edges in the gene network, the direct SNP perturbation effects do not propagate through the gene network, and the sparsity pattern in remains the same as that of . More generally, gene network plays the role of calibrating how the effects of direct genetic perturbations in propagate to affect other gene-expression traits indirectly in , such that an eQTL has strong pleiotropic effects within strongly-connected parts of the gene network.
Decomposition of SNP perturbation effects.
Given a sparse CGGM, the amount of the overall influence of a SNP on a particular gene-expression trait in can be further decomposed into contributions from each of the directly-perturbed genes that pass the perturbation effects to other parts of the network. For example, while each row of captures the overall direct/indirect perturbation effects of the given SNP on gene network (Figure 1B), this overall perturbation effects can be decomposed with respect to each of the genes that are directly perturbed by the same SNP (Figures 1C and D). In experimental perturbation studies, only one or two genes are perturbed, but in genetical genomics analysis with multifactorial perturbations, each SNP can directly perturb multiple genes at the same time, such as mutations in a transcription factor perturbing expressions of multiple target genes. We show that a simple matrix operation for a decomposition of can identify how each of the directly perturbed genes propagate the perturbation effects to other genes.
More specifically, the overall indirect perturbation effects can be decomposed into 's, , as follows:(8)where and represent the th column and row of matrix , respectively. contains information on how the th gene passes on the direct SNP perturbation effects it received to other gene-expression traits. Figures 1B, C, and D illustrate a single row of each of , , and , respectively, corresponding to SNP and gene-expression traits and in the graphical model in Figure 1A.
Decomposition of gene-expression covariance.
In addition, given the estimate of sparse CGGM parameters, the overall observed covariance in gene-expression data can be approximately decomposed into the covariance induced by the gene network and the covariance induced by direct SNP perturbations propagating to different parts of the network. In order to derive this decomposition, ignoring the penalty that introduces a sparsity bias, we consider the problem of finding the optimal CGGM parameter as finding the parameter values that satisfy the following optimality condition:(9)where the expectation is taken with respect to and can be computed asThen, from the above two equations, we obtain(10)The left-hand side of the above equation corresponds to the observed covariance of gene-expression traits, whereas the first and second terms on the right-hand side of the equation correspond to the decomposition of the overall covariance into the part induced by gene network and the other part induced by SNP perturbations, respectively. Eqs. (9) and (10) also follow directly from the standard result on the property of log-linear models, a type of undirected graphical models, that finding maximum likelihood estimates of parameters is equivalent to matching the expectation of features under the model with the empirical expectation of features (Theorem 20.1 in ). CGGMs are a member of log-linear models, and in Eq. (9), the expectation of features under the model and the empirical expectation of features correspond to and , respectively, where are features. In sparse CGGMs, because of the bias introduced by the penalty for sparsity in parameters, the equality in Eq. (10) holds only approximately.
Relationship between CGGM and standard regression method.
The CGGM representation in Eq. (5) as obtained after inference in the probabilistic graphical model in Eq. (6) reveals its connection to the standard regression model in Eq. (2). The standard regression model is parameterized directly by and , whereas in CGGMs, the graphical model parameterized by and is learned first and then and are obtained via inference. Maximum-likelihood estimation of parameters in the absence of sparsity constraint leads to the same estimate for in both models. However, imposing a sparsity constraint on and for the conditional dependencies and then obtaining via inference in sparse CGGM results in different estimates for , compared to sparsifying directly as in the standard sparse regression approach. We argue that imposing a sparsity constraint on CGGM parameters rather than on the standard regression parameters produces more interpretable results that distinguish between sparse direct SNP perturbations and their propagation through the gene network for indirect SNP perturbations of the network. This is analogous to the case of Gaussian graphical models, where it is generally accepted as a more reasonable approach to impose a sparsity constraint on inverse covariance parameters for the graphical model than on covariance parameters that result from inference in the graphical model.
The CGGM representation in Eq. (5) after performing inference also reveals its connection to and advantages over MRCE. Although MRCE and CGGM are related in that in MRCE corresponds to in CGGM, MRCE is based on the standard regression model and is parameterized by . One of the key advantages of using CGGM parameterization over MRCE parameterization is that our formulation leads to a convex optimization for estimation, whereas MRCE estimation problem is only bi-convex, not jointly convex. The important consequence of this difference is that MRCE optimization only reaches a locally-optimal solution with unstable convergence. In contrast, CGGM optimization is orders-of-magnitude more efficient than MRCE and finds the globally optimal solution, achieving higher accuracy for the recovered network and eQTLs with greater computational efficiency. Finally, although in MRCE and in CGGM are equivalent as far as the model is concerned, their estimates are not identical because of the difference in the properties of optimization problems and the quality of solutions between the two approaches.
In this section, we demonstrate the performance of our statistical framework on HapMap-simulated and yeast eQTL datasets ,  and compare the results with those from MRCE  that has been previously developed for genetical genomics analysis. Although our method primarily focuses on the recovery of gene network under SNP perturbations, it has the potential to enhance the power for detecting eQTLs by considering eQTL mapping in the context of network learning problem. In order to test this hypothesis, we also compare our method and MRCE with GFlasso  in terms of the accuracy for detecting eQTLs, as GFlasso has been developed specifically for eQTL mapping.
We used a Matlab implementation of Algorithm 1 in Text S2 for sparse CGGM, an R implementation of MRCE downloaded from the authors' website, and a Matlab implementation of the smoothing proximal gradient method  for GFlasso. We used the approximate MRCE algorithm that terminates the exact method after two iterations as suggested by the authors , because the exact method often did not converge and was too slow to be applicable to large-scale simulation experiments, taking two or three days with cross-validation for a single simulated dataset of 500 SNPs and 30 gene-expression traits.
In order to simulate eQTL datasets, we used the SNP genotype data from HapMap phase III release 2  as SNP data and simulated gene-expression traits , given and known model parameters. We used the SNP data for chromosome 21 of the 343 individuals of African origin, including ASW, LWK, MKK, and YRI population groups. After removing SNPs with minor allele frequency and highly correlated SNPs with squared correlation coefficient , we obtained 4,901 SNPs. In each simulated dataset, we randomly selected a region of 500 SNPs and simulated the values of 30 gene-expression traits for each individual, based on the CGGM in Eq. (5). As almost all statistical methods for eQTL mapping assumes the standard linear regression model in Eq. (2), we performed experiments on gene-expression traits simulated from this model as well. In order to set the model parameters, we first set the sparsity pattern and then assigned values to the non-zero elements of the parameters as follows.
- Simulation from CGGM with known and . In order to set the sparsity pattern of for gene network structure, we randomly partitioned the 30 expression traits into three modules of similar size and assumed relatively dense edge connections within each module and sparse edge connections between different modules. Specifically, we connected a pair of gene-expression traits with an edge with probability within a module and with probability between modules. Given this sparsity pattern in , we assigned values to the non-zero elements of by drawing edge weights randomly from a uniform distribution for pre-determined values of and and setting to the graph Laplacian of this weight matrix. A graph Laplacian is a symmetric matrix, where the off-diagonal elements are negative edge weights and diagonal elements are the sum of the weights of all the edges connecting to each node . In general, a graph Laplacian is only positive-semidefinite, and to ensure is positive definite, we added small positive values to the diagonal elements.
To assign the sparsity pattern of for direct SNP perturbations, we assumed that only a single gene-expression trait within each module is perturbed directly by a small number of SNPs. Specifically, we randomly selected one gene-expression trait in each of the three modules in and set each SNP as directly perturbing the selected gene-expression trait with probability . Then, we set the values of the non-zero elements of to random draws from a uniform distribution .
- Simulation from the standard linear regression model with known and . We used the same strategy as in to set . In order to set the sparsity pattern of , we assumed pleiotropic effects of eQTLs, where all gene-expression traits within each module are influenced by the same set of SNPs. For each gene module in , we let each SNP be an eQTL for the gene-expression traits in the module with probability . The association strengths for these eQTLs were randomly drawn from a uniform distribution .
While in our simulation studies we primarily focused on the relatively small datasets of 500 SNPs and 30 gene-expression traits as described above, in order to demonstrate the performance and scalability of our method, we also applied our method to larger-scale simulated datasets of 1,000 SNPs and 500 gene-expression traits. Because MRCE, the main competing method to our approach, required substantially more computation time than our approach even on the smaller datasets, we were unable to compare the performance of MRCE on these larger simulated datasets. Instead, we compared our method with other computationally efficient methods, including GFlasso and a base-line approach of applying graphical lasso  and lasso  sequentially to learn gene networks and eQTLs.
We use precision-recall curves and prediction errors as quantitative measures of the performance of different statistical methods. Precision-recall curves summarize how accurately each method recovers the true eQTLs and gene network structure by plotting precisions and recalls on - and -axes. In order to compute precisions and recalls, for each simulated dataset, we ranked all the elements of the estimated parameter matrices in a descending order according to their absolute values and compared this ranked list with the set of non-zero elements in the true parameters. On the other hand, prediction errors evaluate the performance of different methods on how accurately each method can predict gene-expression levels, given SNPs and estimated model parameters. Once the parameters are estimated using training data, prediction errors are obtained as , where is the prediction of gene expressions given by the model for the th individual in an independent test dataset and is the number of samples in the test set.
Given the 343 samples in the full dataset, we used 300 samples as a training dataset and the remaining 43 samples as a test dataset. In order to determine the optimal regularization parameters in sparse CGGM, MRCE, and GFlasso during the training phase, we created a grid for different choices of regularization parameters, performed a five-fold cross-validation for each point on the grid, and selected the values that give the smallest cross-validation error as the optimal regularization parameters.
Illustration of Sparse CGGM Behavior
In order to illustrate the behavior of a sparse CGGM, we present the results from applying our method to a single dataset simulated from a CGGM parameterization and compare them with what we obtained from GFlasso and MRCE (Figure 2). The non-zero elements of the true model parameters were drawn from for and for . Since GFlasso requires the gene network to be known, we used the correlation coefficient matrix of gene-expression trait data thresholded at as a gene network in GFlasso estimation. The true parameters for and along with are shown in the left, middle, and right columns, respectively, in Figure 2A. The estimated parameters for sparse CGGMs, MRCE, and GFlasso are shown in Figures 2B–D, respectively. In the plots for and , the rows and columns correspond to gene-expression traits and SNPs, respectively, and the results are shown only for the first 150 SNPs. In each panel, the white pixels correspond to the zero elements of the parameters and the darker pixels to non-zero elements. We note that while sparse CGGM provides the estimates of both and , MRCE and GFlasso provide a single estimate of eQTL effect sizes in and do not distinguish between direct and indirect effects of eQTLs on gene-expression traits.
A known sparse CGGM was used to generate the simulated dataset. The left, middle, and right columns show the absolute values of for gene-network edge weights, for the strengths of direct SNP perturbations, and for strengths of indirect SNP perturbations, respectively. In the middle and right columns, and are shown with gene-expression traits in rows and SNPs in columns. White pixels represent zero elements and darker pixels represent non-zero elements of the parameter matrix. The true model parameters are shown in Panel (A), and the estimated parameters are shown for (B) sparse CGGM, (C) MRCE, and (D) GFlasso. MRCE and GFlasso use the standard regression model for eQTL mapping, and thus provide a single summary of SNP effects on gene expressions in . GFlasso focuses only on the task of eQTL mapping and thus, does not provide an estimate of gene network.
As shown in Figure 2B, our method successfully recovers the three gene modules in the true gene network as the block-diagonal structure in along with the sparse direct perturbation of this network by eQTLs in . When we perform inference in the estimated sparse CGGM by computing to learn indirect perturbation of the network by eQTLs, the direct perturbations of eQTLs in propagate primarily within each gene module in , leading to vertical stripes in . Although MRCE learns a gene network from data, unlike sparse CGGM, it does not have any mechanism to leverage this gene network to learn pleiotropic or indirect effects of eQTLs on gene modules and the estimated in Figure 2C shows isolated eQTLs for individual gene-expression traits rather than vertical stripes. As can be seen in Figure 2D, the GFlasso estimate of shows vertical stripes for eQTLs common within each gene module. However, it is immediately clear that GFlasso results have significantly more false positives for eQTLs than sparse CGGM and MRCE. This demonstrates that genetical genomics approach has the potential to improve the accuracy for detecting eQTLs than the conventional approach that focuses solely on eQTL mapping.
Simulation from Sparse CGGMs
We performed a quantitative comparison of the performance of sparse CGGM, MRCE, and GFlasso, by obtaining precision-recall curves and prediction errors averaged over 50 simulated datasets. Since MRCE and GFlasso are based on the standard linear regression model and sparse CGGM is based on a graphical model, we evaluate the different methods on datasets simulated from both models.
Recovery of gene networks.
We first evaluate the two statistical methods, sparse CGGM and MRCE, on the accuracy of gene network structure recovery. The precision-recall curves for the estimated and averaged over 50 simulated datasets are shown in Figure 3. In each panel of Figure 3, the true model parameters were drawn from one of the nine combinations of the three uniform distributions, namely , , and for (rows in Figure 3) and (columns in Figure 3). The results in Figure 3 show that sparse CGGMs can recover the gene network structure with significantly higher accuracy for nearly all of the parameter settings. In addition, the performance of sparse CGGMs is relatively stable across all parameter settings, whereas the performance of MRCE varies widely for different parameter settings.
Each panel shows the results from datasets simulated under different parameter settings for (rows) and (columns). Each precision-recall curve was obtained as an average over results from 50 simulated datasets. Simulated datasets with 30 gene-expression traits and 500 SNPs were used.
Recovery of eQTLs.
In the process of identifying the gene network structure under SNP perturbations, sparse CGGM and MRCE also address eQTL mapping, identifying the SNPs with or without perturbation effects on gene network. Using the same simulated datasets and estimated models from Figure 3, in Figure 4, we evaluate the two methods on how accurately they can recover the true eQTLs, and compare the precision-recall curves with those from GFlasso that focuses only on eQTL mapping. For sparse CGGM, we present two precision-recall curves, one for direct eQTL effects in and another for indirect eQTL effects obtained from , by comparing the estimated values with the true and . For MRCE and GFlasso that do not distinguish between the direct and indirect effects of eQTLs, we obtained a single precision-recall curve by comparing the estimated with the true .
Precision-recall curves for the recovery of eQTLs are shown, using the same simulated datasets and estimated models in Figure 3. Each panel shows results from datasets simulated under different parameter settings for (rows) and (columns). For sparse CGGMs, each panel shows two precision-recall curves, one for eQTLs with direct perturbation effects and another for indirect perturbation effects , whereas for MRCE and GFlasso, the results are shown only for the association strengths .
Overall, in Figure 4, sparse CGGM consistently outperforms MRCE and GFlasso on estimating eQTLs under all parameter settings. In addition, the two statistical methods for genetical genomics approach achieve significantly higher accuracy for detecting eQTLs than GFlasso that focuses on eQTL mapping, showing the advantage of discovering eQTLs by systematically decoding a gene network and its perturbation effects by SNPs. We also observe that for a given parameter setting for in each column of Figure 4, as the range of values for increases from to , the performance of all methods on detecting eQTLs tend to improve.
In order to evaluate the different methods on how accurately the estimated model can predict gene-expression values given SNPs, we computed the prediction errors on the test data using the same set of estimated models in Figures 3 and 4. The results are shown as box plots in Figure 5. We observe from Figure 5 that sparse CGGMs consistently outperform MRCE and GFlasso with the smallest mean and variance of the prediction errors under all of the nine parameter settings.
Simulation from Standard Linear Regression Models
Since MRCE and GFlasso use the standard linear regression model, we also compared the performance of the different methods, using datasets simulated from the model in Eq. (2) with known parameters for and (Figure 6). We present the precision-recall curves for and averaged over 50 simulated datasets in Figures 6A and B, respectively, and show the prediction errors in Figure 6C. In our simulation, we set the true parameter values to random draws from for and for . As can be seen in Figure 6, even if the datasets were simulated from the standard linear regression model as used in MRCE and GFlasso, our method still outperforms MRCE and GFlasso.
(A) Precision-recall curves for the recovery of gene network structure in (or ). (B) Precision-recall curves for the recovery of eQTLs in . (C) Prediction errors. The results were obtained as an average over 50 simulated datasets. Simulated datasets with 30 gene-expression traits and 500 SNPs were used.
Simulation Using Large-Scale Datasets
The simulation results so far demonstrated that our method has greater power for identifying gene networks and eQTLs than other methods. However, in these experiments, we were constrained to use relatively small datasets of only 30 gene-expression traits with 500 SNPs, because MRCE could not handle larger datasets effectively in a systematic simulation study. In this section, we demonstrate the effectiveness and scalability of our method, using substantially larger simulated datasets of 500 gene-expression traits and 1,000 SNPs.
Given a region of 1,000 SNPs from chromosome 21 of the African individuals in HapMap phase III SNP data as described above, we simulated the values for 500 gene-expression traits, assuming sparse CGGMs with the true parameters determined as follows. We assumed that the true gene networks are scale-free networks, and set the network using the following strategy. First, we determined the number of neighbors of each node by making a random draw from a power-law distribution . Then, we applied the algorithm for generating a scale-free network  that repeatedly connects two nodes until we achieve the desired node degrees initially determined according to the power-law distribution. Given this network structure, we set the edge weights to random draws from a uniform distribution , and set to the graph Laplacian of the edge-weight matrix with small positive values added to the diagonal elements. We set the true eQTLs in by choosing each SNP as an eQTL for each gene-expression trait with probability and selecting one additional SNP as an eQTL for hub nodes with more than 20 neighbors in the network. For each eQTL in , we set the eQTL effect sizes to random draws from a uniform distribution with the signs of the values determined randomly.
In Figure 7, we compare the performance of the different methods averaged over 30 datasets simulated according to the above strategy. The results are shown for the precision-recall curves for the accuracy of detecting gene-network structures (Figure 7A) and eQTLs (Figure 7B) as well as prediction errors (Figure 7C). As MRCE could not run on a single dataset of the given size within a few days, instead of using MRCE in our experiment, we compared our method with GFlasso and also with a two-stage method of applying graphical lasso and lasso to learn gene networks and eQTLs separately. We observe from Figure 7 that sparse CGGMs outperform all the other methods on these large-scale datasets.
(A) Precision-recall curves for the recovery of gene network structure in (or ). (B) Precision-recall curves for the recovery of eQTLs in . (C) Prediction errors. The results were obtained as an average over 30 simulated datasets. Simulated datasets with 500 gene-expression traits and 1,000 SNPs were used.
Computation Time and Scalability
In order to examine the scalability of sparse CGGM, MRCE, and GFlasso, we compared the computation time for a single run of the different methods on varying sizes of datasets in Figure 8. Figure 8A shows the computation time for varying the number of gene-expression traits with the number of SNPs fixed at , whereas Figure 8B shows the results from varying the number of SNPs with the number of gene-expression traits fixed at . Even though we used the approximate method for MRCE to reduce the computational cost of the exact method, our sparse CGGM optimization is more efficient by orders of magnitude than MRCE for problems of all sizes. Although GFlasso is more efficient than both sparse CGGM and MRCE, it is significantly more limited in that it focuses only on the problem of eQTL mapping.
The computation time for a single run of sparse CGGM, MRCE, and GFlasso is shown for (A) varying the number of gene-expression traits with the number of SNPs fixed at and (B) varying the number of SNPs with the number of gene-expression traits fixed at . The results for MRCE were obtained using the approximate algorithm.
Analysis of Yeast eQTL Dataset
We applied our method to an eQTL dataset collected for two yeast parent strains, BY4716 (BY) and RM11-1a (RM), and their 112 segregants . We obtained SNP genotypes for 1,260 loci after removing the redundant SNPs with the same genotypes in neighboring regions of the genome and obtained expression measurements for 3,684 genes after removing the genes whose expression measurements were missing for more than 5% of the 114 samples. In order to select the optimal regularization parameters and , we performed a cross-validation with three random splits of data into 100 samples for estimating model parameters and 14 samples for computing cross-validation errors. Then, a final estimate of parameters was obtained by training a model on the entire dataset using the optimal regularization parameters.
Below, we examine the yeast gene network and eQTLs with direct and indirect perturbations estimated by our method. In addition, we provide an in-depth analysis of a subnetwork with a strong evidence of being involved in DNA replication stress response based on the literature. We also provide a quantitative comparison of sparse CGGMs and other methods in terms of prediction errors.
Scale-Free Gene Network for Yeast
Since many previous works showed that gene networks tend to have a scale-free topology with few hub genes having many neighbors, we examined the gene network parameters in the estimated sparse CGGM for a scale-free property . Given the network edge weights in , we defined the degree of each node as the sum of the absolute values of all incoming edge weights for the node. Then, the best ordinary least square fit of linear model for the empirical cumulative degree distribution was , showing that the estimated network has a strongly scale-free topology. Overall, in our estimate of network, 14 genes were connected to more than 100 other genes, 156 genes had more than 10 neighbors, and 1,593 genes had at least one neighbor.
We hypothesized that each hub gene and its immediate neighbors form a hub-gene module and are involved in a common biological process. In order to test this hypothesis, we performed a gene ontology (GO) enrichment analysis for the 25 largest hub-gene modules, using Fisher's exact tests (Table 1). We found that each hub-gene module was significantly enriched with genes in a common GO category, showing that the genes in each hub-gene module are likely to participate in a common pathway.
SNP Perturbations of Yeast Gene Network
Next, we examined the eQTLs identified by our method as perturbing the above gene network. The eQTLs with direct perturbations of the gene network as captured in tended to concentrate on a small number of genetic loci, forming eQTL hotspots. Although the sparsity pattern of showed that 1,248 out of 1,260 SNP loci regulate directly at least one gene-expression trait, the top 10 SNPs that affect the largest number of gene-expression traits accounted for 15.8% of all SNP/gene-expression-trait pairs with direct influence of SNPs on gene-expression traits, and the top 20 SNPs accounted for 28.5%. We defined these top 20 SNPs as eQTL hotspots, and the genes directly regulated by each eQTL hotspot as a hotspot-regulated gene module (Table 2). In order to avoid redundancy, in the case of multiple eQTL hotspots within a 20 kb region with largely overlapping hotspot-regulated gene modules, we examined only one of those hotspots with the largest hotspot-regulated gene module. Out of the 13 eQTL hotspots that have been previously reported in analysis of the same dataset in , 9 hotspots overlapped with the results from our method.
In order to investigate whether the genes in each hotspot-regulated gene module are involved in a common biological function, we performed GO enrichment analysis (Table 2). We performed Fisher's exact tests, using GO slim categories downloaded from http://www.geneontology.org/GO.slims.shtml, after removing the GO categories with more than 500 genes. Our results show that all of the hotspot-regulated gene modules are significantly enriched for some GO categories, providing evidence that each eQTL hotspot regulates a functionally coherent set of genes.
Since indirect SNP perturbations result from direct SNP perturbations propagating through the network, eQTLs with direct perturbations are likely to have stronger effect sizes than indirect perturbations. In Figure 9A, we compared the overall distribution of effect sizes of direct and indirect SNP perturbations as captured in and of the sparse CGGM by plotting histograms of the absolute values of non-zero elements in and . For indirect SNP perturbations, only those SNP/gene-expression-trait pairs that were estimated to be zero in but non-zero valued in were included. As can be seen in Figure 9A, the direct perturbations are generally stronger than indirect perturbations, confirming our hypothesis.
Histograms of the effect sizes of direct and indirect SNP perturbations in yeast are shown for (A) all eQTLs and (B) cis eQTLs identified by sparse CGGM.
Then, we examined whether the direct SNP perturbations estimated by our method are more likely to be cis eQTLs than the indirect perturbations. We declared the direct and indirect SNP perturbations in and as cis eQTLs, if for a given pair of SNP/gene-expression-trait, the gene sequence overlaps with the linkage region represented by the given SNP. The histogram in Figure 9B shows the distribution of the effect sizes of the estimated direct and indirect SNP perturbations for cis eQTLs. As can be seen in Figure 9B, direct SNP perturbations are significantly more frequent in cis eQTLs than indirect SNP perturbations, and explain nearly all of the cis eQTLs with strong effect sizes.
When we examined the cis eQTLs with direct perturbations in our estimated model, we found that our approach was able to identify some of the well-known direct genetic perturbations in the literature. The genotypes for LEU2, URA3, HO, and LYS2 are known to differ in the parent strains, BY and RM, where these genetic differences have a large impact on the expressions of the corresponding genes as well as other genes . While LYS2 was not included in our analysis, the LEU2, URA3, and HO expressions were found to have cis eQTLs with direct perturbations in our analysis, and at the same time, in our estimate of gene network, LEU2 and URA3 appeared as hub genes with more than 50 neighbors. In particular, the cis eQTLs with direct perturbations of LEU2 and URA3 were found to have the strongest effects among all cis eQTLs shown in Figure 9B, whereas HO had a cis eQTL with moderately strong direct perturbation. These results provide evidence that our method can recover the true direct SNP perturbations of a gene network, by decoupling direct SNP perturbations of gene expressions from their secondary/indirect effects on other gene expressions.
DNA Replication Stress Response Subnetwork and Its Perturbation by eQTLs in Yeast
We performed an in-depth analysis of the subnetwork around the TFS1 gene and its perturbation by eQTLs. This subnetwork is shown in Figure 10A, where the edge thicknesses correspond to the absolute values of edge weights in , representing the strength of dependency between two gene-expression traits. To avoid clutter, we only show the edges with the absolute values of edge weights .
(A) The yeast subnetwork for DNA replication stress response and its direct/indirect perturbation by a SNP in the region of 1,095 kb on chromosome 4 estimated by sparse CGGM learning algorithm. This SNP directly perturbs TFS1, HSP26, RTN2, and GAD1, and the propagation of this direct perturbation to other parts of the network is obtained by performing inference on the estimated sparse CGGM. Edge thicknesses correspond to absolute values of edge weights in . The diamond-shaped nodes represent gene-expression traits that are directly perturbed by the SNP, whereas the round and colored nodes represent those genes whose expressions are indirectly perturbed by the SNP. The color shade and size of nodes indicate the strength of the SNP perturbation of gene-expression trait. Our statistical framework allows the overall indirect SNP perturbation effects in Panel (A) to be decomposed into the components that arose from the propagation of the direct perturbation effects of each of (B) TFS1, (C) HSP26, (D) RTN2, and (E) GAD1 by the given SNP.
This subnetwork in Figure 10A contains many genes involved in DNA replication stress response and other types of stimulus response. In particular, TFS1 has been identified as a high-copy suppressor of guanine nucleotide-exchange factor CDC25, which activates the Ras/cyclic AMP pathway regulating growth and metabolism in response to nutrients . Many of the genes in this subnetwork, including TFS1, its immediate neighbors (PGM2, SOL4, RTN2, GDB1, RME1, PRB1, SDS24, IGD1), and 22 other genes, have been previously observed with changed abundance or localization under DNA damage . In addition, the DDR2 gene in the subnetwork that codes for the DNA damage responsive protein has been found to have multi-stress response function . Also, several hub genes in the subnetwork have been annotated as responding to other stress conditions such as heat shock (HSP26) and oxidative stress (CTT1 and GAD1) –.
Direct and indirect SNP perturbations of the subnetwork.
We examined the direct and indirect perturbations of this subnetwork by an eQTL located in 1,095 kb region of chromosome 4. In Figure 10A, the genes whose expressions are directly perturbed by this eQTL are shown as diamond-shaped nodes. The round and colored nodes correspond to genes whose expressions are indirectly perturbed by this eQTL. The node size and color shade correspond to the effect sizes of the eQTL, with larger and darker nodes for stronger SNP perturbations.
In Figure 10A, we see that only a small number of gene-expression traits (TFS1, HSP26, GAD1, RTN2, and PRC1) are directly perturbed by the given eQTL, whereas this direct eQTL effects are passed on to other parts of the network to affect the expressions of many other genes indirectly. These indirect eQTL effects tend to be stronger on the immediate neighbors of the directly perturbed genes and gradually decrease for those genes farther away from the directly perturbed genes.
Decomposition of SNP perturbation effects.
While Figure 10A shows the overall perturbation effects of the given eQTL on the subnetwork, in order to obtain a more detailed description of the perturbation effects, we used Eq. (8) to perform a decomposition of the overall effects of this eQTL with respect to each of the directly-influenced genes from which the indirect eQTL effects originated. The results of the decomposition are shown for each of the four directly-influenced genes (TFS1, HSP26, RTN2, and GAD1) in Figures 10B–E, respectively. The genes whose expressions are directly perturbed by the eQTL and who pass such direct eQTL effects to other parts of the network are marked with red lines around the corresponding nodes in Figures 10B–E. Although PRC1 is also directly perturbed by this eQTL, we do not show the result for PRC1, because the strength of the direct SNP perturbation on PRC1 is significantly smaller than those for the other four directly-influenced genes. PRC1 is connected to TFS1 and its activity is known to be inhibited by TFS1-encoded protein with high affinity .
How strongly the direct eQTL perturbation effects propagate to other parts of the network depends on both the strengths of the direct SNP perturbations and the strengths of network edge connections . For example, in Figure 10A, the given eQTL has the largest direct influence on RTN2 among all of the directly-influenced genes, and as can be seen in Figure 10D, RTN2 can pass on this direct SNP perturbation to a larger part of the network than TFS1, HSP26, and GAD1. On the other hand, as shown in Figures 10A and D, among RTN2's neighbors, HBT1 and MSC1 receive the strongest indirect SNP perturbations because these two genes have stronger edge connections to RTN2.
Decomposition of gene-expression covariance.
Sparse CGGM also allows for a decomposition of the observed covariance of gene-expression data into two components, the covariance component induced by the gene network and the covariance component induced by the SNP perturbations, as described by Eq. (10). Based on the same subnetwork in Figure 10 and all eQTLs perturbing this subnetwork, we show the results of this decomposition in Figure 11. Figure 11A represents the observed covariance in the gene-expression data (the left-hand side of Eq. (10)), and Figures 11B and C show the covariance component induced by the gene network (the first term on the right-hand side of Eq. (10)), and the covariance component induced by eQTLs (the second term on the right-hand side of Eq. (10)), respectively. Edge thicknesses indicate the absolute values of covariances. We note that the edges in Figure 11 represent marginal dependencies in covariances, whereas the edges in Figure 10 represent conditional dependencies between two gene-expression traits conditional on all the other gene-expression traits.
(A) The covariance of yeast gene expression data for the genes shown in Figure 10. Sparse CGGM allows the observed covariance in Panel (A) to be decomposed approximately into (B) the covariance induced by the gene network and (C) the covariance induced by SNP perturbations and its propagation through the network. Edge width corresponds to covariance or the strength of gene-gene interaction. We note that the edges show marginal dependencies in covariances rather than conditional dependencies in inverse covariances.
Experimental evidence for the estimated subnetwork.
Although the DNA replication stress response subnetwork in Figures 10 and 11 was obtained computationally using our proposed statistical method, we found that this computational result is supported well by the results from experimental perturbation study in the literature and that many genes in the subnetwork have a known role in DNA replication stress response . In , 45 up-regulated transcripts were identified in a telomerase-deficient mutant, after an exposure to telomere shortening and cellular senescence. Out of these 45 genes, 32 genes appeared in our gene-expression dataset of 3,684 genes, and 30 out of the 32 genes were one or two edges away from TFS1 in our subnetwork in Figure 10A: 13 genes were immediate neighbors of TFS1 and another 17 genes were neighbors of either CTT1 (cytosolic atlas T) or DDR2 (DNA damage responsive). Both CTT1 and DDR2 were neighbors of TFS1 in our subnetwork and are known multi-stress response proteins targeted by the transcription activator MSN2 that responds to stress conditions . In addition, our subnetwork captures many of the known interactions among several key elements in DNA replication stress response, such as PHM7, SOL4, RTN2, and PGM2 , .
Finally, our results indicate that different types of stress response pathways are closely related to each other, as genes involved in stress responses under conditions other than DNA replication are often found as neighbors of DNA replication stress response genes in our subnetwork. For example, HSP26, a heat shock protein with chaperon activity, is another target of MSN2, and is known to be involved in response to heat shock and other stimuli . This HSP26 gene is connected in our subnetwork to XBP1, a transcriptional repressor whose relative distribution in the nucleus increases under DNA replication stress . Likewise, GAD1, a glutamate decarboxylase, is required in oxidative stress tolerance  and is a neighbor of TFS1 and RTN2 in our subnetwork.
Overall, we conclude that the gene network obtained computationally by sparse CGGM learning algorithm using SNPs as naturally-occurring perturbations of the gene regulatory system generally matches the network obtained from experimental perturbation study.
Prediction accuracy for yeast gene expressions given SNPs.
As a quantitative evaluation of our method on yeast eQTL dataset, we computed prediction errors on an independent test set for sparse CGGM and compared the result with what we obtained for GFlasso (Table 3). While it took sparse CGGM and GFlasso less than a day for a single run of model training on all SNPs and gene-expression traits, for MRCE a single run of the approximate learning method was not completed after several weeks of computation and we were unable to obtain prediction errors for MRCE. We estimated a model on 100 samples and computed prediction errors on the test set of the remaining 14 samples. During the estimation stage, we selected the optimal regularization parameters by cross-validation with three random splits of the 100 samples into 90 samples for fitting a model and 10 samples for computing cross-validation errors. Since GFlasso requires a known gene network, we used the correlation coefficient matrix of gene-expression data thresholded at as an input gene network. In visual inspection, we found that this thresholded correlation matrix for gene network captured gene module structures reasonably well, when compared to the results from hierarchical agglomerative clustering applied to the gene-expression data. The results in Table 3 show that sparse CGGM outperforms GFlasso on predicting gene-expression levels from SNP genotypes.
In this paper, we presented a new statistical framework for genetical genomics analysis to learn a gene network by treating SNPs as naturally occurring perturbants of a gene network. Within this framework, we introduced a statistical model, called a sparse CGGM, for modeling a gene network under SNP perturbations and discussed an efficient learning algorithm and inference methods. While genetical genomics approach has been recognized as a more effective and less costly method for learning a gene network than experimental methods, this approach has not been widely used mainly because of the computational challenge that the effects of perturbations by often millions genetic variants at a time need to be decoded from data. Our approach directly addresses this challenge and identifies a gene network by decoupling the effects of multifactorial perturbations in eQTL data. At the same time, our approach addresses many of the weaknesses of the experimental methods and is able to identify which genes are directly perturbed by each SNP or are indirectly perturbed as downstream effects in the pathway. As eQTL data collection is being routinely performed for model organisms, and is more amenable for human tissues than experimental perturbations, our approach opens up doors to the possibility of leveraging these datasets for gene network learning rather than focusing on finding eQTLs from such data. Although the primary goal of our work and more generally genetical genomics analysis is to identify a gene network, our statistical approach has additional advantages of enhancing the current statistical tools of eQTL mapping and extracting significantly more detailed information on the functional role of eQTLs in the context of gene network.
Our approach provides a flexible statistical framework for learning a gene network along with eQTLs that can be easily extended in several different ways. For example, although in this paper, our gene network was defined over the expression levels of mRNAs, it is straightforward to include microRNA expression data to construct a network over both mRNA and microRNA expressions, both of which can be perturbed by genetic variants. Another possible extension is to model epistatic interactions among SNPs within sparse CGGM by introducing additional features for SNP interactions in the probabilistic graphical model.
However, there are certain limitations to our approach. Although our approach can handle thousands of gene-expression traits and SNPs efficiently, it is still not efficient enough to be directly applied to genome-wide analysis of eQTL datasets of higher-level organisms with tens of thousands of gene-expression traits and millions of SNPs. For such large-scale datasets, we suggest to split the full dataset into smaller sets of gene-expression traits by applying a clustering algorithm to obtain coarse-grained gene modules. Then, our approach can be applied to each subset of gene-expression traits for coarse-grained gene modules to extract fine-grained information on gene-network connectivities. In order to perform a full joint analysis of all data, in future work, we will consider improving the computational bottleneck of matrix inversion for the gene network parameters in the learning algorithm by replacing it with an approximate but computationally less expensive inversion.
Another future direction is to relax the assumption in our model that the gene-expression traits under SNP perturbations follow a Gaussian distribution. Although Gaussian graphical models have been widely used to infer a gene network from gene-expression data due to many of the properties of Gaussian distributions that lead to easy computations, this assumption can be potentially restrictive for modeling realistic biological processes.
The software for sparse CGGMs is available at http://www.cs.cmu.edu/~sssykim/softwares/softwares.html#scggm.
Proof of convexity of sparse CGGM optimization problem.
Learning algorithm for sparse CGGMs.
We thank Kyung-Ah Sohn for the help with preliminary results. We thank Joel McManus for helpful discussion on yeast data analysis.
Conceived and designed the experiments: SK. Performed the experiments: LZ. Analyzed the data: SK LZ. Contributed reagents/materials/analysis tools: SK LZ. Wrote the paper: SK.
- 1. Tong A, Evangelista M, Parsons A, Xu H, Bader G, et al. (2001) Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294: 2364–2368. doi: 10.1126/science.1065810
- 2. Hu Z, Killion P, Iyer V (2007) Genetic reconstruction of a functional transcriptional regulatory network. Nature Genetics 39: 683–687. doi: 10.1038/ng2012
- 3. Chua G, Morris Q, Sopko R, Robinson M, Ryan O, et al. (2006) Identifying transcription factor functions and targets by phenotypic activation. PNAS 103: 12045–50. doi: 10.1073/pnas.0605140103
- 4. Jansen RC, Nap JP (2001) Genetical genomics: the added value from segregation. Trends in Genetics 17: 388–391. doi: 10.1016/s0168-9525(01)02310-1
- 5. Jansen R (2003) Studying complex biological systems using multifactorial perturbation. Nature Reviews Genetics 4: 145–151. doi: 10.1038/nrg996
- 6. Morley M, Molony C, Weber T, Devlin J, Ewens K, et al. (2004) Genetic analysis of genome-wide variation in human gene expression. Nature 430: 743–747. doi: 10.1038/nature02797
- 7. Stranger B, Forrest M, Clark A, Minichiello M, Deutsch S, et al. (2005) Genome-wide associations of gene expression variation in humans. PLoS Genetics 1: e78. doi: 10.1371/journal.pgen.0010078
- 8. Kim S, Xing E (2009) Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics 5: e1000587. doi: 10.1371/journal.pgen.1000587
- 9. Logsdon B, Mezey J (2010) Gene expression network reconstruction by convex feature selection when incorporating genetic perturbations. PLoS Computational Biology 6: e1001014. doi: 10.1371/journal.pcbi.1001014
- 10. Chaibub N, Keller M, Attie A, Yandell B (2010) Causal graphical models in systems genetics: a unified framework for joint inference of causal network and genetic architecture for correlated phenotypes. Annals of Applied Statistics 4: 320–339. doi: 10.1214/09-aoas288
- 11. Aten J, Fuller T, Lusis A, Horvath S (2008) Using genetic markers to orient the edges in quantitative trait networks: the NEO software. BMC Systems Biology 2: 320–339. doi: 10.1186/1752-0509-2-34
- 12. Ye C, Galbraith S, Liao J, Eskin E (2008) Using network component analysis to dissect regulatory networks mediated by transcription factors in yeast. PLoS Computational Biology 5: e1000311. doi: 10.1371/journal.pcbi.1000311
- 13. Lee SI, Dudley A, Drubin D, Silver P, Krogan N, et al. (2009) Learning a prior on regulatory potential from eQTL data. PLoS Genetics 5: e1000358. doi: 10.1371/journal.pgen.1000358
- 14. Meinshausen N, Buhlmann P (2006) High-dimensional graphs and variable selection with the lasso. The Annals of Statistics 34: 1436–1462. doi: 10.1214/009053606000000281
- 15. Schafer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications of Genetics and Molecular Biology 4: 32. doi: 10.2202/1544-6115.1175
- 16. Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9: 432–41. doi: 10.1093/biostatistics/kxm045
- 17. Banerjee O, Ghaoui LE, dAspremont A (2008) Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research 9: 485–516.
- 18. Shi W, Lee K,Wahba G (2007) Detecting disease causing genes by LASSO-Patternsearch algorithm. Technical Report 1140, Department of Statistics, University of Wisconsin.
- 19. Wu TT, Chen YF, Hastie T, Sobel E, Lange K (2009) Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25: 714–721. doi: 10.1093/bioinformatics/btp041
- 20. The International HapMap 3 Consortium (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467: 52–58.
- 21. Brem R, Kruglyak L (2005) The landscape of genetic complexity across 5,700 gene expression traits in yeast. PNAS 102: 1572–1577. doi: 10.1073/pnas.0408709102
- 22. Rothman A, Levina E, Zhu J (2010) Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics 19: 947–962. doi: 10.1198/jcgs.2010.09188
- 23. Yin J, Li H (2011) A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Annals of Applied Statistics 5: 2630–50. doi: 10.1214/11-aoas494
- 24. Banerjee O, El Ghaoui L, d'Aspremont A (2008) Model selection through sparse maximum likeli- hood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research 9: 485–516.
- 25. Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, Series B 58: 267–288. doi: 10.1111/j.1467-9868.2011.00771.x
- 26. Friedman J, Hastie T, Tibshirani R (2007) Pathwise coordinate optimization. The Annals of Applied Statistics 1: 302–332. doi: 10.1214/07-aoas131
- 27. Li B, Chun H, Zhao H (2012) Sparse estimation of conditional graphical models with application to gene networks. Journal of the American Statistical Association 107: 152–167. doi: 10.1080/01621459.2011.644498
- 28. Mardia K, Kent J, Bibby J (1979) Multivariate Analysis. Academic Press.
- 29. Koller D, Friedman N (2009) Probabilistic Graphical Models: Principles and Techniques. MIT Press.
- 30. Boyd S, Vandenberghe L (2004) Convex Optimization. Cambridge University Press.
- 31. Nesterov Y (2005) Smooth minimization of non-smooth functions. Math Program 103: 127–152. doi: 10.1007/s10107-004-0552-5
- 32. Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal of Imaging Sciences 2: 183–202. doi: 10.1137/080716542
- 33. Nesterov Y (2007) Gradient methods for minimizing composite objective function. CORE Dis- cussion Papers 2007076, Universit catholique de Louvain, Center for Operations Research and Econometrics (CORE).
- 34. Chen X, Lin Q, Kim S, Carbonell J, Xing E (2012) Smoothing proximal gradient method for general structured sparse regression. Annals of Applied Statistics 6: 719–752. doi: 10.1214/11-aoas514
- 35. Luxburg U (2007) A tutorial on spectral clustering. Statistics and Computing 17: 395–416. doi: 10.1007/s11222-007-9033-z
- 36. Chakrabarti D, Faloutsos C (2006) Graph mining: Laws, generators, and algorithms. ACM Computing Surveys 38: 2. doi: 10.1145/1132952.1132954
- 37. Zhang B, Horvath S (2005) A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology 4: Article 17. doi: 10.2202/1544-6115.1128
- 38. Zhu J, Zhang B, Smith E, Drees B, Brem R, et al. (2008) Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nature Genetics 40: 854–61. doi: 10.1038/ng.167
- 39. Robinson L, Tatchell K (1991) TFS1: A suppressor of cdc25 mutations in Saccharomyces cerevisiae. Molecular and General Genetics 230: 241–250. doi: 10.1007/bf00290674
- 40. Tkach JM, Yimit A, Lee AY, Riffle M, Costanzo M, et al. (2012) Dissecting DNA damage response pathways by analysing protein localization and abundance changes during DNA replication stress. Nature Cell Biology 14: 966–976. doi: 10.1038/ncb2549
- 41. Kobayashi N, Mcclanahan TK, Simon JR, Treger JM, Mcentee K (1996) Structure and functional analysis of the multistress response gene DDR2 from Saccharomyces cerevisiae. Biochemical and Biophysical Research Communications 229: 540–547. doi: 10.1006/bbrc.1996.1840
- 42. Schmitt AP, McEntee K (1996) Msn2p, a zinc finger DNA-binding protein, is the transcriptional activator of the multistress response in Saccharomyces cerevisiae. PNAS 93: 5777–82. doi: 10.1073/pnas.93.12.5777
- 43. Grant CM, Perrone G, Dawes IW (1998) Glutathione and catalase provide overlapping defenses for protection against hydrogen peroxide in the yeast Saccharomyces cerevisiae. Biochemical and Biophysical Research Communications 253: 893–898. doi: 10.1006/bbrc.1998.9864
- 44. Coleman S, Fang T, Rovinsky S, Turano F, Moye-Rowley W (2000) Expression of a glutamate decarboxylase homologue is required for normal oxidative stress tolerance in Saccharomyces cerevisiae. Journal of Biological Chemistry 276: 244–250. doi: 10.1074/jbc.m007103200
- 45. Bruun AW, Svendsen I, Sorensen SO, Kielland-Brandt MC, Winther JR (1998) A high-affinity inhibitor of yeast carboxypeptidase Y is encoded by TFS1 and shows homology to a family of lipid binding proteins. Biochemistry 37: 3351–3357. doi: 10.1021/bi971286w
- 46. Teng SC, Epstein C, Tsai YL, Cheng HW, Chen HL, et al. (2002) Induction of global stress response in Saccharomyces cerevisiae cells lacking telomerase. Biochemical and Biophysical Research Communications 291: 714–721. doi: 10.1006/bbrc.2002.6509
- 47. Zhu J, Sova P, Xu Q, Dombek KM, Xu EY, et al. (2012) Stitching together multiple data dimensions reveals interacting metabolomic and transcriptomic networks that modulate cell regulation. PLoS Biology 10: e1001301. doi: 10.1371/journal.pbio.1001301
- 48. Amoros M, Estruch F (2001) Hsf1p and Msn2/4p cooperate in the expression of Saccharomyces cerevisiae genes HSP26 and HSP104 in a gene- and stress type-dependent manner. Molecular Microbiology 39: 1523–1532. doi: 10.1046/j.1365-2958.2001.02339.x