Learning Transcriptional Regulatory Relationships Using Sparse Graphical Models

Understanding the organization and function of transcriptional regulatory networks by analyzing high-throughput gene expression profiles is a key problem in computational biology. The challenges in this work are 1) the lack of complete knowledge of the regulatory relationship between the regulators and the associated genes, 2) the potential for spurious associations due to confounding factors, and 3) the number of parameters to learn is usually larger than the number of available microarray experiments. We present a sparse (L1 regularized) graphical model to address these challenges. Our model incorporates known transcription factors and introduces hidden variables to represent possible unknown transcription and confounding factors. The expression level of a gene is modeled as a linear combination of the expression levels of known transcription factors and hidden factors. Using gene expression data covering 39,296 oligonucleotide probes from 1109 human liver samples, we demonstrate that our model better predicts out-of-sample data than a model with no hidden variables. We also show that some of the gene sets associated with hidden variables are strongly correlated with Gene Ontology categories. The software including source code is available at http://grnl1.codeplex.com.


Introduction
Transcriptional regulatory networks govern the expression levels of thousands of genes as part of a diverse biological processes. Regulatory proteins called transcription factors (TF) are the main players in the regulatory network. TFs bind to promoter regions at the start of other genes and thereby initiate or inhibit gene expression. Determining accurate models for transcriptional regulatory interactions is an important challenge in computational biology. With the development of high-throughput DNA microarray technologies, it is possible to simultaneously monitor the expression levels of essentially all genes. Extensive research has been done to build quantitative regulatory models by associating gene expression levels (see [1][2][3] for reviews).
One challenge in this work is that not all TFs have been identified and the regulatory relationship between TFs and their associated genes may not be available (except for some well studied model organisms such as yeast). Another challenge is the potential for spurious associations between regulators and affected genes due to confounding factors such as expression heterogeneity [4][5][6][7][8]. Moreover, in large scale genome-wide expression datasets, the number of genes (or probs) is usually much larger than the number of samples. This is the so called ''large p small n'' problem [9,10]. Feature selection is required when analyzing such datasets.
Assuming that the (partial) knowledge of the network topology between TFs and genes is available, network component analysis [11,15] aims to reconstruct signals from the regulators and their strengths of influence on each genes. However, such knowledge may not be always available, e.g., for human. Similarly, in [13,14], methods have been proposed to infer the TF activities (concentration levels) assuming the TF-gene relationship is known. The work in [12] does not assume a known regulatory network, and tries to reconstruct one from sequence and array data. The proposed methods was applied to yeast datasets. The goal is different from ours, which is to reconstruct the regulatory network from the microarray data without the sequence information. Clustering approaches have also been developed to analyze gene expression data [17][18][19][20]. These methods partition samples into groups according to the expression patterns of genes in different groups. The TF information is not used in these algorithms.
In this paper, we propose a linear-Gaussian graphical model to address the challenges in learning regulatory relationships. Our model consists of two layers of nodes as shown in Figure 1. The upper layer nodes include the set of known/putative TFs and a set of hidden variables. The hidden variables are used to model possible unknown TFs and confounding factors. The lower layer represents the remaining genes that are not included in the upper layer. The nodes are connected via arcs from the upper layer nodes to the lower layer nodes. The expression levels of a node (gene) in the lower layer are modeled as a linear function of the expression levels of the upper layer nodes-that is, known TFs and hidden factors. Note that graphical model has also recently been applied to find expression quantitative trait loci [21].
To learn the parameters of the model from data, which is usually of high dimension and low sample size, we use L1 regularization as is done in [22] (see also [23][24][25]). This approach yields a sparse network, where a large number of association weights are zero [26]. In gene regulatory networks, the number of TFs is much smaller than the number of transcribed genes, and most genes are regulated by a small number of TFs. The matrix that describes the connections between the transcription factors and the regulated genes is expected to be sparse. Thus L1 regularization is a natural choice for this setting. We apply our model to large scale human gene expression data and show that our model has better prediction accuracy than do other alternatives. We examine each gene set defined by those in the lower layer connected to a single hidden variable in the upper layer. We find that some of these gene sets are strongly correlated with GO categories, suggesting that the hidden variables at least in part represent unknown TFs. The software including source code is publically available at http://grnl1.codeplex.com.

Linear Regression and Probabilistic PCA
Our model can be thought of as a combination of linear regression and probabilistic principal component analysis (PPCA) [27] with L1 regularization. In this subsection, we briefly review these two approaches.
Throughout the paper, we assume that all vectors are column vectors. Let r~(r 1 ,r 2 , Á Á Á ,r K ) T represent the K known/putative TFs (e.g., [28]), and x~(x 1 ,x 2 , Á Á Á ,x D ) T represent the D genes in the dataset. Note that the two sets r and x are disjoint-that is, x include the genes that are not TFs. This restriction is added so that the resulting graph is acyclic and therefore amenable to straightforward estimation techniques. The linear regression model assumes that the expression level of a gene x d can be represented by a linear function of the expression levels of the TFs.
where b dk (1ƒkƒK) is the coefficient that quantifies the strength of the TF r k to initiate (positive) or suppress (negative) the regulation of gene x d , m d is a translation factor, and is the additive noise of Gaussian distribution with zero-mean and standard deviation d-that is, e*N (0,d 2 ): The idea of PPCA is similar to that of linear regression. The difference is that the expression level of a gene x d is modeled as a linear function of the expression levels of a set of hidden (unobserved) variables z~(z 1 ,z 2 Á Á Á ,z M ) T : where z has a zero-mean, unit-variance Gaussian distribution.

Our Model
To incorporate both known/putative TFs and unknown factors, our model combines linear regression and PPCA. We model the expression level of a gene to be a linear function of the expression levels of both known/putative TFs and hidden factors. x d~X K k~1 b dk r k z X M m~1 w dm z m zm d ze: A graphical representation of the model is shown in Figure 1. It has two layers. The upper layer consists of random (vector) variable r representing TFs, and z representing hidden factors. The factors are assumed to mutually independent (although, because the known/putative factors are observed, any dependencies among them do not affect the predictive ability of the model).
The lower layer contains the random variable x representing the genes regulated by the upper layer nodes. These regulated genes are assumed to be mutually independent given the regulators in the upper layer. Next, we use multivariate notation to formalize and derive the likelihood function of our model. Let m~(m 1 ,m 2 where I is the identity matrix. Let the prior distribution over latent variable z be given by a zero-mean unit-covariance Gaussian The conditional distribution of the observed variable x, conditioned on the value of the latent variable z, is also Gaussian, of the form p(xDz)~N (BrzWzzm,d 2 I): Integrating out latent variable z, the marginal distribution is again Gaussian where the D|D covariance matrix C is defined by where the M|M matrix M is defined by Let R = {r n } and X = {x n } be the sets of N observed data points. The loss function (negative log likelihood function) is The parameter space in our model is SB,W,m,dT. Since only a small fraction of the candidate TFs are expected to be true regulators for any given gene, most of the weights in B and W should be set to zero to indicate non-regulation. L1 regularization is a well known approach for effective feature selection. In this approach, we add a penalty to the objective function that automatically pushes the elements in the parameter space to be zero. It has shown experimentally and theoretically to be capable of learning good models when most features are irrelevant [26]. The new objective function with L1 regularization is of the form where l is a tuning parameter that can be determined using cross validation, which will be discussed later.

Optimization
To optimize the likelihood function with L1 norm, we use the Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) algorithm described in [29]. The OWL-QN algorithm minimizes functions of the form where loss is an arbitrary differentiable loss function, and DwD 1 is the L1 norm of the weight (parameter) vector. It is based on the L-BFGS Quasi-Newton algorithm [30], with modifications to deal with the fact that the L1 norm is not differentiable. The algorithm is proven to converge to a local optimum of the parameter vector. The algorithm is very fast, and capable of scaling efficiently to problems with millions of parameters. Thus it is a good option for our problem where the parameter space is large when dealing with large scale genome-wide gene expression data.
Besides the loss function, and the penalized parameters, the OWL-QN algorithm also needs the gradient of the loss function, which (without detailed derivation) is The number of hidden variables M and the L1 penalty l are determined by two-fold cross validation within a wrapper used to evaluate out-of-sample prediction (see Evaluation). Only two folds are used at this stage to lessen the computational burden.
We note that sparse PCA is not convex [31]. Nonetheless, when we applied the optimization program with 10 random parameter initializations for give different models, the program converged to the same solution for each condition.

Data Set
The gene expression data is taken from 1109 human liver samples. Each RNA sample was profiled on a custom Agilent 44,000 feature microarray composed of 39,296 oligonucleotide probes targeting transcripts representing 34,266 known and predicted genes, including high-confidence, non-coding RNA sequences. The gene expression data was originally collected to characterize the genetic architecture of gene expression in human liver [32]. The expression data was processed using the median imputation method as in [32]. All microarray data associated with the human liver cohort were previously deposited into the Gene Expression Ominbus (GEO) database [33] under accession number GSE24335. The set of known and putative TFs is taken from [28], which is publicly available from http://hg.wustl.edu/ lovett/TF_june04table.html. The total number of such TFs is 1660.

Evaluation
We evaluated three models: (1) one with hidden variables, (2) one with no hidden variables, and (3) a reference model that assumes the non-TF genes are mutually independent (i.e., a model with no top layer in the corresponding graph). We evaluated the models by measuring out-of-sample log likelihoods via ten-fold cross validation. More specifically, we partition the samples into 10 subsets of equal size. In each fold, we use samples in 9 subsets as training data and test the learned model in the remaining 1 subset of samples. By measuring out-of-sample versus in-sample predictions, we avoid rewarding models that over fit the data. Within each cross, optimal values for l were determined with two-fold cross validation. In one fold, the optimal value for M (the number of hidden variables) was determined to be 20; and we used this value for the remaining nine folds. Figure 2 shows the model log-likelihoods of the out-of-sample predictions across the 10 folds of the data. As can be seen from the figure, the model with hidden variables always outperforms the model without hidden variables. Assuming the log likelihoods are independent (which is roughly the case as there are only 10 folds in the cross validation), the difference in predictive ability is significant (p~1:91|10 {6 via a paired Wilcoxon signed rank test). Similarly, the model without hidden variables predicts significantly better than does the reference model (p~1:91|10 {6 ).

GO Enrichment Analysis for Gene Sets Associated with Hidden Variables
Hidden variables can model the effect of unknown regulators or hidden confounders. To better understand the effect of the hidden variables, we look for correlations between genes associated with a given hidden variable and sets of genes in GO categories (Biological Process Ontology) [34].The GO categories are downloaded from website for gene set enrichment analysis (GSEA) http://www. broadinstitute.org/gsea/. In particular, for each gene set H, we identify the GO category whose set of genes is most correlated with H. We measure correlation via a p-value determined by application of Fisher's exact test. Since multiple gene sets H need to be examined, the raw p-values need to be calibrated because of the multiple testing problem [35]. To compute calibrated p-values for each H, we perform a randomization test, wherein we apply the same test to 1000 randomly created gene sets that have the same number of genes as H.
In Table 1, each row represents the gene set associated with a hidden variable. The calibrated p-values for the gene sets associated with hidden variables are listed in the second column in the table. The third column shows the false discovery rate (FDR) [36] of the gene sets. As can be seen from the Table, with an FDR significance threshold 0.05, nine of the twenty gene sets  The first column of Table 1 shows the sizes of the gene sets associated with hidden variables. As can be seen, each gene set covers a large number of genes, despite the use of an L1 penalty that tends to drive many association weights to zero. This result   indicates that the hidden variables may model the effects that influence many different genes.
Among the gene sets associated with known/putative regulators, there are 803 gene sets with size greater or equal to 5. The maximum size is 8820. Figure 3 shows the histogram of sizes of these gene sets. It can be seen from the figure that the gene sets associated with known/putative regulators are much smaller than those associated with hidden variables. This is reasonable since real regulators are expected to regulate a relatively small subset of genes. On the other hand, the large sizes of the gene sets associated with hidden variables indicate that the hidden variables are useful in modeling confounding factors that may effect most of the genes.

Comparison to Network Component Analysis Method
Our method aims to learn the transcriptional regulatory relationship without any prior knowledge of the network topology. As discussed in the Introduction section, various methods have been proposed to learn transcription factor activity assuming that the regulatory network topology is known [11][12][13][14][15][16]. Among the existing methods, Network Component Analysis (NCA) is a widely used approach. NCA aims at decomposing gene expression matrix X into two matrices A and P, such that X~AP, where A represents the connectivity network, and P presents the transcriptional factor activities (TFA). The connectivity matrix A is a required input of NCA. A nonzero value indicates there is an edge from a TF to a gene, and zero value indicates there is no edge between them. The nonzero values in the input matrix A can be random. The algorithm automatically learns the optimized A and P. The zero entries in A remain unchanged. That is, the structure of the regulatory network does not change. Therefore, NCA is mainly used to infer the TF activities with known network structure.
The NCA algorithm needs three criteria to ensure the decomposition to be unique [11]. First, the connectivity matrix A must have full-column rank. Second, when a node in the regulatory layer is removed along with all of the output nodes connected to it, the resulting network must be characterized by a connectivity matrix that still has full-column rank. This implies that each column of A must have at least K-1 zeros, where K is the number of TFs. Third, matrix P must have full row rank.
To apply NCA to infer the regulatory structure, we use a random matrix as the input matrix A, of which K-1 random elements in each column are set to 0. This is needed in order to satisfy the three criteria required by NCA. For the fairness of comparison, after the connectivity structure is learned by NCA, we remove the edges with small weights so that the number of remaining edges is equal to that of our model.
We apply GO enrichment analysis on the gene sets learned by NCA and our method. Table 2 shows the average raw p-value of the gene sets and the number of significant gene sets (with significance level 0.05 after correction for multiple testing). As can been seen from the table, the average raw p-value of our model is much less than that of NCA. Moreover, our model identified more significance gene sets than did NCA. The main reason for this difference is that NCA requires prior knowledge about the regulatory structure. Our model dose not have this assumption and tries to reconstruct the regulatory structure from the expression values of the TFs and genes.

Conclusion
Reconstructing gene transcriptional regulatory networks is a central problem in computational systems biology. Challenging issues include the incorporation of knowledge about TFs and modeling unknown TFs and confounders. We have developed a probabilistic graphical model that includes the known TFs as observed variables, uses hidden variables to model unknown TFs and confounders, and uses L1 regularization to address the high dimensionality and relatively low sample size of the data. Using human gene expression data, we have shown that the proposed model predicts significantly better than does the model without hidden variables. In addition, we have found that some of gene sets corresponding to hidden variables have significant correlations with GO categories, suggesting that the hidden variables at least in part represent unknown TFs.