## Figures

## Abstract

Identifying latent structure in high-dimensional genomic data is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for *biclustering* to infer subsets of co-regulated genes that covary in all of the samples or in only a subset of the samples. Our biclustering method, *BicMix*, allows overcomplete representations of the data, computational tractability, and joint modeling of unknown confounders and biological signals. Compared with related biclustering methods, BicMix recovers latent structure with higher precision across diverse simulation scenarios as compared to state-of-the-art biclustering methods. Further, we develop a principled method to recover context specific gene co-expression networks from the estimated sparse biclustering matrices. We apply BicMix to breast cancer gene expression data and to gene expression data from a cardiovascular study cohort, and we recover gene co-expression networks that are differential across ER+ and ER- samples and across male and female samples. We apply BicMix to the Genotype-Tissue Expression (GTEx) pilot data, and we find tissue specific gene networks. We validate these findings by using our tissue specific networks to identify trans-eQTLs specific to one of four primary tissues.

## Author Summary

Recovering gene co-expression networks from high-throughput experiments to measure gene expression levels is essential for understanding the genetic regulation of complex traits. It is often assumed for simplicity that gene co-expression networks are static across different contexts—e.g., drug exposure, genotype, tissue, age, and sex. The biological reality is that, along with differences in gene expression levels, there are differences in gene interactions across contexts. In this work, we describe a model for Bayesian biclustering, or recovering non-disjoint clusters of co-expressed genes in subsets of samples using gene expression level data. Using results from our biclustering model, we build gene co-expression networks jointly across all genes by computing the full regularized covariance matrix between all pairs of genes instead of testing each possible edge separately. Because biclustering recovers structure in subsets of the samples, we are able to recover gene co-expression networks that occur across all samples, that are differential across contexts (e.g., up-regulated in males and down-regulated in females), and that are unique to a context (e.g., only co-expressed in lung tissue). We illustrate the robustness of our approach and biologically validate the networks recovered from three different gene expression data sets.

**Citation: **Gao C, McDowell IC, Zhao S, Brown CD, Engelhardt BE (2016) Context Specific and Differential Gene Co-expression Networks via Bayesian Biclustering. PLoS Comput Biol 12(7):
e1004791.
https://doi.org/10.1371/journal.pcbi.1004791

**Editor: **Xianghong Jasmine Zhou,
University of Southern California, UNITED STATES

**Received: **May 14, 2015; **Accepted: **February 3, 2016; **Published: ** July 28, 2016

**Copyright: ** © 2016 Gao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **The gene expression data from the breast cancer study are available from Bioconductor as BreastCancerNKI. The Cardiovascular and Pharmacogenetics (CAP) study gene expression data were acquired through GEO GSE36868. The Genotype-Tissue Expression (GTEx) pilot data are available through dbGaP, accession phs000424.v4. All code is publicly available at the Engelhardt Group Website: http://beehive.cs.princeton.edu/software/

**Funding: **BEE, CG and SZ were funded by NIH R00 HG006265 and NIH R01 MH101822. CDB was funded by NIH R01 MH101822. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Cellular mechanisms tightly regulate gene transcription. Gene transcription is not independently regulated across genes: many of the mechanisms regulating transcription affect multiple genes simultaneously. Functional *gene modules* consist of subsets of genes that share similar expression patterns and perform coordinated cellular functions [1, 2]. This cluster-based description of gene expression fails to capture the informative co-expression patterns among genes within a gene module.

If we consider each gene as a vertex in a network, then pairs of genes within a gene module for which the correlation in expression levels cannot be explained by other genes may be connected by an undirected edge. Across all genes, these pairwise relationships constitute gene co-expression networks. Constructing these undirected gene networks, as compared to clustering genes into gene modules [3–6], provides rich detail about pairwise gene relationships. An even richer structure capturing these pairwise relationships would be a directed network of genes, but currently directed networks are computationally intractable to construct relative to undirected gene networks [7–10]. Our work describes a rigorous approach to recover undirected gene co-expression networks from gene expression data that uses a probabilistic latent factor model to quantify the relationships between all pairs of genes.

Several methods have been proposed to construct gene co-expression networks by partitioning a set of genes (and, in some cases, samples) into gene modules from which an undirected graph is elicited [11–14]. In most cases, gene partitioning creates disjoint sets of genes, implying that genes only participate in a single gene module. Biologically this assumption does not hold; the impact is that the gene networks built from methods that assume disjoint clusters are modular. These approaches are not probabilistic, and thus uncertainty in the network edges is not well characterized.

Alternatively, statistical latent factor models are often used to identify groups of co-regulated genes in gene expression data [15–18]. In particular, latent factor models decompose a matrix **Y** ∈ ℜ^{p×n} of *p* genes and *n* samples into the product of two matrices, **Λ** ∈ ℜ^{p×K}, the factor loadings, and **X** ∈ ℜ^{K×n}, the latent factor matrix with *K* latent factors, assuming independent Gaussian noise. Because it is costly to obtain and assay genome-wide gene expression levels in a single sample, most expression studies include observations of many more genes *p* than samples *n* [19, 20]. This so-called *p* ≫ *n* scenario limits our ability to find latent structure in this expansive, underconstrained space. High dimensional data suggests the use of strong regularization on the latent space to provide sufficient structure for the optimization to reach a robust solution. For example, we may regularize a latent space to discourage all but a few genes from contributing to a latent factor through a sparsity-inducing prior or penalty on the loading vectors [15, 21, 22]. Non-disjoint clusters of genes can be extracted from the resulting fitted sparse loading matrix by recovering all genes with non-zero loadings on the same factor [16]. Sparse latent factor models are more interpretable than their non-sparse counterparts because of this clustering effect.

Besides encouraging sparsity in the factor loading matrix, which results in non-disjoint clusters of genes that co-vary across all samples, one can also induce sparsity in the factor matrix, which results in non-disjoint subsets of samples within which subsets of genes uniquely exhibit co-variation. Statistically, this corresponds to regularizing both factor and loading matrices using priors that encourage zero-valued elements. Biologically, such a model recovers components that identify small numbers of correlated genes, where the correlation among the genes is exclusive to, for example, female samples. This statistical model encodes a general framework known as *biclustering* [23–35]. A biclustering model decomposes a matrix into clusters that each correspond to a subset of samples and a subset of features that exhibit latent structure unique to those subsets.

Gene expression levels have been shown to be sensitive to a number of environmental, biological, and technical covariates including experimental batch, sex, ethnicity, smoking status, or sample tissue heterogeneity [36, 37]. Methods to adjust the observation matrix to control the effects of these covariates without eliminating signals of interest have been proposed. Most attempts have been limited to correcting for confounding effects in a two-stage approach [20, 38, 39] or controlling for confounding effects jointly with association testing [40, 41]. The two-stage approach applied to estimates of co-expression networks has not been successful: often variation in expression levels of groups of co-expressed genes are captured in the estimates of confounding effects and controlled in the first stage, leading to false negatives [42].

In this paper, we develop a Bayesian statistical model for biclustering called *BicMix*. Our motivation behind developing this method was to identify large numbers of subsets of co-regulated genes capturing as many sources of gene transcription variation as possible within arbitrary subsets of the samples. Our biclustering model also includes non-sparse components to represent sources of transcription variation that affect all genes or all samples, which includes many types of confounding effects. We developed a simple but principled statistical method to reconstruct gene co-expression networks based on the regularized covariance matrices estimated using our biclustering model. This method recovers different types of gene co-expression networks, categorized by quantifying the contribution of each sample to the latent components: i) ubiquitous co-expression networks, ii) co-expression networks specific to a sample context, and iii) networks with differentially co-expressed genes across a sample context.

In this paper, we motivate and describe our Bayesian model for biclustering, BicMix. We validate our biclustering model on extensive simulations and compare biclustering with a number of state-of-the-art methods. We then apply our model to gene expression data without correcting for known or unknown confounders. In particular, we apply our biclustering model to gene expression levels measured in heterogeneous breast cancer tissue samples to recover a co-expression network that is differentially expressed across estrogen receptor positive and negative (ER+ and ER-) samples [43, 44]. Next, we apply our biclustering model to gene expression levels measured in lymphoblastoid cell lines (LCLs) from a cohort of patients in a cardiovascular disease study to identify co-expression networks with differential co-expression across males and females, and across smoking status [45]. Finally, we apply BicMix to the Genotype-Tissue Expression (GTEx) pilot data to elucidate tissue specific gene co-expression networks [46]. We validate the recovered networks by identifying tissue specific trans-eQTLs using the recovered tissue specific co-expression networks.

## Results

### Bayesian biclustering using BicMix

Biclustering was first introduced to detect clusters of states and years that showed similar voting patterns among Republicans in national elections [47] and was later referred to as *biclustering* in the context of identifying co-expressed genes in subsets of samples [23]. Biclustering has also been referred to as two mode clustering [48], subspace clustering [49, 50], or co-clustering [51] in various applied contexts. Biclustering was used successfully to explore latent sparse structure in different applied domains [52], including gene expression data [23, 24, 53–56], neuroscience [57], time series data [54], and recommendation systems [58]. Refer to comprehensive biclustering reviews for details [27, 59].

Biclustering approaches fall into four general categories. The first category assumes that each observed gene expression level for one sample is a linear combination of a mean effect, a row (gene) effect, and a column (sample) effect, some of which may be zero [23]. One approach in this category, *Plaid*, captures gene expression levels as a sum of many sparse submatrix components, where each submatrix includes non-zero values only for a subset of genes and subset of samples [30, 31]. The second category of biclustering methods uses hierarchical clustering to group together similar samples and features [3]. For example, samples may be clustered by considering some measure of feature similarity [24, 25, 28, 32, 35]. The third category of biclustering methods builds up biclusters by iteratively grouping features in a greedy way—e.g., identifying all genes that have correlated expression levels with a selected gene—and then removing samples that do not support that grouping [26]. The last category of biclustering methods uses Bayesian sparse factor analysis models [33]. These models decompose a gene expression matrix into two sparse matrices. Sparsity-inducing priors, such as the Laplace prior, are imposed on elements of both the loading and the factor matrices to induce zero-valued elements. *K* biclusters are recovered as the non-zero feature and sample components for each of the *K* latent components. Our approach falls into this last category of a sparse latent factor model for biclustering.

#### BicMix: Bayesian biclustering model.

We developed a Bayesian biclustering model, *BicMix*, built on factor analysis with sparsity-inducing priors on both of the low dimensional matrices. In particular, we defined the following latent factor model for matrix **Y** ∈ ℜ^{p×n}, which is the set of observations of *p* gene expression levels across *n* samples:
(1)
where **Λ** ∈ ℜ^{p×K} is the *loading matrix*, **X** ∈ ℜ^{K×n} is the *factor matrix*, ** ϵ** ∈ ℜ

^{p×n}is the residual error matrix, and

*K*is fixed a priori. We assume that the residual error is independent across genes and samples and has a zero-mean multivariate Gaussian distribution with gene specific variance: for

*i*= 1, …,

*n*, where

**Ψ**= diag(

*ψ*

_{1}, …,

*ψ*

_{p}). While a value must be specified for

*K*, the number of latent factors, this model removes factors that are unsupported in the data through a global sparsity-inducing prior, so

*K*should be set as an overestimate of the number of latent factors (see Methods) [60].

To induce sparsity in both the factors (samples) and the loadings (genes), we used the three parameter beta () distribution [61], which has been shown to be computationally efficient and to induce flexible modeling behavior. In previous work [60, 62], we included three layers of regularization via the distribution to induce sparsity in the loading matrix. Here, we extended this model to include this same sparsity-inducing prior on the factor matrix (see Methods). With a flexible sparsity-inducing prior on both the factor and loading matrices, the model becomes a biclustering model, estimating subsets of genes for which co-variation is observed in a subset of samples. This structured prior produces favorable behavior in this biclustering model: i) the number of factors and loadings are effectively estimated from the data, because the sparsity-inducing prior removes unused factors; ii) each factor and corresponding loading has a different level of shrinkage applied to it, enabling a non-uniform level of sparsity and corresponding percentage of variance explained (PVE) for each factor and loading pair [60]; iii) neither the clusters of genes nor the clusters of samples are disjoint, so all genes and all samples may be in any number of clusters, or none; and iv) strong regularization allows overcomplete estimates of the response matrix, with possibly more factors *K* than samples *n* or genes *p*.

In gene expression data, observed covariates or unobserved confounding effects may systematically influence variation in the observation [38, 40, 41]. As in prior work, we tailored our sparsity-inducing prior to jointly model these often dense confounding effects [60]. In particular, we adapted our model so that the loadings and factors are drawn from a two-component mixture distribution, where each vector is either *sparse*—with zero elements—or *dense*—with no zero elements or all zero elements (Fig 1). We extracted information about whether a vector is sparse or dense directly from the fitted model parameters using the posterior mean of the mixture component assignment variables for each component *k* = 1, …, *K*, where *z*_{k} ∈ {0, 1} indicates a dense or a sparse loading and *o*_{k} ∈ {0, 1} indicates a dense or a sparse factor. This two-component mixture in the prior distribution for the factors and loadings adds two favorable behaviors to the biclustering model. First, it jointly models covariates that regulate variation in most genes and also in few genes; we have found that confounding effects are often captured in the dense components as all samples and most genes are affected (e.g., batch effects, population structure) [37, 60]. Second, the mixture component has the effect of relaxing a computationally intractable space, enabling scalable parameter estimation in a Bayesian framework. Specifically, considering all possible subsets of genes and samples to identify biclusters is an intractably difficult problem; however, it is computationally tractable to first search over the space for which cluster membership is relaxed, represented as a continuous value between 0 and 1. Then we identify clusters by iteratively shrinking the small magnitude membership values to zero, which maps the continuous values to the binary representation of cluster membership. We estimated parameters in this model using both Markov chain Monte Carlo (MCMC) and a variational expectation-maximization (VEM) approach (see Methods).

Ordered from left to right are, respectively, the *p* × *n* gene expression matrix **Y**, the *p* × *K* loading matrix Λ including both sparse and dense columns, the *K* × *n* factor matrix **X** including both sparse and dense rows, and the *p* × *n* residual matrix *ϵ*. Blue, red, and white entries in each matrix correspond to negative, positive, and zero values, respectively.

#### Gene co-expression networks from biclusters.

To construct an undirected gene network, we built a Gaussian Markov random field, or a Gaussian graphical model (GGM) [63], using the components estimated with our biclustering model (see Methods). In particular, regularized estimates of the gene-by-gene covariance matrix *Ω* may be computed from our parameter estimates as **Ω** = **Λ** **Σ** **Λ**^{T} + **Ψ**, where **Σ** is the covariance matrix for **X**. Factor analysis is often viewed as a method for low-rank covariance estimation by marginalizing over the factors, **X**. Furthermore, for any subset of components with sparse loading vectors, *A* ⊆ {1, …, *K*}, , where **Σ**_{A,A} is the covariance matrix for **X**_{A}, estimates a regularized covariance matrix for the genes loaded on **Λ**_{A}. Note that **Ω**_{A} is generally both sparse and full rank; biclustering is a highly structured approach to estimating regularized covariance matrices [64]. The inverted covariance matrix is a symmetric precision matrix , where each element *δ*_{j,j′} can be transformed into the *partial correlation* between genes *j* and *j*′,
where *x*_{¬(j, j′),⋅} indicates all genes in *X* that are neither *j* nor *j*′.

In a GGM, edges are defined as pairs of nodes for which the partial correlation is non-zero. Since each loading Λ_{k}, *k* ∈ *A*, specifies a cluster of genes, we do not invert the full covariance matrix, but instead invert the submatrix that corresponds to genes with non-zero loadings in those components. This approach avoids inducing non-zero precision estimates—and, correspondingly, edges in the GGM—between genes that never occur in the same bicluster. We used GeneNet [63] to test the precision matrix for significant edges. GeneNet assumes that the edges are drawn from a mixture of the null (no edge) and alternative (edge) distributions, *f*(*E*) = *η*_{0} *f*_{0}(*E*) + *η*_{1} *f*_{1}(*E*), to calculate the probability of each edge being present or not. Practically, we selected edges with a probability > 0.8.

To recover co-variance networks specific to a subset of the samples, where the subset is characterized by context status, we chose the subset of components that contributes to this covariance matrix carefully. In particular, when we select subset *A* to include only components that have non-zero factor values for samples in a specific context (e.g., only female samples), we identify context specific covariance. When we select *A* such that all samples have a non-zero contribution to a component, we recover ubiquitous components. When we select *A* such that the mean rank of the factor values for one sample context is different than the mean rank of factor values for a different sample context—evaluated using a Wilcoxon signed-rank test— we identify components that are differential across the two contexts.

To combine results across runs of biclustering, we extracted co-expression network edges using this procedure separately for each run. Then we used ideas from an ensemble method, bootstrap aggregation (*bagging*) [65], and counted the number of times each edge for a specific network type was recovered across the runs. The final co-expression network contained edges that were identified in ≥ *r* runs.

### Simulation results across biclustering methods

We validated our biclustering model using simulated data sets and compared results from five state-of-the-art biclustering methods. We simulated data from an alternative generative model for observation matrix **Y** = **Λ** **X** + **ϵ**, where **Y** has dimension *p* = 500 by *n* = 300 and . Within this model, we simulated sparsity as follows: for each loading and factor, a number *m* ∈ [5, 20] of elements were randomly selected and assigned values drawn from ; the remaining elements were set to zero. We allowed components to share as many as five elements. Simulation 1 (Sim1) had ten sparse components. Simulation 2 (Sim2) had ten sparse components and five dense components, for which the loadings and factors were drawn from a distribution. The components were shuffled so that a sparse loading may correspond to a dense factor, and vice versa. For both simulations we considered low and high noise scenarios: the residual variance parameter in the low noise (LN) setting was *ν*^{−1} = 1 and, in the high noise (HN) setting, was *ν*^{−1} = 2. Ten matrices *Y* were generated from each simulation scenario.

We ran BicMix and five other biclustering methods–Fabia [33], Plaid [30], CC [23], Bimax [27], and Spectral biclustering [34]. For all simulations, we ran BicMix by setting *a* = *b* = 0.5, *c* = 1, *d* = 0.5, *e* = *f* = 1 and *ν* = *ξ* = 1 to promote substantial sparsity at the local level (i.e., the horseshoe prior [66]), weaker sparsity at the factor specific level (i.e., the Strawderman-Berger prior [67, 68]), and a uniform prior at the global level of the hierarchy [60]. The algorithm was initialized with warm start values by running MCMC for 500 iterations and using the final sample as the initial state for variational EM. For BicMix results, components that were classified as sparse have each element thresholded at 10^{−10}, because our parameter estimation methods converged to values near, but not exactly, zero. All other methods were run using their recommended settings (see Methods). For Sim2, we corrected the simulated data for the dense components by controlling for five principal components (PCs) before all other methods were run; without this initial correction for dense components, results from all five other biclustering methods were uninterpretable. For all runs, BicMix was initialized with *K* = 50 latent factors; all other methods were initialized with the correct number of sparse factors *K* = 10. For Fabia, we ran the software in two different ways. The results from running Fabia with the recommended settings are denoted as *Fabia*. We also set the sparsity threshold in Fabia to the number (from 100 quantiles of the uniform distribution over [0.1, 5]) that produced the closest match in the recovered matrices to the number of non-zero elements in the simulated data; we label these results *Fabia-truth*.

We used the *recovery and relevance score* (R&R score) [27] to measure the false discovery rate (FDR) and sensitivity of each method in recovering true biclusterings. Let the true set of sparse matrices be **M**_{1} and the estimated set of sparse matrices be **M**_{2}; then the R&R score is calculated as:
(2) (3) *Recovery* quantifies the proportion of true clusters that are recovered (i.e., recall); *relevance* refers to the proportion of true clusters identified in the recovered clusters (i.e., precision). For BicMix, we applied this R&R score to the components for which both the loading and the factor vectors were sparse, which indicates a bicluster. For the doubly-sparse latent factor models, Fabia and BicMix, we also calculated a sparse stability index (SSI) [60] to compare the recovered and true matrices; SSI is invariant to label switching and scale, and falls between [0, 1] with 1 indicating perfect recovery.

For Sim1, we found that BicMix recovered the sparse loadings, sparse factors, and the biclusters well in the low noise scenario based on both R&R (Fig 2a) and SSI (Fig 2b). Fabia-truth had the second best performance based on R&R. For comparison, Fabia-truth achieved better R&R scores than Fabia (Fig 2a); the clustering results from BicMix dominated those from Fabia-truth, although there was only a small gain in relevance in the low noise Sim1 results for BicMix. Plaid showed high relevance for the recovered biclusters regardless of the noise level for Sim1, but at the expense of poor recovery scores. The remaining methods did not perform well in these simulations with respect to the R&R score for both low and high noise simulation scenarios.

Top row: Simulation with low noise. Bottom row: Simulation with high noise. Left column: Sim1 with only sparse components. Right column: Sim2 with sparse and dense components. Panel a: Recovery score on the x-axis, relevance score on the y-axis for all methods in the legend. Panel b: Stability statistic (y-axis) for the sparse components recovered by BicMix and Fabia.

For Sim2, BicMix correctly identified the sparse and dense components (Fig 2a), where a threshold of 〈*z*_{k}〉 > 0.9 was used to determine when a loading *k* was dense. The performance of Fabia on Sim2 deteriorated substantially relative to its performance on Sim1, although the confounders were removed using principal components (PCs) and the correct number of factors was given. For both BicMix and Fabia, additional noise in the simulation made bicluster recovery more difficult, as shown in deterioration of the recovery score for both methods; however, unlike Fabia, the relevance score of the biclustering from BicMix was unaffected by additional noise in the Sim2 high noise scenario.

The other methods show inferior performance relative to BicMix and Fabia on Sim2. CC assumes that genes in each bicluster have constant expression values, which limits its ability to cluster genes with heterogeneous expression levels. Bimax assumes binary gene expression values (i.e., over- or under-expressed), which limits its utility for heterogeneous expression levels. Spectral biclustering imposes orthogonal constraints on the biclusters; this orthogonality assumption is violated in these simulations and also in gene expression data, where correlated sources of variation may impact similar subsets of genes.

### Application of BicMix to three gene expression study data sets

We now turn to the application of BicMix to gene expression data from three studies. For each application, we first describe the biclustering results from BicMix. Then we show the interpretable networks that were recovered from the BicMix results and discuss network validation using cis- and trans-eQTLs.

#### Breast cancer network.

We investigated a breast cancer data set that contains 24,158 genes assayed in 337 breast tumor samples [44, 69, 70] after removing genes that are > 10% missing and imputing missing values of included genes [71] (see Methods). All patients in this data set had stage I or II breast cancer and were younger than 62 years old. Among the 337 patients, 193 had lymph-node negative disease and 144 had lymph-node positive disease; prognostic signatures such as *BRCA1* mutations, estrogen receptor status (ER), distant metastasis free survival (DMFS) were collected for all patients. We focused on building differential gene co-expression networks across ER positive (ER+) and ER negative (ER-) patients because of ER’s prognostic value in profiling breast cancer patients [72]: cancer patients that are ER+ are more likely to respond to endocrine therapies than patients that are ER-. In these data, there are 249 ER+ and 88 ER- patients.

We ran *BicMix* on these data, setting *a* = *b* = 0.5, *c* = 1, *d* = 0.5, *e* = *f* = 0.5 and *ν* = *ξ* = 1 as in the simulations; the initial number of components was set to *K* = 300. Starting from 1,000 random values, BicMix was run until the total number of genes with non-zero loadings across components changed over 100 iterations. Removing runs that did not converge, we recovered 53,814 components across 900 runs, of which 110 loadings and 19,239 factors were dense (Fig 3a and 3b).

Panel a: Distribution of the number of genes with non-zero values in each of the 53,814 loadings. Panel b: Distribution of the number of samples with non-zero values in each of the 53,814 factors. Panel c: average PVE for the components sorted by PVE within each run. The middle lines show the median PVE, the ribbons show the range of the minimum and maximum PVE across 900 runs. For panels a and b, the peaks on the far right correspond to the number of the genes and samples for the dense loadings and dense factors.

The distribution of the number of genes in each sparse component was skewed to small numbers (Fig 3a). We categorized each component as one of four configurations: sparse gene loadings with sparse sample factors (SS), sparse gene loadings with dense sample factors (SD), dense gene loadings with sparse sample factors (DS), and dense gene loadings with dense sample factors (DD). SS components captured subsets of genes that are uniquely co-expressed in a subset of samples, which may be due to, e.g., a genotype or environmental context that impacts expression levels of a small number of genes. SD components capture subsets of genes that are differentially co-expressed among all samples, which may be due to, e.g., sex-differential expression or batch effects. DS components capture a subset of samples in which all genes have additional co-variation, which may be due to, e.g., sample contamination. DD components capture variation that occurs across all samples and affects co-variation across all genes, which may be due to, e.g., latent population structure.

For each run, we calculated the percentage of variance explained (PVE) per component *k* as , where *Tr* denotes the trace operator. We ordered the components by PVE within each SS, SD, DS, and DD component category. We calculated the mean, maximum, and minimum values of the PVE-ordered, categorized components across the runs. Note that, because there are no orthogonality constraints, it is possible that many of these components explain similar variation in the observations; for this quantification we are assuming this PVE is disjoint and normalizing across all component-wise PVEs. We also calculated the total PVE explained by each component category by summing the total PVE for all components jointly in each category. The distribution of the number of genes contained in each loading and the PVE by component and across SD, SS, DS, DD categories show that SD and SS components made up the vast majority of recovered components, and that, on average, SD components accounted for a larger PVE than SS components (Fig 3a). The number of components that fell into the SS, SD, DS, DD categories accounted for 64.1%, 35.7%, 0.1%, 0.1%, respectively, of the total number of components. In the same order, components in the four categories accounted for 24.7%, 74.8%, 0.3% and 0.1% of the total PVE.

We selected components from the fitted BicMix model to identify ER+ and ER- specific gene co-expression networks (see Methods). Moreover, to recover gene co-expression networks that are differentially expressed across ER+ and ER- samples, we identified components corresponding to factors that had a significant difference in the mean rank of the factor value between the ER+ and ER- samples based on a Wilcoxon signed-rank test (). Across our components, we found 996 components unique to ER+ samples, 135 components unique to ER- samples, and 17,051 components differential across ER+ and ER- samples. Interestingly, we note that differential ER status represents the bulk of the SD components recovered across all of the runs. We contrasted the correlation of all of the observed covariates with a representative sample of the SD factors (S1 Fig), and we found that ER status correlated well with many of these SD factors. Tumor grade also correlated well with many of the SD factors; however, tumor grade is somewhat anti-correlated with ER status (S2 Fig), so we chose to focus on ER status because of its clinical utility.

The precision matrices of the subsets of components corresponding to the three network types were constructed and edges among these genes were tested using our method for extracting gene co-expression networks from the fitted biclustering model (see Methods) [63]. For the ER- specific network, we recovered a total of 15 genes and 10 edges that are replicated ≥ 2 times; for the ER+ specific network, we recovered 621 genes and 760 edges that are replicated ≥ 2 times (S3 and S4 Figs; S1 and S2 Tables). We recovered more genes for ER+ than ER- specific networks because there are many more ER+ samples than ER- samples. We found one node and no edges shared across the ER+ and ER- specific networks.

For the co-expression network differential across ER+ and ER- samples, we recovered 432 genes and 8,156 edges that were replicated ≥ 15 times across the 900 runs (Fig 4; S3 Table). We hypothesized that, in these differential networks, the 432 genes may be divided into two sub-groups: a group of genes that is up-regulated in the ER+ samples and down-regulated in the ER- samples, and a group of genes that is down-regulated in the ER+ samples and up-regulated in the ER- samples. To explore this hypothesis, we quantified differential expression levels for the 432 genes in the differential gene co-expression network (Fig 4b). We found 430 genes that are up-regulated in ER- samples and down-regulated in ER+ samples. In contrast, we found two genes that are down-regulated in ER- samples and up-regulated in ER+ samples (Fig 4b, first two columns). This bias is likely due to the unbalanced number of ER+ and ER- samples. The genes in the ER-status specific co-expression networks also show a pattern of differential expression (S5 Fig), although not as strong, because gene correlation is differential across networks as opposed to coordinated differential expression levels across a subset of genes.

Panel a: differential ER gene co-expression network, where node size corresponds to betweenness centrality, which quantifies the number of shortest paths between all pairs of nodes in the network in which the gene is included. Panel b: gene expression levels for 432 genes in the ER-status differential co-expression network.

In the ER-status differential network, we found that many of the annotated genes play critical roles in breast cancer development. For genes that are up-regulated in ER- and down-regulated in ER+ patients:

*CD53*, the gene that encodes the leukocyte surface antigen*CD53*protein, and*IL10RA*, the gene that encodes the receptor for interleukin 10, belong to a set of three genes (including*DMFS*) that are predictive of outcome specifically in ER tumors [73];*CD247*encodes the CD3 zeta chain protein; a previous study has shown that the down-regulation of CD3 zeta expression plays an important role in breast cancer progression [74, 75];*CXCR3*encodes the chemokine (*C-X3-C*motif) ligand protein; studies have shown that*CXCR3*deficiency speeds tumor progression [76], and that*CXCR3*isoforms have divergent roles in promoting cancer stem-like cell survival and metastasis [77].

For genes that are up-regulated in ER+ and down-regulated in ER- patients:

#### Cardiovascular and Pharmacogenetic (CAP) RNA expression study data.

We next applied our biclustering model to a gene expression study with 10,195 expressed genes measured in 480 human lymphoblastoid cell lines (LCLs) [45]. In these data, there are 221 female samples, 259 male samples, 64 smokers, and 416 non-smokers. Age and BMI are also available for each sample. The data were processed according to previous work [45, 60]; In particular, we removed genes with probes on the gene expression array that aligned to multiple regions of the genome using a BLAST analysis against human genome reference hg19, leaving 8,718 expressed genes. No known covariates nor PCs were controlled in these data before projecting each gene expression level to the quantiles of a standard normal distribution (S6 Fig).

We set *K* = 400 and ran EM from 1000 starting points with *a* = *b* = 0.5, *c* = 1, *d* = 0.5, *e* = *f* = 1, *ν* = 1 and *α* = *β* = 1 [60]. After removing runs that did not converge, we recovered a total of 865 runs. On average there were 93 factors across runs, of which on average 21 were dense factors and dense loadings (where both the loading and the factor mixture component posterior probability of being non-sparse ≥ 0.9).

#### Sex-differential networks.

To construct a sex differential network, we selected factors that had differential mean values across sex. We recovered 68 genes corresponding to 78 edges that were replicated ≥ 5 times across runs (Fig 5a). The expected number of edges to replicate in this experiment under the null hypothesis is approximately 0.8 (Eq 89). Many of the genes that were identified to have differential co-expression partners across sex play important roles in sex determination or sex specific regulatory activities.

- The hub gene,
*UTX*, regulates somatic and germ cell epigenetic reprogramming [80]; more recently it has been shown to be a sex specific tumor suppressor in T-cell acute lymphoblastic leukemia [81]; *ZFX*, a gene on the human X chromosome that is structurally similar to the Y chromosome gene*ZFY*, has been used for sex identification in many species [82];*USP9X*, an X-linked gene, is differentially expressed across sexes in the adult mouse brain, and is hypothesized to play a role in differential neural development in women with XO Turner syndrome [83].

A heatmap of the gene expression levels of these genes across male and female samples shows patterns of differential expression as in the genes highlighted in the breast cancer ER-status differential co-expression network (Fig 5b).

Panel a: differential sex gene co-expression network, where node size corresponds to betweenness centrality. Panel b: gene expression levels for 61 genes in the sex differential gene co-expression network.

#### Sex specific and smoking status specific networks.

To construct sex specific networks and smoking status specific networks, we identified components with non-zero factor values for either only male or only female samples, or only smokers or only non-smokers, respectively. Using these components, we recovered 57 genes with 176 edges specific to males and 38 genes with 80 edges specific to females, where both networks included edges that were recovered ≥ 10 times across runs (Fig 6). The expected number of edges to replicate in this experiment under the null hypothesis is approximately 2.3 × 10^{−7} (males) and 9.4 × 10^{−7} (females; Eq 89).

Panel a: Gene co-expression network specific to males. Panel b: Gene co-expression network specific to females. Node size and color correspond to betweenness centrality.

For the smoking status specific networks, we identified 31 genes with 36 edges specific to smokers that were recovered ≥ 5 times across runs (S7a Fig); we identified 294 genes with 2,272 edges specific to non-smokers that were recovered ≥ 10 times across runs (S7b Fig). The expected number of edges to replicate in this experiment under the null hypothesis is approximately 4.6 × 10^{−7} (smokers) and 1.3 × 10^{−6} (non-smokers; Eq 89). The decrease in power to detect smoker specific edges compared to non-smoker specific edges is likely due to the imbalanced number of smokers (64) and non-smokers (416) in this study.

#### Genotype-Tissue Expression (GTEx) study.

We applied BicMix to a subset of the pilot Genotype-Tissue Expression (GTEx) project data [46]. Our goal for these data was to identify gene co-expression networks that were specific to a single tissue while controlling for networks that appeared across all tissues and other confounding effects. The subset of gene expression data that we used contained *p* = 20,850 genes measured in *n* = 446 samples across four tissues: adipose (*n*_{f} = 103), artery (*n*_{a} = 118), lung (*n*_{l} = 122), and skin (*n*_{s} = 106). We preprocessed these RNA-sequencing data as described in Methods. We concatenated these data sample-wise to create a single response matrix **Y** ∈ ℜ^{p × (nf + na + nl + ns)}.

We set *K* = 300 and ran EM from 1000 starting points with *a* = *b* = 0.5, *c* = 1, *d* = 0.5, *e* = *f* = 1, *ν* = 1 and *α* = *β* = 1 [60]. After eliminating runs that did not converge, we used a total of 958 runs. In these runs, we recovered on average 119 factors, approximately eight of which were dense in both factors and loadings (S8 Fig).

#### Comparison with biclustering results from comparative methods.

To determine whether or not the existing biclustering methods also recover similarly informative biclusters, we ran Fabia, CC, Bimax, Spectral, and Plaid on the GTEx data. To control for confounding effects, we removed the effects of five principal components of the gene expression matrix separately within each tissue so as to maintain the tissue specific effects for all methods except Fabia and BicMix. We found that both Plaid and Fabia were able to separate the tissues in the sample space, as was principal components analysis (PCA; S9 Fig). Starting from 300 biclusters, Plaid reduced the number of biclusters to four, with each bicluster containing samples from a subset of one tissue; adipose samples, however, were not identified uniquely (S10 Fig).

Results from Fabia depended on the specified number of biclusters: when the number of clusters is set to the number of tissues (four), the biclusters distinguished the four tissues (S11 Fig). However, with larger numbers of clusters specified, the separation of the samples by tissue became increasingly obfuscated until, with 20 clusters, there were no tissue specific patterns across clusters (S12 Fig). Compared to results from Plaid and Fabia, the other methods produced uninterpretable results: with 300 clusters, CC found one bicluster that contained all genes and one sample; Bimax failed to recover any biclusters in these data; Spectral found two biclusters, each with zero genes. BicMix, as described above, identified the four tissues in separate clusters as with Fabia (four factors) and PCA (S13 Fig).

#### Tissue specific networks.

In the BicMix results, we identified all of the bicluster factors and loadings that appeared uniquely in one of the four tissues. We collected genes and edges that were recovered ≥ 10 times across 1000 runs. We recovered 152 genes with 379 edges for adipose tissue, 167 genes with 966 edges for artery tissue, 98 genes with 199 edges for lung tissue, and 70 genes with 171 edges for skin tissue (Fig 7; S4–S7 Tables). The expected number of edges to replicate in this experiment under the null hypothesis is approximately 9.3 × 10^{−6} (adipose), −2.8 × 10^{−6} (artery), 8.2 × 10^{−6} (adipose), 1.7 × 10^{−6} (skin; Eq 89). Note that the negative expectation is due to the approximation in Eq (89).

Adipose: gene co-expression network for adipose. Artery: gene co-expression network for artery. Lung: gene co-expression network for lung. Skin: gene co-expression network for skin. Node size and color correspond to betweenness centrality.

To study the biological significance of the tissue specific gene clusters, we applied the functional gene annotation tool, DAVID (version 1.1) [84] to perform a gene ontology enrichment analysis on the recovered tissue specific gene clusters from a subset of 100 randomly chosen runs. Using a false discovery rate (FDR) threshold of 0.05, we found 262 (adipose), 265 (artery), 672 (lung), and 560 (skin) *molecular function* Gene Ontology (GO) categories enriched in clusters specific to each of the four tissues (S8–S11 Tables). In particular, the adipose specific gene clusters were most enriched for *lipid biosynthetic process* (FDR ≤ 2.8 × 10^{−4}), *lipid metabolism* (FDR ≤ 1.7 × 10^{−2}), and *insulin signaling pathway* (FDR ≤ 0.046); Interestingly, artery specific functions also appeared in adipose enriched terms; it has been shown that angiogenesis and vascular functions modulate obesity, adipose metabolism, and insulin sensitivity [85]. The artery specific gene clusters were most enriched for *vascular smooth muscle contraction* (FDR ≤ 9 × 10^{−3}), *circulatory system process* (FDR ≤ 0.03), and *ventricular cardiac muscle cell development* (FDR ≤ 0.049). The lung specific gene clusters were enriched for *respiratory burst* (FDR ≤ 5 × 10^{−3}) and *Toll-like-receptor* (FDR ≤ 1.8 × 10^{−2}), where *Toll-like-receptor* has a role in lung disease [86]. The skin specific gene clusters were most enriched for *melanogenesis* (FDR ≤ 1.9 × 10^{−2}), the process of formation of pigmentation, typically of the skin, and *epidermolysis bullosa (EB)* (FDR ≤ 0.03), an inherited connective tissue disease causing blisters in the skin and mucosal membranes. We also observed that many GO terms were enriched in multiple tissues, suggesting that a similar set of biological functions are achieved via different, cell specific mechanisms across tissues.

In the tissue specific networks, we found that the genes with larger betweenness centrality were known to play important roles in tissue specific functions. For adipose tissue, we found that the hub genes play important roles in adipose metabolism and show significant associations with obesity related traits. For example, *DOK1*, the gene that encodes *Docking Protein 1*, was shown to mediate high-fat diet-induced adipocyte hypertrophy and obesity through modulation of *PPAR*-gamma phosphorylation [87]. *APBB1IP*, which encodes the *Rap1-GTP*-interacting adaptor, has been shown to be involved in regulating metabolism and protecting against obesity [88]. *RHOQ*, also known as *TC10*, alters its gene expression levels in visceral adipose tissue of rats that are given a high-fat diet [89]. Genes with smaller betweenness centrality play important adipose specific functions as well; for example, gene expression levels of *PNPLA3* are affected by diet-induced obesity [90], and *LGPAT1* was shown to influence BMI and percent body fat in Native Americans [91].

In the artery specific network, we found that the genes with larger betweenness centrality play important roles in *angiogenesis* and *vasculogenesis*. For example, *ELTD1*, also known as epidermal growth factor (*EGF*), regulates angiogenesis [92]. *CLIC4*, a gene that encodes the chloride intracellular channel protein 4, regulates vasculogenesis through endothelial tube formation; abnormal *CLIC4* expression may play a role in pulmonary arterial hypertension pathology [93]. *CD36* encodes the thrombospondin receptor, and polymorphisms in this gene have significant associations with coronary artery heart disease risk [94]. *BAIAP2* encodes the brain specific angiogenesis inhibitor 1 *BAI1*, and is protective of angiogenesis [95].

In the lung specific network, the genes with larger betweenness centrality play important roles in maintaining the proper function of the lungs or are associated with lung and respiratory diseases. For example, *IRAK3* encodes a member of the interleukin-1 receptor-associated kinase protein family; hospital patients on mechanical ventilation with injury have differential expression of *TLR4* and *IRAK3* [96]. *CD46* encodes a regulatory protein for which higher expression in the lungs of ex-smokers appears to reduce inflammation and protect individuals from emphysema and chronic obstructive pulmonary disease [97, 98]. *IL18RAP* encodes the interleukin 18 receptor accessory protein. A recent study found that genetic variants in *IL1RL1* and *IL18R1* were significantly associated with bronchial hyperresponsiveness [99].

In the skin specific network, we found genes with larger betweenness centrality played critical biological roles in the skin, and were often related to skin tumor growth or carcinoma. For example, *RBPMS* encodes the RNA binding protein with multiple splicing; *RBPMS* is one of six genes that were shared among the top up-regulated genes both in dedifferentiated carcinoma and in carcinoma with loss of 13q [100]. *MCAM* encodes the melanoma cell adhesion molecule; increased expression of *MCAM* in human melanoma cells leads to increased tumor growth and metastasis [101, 102]. Finally, *MSRB* encodes the methionine sulfoxide reductase, which is important for antioxidant repair in human skin [103]. We note that it is possible in the current methodological framework to recover gene networks specific to subsets of the context specific samples; here it is possible that a subset of these skin samples include pre-cancerous cells, which will lead to recovering skin specific networks with edges corresponding to genes that are co-expressed only in pre-cancerous skin tissue.

#### Tissue specific trans-eQTLs.

To validate the tissue specific networks identified in the GTEx data, we performed a trans-eQTL analysis on the genes in these networks as follows. An expression QTL (eQTL) is, in this context, a single nucleotide polymorphism (SNP) that regulates expression levels of a specific gene. Transcription regulation may happen in an allele specific way (cis-eQTL) or may be mediated by a non-allele specific process (trans-eQTL). Given a gene *A*, a SNP *Q* that has been identified as a cis-eQTL for gene *A*, and a target gene *B* that is a neighbor of *A* in the tissue specific network, we tested for association between *Q* and *B* in gene expression values derived from samples with matched primary tissue. Using a list of eQTLs identified in previous work [104], we performed a pairwise univariate regression analysis of these eQTLs informed by the tissue specific networks. The same univariate analysis was performed on the permuted gene expression data to find *p* values under the null distribution of no trans-association.

Using our adipose specific networks, we found 40 cis-eQTLs for genes *RHOQ* and *PARVA12*, which were trans-eQTLs for many different target genes in adipose tissue samples (FDR ≤ 0.2). Two of these genes with trans-eQTLs specific to adipose tissue were *TK2* and *EHD2*, which have been associated with the abnormal development of adipose tissues and adipokine levels in mice [105] and lead to specific changes in white adipose tissue of growth hormone receptor-null mice [106], respectively. In artery, we found 121 cis-eQTLs for nine genes that act as trans-eQTLs to multiple target genes. The artery specific trans-eQTL target genes include *CCM2L*, involved in controlling vascular stability and growth [107], and *ARHGEF15*, which affects retinal angiogenesis in endothelial cells [108]. In lung, we found 19 cis-eQTLs for seven genes that are trans-eQTLs for seven target genes including *IRAK3*, described above, and *SIGLEC5*, which has been shown regulate acute pulmonary neutrophil inflammation [109]. In skin samples, we found 13 cis-eQTLs targeting three genes that are trans-eQTLs for two target genes that are unique to skin samples. One of these skin specific trans-eQTLs affects *SLC11A1*, which has been associated with susceptibility to Buruli ulcers [110]. (Full list of tissue specific trans-eQTLs in S12–S15 Tables.)

## Discussion

In this work, we developed a statistical approach to biclustering based on a Bayesian sparse latent factor model. We included a two-component mixture distribution to allow both sparse and dense representations of the features or samples, which captures heterogeneous sources of structured variation within the gene expression data. We used the regularized covariance matrix estimated from the latent factor model to build a Gaussian graphical model with the features represented as nodes in the undirected network. By extracting covariance matrices corresponding to subsets of components, we were able to identify gene co-expression networks that were shared across all samples, unique to a subset of samples, or differential across sample subsets.

We applied our methodology to breast tumor tissue gene expression samples and recovered co-expression networks that are differential across ER+ and ER- tumor types. We applied our methodology to gene expression data from the CAP project and recovered sex-differential, sex specific, and smoking status specific gene co-expression networks. We applied our methodology to the GTEx gene expression pilot data and recovered tissue specific networks for four tissues, which we validated by identifying tissue specific trans-eQTLs.

Factor analysis methods, including the biclustering approach presented here but extending to many other well-studied models, are statistical tools developed for exploratory analyses of the data. In this work, we have exploited the latent structure in both the factor and the loading matrix to estimate the covariance matrix that is specific to sample subsets. We considered tissue type and tumor types, but these methods can be used for any observed binary, categorical, integer, or continuous covariate (e.g., case-control status, batch, sex, age, EBV load). We showed in the Results that the recovered latent structure has substantial context specific biological meaning.

Our results on the GTEx data show that a number of genes were identified as part of multiple tissue specific networks. While individual genes may overlap across networks, the interactions of those genes did not. Genes that co-occurred in multiple tissue specific networks are good candidates to test for differential function across tissues. We also used this approach to study sexual dimorphism, extracting gene networks specific to one sex or differential across the sexes. We showed the potential of this approach to improve statistical power to identify sex specific trans-eQTLs.

In this version of BicMix, extracting a covariance matrix specific to a subset of the samples was performed *post hoc*: the linear projection to the latent space was performed in a mostly unsupervised way, although our three layer sparsity inducing priors add additional structure above SFA-type approaches. As described in the Results, there were multiple categories of gene-interactions that we recovered. These categories included: gene interactions that existed across context, gene interactions that were unique to specific contexts, and gene interactions that were present across contexts but differentially interact in different contexts. However, this approach currently does not use known context to directly inform the projection.

Indirectly, we saw that the sparsity structure on the samples allowed small subsets of the samples to inform projection, but this still relied on a post hoc interpretation of those sample subsets to recover specific network types. Correlation between contexts, such as tissue and batch, or smoking status and sex, would confound these results; here we checked for these correlation among observed covariates (S2 and S6 Figs) and also validated our results using literature searches and trans-eQTL recovery. Furthermore, it may be the case that, for a sample context of interest (e.g., age, sex), there is insufficient signal that is uniquely attributable to those samples (e.g., female) to identify a covariance matrix corresponding to the values of interest in this unsupervised framework. We are currently extending this approach so that the linear projection is explicitly informed by the context of interest.

## Methods

### Bayesian biclustering model for BicMix

We consider the following factor analysis model:
(4)
where **Y** ∈ ℜ^{p×n} is the matrix of observed variables, **Λ** ∈ ℜ^{p×K} is the loading matrix, **X** ∈ ℜ^{K×n} is the factor matrix, and **ϵ** ∈ ℜ^{p×n} is the residual error matrix for *p* genes and *n* samples. We assumed , where **Ψ** = diag(*ψ*_{1}, …, *ψ*_{p}).

In previous work [60, 62], a three parameter beta () [61] prior was used to model the variance of **Λ**. The three parameter distribution has the form
(5)
for *x* ∈ (0, 1), *a* > 0, *b* > 0 and *ϕ* > 0.

We used to induce flexible shrinkage to both **Λ** and **X**. Specifically, we included three layers of shrinkage—global, factor specific, and local—for both the factors and the loadings. Next we describe the sparsity-inducing structure for **Λ** and **X**.

#### Hierarchical structure on Λ.

The hierarchical structure for **Λ** is written as
(6) (7) (8) (9)
We used the fact that
(10)
where *Be*′(*a*, *b*) and indicate an inverse beta and a gamma distribution, respectively. Making the substitution , we get the equivalent hierarchical structure [61]:
(11) (12) (13) (14) (15) (16) (17)

We applied a two-component mixture model to jointly model possibly dense confounding effects by letting *θ*_{j,k} be generated from a mixture of sparse and dense components:
(18)
where the hidden variable *z*_{k}, which indicates whether or not loading *k* is sparse (1) or dense (0), is generated from the following beta-Bernoulli distribution:
(19) (20)

#### Hierarchical structure for X.

The hierarchical structure inducing sparsity in **X**, which is structurally identical to that for Λ, is:
(21) (22) (23) (24) (25) (26) (27)
with *σ*_{k,i} generated from a two component mixture. Here, the hidden variable *o*_{k}, which indicates whether or not the factor is sparse (1) or dense (0), has the following beta-Bernoulli distribution:
(28) (29)

### Variational expectation maximization

Because of the large dimension of matrices to which we apply BicMix, we used approximate methods for parameter estimation. In particular, we used variational expectation maximization (VEM) to estimate values for latent variables and parameters directly from the data in an approximate way. Extending previous work [60], the posterior probability is written as:
(30)
where we used **Θ**_{Λ} and **Θ**_{X} to denote the set of parameters related to **Λ** and **X**, respectively. Then,
(31)
and
(32)

To derive the variational EM algorithm, we expanded the posterior probability (Eq (30)) and wrote the expected complete log likelihood for parameters related to **Λ**: *Q*(**Θ**_{Λ}) = 〈ℓ_{c}(**Θ**_{Λ},**Λ**|**z**,**X**,**Y**)〉 as:
(33)
where we used 〈*X*〉 to represent the expected value of *X*.

Similarly, the expected complete log likelihood for parameters related to **X** takes the following form:
(34)
To simplify the calculation, we assumed that the joint distribution *p*(*o*_{k},*x*_{k,i}) factorizes as *p*(*o*_{k})*p*(*x*_{k,i}), implying that corresponding factors’ and loadings’ sparsity statuses are independent.

We computed the MAP estimates for the parameters that encourage sparsity in the Λ matrix, . Specifically, we solved equation to find the closed form MAP estimates. The MAP estimate for the *j*th row of **Λ**, Λ_{j,⋅}, in matrix form, is:
(35)
where
(36)
and
(37)
and **I** is the identity matrix.

We computed the expected value of **X**, 〈**X**〉, which has the following form:
(38)
where
(39)
and
(40)

We computed the expected value of :
(41)
where **Σ**_{X} denotes the covariance matrix of **X**.

Parameter *θ*_{j,k} has a generalized inverse Gaussian conditional probability [60, 61], and its MAP estimate is:
(42)

Similarly, the MAP estimate for *σ*_{k,i} has the following closed form:
(43)

The MAP estimate for *δ*_{j,k} is:
(44)

Parameter *ϕ*_{k} generates both sparse and dense components, and its MAP estimate takes the form:
(46)
where
(47) (48) (49)

Correspondingly, (50) where (51) (52) (53)

The following parameters have similar updates to *δ*_{j,k}, with simple closed forms because of the conjugacy of the distributions:
(54) (55) (56)

The corresponding parameters related to **X** have similar forms:
(57) (58) (59)

The expected value of *z*_{k}|**Θ** is computed as:
(60)

The prior on the indicator variable for sparse and dense components, *π*, has a beta distribution, and its geometric mean is the following:
(61)
where *ψ* is the digamma function.

The corresponding parameters related to **X** are
(62) (63)

Assuming that the residual precision has a conjugate (gamma) prior, , then we have (64)

In estimating the factors, we invert a *K* × *K* matrix for each sample, and, in estimating the loading matrix, we invert a *K* × *K* matrix for reach gene, so the main source of computational complexity in our algorithm is . To summarize the description above, we write the complete VEM algorithm for parameter updates:

**Algorithm 1:** Variational expectation maximization for BicMix

**Data:** *Y*, *K*, *n*_*itr*, *a*, *b*, *c*, *d*, *e*, *f*, *a*_{X}, *b*_{X}, *c*_{X}, *d*_{X}, *e*_{X}, *f*_{X}, *α*, *β*,*α*_{X}, *β*_{X}

**Initialize starting values:**

Sample *η*, *γ*, *χ*, *φ* ←

Sample *π* ← , *π*_{X} ← ,

**for** *j* ← 1 **to** *p* **do**

Sample *ψ*_{j, j} ←

**for** *k* ← 1 **to** *K* **do**

Sample *z*_{k} ← , *o*_{k} ←

Sample *ϕ*_{k}, *τ*_{k}, *ω*_{k}, *κ*_{k} ←

**for** *j* ← 1 **to** *p* **do**

Sample Λ_{j,k} ← , Sample *θ*_{j,k}, *δ*_{j,k} ←

**for** *i* ← 1 **to** *n* **do**

Sample *x*_{k,i} ← , Sample *σ*_{k i}, *ρ*_{k,i} ←

**for** *t* ← 1 **to** *n*_*itr* **do**

**for** *j* ← 1 **to** *p* **do**

Update Λ_{j, ⋅} ← Eq (35)

**for** *k* ← 1 **to** *K* **do**

Update *θ*_{j,k} ← Eq (42), *δ*_{j,k} ← Eq (44)

**for** *k* ← 1 **to** *K* **do**

Update *ϕ*_{k} ← Eq (46), *τ*_{k} ← Eq (54), *z*_{k} ← Eq (60)

Update *η* ← Eq (55), *γ* ← Eq (56), *π* ← Eq (61)

**for** *i* ← 1 **to** *n* **do**

Update *X*_{⋅,i} ← Eq (38)

**for** *k* ← 1 **to** *K* **do**

Update *σ*_{k,i} ← Eq (43), *ρ*_{k,i} ← Eq (45)

**for** *k* ← 1 **to** *K* **do**

Update *ω*_{k} ← Eq (50), *κ*_{k} ← Eq (57), *o*_{k} ← Eq (62)

Update *χ* ← Eq (58), *φ* ← Eq (59), *π*_{X} ← Eq (63)

**for** *j* ← 1 **to** *p* **do**

Update *ψ*_{j,j} ← Eq (64)

Output **Λ**, **X**, **z**, **o**

Across hundreds of runs, we found that the number of recovered factors was stable within each application with low variance. As a rule of thumb, we initialized the number of latent factors to *K* = 2 × min(*n*, *p*), and, if we find that the number of factors recovered is not reduced by approximately a third, we increased the number of initial factors until we saw this substantial reduction in *K*.

### MCMC: Conditional distributions of BicMix parameters

We derive below the conditional distributions that capture the MCMC approach that we implemented for BicMix. In our manuscript, we used MCMC to compute the warm start parameter settings in the simulations.

We updated the loading matrix **Λ** one row at a time, where each row consists of values across the *K* components. The *j*th row of the loading matrix, Λ_{j,⋅}, has the following posterior distribution,
(65)
where **V**_{j} is a *K* × *K* diagonal matrix. If we used *V*_{j,k,k} to denote the (*k*, *k*)th element for **V**_{j}, then we sampled *V*_{j,k,k} and its related parameters as follows:
(66)

Similarly, each column of **X** consists of values across the *K* components; the *i*th column of the factor matrix, *X*_{⋅,i}, has the following posterior distribution,
(67)
where **W**_{i} is a *K* × *K* diagonal matrix. If *W*_{i,k,k} denotes the (*k*, *k*)th element of *W*_{i}, then we sampled the value of *W*_{i,k,k} as follows:
(68)

We sampled values for the parameters conditional on sparse and dense state as follows. If *z*_{k} = 1,
(69) (70) (71)

Correspondingly, the following parameters related to **X** were sampled as follows. If *o*_{k} = 1
(73) (74) (75)

The following parameters are not sparse or dense component specific; they each have a gamma conditional distribution because of conjugacy: (77) (78) (79)

Parameters related to **X** were sampled as,
(80) (81) (82)

The conditional probability for *z*_{k} has a Bernoulli distribution:

Let *p*_{z} = *p*(*z*_{k} = 1|**Θ**_{Λ}); then the conditional probability for *z*_{k} is
(83)

The mixing proportion *π* has a beta conditional probability:
(84)
where is the indicator function.

Similarly for **X**, the conditional probability for *o*_{k} has a Bernoulli distribution:
(85)

Let *p*_{X} = *p*(*o*_{k} = 1|**Θ**_{X}); then the conditional probability for *o*_{k} is
(86)

The mixing proportion *π* has a beta conditional probability:
(87)
where is the indicator function.

We implemented the following MCMC algorithm for sampling the parameters of the BicMix model.

**Algorithm 2:** MCMC algorithm for BicMix

**Data:** *Y* *p* × *n* gene expression matrix, *K*, *n*_*itr*

**Initialize parameters**

**for** *t* ← 1 **to** *n*_*itr* **do**

**for** *j* ← 1 **to** *p* **do**

**for** *k* ← 1 **to** *K* **do**

Sample *V*_{j,k,k} according to Eq (66)

Sample Λ_{j,⋅} according to Eq (65)

**for** *k* ← 1 **to** *K* **do**

**if** *z*_{k} = 1 **then**

Sample *ϕ*_{k} according to Eq (71)

**for** *j* ← 1 **to** *p* **do**

Sample *θ*_{j,k} according to Eq (69), *δ*_{j,k} according to Eq (70),

**if** *z*_{k} = 0 **then**

Sample *ϕ*_{k} according to Eq (72)

**for** *k* ← 1 **to** *K* **do**

Sample *τ*_{k} according to Eq (77), *z*_{k} with Eq (83)

Sample *η* according to Eq (78), *γ* with Eq (79), *π* with Eq (84)

**for** *i* ← 1 **to** *n* **do**

**for** *k* ← 1 **to** *K* **do**

Sample *W*_{i,k,k} according to Eq (68)

Sample *X*_{⋅,i} according to Eq (67)

**for** *k* ← 1 **to** *K* **do**

**if** *o*_{k} = 1 **then**

Sample *ω*_{k} according to Eq (75)

**for** *i* ← 1 **to** *n* **do**

Sample *σ*_{k,i} according to Eq (73), *ρ*_{k,i} according to Eq (74),

**if** *o*_{k} = 0 **then**

Sample *ω*_{k} according to Eq (76)

**for** *k* ← 1 **to** *K* **do**

Sample *κ*_{k} according to Eq (80), *o*_{k} according to Eq (86)

Sample *χ* according to Eq (81), *φ* with Eq (82), *π*_{X} with Eq (87)

**for** *j* ← 1 **to** *p* **do**

Sample *ψ*_{j,j} according to Eq (88)

Output **Λ**, **X**, **z**, **o**

### Data processing and comparative methods

#### Processing the breast cancer gene expression data.

The breast cancer data set is maintained by Dana-Farber Cancer Institute at Harvard University, and is available through their R package: breastCancerNKI version 1.3.1 [70]. We removed genes with > 10% missing values. We imputed the remaining missing values using the R package impute (version 1.36.0) [71, 111]. We projected the gene expression levels of each gene to the quantiles of a standard normal. There were 24,158 genes remaining in the data set after filtering.

#### Processing the CAP gene expression data.

The Cardiovascular and Pharmacogenetics (CAP) gene expression data were generated from the Krauss Lab at the Children’s Hospital Oakland Research Institute and is publicly available at Gene Expression Omnimbus (GEO), GSE36868. We used expression levels from 8,718 expressed genes measured in a sample of 480 human immortalized blood cell lines (LCLs) [45]. The data were processed according to previous work [45, 60], including filtering unexpressed genes; however, neither known covariates nor PCs were controlled for before quantile normalization. We also removed genes with probes on the gene expression array that aligned to multiple regions of the genome using a BLAST analysis and human reference genome hg19 in order to remove gene pairs that appeared well correlated due to a co-hybridization effect. We did this because pairs of genes that were artefactually correlated due to co-hybridization were incorrectly identified by BicMix as co-regulated genes, and we wanted to control this artifact.

#### Processing the GTEx gene expression data.

We downloaded from dbGaP the Genotype-Tissue Expression (GTEx) project v4 pilot data [46]. The subset of gene expression data that we used contained *p* = 20,134 genes measured in *n* = 446 samples across four tissues: adipose (*n*_{f} = 103), artery (*n*_{a} = 118), lung (*n*_{l} = 122) and skin (*n*_{s} = 106). We preprocessed these RNA-seq data as in earlier work and recapitulated here [104].

We trimmed the RNA-seq reads from the GTEx pilot data v4 [46] using Trimmomatic (v.0.30) [112]. We also trimmed the adapter sequences and overrepresented contaminant sequences that are identified by FastQC (v.0.10.1) [113] with 2 seed mismatches and a simple clip threshold of 20. For all reads, we trimmed the leading and trailing nucleotides (low quality or Ns) until a canonical base was encountered with quality greater than 3. For adaptive quality trimming, we scanned the reads with a 4-base sliding window and trimmed when the average quality per base dropped below 20. We discarded sequences shorter than 30 nucleotides. We were left with 3.65 B read pairs in adipose, 4.02 B read pairs in artery, 4.63 B read pairs in lung, 3.63 B read pairs in skin after trimming.

Before mapping the RNA-seq reads to human reference genome assembly GRCh37.p13, we prepared the genome with STAR aligner genomeGenerate mode by setting the splice junction database (sjdbGTFfile) to GENCODE v.19, and setting the splice junction database overhang (sjdbOverhang) to 75 bp. We used default settings for the rest of the parameters. We ran STAR aligner alignReads using default settings except setting outFilterMultimapNmax to 1. We mapped 14.52 B read pairs (91%) uniquely, and we discarded the 1.2 B read pairs (7.5%) that mapped to multiple loci.

We converted the mapped reads to read counts (one read pair per read count) using the software featureCounts [114] on settings that discounted multi-overlapping and chimeric reads. We also required both ends of a read pair to be mapped to the same gene. After filtering all genes with > 25% non-zero values across all tissues, we were left with 20,134 genes. We normalized the read abundance by gene length, GC-content, and library size using the Bioconductor R package cqn [115], and the resulting gene expression values were mapped to the quantiles of the standard normal distribution. Note that we did not correct for confounding factors nor known covariates before quantile normalization. We concatenated these gene expression data sample-wise to create a single response matrix **Y** ∈ ℜ^{p×(nf + na + nl + ns)}.

### Simulation comparison

Using these simulated data, we compared BicMix to five other methods: Fabia, Plaid, CC, Bimax, and Spectral biclustering. We ran these methods using the following settings.

For Sim1, we set the number of components to the correct values, and ran each method as follows.

- We ran Fabia (version 2.10.2) using its default parameter settings.
- We ran Fabia-truth using default parameter settings. We set the sparsity threshold in Fabia to the number (from 100 quantiles of the uniform distribution over [0.1, 5]) that produced the closest match in the recovered matrices to the number of non-zero elements in the simulated data.
- We ran Plaid (implemented in the R package biclust [116] version 1.0.2) using background = TRUE to capture the noise, maximum layers were set to 10, number of iterations to find starting values was set to 10, and the number of iterations to find the layers was set to 100.
- We ran CC (implemented in the R package biclust [116] version 1.0.2) by setting the maximum accepted score delta = 1.5 and the scaling factor alpha = 1.0.
- We ran Bimax (implemented in the R package biclust [116] version 1.0.2) by setting the minimum number of rows and columns to 2.
- We ran Spectral biclustering (implemented in the R package biclust [116] version 1.0.2) by setting the normalization method to bistochastization, the number of eigenvalues for constructing the biclusters was set to 10, and the minimum number of rows and columns for the biclusters were set to 2.

For Sim2, we corrected the simulated data for the dense components by controlling for five PCs in the simulated gene expression data, and we ran the methods on the residual matrix as in Sim1, setting the number of components to 10.

### Redundancy of components

We calculated a simple statistic to check the redundancy of the multiple components recovered across multiple runs as follows. For every component in all runs, we counted the number of genes with non-zero values, denoted as *n*_{g}, and the number of samples with non-zero values, denoted as *n*_{s} for each component. We then grouped the components that share the same *n*_{g} and *n*_{s}. For each pair of components in the same group, we counted how many components have non-zero values for the same genes and the same samples (i.e., the *ℓ*_{0} norm between components, or the Manhattan distance). Redundant components corresponded to pairs of factors and loadings for which the Manhattan distance is zero.

### Algorithm for identifying gene co-expression networks from BicMix

We write out the algorithm we used to build the gene co-expression networks using the fitted BicMix model. Note that the sparsity-inducing prior on the covariance matrix of the factors increases the difficulty of computing the gene-wise covariance matrix relative to the common identity matrix covariance in the prior of the factors; however, all of the elements necessary to compute an estimate of the factor covariance matrix have been explicitly quantified in the VEM algorithm already.

**Algorithm 3:** Algorithm to construct gene co-expression network

**Data:** *p* × *K* loading matrix and *K* × *n* factor matrix; Ψ; *net*_*type*, *rep*, *c*, (*d*).

**for** *i* ← 1 **to** *n*_*runs* **do**

**for** *k*_{1} ← 1 **to** *K* − 1 **do**

**for** *k*_{2} ← *K*_{1} + 1 **to** *K* **do**

**if** *cor*(Λ_{k1}, Λ_{k2}) × *cor*(**X**_{k1}, *X*_{k2})> 0.5 **then**

discard run

**if** *net*_*type* = *subset specific* **then**

Add component to *A* when non-zero factors are only in context *c*

**if** *net*_*type* = *subset differential* **then**

Compute Wilcoxon signed-rank test for non-zero factor values across contexts *c*, *d* Add component to *A* when *p* < 1 × 10^{−10}

Construct the covariance matrix for **X** as **Σ** ← **〈** **X** **X**^{T} **〉** − **〈** **X** **〉** **〈** **X** **〉**^{T} (Eqs 41 and 38)

Calculate the variance for the residual as **Ψ**← Eq (64)

Construct the precision matrix for subset *A* as

Run GeneNet [63] on **Δ**^{i} to test significance of edges

Store edges with probability of presence ≥ 0.8

Count number of times each edge is found across all runs

Keep edges that are found ≥ *rep* times

Output the nodes and edges

Draw graph using Gephi [117]

### Quantifying the expected number of edges using ensemble method

We computed the (approximate) expected number of edges to appear *r* or more times at random using our ensemble method as follows. Let *R* represent the total number of runs, *r* represent the threshold for number of runs an edge must appear, *E* represent the total number of possible edges, and represent the average number of edges per run. Note that this is an approximate expectation because we are using instead of the true number of edges recovered in each run; this expectation has a noticeable impact on the result when the variance in edges per run is large (this was the case only in the breast cancer data set), but otherwise lead to a reasonable approximation.

We computed the probability of a single edge occurring in at least *r* networks as follows:
Then the expected number of edges that will occur *r* or more times at random was approximated as follows, assuming edge independence:
(89)

## Supporting Information

### S1 Fig. Correlation between the sparse loadings (genes), dense factors (samples), and experimental covariates in the breast cancer data.

The x-axis represents 30 recovered factors; the y-axis represents the observed covariates; darker blue and red represent large magnitude correlations, whereas white represents no correlation.

https://doi.org/10.1371/journal.pcbi.1004791.s001

(PDF)

### S2 Fig. Correlation among the known covariates in breast cancer data.

The x- and y-axes represents the observed covariates; darker blue and red represent large magnitude correlations, whereas white represents no correlation.

https://doi.org/10.1371/journal.pcbi.1004791.s002

(PDF)

### S3 Fig. Gene co-expression network specific to ER- samples.

Node size corresponds to betweenness centrality.

https://doi.org/10.1371/journal.pcbi.1004791.s003

(PDF)

### S4 Fig. Gene co-expression network specific to ER+ samples.

Node size corresponds to betweenness centrality.

https://doi.org/10.1371/journal.pcbi.1004791.s004

(PDF)

### S5 Fig. Heatmap of genes identified in the ER+ specific and ER- specific networks.

There is evidence of differential expression levels across the two sample types for these genes that are in the ER+ and ER- specific networks.

https://doi.org/10.1371/journal.pcbi.1004791.s005

(PDF)

### S6 Fig. Correlation among the known covariates in the CAP data.

The x- and y-axes represents the observed covariates; darker blue and red represent large magnitude correlations, whereas white represents no correlation.

https://doi.org/10.1371/journal.pcbi.1004791.s006

(PDF)

### S7 Fig. Smoking status specific gene co-expression networks in the CAP data.

Panel a: Gene co-expression network specific to smokers. Panel b: Gene co-expression network specific to non-smokers.

https://doi.org/10.1371/journal.pcbi.1004791.s007

(PDF)

### S8 Fig. Distribution of the number of genes in loadings and the number of samples in the factors for the GTEx data.

Both the sparse and dense loadings and factors are shown. Left: number of genes for all loadings; Middle: number of samples for all factors; Right: a summary of the PVE explained by the components across all runs, where upper bound and the lower bound of the ribbon correspond to the maximum and minimum PVE, and the solid line correspond to the median. The components are inversely sorted by the median.

https://doi.org/10.1371/journal.pcbi.1004791.s008

(PDF)

### S9 Fig. Principal components analysis applied to the GTEx pilot data.

PC1 effectively separates the four tissue types.

https://doi.org/10.1371/journal.pcbi.1004791.s009

(PDF)

### S10 Fig. Plaid applied to the GTEx pilot data.

Each sample point is plotted with jitter to denote the density of each tissue in each included or excluded component. The four factors each capture variation in one (or a subset of one) of the tissues reasonably well, except for adipose.

https://doi.org/10.1371/journal.pcbi.1004791.s010

(PDF)

### S11 Fig. Fabia with four components applied to the GTEx pilot data.

Across factors 1, 2, and 3, the four tissue types are effectively separated.

https://doi.org/10.1371/journal.pcbi.1004791.s011

(PDF)

### S12 Fig. Fabia with twenty components applied to the GTEx pilot data.

With 20 factors, Fabia is no longer able to separate the four tissue types because of limited sparsity.

https://doi.org/10.1371/journal.pcbi.1004791.s012

(PDF)

### S13 Fig. BicMix applied to the GTEx pilot data.

The substantial sparsity induced in BicMix is illustrated in these panels. Note that BicMix separates all four tissues in the first four factors.

https://doi.org/10.1371/journal.pcbi.1004791.s013

(PDF)

### S1 Table. Genes that are specific to ER- patients.

The importance of the genes are ordered by their betweenness centralities.

https://doi.org/10.1371/journal.pcbi.1004791.s014

(TEX)

### S2 Table. Genes that are specific to ER+ patients.

The importance of the genes are ordered by their betweenness centralities.

https://doi.org/10.1371/journal.pcbi.1004791.s015

(TEX)

### S3 Table. Genes that are differential between ER+ and ER- patients.

The importance of the genes are ordered by their betweenness centralities.

https://doi.org/10.1371/journal.pcbi.1004791.s016

(TEX)

### S4 Table. Genes that are in the adipose-specific network.

The genes are ordered by their betweenness centralities.

https://doi.org/10.1371/journal.pcbi.1004791.s017

(TEX)

### S5 Table. Genes that are in the artery-specific network.

The genes are ordered by their betweenness centralities.

https://doi.org/10.1371/journal.pcbi.1004791.s018

(TEX)

### S6 Table. Genes that are in the lung-specific network.

The genes are ordered by their betweenness centralities.

https://doi.org/10.1371/journal.pcbi.1004791.s019

(TEX)

### S7 Table. Genes that are in the skin-specific network.

The genes are ordered by their betweenness centralities.

https://doi.org/10.1371/journal.pcbi.1004791.s020

(TEX)

### S8 Table. Gene Ontology enrichment analysis across recovered factors for adipose tissue.

All term enrichments are for a cutoff of FDR ≤ 0.05. Gene names are suppressed, but the number of genes in each factor is shown.

https://doi.org/10.1371/journal.pcbi.1004791.s021

(TEX)

### S9 Table. Gene Ontology enrichment analysis across recovered factors for artery tissue.

All term enrichments are for a cutoff of FDR ≤ 0.05. Gene names are suppressed, but the number of genes in each factor is shown.

https://doi.org/10.1371/journal.pcbi.1004791.s022

(TEX)

### S10 Table. Gene Ontology enrichment analysis across recovered factors for lung tissue.

All term enrichments are for a cutoff of FDR ≤ 0.05. Gene names are suppressed, but the number of genes in each factor is shown.

https://doi.org/10.1371/journal.pcbi.1004791.s023

(TEX)

### S11 Table. Gene Ontology enrichment analysis across recovered factors for skin tissue.

https://doi.org/10.1371/journal.pcbi.1004791.s024

(TEX)

### S12 Table. eQTLs for adipose tissue.

Associations are listed in the order of the significance of p values.

https://doi.org/10.1371/journal.pcbi.1004791.s025

(TEX)

### S13 Table. eQTLs for artery tissue.

Associations are listed in the order of the significance of p values.

https://doi.org/10.1371/journal.pcbi.1004791.s026

(TEX)

### S14 Table. eQTLs for lung tissue.

Associations are listed in the order of the significance of p values.

https://doi.org/10.1371/journal.pcbi.1004791.s027

(TEX)

### S15 Table. eQTLs for skin tissue.

Associations are listed in the order of the significance of p values.

https://doi.org/10.1371/journal.pcbi.1004791.s028

(TEX)

## Acknowledgments

The authors would like to acknowledge conversations with Sebastian Seung and Eleazar Eskin that contributed to this manuscript.

## Author Contributions

Conceived and designed the experiments: BEE CG. Performed the experiments: CG. Analyzed the data: CG BEE. Contributed reagents/materials/analysis tools: ICM SZ. Wrote the paper: BEE CG CDB.

## References

- 1. Hung JH, Whitfield TW, Yang TH, Hu Z, Weng Z, et al. (2010) Identification of functional modules that correlate with phenotypic difference: The influence of network topology. Genome Biology 11: R23. pmid:20187943
- 2. Parkkinen JA, Kaski S (2010) Searching for functional gene modules with interaction component models. BMC Systems Biology 4: 1–11.
- 3. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95: 14863–14868.
- 4. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering 16: 1370–1386.
- 5. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, et al. (2006) GenePattern 2.0. Nature Genetics 38: 500–501. pmid:16642009
- 6. de Souto MC, Costa IG, Araujo DSd, Ludermir TB, Schliep A (2008) Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 9: 497. pmid:19038021
- 7. Friedman N, Linial M, Nachman I, Pe’er D (2000) Using Bayesian networks to analyze expression data. Journal of Computational Biology 7: 601–620. pmid:11108481
- 8. Davidich MI, Bornholdt S (2008) Boolean network model predicts cell cycle sequence of fission yeast. PLoS ONE 3: e1672. pmid:18301750
- 9. MacNeil LT, Walhout AJM (2011) Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression. Genome Research 21: 645–657. pmid:21324878
- 10. Karlebach G, Shamir R (2008) Modelling and analysis of gene regulatory networks. Nature Reviews Molecular Cell Biology 9: 770–780. pmid:18797474
- 11. Zhang B, Horvath S (2005) A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology 4. pmid:16646834
- 12. Bonneau R, Reiss DJ, Shannon P, Facciotti M, Hood L, et al. (2006) The Inferelator: An algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biology 7: R36. pmid:16686963
- 13. Ruan J, Dean AK, Zhang W (2010) A general co-expression network-based approach to gene expression analysis: comparison and applications. BMC Systems Biology 4: 8. pmid:20122284
- 14. Glass K, Huttenhower C, Quackenbush J, Yuan GC (2013) Passing messages between biological networks to refine predicted interactions. PLoS ONE 8: e64832. pmid:23741402
- 15. Engelhardt B, Stephens M (2010) Analysis of population structure: A unifying framework and novel methods based on sparse factor analysis. PLoS Genetics 6: e1001117. pmid:20862358
- 16. Carvalho CM, Lucas JE, Wang Q, Chang J, Nevins JR, et al. (2008) High-dimensional sparse factor modelling: Applications in gene expression genomics. Journal of the American Statistical Association 103: 1438–1456. pmid:21218139
- 17. West M (2003) Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statistics 7: 723–732.
- 18. Bhattacharya A, Dunson DB (2011) Sparse Bayesian infinite factor models. Biometrika 98: 291–306. pmid:23049129
- 19. Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, et al. (2007) Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315: 848–853. pmid:17289997
- 20. Brown CD, Mangravite LM, Engelhardt BE (2013) Integrative modeling of eQTLs and cis-regulatory elements suggests mechanisms underlying cell type specificity of eQTLs. PLoS Genetics 9: e1003649. pmid:23935528
- 21. Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10: 515–534. pmid:19377034
- 22. Srivastava S, Engelhardt BE, Dunson DB (2014) Expandable factor analysis. arXiv preprint arXiv:14071158: 1–32.
- 23. Cheng Y, Church GM (2000) Biclustering of expression data. Proceedings of the International Conference on Intelligent Systems for Molecular Biology 8: 93–103.
- 24. Ben-Dor A, Chor B, Karp R, Yakhini Z (2003) Discovering local structure in gene expression data: The order-preserving submatrix problem. Journal of Computational Biology 10: 373–384. pmid:12935334
- 25. Murali TM, Kasif S (2003) Extracting conserved gene expression motifs from gene expression data. Proceedings of the Pacific Symposium on Biocomputing: 77–88.
- 26. Li G, Ma Q, Tang H, Paterson AH, Xu Y (2009) QUBIC: A qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Research 37: e101–e101. pmid:19509312
- 27. Prelic A, Bleuler S, Zimmermann P, Wille A, Bühlmann P, et al. (2006) A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22: 1122–1129. pmid:16500941
- 28. Bergmann S, Ihmels J, Barkai N (2003) Iterative signature algorithm for the analysis of large-scale gene expression data. Physical review E, Statistical, nonlinear, and soft matter physics 67: 031902. pmid:12689096
- 29. Huttenhower C, Mutungu KT, Indik N, Yang W, Schroeder M, et al. (2009) Detailing regulatory networks through large scale data integration. Bioinformatics 25: 3267–3274. pmid:19825796
- 30. Lazzeroni L, Owen A (2000) Plaid models for gene expression data. Statistica Sinica 12: 61–86.
- 31. Gu J, Liu JS (2008) Bayesian biclustering of gene expression data. BMC Genomics 9: S4. pmid:18366617
- 32.
Bozdag D, Parvin JD, Catalyurek UV (2009) A biclustering method to discover co-regulated genes using diverse gene expression datasets. In: Rajasekaran S, editor, Bioinformatics and Computational Biology, Springer Berlin Heidelberg, number 5462 in Lecture Notes in Computer Science. pp. 151–163.
- 33. Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, et al. (2010) FABIA: Factor analysis for bicluster acquisition. Bioinformatics 26: 1520–1527. pmid:20418340
- 34. Kluger Y, Basri R, Chang JT, Gerstein M (2003) Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Research 13: 703–716. pmid:12671006
- 35. Aguilar-Ruiz JS (2005) Shifting and scaling patterns from gene expression data. Bioinformatics 21: 3840–3845. pmid:16144809
- 36. Storey JD, Akey JM, Biswas S, Leek JT (2007) On the design and analysis of gene expression studies in human populations. Nature Genetics 39: 808–809.
- 37. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, et al. (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11: 733–739. pmid:20838408
- 38. Leek J, Storey J (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3: e161–1735.
- 39. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, et al. (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464: 768–772. pmid:20220758
- 40. Stegle O, Parts L, Durbin R, Winn J (2010) A bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eqtl studies. PLoS Computational Biology 6: e1000770. pmid:20463871
- 41. Listgarten J, Kadie C, Schadt E, Heckerman D (2010) Correction for hidden confounders in the genetic analysis of gene expression. Proceedings of the National Academy of Sciences of the United States of America 107: 16465–16470. pmid:20810919
- 42. Pierson E, Koller D, Battle A, Mostafavi S, Consortium G, et al. (2015) Sharing and specificity of co-expression networks across 35 human tissues. PLOS Computational Biology 11: e1004220. pmid:25970446
- 43. Van’t Veer LJ, Dai H, Van de Vijver MJ, He YD, Hart AAM, et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415: 530–536.
- 44. van de Vijver MJ, He YD, van’t Veer LJ, Dai H, Hart AAM, et al. (2002) A gene-expression signature as a predictor of survival in breast cancer. The New England Journal of Medicine 347: 1999–2009. pmid:12490681
- 45. Brown CD, Mangravite LM, Engelhardt BE (2013) Integrative Modeling of eQTLs and Cis-Regulatory Elements Suggests Mechanisms Underlying Cell Type Specificity of eQTLs. PLoS Genetics 9: e1003649+. pmid:23935528
- 46. Ardlie KG, Deluca DS, Segre AV, Sullivan TJ, Young TR, et al. (2015) The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science 348: 648–660.
- 47. Hartigan JA (1972) Direct clustering of a data matrix. Journal of the American Statistical Association 67: 123–129.
- 48. Van Mechelen I, Bock HH, De Boeck P (2004) Two-mode clustering methods: A structured overview. Statistical Methods in Medical Research 13: 363–394. pmid:15516031
- 49. Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Transactions on Knowledge and Data Engineering 18: 902–916.
- 50. Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3: 1–58.
- 51. Yoon S, Benini L, De Micheli G (2007) Co-clustering: a versatile tool for data analysis in biomedical informatics. IEEE transactions on information technology in biomedicine: a publication of the IEEE Engineering in Medicine and Biology Society 11: 493–494.
- 52. Busygin S, Prokopyev O, Pardalos PM (2008) Biclustering in data mining. Computers & Operations Research 35: 2964–2987.
- 53. Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions in Computational Biology and Bioinformatics 1: 24–45.
- 54. Madeira SC, Oliveira AL (2009) A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series. Algorithms for Molecular Biology 4: 8. pmid:19497096
- 55. Turner H, Bailey T, Krzanowski W (2005) Improved biclustering of microarray data demonstrated through systematic performance tests. Computational Statistics & Data Analysis 48: 235–254.
- 56.
Santamaría R, Quintales L, Therón R (2007) Methods to bicluster validation and comparison in microarray data. In: Yin H, Tino P, Corchado E, Byrne W, Yao X, editors, Intelligent Data Engineering and Automated Learning, Springer Berlin Heidelberg, number 4881 in Lecture Notes in Computer Science. pp. 780–789.
- 57. Neng Fan NB (2009) Recent advances of data biclustering with application in computational neuroscience: 105–132.
- 58.
de Castro P, de Franga F, Ferreira H, Von Zuben F (2007) Evaluating the performance of a biclustering algorithm applied to collaborative filtering: A comparative analysis. In: Proceedings of the 7th International Conference on Hybrid Intelligent Systems. pp. 65–70.
- 59. Eren K, Deveci M, Kücüktunc O, Catalyürek UV (2012) A comparative analysis of biclustering algorithms for gene expression data. Briefings in Bioinformatics: 32.
- 60. Gao C, Brown CD, Engelhardt BE (2013) A latent factor model with a mixture of sparse and dense factors to model gene expression data with confounding effects. ArXiv preprint arXiv:13104792.
- 61.
Armagan A, Dunson DB, Clyde M (2011) Generalized beta mixtures of Gaussians. In: Proceedings of Neural Information Processing Systems. pp. 523–531.
- 62.
Gao C, Engelhardt B (2012) A sparse factor analysis model for high dimensional latent spaces. NIPS: Workshop on Analysis Operator Learning vs Dictionary Learning: Fraternal Twins in Sparse Modeling.
- 63. Schäfer J, Strimmer K (2005) An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 21: 754–764. pmid:15479708
- 64. Schäfer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology 4. pmid:16646851
- 65. Breiman L (1996) Bagging predictors. Machine Learning 24: 123–140.
- 66. Carvalho CM, Polson NG, Scott JG (2010) The horseshoe estimator for sparse signals. Biometrika 97: 465–480.
- 67. Strawderman WE (1971) Proper Bayes minimax estimators of the multivariate normal mean. The Annals of Mathematical Statistics 42: 385–388.
- 68. Berger J (1980) A robust generalized Bayes estimator and confidence region for a multivariate normal mean. The Annals of Statistics 8: 716–761.
- 69. Van’t Veer LJ, Dai H, Van de Vijver MJ, He YD, Hart AAM, et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415: 530–536.
- 70.
Schroeder M, Haibe-Kains B, Culhane A, Sotiriou C, Bontempi G, et al. (2011) breastCancerNKI: Gene expression dataset. R package version 1.0.6.
- 71.
Hastie T, Tibshirani R, Sherlock G, Eisen M, Brown P, et al. (1999) Imputing missing data for gene expression arrays. Technical report.
- 72. Zhang MH, Man HT, Zhao XD, Dong N, Ma SL (2014) Estrogen receptor-positive breast cancer molecular signatures and therapeutic potentials. Biomedical Reports 2: 41–52. pmid:24649067
- 73. Hu Y, Wu G, Rusch M, Lukes L, Buetow KH, et al. (2012) Integrated cross-species transcriptional network analysis of metastatic susceptibility. Proceedings of the National Academy of Sciences 109: 3184–3189.
- 74.
Schüle J, Bergkvist L, Håkansson L, Gustafsson B, Håkansson A (2002) Down-regulation of the
*CD3*-*ζ*chain in sentinel node biopsies from breast cancer patients. Breast Cancer Research and Treatment 74: 33–40. pmid:12150450 - 75.
Yu B, Zhang W (2011) Down-regulation of
*CD3*-*ζ*is a breast cancer biomarker associated with immune suppression. Cell Biology International 35: 165–169. pmid:20883209 - 76.
Oghumu S, Varikuti S, Terrazas C, Kotov D, Nasser MW, et al. (2014)
*CXCR3*deficiency enhances tumor progression by promoting macrophage M2 polarization in a murine breast cancer model. Immunology 143: 109–119. pmid:24679047 - 77.
Li Y, Reader JC, Ma X, Kundu N, Kochel T, et al. (2014) Divergent roles of
*CXCR3*isoforms in promoting cancer stem-like cell survival and metastasis. Breast Cancer Research and Treatment: 1–13. - 78.
King TD, Suto MJ, Li Y (2012) The wnt/
*β*-catenin signaling pathway: A potential therapeutic target in the treatment of triple negative breast cancer. Journal of Cellular Biochemistry 113: 13–18. pmid:21898546 - 79. Verghese ET, Drury R, Green CA, Holliday DL, Lu X, et al. (2013) MiR-26b is down-regulated in carcinoma-associated fibroblasts from ER-positive breast cancers leading to enhanced cell migration and invasion. The Journal of Pathology 231: 388–399. pmid:23939832
- 80.
Mansour AA, Gafni O, Weinberger L, Zviran A, Ayyash M, et al. (2012) The H3K27 demethylase
*UTX*regulates somatic and germ cell epigenetic reprogramming. Nature 488: 409–413. pmid:22801502 - 81.
Van der Meulen J, Sanghvi V, Mavrakis K, Durinck K, Fang F, et al. (2015) The H3K27me3 demethylase
*UTX*is a gender-specific tumor suppressor in T-cell acute lymphoblastic leukemia. Blood 125: 13–21. pmid:25320243 - 82.
Aasen E, Medrano J (1990) Amplification of the
*ZFY*and*ZFX*genes for sex identification in humans, cattle, sheep and goats. Biotechnology 8: 1279–1281. pmid:1369448 - 83.
Xu J, Taya S, Kaibuchi K, Arnold A (2005) Sexually dimorphic expression of
*USP9X*is related to sex chromosome complement in adult mouse brain. The European Journal of Neuroscience 21: 3017–3022. pmid:15978012 - 84. Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, et al. (2003) DAVID: Database for annotation, visualization, and integrated discovery. Genome Biology 4: P3. pmid:12734009
- 85. Cao Y (2013) Angiogenesis and vascular functions in modulation of obesity, adipose metabolism, and insulin sensitivity. Cell Metabolism 18: 478–489. pmid:24035587
- 86. Basu S, Fenton MJ (2004) Toll-like receptors: Function and roles in lung disease. American Journal of Physiology 286: L887–L892. pmid:15064235
- 87.
Hosooka T, Noguchi T, Kotani K, Nakamura T, Sakaue H, et al. (2008)
*DOK1*mediates high-fat diet–induced adipocyte hypertrophy and obesity through modulation of*PPAR*-*γ*phosphorylation. Nature Medicine 14: 188–193. pmid:18204460 - 88.
Yeung F, Ramírez CM, Mateos-Gomez PA, Pinzaru A, Ceccarini G, et al. (2013) Non-telomeric role for
*RAP1*in regulating metabolism and protecting against obesity. Cell Reports 3: 1847–1856. pmid:23791522 - 89.
Jun HS, Hwang K, Kim Y, Park T (2008) High-fat diet alters
*PP2A*,*TC10*, and*CIP4*expression in visceral adipose tissue of rats. Obesity 16: 1226–1231. pmid:18388891 - 90.
Oliver P, Caimari A, Díaz-Rúa R, Palou A (2012) Diet-induced obesity affects expression of adiponutrin/
*PNPLA3*and adipose triglyceride lipase, two members of the same family. International Journal of Obesity 36: 225–232. pmid:21556044 - 91.
Traurig MT, Orczewska JI, Ortiz DJ, Bian L, Marinelarena AM, et al. (2013) Evidence for a role of
*LPGAT1*in influencing BMI and percent body fat in Native Americans. Obesity 21: 193–202. pmid:23505186 - 92.
Masiero M, Simões FC, Han HD, Snell C, Peterkin T, et al. (2013) A core human primary tumor angiogenesis signature identifies the endothelial orphan receptor
*ELTD1*as a key regulator of angiogenesis. Cancer Cell 24: 229–241. pmid:23871637 - 93. Wojciak-Stothard B, Abdul-Salam VB, Lao KH, Tsang H, Irwin DC, et al. (2014) Aberrant chloride intracellular channel 4 expression contributes to endothelial dysfunction in pulmonary arterial hypertension. Circulation 129: 1770–1780. pmid:24503951
- 94.
Zhang Y, Ling Z, Deng S, Du H, Yin Y, et al. (2014) Associations between
*CD36*gene polymorphisms and susceptibility to coronary artery heart disease. Brazilian Journal of Medical and Biological Research 47: 895–903. pmid:25118627 - 95.
Koh JT, Lee ZH, Ahn KY, Kim JK, Bae CS, et al. (2001) Characterization of mouse brain-specific angiogenesis inhibitor 1 (
*BAI1*) and phytanoyl-CoA alpha-hydroxylase-associated protein 1, a novel*BAI1*-binding protein. Molecular Brain Research 87: 223–237. pmid:11245925 - 96.
Villar J, Cabrera NE, Casula M, Flores C, Valladares F, et al. (2010) Mechanical ventilation modulates
*TLR4*and*IRAK-3*in a non-infectious, ventilator-induced lung injury model. Respiratory Research 11: 27. pmid:20199666 - 97.
Grumelli S, Lu B, Peterson L, Maeno T, Gerard C (2011)
*CD46*protects against chronic obstructive pulmonary disease. PLoS ONE 6: e18785. pmid:21573156 - 98.
Burdorf L, Stoddard T, Zhang T, Rybak E, Riner A, et al. (2014) Expression of human
*CD46*modulates inflammation associated with*GalTKO*lung xenograft injury. American Journal of Transplantation 14: 1084–1095. pmid:24698431 - 99. Reijmerink NE, Postma DS, Koppelman GH (2010) The candidate gene approach in asthma: What happens with the neighbours? European Journal of Human Genetics 18: 17. pmid:19654613
- 100. Skawran B, Steinemann D, Becker T, Buurman R, Flik J, et al. (2008) Loss of 13q is associated with genes involved in cell cycle and proliferation in dedifferentiated hepatocellular carcinoma. Modern Pathology 21: 1479–1489. pmid:18820673
- 101.
Xie S, Luca M, Huang S, Gutman M, Reich R, et al. (1997) Expression of
*MCAM/MUC18*by human melanoma cells leads to increased tumor growth and metastasis. Cancer Research 57: 2295–2303. pmid:9187135 - 102.
Mills L, Tellez C, Huang S, Baker C, McCarty M, et al. (2002) Fully human antibodies to
*MCAM/MUC18*inhibit tumor growth and metastasis of human melanoma. Cancer Research 62: 5106–5114. pmid:12208768 - 103.
Taungjaruwinai WM, Bhawan J, Keady M, Thiele JJ (2009) Differential expression of the antioxidant repair enzyme methionine sulfoxide reductase (
*MSRA*and*MSRB*) in human skin. The American Journal of Dermatopathology 31: 427–431. pmid:19542914 - 104.
McDowell I, Pai A, Guo C, Vockley C, Brown C, et al. (2014) Identification of long intergenic non-coding RNA eQTLs in four primary tissues reveals association with obesity-related traits. In Review.
- 105. Villarroya J, Dorado B, Vila MR, Garcia-Arumí E, Domingo P, et al. (2011) Thymidine kinase 2 deficiency-induced mitochondrial DNA depletion causes abnormal development of adipose tissues and adipokine levels in mice. PLoS ONE 6: e29691. pmid:22216345
- 106. Sackmann-Sala L, Berryman DE, Lubbers ER, Zhang H, Vesel CB, et al. (2014) Age-related and depot-specific changes in white adipose tissue of growth hormone receptor-null mice. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences 69: 34–43.
- 107. Zheng X, Xu C, Smith AO, Stratman AN, Zou Z, et al. (2012) Dynamic regulation of the cerebral cavernous malformation pathway controls vascular stability and growth. Developmental Cell 23: 342–355. pmid:22898778
- 108.
Kusuhara S, Fukushima Y, Fukuhara S, Jakt LM, Okada M, et al. (2012)
*ARHGEF15*promotes retinal angiogenesis by mediating*VEGF*-induced*CDC42*activation and potentiating*RHOJ*inactivation in endothelial cells. PLoS ONE 7: e45858. pmid:23029280 - 109.
McMillan SJ, Sharma RS, McKenzie EJ, Richards HE, Zhang J, et al. (2013)
*Siglec-E*is a negative regulator of acute pulmonary neutrophil inflammation and suppresses*CD11b**β*2-integrin–dependent signaling. Blood 121: 2084–2094. pmid:23315163 - 110.
Stienstra Y, Van Der Werf T, Oosterom E, Nolte I, Van der Graaf W, et al. (2006) Susceptibility to Buruli ulcer is associated with the
*SLC11A1*(*NRAMP1*) D543N polymorphism. Genes and Immunity 7: 185–189. pmid:16395392 - 111. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, et al. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17: 520–525. pmid:11395428
- 112. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 30: 2114–2120. pmid:24695404
- 113.
Andrews S (2012). http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/.
- 114. Liao Y, Smyth GK, Shi W (2014) featureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30: 923–930. pmid:24227677
- 115. Hansen KD, Irizarry RA, Wu Z (2012) Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 13: 204–216. pmid:22285995
- 116.
Kaiser S, Santamaria R, Theron R, Quintales L, Leisch F (2009) biclust: Bicluster algorithms. R package version 07 2.
- 117.
Bastian M, Heymann S, Jacomy M (2009) Gephi: An open source software for exploring and manipulating networks.