A mixture model to detect edges in sparse co-expression graphs with an application for comparing breast cancer subtypes

We develop a method to recover a gene network’s structure from co-expression data, measured in terms of normalized Pearson’s correlation coefficients between gene pairs. We treat these co-expression measurements as weights in the complete graph in which nodes correspond to genes. To decide which edges exist in the gene network, we fit a three-component mixture model such that the observed weights of ‘null edges’ follow a normal distribution with mean 0, and the non-null edges follow a mixture of two lognormal distributions, one for positively- and one for negatively-correlated pairs. We show that this so-called L2 N mixture model outperforms other methods in terms of power to detect edges, and it allows to control the false discovery rate. Importantly, our method makes no assumptions about the true network structure. We demonstrate our method, which is implemented in an R package called edgefinder, using a large dataset consisting of expression values of 12,750 genes obtained from 1,616 women. We infer the gene network structure by cancer subtype, and find insightful subtype characteristics. For example, we find thirteen pathways which are enriched in each of the cancer groups but not in the Normal group, with two of the pathways associated with autoimmune diseases and two other with graft rejection. We also find specific characteristics of different breast cancer subtypes. For example, the Luminal A network includes a single, highly connected cluster of genes, which is enriched in the human diseases category, and in the Her2 subtype network we find a distinct, and highly interconnected cluster which is uniquely enriched in drug metabolism pathways.

simple but faster algorithm called graphical lasso, which also estimates a sparse precision matrix using coordinate descent procedure for lasso. Yuan and Lin [28], Banerjee et al. [29], Rothman et al. [30] and Levina et al. [31] also proposed algorithms to solve the l 1 penalized Gaussian log-likelihood to estimate a large, sparse precision matrix. These methods have been extensively used to estimate sparse gene network. The use of precision matrix for estimating gene network is justified when the data is generated from a multivariate normal distribution. Under this assumption, elements of the precision matrix correspond to conditional independence restrictions between nodes in a gene network. These approaches work well when analyzing gene data obtained from large retrospective studies such as the TCGA data [32]. However, applying these approaches to datasets with a relatively small number of samples may not yield robust estimates of the regression coefficients.
The main goal of this paper is to introduce a new method to uncover the structure of sparse gene networks from expression data. Rather than estimating the precision matrix, we use the covariance matrix to determine which genes are co-expressed. Sparsity in the covariance matrix (rather than the precision matrix) is motivated mostly by biological arguments -(i) expression level of genes are associated with their functionality, and thus, 3 are expected to be highly correlated for genes that share the same functional process; and (ii) the number of genes associated with most biological functions is small, relative to the total number of genes. There is also a strong mathematical argument for modeling sparsity in the covariance matrix, rather than the precision matrix. For each gene, its expression profile is an N -dimensional vector where N is the sample size, and for each pair of genes the correlation between their expression profiles represents the cosine of the angle between the two corresponding vectors. So, the correlation matrix offers an intuitive representation of the similarity between expression profiles. As Frankl and Maehara [33] showed, as N increases, any two random vectors in the Euclidean N -dimensional space are approximately orthogonal with probability that approaches 1. Thus, the correlation between two random expression profiles is expected to be close to cos(π/2) = 0, which means that the correlation (and hence, the covariance) matrix is expected to be sparse.
While gene expression datasets consist of thousands of genes and the complete network contains millions of edges, in most biological systems gene networks are expected to be sparse [34]. With a finite sample, however, the observed correlations are not zero, even for uncorrelated pairs, so the first problem is to define an accurate decision rule to differentiate between spurious correlations and true edges. A second, related problem, is how to perform the computation needed to implement such a rule in practice.
To address these challenges, we first obtain weights, w ij , by applying Fisher's Z transformation to the sample correlation coefficients, r ij . For uncorrelated pairs, the asymptotic distribution of w ij is normal, with mean zero and variance N − 3. This motivates fitting a mixture model to {w ij } in which the majority of pairs belong to a normally distributed 'null component', and a small percentage of the weights belong to one of two 'non-null components', which follow log-normal distributions (one for positive and one for negative correlations). This so-called L 2 N model was first presented in Bar and Schifano [35], in the context of identifying differentially expressed (or dispersed) genes. We use the L 2 N model to recover a sparse gene network for the following reasons. First, the L 2 N mixture model leads to shrinkage estimation and to borrowing strength across all pairs, which increases the power to detect co-expressed pairs. Second, the specific form of the mixture model allows us to establish a decision rule which controls the error rate. Third, the mixture model lends itself to a computationally-efficient estimation of the parameters via the EM algorithm [36]. Note that when Bar and Schifano [35] used the model, they assumed independence between genes, while this is not the case in our application. Furthermore, they used the L 2 N model in the k-group comparison setting, while in our application we fit the model to detect a network for each group. This paper is organized as follows: in Section 2 we present our method to uncover the structure of gene networks. We discuss some computational challenges and our proposed solutions. In Section 3 we evaluate the goodness of fit of our method and its ability to correctly recover the network structure and compare our method's ability to correctly identify true edges with existing methods. In Section 4 we apply our method to a publicly available collection of breast cancer datasets from twelve studies. We conclude with a discussion in Section 5.

A Mixture Model for Edge Indicators in a Gene Network
A gene network can be represented by a weighted, undirected graph in which each node corresponds to a gene and each edge corresponds to a pair of genes that are 'co-expressed', meaning that their expression levels are highly correlated. The weights represent the 5 strength of the connection between two genes, namely, their tendency to be co-expressed.
Given normalized expression data of G genes, our objective is to discover the network structure, namely, which of the K = G(G − 1)/2 pairs are co-expressed. Our general strategy is to associate with every putative edge in the complete graph with G nodes, a latent indicator variable whose value (0 or 1) is determined by a statistical model.
To start, we define the edge weights in terms of pairwise correlation coefficients. Let x g be a vector of normalized expression levels for gene g ∈ {1, . . . , G} obtained from N samples (N > 3). Suppose that the true correlation coefficient between genes m and n is ρ mn , and let r mn = corr(x m , x n ) be the observed correlation coefficient. Using Fisher's Z transformation, we obtain the estimated weight w mn = arctanh(r mn ), which is known to be approximately normally distributed, with mean arctanh(ρ mn ) and variance 1 N −3 . Let E = {e mn } be the set of true edges in the network. We assume that G is large and that most pairs are not co-expressed, so the network is sparse: |E| K. This assumption, along with asymptotic normality of w mn , motivate our model choice. Specifically, we assume that the weights follow the so-called L 2 N mixture distribution in Bar and Schifano [35].
Note that σ 2 consists of two variance components, namely, σ 2 = 1 N −3 + σ 2 0 , where 1 N −3 is the variance component due to the asymptotic distribution of arctanh(r mn ), whereas σ 2 0 6 is due to the random effect model, which allows us to account for extra variability among uncorrelated pairs. A graphical representation of the L 2 N model is shown in Figure 1.
If we denote the null mixture component in the L 2 N model by C 0 , the two non-null components by C 1 and C 2 , the corresponding probability density functions by f j , and the mixture probabilities by p j , for j = 0, 1, 2, such that p 0 + p 1 + p 2 = 1, then we classify a putative edge between nodes m and n in the complete graph into one of the three mixture components based on the posterior probabilities, Let b mn = (b 0mn , b 1mn , b 2mn ) be an indicator vector, so that b jmn = 1 for the component j with the highest probability, P r(e mn ∈ C j |w mn ), for the pair mn, and 0 for the other two components. Using this notation, the G × G matrix A = [1 − b 0mn ] denotes the adjacency matrix between the G nodes in the graph. Our goal is to obtain an accurate estimate of A. To do that, we treat the indicators b mn as missing data, and use the EM algorithm [36] to estimate the parameters of the mixture model. The hierarchical and parsimonious nature of the L 2 N model leads to shrinkage estimation and borrowing power across all pairs of genes, as well as to computational efficiency. This is critical, since K is typically very large and can be much larger than the sample size, N . Details regarding the parameter estimation for the L 2 N model can be found in [35]. The algorithm is implemented as an R-package called edgefinder (see the Supplementary Material).

Implementation Notes
There are a couple of challenges related to the parameter estimation via the EM algorithm which should be addressed. First, the complete set of pairwise correlation coefficients, and thus the normalized weights, w mn , cannot be assumed to be mutually independent. Conditionally, they are all asymptotically normally distributed with variance 1 N −3 , but for a fixed m, some w mn may be correlated. Second, obtaining estimates based on all K pairwise correlations is time-consuming and requires storing a very large matrix in the computer's memory, since the values of the indicator variables,b jmn , for each pair of genes must be updated in each iteration of the EM algorithm.
When the number of genes is large, as is the case in the applications which motivated this paper, we propose taking a random sample of G < G genes (e.g., G = 1000) and fitting the mixture model to this random subset. Using the smaller subset of genes greatly improves computational efficiency and yields highly accurate estimates. Furthermore, randomly selecting a subset of the genes allows us to assume that the remaining weights used in the (complete data) likelihood function are approximately independent. With the estimates obtained from the random subset, it is then possible to compute the posterior probabilities (4) for all K pairs, even on a computer with standard memory capacity. 8 In addition to computational efficiency and increasing power through shrinkage estimation, the mixture model allows us to estimate how many pairs of genes are correctly (or incorrectly) classified as co-expressed. Specifically, for a predetermined posterior probability ratio threshold, T > 1, we solve , and b 0mn = 0 otherwise. Alternatively, for any α, we can (numerically) find thresholds c 1 > 0 and c 2 < 0 such that That is, we control the estimated probability of a Type I error at a certain level, α. Similarly, we can control the false discovery rate [37]. Using the thresholds c 1 and c 2 , we can estimate the probability of a Type II error: 3 Simulation Study

Data Generated Under the L 2 N Model
In the first simulation study, we assess the power and goodness of fit of our model using different configurations, with varying numbers of genes (G), samples (N), degrees of sparsity (p 1 + p 2 ), and graph structures. In this section, data are generated according to the L 2 N model from Section 2, and we use different parameters for the log-normal components. We show representative results with N = 100 and G = 500 (thus, K, the maximum possible number of edges is 124,750.) Four network configurations are used in this section. They are described in terms of the shape of the G × G adjacency matrix, A, as follows: I is an identity matrix, and 0 is a matrix of zeros. This graph contains one clique (complete subgraph) with 100 nodes, and nodes not in the clique are not connected to other nodes. |E| = 4, 950 (≈ 0.04K), p 1 = 0.0396, p 2 = 0.
For pairs i, j such that A ij = 0, we generated w ij independently from a standard normal distribution. In the complete and two independent blocks configurations, for A ij = 1 we generated only positively correlated pairs (so p 2 = 0), and in the two negatively correlated blocks configuration, the pairs were positively correlated within each block, but pairs across the two blocks were generated to be negatively correlated. For the ar structure, weights generated from the log-normal distribution appear in the off-diagonals of A in decreasing order. That is, the largest G − 1 weights are placed randomly in the secondary diagonal (elements A i,i+1 ), the next G−2 largest weights are placed randomly in the ternary diagonal (elements A i,i+2 ), etc. (Note that the values A ij in this case are only used to indicate that elements along diagonals are equal, but these values are not used to generate the weights.) All four configurations are sparse, with only 2-4% of the putative edges being present in the graph. The two negatively correlated blocks structure has the same sparsity as the complete and ar graphs, but it has different weights of non-null mixture components. The ar graph has a more constrained structure, with weights among pairs of nodes which decay as |i − j| increases, for i, j ≤ S. Recall, however, that our method does not rely on any assumptions about the structure of the graph. Therefore, it is expected that its performance will only depend on the parameters involved in the mixture model. In the simulations presented here we examined the power of the method to detect edges in the graphs as a function of the location parameters of the log-normal components. We varied θ 1 , such that . . , 0.75}, and set κ 2 1 = 0.25. In the two negatively correlated blocks configuration, we used θ 2 = θ 1 and κ 2 = κ 1 . The weights which were generated according to the L 2 N model were transformed into correlation coefficients using the tanh function, and the resulting covariance matrix was used to simulate 500 gene expression values for 100 subjects. For each configuration we generated 20 different datasets.
We applied our method and checked the goodness of fit of the mixture model and the ability to correctly recover the structure of the network, in terms of the number of trueand false-positive edges. In our simulations, we used the approach described in Section 2 to control the false discovery rate at the 0.01 level. To demonstrate how well our algorithm estimates the true mixture model, we plotted for each configuration the histogram of the observed w mn and the fitted mixture and measured the goodness of fit in terms of the root means squared error (rMSE). In all configurations, the rMSE was very small (≤ 0.01), and the mixture weights (p 0 , p 1 , p 2 ) were estimated very accurately and with increasing accuracy as θ 1 increases. See, for example, Figure S1 in the supplementary material, where the average estimate of p 1 is plotted versus θ using the two negatively correlated blocks configuration. A representative goodness of fit plot is shown in Figure 2, for the complete configuration, with θ 1 = −0.25. The red curve represents the null component, the green lines represent the non-null components (in this case, C 2 is estimated, correctly, to have a weight which is very close to 0), and the dashed blue line is the mixture. and two negatively correlated blocks). In all cases, the FDR was controlled at the 0.01 level.

13
Arguably, more important than assessing goodness of fit, is determining a method's ability to recover the true network structure correctly, i.e. identify as many existing edges as possible while maintaining a low number of falsely-detected edges. Figure 3 shows the average power of our method to detect true edges for a range of values for θ i (with a fixed κ 2 ) and for different network configurations, where power is the total number of true-positives divided by the total number of edges in the graph. It can be seen that, as the location parameter of the log-normal distribution increases, the power increases and approaches 1.
When the data are generated under the L 2 N model, this is the expected behavior, since as θ 1 (θ 2 ) increases, the positive (negative) non-null component, C 1 (C 2 ), is pushed further to the right (left), making it easier to discriminate between non-null and null components.
Note that our method has approximately the same power curve for all four configurations.
The average false discovery rate across all configurations and replications is 0.008. For small values of θ 1 (< −0.75), the average FDR is slightly higher (approximately 0.012) and for θ > 0, the average FDR is less than 0.01. More detailed results regarding the achieved false discovery rate are shown graphically in Figure S2.

Data Generated Under Other Models
In the second simulation study we evaluate the ability of our method to recover the true network structure and compare it with other methods. For a fair comparison, the data are not generated under L 2 N model. Rather, we use data generated from a multivariate normal distribution whose covariance matrix Σ is defined as a function of an adjacency matrix A. To generate data, we use the R-package huge [38]. We generate five types of network configurations as follows: • random: each edge is randomly set to exist in the graph using K i.i.d. Bernoulli(p) 14 draws (p = 0.01, 0.05, 0.1). |E| ≈ 1000 × (1000 − 1) × p/2.
• hub: it consists of g disjoint groups, and nodes within each group are only connected through a central node in the group (g = 25, 50, 100). |E| = 1000 − g.
• overlapped-cluster: we modified a function that generates a cluster network that consists of non-overlapping g groups. In the modified function, the groups are aligned in an adjacency matrix so that each group shares 20% of the nodes with its leftadjacent group and another 20% with its right-adjacent group. Edges in each group are randomly generated with probability p. We use p = 0.3 for g = 25 and 50, and in each group are randomly generated with probability p. We compare the performance in terms of the number of true positives (i.e. correctly identified edges) given the same total number of edges identified. We define the true network in two different ways: (i) based on the adjacency matrix -a pair m, n is set to be connected if and only if A mn = 1, and (ii) by applying a threshold to the true covariance matrix -for a predetermined t, a pair is connected iff |Σ mn | > t where the threshold t is set to achieve the same sparsity level as that of the adjacency matrix. The two are different in that the former assumes conditional independence between two nodes that not connected by an edge, while the latter does not.
The results from the two network definitions are similar. We show the result from (i) whose true matrix is derived from the adjacency matrix, and show the result from (ii) whose true matrix is derived from the true covariance matrix in the supplementary material ( Figure S3). when the edges are identified in a random manner. In the band, g = 50 configuration (panel I), the competing methods do as well as or worse than chance, whereas our method gives much better results. In the random, p = 0.1 configuration (panel C), the other methods perform worse than choosing edges randomly and our method performs slightly better than chance to a certain point (approximately 30,000 total edges being detected) before deteriorating, but still gives better results than the other methods.
The scale-free configuration appears to be especially challenging for all the methods.
They all detect only approximately 10 -20 true edges when the total number of edges detected was 1,000 ( Figure 4M). In this case, the correlation thresholding, glasso-and MB-OR perform only slightly better than our method.
Consider that sparsity levels change as we vary the network sizes: larger G results in a more sparse network. Figure S4 depicts the numbers of true positive edges given the total number of edges for various sizes of scale-free networks. When G = 2, 000, our method is much better than MB-AND (as we have also seen in Figure 4M), and slightly better than the other methods. However, when G = 200, 500, our model outperforms the other methods.

Case Study
Subtype Characterization of Breast Cancer. We analyzed a large dataset used by Allahyar et al. [40] who introduced SyNet, a computational tool aiming to improve networkbased cancer outcome prediction. The dataset, referred to as ACES [41], contained expres-sion data for 12,750 genes collected from 1,616 women across 12 studies. To obtain and preprocess the data, we used the raw data and scripts provided by Allahyar et al. [40] via their github page (github.com/UMCUGenetics/SyNet). We focused on analyzing gene Then, we create clusters recursively as follows. In each iteration, i, we find the node, x i , with the largest degree and identify all of its neighbors. We sort its neighbors in increasing dissimilarity order. So, within the i-th cluster, the nodes are arranged from closest to x i to the farthest. Then, in the subsequent iterations node x i and all its neighbors are excluded, until we have exhausted all the nodes. Figure 6 shows that the genes in the second largest cluster in the Her2 network appear to be highly connected. Figure 7 shows again the plot of γ m d m versus d m , but this time, genes identified as belonging to the second largest cluster appear in red. Clearly, the upper 'arm' in the plot consists of genes from that cluster. each subtype, we perform gene set enrichment analysis [21] and find the KEGG pathways significantly enriched in every identified cluster (p-value ≤ 0.01). We then identify uniquely enriched pathways for each subtype compared to other subtypes (see Table 1).
Our results obtained from the pathway analysis are consistent with the previous findings. The HER2 group has three uniquely enriched pathways: drug metabolism-cytochrome P450, steroid hormone biosynthesis, and tryptophan metabolism. The relationship of drug metabolism-cytochrome P450 to HER2 has been reported in Towles et al. [42], the relationship of steroid hormone biosynthesis to HER2 has been reported in Huszno et al. [43], and the relationship of tryptophan metabolism to HER2 has been reported in Fisher et al.  Uniquely enriched pathways in each breast cancer subtype. All identified clusters are indexed by their size. The cluster index indicates the order of the clusters. P-values are adjusted to control the false discovery rate using Benjamini and Hochberg [37].
mal group are parts of the immune system. Furthermore, Phagosome, antigen processing and presentation, natural killer cell mediated cytotoxicity, graft-versus-host disease, T cell receptor signaling pathway, and primary immunodeficiency are enriched in a larger number of clusters in the normal group than other subtypes. These pathways are also part of or related to the immune system (see Table S1 in the Supplementary Material).

Discussion
We propose a new approach for detecting edges in gene networks, based on co-expression data. We consider the entire set of genes as a network in which nodes represent genes and weights on edges represent the correlation between expression levels of pairs of genes. We start by modeling the normalized pairwise correlations as a mixture of three components: a normal component with mean 0, representing the majority of pairs which are not coexpressed, and two non-null components, modeled as log-normal distributions, for positively and negatively correlated pairs.
From a theoretical point of view, the so-called L 2 N model has the advantage that the overlap between the null component and the non-null components around 0 is negligible. This helps to avoid identifiability problems which are known to affect other mixture models which rely only on normal components, such as the spike-and-slab or a three-way normal mixture model. Furthermore, the mixture model allows us to accurately estimate the proportion of spurious correlations among all pairs of genes and to derive a cutoff criterion in order to eliminate the vast majority of the edges in the graph that correspond to uncorrelated genes. We also derived estimators for the probabilities of Type-I and Type-II errors, as well as the false discovery rate associated with the cutoff criterion.
From a practical point of view, this model appears to fit co-expression data extremely well, even when the data are not generated according to the mixture model. Our simulations show that it outperforms other methods in terms of the power to detect edges, and that it maintains a low false discovery rate. Estimation of the model parameters is done very efficiently, using the EM algorithm. In typical gene expression datasets which consist of thousands of genes and millions of putative edges, computational efficiency is critical.
Our approach does not require any assumptions about the underlying structure of the network. We only assume that the normalized correlations follow the L 2 N model. This is a very modest assumption since the Fisher z-transformed correlations are indeed (asymptotically) normally distributed for all the uncorrelated pairs.
Our case study yielded results that are insightful and consistent with previous findings.
In particular, for the HER2 group, we identified the pathways that are known to be associated with this subtype. For luminal B group, we identified pathways that are associated with endocrine treatment resistance in breast cancer with estrogen receptor alpha. All the breast cancer subtypes are less prone to have pathways related to the immune system than 25 the normal group, which is reasonable because cancer is known to weaken the immune system.
We plan to extend this method to handle time varying networks. This will be particularly useful when analyzing gene expression data from repeated measures designs. We also plan to extend the model to applications that involve multiple platforms, such as methylation and proteomics. In principle, with the appropriate normalization technique for each platform, one can simply construct a graph with G 1 + . . . + G k nodes, where G i is the number of 'building blocks' observed in each platform. However, much more work is needed to establish the theoretical framework to define 'co-expression' across platforms.