Simultaneous Clustering of Multiple Gene Expression and Physical Interaction Datasets

Many genome-wide datasets are routinely generated to study different aspects of biological systems, but integrating them to obtain a coherent view of the underlying biology remains a challenge. We propose simultaneous clustering of multiple networks as a framework to integrate large-scale datasets on the interactions among and activities of cellular components. Specifically, we develop an algorithm JointCluster that finds sets of genes that cluster well in multiple networks of interest, such as coexpression networks summarizing correlations among the expression profiles of genes and physical networks describing protein-protein and protein-DNA interactions among genes or gene-products. Our algorithm provides an efficient solution to a well-defined problem of jointly clustering networks, using techniques that permit certain theoretical guarantees on the quality of the detected clustering relative to the optimal clustering. These guarantees coupled with an effective scaling heuristic and the flexibility to handle multiple heterogeneous networks make our method JointCluster an advance over earlier approaches. Simulation results showed JointCluster to be more robust than alternate methods in recovering clusters implanted in networks with high false positive rates. In systematic evaluation of JointCluster and some earlier approaches for combined analysis of the yeast physical network and two gene expression datasets under glucose and ethanol growth conditions, JointCluster discovers clusters that are more consistently enriched for various reference classes capturing different aspects of yeast biology or yield better coverage of the analysed genes. These robust clusters, which are supported across multiple genomic datasets and diverse reference classes, agree with known biology of yeast under these growth conditions, elucidate the genetic control of coordinated transcription, and enable functional predictions for a number of uncharacterized genes.


Algorithm Analysis
The proof of the theorem on JointCluster algorithm is organized in two parts: the first collects claims about the cuts made by the algorithm, and the second uses them to prove the theorem. Please refer the main text for the algorithm description, theorem statement, and relevant definitions.
Claims and notations. Let the cuts produced by the algorithm be (S 1 , T 1 ), (S 2 , T 2 ), . . ., and let the type of a cut (S j , T j ) be the label τ j ∈ {1, 2, . . . , p} of the graph that produced it. That is, (S j , T j ) is the smallest of the chosen cuts in the algorithm description above, and approximates the sparsest cut in the current node set in G τj . For convenience, let S j be the "smaller" side of the cut (i.e., a τj (S j ) ≤ a τj (T j )). Slightly modify the algorithm just for the sake of analysis so that at any point, the current node set S j ∪ T j satisfying a τj (S j ∪ T j ) < n a τj (V ) is split into singletons. The conductance of singletons is defined to be 1.
Let a partition C 1 , C 2 , . . . , C l of V be an ({α i }, ) simultaneous clustering. We need to show that the partition defined by the algorithm's cuts approximates this optimal clustering. To this end, note that the inter-cluster edges X of the algorithm's partition is simply the edges crossing all cuts {(S j , T j )}. To bound the cost of edges lost in some (S j , T j ), we define a new cluster-avoiding cut (S j , T j ) in S j ∪T j as follows. For each cluster C t in the optimal clustering, place all of (S j ∪ T j ) ∩ C t either into S j if a τj (S j ∩ C t ) ≥ a τj (T j ∩ C t ), or into T j otherwise. By this construction, |a τj (S j )−a τj (S j )| ≤ l t=1 min(a τj (S j ∩C t ), a τj (T j ∩C t )). We now present two claims that will help us bound the edges lost in the algorithm's cuts in the next section. These claims are simple extensions of the single graph case [1], but proved here for completeness. The first claim is on the maximum number of S j or S j sets a node can belong to. Claim 1. Consider any graph G i . For each node u ∈ V , there are at most log n values of j such that τ j = i and u belongs to S j . Further, there are at most 2 log n values of j such that τ j = i and u belongs to S j .
Proof of Claim 1. Fix a node u ∈ V . Let Clearly if u ∈ S j ∩ S k with j < k, then (S k , T k ) must be a partition of S j or a subset of S j . Also, if the cut types τ j = τ k = i, we have a i (S k ) ≤ 1 2 a i (S k ∪ T k ) ≤ 1 2 a i (S j ). So a i (S j ) reduces by a factor of 2 or greater between two successive times j ∈ I u . The maximum value of a i (S j ) is at most a i (V ) and the minimum value is at least n a i (V ), yielding the first statement of the claim. Now suppose j, k ∈ J u with j < k. Suppose also u ∈ C t . Then u ∈ T j ∩ C t , and T j or a subset of T j is later partitioned into (S k , T k ). Since u ∈ S k \ S k with τ k = i, we also have . This halving of a i (T j ∩ C t ) between two successive times that j ∈ J u shows that |J u | ≤ log n . This proves the second statement in the claim as u ∈ S j when u ∈ S j or u ∈ S j \ S j .
The second claim considers a set C t in the optimal partition, and uses its high conductance (of at least α i in G i ) to show that another conductance-like measure defined using intra-cluster cut edges (within C t ) is high as well.
Claim 2. Consider any graph G i . For each cluster C t in the optimal clustering, which has conductance at least Proof of Claim 2. Consider how the cuts (S 1 , T 1 ), (S 2 , T 2 ), . . . induce a partition of C t into P 1 , P 2 , . . .. Every edge between two vertices in C t that belong to different sets of the partition must be cut by some cut (S j , T j ), and conversely, every edge of every cut (S j ∩ C t , T j ∩ C t ) must have its two end points in different sets of the partition. So given that C t has conductance α i , we obtain all j Each node u ∈ C t belongs to the smaller (with respect to a i (·)) of the two sets S j ∩ C t , T j ∩ C t for at most log n values of j. Also every P s contained in (and which partition) the smaller of S j ∩ C t , T j ∩ C t is smaller than C t \ P s . So we have The claim follows from these two bounds taken together.
Proof of algorithm guarantees. We now prove that the partition defined by the algorithm's cuts approximates the optimal clustering, which has conductance at least α i in G i and inter-cluster edge cost at most in the sum graph (V, a s ).
Proof of Theorem 2 in the main text. From the algorithm description, it follows that each cluster output by our algorithm upon termination has conductance at least Thus it remains to bound the weight of the inter-cluster edges. To do so, we divide the algorithm's cuts into two groups, and bound their edge weights separately. The first group H consists of cuts with "high" conductance within clusters. Let a I i (S j , T j ) denote the weight of intra-cluster edges of the cut (S j , T j ); i.e., a I i (S j , T j ) = l t=1 a i (S j ∩ C t , T j ∩ C t ). Then, The remaining cuts will be called as cuts with "low" conductance within clusters. We now bound the cost of the high conductance group using the cost of the cluster-avoiding cuts. Recall that a cluster-avoiding cut (S j , T j ) of a cut (S j , T j ) satisfies |a τj (S j ) − a τj (S j )| ≤ t min(a τj (S j ∩ C t ), a τj (T j ∩ C t )). Also for all j ∈ H, we have Putting them together, we obtain |a τj (S j ) − a τj (S j )| ≤ 1 2 a τj (S j ), and hence min(a τj (S j ), a τj (T j )) ≥ 1 2 a τj (S j ). We now use the sparsest cut approximation guarantee to upper bound a τj (S j , T j ) in terms of a τj (S j , T j ).
Hence we have bounded the overall cost of the high conductance cuts with respect to the cost of the cluster-avoiding cuts. Observe that every edge in a cluster-avoiding cut is an inter-cluster edge in the optimal clustering. Also note that if an edge (u, v) crosses a cut (S j , T j ), then either u ∈ S j or v ∈ S j . So by applying Claim 1 for any i = 1, 2, . . . , p, each edge (u, v) crosses a cut (S j , T j ) of type τ j = i for at most 4 log n values of j. Applying these observations along with Claim 1 and Holder's inequality, we bound the cost of the high conductance group cuts in the min graph (V, a m ).
Next we deal with the group of cuts with low conductance within clusters. Using the definition of H and Claim 2, we can bound the intra-cluster cost of the low conductance group of cuts.
In addition, since each inter-cluster edge in the optimal clustering belongs to at most one cut (S j , T j ), the cost of such edges crossing the low-conductance group of cuts in the sum graph (and hence min graph) is at most 2 a s (V ). So the total cost of the low conductance group of cuts in the min graph is a s (V ).
To get the total edge cost in the min graph, we then sum up the cost of the two groups of cuts from above, and note that splitting up all the S j ∪ T j with a τj (S j ∪ T j ) ≤ n a τj (V ) into singletons costs us at most 2 a s (V ) on the whole. Substituting a s (V ) as twice the total sum of edge weights gives the edge cost bound in Theorem 2 in the main text.

JointCluster -Overall Framework
JointCluster performs a combined analysis of multiple graphs using a flexible framework comprising two phases. The first phase produces a clustering tree using the JointCluster algorithm and heuristics described in the main text, and the second phase parses this clustering tree to produce clusters preserved in multiple graphs as described in this section. The algorithm parameters are also learnt automatically as described here, so that we could jointly cluster multiple graphs in an unsupervised fashion.
Parsing the clustering tree. Our algorithm recursively partitions the common node set V of the graphs. To capture this hierarchical structure, we use a clustering tree with the same structure as the algorithm's recursion call tree, and map the subset of V on which a recursive call is made to the corresponding vertex in the tree. Adopting an empirically tested clustering framework [2], our JointCluster method parses this clustering tree into a partition of V that both respects the clustering tree and optimizes a certain objective function.
A partition of V is a collection of disjoint node sets C 1 , C 2 , . . . C T (C t ⊆ V ) whose union equals V . A collection of disjoint node sets is said to respect a clustering tree if each node set is mappable to some vertex in the clustering tree. Our objective function for a partition is based on the modularity score [3], studied in other biological contexts [4], and described next.
We take the min-modularity score of a partition C, which is the sum of the min-modularity score of the sets C t in the partition, as our objective function. We compute the modularity score of C t in each graph G i as below, and take their minimum as the min-modularity score of C t . The modularity of a set of nodes C t in graph G i is given by: The first term is the normalized edge density of the nodes in G i and the second the expectation of the same measure in a randomized graph obtained from degree-preserved shuffling of the edges in G i . The partition of V that respects the clustering tree and optimizes the min-modularity score can be found by dynamic programming, much in the same way as done for the correlation clustering objective function in [2]. For a given vertex in the clustering tree, let D ⊆ V be the nodes mapped to it, and D l , D r be the nodes mapped to its left and right children respectively. If OP T (D) is the optimal partition of D, then the dynamic programming recurrence is OP T (D) = arg max(min-modularity(D), min-modularity(OP T (D l )∪ OP T (D r ))). To enable a fair comparison of JointCluster with other methods, we needed a size limit on the clusters. To find a partition where each set or cluster in the partition is of some maximum size (set at 100 nodes), we simply change the above recurrence to OP T (D) = OP T (D l ) ∪ OP T (D r ) whenever |D| > 100. From the size-limited optimal partition of V found using this procedure, we only output clusters of some minimum size (at least 10 nodes) as the final list of meaningful clusters. This list is ordered by the min-modularity scores of clusters (highest score first), and the clusters are identified by their rank in this ordering.
Parameter learning. In the problem definition, we saw how graph-specific conductance thresholds {α i } accommodate the varying extents to which different graphs can be clustered. We desire an unsupervised method for learning the related conductance threshold α * i (see algorithm description) for each network of interest G i from the inherent cluster structure present in the network. To do so, we repeat the following procedure separately for each graph G i .
1. Separately cluster G i using JointCluster with the conductance threshold set without loss of generality to the maximum possible value 1. The result is a fully-resolved clustering tree, where each node in G i is mapped to a leaf vertex in the tree.
2. Parse the clustering tree into a clustering with the optimal min-modularity score, which turns out to be the modularity score as only one graph is clustered. This clustering yields a set of sizeconstrained, meaningful clusters {C t } as described above.
3. Set α * i to the minimum conductance threshold that would result in the same set of clusters {C t } upon clustering G i as the threshold of 1. This value is determined from the clustering tree by selecting the tree vertices to which the clusters {C t } are mapped to, and taking the maximum of the conductance of all cuts made by the algorithm in resolving the clustering tree to the selected vertices.
Note that when clustering multiple graphs, very stringent conductance thresholds might not yield any (non-trivial) clusters, and very low conductance thresholds might only yield coarsely resolved clustering trees and large clusters. The rationale behind the above heuristic is to automatically choose a threshold that is both sufficiently low to enable simultaneous clustering with the scaling heuristic, and sufficiently high to provide a clustering tree that is at least as resolved as the single-graph clustering trees.
Significance of modularity scores. The modularity score of a cluster in a network is said to be statistically significant (or simply significant) if the P-value of the score is at most 1/100 as estimated from random sets containing the same number of genes as the cluster. In detail, a random set of genes of the same size as the cluster is chosen without replacement from all 4482 genes in the network, and its modularity score calculated in the network. The P-value of a modularity score a is calculated by the proportion of 100 random sets that achieve modularity score at least a. The P-value of the minmodularity score of a cluster is similarly estimated from the min-modularity score of 100 random sets of genes of the same size as the cluster, and the min-modularity score is said to be (statistically) significant if the estimated P < 1/100. The P-value of the min-modularity score of a whole clustering is estimated from the min-modularity score of 100 random partitions of all genes into random sets of genes, with each random partition having the same number of clusters and cluster sizes as the whole clustering. Again, the min-modularity score of a clustering is said to be (statistically) significant if the estimated P < 1/100.
Adapted scoring for protein-protein and protein-DNA networks. To obtain some of our results, we decomposed the physical network into a network of protein-protein and a network of protein-DNA interactions, and applied JointCluster to these two networks in place of the single physical network. The clustering tree was produced and parsed as before, except for the use of a slightly adapted min-modularity score whenever protein-protein and protein-DNA networks were involved. The adapted min-modularity score of a cluster is the minimum of the best modularity score of the cluster in the protein-protein and protein-DNA networks, and the modularity score of the cluster in other integrated networks if any. This allowed us to find clusters that were present in the protein-protein network (as a protein complex for instance) or in the protein-DNA network (as a coregulated module for instance).

Performance Comparison
Tested methods. We already described the tested methods (JointCluster, single tree, single graph and Coassociation methods) in the main text. To complete the description, we provide details on implementing the spectral clustering method M [2], and building the coassociation graph [5]. The spectral method M was implemented by simply applying our JointCluster program on one input graph, since JointCluster is an extension of the spectral method from one graph to multiple graphs. Clustering a graph using this implementation of method M yields the clustering tree, and optionally the clusters that result from parsing this tree using the modularity score described above. The coassociation graph is basically a consensus of different clusterings (partitions) of the same set of nodes. The graph was defined over the nodes being clustered, with an edge between two nodes u, v weighted by the number of clusterings in which both u, v belong to the same cluster.
The implementation of Matisse and Co-clustering methods described in the main text was available as a single software http://acgt.cs.tau.ac.il/matisse/. The same software was also used to estimate missing expression values using the k-nearest neighbors option (with k = 10) before building the yeast coexpression networks.
Simulated data for benchmarking. To generate one random instance of two test graphs G 1 , G 2 , we proceeded as in an earlier study [4], by creating each graph over 128 nodes and implanting a fixed "true" clustering of 4 clusters with 32 nodes each. Let the internal degree of a node v count the neighbors of v that belong to the same cluster as v, and its external degree count the remaining neighbors of v. In order to obtain a meaningful cluster structure, the edge probability between two nodes was set such that the average external degree (the k out parameter) of every node is at most its average internal degree. The sum of these two average degrees was fixed at a constant 16 as in the earlier study, so only the values k out ≤ 8 lead to a meaningful cluster structure. In more detail, we classified edges as intra or inter cluster, and set these edges independently with probability p i and p o respectively. The p i and p o values were chosen such that the average internal degree 32p i and average external degree k out = 96p o summed up to 16 for different values of k out .
Reference classes. The biological significance of detected clusters were assessed using reference sets of yeast genes belonging to different classes, as mentioned in the Results. We provide detail on how these five reference classes were collected.
• GO Process: Genes in each reference set in this class are annotated either directly or indirectly to the same GO Biological Process term, using any evidence code other than IEA (Inferred from Electronic Annotation) [6] (Jan 2009 version). We restrict ourselves to sets of size at least 50 and at most 1000 to avoid reference sets that are either too small or too large to provide a meaningful enrichment category. • TF (Transcription Factor) Perturbations: Genes in each set are differentially expressed at P-value at most 0.01 between TF-perturbed and wildtype (or control) strains. The list of reference sets from two studies, TF deletion [7] and TF over-expression [8], are concatenated to form this reference class. The P-value for differential expression of a gene under a TF deletion or overexpression is based respectively on the X-score or Z-score reported in the study. • Compendium of Perturbations: Genes in each set are differentially expressed at P-value at most 0.05 and exhibit a fold change at least 1.5 between perturbed (deletions of genes, or chemical treatment) and wildtype (or mock-treated control) yeast strains [9]. The P-values based on a gene-specific error model and fold changes are as reported in the study. • TF Binding Sites: Genes in a set have binding sites of the same TF in their upstream genomic regions, with sites predictions downloaded from a previous study based on ChIP binding data [10] at a binding P-value cutoff of 0.005 and cross-species conservation of site sequences determined using a moderate criterion [11]. • eQTL Hotspots: The genomic regions with wide-spread transcriptional effects, i.e., exhibiting a significant excess of linkages of expression-related traits to genotypic variations, are called eQTL hotspot regions. The hotspot regions have been reported for three traits, gene expression under glucose and ethanol conditions and difference between these two expressions, using linkages significant at about 5% FDR [12]. Genes with trait linkages to such a eQTL hotspot region are grouped into a reference set.
Besides the criteria described above, we discard from each class, reference sets that are too small (less than 5 genes) to provide a meaningful enrichment category. A reference set comprising differentially expressed genes is referred to as a signature set or simply signature in the text (eg. deletion, overexpression or perturbation signatures).
Evaluation measures. To evaluate the methods on clustering of the simulated datasets, we used the standard Jaccard index measure, which ranges from 0 to 1, to reflect the degree of overlap between the true clustering and the clustering detected by the methods. Jaccard Index between two clusterings of the same set of elements is the ratio between the number of element pairs that are intra-cluster wrt (with respect to) both clusterings and the number of element pairs that are intra-cluster wrt either of the clusterings. Given a clustering (partition) of a set of elements into clusters, a pair of elements that belong to a single cluster is denoted as intra-cluster wrt this clustering.
To evaluate the methods on clustering of the yeast networks, we measured the overlap between clusters detected by the methods against sets in the various reference classes. The significance of overlap between two sets of genes belonging to a larger background set (typically all the genes in the yeast networks) is reported as an enrichment P-value computed using the hypergeometric distribution (denoted as P in the text).
Sensitivity and specificity measure how well the clusters produced by a clustering method align with reference sets of a specific class such as GO Process or TF Perturbations. Sensitivity is the fraction of reference sets in the class that are enriched for genes belonging to some cluster output by the method. Specificity is the fraction of clusters output by a method that are enriched for genes belonging to some reference set in the class. Both fractions are reported in the tables as percentages rounded to the nearest tenth. A cluster is said to be significantly enriched or simply enriched for genes belonging to some reference set, if the enrichment P-value of overlap between the cluster and reference set genes, computed as said above using all genes in the networks of interest as background, is at most 0.005 after Bonferroni correction for the multiple reference sets being tested. The same method is used to declare a reference set as being enriched for genes belonging to some cluster, but the Bonferroni correction is now using the number of clusters being tested.
A set of genes is said to be recovered by a clustering method if it is enriched for genes belonging to some cluster in the clustering found by the method, using the same definition of enrichment of a cluster for a reference set described above. The genes in the three-network clusters obtained from applying JointCluster on the yeast physical and glucose/ethanol coexpression networks is provided in an accompanying "pge_clusters.orf" file. The enrichment of these clusters for all five reference classes is provided in an accompanying "pge_clusters_enrich.out" file. The prediction of function for uncharacterized ORFs (Open Reading Frames) based on the GO Process enrichment of these clusters is provided in an accompanying "pge_clusters_predns.out" file. The accompanying files and JointCluster software are available at http://www.math.mcgill.ca/vetta/jc/ .  Figure 1. Connectivity of the best preserved clusters. Two top-ranking preserved clusters (Clusters #1, #2) detected from applying JointCluster on the yeast physical and glucose/ethanol coexpression networks. Cluster #1 is significantly enriched for functions related to translation and mitochondria, and Cluster #2 is enriched for functions related to ribosome biogenesis. Modularity score of a cluster in each network is shown below the corresponding heatmap. In these heatmaps, the darkness of a spot x, y is scaled by the weight of the edge between the x-th and y-th gene in the cluster (darkest spot is weight 1.0 and brightest is weight 0.0). The weight is a binary (1 or 0) indicator of interaction between the genes in the physical network (labeled physical), and absolute correlation coefficient between genes in the coexpression network (labeled glucose and ethanol). The inner band enveloping a heatmap shows for each node v, the average edge weight from v to neighbors of v belonging to the same cluster as v; the outer band shows the average edge weight from v to the remaining neighbors of v. Hence, a good cluster's inner band would be darker than its external band. The weight is a binary (1 or 0) indicator of interaction between the genes in the physical network (labeled physical), and absolute correlation coefficient between genes in the coexpression network (labeled glucose/ethanol).  Figure 1, except that the heatmap is based on an ordering of cluster genes that places physically isolated genes first and other genes missed by Matisse clusters next before the remaining genes.