Voting-Based Cancer Module Identification by Combining Topological and Data-Driven Properties

Recently, computational approaches integrating copy number aberrations (CNAs) and gene expression (GE) have been extensively studied to identify cancer-related genes and pathways. In this work, we integrate these two data sets with protein-protein interaction (PPI) information to find cancer-related functional modules. To integrate CNA and GE data, we first built a gene-gene relationship network from a set of seed genes by enumerating all types of pairwise correlations, e.g. GE-GE, CNA-GE, and CNA-CNA, over multiple patients. Next, we propose a voting-based cancer module identification algorithm by combining topological and data-driven properties (VToD algorithm) by using the gene-gene relationship network as a source of data-driven information, and the PPI data as topological information. We applied the VToD algorithm to 266 glioblastoma multiforme (GBM) and 96 ovarian carcinoma (OVC) samples that have both expression and copy number measurements, and identified 22 GBM modules and 23 OVC modules. Among 22 GBM modules, 15, 12, and 20 modules were significantly enriched with cancer-related KEGG, BioCarta pathways, and GO terms, respectively. Among 23 OVC modules, 19, 18, and 23 modules were significantly enriched with cancer-related KEGG, BioCarta pathways, and GO terms, respectively. Similarly, we also observed that 9 and 2 GBM modules and 15 and 18 OVC modules were enriched with cancer gene census (CGC) and specific cancer driver genes, respectively. Our proposed module-detection algorithm significantly outperformed other existing methods in terms of both functional and cancer gene set enrichments. Most of the cancer-related pathways from both cancer data sets found in our algorithm contained more than two types of gene-gene relationships, showing strong positive correlations between the number of different types of relationship and CGC enrichment -values (0.64 for GBM and 0.49 for OVC). This study suggests that identified modules containing both expression changes and CNAs can explain cancer-related activities with greater insights.


Introduction
Cancer is a common genetic disease and a worldwide leading cause of death. Cancer genomics identifies changes of genes that play important roles in cancer initiation and progression. Decades of research have revealed that cancer is closely related to abnormal changes in regulatory and signaling pathways during its growth and malignance [1,2]; such dysregulations in key pathways occur due to combinations of genetic alterations and expression changes of oncogenes or tumor suppressor genes [3][4][5]. Therefore, many algorithms have been developed to identify pathways related to cancer [6][7][8][9] using DNA CNAs, GE changes, PPIs, and so on.
Extensive uses of GE for studying molecular pathways have helped in classifying cancer subtypes, predicting prognosis, and developing drugs for cancer. However, using only GE data for identifying cancer-related genes is not enough because some important genes in cancer-related pathways might not be differentially expressed and some differentially expressed genes might not be relevant to cancer. CNAs are structural variations of DNA sequences that represent abnormal copies of DNA segments in a form of deletion or amplification in the cell [10]. CNAs are known to be a hallmark of cancer, and methods including GISTIC [11], RAE [12], and WIFA [13] have been used to detect cancerdriver genes in aberrant genomic regions. A recent large-scale analysis of GBM samples from The Cancer Genome Atlas (TCGA) [8] showed genetic alterations including mutations, deletions, and amplifications of DNA in 78%, 87%, and 88% of 206 GBM samples in the core components of RB, TP53, and RTK/PI3K pathways, respectively.
Several studies have recently reported the importance of integrating CNAs and GE data sets for the identification of cancer-related pathways. TCGA research on ovarian cancer showed that genetic alterations and gene expression changes simultaneously occur in the retinoblastoma signaling pathway [14]. Jörnsten et al. [15] developed a model that explains the effects of CNAs on GE in a large-scale network. Based on the model, prognostic scores were calculated and cancer-related genes were identified. Akavia et al. [16] employed an integrative Bayesian approach to identify biologically and therapeutically important driver genes in genetically altered regions by associating candidate driver genes with differentially expressed genes. They applied the proposed method to a melanoma data set and identified known driver genes in melanoma, along with novel cancer driver genes TBC1D16 and RAB27A. An important

A Framework for Combining Topological and Datadriven Properties
We developed the VToD approach to construct modules that are composed of a set of functionally coherent and cancer-related genes. VToD was developed based on four main ideas; (i) genes with similar gene expression profiles and copy number changes are more likely to be in the same module, (ii) genes can be assigned into multiple modules to reflect the biological knowledge that some genes are involved in multiple pathways, (iii) genes in a short distance in the PPI network are more likely to belong to the same module, and (iv) hub genes in the PPI network are more likely to be included in the modules since many hub genes having a large number of interacting partners may contribute to cancer development. The former two ideas consider data-driven properties and the latter two reflect topological properties of genes within the PPI network.
The schematic diagram of our proposed VToD method is shown in Figure 1. VToD constructs a gene-gene relationship network, GGR :~(S,R) by integrating GE and CNA data sets, where S is a set of seed genes and R is a set of gene-gene relationships. Seed genes are selected by combining differentially expressed (DE) genes and CNA genes, where CNA genes are obtained from TCGA [8,14] and listed in Table S1. For GBM, 4,821 seed genes were selected by combining 2,976 DE genes and 2,073 CNA genes. For OVC, 6,649 seed genes were constructed by 710 DE genes and 6,510 CNA genes. Note that some seed genes are both differentially expressed and copy number altered. The gene-gene relationships R were constructed, where two genes have strong correlation in at least one of three types of relationships: GE-GE, CNA-GE, and CNA-CNA. Then, VToD integrates a PPI data set with the gene-gene relationship network GGR by following four major steps.
1. Calculate the association between genes: For every two genes g and m, an association value from gene m to gene g is calculated by combining the gene-gene relationship and the PPI data set. The association value is called a vote-value in this study, since we assume that gene m votes for gene g to represent the strength of the association between two genes. 2. Select representative genes of each gene: For gene g, vote-values from all other genes are sorted in descending order, and genes located within the top vote th% of the vote-values are selected as the representative genes of the gene g. 3. Form pre-modules: If a gene m is selected as a representative gene from multiple genes, other genes selecting the gene m as the representative gene along with the gene m itself form a premodule. 4. Merge pre-modules: Two pre-modules are merged if pairwise members of the two pre-modules are highly related in the genegene relationship network and are closely connected in the PPI network.
The VToD algorithm is inspired by a dynamic signal transduction system (STM) algorithm [32] in which, for each gene, the most associated genes are selected to form pre-modules based on the PPI topology only. However, the clear distinction lies between STM and VToD in the process of (i) calculating the association between two genes and (ii) merging pre-modules, since our approach integrates GE, CNAs, and PPI data sets.
The constructed modules were assessed in two aspects; (i) we measured functional relevance of the identified modules by testing whether genes in a module were enriched for KEGG, BioCarta pathways, and biological processes in GO terms (called a functional enrichment test), and (ii) we assessed the relevance of the modules to cancer by applying an enrichment test to the cancer-related pathways or cancer-related biological functions, which are subsets of the above three categories of pathways/GO terms enriched with cancer-related genes from CGC [31] (called a cancer-related pathway enrichment test). Further, we tested whether genes in the identified modules were enriched with cancer genes from CGC, GBM driver genes [33], and OVC-related genes [34]. In these assessments, the hypergeometric statistics were used for the enrichment test.

Modules from the VToD Algorithm
The distributions of all enumerated pairwise gene-gene relationships (GE-GE, CNA-GE, and CNA-CNA) among seed genes are shown in Figure S1, and the distributions of all votevalues for GBM and OVC data sets are shown in Figure S2. Since the number of pre-modules depends on the vote th% values in Step 2 of the VToD algorithm, we tried three vote th values to examine how vote th values affect on the constructed premodules. Vote-values of the top 1%, 0.25%, and 0.1% eventually yielded 100, 68, and 43 pre-modules for GBM, and 138, 53, and 34 pre-modules for OVC. Then, we applied the functional enrichment tests and cancer-related pathway enrichment tests to pre-modules generated using the three threshold values above. Figure 2 shows the fraction of enriched pre-modules; although many pre-modules have significant overlaps with known pathways across all three thresholds, pre-modules from vote th = 0.25% and 0.1% have more overlaps with the pathways compared to vote th = 1%, showing that higher vote-values generate higher fraction of functionally relevant and cancer-related modules.
We also tested the importance of considering both topological and data-driven properties for pairwise vote calculation. We generated pre-modules using only topological and only datadriven properties. When the single property was used, the same number of gene pairs was selected with that of gene pairs selected by combining both properties for each value of vote th. Across all three values for the vote th threshold, the fraction of functionally enriched modules was higher when topological and data-driven properties were combined than when only a single property was used for both GBM and OVC, as shown in Figure 2.
We chose vote th = 0.1% as a threshold for further analysis. Using this threshold, for GBM, 43 pre-modules were obtained. By merging these pre-modules, 22 modules were generated, and the average number of genes in the modules was 24. For OVC, using the same threshold, 34 pre-modules were generated, and 23 modules were obtained after merging pre-modules, where the average of number of genes is 57. All genes in the modules are listed in Table S2 and Table S3. The statistical significance of the identified modules is shown in Figure S3.
Since the VToD algorithm allows multiple appearances of genes in several modules, we calculated the average ratio of common genes between modules. For GBM, the ratio of common gene was 16.07%, which was similar to those of the KEGG and BioCarta pathways. Also, the distribution of ratios of common genes was calculated. Around half of the modules had ƒ 10% of common genes, which indicates that final modules will be enriched with distinct functional pathways or terms ( Figures S4A and S4B). We also investigated three different types of direct relationships (GE-GE, CNA-GE, and CNA-CNA) between the gene pairs within each of these 22 GBM modules ( Figure S5A). Around 64% of the modules contained at least two types of relationships, showing (i) that genes with similar gene expression and DNA copy number changes are more likely to be in the same module, and (ii) that the activity of the genes in these identified modules can be explained by different molecular mechanisms (Table S4).
For 23 OVC modules, the average ratio of common genes was 11.68%, which was also lower than those from KEGG and BioCarta, and more than half of the 23 OVC modules had ƒ 10% of common genes ( Figures S4C and S4D). Around 83% of all 23 OVC modules ( Figure S5B) contained at least two types of direct relationships.
Cancer-related modules identified by the VToD algorithm for GBM. We applied functional and cancer gene set enrichment tests to 22 GBM modules. We found that 19 (86.36%), 14 (63.63%), and 20 (90.9%) modules were significantly enriched (FDR q-value v 0.05) with at least one KEGG, BioCarta, or GO terms, respectively, showing that identified modules are functionally coherent. Also, 15 (68.18%), 12 (54.55%), and 20 (90.9%) GBM modules were significantly enriched with cancer-related KEGG, BioCarta pathways, and GO terms, respectively. In the case of the cancer gene set enrichment test, 9 and 2 GBM modules had significant overlap (FDR q-value v 0.05) with CGC [31] and GBM-related genes [33], respectively. These results show that our modules are related to cancer development. Table 1 shows the summary of the top five selected modules ordered by GBM-related gene enrichment qvalues; these modules contain many GBM-related genes. All enrichment results for the GBM data set are shown in Tables S4,  S5, S6, and S7.
We selected GBM Module 2 to explain in detail how genes are interacting with other genes and are involved in biological pathways in modules. We selected this module for further explanation since it has a low enrichment q-value with cancer gene sets, and contains gene pairs with strong correlations in three types of direct relationships. This module contains 1,080 gene pairs from 48 genes, and among them there were 300 GE-GE, 9 CNA-GE, and 8 CNA-CNA direct relationships. Figure 3A shows the network view of the GBM Module 2 with direct relationships only. There were three types of edges in this network: i) red edges for CNA-CNA, ii) blue edges for CNA-GE, and iii) green edges for GE-GE relationships between two genes. Genes belonging to significantly enriched pathways/terms were grouped together. Information for DNA CNAs and/or expression changes for genes were also labeled with them within each group. Frequencies of copy number changes were presented as a percentage of 206 GBM samples with either focal amplification or homozygous deletion in [8]. To count the fraction of tumor samples with gene expression changes for gene i , we considered that a tumor sample s is over-or under-expressed if the value of Dr i,s D in Equation (1)    Voting-Based Cancer Module Identification PLOS ONE | www.plosone.org receptor signaling. In BRCA2-ING1, both genes play critical roles in cell cycle control [35,36]; ING1 is a tumor suppressor gene and interacts with TP53, and its under-expression and genetic rearrangement have been observed in several cancers, including GBM [37]; and BRCA2, a tumor suppressor gene, has recently been targeted for sensitizing glioma cells for killing by anti-cancer drugs [38]. In BTBD2-TEP1, TEP1 is a well-known GBM suppressor gene, and the deletion/mutation of this gene has been observed in many cancers, including GBM [39]; polymorphism of BTBD2 is involved in the double-strand break repair pathway that can be useful for GBM survival [40]. In ING1-HMGB1, both genes are located in chromosome 13q, where copy number loss has been reported [41][42][43], suggesting co-occurring deletion of these two genes. In APEX1-HIF1A and HIF1A-TEP1 having the CNA-CNA relationship, APEX1 and HIF1A directly interact with each other in vitro [44]; and, in GBM, copy number loss at 14q11.1-q13.1, 14q23.2-q23.3, and 14q32.33, where these genes are located, has been reported by Donovan et al. [45]. The relationship between 14q11.1-11.2 and 14q23.1-31.3 are also shown in our findings of CNA-GE relationships (APEX1-BRCA1, BRCA1-HIF1A, and BRCA1-TEP1) within this module. In BTBD2-BARD1, BARD1 was suggested as a mediator of apoptosis since its over-expression induces cell death [46]; and high LOH has been detected in human carcinoma metastases to the brain at chromosome 19p13.3 for BTBD2 [47]. Figure 3B shows enrichment tests using KEGG and BioCarta pathways for the GBM Module 2. To find GBM-related pathways, we also calculated the q-values for the enrichment of GBM-related genes in these pathways, respectively. In Figure 3B, the top 15 of 37 enriched KEGG and the top 15 of 49 enriched BioCarta pathways are shown for the GBM Module 2, along with their corresponding overlapping q-values, sorted by those q-values. GBM Module 2 contains many previously known GBM-related KEGG pathways including Glioma, P53 signaling, MAPK signaling, ERBB signaling, mTOR signaling, and VEGF signaling, and GBM-related BioCarta pathways, including ATM, G2, G1, RB, P53, PTEN, and MET pathways [48]. GBM Module 2 is also enriched with cancer-related 40 KEGG, 48 BioCarta pathways, and 92 GO terms.
We also tested the relevance of GBM Module 2 with cancer using CGC and GBM-related genes, as shown in Figure   (78.26%), and 23 (100%) OVC modules were significantly enriched with cancer-related KEGG, BioCarta, and GO terms, respectively. Table 2 shows the summary of five selected modules ordered by OVC-related gene set enrichment q-values. All enrichment results for the OVC data set are shown in Tables  S8, S9, S10, and S11.
We investigated OVC Module 8 in detail, as shown in Figure 4; it contains 629 gene pairs of 37 genes, and among them there were 2 GE-GE, 28 CNA-GE, and 49 CNA-CNA direct relationships. In OVC Module 8, STAT5B-STAT3 gene pair is activated in ovarian cancer [49], interacts with each other [50], and is involved in many pathways including Jak-STAT signaling, RAS signaling, Chemokine signaling, EGF, IL10, PDGF, and TPO pathways. In STAT5B-PRLR, both genes are involved in Jak-STAT signaling, a signal transduction pathway with key control over proliferation, differentiation, and survival of mammary cells [51]. Recently, it has been shown that PRLR and its downstream STAT5B are acetylated by CREB-binding protein (CBP) [52]. In EGF-STAT1 and EGF-STAT3, both gene pairs are involved in pancreatic cancer, EGF pathway, and signal transduction pathway; both STAT1 and STAT3 are activated by the Jak kinase in response to EGF [53][54][55], where JAK2/STAT3 signaling is required for EGFdriven ovarian cancer [55]. In PIK3R1-IGF1R, these genes interact with each other [56] and are involved in many functional pathways, including the IGF1, IGF1R, HDAC, BAD, IGF1MTOR, and focal adhesion pathways. In ERBB2-STAT, these genes are involved in pancreatic cancer and signal transduction pathways; the correlation between the activation of ERBB2 and STAT3 has been observed in many human tumors [57,58]. In ERBB2-STAT5B, both genes interact with JAK2 [59,60] and are involved in ERBB signaling and signal transduction pathways. In EGF-ERBB2, these genes directly interact with each other [61] and are involved in many cancers, including pancreatic, endometrial, prostate, bladder and ovarian cancers. They are also involved in ERBB signaling and focal adhesion pathways. In HRAS-FYN, these genes interact with each other in vitro [62] and are involved in many pathways, such as focal adhesion, axon guidance, T-cell receptor signaling, and FC epsilon RI signaling, ECM, TCR, and integrin pathways.
The top 15 of 37 enriched KEGG and top 15 of 59 enriched BioCarta pathways are also shown for OVC Module 8 in Figure 4B. It includes known OVC-related KEGG pathways, such as focal adhesion, JAK-STAT signaling, ERBB signaling, cytokine-cytokine receptor interaction, chemokine signaling and VEGF signaling, and OVC-related BioCarta pathways, such as AKT signaling, IL6, RAS, EGF, IGF1, PDGF, VEGF, CXCR4, and HER2 pathways [34]. We also tested the relevance of the OVC Module 8 to cancer. OVC Module 8 was enriched with 39 KEGG, 58 BioCarta pathways, and 49 GO terms, which were cancer-related subsets of the original pathways/terms. Also, as shown in Figure 4C, the OVC Module 8 contained 7 CGC genes (PTPN11, AKT1, ERBB2, FOXO1, HRAS, LIFR, and PIK3R1) with a q-value of 2.08|10 {07 and 6 OVC-related genes (EGF, EPHA2, ERBB2, PIK3R1, STAT3, and VEGFA) with a q-value of 5.23|10 {10 . These results suggest that our identified modules from the OVC data set represent cancer-related pathways.
Comparing VToD with other Methods Table 3 shows performance comparisons between our proposed VToD algorithm and other clustering methods using GBM and OVC data sets; when compared to these algorithms, a higher fraction of VToD modules were functionally enriched than   modules from other algorithms. Although the functional enrichment of DFM-CIN modules is comparable to those of VToD, VToD identified a higher fraction of modules encriched with cancer-related pathways than DFM-CIN. Note that, since algorithms were designed for different data types, they were compared using data types in the original paper. For a hierarchical clustering method, GE, CNAs, and PPI data sets were integrated.
N Hierarchical clustering: To find modules by the hierarchical clustering algorithm, we converted our gene-gene relationship network GGR into a distance matrix using the topological overlap metric [63] of the WCGNA tool in the R computational suite. This distance matrix was then used for hierarchical clustering with the average linkage. The dendrogram of the cluster was cut by a dynamic tree-cut [64] algorithm, finally producing 216 modules when the GBM data set was used. We applied functional and cancer gene set enrichment tests with these 216 modules. We found 14, 0, and 13 modules having significant overlaps with KEGG, BioCarta pathways, and GO terms, respectively, and 4, 0, and 4 enriched modules with cancer-related subsets of KEGG, BioCarta, and GO terms, respectively. Also, 5 and 1 modules were enriched with CGC-and GBM-related genes (Table  S12). Table 3 shows the comparative performance between hierarchical clustering and VToD algorithms, showing that VToD identified more pathway-enriched modules than the hierarchical clustering algorithm (Table S13). Moreover, Figure S6A shows the box plot of CGC and GBM driver gene enrichment q-values, indicating higher cancer gene enrichments in VToD compared to hierarchical clustering. Also, the pie charts in Figure S6B show different combinations of three types of direct relationships (CNA-CNA, GE-CNA, GE-GE). Here, VToD produced a larger fraction of modules containing more than one type of direct relationships compared to hierarchical clustering. N Cerami et. al.: Cerami et al. [9] developed an algorithm to integrate DNA copy numbers, somatic mutation, and PPI data sets, and applied it to 84 TCGA GBM data [8]. In their study, altered genes were identified using RAE [12], and a network of genes were constructed based on PPI information using an edge-betweenness algorithm [65], resulting in 10 overlapping modules. When functional and cancer gene set enrichment tests were conducted for these modules, one, one, and three modules were significantly enriched with at least one KEGG, BioCarta pathways, and GO terms, respectively, and 2, 1, and 2 modules were significantly enriched with cancer-related subsets of these three categories of pathways. Also, 2 and 2 modules were significantly enriched with CGC-and GBMrelated gene sets, respectively. DFM-CIN [67] was comparable to VToD and outperformed other competing methods in functional enrichment tests for both GBM and OVC data sets. However, VToD outperformed all other methods in terms of cancer gene set enrichment tests and cancer-related pathway enrichment tests for both GBM and OVC data sets, indicating that identified modules were more likely to be related to the cancer. The numbers of distinct pathways or functional terms enriched for VToD modules were comparable to DFM-CIN and greater than those of other methods, showing the convincing performance of our algorithm. All distinct enriched pathways or terms found by the above methods, including VToD for both GBM and OVC data sets, are shown in Tables S13 and S14, respectively.

Discussion and Conclusions
We proposed the voting-based module construction approach by integrating three direct relationships (GE-GE, CNA-GE, and CNA-CNA), along with indirect relationships and PPI information. We have shown that our relationship network by integrating GE-GE, CNA-GE, and CNA-CNA types can be useful for giving explainable relationships between genes in identified modules since most of the modules contained different types of relationships; by observing CGC enrichment result, all 9 GBM modules and 14 of 15 OVC modules constructed by the VToD algorithm contain at least two types of direct relationships, implying that GE changes and CNAs occur simultaneously in cancer modules. This conclusion was further confirmed when we found that the numbers of different types of direct relationships in modules had strong a positive correlation with CGC enrichment q-values (0.64 for GBM and 0.49 for OVC) and module sizes (0.67 for GBM and 0.52 for OVC).
In this study, we combined both data-driven and topological properties throughout the whole algorithm, from constructing premodules to merging pre-modules. However, our approach has limitations in combining these two properties. When we combined the data-driven and topological properties to calculate vote-values among gene pairs, we integrated them using the same weights (see Equation (4) in the Methods section), although the distribution and the contribution of each property might be different. For further investigation, we drew distributions of topological values and datadriven values of gene pairs contained in the pre-modules of GBM, as shown in Figures S7A and S7C, respectively. The distributions of these two values were different; a Kolmogorov-Smirnov (K-S) test under the null hypothesis that two distributions are identical gives a p-value of 2.2e-16. Similar results were observed in OVC, as shown in Figures S7B and S7D. However, when we drew scatter plots of data-driven property values and topological property values of gene pairs included in the GBM and OVC pre-modules ( Figures S7E and S7F), one property was not dominated by the other property. In many gene pairs, one of two properties had a relatively larger value while the other had relatively smaller value, showing negative correlations between them (20.550 for GBM and 20.259 for OVC). This observation showed that both properties were significantly contributing to constructing pre-modules.
When we combined the data-driven and topological properties to merge two pre-modules, we also integrated them using the same weights (see Equation (5) in the Methods section). Distributions of topological values and data-driven values of all pairs of premodules are different ( Figure S8). One interesting observation is that most pairs of pre-modules have value one for the topological property for both GBM and OVC data sets, as shown in Figures  S8A and S8C, respectively. Consequently, most merged premodules have value one for the topological property (Figures S9A and S9B for GBM and OVC, respectively), and values of the datadriven property in those modules is also high. According to these observations, different distributions of these two properties might not significantly reduce the performance of the proposed algorithm. Our method of combining two different distributions can be further improved; for example, (i) a distribution is transformed into the standard normal distribution, and then (ii) the optimal contribution weights of two distributions is searched. However, a new weight parameter adds an additional complexity to the model. The advantages of the current approach are parameter-less and intuitively simple, and the comparative assessments showed that our methods outperformed other methods in detecting cancerrelated modules.
Our primary goal in this study was to establish the relationships among genes in cancer pathways using an integrated approach. We hope that our research will help explain complex relationship between genes in cancer development. Although we validated the identified modules by using functional and cancer gene set enrichment tests in the current study, more experiments, such as survival analysis and the classification of normal/tumor patients, will be part of our future work.

Data Sets
GE and CNA (level 3) data from 266 paired GBM and 96 paired OVC tumor samples were downloaded from the TCGA data portal (as of 24 July 2011 and 25 June 2012, respectively). GE data were measured using the Affymetrix Human Genome U133 array platform. For both cancer data sets, expression values for 12,044 genes were organized as GE data matrices, labeling rows with gene symbols and columns with sample identifiers ( Figure 1A). DNA copy numbers were measured using Agilent's Human Genome CGH microarray 244A platform. To obtain gene-level CNA values, segmented regions in the level 3 data (for both GBM and OVC data sets) except sex chromosomes (Chr X and Y) were mapped to gene symbols (Table S1); gene symbols were mapped to segmented regions using refGene.txt (version hg18), downloaded from the UCSC genome browser, and ambiguous annotations (genes with multiple annotations) of 28 genes were manually resolved by using either NCBI, the Encode Gencode manual, a BLAT similarity/score (26 genes), or refGene.txt of version hg19 (2 genes). Then, CNA data matrices for both GBM and OVC data sets were organized with the same tumor samples (columns) as in corresponding GE data matrices, and 22,082 and 22,086 genes (rows), respectively. Missing values in CNA matrices were imputed with the mean across all samples. PPI information was collected from [9], including i) manually curated interactions in HPRD [69], ii) signaling pathways from Reactome and NCI/Nature pathway interactions, and iii) the MSKCC Cancer Cell Map. We further converted this PPI information into an undirected graph with 44,959 genes as nodes and 96,347 pairwise interactions as edges.

Constructing a Gene-gene Relationship Network
We define a gene-gene relationship network as GGR: = (S, R).
Here, S is a set of seed genes and S: = (DE | SA), where DE and SA are sets of differentially expressed and significantly copy number altered genes, respectively. R is the set of pairwise genegene relationships among these seed genes ( Figure 1B).
Seed genes. To find the DE genes in S, the two-tailed pooled t-test was used. For the t-test, 10 and 8 unmatched normal samples downloaded from the GBM and OVC pages of the TCGA data portal were used as control data sets, respectively. p-values were corrected by the Bonferroni correction method and genes with corrected p-values below a threshold were selected. To find SA genes in GBM, we collected the focal aberrant regions identified by the GISTIC and RAE algorithms in [8], where a subset of samples of our study were used (Table S3 and S4 of the original article, respectively). In these two algorithms, focal aberrant regions were detected to distinguish relatively short aberrant regions containing cancer-related genes from random broad aberrant regions. Similarly, we collected altered regions found by GISTIC in [14] ( Supplementary Table S5.2 of the original article) for OVC. Some genes in S had only GE data, some had only CNA data, and others had both GE and CNA data (shown in open rectangles, open circles, and filled circles in Figure 1B, respectively).
Relationships among gene pairs. Pairwise relationships between genes are measured by using an absolute value of Pearson correlation coefficient (PCC). For any gene pair (gene i , gene j ), absolute PCC values from GE-GE, CNA-GE, and CNA-CNA data are calculated depending on data availability. Since any type of relationship between two genes might affect cancer development, all three absolute PCCs are considered, and the maximum of them was chosen as a potential relationship value and defined as a r value ( Figure 1B). Note that, since proximal genes on the chromosome are frequently amplified or deleted together, gene pairs were not considered the CNA-CNA relationship if they were in the same focal aberrant regions or within the same cytoband.
A gene pair (gene i , gene j ) is included in the gene-gene relationship R with the r value as a weight if the r value is larger than a threshold, and the relationship is called a direct relationship. The threshold is empirically chosen based on the distribution of all PCC values (see the Parameter selection section below). For each gene pair (gene i , gene j ) = [ R, indirect relationships are found by searching a statistically significant simple path between two genes. Here, the significant path between two genes is defined as a list of other genes that gives the statistically significant geometric mean of r values from gene i to gene j throughout genes in the list compared to the geometric mean of the path in a random PPI network. The random PPI network is generated such that interactions between genes are randomly assigned, while the topology of the network and expression values of genes are conserved with those in the observed PPI network. The null hypothesis for the statistical significance test is that the geometric mean of r values of a simple path in the randomly generated PPI network is greater than or equal to that of the observed path. However, since it takes an exponential time to consider every possible pathes between two genes, a heuristic search is developed; paths between gene i and gene j are searched if two genes are connected in the PPI graph. A breadth first search algorithm is used to search all simple paths between two genes (gene i , gene j ) = [ R. Also, only the paths in which all genes have either GE, CNA, or both types of data are considered. Since searching might yield multiple paths, we chose the path P Ã with the maximum average PPI connectivity, since genes having larger interactions with other genes are more likely to be related to cancer (see File S1). P Ã is measured by Equation (2), where deg(gene i ) is the degree of connectivity in the PPI graph for gene i and n is the number of genes along the path P Ã . Before calculating the average connectivity for the genes along the path, the connectivity of each gene gene i is normalized with the global maximum PPI connectivity (Equation (3)) to make the value in the range of [0, 1].
Then, the statistical significance of the path P* between gene i and gene j is assessed based on the randomly generated PPI network. In the null hypothesis mentioned above, the observed value is the geometric mean of all pairwise r values along the path P*. If pvalue of the path P* obtained by comparing to N rand random PPI networks is below a threshold, the gene pair (gene i , gene j ) is included in the gene-gene relationship network R as the indirect relationship.

Module Detection: VToD
The proposed algorithm is described in Table 4, and source codes of the VToD algorithm is provided in http://www.gcancer. org/VToD/VToD.html.
Step 1: Calculate the association between genes. In VToD, pairwise votes are first calculated for every pair of genes fg,mg [ S, as shown in Equation (4), where norm deg(m) is the normalized PPI connectivity of m using Equation (3), SPL(g,m) is the shortest path length between two genes in the PPI graph, and r value(g,m) is the relationship value used from our constructed network GGR.
vote(g,m)~n orm deg(m) SPL(g,m) zr value(g,m) ð4Þ Equation (4) calculates the score when a gene g chooses a gene m as a representative gene; the score increases if (i) a gene pair fg,mg has a high relationship r value, denoting data-driven property, or (ii) the gene m with a high topological value is closely interacting with the gene g in the PPI network. Hub genes in the PPI network have more chances to be selected as representative genes due to norm deg(m), but are controlled by the length of the shortest path between g and m to produce functional modules related to the gene g. We apply constraints to the shortest path length value ƒ path th to increase the compactness of a premodule, and to reduce the time-complexity for searching the shortest path. If the shortest path length between the gene g and the gene m is larger than path th, the topological information is not considered and the vote-value is defined using only the datadriven value between them. Note that vote(g,m) has values between 0 and 2 since both terms have values between 0 and 1.
Step 2: Select representative genes of each gene. After the vote calculation, a gene m is selected as a representative gene of a given gene g based on a local rank and a global rank. In calculating the local rank of the gene m for the gene g, all genes in S are ranked by descending order of vote-values to the gene g. Then, the cumulated vote-value from the largest vote-value to the vote-value of the gene m is calculated. If the cumulated vote-value is within the top k% of all cumulative vote-values, the gene m is considered a candidate representative gene of the gene g. For the global rank, if the vote(g,m) value is located within the top vote th% of vote-values between all gene pairs in S, the gene m is selected as the representative gene of the given gene g. This approach allows multiple representative genes for the gene g, and one gene can be selected as a representative gene for multiple genes.
Step 3: Forming pre-modules. Each representative gene m (from Step 2) starts forming a pre-module including only itself. Then, each module is enlarged by aggregating all genes that selected the gene m as the representative gene. A redundant premodule is removed when it is a subset of other pre-modules. Smaller pre-modules are also removed if they contain either only two genes (including the representative gene) or all the genes except the representative gene in the pre-module are located in the same focal region of chromosomes.
Step 4: Merging pre-modules. In this step, two pre-modules are merged if pairwise members of the two pre-modules are highly related in the gene-gene relationship network and are closely connected in the PPI network. A pairwise merging value MV(C l , C m ) between any two pre-modules C l and C m is calculated by Equation (5). Let the sizes of C l and C m be n l and n m , respectively, and let n l ƒ n m . In the equation, the topological property between two pre-modules (or modules) is given as the ratio of genes in C l having at least one protein interaction partner in C m (interconnectivity: IC(C l , C m )). Data-driven properties are calculated as the average of gene-gene relationship values between two premodules (or modules).
At every merging step, two modules having the maximum pairwise merging value, denoted as MaxPair:Value in Step 4 of Table 4, are merged and replaced by the newly merged module in the module set. Such merging processes continue until the MaxPair:Value is below the threshold merging th that is decided by comparing the merging values generated by randomized modules (see the next section for generating a randomized module).

Statistical Significance of the Identified Modules
The statistical significance of the identified modules is validated by comparing them to randomized modules. To generate randomized modules, r values of gene pairs in the gene-gene relationship network GGR are shuffled, while the PPI network remains unchanged, so that the topological property is disconnected from the data-driven property; then, using this shuffled relationship network GGR, the whole VToD algorithm from Step 1 to Step 4 runs until a single pre-module is left. This generation of randomized modules were repeated 100 times. Concurrently, the observed pre-modules were merged until a single pre-module is left. At each merging step, a pre-module pair that yields the maximum merging value is selected, and these values for both observed and random cases are plotted in Figures S3A and S3C, for GBM and OVC, respectively. At each merging step, the observed value was significantly greater than the average of the maximum merging values in random cases, confirming that the modules identified by the VToD algorithm are statistically significant.
Next, to find the merging threshold merging th, the stopping criteria of merging pre-modules, for each merging step, we compared the maximum merging value of the observed case with the maximum merging values of the first merging step in all 100 random cases and measured the empirical p-value (Figures S3B and 3D for GBM and OVC data sets, respectively).

Functional and Cancer Gene Set Enrichment
We tested whether constructed modules from both GBM and OVC data sets were enriched with known signaling pathways or biological functions. We used 186 KEGG pathways, 217 BioCarta pathways, and 751 biological processes in GO downloaded from the Molecular Signature Database (MsigDB) at the Broad Institute (http://www.broad.mit.edu/gsea/msigdb/msigdb_index.html).
We excluded GO terms with sizes ƒ 5 and § 250 to omit toospecific or too-general terms. Next, we selected cancer-related subsets of pathways/terms from all pathways/terms. To find such pathways/terms, we measured statistically significant enrichment of cancer genes from CGC [31] in pathways/terms by applying a hypergeometric test and by correcting the p-values using the FDR multiple comparison correction, giving q-values. By applying q- values ƒ 0.05, 83, 139, and 338 cancer-related pathways/terms were selected from KEGG, BioCarta pathways, and GO biological process terms, respectively; they are listed in Table  S15. Also, cancer genes from CGC [31], GBM-related genes [33], and OVC-related genes [34] were used to measure the cancer gene enrichment of the identified modules. For the enrichment analysis, a hypergeometric test was applied to each module using the above all pathways/terms, cancerrelated pathways/terms, and cancer gene sets, giving p-values, and q-values were obtained by the FDR multiple comparison correction. The q-values v 0.05 was used for an enrichment threshold. Note that q-values depend on the number of comparisons and p-values of comparisons in the enrichment test. Therefore, it may happen that although a module is enriched for a pathway when the multiple comparison correction was performed using the cancer-related subset of pathways, the module is not enriched for the same pathway when the correction was done using all pathways, and vice versa.

Parameter Selection
Our algorithm has following parameters: thresholds for selecting differentially expressed genes, thresholds for r value, path th for searching indirect relationships, k% and vote th% for selecting representative genes, and merging th for merging pre-modules. We used q-value v 0.05 for selecting differentially expressed genes (Bonferroni corrected) and merging th. vote th were tested for three different values, as shown in the Results section. However, path th, thresholds for r value, and k% were empirically chosen since these parameters affect the intermediate steps and are not critical for final modules. Here, we explain these parameters in detail.
From the distribution of three different direct relationships (GE-GE, CNA-GE, and CNA-CNA), the top 10% of all corresponding PCC values were selected as thresholds: 0.38, 0.165, and 0.435 for GE-GE, CNA-GE, and CNA-CNA relationships, respectively, for GBM ( Figure S1A). By applying corresponding thresholds to the r values between any pair of genes, 2,617,259 direct relationships were included in the gene-gene relationship R, which was 22.53% of 11,618,610 total gene pairs consisting of 4,821 seed genes in S.
To search indirect relationships for any pair (gene i , gene j ) = [ R, the most relevant path P Ã was chosen after exploring all simple paths in the PPI graph with path th = 2. The geometric mean of pairwise r values along P Ã was calculated and the statistical significance was measured over N rand ( = 50) randomly generated PPI networks. A gene pair (gene i , gene j ) having a p-value ƒ 0.05 was considered an indirect relationship and added to the genegene relationship R. The 42,532 total pairs updated by indirect relationships were 0.47% of all 9,001,351 ( = 11,618,610-2,617,259) gene pairs.
Using an experimental setup similar to the one above, we selected 0.295, 0.19, and 0.19 as thresholds for GE-GE, CNA-GE, and CNA-CNA direct relationships, respectively, for OVC ( Figure  S1B). Applying these thresholds to r values, we defined 5,681,333 (25.71% of all pairs) direct relationships in R, followed by updating 52,969 pairs (0.32% of all remaining pairs in R) as indirect relationships.
To select representative genes for each gene, we needed to decide two thresholds, k% and vote th%; the vote(g,m) value is located (i) within the top k% of local vote-values and (ii) within the top vote th% of global vote-values. We used k = 1 and tested three values for vote th of the top 1%, 0.25%, and 0.1%, as mentioned in the Results section. The distributions of all votes for the GBM and OVC data sets are shown in Figure S2. Figure S1 Distributions of all enumerated pairwise direct relationships among the genes in S. (A) is for GBM data set and (B) is for OVC data set. X-axis indicates the absolute Pearson correlation coefficient (PCC) for GE-GE, CNA-GE and CNA-CNA relationships. For each distributions, y-axis indicates the proportion of gene-pairs among the total number of pair-wise relationships (GE-GE, CNA-GE, and CNA-CNA) having corresponding PCC values. Here, we show the selection of individual thresholds in the distribution using the arrows. For both data sets, several peaks were observed, but we did not find any particular reason for these peaks. Since we used a binning approach to draw distributions, the observed peaks depend on the bin size. For our convenience, we used the bin size of 0.01.             File S1 Descriptions about finding indirect relationships, and topological and data-driven properties in merging pre-modules.

(PDF)
Author Contributions