Multiscale Embedded Gene Co-expression Network Analysis

Gene co-expression network analysis has been shown effective in identifying functional co-expressed gene modules associated with complex human diseases. However, existing techniques to construct co-expression networks require some critical prior information such as predefined number of clusters, numerical thresholds for defining co-expression/interaction, or do not naturally reproduce the hallmarks of complex systems such as the scale-free degree distribution of small-worldness. Previously, a graph filtering technique called Planar Maximally Filtered Graph (PMFG) has been applied to many real-world data sets such as financial stock prices and gene expression to extract meaningful and relevant interactions. However, PMFG is not suitable for large-scale genomic data due to several drawbacks, such as the high computation complexity O(|V|3), the presence of false-positives due to the maximal planarity constraint, and the inadequacy of the clustering framework. Here, we developed a new co-expression network analysis framework called Multiscale Embedded Gene Co-expression Network Analysis (MEGENA) by: i) introducing quality control of co-expression similarities, ii) parallelizing embedded network construction, and iii) developing a novel clustering technique to identify multi-scale clustering structures in Planar Filtered Networks (PFNs). We applied MEGENA to a series of simulated data and the gene expression data in breast carcinoma and lung adenocarcinoma from The Cancer Genome Atlas (TCGA). MEGENA showed improved performance over well-established clustering methods and co-expression network construction approaches. MEGENA revealed not only meaningful multi-scale organizations of co-expressed gene clusters but also novel targets in breast carcinoma and lung adenocarcinoma.

(PFNs) and MEGENA-derived multi-scale clusters. We evaluate interactions captured by PFNs using a series of simulated and real-world gene expression data in comparison with False Discovery Rate (FDR) based networks (FDRN). We further compare the functional relevance of MEGENA-derived multi-scale clusters and those identified by several established clustering analysis approaches. Simulated expression data as well as the gene expression data in breast carcinoma (BRCA) and lung adenocarcinoma (LUAD) from The Cancer Genome Atlas (TCGA) were used. A detailed description about the data acquisition and preprocessing can be found in S1 Text. We proceed to highlight some novel insights revealed by multi-scale clusters and then conclude the paper by summarizing the key contributions and pointing out future work.
Overview of MEGENA MEGENA consists of four major steps: 1) Fast Planar Filtered Network construction (FPFNC) by introducing parallelization, early termination and prior quality control; 2) Multiscale Clustering Analysis (MCA) by introducing compactness of modular structures characterized by a resolution parameter; 3) Multiscale Hub Analysis (MHA) to identify highly connected hubs of each cluster at each scale and 4) Cluster-Trait Association Analysis (CTA) to explore the relevance of cluster to clinical outcomes. FPFNC constructs PFN by mostly following the network embedding rationale from the PMFG algorithm. All pairs of genes are first ranked via a similarity measure quantifying respective interaction strengths and then iteratively tested for planarity to grow the embedded network that favors inclusion of pairs with larger similarities [21]. To make the PFN construction scalable for whole genome co-expression network analysis, two techniques were developed. Firstly, insignificant interactions are removed before the network embedding step by controlling the False Discovery Rate (FDR) of similarity for each gene pair. However, such a filtering may not be necessary since we will show in the subsequent section of Evaluation of PFNs that PFNs are very robust with respect to different FDR thresholds. Secondly, a parallelized screening procedure (PCP) is developed to extract a subset of gene pairs that are more likely to be embedded. Such procedures enable FPFNC to efficiently and effectively construct embedded co-expression networks by capturing significant interactions at the whole genome level.
PFN constructed through FPFNC is then input to MCA to identify multiscale clusters. MCA incorporates three distinct criteria to identify locally coherent clusters while maintaining a globally optimal partition. First, shortest path distances (SPD) [28] are utilized to optimize within-cluster compactness. Second, local path index (LPI) is used to optimize local clustering structure. Third, overall modularity (Q) [29] is employed to identify optimal partition. Specifically, MCA adopts a hierarchical divisive approach to dissect complex interactions in PFN into coherent interactomes across different resolution scales by iterating two steps, k-split and compactness evaluation. k-split identifies the clusters that lead to an optimal partition of a parent network via optimization of SPD, LPI and Q. In the step of compactness evaluation, individual clusters from k-split are compared to the parent network via a measure of network compactness defined below, where, V is the set of nodes in the network, SPD is the average of shortest path distances of all node pairs, and α is the resolution parameter. Given that the denominator log(|V|) α is the hallmark of the small-world property represented by the scaling relation d SPD $ logðjVjÞ when α = 1, ν measures the coherence of a network's topology. Therefore, a smaller α identifies more compact clusters. For a given cluster (network), MCA searches through a range of α values for a resolution scale that leads to more compact clusters than the parent cluster (network). These clusters are further split by k-split until no more compact clusters can be identified. Each split represents a finer picture of modular structure of the given PFN. The output of MCA is a hierarchy of clusters at various levels defined by α.
Finally, MHA and CTA constitute the downstream analyses in MEGENA. MHA first identifies significant hubs within each cluster with respect to an established random model of planar networks [23,[29][30][31]. The nodes that are hubs at multiple scales are called multiscale hubs. CTA evaluates the relevance of individual clusters to clinical outcomes through principal component and correlation analyses.

Scalability of MEGENA: Effectiveness of PCP
PCP is a key technique developed to speed up PFN construction to overcome the worst case O(|V| 3 ) complexity of the existing serial PMFG algorithm (See Methods for detailed discussion) [21,27]. In conjunction with correlation screening, PCP-mediated FPFNC dramatically increases its efficiency in construction whole genome co-expression network. In order to verify this, we compared PCP-mediated network embedding and the existing serial PMFG using the TCGA gene expression data that involve over 20,000 genes. We compared the acceptance rate of pairs filtered by PCP in MEGENA and that of non-filtered pairs by PMFG. The acceptance rate, defined as |E|/|E| max , where |E| is the number of edges embedded in a PFN, and |E| max = 3 (|V| -2) is the maximal number of edges embeddable in a planar network by Euler relation [32].
As shown in Fig 2, the acceptance rate by the serial PMFG algorithm quickly decreases close to 0% as the number of links in PFN reaches the maximal number of links. The finding indicates that PMFG performs exponentially increasing number of computations to embed more edges as the number of links in PFN saturates towards the maximal number. On the contrary, PCP remedies the problem by dramatically boosting the acceptance rate close to 100% as the number of links in PFN increases. These results demonstrate the effectiveness of PCP in reducing the overall computation time by leveraging parallel computation capability, and scalability of FPFNC for whole-genome co-expression network.

Evaluation of PFNs
We evaluated the performance of PFNs from multiple aspects. We first evaluated capacity of PFNs in capturing underlying regulatory interactions by comparing to golden standard networks using simulated datasets from DREAM challenge [33]. We then compared the network neighborhoods of a number of genes in PFNs with their actual targets derived from the perturbation experiments. Furthermore, we compared the global topological properties of inferred PFNs with the established hallmark signatures of complex networks.
Evaluation by simulated data. To evaluate the accuracy of PFN in capturing the underlying gene regulatory network, we leveraged GeneNetWeaver [34] to generate 10 time series datasets for each of the golden standard networks from DREAM challenge [33]. The generation of the simulated data was detailed in S1 Text. We systematically compared MEGENA-derived PFNs with those based on the state-of-art methods including ARACNE [35] and Random Forest (RF) [36] across various FDR thresholds and similarity/dissimilarity measures. Specifically, we tested Pearson's correlation coefficient (denoted as PCC), Mutual Information (denoted as MI), and Euclidean distance (denoted as euclid), and applied FDR thresholds (0.01, 0.05, 0.1, 0.2, 0.4, 0.6, 0.8 and 1) on each similarity/dissimilarity measure to filter out insignificant interactions. FDR thresholded interactions were then used to construct PFNs and ARACNE networks. Note that the three measures are all applicable to PFN, but ARACNE uses only MI as it assumes Data Processing Inequality (DPI) imposed by MI [35]. As RF is not dependent on any similarity/dissimilarity measure, FDR of the interactions determined by RF is estimated by permuting input data.
For each inferred network from a simulated data, we compared the weighted shortest path distances for all pairs of the nodes in the network and those in the underlying gold standard network. The node pairs connected in a gold standard network were treated as a positive class and the pairs not connected were taken as a negative class. We then calculated Area Under Curve in Receiver Operating Characteristic curve (AUC-ROC) for shortest path lengths in each inferred network.
As shown in Fig 3A, the PFNs from PCC and MI consistently outperform the RF and ARA-CNE networks across various FDR thresholds. Table 1 shows the best average AUC-ROC scores, indicating that PFNs from PCC and MI across various FDR thresholds show consistently the best performance except for InSilicoSize100-Yeast2 data set where PFNs are only slightly outperformed by RF based networks at an FDR threshold of 1. At FDR thresholds of 0.2 or less are practically used in almost all cases, PFNs from PCC and MI show the best overall performance.
To assess the stability of the performance of each method, we calculated Coefficient of Variation (CV) of the average AUC-ROC scores across different FDR thresholds. PFNs from different similarity/dissimilarity measures have the most stable performance among the tested networks across different FDR thresholds ranging from 0 to 1 (S1 Fig). The performance of PFNs peaks at FDR thresholds around 0.01 and/or 0.05 in most cases. However, PFNs from Euclidean distance have relatively poor performance in general.
Evaluation by disease-specific data. We further evaluated the accuracy of PFNs in detecting the true interactions in a disease dataset, we constructed PFN from BRCA gene expression data set from TCGA (hence, BRCA PFN). The details about the data acquisition and preprocessing can be found in S1 Text. For comparison with the BRCA PFN, we also constructed FDR thresholded network (FDRN) with FDR < 0.05 for PCC and MI. These co-expression networks were then tested for the enrichment of the siRNA knockdown signatures of key transcription factors (TFs) of breast carcinoma in MCF7 cells [37]. Of 78 TFs in the knockdown experiments, 32 TFs remained after data processing (see Data Acquisition and Processing in S1 Text), and appeared in the BRCA PFN and their siRNA signatures were used for testing the networks. Namely, the 32 TFs are: BCL2, BRCA1, BRCA2, CCL5, CCNA2, CCNB1, CDC20, CDC25A, CDC25B, CDKN2A, CEBPB, CEBPD, CENPE, CENPF, CHEK1, E2F1, E2F5, ERBB2, ESR1, FOS, FOXC1, GATA3, HIF1A, HOXB7, ID1, MYBL2, MYC, PAX3, SKP2, STAT1, TOP2A, and WT1. Two differentially expressed gene (DEG) signatures for each experiment were identified based on T-test p value < 0.05 and fold changes ! 1.3 or 1.5. Two different fold change cut-offs were chosen to give a more comprehensive performance.
For each co-expression network, we tested the enrichment of a TF's DEG signature in its llayer network neighborhoods (l = 1,. . .,l max , where l max is the largest shortest distance between the given TF and the other nodes in the network). The optimal layer l optimal was chosen based on FET p-value that shows the most significant enrichment of the DEG signature. The final FET p-values were adjusted for multiple testing. In the case of MI, Fig 3B and 3C show the number of signatures enriched in the BRCA PFN and FDRN at various thresholds for the corrected FET p-values and enrichment fold changes, respectively. Fig 3D and 3E show the results for the PCC based coexpression networks. Clearly, these experiment-derived gene signatures are far more significantly enriched in the PFNs than in the FDRNs, implying that the PFNs constructed by MEGENA can better capture true gene regulatory relationships using either PCC or MI.
Evaluation by topological characteristics. Degree distributions and diameters of PFNs were computed since these are the key network topological characteristics representing the hallmark features such as scale-free and small-world networks [28]. We present the PFNs constructed from PCC in this section.  its clusters at α = 1. We reserve "k" to denote the node degree only in this section, but use k to denote the number of clusters in the other sections.
As shown in Fig 5A, the BRCA PFN is scale-free by following a typical power-law degree distribution with exponent γ 3, consistent with the frequently observed range of exponents, i.e., 2 γ 3, in real-world complex networks [28]. However, the LUAD PFN does not exhibit the characteristics of scale-free degree distribution across all k though the distribution between 3 k 50 is scalefree (Fig 5B). The decaying tail at k ! 50 in the LUAD PFN shows the characteristics of exponential distributions. The diameters of the BRCA and LUAD PFNs arẽ 11.3, which is in accordance to the hallmark feature of small-world networks with diameters around log(|V|).
In summary, these PFNs possess the hallmark features of complex networks. The qualitative difference between the degree distributions of the two PFNs demonstrate a broad range of network characteristics from scale-free to exponential distributions [23]. These results support the breadth of topological diversity in embedded networks such as PFNs. Several studies of statistical mechanics of embedded networks on topological sphere showed that those networks can possess either exponential degree distribution [31], or scale-free degree distribution with various exponents [23] in thermodynamic limits, depending on the underlying evolutionary dynamics.

Evaluation of Multicale Clusters
We further compared the clusters derived from MEGENA and those identified by other established clustering and network inference approaches using the TCGA BRCA and LUAD gene expression data (see Data Acquisition and Preprocessing in S1 Text for description of BRCA and LUAD data). Specifically, we considered two other types of coexpression networks including weighted co-expression networks (WGCN) and unweighted coexpression networks (FDRN, based on the links at FDR < 0.05) [14,16] (see S1 Text for details) and three established clustering techniques including infomap [38], walktrap [39], and leading eigenvector based spectral clustering [40]. Note that these clustering methods detect coherent clusters in complex network by optimizing for Newman's modularity Q. Two different similarity measures, MI and PCC, were used for constructing coexpression networks.
Weighted co-expression network analysis (WGCNA) uses its own clustering method which is not suitable for un-weighted networks like PFNs and FDRNs. Towards this end, we compared MEGENA (as a combination of PFN and MCA), WGCNA and the following 6 combinations of networks (PFN and FDRN) and clustering methods (infomap, walktrap, leading significantly enriched signatures. C shows enrichment fold change cut-off against the number of significantly enriched signatures. D-E. Comparisons of BRCA TF knock down signatures on inferred networks from PCC. D and E correspond to FDR corrected FET p-values and enrichment fold changes, similarly to B and C. As there are a few oncogenic signatures available in LUAD, the evaluation of LUAD networks is less comprehensive than that of the BRCA networks. Therefore, we focus on the results from the BRCA data in the main text and report the results from the LUAD data in S1 Text. Functional analysis of multiscale clusters. We collected a large number of gene sets associated with known molecular functions and pathways from Molecular Signature DataBase (MSigDB) across GO-BP (Gene Ontology-Biological Processes), GO-CC (Gene Ontology-Cellular Components), GO-MF (Gene Ontology-Molecular Functions), KEGG (Kyoto Encyclopedia of Genes and Genomes) and REACTOME (Reactome database) categories. The significance of the overlap between a given cluster and an annotated gene set was calculated by the Fisher Exact Test (FET), and corrected for multiple testing via Bonferroni correction for the total number of comparisons (i.e. number of clusters Х number of gene sets tested). The detailed results from the enrichment analysis were included in S1 Data.
As shown in Fig 6A and 6B, the multiscale clustering analysis (MCA) has the best performance since the resulting clusters are enriched for the largest number of the annotated gene sets with respect to all significance levels in both MI-and PCC-based networks. More importantly, the MCA-derived clusters show the largest fold enrichment of the BRCA oncogenic signatures. Similar results are observed in PCC-based LUAD networks (see S3A and S3B Fig).
MCA consistently outperforms the established co-expression network analysis method, WGCNA in both the BRCA and LUAD cases.
Interestingly, the methods that directly optimize for Q (infomap and eigenvector) consistently show better performance on the FDRNs than the PFNs. Q assumes that the random . The x-axis is the logarithm of degree k and the y-axis is the logarithm of inverse cumulative degree distribution, P(k' > k). Red straight line is fitted distribution for P(k^'>k)~k^(γ+1), where γ is the estimated exponent of the underlying degree distribution. Respective γ value is displayed at the top. networks follow the configurational model, where edges are shuffled while maintaining the degree sequence [29]. In other words, the underlying random model of Q assumes that the chance of connecting two nodes by an edge is equal for all pairs to 2|E|/(|V|(|V|-1)), regardless of the degrees of connected nodes. This implies that there is no correlation between the degrees of the nodes sharing an edge in the random network model. However, this assumption does not hold for geometrical networks such as PFNs. We have previously shown that, these random geometrical networks do possess significant degree correlations [23], therefore optimization of Q alone cannot properly address the optimal partition of PFN. On the other hand, FDRN is relatively free from these constraints, and therefore the resulting clusters by optimizing for Q in FDRN reflect the underlying biology better than the case of PFN. On the other hand, the walktrap method resolves this problem by leveraging local random walks instead of Q. Particularly, the walktrap method on the PFNs outperforms the infomap and eigenvector methods on the FDRNs. Altogether, the results imply that the modular structures in PFNs can be better identified by clustering methods that capture local clustering structure, supporting the use of LPI in MEGENA, as described in Methods.
Prognostic analysis of multiscale clusters. We then examined how each cluster identified from various approaches is associated with the overall survival based on principal component analysis (for details, see Section 4 Cluster-Trait Association Analysis in Methods). As shown in Fig 6C, MCA (multiscale) identifies the largest number of significant clusters across a wide range of FDR corrected Cox p-value thresholds, and this is commonly observed in both MIand PCC-based networks. Although there are relatively more clusters from MCA than other clustering methods due to the hierarchical divisive nature, several clusters from MCA are most predictive of survival. This outstanding performance is also observed in the case of PCC-based LUAD data set (see S3C Fig).
We further identified the clusters that were not enriched for any functions/pathway signatures based on Bonferroni corrected FET p-value < 0.05 and the odds ratio > 1. As these clusters have unknown functions, they are termed as unknown-functional clusters (UNC). The UNCs from various approaches are listed in S2 Data. We then checked how many of the UNCs identified by each approach were predictive of survival at various thresholds for FDR corrected Cox p-values. As shown in Fig 6D, MCA identified the most number of UNCs that were predictive of survival. Such an unambiguous trend was observed in both PCC-and MIbased networks.
In summary, MEGENA as a combination of PFN and MCA identifies the largest number of functional clusters across wide range of cluster sizes in both BRCA and LUAD datasets (see S4 Fig). In particular, MEGENA outperforms WGCNA in terms of more significant enrichment for known pathways and better predictive power of survival.
Highlight of an adipocytokine-enriched cluster in BRCA. To highlight the findings by MEGENA, we identified 9 functionally annotated clusters (FACs) detected only by MEGENA. These clusters are significantly enriched for some GO-BP/KEGG/REACTOME gene sets that are not enriched in any clusters detected by any other aforementioned approaches based on a threshold of 0.05 for the multiple-test (Bonferroni) corrected p-values. The 9 FACs and their enriched gene sets are shown in S1 Table. Among these MEGENA-specific FACs, one cluster (comp1_56) is enriched for the genes in adipocytokine signaling pathway (corrected FET p = 0.02, 2.7 fold). The hub genes of this cluster are all significantly associated with the overall survival of the BRCA Luminal-B patients. Luminal-B, a molecular subtype of hormone-receptor positive breast cancers, is associated with higher grade and increased proliferation rate, and has a poorer overall prognosis than its hormone-receptor positive counterpart, Luminal-A [41]. Fig 7A and 7B show the localization of genes with univariate Cox p-value < 0.05 for the Luminal B patients' overall survival at the adipocytokine-enriched cluster. The hub genes of this cluster are all predictive of the Luminal B patients' overall survival: AQP7 (Cox p-value < 1.5e-3), C14orf180 (Cox p-value < 4.4e-3), CIDEC (Cox p-value < 1.6e-3), CIDEA (Cox p-value < 2.1e-2) and MRAP (Cox pvalue < 2.2e-2). We further examined the significance of the survival difference between expression median defined subgroups for each hub gene. Fig 8 shows the Kaplan-Meier plots for the two most predictive hub genes, AQP7 and CIDEC.
. This is in agreement with the previous findings that AQP7 expression was down-regulated in breast tumor [42] and CIDEA and CIDEC are cell death-inducing DFFA-like effectors to activate apoptosis [43]. Particularly, CIDEA and CIDEC are involved in adipose tissue loss in cancer cachexia [44], an important, negative prognostic marker that has been linked to systemic inflammation and cell death [45].
In summary, these findings suggest that adipocytokine signaling pathways may play an important role in Luminal B subtype of breast cancer though the exact mechanisms need further experimental validation. Interestingly, the expression of the hub genes in the adipocytokine cluster/subnetwork were not significantly associated with the overall survival outcome of the Luminal A patients. Therefore, MEGENA is capable of capturing finer-scale functional subnetworks to stratify a breast cancer subtype into the subgroups with prognostic significance.

Multiscale Organizations in PFNs
In this section, we will explore the multiscale clustering structures in PFNs constructed by MEGENA. Here, we mainly focus on the PCC-based BRCA PFN as it showed slightly better performance than the MI-based network (Fig 3B-3E).
Multiscale organization of functions/pathways. We performed MHA on the BRCA PFN to identify the groups of scales that had similar interaction patterns and shared highly connected hubs across different scales. Six distinctive scale groups were identified: S1 (0.   To further understand the biology of the clusters at the different scales, we annotate each cluster with its most enriched function or pathway and then build up a cluster hierarchy based on the parent-child relationships (e.g., a child cluster is a subset of the parent cluster). As shown in Fig 9B,  Identification of novel key drivers via MHA. Having validated that some known interactions are present in the BRCA PFN via the oncogenic signatures, we hypothesize that hub genes of the BRCA PFN are more likely to be key drivers of BRCA etiology, and further explore biological significance of novel key drivers of the network.
To verify biological significance of hub genes detected by MHA, we compared expression fold changes of network hubs between different cancer stages, to those of the non-hub genes. We first identified the hub genes at each scale, and then intersected the hub gene sets at the different scales to identify a more stringent hub set, denoted as "multiscale hubs". We then evaluated the significance of the difference between the fold change distributions from each hub set and the corresponding non-hub genes using the Kolmogorov-Smirnov (KS) test. In many cases, the hubs show significantly higher expression fold changes than the non-hub genes, and the average fold changes are greater for hubs at larger scales. Particularly, the multiscale hubs show the highest average fold changes, suggesting that they may be transcriptomic drivers of breast cancer progression. There are 14 multiscale hub genes including ROPN1, TPX2, TEKT1, FOXA1, ESR1, CCNT1, SDPR, THSD4, MLPH, TBC1D9, FOXC1, SPDEF, SFRP1, and AR. Many of these genes are known to play important roles in breast cancer etiology. For instance, AR and ESR1 are well-established endocrine receptors in breast cancer [46,47]. FOXA1 is a pioneer transcription cofactor for ESR1, opening chromatin containing ESR1 target genes [48]. In ER-breast cancers, FOXA1 may promote androgen signaling and may contribute to the development of resistance to anti-androgen therapy [46]. SPDEF, a target gene of ESR1, overexpressed in breast and other solid tumors, was shown to be associated with worse outcomes in patients with ER+ breast cancers and to be critical for the survival of ER+ breast cancer cells in vitro [49]. SPDEF was upregulated in ER + breast cancer cell models of estrogendeprivation resistance and tamoxifen resistance [50]. TPX2 is an established regulator in spindle assembly and DNA damage response across many solid tumors [51], and indeed is a key regulator of the cell-cycle cluster identified by MEGENA. FOXC1, a prognostic biomarker of basal-like breast cancer patients [52] is a key regulator of NF-kappaB signaling pathways in basal-like breast cancer cells [53] and promotes breast cancer invasion [54]. SFRP1 is a known inhibitor of Wnt pathway and tumor suppressor gene, which is epigenetically silenced in a variety of tumors including breast cancer [55].
This multiscale hub set also includes a number of novel genes as promising targets of breast cancer for further studies. One example is ROPN1, which is an immediate neighborhood of The numeric labels on x-axis represent the ranges of α values defining the resolution levels of the hubs, "multiscale" represents intersection of hub genes across different scales, and "non.hub" represents the rest of genes. FOXC1 and FOXA1. ROPN1, also known as ropporin, is a cancer-testis antigen [56] and a potential immune-therapeutic target for multiple myeloma as Chiriva-Internati et al. generated human leukocyte antigen class I-restricted cytotoxic lymphocytes to kill autologous multiple myeloma cells [57]. ROPN1 is significantly associated with the overall survival for all the BRCA patients in TCGA (Cox p-value < 2.6e-2, logrank p-value < 5.8 e-2), PR+ (Cox pvalue < 9.7e-5, logrank p-value < 3.0e-4), and ER+ (Cox p-value < 3.3e-4, logrank pvalue < 2.7e-4). Higher expression of ROPN1 was associated to better prognosis. Furthermore, ROPN1 is mostly down-regulated in tumor samples in comparison to normal samples (FDR corrected test p-value < 8.4e-14, the ratio of the average expression in the tumor samples to that in the adjacent normal samples = 0.14).  between the ROPN1-low and -high groups for all, ER+ and PR+ patients. Moreover, ROPN1 is predictive of survival of the Her2 subtype identified by PAM50 biomarkers (Cox pvalue < 2.2e-3) [58]. These results suggest ROPN1 as a desirable immunotherapeutic target for further validation.
We also performed MHA on the PFN clusters based in the TCGA LUAD data set to verify the findings in the BRCA data set shown in S7 and S8 Figs. We again observed a similar pattern to the BRCA PFN, i.e., the multiscale hubs of the LUAD PFN have significant expression fold changes. TPX2, C16orf89 and GJB3 emerged as the multiscale hubs present at all scale groups in the LUAD PFN. Notably, GPR116 and HOPX directly connect with C16orf89 in the LUAD PFN and are hubs at several scales. In particular, HOPX (HOP homeobox) is a lineagespecific transcriptional regulator of differentiation and has been shown to control the fate of LUAD progression where the cooperative expression of HOPX with GATA6 limits metastatic competence of LUAD cells [59]. GPR116, an essential regulator of lung surfactant homeostasis, was shown to play a crucial role in preventing alveolar collapse through its ability to reduce surface tension [60,61]. We further showed that the DEG signatures from HOPX and GATA6 double-knockdown in the lung adenocarcinoma cell lines [59] and GPR116 knockdown in the murine type II alveolar epithelial cells from the Gpr116 knockout mice (GSE41417) [62] were significantly enriched in HOPX and Gpr116's neighborhoods, respectively, (see S9 Fig). Altogether, these results suggest C16orf89 as a novel therapeutic target in preventing lineage-specific metastasis in LUAD for murine type II alveolar epithelial tissue.

Discussion
We developed a novel framework, Multiscale Embedded Gene Co-expression Network Analysis (MEGENA), to infer gene co-expression networks, by implementing a parallelized algorithm for embedding co-expression networks on topological sphere and a new clustering analysis algorithm to detect coherent clusters at various compactness scales. MEGENA constructs a co-expression network by enforcing an objective criterion of "embeddability" of candidate connections on topological sphere, and thus reduces the inherent redundancy of pair-wise interactions. MEGENA-derived networks possess the hallmarks of complex networks [21,23,31]. Application of MEGENA and the state-of-the-art network inference approaches to the simulated data with gold standard networks shows that the MEGENAderived PFNs have the best and most stable performance. The outstanding performance of MEGENA was further demonstrated in the significant overlap between the siRNA knockdown signatures of a large number of functionally important TFs and their inferred network signatures.
We showed that the novel multiscale clustering technique in MEGENA can identify biologically more meaningful and relevant coexpressed gene clusters than established network clustering methods such as infomap, walktrap, leading eigenvector spectral clustering, and WGCNA.
We highlighted the novel insights from the multiscale approach to decipher gene-gene interaction networks. Key drivers/hubs of MEGENA-derived networks were further identified by MHA, a key procedure in MEGENA. We identified ROPN1 and C16orf89 as the novel candidate drivers for BRCA and LUAD, respectively.
As an alternative approach to analyzing big Omics data, MEGENA has demonstrated competitive performances and will have a great potential in unraveling novel pathways and key regulators in complex diseases. There are rooms to further improve MEGENA. Currently, we are extending the network construction algorithm to higher genus to account for more complex interaction patterns as hyperbolic surfaces with higher genus are able to accommodate for more complexity [31].

Fast Planar Filtered Network Construction
A key component of MEGENA is the construction of Planar Filtered Networks (PFNs). Here we developed a new procedure named FPFNC to substantially improve the existing PMFG in terms of efficiency and scalability. Specifically, we first introduced a parallelization process for testing planarity, and then implemented early termination options to construct 'nearly maximal' embedded networks which prevent inclusion of less informative but computationally expensive links. The procedure for constructing a PFN is detailed below.
Compute similarities between gene expression profiles. Given a gene expression dataset with N genes and M samples, we first compute the similarity between any two genes. A number of similarity measurements such as correlation, Euclidean distance and mutual information can be employed to compute the similarity between expression profiles. MEGENA can take as input similarities from any similarity measurement. Comparison of various similarity measurements is beyond the scope of this paper. Gene-gene similarities are then filtered by False Discovery Rate (FDR) to minimize the impact of false positives. FDR is computed by permuting gene expression matrix across the samples (global FDR), or by directly calculating pairwise nominal p-values by Fisher's Z-transformation. In the current implementation of FPFNC, we set FDR < 0.05 as the default threshold for filtering similarities.
Construct PFN. The existing PMFG algorithm embeds an input network onto a topological sphere by the following steps [21] (Fig 12)

i)~iii) is repeated in a serial manner until the maximal number of edges |E max | = 3(|V o |-2) is reached, and results in
The resulting network G f is an embedded network on a topological sphere where every edge can be drawn without crossing another.
To speed up the planarity test, we designed a parallelized screening procedure (PCP) to extract a subset of gene pairs that are more likely to be embedded. The underlying premise of PCP is explained as follows.
Intuitively speaking, we are utilizing the fact that any subnetwork of a planar network is always planar. Therefore, if ij is one of edges included in the final network G f , the combination of any subnetwork in G f or G o with the edge ij must be embeddable since the combination is also a subnetwork of a planar network. This allows to use G o as a platform to test the quality of N c individual pairs prior to the link embedding step (the step (iii) in PMFG) in PCP where each of N c pairs are combined with G o and tested for planarity in parallel. The qualified pairs after PCP proceed to the serial embedding step in PMFG, finally yielding to the updated network G f . This process is also shown in Fig 12. Currently, we implemented N c = 1000 x N core where N core is the number of cores used in parallel computation. Furthermore, we have set PCP to start when the acceptance rate of each edge into G f falls below 10%, where acceptance rate is defined as the number of accepted pairs by the number of tested pairs. We recommend that users follow the current parameters as we show that PCP with current parameters improves the overall computation time effectively as discussed in Section 1 in Results.
Terminate PFN construction. Early termination conditions are set up to further bypass unnecessary computations to embed less informative pairs that are not filtered out by Step 1.1 while keeping sufficient information for network construction. By setting up reasonable termination options, the resulting PFN still harbors sufficient information. The construction process will be terminated if one of the following conditions is satisfied: (i) a network is maximally embedded (identical to PMFG when Step 1.1 is bypassed) [21]; (ii) an embedded network reaches a certain saturation where the mathematically permitted maximum number of links is 3(|V|-2); (iii) the number of rejected pairs per one embedded link reaches a certain threshold. For the condition (iii), we set the threshold for the number of rejected links as 20|V|, corresponding to FDR < 0.05. Empirically, we found out that the condition (ii) with 90~95% of the

Multiscale Clustering Analysis (MCA) on PFN
As most biological networks exhibit highly modular and yet hierarchical organizations [3,7], we next set about to identify coherent modular structures in PFNs. It has been well-known that characterization of the organization patterns in complex networks cannot be done by a single perspective, but requires a combination of multiple distinctive and diverse features [8].
Measures of compactness and local clusteredness. We perform multiscale clustering analysis of PFNs by optimizing a number of key network features including within-cluster compactness, local clustering structures, overall modularity, and other network-theoretic measures.
Specifically, within-cluster compactness is measured via the shortest path distances (SPD) on PFN [28], local clustering structure via Local Path Index (LPI) [63], and overall modularity via Q [29]. SPD is defined as distances of shortest geodesic paths among all nodes in a given network [64], thus can be naturally used to represent the degree of compactness for a given set of nodes by summarizing overall distances between the nodes on the network. However, SPD tends to over-emphasize hierarchical organization of nodes, and thus ignores clustering structures [65].
In order to mitigate this problem, we chose LPI to incorporate local clustering structures in PFN. LPI is defined as A 2 + A 3 , where A is the binary adjacency matrix of the network, is a free parameter, and [A n ] ij is the number of paths of n steps linking nodes i and j [63]. LPI is a quasi-local measure that evaluates structural similarity between two nodes on a network by accounting for all paths of lengths 2 and 3 between any two nodes. LPI with = 0.01 in the context of link prediction has been shown to perform superior to some of already established network theoretic similarity measures in many real-world cases [63,66]. Particularly, the abundance of 3-cliques and 4-cliques in PFN works synergistically with LPI for the following reasons: a) a majority of the 2-and 3-step paths are confined within these cliques, and b) organization patterns of adjacent 3-cliques and 4-cliques provide rich clustering information of PFN [22,23,27], and c) the adjacency information can be readily captured by LPI by paths that span between different cliques. Furthermore, the computation time for LPI is linear with the size of network [63], making it suitable for large-scale co-expression network analysis. Lastly, we leveraged the capability of Q to evaluate significance of within-cluster connectivity in tandem with between-cluster connectivity to guide proper splitting of networks into sub-clusters.
We introduced a network compactness measure ν(α) as a function of a resolution parameter α, to capture the cohesiveness of modular structures in PFN at different resolutions (or scales) defined by α. Particularly, we leveraged geometrical characteristics of PFN that retains similarity between two nodes by means of geodesic path distance such that coherent clusters exist as connected subgraphs with higher local clustering coefficients [21]. Furthermore, these geometrical networks exhibit a broad spectrum of compactness characteristics from loose-world to semi-ultra small-world [23], and thus they allow different degrees of compactness to co-exist. Fig 1B presents an overview of the procedure in MCA and the details of MCA are described in the following subsections.
Network split: k-split. At each iteration, a nested split is performed on each cluster to derive an optimal split with respect to the initial cluster via k-medoids clustering which detects k optimal clusters by minimizing the SPD based intra-cluster distance [67]. However, SPD based evaluation of the partitions at each k does not take into account clusteredness of local topology, leading to misclassification of the nodes at the boundaries between adjacent clusters.
To remedy this problem, we designed a procedure to update the boundaries of the clusters derived from a partition at each k. Boundary nodes are those with immediate neighbors from at least two different clusters. Let V bnd be a set of boundary nodes. Then, for a node i V bnd , its cluster membership to V l is defined via the following equation, where membership of i is assigned to V l . The Eq (1) assigns a given node i to a new cluster in which the node i has the maximal number of interactions as determined by LPI. This boundary detection procedure (BDP) is iterated until there is no change in cluster membership, therefore leading to a stable partition. This process is illustrated in Fig 13 in the step "update boundary", showing a toy example where misclassified red nodes in "Before" panel are correctly assigned in the "After" panel after BDP.
As network split requires determination of the optimal k value, we utilized an established measure of clusteredness on networks, Newman's modularity, Q [29]. Specifically, we repeat the k-medoids partition and BDP for a range of k values until no further optimal solution by Q is available in the interval [k'-dk, k'], where dk is set as 10. In other words, we search for the optimal partition determined by Q within 10 further partitions, and repeat the search until no better solutions are detected. Although we do not have the data for systematic evaluation on setting the appropriate dk value, we did not observe much differences in resulting clusters in the exemplified BRCA and LUAD PFNs for dk ! 10, and dk = 10 often succeeds to explore up to k % 60 in these cases. Given that the clustering is a hierarchical divisive method and 2 dk 6 are often the number of clusters explored for each split in hierarchical divisive methods [15,40], we recommend dk = 10 is sufficient to effectively identify these clusters.
Identification of significantly compact sub-clusters. Upon splitting a network G o (V o ,E o ) into clusters, each cluster is evaluated by the compactness measure defined as the average shortest path distance within a cluster V l , normalized by the logarithm of the cluster size, where α is a scaling parameter to control the degree of compactness. Normalization by log(|V|) α is in accordance to the scaling effect from the diameter, d SPD, of geometrical networks such as PFNs. In the thermodynamic limit of |V| ! 1, geometrical networks can be modeled by means of a stochastic network model where small clusters of "bubbles" aggregate to give characteristic features of real-world complex networks, i.e., d SPD $ logðjVjÞ a , where α ! 1, a hallmark feature of small-world property [23]. However, d SPD does not reflect network structural information since it relies on the minimum of the longest pair-wise shortest path lengths. To mitigate this, we utilize SPD to incorporate all information between all pairs of nodes, and take the scaling effect from d SPD as an upper bound of network compactness. For a given network subject to split, a split is obtained by maximizing for Q that is independent of α. Then, at a given α, the following criterion determines the acceptance or rejection of the split: Accept : if n o > u l for any l; Reject : if n o < u l for all l; ð4Þ ( where, ν o is the compactness of the parent cluster. The criterion implies that, for a given α, a split that results in at least one cluster l with improved ν l in comparison with G o is considered as valid.
In order to efficiently search for valid splits across all α, we leveraged the independence between the split detection via Q, and the split acceptance criterion. For a given network G o , we define a characteristic α value as the one that can identify a more compact cluster l than G o : Calculation of a c l for all the clusters from a split, allows us to identify α values at which the split is accepted according to the criterion (4). Flow chart of the clustering analysis procedure for each value of compactness resolution parameter, α. The upper panel illustrates the k-split procedure within each cluster to detect optimal sub-clusters. The lower panel describes the compactness evaluation procedure (CEP) after k-split. CEP compares the parent cluster prior to k-split with the sub-clusters after k-split by means of the compactness measure, ν l , and updates the partition accordingly. On the left, each step is illustrated by a graphical toy example. From the top, the pictures correspond to: the initial network subject to clustering, correct classification of boundary nodes by BDP (Before: before BDP, After: correction after BDP), identification of the optimal k via modularity Q k , final clusters, and comparison between initial network and sub-clusters via compactness. These steps are iterated for all clusters from the newly updated partition until no further update can be made. Individual sub-clusters from an accepted split are evaluated by calculating the statistical significance of ν l at a scale a ¼ a c l . Specifically, a random PFN is first generated by shuffling the link weights of the parent cluster. Then, |V l | nodes are sampled for 100 times from the random parent, and the compactness values for these random networks, n 0 l , are calculated to estimate significance p-value of ν l . Random node insertion moves (namely T2 moves) are utilized to generate random PFNs which exhibit scale-free degree distributions and small-worldness in thermodynamic limits [23].
Termination of algorithm. The search for sub-clusters is performed iteratively until no valid sub-clusters can be further identified under the following conditions: -no sub-clusters can be further identified as more compact than respective parent clusters at any α, or -no further sub-clusters show significant compactness (P>0.05).

Multiscale Hub Analysis (MHA)
We developed a Multiscale Hub Analysis (MHA) procedure to identify highly connected nodes at each scale defined by α and across all the scales. MHA identifies the nodes with significantly high connectivity within each significant clusters previously identified through the following steps: 1) Group the scales that show similar within-cluster connectivity patterns, 2) Identify hubs at each scale, and 3) identify multiscale hubs by combining significance scores of individual nodes across all different scales. The procedure is detailed in the following subsection.
Grouping similar scales. At each scale α, we define within-cluster connectivity of node v i as where V a l is the set of nodes in the cluster l at the scale α, and A(ν i ,ν j ) = w is the adjacency matrix denoting the weight of link connecting v i and v j . Combining c w (ν i , α) across all nodes and α values, we obtain C w (V,A) where V is the set of all nodes in PFN, and F is a set of α values that have at least one significant split.
Then, k-medoids clustering is performed for each α 2 F. The optimal number of clusters, k, is then estimated by summarizing results from multiple established internal validity indices since a single validity index may emphasize only a specific criterion and may not provide good quality clusters [68]. Specifically, we utilized a number of internal validity indices such as average silhouette width, Normalized Gamma statistics, Dunn's index and separation index [68,69] to evaluate quality of clustering results across k 2 [2,|A| − 1].
We then summarized the results by calculating combined normalized ranks across these cluster quality indices by the following formula, where M is the set of cluster quality indices, rank(k, m) is the rank of k by the cluster quality index m, and score(k) is the summarized score of k. Fig 14 shows the grouping process in analyzing the BRCA PFN where 6 distinct scales were identified.
Identification of hubs at each scale. For each subnetwork at a scale α, n s random networks are generated as described previously in MCA description. The significance of withincluster connectivity c w (ν i ,α) is then evaluated by taking the average number of nodes that have higher connectivity than c w (ν i ,α) from n s random networks with n s = 100 as default.
Identification of multiscale hubs. We adopt Fisher's inverse Chi-square approach to compute the combined statistics for each node i across α values grouped together, termed a scale group. That is, where, p ðaÞ i is significance p-value of c w (ν i ,α), and A l is the scale group including the set of α values grouped together by clustering for C w (V,A). In order to evaluate the significance of S i , we generate the null distribution of S i by randomly shuffling for α and i in the equation above. This shuffling is performed N s times to generate stable null distribution, where N s = 100 by default. Nominal p-values for each of S i is calculated from the null distribution, and is corrected for multiple testing by Bonferroni correction for the number of nodes in the PFN. We have set Bonferroni corrected p-value < 0.05 as the default threshold to identify significant hubs for each scale group A l .

Cluster-Trait Association Analysis (CTA)
To relate each cluster with clinical outcomes, principal component analysis (PCA) is first performed for each cluster and then the correlation between the first (or multiple) principal component(s) and each trait is computed as cluster relevance to the trait.
For patient survival data, the association is examined by multivariate Cox proportional hazards regression model that regresses patient survival onto the first (or multiple) principal component(s) of a given module, and Cox p-value is calculated to evaluate the significance. To further investigate the prognostic power of each cluster, logrank p-value is calculated to characterize the difference between the survival curves of two molecular subtypes defined by the median expression of the first PC of each cluster. The logrank p-values and Cox p-values are then corrected for multiple testing by Benjamini-Hochberg FDR correction.

Computational Complexity Analysis
Among the four major steps of MEGENA including FPFNC, MCA, MHA and CTA, FPFNC is most time consuming. The complexity of the existing serial PMFG algorithm has a complexity of O(|V| γ ), 2 γ 3, where the worst case of O(|V| 3 ) is due to performing O(|V|) Myrvold-Boyer planarity test on O(|V| 2 ) correlation pairs. FPFNC circumvents this problem by reducing the number of correlation pairs subject to the planarity test by means of testing significance of every correlation pair, taking O(|V| 2 ). Assuming that a certain threshold such as FDR < 0.05 leaves a faction of nodes correlation to every node, we can approximate the number of remaining pairs to construct PFN as |V|. Combining these two, the overall complexity is O('|V| 2 ), which is a substantial improvement over the previous algorithm. Furthermore, the parallelization via PCP with several cores allows to handle for the multiplicative factor ', leading to O(''|V| 2 ) with ' > '' ! 1. As a result, FPFNC achieves a scalable computation of PFN of |V|~20,000 with moderate computational resources within a few days, while the exiting serial PMFG algorithm takes over a week to handle a network with |V|~5000. For instance, using FPFNC on 16 cores (3.5 GHz Intel Ivy Bridge), the LUSC PFN with |V| = 20523 (Fig 2A and 2B) was constructed in less than 36 hours and the THCA PFN with |V| = 16639 (Fig 2C and 2D) took less than 18 hours. Construction of such large-scale PFNs is not feasible for the existing serial PMFG algorithm as we estimate that it will take over a month.
MCA is governed by computation of shortest-path distance (SPD) for all pairs of nodes, and iterative k-medoids clustering. We adopted the Bellman-Ford algorithm to compute SPD [70] which has a computation complexity O(|V||E|). Given |E| 3(|V|-2) in embedded networks on surface with g = 0, the time complexity of computing SPD is O(|V| 2 ). SPD is calculated for multiple times in MCA. It is calculated first from the global PFN as the input dissimilarity matrix for k-medoids clustering, and then from multiple random planar networks to calculate statistics for cluster compactness to evaluate each split. Therefore, the initial computation of SPD from PFN dominates the running time since computing SPDs for candidate clusters become relatively negligible due to dramatic decrease in cluster size. Therefore, the overall complexity involving all SPD calculations becomes O(|V| 2 ), where corresponds to the number of clusters with sizes comparable to the PFN. Additionally, the computational complexity of k-medoids clustering is O(|V| 2 /k) [67], where k is tested from k = 2,. . .,k max with k max reaching around 50 in practical cases with current implementation of k-split. Therefore, the overall time complexity of MCA is O('|V| 2 ). In the current implementation, the overall computation time of MCA for a PFN with |V| = 15402 took less than 2 hours on a single core (3.5 GHz Intel Ivy Bridge).
Lastly, the computational complexity of MHA is dictated by calculation of significance of within-cluster connectivity for each node, across a range of α values. Given that we generate n s (= 100 by default) random planar networks for each unique cluster, and we calculate degree of each node for each random network, the time complexity for performing the statistical test for each cluster is O(|V l |), and for all clusters is O(S l |V l |). Since S l |V l |~|V| where is the mean number of instances that a single node appears in different clusters, the overall time complexity for MHA is fairly linear with |V|. Indeed, MHA for the BRCA and LUAD PFNs in this manuscript took only few minutes.
Overall, the computation complexity of MEGENA is O(β|V| 2 ), where β largely depends on the number of cores to perform parallelized computations. The space (memory) complexity of MEGENA is O(|V| 2 ) due to a |V|x|V| similarity matrix. Based upon 16 cores of 3.5 GHz Intel Ivy Bridge, FPFNC just needed less than 3 hours to construct the BRCA and LUAD PFNs with 6999 and 7562 nodes, respectively while the existing PMFG algorithm took over a week. The whole MEGENA took less than 4 hours for both cases.
Supporting Information S1 Data. Supporting data including the results from comparing Molecular Signature Data-Base (MSigDB) gene sets from Gene Ontology (GO) collection and pathway databases (KEGG and REACTOME) via Fisher Exact Test (FET), testing for over-representation (i.e. odds ratio > 0). (TXT) S2 Data. Supporting data including the results from comparing Molecular Signature Data-Base (MSigDB) gene sets from Gene Ontology (GO) collection and pathway databases (KEGG and REACTOME) via Fisher Exact Test (FET), testing for under-representation (i.e. odds ratio < 0). (TXT) S1 Text. Supporting information describing data acquisition and quality control, the established clustering methods in comparison with MEGENA, and the results from applying MEGENA to the TCGA LUAD dataset.