Identifying Network Perturbation in Cancer

We present a computational framework, called DISCERN (DIfferential SparsE Regulatory Network), to identify informative topological changes in gene-regulator dependence networks inferred on the basis of mRNA expression datasets within distinct biological states. DISCERN takes two expression datasets as input: an expression dataset of diseased tissues from patients with a disease of interest and another expression dataset from matching normal tissues. DISCERN estimates the extent to which each gene is perturbed—having distinct regulator connectivity in the inferred gene-regulator dependencies between the disease and normal conditions. This approach has distinct advantages over existing methods. First, DISCERN infers conditional dependencies between candidate regulators and genes, where conditional dependence relationships discriminate the evidence for direct interactions from indirect interactions more precisely than pairwise correlation. Second, DISCERN uses a new likelihood-based scoring function to alleviate concerns about accuracy of the specific edges inferred in a particular network. DISCERN identifies perturbed genes more accurately in synthetic data than existing methods to identify perturbed genes between distinct states. In expression datasets from patients with acute myeloid leukemia (AML), breast cancer and lung cancer, genes with high DISCERN scores in each cancer are enriched for known tumor drivers, genes associated with the biological processes known to be important in the disease, and genes associated with patient prognosis, in the respective cancer. Finally, we show that DISCERN can uncover potential mechanisms underlying network perturbation by explaining observed epigenomic activity patterns in cancer and normal tissue types more accurately than alternative methods, based on the available epigenomic data from the ENCODE project.


Introduction
Genes do not act in isolation but instead work as part of complex networks to perform various cellular processes. Many human diseases including cancer are caused by dysregulated genes, with underlying DNA or epigenetic mutations within the gene region or its regulatory elements, leading to perturbation (topological changes) in the network [1][2][3][4][5][6][7]. This can ultimately impair normal cell physiology and cause disease [8][9][10][11]. For example, cancer driver mutations [12][13][14][15][16][17][18][19] on a transcription factor can alter its interactions with many of the target genes that are important in cell proliferation (Fig 1A). A key tumor suppressor gene can be bound by different sets of transcription factors between cancer and normal cells, which leads to different roles [6,[20][21][22][23] (Fig 1B). Recent studies stress the importance of identifying the perturbed genes that create large topological changes in the gene network between disease and normal tissues as a way of discovering disease mechanisms and drug targets [8,9,[24][25][26][27]. However, most existing analysis methods that compare expression datasets between different conditions (e.g., disease vs. normal tissues) focus on identifying the genes that are differentially expressed [28][29][30]. For example, a recent review paper on biological network inference [31] emphasized that there is a lack of methods that focus on inferring the differential network between different conditions (e.g., distinct species, and disease conditions).
Several recent studies compare gene networks inferred between conditions based on expression datasets [1,[32][33][34][35][36][37]. They fall into three categories: 1) Network construction based on prior knowledge: West et al. (2012) computes the local network entropy, based on the protein interaction network from prior knowledge and expression datasets from cancer and normal tissues [1]. 2) Pairwise correlation-based networks: Guan et al. (2013) [35] proposed the local network similarity (LNS) method to compare the pairwise Pearson's correlation matrices of all genes between two conditions. Still other authors compared pairwise correlation coefficients for all gene pairs between conditions with different correlation measures including t-test p-values [32,33,36]. 3) Learning a condition-specific conditional dependency network for each condition and comparing the networks between conditions: Gill et al. (2010) proposed a method, called PLSNet, that fits a partial least squares model to each gene, computes a connectivity scores between genes, and then calculates the L 1 distance between score vectors to estimate network perturbation [34]. Zhang et al. (2009) proposed a differential dependency network (DDN) method that uses lasso regression to construct networks, followed by permutation tests to measure the significance of the network differences [37].
There have been approaches to identify dysregulated genes in cancer by utilizing multiple types of molecular profiles, not based on network perturbation across disease states estimated based on expression data. Successful examples use a linear model to infer each gene expression model based on copy number variation, DNA methylation, ChIP-seq, miRNAs or mRNA levels of transcription factors [38][39][40]. The advantages of the aforementioned methods that take only expression datasets as input to identify perturbed genes are in their applicability to diseases for which only expression data are available. In this paper, we focus on identifying perturbed genes purely based on gene expression datasets representing distinct states, and compare our method with existing method, LNS, D-score and PLSNet. (A) A simple hypothetical example that illustrates the perturbation of a network of 7 genes between disease and normal tissues. One possible cause of the perturbation is a cancer driver mutation on gene '1' that alters the interactions between gene '1' and genes '3', '4', '5', and '6'. (B) One possible cause of network perturbation. Gene '1' is regulated by different sets of genes between cancer and normal conditions. (C) The overview of our approach. DISCERN takes two expression datasets as input: an expression dataset from patients with a disease of interest and another expression dataset from normal tissues (top). DISCERN computes the network perturbation score for each gene that estimates the difference in connection between the gene and other genes between disease and normal conditions (middle). We perform various post-analyses to evaluate the DISCERN method by comparing with alternative methods, based on the importance of the high-scoring genes in the disease through a survival analysis and on how well the identified perturbed genes explain the observed epigenomic activity data (bottom).
We present a new computational method, called DISCERN (DIfferential SparsE Regulatory Network), to identify perturbed genes, i.e. the genes with differential connectivity between the condition specific networks (e.g., disease versus normal). DISCERN takes two expression datasets, each from a distinct condition, as input, and computes a novel perturbation score for each gene. The perturbation score captures how likely a given gene has a distinct set of regulators between conditions (Fig 1A). The DISCERN method contains specific features that provide advantages over existing approaches: 1) DISCERN can distinguish direct associations among genes from indirect associations more accurately than methods that focus on marginal associations such as LNS; 2) DISCERN uses a penalized regression-based modeling strategy that allows efficient inference of genome-wide gene regulatory networks; and 3) DISCERN uses a new likelihood-based score that is more robust to the expected inaccuracies in local network structure estimation. We elaborate on these three advantages below: First, DISCERN infers gene networks based on conditional dependencies among genes-a key type of probabilistic relationship among genes that is fundamentally distinct from correlation. If two genes are conditionally dependent, then by definition, their expression levels are still correlated even after accounting for (e.g., regressing out) the expression levels of all other genes. Thus, conditional dependence relationship is less likely to reflect transitive effects than mutual correlation, and provides stronger evidence that those genes are functionally related. These functional relationships could be regulatory, physical, or other molecular functionality that causes two genes expression to be tightly coupled. As a motivating example, assume that the expression levels of genes '3' and '5' are regulated by gene '1' in a simple 7-gene network ( Fig 1A). This implies that the expression level of gene '1' contains sufficient information to know the expression levels of genes '3' and '5'. In other words, genes '3' and '5' are conditionally independent from each other and from the rest of the network given gene '1'.
Second, DISCERN uses an efficient neighborhood selection strategy based on a penalized regression to enable the inference of a genome-wide network that contains tens of thousands of genes. Penalized regression is a well established technique to identify conditional dependencies [41]. Inferring the conditional dependence relationships from high-dimensional expression data (i.e., where the number of genes is much greater than the number of samples) is a challenging statistical problem, due to a very large number of possible network structures among tens of thousands of genes. Unlike pairwise correlation, the conditional dependence between '1' and '2' cannot be measured based on just the expression levels of these two genes. We should consider the possible networks among all genes and find the one that best explains the expression data. This involves both computational and statistical challenges. To make this process feasible, DISCERN uses a sparse regression model for each gene to select neighbors in the network [41,42]. The use of a scalable method to infer a genome-wide conditional dependence network is a key distinguishing feature of the DISCERN method.
Finally, one of the most novel features of DISCERN is the ability to avoid the overestimation of the degree of network perturbation due to dense correlation among many genes. Revisiting the 7-gene network example (Fig 1A), assume that genes '5' and '7' are highly correlated to each other, in which case a penalized regression that imposes a sparsity penalty, such as the lasso method, may arbitrarily select one of them. This can result in a false positive edge between genes '1' and '7' instead of '1' and '5'. This may lead to overestimation of the perturbation of gene '1' (Fig 1A). Our network perturbation score overcomes this limitation by measuring the network differences between conditions based on the likelihood when the estimated networks are swapped between conditions-not based on the differences in topologies of the estimated networks. We demonstrate the effectiveness of this feature by comparing with methods based on the topology differences of the estimated networks.
We evaluated DISCERN on both synthetic and gene expression data from three human cancers: acute myeloid leukemia (AML), breast cancer (BRC), and lung cancer (LUAD). Integrative analysis using DISCERN on epigenomic data from the Encyclopedia of DNA Elements (ENCODE) project leads to hypotheses on the mechanisms underlying network perturbation ( Fig 1C). The resulting DISCERN score for each gene in AML, BRC and LUAD, the implementation of DISCERN, and the data used in the study are freely available on our website http:// discern-leelab.cs.washington.edu/.

Method overview
Here, we describe the DISCERN method, referring to the Methods for a full description. We postulate that a gene can be perturbed in a network largely in two ways: A gene can change how it influences other genes (Fig 1A), for example, a driver mutation on a transcription factor can affect cell proliferation pathways [12][13][14][15][16][17][18][19]. A gene can change the way it is influenced by other genes, a common example being when a mutated (genetically or epigenetically) gene acquires a new set of regulators, which occurs frequently in development and cancer [6,[20][21][22][23] (Fig 1B). Identifying the genes that are responsible for large topological changes in gene networks could be crucial for understanding disease mechanisms and identifying key drug targets [8,9,[24][25][26][27]. However, most current methods for identifying genes that behave differently in their expression levels between diseased and normal tissues focus on differential expression [28][29][30], rather than differential connection with other genes in a gene expression network (Fig 1A).
We model each gene's expression level using a sparse linear model (lasso regression): let y ðsÞ i be expression levels of gene i in an individual with state s, cancer (s = c) or normal (s = n), modeled as: y ðsÞ i % X p r¼1 w ðsÞ ir x ðsÞ r . Here, x 1 , . . .x p denote candidate regulators, a set of genes known to regulate other genes, including transcription factors, chromatin modifiers or regulators, and signal transduction genes, which were used in previous work on network reconstruction approaches [43][44][45][46] (S1 Table). Linear modeling allows us to capture conditional dependencies efficiently from genome-wide expression data containing tens of thousands genes. Naturally, a zero weight w ir indicates that a regulator r does not affect the expression of the target gene i. Sparsity-inducing regularization helps to select a subset of candidate regulators, which is a more biologically plausible model than having all regulators, and makes the problem wellposed in our high-dimensional setting (i.e., number of genes ) number of samples).
To determine the regulators for any given gene, we use a lasso penalized regression model [47] with the optimization problem for each lasso regression defined as: jw ðsÞ ir j; where y ðsÞ ij means the expression level of the i th gene in the j th patient in the s th state, and x ðsÞ rj similarly means the expression level of the r th regulator in the j th patient in the s th state. The second term, the L 1 penalty function, will zero out many irrelevant regulators for a given gene, because it is known to induce sparsity in the solution [47]. We normalize the expression levels of each gene and each regulator to be mean zero and unit standard deviation, a process called standardization, which is a standard practice before applying a penalized regression method [47][48][49][50]. The difference in the weight vector between conditions, w ðnÞ i and w ðcÞ i , can indicate a distinct connectivity of gene i with p regulators between the conditions. However, simply computing the difference of the weight vectors is unlikely to be successful, due to the correlation among the regulators. The lasso, or other sparsity-inducing regression methods, can arbitrarily choose different regulators between cancer and normal. Examining the difference in the weight vectors between conditions would therefore lead to overestimation of network perturbation.
Instead, DISCERN adopts a novel network perturbation score that measures how well each weight vector learned in one condition explains the data in a different condition. This increases the robustness of the score to correlation among regulators, as demonstrated in the next section. We call this score the DISCERN score, defined as DISCERN i ¼ per sample ÀlogÀlikelihood computed using learned weights w ðsÞ in a different condition s 0 per sample ÀlogÀlikelihood computed using learned weights w ðsÞ in the same condition s . This is equivalent to Here n s is the number of samples in the data from condition s. The numerator measures the error of predicting gene i's expression levels in cancer (normal) based on the weights learned in normal (cancer). If gene i has different sets of regulators between cancer and normal, it is likely to have a high DISCERN score. The denominator plays an important role as a normalization factor, which is demonstrated by comparing with an alternative score, namely the D 0 score (Fig 2A), that uses only the numerator of the DISCERN score. We also compare with existing methods, such as LNS [35] and PLSNet [34], that compare the weight vectors between cancer and normal models where we demonstrate the advantages of the likelihood-based model that DISCERN uses.

Comparison with previous approaches on synthetically generated data
In order to systematically compare DISCERN with alternative methods in a controlled setting, we performed validation experiments on 100 pairs of synthetically generated datasets representing two distinct conditions. Each pair of datasets contains 100 variables drawn from the multivariate normal distribution with zero mean and covariance matrices S 1 and S 2 . We divided 100 variables into the following three categories: 1) variables that have different sets of edge weights with other variables across two conditions, 2) variables that have exactly the same sets of edge weights with each other across the conditions, and 3) variables not connected with any other variables in the categories 2) and 3) in both conditions. For example, in Fig 1A, '1' is in category 1). '2', '4', '6', and '7' are in category 2), and '3' and '5' is in category 3). We describe how we generated the network edge weights (i.e., elements of S À1 1 and S À1 2 ) among the 100 variables in more detail in Methods.
We compared DISCERN with 4 alternative methods to identify perturbed genes: LNS [35], D-score [36], PLSNet [34], and D 0 that uses only the numerator of the DISCERN score. Here, we do not compare with the methods to identify differentially expressed genes, such as ANOVA, because the synthetic data were generated from a zero mean Gaussian distribution. We note that the PLSNet method uses empirical p-values as the network perturbation scores, where the empirical p-value for each gene is estimated from permutation tests that generate the null distribution of the gene's score [34]. All the other methods, such as DISCERN, LNS, and D-score, do not require permutation tests (see Methods for details). To show that DISCERN outperforms existing methods and those that use the empirical p-values obtained through permutation tests as the network perturbation scores, we developed the following methods for comparison: LNS, D-score, and D 0 followed by permutation tests to compute the empirical pvalues, called pLNS, pD-score, and pD 0 , respectively.
The average receiver operating characteristic (ROC) curves across 100 pairs of datasets for these methods (Fig 2A) show that DISCERN significantly outperforms all the other 7 methods-3 existing methods (LNS, D-score, and PLSNet), and 4 methods we created for comparison (D 0 , pD 0 , pLNS, and pD-score). Except DISCERN, PLSNet performs the best among all existing methods. However, its run time grows too quickly as the number of variables increases, which makes it two to three orders of magnitude slower than DISCERN when run on larger data ( Fig  2B). PLSNet was too slow to run on genome-scale data and therefore we did not use it for the subsequent experiments on genome-wide gene expression data from cancer patients.
We note that DISCERN does not need permutation tests to generate the null distribution of the score for each gene. All other methods improve when the empirical p-values from permutation tests are used, which indicates that the gene-level bias on the magnitude of the raw scores hurts their performance to identify perturbed genes. DISCERN significantly outperforms D 0 that uses only the numerator of the DISCERN score, which indicates that the denominator of the DISCERN score plays a role to normalize the score such that the scores of different genes can be compared to each other. Computing the empirical p-value for each gene based on the gene-specific null distribution obtained through permutation tests is not feasible on genomewide data. To obtain a p-value of 0.05 after Bonferroni correction, we need at least (1/0.05 × p) permutation tests per gene, where p is the total number of genes, and (1/0.05 × p 2 ) permutation tests in total. When p = 20,000, this number is (4 × 10 9 ) permutation tests, which is not feasible even when using multiple processors at a reasonable cost. This is demonstrated in Fig 2B that shows the run time of PLSNet, a permutation test-based method, when applied to data containing a varying number of genes (p).

Comparison of methods on gene expression datasets
We used genome-wide expression datasets consisting of 3 acute myeloid leukemia (AML) datasets, 3 breast carcinoma (BRC) datasets and 1 lung adenocarcinoma (LUAD) dataset (Table 1). Details on the data processing are provided in Methods. To evaluate the performance of the DISCERN method, we compared DISCERN with existing methods that scale to over tens of thousands of genes: LNS [35] and D-score [36] that aim to estimate network perturbation, and ANOVA that measures differential expression levels between cancer and normal samples.
We first computed the DISCERN, LNS, D-score, and ANOVA scores in the 3 cancers based on the following datasets that contain normal samples: AML1, LUAD1 and BRC1 (Table 1). Then, we used the rest of the datasets to evaluate the performance of each method at identifying genes previously known to be important in the disease, for example, the genes whose expression levels are significantly associated with survival time in cancer. The value of the sparsity tuning parameter λ was chosen via cross-validation tests, a standard statistical technique to determine the value of λ [47]. For the chosen λ values, the overall average regression fit measured by cross-validation test R 2 was 0.493.
To remove any potential concern of the effect of standardization on genes with very low expression level, we first show that genes with low mean expression do not tend to have high enough DISCERN score to be considered in our evaluation in the next sections (S1 Fig). The Pearson's correlation between the mean expression before standardization and the DISCERN score ranges from 0.08 and 0.43 across datasets. Positive correlation is induced because genes with low mean expression tend to have lower DISCERN scores, indicating that genes whose expression are likely essentially noise would not be selected as high-scoring genes. To further reduce the potential concern of genes with low expression in RNA-seq data (LUAD), we applied the voom normalization method that is specifically designed to adjust for the poor estimate of variance in count data, especially for genes with low counts [51].
We assessed the significance of the DISCERN scores through a conservative permutation testing procedure, where we combined cancer and normal samples, and permuted the cancer/ normal labels among all samples (more details in Methods). Unlike the gene-based permutation test described in the previous section, here, we generate a single null distribution for all genes, which requires a significantly less number of permutation tests (one million in this experiment). After applying false discovery rate (FDR) correction on these p-values, there are 1,351 genes (AML), 2,137 genes (BRC), and 3,836 (LUAD) genes whose FDR corrected p-values are less than 0.05. We consider these genes to be significantly perturbed genes (S2 Table). The difference in these numbers of significant perturbed genes identified by DISCERN is consistent with a prior study that showed that lung cancer has a larger number of non-synonymous mutations per tumor than breast cancer, which has a larger number than AML [52].

Top scoring DISCERN genes in AML reveal known cancer drivers in AML
The 1,351 genes that were predicted to be significantly perturbed between AML samples and normal non-leukemic bone marrow samples were enriched for genes causally implicated previously in AML pathogenesis (S2 Table). This include a number of genes that we and others have previously identified as being aberrantly activated in leukemic stem cells such as BAALC, GUCY1A3, RBPMS, and MSI2 [55][56][57]. This is consistent with over-production of immature stem-cell like cells in AML, which is a major driver of poor prognosis in the disease. Prominent among high-scoring DISCERN genes were many HOX family members, which play key roles in hematopoietic differentiation and in the pathogenesis of AML [58]. HOX genes are frequently deregulated by over-expression in AML, often through translocations that result in gene fusions. The highest ranked gene in AML by DISCERN is HOXB3 which is highly expressed in multipotent hematopoietic progenitor cells for example. Thirteen (out of 39 known) HOX genes are in the 1,351 significantly perturbed genes (p-value: 5.99 × 10 −6 ). When compared to known gene sets from the Molecular Signature Database (MSigDB) [59] in an unbiased way, the top hit was for a set of genes (VERHAAK_AML_WITH_NPM1_ MUTATED_DN; p-value: 2 × 10 −86 ) that are down-regulated by NPM1 (nuclephosmin 1) mutation in AML (S1 File). NPM1 is one of three markers used in AML clinical assessment; the others are FLT3 and CEBPA that are significantly perturbed genes identified by DISCERN as well. Mutation leads to aberrant cytoplasmic location of itself and its interaction partners, leading to changes in downstream transcriptional programs that are being captured by DISCERN. Also highly significant were genes highly expressed in hematopoietic stem cells [60] (JAATINEN_ HEMATOPOIETIC_STEM_CELL_UP; p-value: 6 × 10 −74 ). Among these were key regulators of hematopoietic system development such as KIT, HOXA3, HOXA9, HOXB3 (with the latter homeobox genes also implicated in AML etiology), as well as FLT3 which plays a major role in AML disease biology, with its mutation and constitutive activation conferring significantly worse outcomes for patients [61]. Comparison to Gene Ontology (GO) categories identified dysregulation of genes involved in hemostasis and blood coagulation, a key clinical presentation of AML. Furthermore, GTPase activity/binding and SH3/SH2 adaptor activity were enriched among high-scoring DISCERN genes. These are pertinent to AML due to previously noted high expression in AML leukemic stem cells of GUCY1A3 and SH3BP2, both identified as perturbed genes by DISCERN [55]. However, their function has not been examined in detail, suggesting that they are potential targets for further investigation as to their role in AML disease mechanisms. Several other highly significant enrichments were for AML subtypes that are driven by specific translocations, including MLL (mixed lineage leukemia) translocation with various partners, as well as t (8;21) translocations. The latter is of particular interest, since it is primarily a pediatric AML, whereas our network analysis uses purely adult AML samples-indicating the potential to uncover putative mechanisms that generalize beyond the context of the immediate disease type.

Top scoring DISCERN genes in lung cancer reveal biological processes known to be important in lung cancer
There are 3,836 significantly perturbed genes identified by DISCERN in lung cancer (LUAD) (S2 Table). The 3rd and 4th highest ranked genes are ICOS (inducible costimulator) and YWHAZ (14-3-3-zeta). Both genes have known roles in disease initiation or progression in lung cancer. Polymorphisms in ICOS have been associated with pre-disposition to non-small cell lung cancer [62], while over-expression of YWHAZ is known to enhance proliferation and migration of lung cancer cells through induction of epithelial-mesenchymal transitions via beta-catenin signaling [63]. GIMAP5 (GTPase IMAP Family Member 5), another high scoring LUAD gene (11th), is consistently repressed in paired analyses of tumor vs normal lung tissue from the same patient, and encodes an anti-apoptotic protein [64]. Down-regulation of GIMAP5 in lung tumors therefore potentially facilitates their evasion of programmed cell death, one of the hallmarks of cancer.
Several of the GO biological categories enriched in 3,836 high-scoring DISCERN genes in LUAD (FDR-corrected p-value <0.05) reflected metabolic and proliferative processes that are commonly de-regulated in solid tumors such as lung adenocarcinoma. Among these were cellular response to stress, mitotic cell cycle, amino acid metabolism, and apoptosis (S1 File). In fact the top-ranked gene was MCM7 (minichromosome maintenance protein 7), an ATPdependent DNA helicase involved in DNA replication which has been implicated in carcinogenesis previously due to its function as a binding partner of PRMT6 [65]. Moreover, it was specifically identified as being a potential therapeutic target due to its over-expression in solid tumors relative to normal tissues. The high ranking of genes associated with apoptosis is consistent with the fact that there is often high rate of tumor cell death. Although the highlyranked CARD6 (caspase recruitment domain family member 6) functions in apoptotic processes, it is also known as a regulator of downstream NF-κβ signaling. Indeed, consistent with this, we found enrichment for NF-κβ signaling pathway genes among high DISCERN-scoring genes in LUAD including NFKBIB (NF-κβ inhibitor β) which inhibits the NF-κβ complex by "trapping" it in the cytoplasm, preventing nuclear activation of its downstream targets. Although the role of NFKBIB in lung cancer has not been studied extensively, its related family member NFKBIA is known to be a silencer in non-small-cell lung cancer patients with no smoking history, suggesting that it could play some role in LUAD that arises through inherent genetic influences, or environmental insults other than smoking [66]. Levels of β-catenin have been known for some time to influence progression and poor prognosis in LUAD, potentially through its role in differentiation and metastasis from primary tumor sites [67]. We found that components of β-catenin degradation pathways-including most notably CTNNBIP1 (β-catenin interacting protein 1)-ranked among the most significant DISCERN genes in our LUAD analysis.
When comparing to other sets of genes in MSigDB, we also found targets of transcription factors including MYC, which is often de-regulated in solid tumors (either by mutation or copy number variation), and targets of the polycomb repressive complex gene EZH2. The developmental regulator EZH2 functions through regulation of DNA methylation [68], and has been implicated in B-cell lymphomas through somatic mutations [69], promotion of transformation in breast cancer [70], as well as progression in prostate cancer [71]. Interestingly, the most highly dys-regulated gene set identified by comparison to GO categories in LUAD was one related to NGF (nerve growth factor)-TrkA signaling. There are a few reports on the relevance of this axis to cancers including neuroblastoma, ovarian cancer, and a possible role in promoting metastasis in breast cancer. However, its striking appearance as the most significant hit for high-ranking DISCERN genes suggests that it merits study in lung cancer.

Top scoring DISCERN genes in breast cancer reveal biological processes known to be important in breast cancer
Here, we did the functional enrichment analysis with 2,137 genes identified by DISCERN to be significantly perturbed in breast cancer (BRC) (S2 Table). BRC showed perturbation of distinct genes and sets of genes in comparison to LUAD, as well as similarities. Again, these included GO biological processes that one would generically expect to be over-activated in a solid tumor, such as translation intiation, cell cycle, proliferation, and general cellular metabolic processes. As with LUAD, targets of MYC were enriched in high-scoring DISCERN genes in BRC. Another high-scoring group in BRC was comprised of genes that are highly correlated with each other, but with this relationship de-regulated by BRCA1 mutation [72]. Additional significant overlaps were identified with luminal A, luminal B, HER2-enriched, and basal-like breast cancer subtype-specific genes that are associated with clinical outcomes [73], and genes associated with ER-positive breast cancer [74]. The 3rd highest ranked DISCERN gene was BRF2 (TFIIB-related factor 2). BRF2 is a known oncogene in both breast cancer and lung squamous cell carcinoma, and a core RNA polymerase III transcription factor that senses and reacts to cellular oxidative stress [75]. A GO category associated with NGF (nerve growth factor)-TrkA signaling shows the highest overlap with DISCERN genes in BRC (p-value: 3.16 × 10 −104 ). NGF-TrkA signaling is upstream of the canonical phosphatidylinositol 3-kinase (PI3K) -AKT and RAS -mitogen-activated protein kinase (MAPK) pathways, both of which impinge on cell survival and differentiation. In the context of breast cancer, over-expression of TrkA has been connected to promoting growth and metastasis, as an autocrine factor, presumably due to its influence on PI3K-AKT and RAS/MAPK [76]. TrkA is reportedly overexpressed in breast carcinoma relative to normal breast tissue in a majority of cases [77], supporting the high-ranking of genes in this pathway by DISCERN. Taken together, these results indicate that DISCERN highly ranks genes that are connected to known phenotypic and survival-associated processes in breast cancer. However, intriguingly the top DISCERN gene was CLNS1A (chloride nucleotide-sensitive channel 1A). This chloride channel gene has not, to our knowledge, been implicated in pathogenesis in any cancer, although it is a member of the BRCA1-related correlation network noted above. In fact there appear to have been few studies of its function although Entrez gene notes that it performs diverse functions.

DISCERN scores reveal survival-associated genes across multiple cancer types
In this section, we focus on the quantitative assessment of DISCERN and the comparison with LNS and D-score in terms of how much the identified genes are enriched for genes implicated to be important in the disease. Specifically, genes whose expression levels are significantly associated positively or negatively with survival time are often considered to be associated with tumor aggression. Identifying such genes has been considered as an important problem by a number of authors, where breast cancer was one of the first cancers to show promise in terms of identifying clinically relevant biomarkers [78,79]. Here, we evaluated DISCERN based on how well it reveals survival-associated genes identified in an available independent dataset.
We chose the datasets with measures of patient prognosis: AML2, BRC2, and LUAD1. AML2 and BRC2 were not used for computing any scores (DISCERN, LNS, D-score, and ANOVA). For each of these datasets we computed the survival p-values based on the Cox proportional hazards model [80] measuring the association between each gene's expression level and survival time. We defined survival-associated genes as the genes whose expression levels are associated with survival time based on the Cox proportional hazards model (p-value < 0.01) (S3 Table).
We considered the genes whose DISCERN scores are significantly high at FDR corrected pvalue < 0.05 in each cancer: 1,351 genes (AML), 2,137 genes (BRC), and 3,836 genes (LUAD). We first computed the Fisher's exact test p-values to measure the statistical significance of the overlap between these significantly perturbed genes and survival-associated genes in each of three cancers. For each cancer, we compared with existing methods to detect network perturbation-LNS and D-score-when exactly the same number of top-scoring genes were considered (Fig 3A-3C). Since these numbers of genes were chosen specifically for DISCERN, there is a chance that LNS and D-score would show a higher enrichment for survival-associated genes if different numbers of top-scoring genes were considered. As discussed in the previous section, performing the gene-based permutation tests to estimate the confidence of each gene's score in genome-wide data is not feasible. Instead, we compared the Fisher's exact test p-values of the three methods across a range of numbers of top-scoring genes from 0 to N Fig 3D-3F. It is pretty clear that neither LNS nor D-score would be better than DISCERN in revealing survivalassociated genes, even when different numbers of top-scoring genes were considered across all cancer types.
ANOVA is a well-established method to identify differentially expressed genes across distinct conditions; DISCERN LNS, and D-score are methods to identify differentially connected genes across conditions. Therefore, the purpose of the comparison with ANOVA is not to evaluate DISCERN in identifying survival-associated genes as perturbed genes. The purpose is to compare between differentially expressed genes (that are commonly considered important) and perturbed genes estimated by the three methods (DISCERN, LNS, and D-score), in terms of the enrichment for genes with potential importance to the disease. For ANOVA, in Fig 3A-3C, we considered 8,993 genes (AML), 7,922 genes (BRC) and 13,344 genes (LUAD) that show . For ANOVA, we considered 8,993 genes (AML), 7,922 genes (BRC) and 13,344 genes (LUAD) that show significant differential expression at FDR corrected p-value < 0.05. (D)-(F) We consider up to 1,500 (AML), 2,500 (BRC), and 4,000 (LUAD) top-scoring genes in each method, to show that DISCERN is better than LNS and D-score in a range of N value. The red-colored dotted line indicates 1,351 genes (AML), 2,137 genes (BRC), and 3,836 genes (LUAD) that are identified to be significantly perturbed by DISCERN (FDR < 0.05). We compare among the 4 methods consisting of 3 methods to identify network perturbed genes (solid lines) and ANOVA for identifying differentially expressed genes (dotted line) in 3 cancer types.
doi:10.1371/journal.pcbi.1004888.g003 significant differential expression between cancer and normal samples at FDR corrected pvalue < 0.05. The perturbed genes identified by DISCERN are more associated with survival than differentially expressed genes captured by ANOVA in AML and LUAD (Fig 3).
In addition to the comparison with other methods-LNS and D-score-we also compare with frequently mutated genes and genes annotated to be involved in the respective cancer. We considered the following three gene sets: 1) a gene set constructed based on the gene-disease annotation database, Malacards [81], 2) genes known to have cancer-causing mutations based on the Cancer Gene Census [82], and 3) genes predicted to have driver mutations identified by MutSig [5] applied to The Cancer Genome Atlas (TCGA) data for the respective cancer type. The Malacards (gene set #1) and TCGA driver gene sets (#3) are generated for each cancer type-AML, breast cancer, or lung cancer. For example, for Malacards, we used the genes that are annotated to be involved in AML in Malacards to compare it with DISCERN genes identified in AML. Similarly, for the TCGA driver gene sets (#3), we used the AML TCGA data to identify the frequently mutated genes that are likely driver genes, and compared with high DIS-CERN-scoring genes in AML. We used the breast cancer TCGA data for BRC, and lung cancer TCGA data for LUAD. The Cancer Gene Census (CGC) gene set is a rigorously defined set of genes with multiple sources of evidence that its genes are cancer drivers in a single or multiple cancers.
For each cancer type, we compared these three sets of genes with the perturbed genes identified by DISCERN-1,351 (AML), 2,137 (BRC), and 3,836 (LUAD) genes with high DISCERN scores-on the basis of the significance of the enrichment for survival-associated genes. S2 Fig shows that the perturbed genes identified by DISCERN are more significantly enriched for survival-associated genes.

Prognostic model based on high DISCERN-scoring genes
In this section, we evaluated the DISCERN score based on how well it identifies genes that are predictive of patient prognosis. Here, we test the possibility of using the network perturbed genes identified by DISCERN as prognostic markers. For the cancer types with at least three data sets (AML and BRC; see Table 1), we construct a survival time prediction model using the genes with significant DISCERN scores (AML: 1,351 genes, BRC: 2,137 genes) identified based on one data set (Data # 1: AML1 and BRC1) as described in the previous subsection. Then, we trained the prediction model using one of the other datasets (Data #2: AML2 and BRC2) not used for the computation of the DISCERN score. Finally, we tested the prediction accuracy on the third data set (Data #3: AML3 and BRC3).
We controlled for clinical covariates whose data are available-age in case of AML and age, grade and subtype in case of BRC-by adding them as unpenalized covariates into our elastic net Cox regression model. We trained the Cox regression model using Data #2 and tested the survival prediction model on Data #3. Since we evaluated the survival prediction in separate data (AML3 and BRC3) that were not used when training the survival prediction model, using more predictors, e.g., by adding clinical covariates, does not necessarily improve the prediction performance. Adding more predictors often leads to a higher chance of overfitting. Our survival prediction model based on the high DISCERN-scoring genes works at least as well as models based on the genes contained in the previously established prognosis markers, such as Leukemic Stem Cell score (LSC) [54] for AML and MammaPrint signature (with *70 genes) [83] for BRC, as shown in Fig 4. The c-index in AML is 0.669 with standard error (se) being 0.031 ( Fig 4B); in BRC, the c-index is 0.668 (se: 0.027) (Fig 4D). The DISCERN-based expression marker with clinical covariates makes better predictions than when clinical covariates alone are used.  DISCERN explains epigenomic activity patterns in cancer and normal cell types more accurately than alternative methods One of the possible mechanisms underlying network perturbation identified in gene expression datasets representing different conditions (e.g., cancer and normal) is the following: A transcription factor (TF) 'X' binds to a gene 'Y"s promoter or its enhancer region in cancer but not in normal (or vice versa). Then, 'X' or its co-regulator could be an expression regulator for 'Y' in cancer but not in normal (or vice versa), and Y is identified as a perturbed gene (i.e., a high DISCERN-scoring genes). It is possible that 'X"s binding information is not available and 'X"s protein level is not reflected in its mRNA expression level; thus we cannot expect the DISCERN score of a gene inferred from expression data to be perfectly correlated with whether the gene has a differential biding of a certain TF, inferred from ChIP-seq or DNase-seq data. However, the degrees of correlation between the network perturbation score (DISCERN, LNS or Dscore) of a gene and whether a TF differentially binds to the gene can be a way to evaluate the network perturbation scoring methods.
To determine whether or how much our statistical estimates of network perturbation reflects perturbation of the underlying TF regulatory network, we queried epigenomic data from ENCODE project. Two of the ENCODE cell lines-NB4 (an AML subtype [84]) and CD34+ (mobilized CD34 positive hematopoietic progenitor cells)-are closest to AML and normal conditions, and the DNase-seq data from these cell lines are available. We used the DNase-seq data from NB4 and the position weight matrices (PWMs) of 57 TFs available in the JASPAR database [85] to find the locations of the PWM motifs that are on the hypersensitive regions. This is a widely used approach to estimate active binding motifs using DNase-seq data, when ChIP-seq data are not available. We identified the locations of these PWM motifs on the hg38 assembly by using the FIMO [86] method (p-value 10 −5 ). We then intersected these motif locations with hypersensitive regions identified by the DNase-seq data for each TF. We repeated for the other cell line CD34+.
For each TF, we measured how well the DISCERN score of a gene can predict the differential binding of the TF in active enhancer regions (marked by H3K27Ac) within 15kbs of the transcription start site (TSS) of the gene (Fig 5A-5C) and 5kb of the gene between blood cancer and normal cell lines (NB4 and CD34+) (S3A- S3C Fig). We show that the DISCERN score can reflect differential binding of most of the TFs better than existing methods to identify network perturbation (LNS and D-score) and a method to identify differentially expressed genes (ANOVA). As a way to summarize these results across all 57 TFs, we computed the Pearson's correlation between the score of each gene and the proportion of TFs that differentially bind to that gene out of all TFs that bind to that gene. Fig 5D shows that DISCERN detects genes with many TFs differentially bound between cancer and normal better than the other network perturbation detection methods (LNS and D-score) and ANOVA.
Considering hypersensitive sites identified by DNase-seq data as the indication of "general" binding of TFs or other DNA-associated proteins, we assume that a gene is differentially bound if there is a DNase signal within a 150bp window around its TSS in one condition (cancer or normal), but not in the other condition. We observe that the DISCERN scores of the genes that are differentially bound are significantly higher than those of the genes that are not (Fig 5E). These results suggest that DISCERN identifies possible regulatory mechanisms underlying network perturbation more accurately than existing network perturbation detection methods (LNS and D-Score) and a method for identifying differential expression levels (ANOVA).
As a specific example, STAT3 has been shown to differentially regulate the mRNA expression of BATF in myeloid leukemia but not in normal condition [87]. We found that STAT3 differentially binds to BATF in the AML cell line but not in the normal cell line based on our differential binding analysis using the DNase-seq/motif data, as described above (S4 Table). Interestingly, DISCERN identifies BATF as a perturbed gene in AML (FDR corrected pvalue < 0.05). DISCERN also identifies STAT3 as the strongest regulator for BATF in AML expression data, but STAT3 is not selected as an expression regulator in normal expression data (S1 File). Interestingly, LNS and D-Score detect STAT3 as an expression regulator of BATF in both conditions, not as a differential expression regulator.
Two of the Tier 1 ENCODE cell lines-K562 (chronic myeloid leukemia cell line) and GM12878 (a lymphoblastoid cell line)-correspond to blood cancer and normal tissues as well [88]. Tier 1 data contain the largest number of TFs with ChIP-seq datasets, which allows us to perform this kind of analyses using ChIP-seq datasets for these TFs. We repeated the same analysis with these cell lines and showed similar results (see S4 and S5 Figs).

Combining DISCERN with ENCODE data improves the enrichment of known pathways
Additionally, we investigated whether one can use DISCERN as a filtering step to increase the power in a pathway enrichment analysis. We consider hypersensitive sites identified by Fig 5. Kolmogorov-Smirnov test p-value measuring the significance of the difference in score between genes differentially bound by the corresponding transcription factor (TF) (x-axis) and those not differentially bound by the corresponding TF. We performed the one-sided test with an alternative hypothesis that differentially bound genes have higher scores; thus high −log 10 (p-value) means that high-scoring genes tend to show differential binding. The TFs are divided into the 3 sets: (A) TFs that are known to be associated with leukemia, (B) TFs that are known to be associated with cancer, and (C) TFs that are currently not known to be associated with cancer or leukemia, based on the gene-disease annotation database Malacards [81]. (D) Comparison of the p-values for the Pearson's correlation between the score of each gene and the proportion of differential TFs out of all TFs bound to the genes. (E) Kolmogorov-Smirnov test (one-sided as above) p-value measuring the significance of the difference in score between the genes with differential binding purely based on the DNase-seq data and those not. Here, a differentially bound gene is defined as a gene that has a DNase signal within a 150bp window around its TSS in one condition but not in the other condition. DNase-seq data as the indication of "general" binding of TFs or other DNA-associated proteins, and important regulatory events. As describe above, we identified differentially regulated genes between AML and normal cell lines (NB4/ CD34+) by identifying gene that have DNaseseq peaks within 150bp around the TSS in one condition (cancer or normal), but not in the other condition. There are 3,394 differentially regulated genes selected based on the DNase-seq data, of which 339 are significant DISCERN genes (S1 File). Presumably, these disease specific targets should be enriched for pathways or categories that will help us understand mechanisms underlying the disease. Alternatively, some targets may be spurious, especially considering the use of cell lines that are not a perfect match to healthy and diseased bone marrow samples and experimental noise.
Here we attempt to identify differentially regulated genes between AML and normal samples, by integrating the information on the DNase-seq data (i.e., differentially bound genes) and significantly perturbed genes identified by DISCERN based on the expression datasets from AML samples and normal non-leukemic bone marrow samples. To show that combining these two pieces of information helps us to identify pathways that are specifically active in one condition not in the other, we compared the significance of the enrichment for Reactome pathways measured in fold enrichment between 1) 339 differentially bound DISCERN genes (intersection of 3,394 differentially bound genes and high DISCERN-scoring genes), and 2) 3,394 differentially bound genes. S6 Fig shows that for most of the pathways, using the intersection of differentially bound and perturbed genes increases the fold enrichment compared to when differentially bound genes were used (Wilcox p-value <7 × 10 −5 ).
Among the pathways, 'platelet activation signalling and aggregation' shows significant improvement in fold enrichment: 1) when differentially bound DISCERN genes were used (f = 2.9; FDR q-value = 0.01), compared to 2) when differentially bound genes were used (f = 1.03). It has been shown that the interactions between platelets and AML cells have considerable effects on metastasis, and the various platelet abnormalities have been observed in AML and other leukemias [89]. G-alpha signalling-related pathways also show significant boost in fold enrichment when DISCERN was used as a filtering mechanism for differentially bound genes. 'G q signalling pathway' shows significant increase in fold enrichment: 1) when differentially bound DISCERN genes were used (f = 2.16; FDR q-value = 0.05), compared to 2) when differentially bound genes were used (f = 0.92). 'G 12/13 signalling pathway' shows significant improvement in fold enrichment: 1) when differentially bound DISCERN genes were used (f = 3.4; q-value <0.03), compared to 2) when differentially bound genes were used (f = 1.5). These pathways have been implicated in leukemias [90].

Discussion
We present a general computational framework for identifying the perturbed genes, i.e., genes whose network connections with other genes are significantly different across conditions, and tested the identified genes with statistical and biological benchmarks on multiple human cancers. Our method outperforms existing alternatives, such as LNS, D-score, and PLSNet, based on synthetic data experiments and through biological validation performed using seven distinct cancer genome-wide gene expression datasets, gathered on five different platforms and spanning three different cancer types-AML, breast cancer and lung cancer. We demonstrated that DISCERN is better than other methods for identifying network-perturbation in terms of identifying genes known to be or potentially important in cancer, as well as genes that are subject to differential binding of transcription factor according to the ENCODE DNase-seq data. We also demonstrated a method to use DISCERN scores to boost signal in the enrichment test of targets of differential regulation constructed using DNase-seq data available through the ENCODE Project.

Data preprocessing
Raw cell intensity files (CEL) for gene expression data in AML1, AML2, and AML3 were retrieved from GEO [91] and The Cancer Genome Atlas (TCGA). Expression data were then processed using MAS5 normalization with the Affy Bioconductor package [92], and mapped to Enztrez gene annotations [93] using custom chip definition files (CDF) [94], and batcheffect corrected using ComBat [95] implemented in package sva from CRAN.
BRC1 expression data were accessed through Broad Firehose pipeline (build 2013042100). We checked whether BRC1 processed by Firehose shows evidence of batch effects. We confirmed that the first three principal components are not significantly associated with the plate number (which we assumed to be a batch variable), which indicates no strong evidence of batch effects. BRC2 and BRC3 were accessed through Synapse (syn1688369, syn1688370). All probes were then filtered and mapped using the illuminaHumanv3.db Bioconductor package [96]. Probes mapped into the same genes were then collapsed by averaging if the probes being averaged were significantly correlated (Pearson's correlation coefficient greater than 0.7).
LUAD1 expression data were accessed through Broad Firehose pipeline (build 2015110100). Genes which had a very weak signal were filtered out of the LUAD1 data. We then applied the voom normalization method that is specifically designed to adjust for the poor estimate of variance in count data, especially for genes with low counts [51]. The voom algorithm adjusts for this variance by estimating precision weights designed to adjust for the increased variance of observations of genes with low counts. This would stabilize the estimated distribution of RSEM values in the LUAD data, making it more normally distributed. Since LUAD data comes from different tissue source sites, we have applied batch-effect correction using ComBat.
For all datasets, only probes that are mapped into genes that have Entrez gene names were considered. Table 1 shows the number of samples and genes used in each dataset. For AML1, BRC1, and LUAD1 that were used for score computationa, we splitted each dataset into two matrices, one with only cancerous patients and one with normal patients. These matrices are normalized to 0-mean, unit-variance gene expression levels for each gene, before each network perturbation score (DISCERN, LNS, and D-score) was computed, which is a standard normalization step for accurately measuring the difference in the network connectivity. For methods that measure the differential expression levels (ANOVA), such normalization was not applied.
Lastly, candidate regulators are identified from a set of 3,545 genes known to be transcription factors, chromatin modifies, or perform other regulatory activity, which have been used in many studies on learning a gene network from high-dimensional expression data [43][44][45][46] (S1 Table).

DISCERN score
DISCERN uses a likelihood-based scoring function that measures for each gene how much likely the gene is differently connected with other genes in the inferred network between two conditions (e.g., cancer and normal). We model each gene's expression level based on a sparse linear model. Let y ðsÞ i be a standardized expression levels of gene i in an individual with a condition s (cancer or normal) modeled as: y ðsÞ i % X p r¼1 w ðsÞ ir x ðsÞ r , where x ðsÞ 1 ; . . . x ðsÞ p denote standardized expression levels of candidate regulator genes in a condition s. Standardization is a standard practice of normalizing expression levels of each gene to be mean zero and unit stadard deviation before applying penalized regression method [47][48][49][50]. To estimate weight vector w ðsÞ i lasso þ l X p r¼1 jw ðsÞ ir j; where the subscript j in the formula iterates over all patients, used as training instances for lasso. Here, y ðsÞ ij corresponds to the expression level of the i th gene in the j th patient in the s th state and x ðsÞ ij similarly corresponds to the expression level of the i th regulator in the j th patient in the s th state. The second term, the L 1 penalty function, will zero out many irrelevant regulators for a given gene, because it is known to induce sparsity in solution [47].
After estimating w ðsÞ i for each s, the DISCERN score measures how well each weight vector learned on one condition explains the data in the other condition, by using a novel model selection criteria defined as: The first step of calculating the DISCERN score and D 0 score is to fit a sparse linear model (such as lasso [47]) for each gene's expression level. We used the scikit-learn Python package (version 0.14.1) to calculate these scores with the values of the sparsity tuning parameters λ chosen by using the 5-fold cross-validation tests.

ANOVA score computation
Analysis of Variance (ANOVA) is a standard statistical technique to measure the statistical significance of the difference in mean between two or more groups of numbers. For each gene, the 1-way ANOVA test produces a p-value from the F-test, which measures how significantly its expression level is different between conditions (e.g., cancer and normal). The ANOVA score was computed as negative logarithm of a p-value, obtained from 1-way ANOVA test using f_oneway function in scipy.stats Python package.

PLSNet score computation
PLSNet score attempts to measure how likely each gene is differently connected with other genes between conditions. It was computed using dna R package version 0.2_1 [34]. The network perturbation score for each gene is computed based on the empirical p-value from 1,000 permutation tests.

LNS score computation
In Guan et al. (2013) [35], the authors defined the local network similarity (LNS) score for gene i that is defined as correlation of the Fisher's z-transformed correlation coefficients between expression of gene i and all other genes between two conditions: where c s ij represents the correlation coefficient between expression levels of genes i and j in condition s = n for normal and s = c for cancer.

D-score score computation
For synthetic data analysis, we have also introduced a D-score, computed as following (as used in Wang et al. (2009) [36]): where d s ij is a normalized correlation (normalized to have zero mean and unit variance across genes) between genes i and j in condition s, also known as Glass' d score [97].

Synthetic data generation
We generated 100 pairs of datasets, each representing disease and normal conditions. Each pair of datasets contains 100 variables drawn from the multivariate normal distribution with zero mean and covariance matrices S 1 and S 2 . Each dataset contains n 1 and n 2 samples, respectively, where n 1 is randomly selected from uniform distribution between 100 and 110, and n 2 is from uniform distribution between 16 and 26. This difference in n 1 and n 2 reflects the ratio of the cancer samples and normal samples in the gene expression data (Table 1).
For each of the 100 pairs of datasets, we divided 100 variables into the following three categories: 1) variables that have different sets of edge weights with other variables across two conditions, 2) variables that have exactly the same sets of edge weights with each other across the conditions, and 3) variables not connected with any other variables in the categories 2) and 3) in both conditions. For example, in Fig 1A, '1' is in category 1) (i.e., perturbed genes). '2', '4', '6', and '7' are in category 2), and '3' and '5' is in category 3). In each of the 100 pairs of datasets, the number of genes in category #1 (perturbed genes), p, is randomly selected from uniform distribution between 5 and 15. The number of genes in each of the other two categories #2 and #3 is determined as (100 − p)/2.
We describe below how we generated the network edge weights (i.e., elements of S À1 1 and S À1 2 ) among the 100 variables. To ensure that only the genes in #1 have differing edge weights between two conditions, we generated two p × p matrices, X 1 and X 2 , with elements randomly drawn from a uniform distribution between -1 and 1. Then, we generated symmetric matrices, X ⊺ 1 X 1 and X ⊺ 2 X 2 , and added positive values to the diagonal elements to these symmetric matrices, if its minimum eigenvalue is negative-a commonly used method to generate positive definite matrices [98]. They become submatrices of S À1 1 and S À1 2 for these p variables. Similarly, we generate a common submatrix for the variables in category #2-variables that have the same edge weights with other variables across conditions. Variables in category #3 have identity matrix as the inverse covariance matrix among the variables in that categories. Finally, we added mean zero Gaussian noise to each element of S À1 1 and S À1 2 , where the standard deviation of the Gaussian noise is randomly selected between 0.5 and 5.
This procedure allows having datasets of varying levels of difficulty in terms of highdimensionality and network perturbation, which provides an opportunity to compare the average performances of the methods in various settings.

Conservative permutation tests
To generate a conservative null distribution, we performed permutation tests by randomly reassigning cancer/normal labels to each sample, preserving the total numbers of cancer/normal samples. The correlation structure among genes would be preserved, because every gene is assigned the same permuted label in each permutation test. We then computed the DISCERN score for a random subset of 300 genes. We repeated this process to get over one million DIS-CERN scores to form a stable null distribution, which was used to compute empirical p-values.

Identifying survival-associated genes
For the survival-associated genes enrichment analysis, we first computed the association between survival time and each gene expression level. Genes that had a p-value from the Cox proportional hazards model (computed using survival R package) smaller than 0.01 were considered significantly associated with survival. These include 1,280 genes (AML), 1,891 genes (BRC) and 1,273 genes (LUAD) (S3 Table). Statistical significance of the overlap with top N DISCERN, LNS, D-score and ANOVA -scoring genes was computed by using the Fisher's exact test based on the hypergeometric distribution function from scipy.stats Python package [99].

Gene sets previously known to be important in cancer
We presented the results on the comparison with three sets of genes that are known to be important in cancer (S2 Fig). Here, we describe how we obtained these gene sets. First, Malacards genesets were constructed based on the data from malacards.org website accessed in September 2012. Second, we used a set of 488 genes we downloaded from Catalogue of Somatic Mutations in Cancer website (CGC) [82]. For each cancer type, we considered the intersection between this list and the genes that are present in the expression data. Finally, a set of genes likely to contain driver mutations selected by MutSig was defined as those that pass q < 0.5 threshold based on 20141017 MutSig2.0 report from Broad Firehose.

Cross-dataset survival time prediction
To evaluate the performance of the DISCERN score on identifying genes to be used in a prognosis prediction model, we trained the survival prediction model using one dataset and tested the model on an independent dataset (Fig 4). To train the survival prediction model, we used the elastic net regression (α = 0.5) using glmnet CRAN package (version 1.9-8). Available clinical covariates-age for AML, and age, grade and subtype for BRC-were added as unpenalized covariates. Regularization parameter λ was chosen by using the built-in cross-validation function. Testing was always performed in the independent dataset with held-out samples from the dataset that was not used for training. For comparison, we trained the prediction model using 22 LSC genes [54] with age in AML, and 67 genes from the 70-gene-signature [83] (3 genes from the signature were missing in the dataset we were using) with clinical covariates (age, stage, and subtype) in BRC, as shown in Fig 4B and 4D, respectively.

Epigenomics analysis
The Encyclopedia of DNA Elements (ENCODE) is an international collaboration providing transcription factor binding and histone modification data in hundreds of different cell lines [100]. Data for ENCODE analysis were accessed through the UCSC Genome Browser data matrix [101] and processed using the BedTools and pybedtools packages [102,103]. Two of the ENCODE cell lines-NB4 (an AML subtype [84]) and CD34+ (mobilized CD34 positive hematopoietic progenitor cells)-are closest to AML and normal conditions, and the DNaseseq data from these cell lines are available.
For each cell line, we used the DNase-seq data and the position weight matrices (PWMs) of 57 transcription factors (TFs) available in the JASPAR database [85] to find the locations of the PWM motifs that are on the hypersensitive regions. We identified the locations of these PWM motifs on the hg38 assembly by using FIMO [86] (p-value 10 −5 ). We then intersected these motif locations with hypersensitive regions identified by the DNase-seq data for each TF. We repeated this process to identify active binding motifs of the 57 TFs in each of the cell lines, NB4 and CD34+.
For each TF, we identified the genes the TF differentialy binds to between cancer and normal cell lines. We assumed that a certain TF is bound near a gene if the center of the peak is in the active enhancer regions (marked by H3K27Ac) within 15kbs of the transcription start site (TSS) of the gene or the 5kb around the gene's transcription start site. We show that for most of the TFs, differentially bound genes have significantly high DISCERN scores than those not (Fig 5A-5C).
The differential regulator score for each gene was computed by taking the number of differentially bound TFs and dividing it by the total number of TFs bound to the gene in any condition. We show that the differential regulator score is highly correlated with the DISCERN score ( Fig 5D). For DNase-based analysis (Fig 5E), we defined a gene to be differentially regulated if hypersensitive sites detected by DNase-seq are within 150bp upstream of the gene in one condition and not in another.

Reactome enrichment and DISCERN filtering
A set of 605 Reactome pathways was downloaded through Broad Molecular Signature Database (MSigDB) [59]. We postulate that hypersensitive sites identified by DNase-seq in a particular cell line indicate the regions where important regulatory events occur, such as transcription factor binding. We constructed the list of differentially regulated genes by comparing the hypersensitive sites identified by DNase-seq data between cancer and normal cell lines within 150bp upstream from TSS of each gene. For each pathway, we computed the fold enrichment (¼ number of genes in the intersection of two groups of genes number of genes in the intersection by random chance ) that measures the significance of the overlap between genes in the pathway and the identified differentially regulated genes. We compared the fold enrichment with when the genes in the intersection of differentially regulated genes and 1,351 significantly perturbed genes identified by DISCERN were used (S6 Fig). To reduce the noise, we only considered the pathways that had !5 genes in the overlap before filtering. The p-values were then FDR corrected for multiple hypothesis testing. Although pvalues would measure the significance of the overlap between a gene set with a pathway, we used the enrichment fold as a measure of the significance of the overlap because we compared a set of genes with another set much smaller size.
Supporting Information S1 File. Lists of genes used in various analyses in this paper, created based on Cancer Gene Census, Malacards, TCGA mutation profiles, ENCODE epigenomic data, and survival-associated genes for each cancer type. (ZIP) S1 Fig. A scatter plot that shows the relationship between the DISCERN score (y-axis) and the mean expression level of each gene. We show that genes with very low mean expression levels do not tend to have high enough DISCERN score to be considered in our analysis. The Pearson's correlation between the mean expression before standardization and the DISCERN score ranges from 0.08 and 0.43. Positive correlation is induced because genes with very low mean expression tend to have lower DISCERN scores, indicating that there is probably not an issue in terms of overly selecting genes whose expression are essentially noise. In LUAD, the voom normalization method removes many genes with low expression. We can see that many dots were removed between -5 and -10 for LUAD (bottom). (TIFF) S2 Fig. Comparison of the enrichment for survival associated genes in gene sets known to be important in cancer and high DISCERN-scoring genes. We consider a set of genes previously known to be cancer drivers based on Cancer Gene Census (CGC) [82], genes annotated to be associated with the disease (AML, breast cancer or lung cancer) in the gene-disease annotation database Malacards [81], and genes predicted to have driver mutations identified by MutSig [5] in the corresponding cancer type. We compare these 3 gene sets with genes with high DISCERN scores: 1,351 genes (AML), 2,137 genes (BRC), and 3,836 (LUAD) genes whose FDR corrected significance p-values are less than 0.05. We evaluated each method in terms of the enrichment p-value for survival-associated genes in the following datasets: (A) AML2 containing 400 CGC genes, 716 Malacards genes, and 38 MutSig genes; (B) BRC3 containing 258 CGC genes, 2160 Maracards genes, and 34 MutSig genes; and (C) LUAD containing 400 CGC genes, 514 Malacards genes, and 238 MutSig genes. Kolmogorov-Smirnov test p-value measuring the significance of the difference in score between genes differentially bound by the corresponding transcription factor (TF) (x-axis) and those not differentially bound by the corresponding TF. We assume that a TF binds to a gene if the TF has a peak within 5kbs of the transcription state site (TSS) of the gene. We performed the one-sided test with an alternative hypothesis that differentially bound genes have higher scores; thus high −log 10 (p-value) means that high-scoring genes tend to show differential binding. The TFs are divided into the 3 sets: (A) TFs that are known to be associated with leukemia, (B) TFs that are known to be associated with cancer, and (C) TFs that are currently not known to be associated with cancer or leukemia, based on the gene-disease annotation database Malacards [81]. (D) Comparison of the p-values from the Pearson's correlation between the score of each gene and the proportion of differential TFs out of all TFs bound to the genes. (E) Kolmogorov-Smirnov test (one-sided as above) p-value measuring the significance of the difference in score between the genes with differential binding purely based on the DNase-seq data and those not. Here, the differentially bound gene is defined as the gene has a DNase signal within a 150bp window around its TSS in one condition but not in the other condition. (EPS) S4 Fig. Epigenomic data analysis using the GM12878 cell line and K562 cell line. Kolmogorov-Smirnov test p-value measuring the significance of the difference in score between genes differentially bound by the corresponding transcription factor (TF) (x-axis) and those not differentially bound by the corresponding TF. We assume that a TF binds to a gene if the TF has a peak in the active enhancer regions (marked by H3K27ac) within 15kbs of the transcription state site (TSS) of the gene. We performed the one-sided test with an alternative hypothesis that differentially bound genes have higher scores; thus high −log 10 (p-value) means that highscoring genes tend to show differential binding. The TFs are divided into the 3 sets: (A) TFs that are known to be associated with leukemia, (B) TFs that are known to be associated with cancer, and (C) TFs that are currently not known to be associated with cancer or leukemia, based on the gene-disease annotation database Malacards [81]. (D) Comparison of the p-values from the Pearson's correlation between the score of each gene and the proportion of differential TFs out of all TFs bound to the genes. (E) Kolmogorov-Smirnov test (one-sided as above) pvalue measuring the significance of the difference in score between the genes with differential binding purely based on the DNase-seq data and those not. Here, the differentially bound gene is defined as the gene has a DNase signal within a 150bp window around its TSS in one condition but not in the other condition. (EPS) S5 Fig. Epigenomic data analysis using the GM12878 cell line and K562 cell line. Kolmogorov-Smirnov test p-value measuring the significance of the difference in score between genes differentially bound by the corresponding transcription factor (TF) (x-axis) and those not differentially bound by the corresponding TF. We assume that a TF binds to a gene if the TF has a peak within 5kbs around the transcription state site (TSS) of the gene. We performed the onesided test with an alternative hypothesis that differentially bound genes have higher scores; thus high −log 10 (p-value) means that high-scoring genes tend to show differential binding. The TFs are divided into the 3 sets: (A) TFs that are known to be associated with leukemia, (B) TFs that are known to be associated with cancer, and (C) TFs that are currently not known to be associated with cancer or leukemia, based on the gene-disease annotation database Malacards [81]. (D) Comparison of the p-values from the Pearson's correlation between the score of each gene and the proportion of differential TFs out of all TFs bound to the genes. (E) Kolmogorov-Smirnov test (one-sided as above) p-value measuring the significance of the difference in score between the genes with differential binding purely based on the DNase-seq data and those not. Here, the differentially bound gene is defined as the gene has a DNase signal within a 150bp window around its TSS in one condition but not in the other condition. (EPS) S6 Fig. Integrative analysis of differential binding using DNase-seq data and high DIS-CERN-scoring genes improves the enrichment of known pathways. We compared the fold enrichment of Reactome pathways between 3,394 differentially regulated genes (x-axis) and 339 differentially regulated DISCERN genes (y-axis). We considered 33 Reactome pathways, which is a subset of 605 pathways that has !5 genes. (EPS) S1 Table. The 3,545 candidate regulators used as potential features for the gene regulation model for each gene. (TXT) S2 Table. Lists of genes with significant DISCERN scores: 1,351 genes (AML), 2,137 genes (BRC), and 3,836 (LUAD) genes whose FDR corrected significance p-values are less than 0.05. (CSV) S3 Table. Survival p-values computed in AML2 and LUAD1. Genes that had a p-value from the Cox proportional hazards model (computed using survival R package) smaller than 0.01 were considered significantly associated with survival. These include 1,280 genes (AML), and 1,273 genes (LUAD). The BRC2 data are not publicly available. (CSV) S4 Table. Lists of genes differentially bound by STAT3 in NB4 and CD34+ cell lines. (CSV)