A Novel Network Profiling Analysis Reveals System Changes in Epithelial-Mesenchymal Transition

Patient-specific analysis of molecular networks is a promising strategy for making individual risk predictions and treatment decisions in cancer therapy. Although systems biology allows the gene network of a cell to be reconstructed from clinical gene expression data, traditional methods, such as Bayesian networks, only provide an averaged network for all samples. Therefore, these methods cannot reveal patient-specific differences in molecular networks during cancer progression. In this study, we developed a novel statistical method called NetworkProfiler, which infers patient-specific gene regulatory networks for a specific clinical characteristic, such as cancer progression, from gene expression data of cancer patients. We applied NetworkProfiler to microarray gene expression data from 762 cancer cell lines and extracted the system changes that were related to the epithelial-mesenchymal transition (EMT). Out of 1732 possible regulators of E-cadherin, a cell adhesion molecule that modulates the EMT, NetworkProfiler, identified 25 candidate regulators, of which about half have been experimentally verified in the literature. In addition, we used NetworkProfiler to predict EMT-dependent master regulators that enhanced cell adhesion, migration, invasion, and metastasis. In order to further evaluate the performance of NetworkProfiler, we selected Krueppel-like factor 5 (KLF5) from a list of the remaining candidate regulators of E-cadherin and conducted in vitro validation experiments. As a result, we found that knockdown of KLF5 by siRNA significantly decreased E-cadherin expression and induced morphological changes characteristic of EMT. In addition, in vitro experiments of a novel candidate EMT-related microRNA, miR-100, confirmed the involvement of miR-100 in several EMT-related aspects, which was consistent with the predictions obtained by NetworkProfiler.


Introduction
Currently, several large-scale omics projects, such as the National Cancer Institute's Cancer Genome Atlas (http://cancergenome.nih. gov/) and the Sanger Institute's Cancer Genome Project (http:// www.sanger.ac.uk/genetics/CGP/), produce large amounts of data, including genomic, epigenomic, and transcriptomic information, about cancer patients or cell lines. Two challenges in omics are to construct and analyze patient-specific molecular networks to develop a comprehensive understanding of the molecular mechanisms of tumorigenesis and to identify molecules that are critical for tumor proliferation and progression [1]. If these challenges can be overcome, it may be possible to personalize cancer therapy, improve its efficacy, and reduce its toxicity and cost [2,3].
Systems biology integrates various types of omics data and computational tools to represent and analyze complex biological systems. For example, gene network estimation that is based on Bayesian networks or mutual information networks can reconstruct biological systems from gene expression data [4]. However, most traditional gene network estimation methods construct a static network by using gene expression data from different cellular conditions. As a result, these methods only produce an averaged network for all patients and cannot reveal patient-specific molecular mechanisms of cancer. In addition, it is very difficult to infer a patientspecific gene network from only a few gene expression profiles of the patient without making any assumptions about the network.
In this study, we developed a novel statistical method called NetworkProfiler, which infers patient-specific gene regulatory networks from a dataset of cancer gene expression profiles. NetworkProfiler is based on a statistical graphical model with varying coefficients and a kernel-based data integration method with elastic net regularization for parameter estimation. A key feature of NetworkProfiler is that the strengths of the relationships between genes are allowed to vary depending on cancer characteristics, such as cancer progression, metastasis, disease-free survival, and drug sensitivity. NetworkProfiler groups samples according to the specific cancer characteristics so that neighboring samples have common gene regulatory systems. Then, by integrating the gene expression profiles of neighboring samples with a kernel method, NetworkProfiler produces a gene regulatory network for each sample. Finally, we analyzed 2 post-analysis to discover upstream regulatory genes and downstream target genes for specific cancer characteristics. Network-Profiler is the first algorithm for constructing patient-specific gene regulatory networks from clinical cancer gene expression data to elucidate cancer heterogeneity.
We applied NetworkProfiler to gene expression microarray data from 762 cancer cell lines to determine system changes related to the epithelial-mesenchymal transition (EMT). The epithelial-mesenchymal transition (EMT) is a process that changes proliferating cells from an aplanetic state to a motile state [5], which allows cancer cells to leave the primary tumor and metastasize. The loss of E-cadherin, a cell adhesion molecule, is a biomarker of EMT [5]. NetworkProfiler identified 25 key regulators of E-cadherin, of which half have been previously described and the other half were novel candidates. NetworkProfiler also revealed regulatory changes in miR-141, ZEB1, and E-cadherin. Specifically, our results suggested that decreased expression of miR-141 in mesenchymal cells disrupts the negative feedback loop between miR-141 and ZEB1, which would allow ZEB1 to decrease the expression of E-cadherin during the EMT. In addition, we predicted 45 EMT-dependent putative master regulators that control sets of genes involved in cell adhesion, migration, invasion and metastasis, namely, 17 of which are downstream targets of TGFB1, a master switch of the EMT. To further validate the performance of NetworkProfiler, we experimentally evaluated in silico predictions obtained by NetworkProfiler. We consequently found that knockdown of KLF5, a new candidate regulator of E-cadherin, decreased E-cadherin expression and induced morphological changes characteristic of EMT. In addition, the functional involvement of miR-100 was validated in some EMT-related aspects, which was consistent with the predictions obtained by Network Profiler.

Overview of NetworkProfiler
Here, we provide an overview of NetworkProfiler; please refer to the Methods section for a complete description. NetworkProfiler is a modulator-dependent graphical model because it includes a modulator (M) variable in addition to regulator (R) and target (T) variables (genes). R controls the transcription of T and M is a cofactor that modulates the interaction between R and T. In this study, we defined M as a biological or a clinical feature that is related to cancer, such as drug response, survival risk, or a molecule or pathway that is related to cancer initiation, progression, or metastasis. The relationships between R, T, and M are illustrated in Figure 1a. As shown in Figure 1b, the strength of the relationship between R and T varies depending on the value of M. Thus, M does not affect R and T directly; instead, it influences the strength of the relationship between R and T. In contrast, existing graphical models, such as Bayesian networks and mutual information networks [4], do not consider the effect of M (Figure 1c), so the strength of the relationship between R and T remains constant for all values of M (Figure 1d).
In addition, NetworkProfiler can infer the relationships between R and T, given a value of M. As a result, we could use NetworkProfiler to construct patient-specific networks with varying R-T relationships that reflect changes in the feature of interest in cancer patients. A simple example with synthetic data for R, T, and M is shown in Figure 2a. In this example, we assume that R regulates T only with a high value of M (Figure 2b). In this case, most existing methods that only consider R and T in all of the samples ( Figure 2c) and ignore M would conclude that R does not regulate T. In contrast, NetworkProfiler attempts to quantify the strength of the relationship between R and T for a specific value m of M by reweighting the data according to the value of M to identify the neighborhood of samples with values of M that are close to m. Then, NetworkProfiler measures the dependency between R and T on the basis of these neighboring samples. The optimization of the size of the neighborhood is explained in the Method section.
A schematic representation of the entire analytical process of NetworkProfiler is shown in Figure 3. NetworkProfiler used 2 inputs: (1) gene expression data and (2) the modulator for each sample (Figure 3a). The gene expression data was represented as a p|n matrix, where p is the number of genes and n is the number of samples (patients). If the modulator was an observable variable, then we directly applied NetworkProfiler to these inputs. However, if the modulator was a variable that is difficult to observe, then we used a signature-based hidden modulator extraction algorithm to estimate the value of the modulator. The output of NetworkProfiler is a set of gene networks for every value of M (i.e., samplespecific gene networks) shown in Figure 3b.
Afterwards, we used 2 post-analysis techniques to extract biological information from the networks. The first technique identified upstream regulators of a target gene of interest in the constructed modulator-dependent gene networks. To evaluate the modulator-dependent strength of a regulator for the target gene, we created a measure called the regulatory effect. The regulatory effect profiles of the upstream regulators for specific target genes are shown in Figure 3c. The second technique discovered putative master regulators that control downstream target gene sets with previously curated functions. To evaluate the enrichment of the target genes on a functional gene set, we created measure called the enrichment score. The resulting regulator-function matrix (Figure 3d) illustrates the candidate regulators (rows) of functions (columns) that are enhanced in the target genes.

Identification of system changes in the epithelialmesenchymal transition
To identify system changes during the EMT, we applied NetworkProfiler to gene expression profiles of 762 cancer cell lines from the Sanger Cell Line Project (http://www.broadinstitute. org/cgi-bin/cancer/datasets.cgi). This dataset included the expression profiles of 22,777 probes, which correspond to 13,006 mRNAs in these cancer cell lines from the Affymetrix GeneChip Human Genome U133 Array Set (HG-U133A) and the expression profiles of 502 human microRNAs from bead-based oligonucleotide arrays. The MAS5-normalized mRNA dataset was further transformed to the log scale and quantile-normalized. During the mapping of the probes to genes, we selected 1 probe for each gene that had the largest variance, which produced a final 13,508 (genes) | 762 (cancer cell lines) gene expression matrix.
In this study, we considered transcription factors, nuclear receptors, and microRNAs to be potential regulators. To identify transcription factors and nuclear receptors, we selected human genes that were annotated as a ''transcription regulator'' or ''ligand-dependent nuclear receptor'' from the Ingenuity Knowledge Base (IKB; http://www.ingenuity.com). We also included some transcription factors that were not annotated in the IKB but were annotated in the Biobase Knowledge Library (BKL; http:// biobase-international.com/). We mapped a total of 1230 genes in the HG-U133A microarray gene set to 1183 transcription factors and 47 nuclear receptors (Table S1). In addition, we included 502 human miRNA probes (Table S2).
To calculate the modulator values for the EMT in the 762 cancer cell lines, we applied a signature-based hidden modulator extraction algorithm (see Methods for details) to the expression data. First, we selected 122 genes labeled ''EMT_UP'', ''EMT_DN'', ''JECHLIN-GER_EMT_UP'', and ''JECHLINGER_EMT_DN'' from Molecular Signatures Database v2.5 ([6]; http://www.broadinstitute.org/ gsea/msigdb/index.jsp). Then, this algorithm narrowed the set to 50 functionally coherent genes with pv10 {5 by using the extraction of expression module (EEM) [7] (Table S3) and computed the first principal component of these 50 genes as hidden values of the EMT-related modulator (Table S4). Since the direction of the first principal component did not always correspond to that of the EMT, we changed the sign of the modulator values by multiplying either plus or minus one so that epithelial-like cells have lower modulator values than mesenchymal-like cells. Figure 4 shows the expression profiles of the 50 functionally coherent genes in ascending order of the EMT-related modulator values. These modulator values clearly discriminated cell lines that were epithelial-like or mesenchymal-like. Specifically, cells with smaller or larger modulator values had more epithelial or mesenchymal phenotypes, respectively. Furthermore, many carcinomas and squamous tumors had low modulator values, while many gliomas and melanomas had high values. By using these EMT-related modulator values, NetworkProfiler constructed 762 regulatory gene networks that are related to the EMT. The list of the estimated edges in each of these networks can be downloaded from the supporting web site (Files S1, S2, and S3; http://bonsai. hgc.jp/*shima/NetworkProfiler).

Identification of regulators of E-cadherin that induce the epithelial-mesenchymal transition
To identify possible regulators that might control the expression of E-cadherin during the EMT, we calculated the regulatory effects of the upstream regulators of E-cadherin. Out of 1732 potential regulators, NetworkProfiler inferred that 370 of them may control the expression of E-cadherin in any of the 762 cancer cell lines (Table S5). These putative regulators were ranked according to the change in their regulatory effect during the EMT. Although we did not include any information on known E-cadherin regulators, about half of the 25 highest ranked regulators were previously reported in the literature (Table 1). For example, 2 zinc finger transcription factors, ZEB1 and ZEB2, are direct repressors of E-cadherin and are involved in the EMT [9,15]. In addition, the miR-200 family indirectly suppresses the EMT by inhibiting the translation of ZEB1 and ZEB2 mRNAs [8]. Similarly, miR-192 inhibits the translation of ZEB2 [13,14]. In addition, SNAI2, a member of the Snail superfamily of zinc finger transcription factors, also is involved in the EMT [16]. Likewise, TCF4 (also known as E2-2), a class I bHLH transcription factor, is an EMT regulator; its isoforms induce the EMT in MDCK kidney epithelial cells [12]. In contrast, FOXA1 and FOXA2 are positive regulators of E-cadherin, which suppress the EMT in pancreatic ductal adenocarcinoma [11]. KLF4 also inhibits the EMT by regulating E-cadherin expression [10]. NetworkProfiler also identified several other known direct repressors of E-cadherin, such as TWIST1 [17] and TCF3 (also known as E47) [18]; however, these regulators were ranked 38th and 84th, respectively.
The other half of the 25 highest ranked regulators has not yet been reported and may be novel EMT-dependent regulators of Ecadherin. For example, although the relationship between GRHL2 and EMT is not known, GRHL2 is required for morphogenesis of epidermal and tracheal cells and plays an important role in regulating the expression levels of E-cadherin in Drosophila post-embryonic neuroblasts [19]. ZNF217 binds the Ecadherin promoter [20], which suggests that ZNF217 might be a transcription factor for E-cadherin.
Next, we compared the performance of NetworkProfiler with that of a structural equation model (SEM) of E-cadherin that was inferred by the elastic net [22]. This model was equivalent to a regression model where the response variable is the expression of E-cadherin and the explanatory variables are the 1732 regulator expressions. The significance of each regulator was evaluated based on the number of non-zero regression coefficients in 1000 bootstrapped datasets. The SEM inferred 627 putative regulators (Table S6). Among these putative regulators, there were only 6 regulators, namely, ZEB1, miR-141, ZEB2, TCF3, miR-200b, and miR-200c, in the 25 highest ranked regulators that were previously reported in the literature. This result suggested that NetworkProfiler was superior to the traditional gene network estimation methods to identify regulators of E-cadherin that are involved in the EMT. Moreover, NetworkProfiler can reveal regulatory changes among genes during the EMT. Figures 5a and 5b show the regulatory profiles of putative regulators of E-cadherin when the lengths of the paths from the regulators to E-cadherin is 1 and 2, respectively.
NetworkProfiler can also predict mechanistic interpretations of published experiments. For example, it is known that ZEB1 and ZEB2 induce EMT by repressing E-cadherin transcription and that ectopic expression of the miR-200 family (miR-200a, miR-200b, miR-200c, and miR-141) or miR-205 leads to downregulation of ZEB1 and ZEB2, upregulation of Ecadherin, and mesenchymal-epithelial transition (MET) in cells [8]. As the relationships between these genes, the prediction of NetworkProfiler provides the following results. As shown in Figures 6c and 6d, although the expression of miR-141 had a strong positive effect on that of E-cadherin in epithelial-like cells, this effect decreases during the EMT. In contrast, although the expression of ZEB1 had a weak negative effect on that of Ecadherin in epithelial-like cells, this effect increased during the EMT. Interestingly, miR-141 and ZEB1 had a strong, direct negative effect on each other only when the EMT-related modulator values were low. This implied that there is a negative feedback loop between miR-141 and ZEB1 in epithelial-like cells, which is consistent with a previous study [23]. Furthermore, during the EMT, the expression levels of miR-141 and Ecadherin decreased, while the expression level of ZEB1    increased. These results suggested that reduced expression of miR-141 disrupts the negative feedback loop between miR-141 and ZEB1 (Figures 6a and 6b), which would allow ZEB1 to decrease the expression of E-cadherin, as illustrated in Figure 6c. It should be noted that these results cannot be predicted by traditional graphical models which infer a static gene network structure.
Identification of relationships between regulators and epithelial-mesenchymal transition-related functional gene sets The EMT-dependent relationships between downstream target genes for each regulator and previously curated functional gene sets in each sample were analyzed by applying gene set analysis (see Methods for details) to the constructed gene networks for 762 cancer cell lines. We tested five curated gene sets included in Ingenuity Knowledge Base (IKB; http://www.ingenuity.com). These gene sets were related with adhesion, migration, invasion, and metastasis which were hallmarks of EMT [5], and EMT itself. By using gene set analysis, the statistical significances (q-values) for the enrichments of downstream genes for the 1732 regulators on the five functional gene sets were calculated in each of the 762 cell lines. These results can be downloaded from the supporting web site (File S4; http://bonsai.hgc.jp/*shima/NetworkProfiler).
To search for regulators that strongly affected the five EMTrelated functional gene sets, the change in the enrichment score during the EMT and their integral q-value were calculated. The result was summarized by a regulator function matrix (Table S7). We focused on 45 regulators with the integral q-values less than 10 {10 as putative master regulators that strongly enhanced the  functional gene sets related with the EMT. Interestingly, among the 45 regulators, 17 regulators were downstream targets of transforming growth factor b-1 (TGFB1), a master switch of EMT [24], with published evidence (Table S8). This result suggests that these regulators have crucial roles in TGFB1-induced EMT.
As a control, we tested how well the NetworkProfiler analysis identified known relationships between regulators and functional gene sets in the Ingenuity Knowledge Base. The known functional relationships of the 45 putative master regulators are shown in Table 2. For example, FOSL1 increases the migration of MDA-MB-436 cells [25] and the invasion of A549 cells [26]. SMAD3 increases the adhesion [34], the metastasis [35], and the migration [36] of cells, respectively. Similarly, HIF1A increases the adhesion of undifferentiated trophoblast stem cells [27], the metastasis of LM2 cells [41], the migration of HUVEC cells [42], and the invasion of Achn cells [43], respectively.
Although some of the 47 putative master regulators have not been reported to enhance the EMT-related functions in IKB, some predictions were supported by other resent works which were not included in IKB. For example, the prediction of NetworkProfiler suggested that PTRF regulates gene sets related with migration (q-value = 2:45|10 {8 ) and with metastasis (qvalue = 2:03|10 {6 ) during the EMT. Consistent with the in silico result, PTRF expression inhibits migration and correlates with metastasis in PC3 prostate cancer cells [61]. Similarly, NetworkProfiler predicted that miR-146 contributes to migration (q-value = 3:27|10 {9 ) and invasion (q-value = 1:01|10 {4 ) during the EMT. This in silico result is comparable with the in vitro result that miR-146 inhibits invasion and migration, and acts as a metastasis suppressor [62]. In addition, some predictions between miRNAs and functions seem reasonable based on the known functions of the miRNA host genes. For example, the prediction of NetworkProfiler provided the hypothesis that miR-143 and miR-145 promotes metastasis (qvalue = 7:17|10 {4 and 3:15|10 {5 ) and migration (qvalue = 1:37|10 {6 and 6:10|10 {8 ), respectively. miR-143 and miR-145 cooperatively target a network of transcription factors, such as KLF4, to control smooth muscle phenotype switching [63]. Since KLF4 increases the migration of cells [29] and induces EMT [10], these miRNAs might be related with EMT-related functions or control EMT by targeting KLF4. Again, it should be noted that these relationships between regulators and functions cannot be predicted from one gene network constructed by traditional graphical models, and only the results of multiple network comparison between epitheliallike and mesenchymal-like cells based on NetworkProfiler enables us to support the recent biological knowledge and new hypotheses about unknown relationships.

Comparison between in silico predictions and in vitro results
To validate the performance of NetworkProfiler, in silico predictions obtained by NetworkProfiler were evaluated experimentally. We first conducted in vitro experiments of a new candidate regulator of E-cadherin listed in Table 1, KLF5, to investigate whether KLF5 affects E-cadherin expression and induces morphologic changes characteristic of EMT using A549 lung adenocarcinoma cell line, which is well known to exhibit EMT in response to TGF-b [64]. KLF5 knockdown markedly altered a cobblestone epithelial morphology of A549 cells and induced a more fibroblast-like morphology with reduced cell-cell contact, which was similar to that seen in TGF-b-treated A549 cells (Figure 7a and Figure S1). Immunofluorescence analysis showed significant reduction of E-cadherin expression in A549 cells knocked down for KLF5 (Figure 7b), which was also confirmed by western blot analysis (Figure 7c). Conversely, vimentin expression was shown to be modestly increased by siKLF5 treatment (Figure 7c). Consistent with the in vitro results, the prediction of NetworkProfiler suggested that KLF5 affects Ecadherin expression as well as Vimentin expression during the EMT, since the changes in the regulatory effects from KLF5 to Ecadherin and Vimentin were much larger compared with the other regulators (12.42 and 16.57, respectively) which was ranked 15-th and 10-th among the 1732 regulators (Table S9). The result of gene set analysis (Table S7) also suggested that KLF5 affects EMT (q-value = 1:60|10 {24 ). Thus, we consequently found that in silico predictions obtained by NetworkProfiler was confirmed with the results of in vitro experiments; KLF5, a newly identified candidate regulator of EMT, was shown to affect expressions of Ecadherin and Vimentin as well as morphologic characteristics related to EMT as a repressor of EMT.
We also conducted in vitro experiments to validate functional involvement of a novel candidate EMT-related microRNA, miR-100 whose expression was increased in 762 cancer cell lines during the EMT ( Figure S2). miR-100 was found to be expressed at a low level in A549, NCI-H727 and NCI-H1439 NSCLC cell lines, which had low EMT-related modulator values among the 762 cell lines panel (Figure 8a). miR-100 was transiently introduced into A549 cells, resulting in a significant increase of cell migration activity (Figure 8b). However, overexpression of miR-100 did not affect expressions of an epithelial marker, E-cadherin, and a mesenchymal marker, vimentin (Figure 8c), and also did not influence cell morphology (Figure 8d). However, overexpression of miR-100 significantly increased cell migration without noticeably affecting morphology in NCI-H727 and NCI-H1437 cells ( Figure  S3). Consistent with the in vitro results, the prediction of NetworkProfiler suggested that miR-100 enhances migration (qvalue = 1:42|10 4 ) but does not affect EMT itself (q-value = 0.24) from gene set analysis (Table S7). It also suggested that miR-100 does not affect the expressions of E-cadherin and Vimentin during the EMT, since E-cadherin and Vimentin were not target genes of miR-100 in all the 762 cell line-specific gene networks related with the EMT(Files S1, S2, and S3) and the changes in the regulatory effects from miR-100 to E-cadherin and Vimentin were much smaller compared with the other regulators (0 and 1.72, respectively), which were ranked 371-th and 151-th among the 1732 regulators (Table S9). Thus, we conclude that several hypotheses of miR-100 functions provided by NetworkProfiler are consistent with the results of in vitro experiments; NetworkProfiler has the potential to uncover novel biological mechanisms.

Discussion
We developed a novel algorithm called NetworkProfiler to infer patient-specific, modulator-dependent gene regulatory networks from gene expression data. Unlike traditional methods that infer a static network for a specific state of a cell or an averaged network for many patients, NetworkProfiler can be used to construct patientspecific gene networks for specific diseases, such as cancer. Subsequently, information about the regulatory effects of individual genes and functional gene sets can be extracted from these networks. In order to show the performance of NetworkProfiler, we applied NetworkProfiler to microarray gene expression data from 762 cancer cell lines to identify the system changes that were related to the EMT. As a result, we identified 25 EMT-dependent regulators of E-cadherin. Although some of these regulators have been reported in the literature, others may be novel master regulators of E-cadherin that induce the EMT. Moreover, in comparison to the traditional SEM approach, the performance of NetworkProfiler was superior for identifying regulators of E-cadherin during the EMT. We also showed that NetworkProfiler can reveal regulatory changes of E-cadherin during the EMT. In particular, our results suggested that decreased expression of miR-141 disrupts the negative feedback loop between miR-141 and ZEB1, which would allow ZEB1 to decrease the expression of E-cadherin.
Furthermore, we also identified putative relationships between regulators and EMT-dependent functional gene sets, some of which had published evidence. Based on the significance of the enrichment of downstream target genes for the regulator on the 5 functional gene sets, we identified 45 putative master regulators for the EMT. We found that 17 regulators were downstream targets of TGFB1 that is a master switch of the EMT. We then showed that NetworkProfiler can not only predict the relationships between these regulators and functions that were supported by many published evidence, but also produce new hypotheses that some of them might enhance EMT-related functions or induce EMT.
Finally, it is of note that we were able to validate the in silico predictions obtained by NetworkProfiler in our in vitro experiments. KLF5, a newly identified candidate regulator of EMT, was experimentally shown to affect E-cadherin expression as well as morphologic characteristics related to EMT, validating the NetworkProfiler-based prediction that KLF5 is a negative regulator of EMT. We also conducted in vitro experiments of another regulator, miR-100, for which NetworkProfiler predicted its association with some EMT-associated functions. As a result, we found that the predicted miR-100 functions conformed to the results of in vitro experiments. Thus, we conclude that the effectiveness of the proposed method was validated not only from published literature but also from new in vitro validation experiments.
We anticipate several possible applications and extensions of NetworkProfiler. In this study, we only focused on the system changes that are associated with the EMT. NetworkProfiler also could be used to infer system changes and reconstruct modulatordependent gene networks for other well-defined modulators, such as drug sensitivity and prognosis risk. Currently, a significant limitation of NetworkProfiler is that the modulator must be onedimensional. However, cancer development is a multivariate process. It may be possible to use multivariate kernel functions in NetworkProfiler to overcome this limitation.
During the past decade, cancer therapy has become increasingly personalized [2,3]. Unlike the traditional ''one-size-fits-all'' approach to cancer therapy, patient-specific cancer therapy reduces the side effects of chemotherapy and predicts the odds of cancer recurrence more accurately by tailoring cancer treatment to specific genetic defects in the tumors of individual patients. However, this goal is not an easy task since cancer is an extremely complex and heterogeneous disease. We believe that NetworkProfiler will help elucidate the systems biology of cancer and facilitate personalized chemotherapy.

Cell lines and reagents
Human non-small cell lung cancer (NSCLC) cell lines, A549, NCI-H1437 and NCI-H727, were purchased from American Tissue Culture Collection, while other NSCLC cell lines, Calu1, Calu6 and SK-MES1, were generously provided by Dr. L. J. Old (Memorial Sloan-Kettering Cancer Center). Cells were maintained in RPMI 1640 supplemented with 10% fetal bovine serum. The anti-Ecadherin antibody was purchased from BD Transduction Laboratories, anti-vimentin from Santa Cruz Biotechnology, anti-a-tublin from Sigma Aldrich, and anti-mouse IgG from Cell Signaling Technology. The Alexa-conjugated anti-mouse IgG was purchased from Molecular Probes. siRNAs against KLF5 (siKLF5 #1 and #2) and a negative control (siNC) were purchased from Sigma Genosys. Pre-miR has-miR-100 and negative control #2 were purchased from Ambion. Human TGF-b was purchased from R&D Systems, Inc.
Immunostaining, western blot analysis and in vitro motility assay 2|10 4 cells in 6-well plates were transiently transfected with either 20 nM siRNA or 10 nM Pre-miR molecules using Lipofectamine RNAiMAX (Invitrogen), as previously described [65]. Immunofluorescence staining was carried out after fixation with 3.7% formaldehyde and postfixing with 0.1% Triton X-100 each for 10 min at RT. Photographs were taken 72 hr after transfection. Cells were harvested 48 hr after transfection for western blot analysis. In vitro motility assay based on Transwell-chamber culture systems was performed, as previously described [66].

Quantitative real-time reverse transcription (RT)-PCR analysis
Quantitative real-time RT-PCR analysis of KLF5 was performed using Power SYBR Green (Applied Biosystems) and the following PCR primers: 59-CCCTTGCACATACACAATGC-39 and 59-GGATGGA-GGTGGGGTTAAAT-39. Quantitative real-time RT-PCR analysis of miR-100 and RNU44 was performed using TaqMan probes and 7500 Fast Real-Time PCR system (Applied Biosystems), essentially as previously described [67].

NetworkProfiler
NetworkProfiler employed a varying-coefficient structural equation model (SEM) to represent the modulator-dependent conditional independence between gene transcripts. Let there be q possible regulators, R 1 , . . . ,R q , that may control the transcription of the k-th target gene T k when the modulator M~m. Then the varying-coefficient structural equation model for T k is where b jk (m) is the coefficient function that represents the effect of R j on T k , R 0~1 , and e k is a noise term. If T k~Rl , then the term b lk (m) : R l can be omitted from the model, i.e., b lk (m)~0 for all m. By estimating the parameters b jk (m), we obtain the transcriptional regulatory gene network at M~m.
We used a kernel-based method to estimate these parameters. Let there be n sets of gene expression profiles. Then, the SEM for the a-th sample can be rewritten as where t ak , r aj , and m a are the values of the k-th target gene, the jth regulator, and the modulator for the a-th sample, respectively; r 0k~1 , and b jka~bjk (m a ). For n samples, we obtain n modulatordependent gene regulatory networks, i.e., the regulatory effects of R j (j~1, . . . ,q) on T k (k~1, . . . ,p) are determined bŷ b b 111 , . . . ,b b qpn , whereb b jka is the estimate of b jka . We assumed that the values of the coefficients are almost constant for the neighborhood samples of the a-th sample with respect to the modulator m, that is, b jki &c for the i-th sample that satisfies jm i {m a j v d for some constant c and small d. Then, we estimated the parameters b jka for fixed a by minimizing a regularized kernel-based weighted residual sum of squares where K(m i {m a jh k ) is a Gaussian kernel function defined by and l ka and c ka are hyperparameters that control the L 1 (lasso [68]) and L 2 (ridge [69]) penalties, respectively. In addition, w jka is an importance weight for b jka , and h k is the bandwidth of the Gaussian kernel. The kernel function K(m i {m a jh k ) defines the neighborhood around the a-th sample in terms of M; a large value of K(m i {m a jh k ) means that the i-th sample is in the neighborhood of the a-th sample. By fixing l ka , c ka , w jka , and h k , we obtain the estimates This parameter estimation method is a weighted version of the elastic net [22]. The L 1 penalty zeroes some coefficients [68], which produces a sparse network structure. In contrast, the L 2 penalty stabilizes the solution by a grouping effect that promotes the collective inclusion or exclusion of highly correlated variables in the model [22]. The importance weights w jka allow tuning parameters to take on different values for different coefficients b jka . For example, if w jka has a large value, then an estimatorb b jka tends to be zero. In contrast, if w jka has a small value that is nearly equal to zero,b b jka tends to be non-zero. These weights create a sparser network structure than the lasso and elastic net methods. The parameters b jka were estimated by using a recursive procedure, and the weights w jka were updated by w jka~1 =(b b jka zj) [70], whereb b jka is the estimate from the previous step and j~10 {5 to avoid dividing by zero. Then, the modulator-dependent networks for n samples can be derived from the estimates ofb b jka (j~1, . . . ,q, k~1, . . . ,p, and a~1, . . . ,n). For convenience of subsequent explanations, we introduce the following notations: In these expressions, t ka (h k ) and R a (h k ) were normalized so that the means and variances for t ka (h k ) and each column of R a (h k ) were 0 and 1, respectively. As a result, the intercept b 0ka was not included in the loss function (1). For fixed h k , the loss function (1) can be minimized by using a kernel-based weighted version of the recursive elastic net [70]. The tuning parameters l ka and c ka were selected by minimizing a modified version of the bias-corrected weighted Akaike information criterion (AIC) [71]: where n a (h k )~P n i~1 k ia (h k ), andŝ s 2 ka is estimated bŷ In addition,d df ka is the unbiased estimate of the degrees of freedom given bŷ where I is the identify matrix andR R(h k ) is the submatrix of R(h k ), which has columns that correspond to the nonzero coefficients, respectively. The NetworkProfiler algorithm is shown below: Algorithm: NetworkProfiler. The results from NetworkProfiler, which are the estimates of q coefficientsb b jka (j~1, . . . ,q) for the k-th target gene of the a-th patient, depend on the values of h k . We used cross-validation to select an optimal value of h k and estimate q|n coefficients, b 1k1 , . . . ,b qkn by minimizing the cross-validation error: where S is a randomly selected set of samples andb b ({a) 1ka , . . . ,b b ({a) qka are estimated from the remaining samples by minimizing: The algorithm in NetworkProfiler for minimizing this loss function (3) is shown below: Algorithm: Conditional optimization with crossvalidation.
. Subsequently, the modulator-dependent gene networks for n samples are determined from the coefficient vectorsb b k1 (ĥ h k ), . . ., b b kn (ĥ h k ) (k~1, . . . ,p) by applying the above algorithm for all k~1, . . . ,p. The computational cost of this algorithm rapidly increases as the number of samples and genes increase. Thus, for computers that only have a single central processing unit (CPU), this algorithm is only practical for medium-sized networks with up to several genes. However, since this algorithm can be executed in parallel for every k, it can be run on a stand-alone workstation with multi-core CPUs and computer clusters. Figure S4 represents the histogram of computational times based on 12 core CPUs (Intel Xeon Processor E5450 (# of cores = 4, clock speed = 3.0GHz) | 3) for calculating 762 cancer cell line-specific gene networks from 13,508 | 762 gene expression data through 100,000 iterations when 100 target genes were randomly selected among 13,508 genes and the number of regulators was not restricted, i.e., 1732 regulators were used. The average computational time was about 9 days. In this situation, we can find putative master regulators of the focused target genes related with a modulator of interest. Of course, for calculating gene networks of 762 samples for a large number of target genes, a supercomputer is required. In this study, we used the Super Computer System at the Human Genome Center, Institute of Medical Science, University of Tokyo, Japan, to analyze 762 gene networks with 13,508 target genes.

Signature-based hidden modulator extraction
When the modulator was a variable that is difficult to observe, we used a signature-based hidden modulator extraction algorithm to estimate the value of the modulator for each sample. This algorithm takes seed genes that are related to the modulator and computes the underlying latent variable of the modulator by using principal components and extraction of expression modules (EEM) [7]. Let M be a gene set that is related to the modulator and let X M be an n|jMj matrix of n expression levels of M. Then, a linear model, which is a special case of the single factor model [72], relates M Ã , a subset of M, to an underlying latent variable U as follows: where X j is the expression level of the j-th gene in M Ã , a 0j is the yintercept, a 1j is a coefficient, and e' j is a noise term. We assumed that other genes that do not include M Ã (fX j ; j 6 [M Ã g) are independent of U.
The values of U for n samples, u i (i~1, . . . ,n), can be estimated by the following procedure: Algorithm: signature-based hidden modulator extraction.
1: For a given set M, find a subset M Ã based on the expression coherence with the EEM algorithm [7].
2: Given M Ã , singular value decomposition of the data matrix X M Ã estimates u i by the largest principal component. 3: Return the values u i (i~1, . . . ,n). In the first step, we estimate M Ã . In the second step, we assume that the noise terms e' j have Gaussian distributions with equal variances. As a result, the singular value decomposition generates maximum likelihood estimates of u i for the single factor model [72].

Regulatory effect
To identify upstream regulators that had strong effects on the expression of a target gene of interest in the constructed modulator-dependent gene networks, we defined a measure, called the regulatory effect, of the effect of the j-th regulator on the k-th target gene in the a-th sample as where p jka is the set of all possible paths from R j to T k , and b b (j?k) l (m a ) is the product of the estimated coefficients on the l-th path that includes p jka . For example, given all the possible paths from R 1 to T 2 in the a-th sample ( Figure S5), the set p 12a is and the regulatory effect RE 12a is In our analysis, the length of the paths from R j to T k is restricted to either 1 or 2.
To determine how the modulator affects the regulatory effect RE jka , we also defined the change in the regulatory effect of the jth regulator on the k-th target as REC jk~m axfRE jka ; a~1, . . . ,ng{ minfRE jka ; a~1, . . . ,ng: ð8Þ In addition to this definition, it is also possible to use percentiles instead of max and min to achieve more robust results. However, in our analysis, we used max and min to increase the power of the method. It should be noted that the change in the regulatory effect REC jk does not explain the mode of action for the modulator with respect to the regulator-target relationship. File S5 (http:// bonsai.hgc.jp/,shima/NetworkProfiler) is provided to determine the modulator mode of action by statistical test.

Gene set analysis of downstream genes for a regulator
To identify regulators that enhanced the functions of their targets, we calculated the statistical significance of the enrichment of targets for a given regulator in each sample. To test the enrichment, we use the degree of independence between the two properties: A ja :gene is in the list of targets for the j-th regulator in the a{th sample B u :gene is a member of the u-th priori set Testing the association between the properties A ja and B u corresponds to Fisher's exact test. The p-value calculated by this test, P jua , indicates the probability of observing at least the same amount of enrichment when downstream genes are randomly selected out of all genes. Thus, a very small p-value gives strong evidence for an association between A ja and B u for the j-th regulator in the a-th sample. To correct for multiple hypotheses testing, Benjamini-Hochberg (BH)-corrected p-values (q-values) [73], Q jua , were calculated.
To determine how the modulator affects the functions of downstream genes for a regulator, we defined the enrichment score, ES ju , as a change in the statistical significance of the enrichment of targets for the j-th regulator on the u-th function: ES ju~l og ( maxfQ jua ; a~1, . . . ,ng= minfQ jua ; a~1, . . . ,ng): ð9Þ Thus, a very large ES ju indicates that the modulator causes a significant change of the enrichment of the targets for the j-th regulator on the u-th function.
To identify putative master regulators that control more functional gene sets than other regulators, we also calculated the total enrichment score, TES j , by combining independent enrichment scores, ES j1 , . . . ,ES jU , where U is the number of functional gene sets: The total enrichment score is equivalent to the difference of the Fisher's statistic {2 P k i~1 log P k [74] which was used to combine independent tests obtained from k studies based on the p-values, P 1 , . . . ,P k . The Fisher's method is based on the fact that the statistic {2 P k i~1 log P i follows a chi-square distribution with 2k degrees of freedom under the global null hypothesis that all null hypotheses are true. A small integral p-value for the hypothesis indicates that the j-th regulator controlled at least one or more functional gene sets during the change of the modulator.  Figure S4 Histogram of computational times for inferring cancer cell line-specific gene networks running on 12 core CPUs. The 762 cancer cell line-specific gene networks related with the EMT were calculated from 13,508 | 762 gene expression data when 100 target genes were randomly selected among 13,508 genes and the number of regulators was not restricted, i.e., 1,732 regulators were used. The comptational times were based on 12 core CPUs (Intel Xeon Processor E5450 (# of cores = 4, clock speed = 3.0 GHz)|3). The histogram was calculated by 100,000 iterations. (PDF) Figure S5 Example of paths among four genes, R 1 , T 2 , R 3 , and R 4 : (PDF)     Table S7 Regulator function matrix between 1732 regulators and 5 functions. The row and column indicate regulator and functional gene set, respectively. The (i,j)-th element represents the change during the EMT in the statistical significance (-log 10 (q-value)) for the enrichment of target genes of the i-th regulator on the j-th function. The last column indicate the integral q-value of each row regulator which were used to determine which regulator strongly affected the functional gene sets. (XLS)