Using Functional Signatures to Identify Repositioned Drugs for Breast, Myelogenous Leukemia and Prostate Cancer

The cost and time to develop a drug continues to be a major barrier to widespread distribution of medication. Although the genomic revolution appears to have had little impact on this problem, and might even have exacerbated it because of the flood of additional and usually ineffective leads, the emergence of high throughput resources promises the possibility of rapid, reliable and systematic identification of approved drugs for originally unintended uses. In this paper we develop and apply a method for identifying such repositioned drug candidates against breast cancer, myelogenous leukemia and prostate cancer by looking for inverse correlations between the most perturbed gene expression levels in human cancer tissue and the most perturbed expression levels induced by bioactive compounds. The method uses variable gene signatures to identify bioactive compounds that modulate a given disease. This is in contrast to previous methods that use small and fixed signatures. This strategy is based on the observation that diseases stem from failed/modified cellular functions, irrespective of the particular genes that contribute to the function, i.e., this strategy targets the functional signatures for a given cancer. This function-based strategy broadens the search space for the effective drugs with an impressive hit rate. Among the 79, 94 and 88 candidate drugs for breast cancer, myelogenous leukemia and prostate cancer, 32%, 13% and 17% respectively are either FDA-approved/in-clinical-trial drugs, or drugs with suggestive literature evidences, with an FDR of 0.01. These findings indicate that the method presented here could lead to a substantial increase in efficiency in drug discovery and development, and has potential application for the personalized medicine.


Introduction
The average research and development (R&D) cost for the 10odd years to develop a new pharmaceutical now exceeds a billion dollars [1,2]; anti-cancer drugs being especially costly [2]. The process encompasses compound identification, toxicity testing in animals, early phase clinical trails, and efficacy in late phase trials. The failure of more than 90% of drugs during development [1], is perhaps the single greatest contributor to overall cost of pharmaceutical R&D. This cost in time and money can in principle be substantially reduced by repositioning drugs that are already approved for other purposes.
One way to screen approved drugs for new purposes is computationally. Computational chemistry provides valuable contributions in hit-and lead-compound discovery [3]. Systems biology approaches have also been recently used to capture the complexity of drug discovery and repositioning [4,5]. Computational approaches have rarely, however, been a key contributor to drug discovery or repositioning [3]. This is in part because the majority of the studies focus only a few genes/proteins [6], either as the drug targets, or ''disease signatures'' while there is increasing evidence that many effective drugs act on multiple rather than single targets [4], and evidence is starting to emerge that pathologies can be a consequence of small abnormalities in many genes, rather than major abnormalities in a few genes [7,8]. In addition, many existing methods constrain search space by imposing similarity requirements-including similarity of ligand structures [9], expression profile of drug response [10], topological similarity of target-drug, drug-drug and disease-drug [11,12] networks, and side-effect similarity [13], which diminishes the effectiveness of de novo drug discovery.
The main idea underlying a number of current methods, including the one presented here, is to identify genes whose expressions are reverse correlated under disease and drug perturbations [14,15,16]. Our approach, however, uses functional signatures rather than gene signatures. Ideally a functional signature would be represented by pathways or other functional modules that are perturbed by the disease and restored by drugs.
The utility of such a definition is limited by lack of a comprehensive set of functional modules/pathways. We therefore adapted an alternative approach that identifies a drug for repositioning when the reverse ordered lists of disease perturbed and drug perturbed genes has a statistically significant overlap. We thereby remove the requirement for representing a disease by a fixed number of genes. Because we use a large number of genes in our analysis, we filter out genes that are expressed differently between untreated cell line and disease samples; a step that is generally not present in gene signature based methods.
Our approach allows the detection of heterogeneous drug candidates that may restore cellular functions through different paths, in keeping with the idea that drugs acting selectively on multiple targets may be more efficacious than single-target agents, and that a particular physiological process may be modulated by multiple paths. This is in contrast to other approaches which either use a fixed small number of genes as the disease signature [14,15,16] or limit candidates to drugs whose properties (such as expression profiles) are similar to those of existing drugs [9,10,11,12,13].
As with other approaches [14,15,16] we utilize two databases: the Connectivity Map (CMAP) which provides information on expressed genes in cancer cell lines perturbed by bioactive compounds [6,14], and the Gene Expression Omnibus (GEO) [17], which stores transcript levels for various cancers. We consider as potential candidates, compounds that down (up)regulate cell-line genes which are up (down)-regulated in transformed tissue cells. We use a three-step strategy to identify candidate compounds. First, we compare the expression of genes in the untreated cell line and the cancer tissue sample, and retain genes that are expressed in both. Second, we download the ranked list of perturbed cell line genes from CMAP, and generate a ranked list of genes from tissue samples ranked by differential expression. Both steps are designed to make the expression data comparable between cell lines and cancer samples. Finally, as shown in Fig. 1, we compare the K (window size) most upregulated genes in the tissue (UC) against the K most downregulated genes in the cell line list (DB), for each compound. We assume a compound is a candidate for repositioning if there is significant number of overlapping genes between UC and DB, and vice versa.
We illustrate that this new strategy with database integration and straightforward statistical analysis is able to identify a remarkably large number of plausible candidates for myelogenous leukemia, prostate and breast cancer. Of the more than 1300 CMAP compounds, 4 are currently in use against breast cancer, 5 against myelogenous leukemia and 3 against prostate cancer. Our analysis returned 1 of the 4, 2 of the 5 and 1 of the 3. The relative plausibility of the candidates is further indicated by the fact that 11/45, 5/50 and 6/50 candidates for repositioning against breast cancer, myelogenous leukemia and prostate cancer, respectively, are currently in clinical trials for those diseases, these statistics summarizing the most important indicators of performance. These results not only demonstrate the effectiveness of the approach, but also hint the potential application of the approach for the personalized medicine by reverse-correlating of patient's expression profile against the expression profiles of all available drugs, as detailed in the discussion section.

Statistics of significant bioactive compounds
Breast cancer. As shown in Table 1, we detected 28 bioactive compounds from correlations between genes that are up-regulated in cancer (UC) and down-regulated in response to bioactive compounds (DB), and another 62 by comparing genes down-regulated in cancer (DC) to those up-regulated by bioactive compounds (UB). Of the 90, 80 either up-regulate down-regulated cancer genes (DC/UB), or down-regulate up-regulated cancer genes (UC/DB); another 10 display duality; i.e. they do both. Consequently, we identified 80 distinct compounds; 46 of them are FDA approved. CMAP includes 4 FDA approved drugs for breast cancer. We recovered one of them, fulvestrant, which displays duality ( Table 2 and Table S1); i.e. it down-regulates genes that are highly up-regulated in breast cancer, and also up-regulates genes that are highly down-regulated. The remaining 45 are FDA approved for diseases other than breast cancer and are therefore candidates for repositioning.
Myelogenous leukemia. We detected 89 (UC/DB) and 26 (DC/UB) bioactive compounds for myelogenous leukemia, 96 of which are distinct (19 show duality), and of those, 52 are FDA approved. Of the five CMAP compounds currently in use against myelogenous leukemia, we recovered 2 (etoposide and prednisone), leaving 50 candidates for repositioning.
Prostate cancer. We detected 83 (UC/DB) and 88 (DC/UB) bioactive compounds for prostate cancer. Of the 171, 89 are distinct and 51 of these are FDA approved. We recovered one of the 3 compounds in CMAP, which are FDA approved for prostate cancer (diethylstidbestrol), leaving 50 potential candidates for repositioning.
Supporting evidence (i) Recall. As indicated above, our method recovered 1/4, 2/ 5 and 1/3 of the CMAP compounds that are FDA approved for breast cancer, myelogenous leukemia and prostate cancer, respectively. We also note, as outlined below (iii), less direct, but nonetheless important supporting evidence for potential efficacy of a substantial number of identified compounds.
(ii) Clinical trials. Twenty-two of the predicted distinct compounds that are FDA approved, and are consequently candidates for repositioning, are in fact in clinical trials: 11 for breast cancer, 5 for leukemia and 6 for prostate cancer, representing 24% (11/45), 10% (5/50) and 12% (6/50) of the distinct candidates for those diseases.
(iii) Other evidence. As summarized in Tables 1 and S1, published results provide suggestive evidence for the potential efficacy of an additional 13 distinct breast cancer candidates, 5 distinct leukemia candidates and 8 distinct prostate cancer candidates. Six of the 13, three of the 5 and seven of the 8 are FDA approved drugs, and are therefore candidates for repositioning.

Author Summary
The effective drug of a given disease is aimed to bring abnormal functions associated with disease back to the normal state. Using expression profile as the surrogate marker of the cellular function, we introduce a novel procedure to identify candidate therapeutics by searching for those bioactive compounds that either down-regulate abnormally over-expressed genes, or up-regulate those that are abnormally under-expressed. We show that the approach detects a pool of plausible candidates as repositioning/new drugs. In contrast to previous studies, our approach uses a variable big number of genes and/or gene combinations as a representation of functional signatures to identify bioactive compounds that modulate a given disease, irrespective of the particular genes that contribute to the cellular functions; therefore it covers potential drugs with heterogeneous properties. The method may also have potential application for the personalized medicine.
(iv) Functional plausibility. We defined perturbed pathways as over-represented in genes that are significantly up or down-regulated in diseased relative normal tissue, as explained in Methods. We expect and find that for a given disease, a number of pathways is perturbed by multiple compounds. As elaborated below, identification of common  processes could provide clues about cancer biogenesis, mechanism and treatment. We look for common processes using a tandem approach: starting with pathway analysis for the most specific relations, and Gene Ontology analysis to search for higher order connections. We show that although the gene sets used for reverse-correlation may be different for different drug candidates, these genes involve many functions common to the target cancer.
Over-represented pathways for breast cancer. Predicted drug candidates for breast cancer are aimed at restoring expression of genes that are up-regulated by the disease in seven pathways: adherens junction, focal adhesion, ErbB signaling, riboflavin metabolism, thiamine metabolism, nucleotide excision repair and bacterial invasion of epithelial cells. The inhibition of over-expressed genes in both the adherens junction and focal adhesion pathways hints at the critical role of endothelial barrier enhancement [20] to impede cancer cell extravasation. The overrepresentation of the bacterial invasion pathway, on the other hand, indicates augmented breast cancer cell invasiveness and adhesiveness under conditions of bacterial infection. This suggests that the increased risk of metastasis due to infection could be the result of direct interaction of infectious bacteria, and not just bacterially induced inflammation [21].
The involvement of the ErbB signaling pathway is not surprising -it is well-known that the ErbB protein family or epidermal growth factor receptor (EGFR) family, especially ErbB-2 (HER-2), is often over-expressed with aggressive clinical behavior and poor outcome in patients with breast cancer [22]. In addition, the dual inhibition of the focal adhesion and EGFR signaling pathways can cooperatively enhance apoptosis in breast cancers [23]. The identification of these pathways is consistent with the recent development of therapy for breast cancer, i.e., targeting of ErbB-2 with trastuzumab, and vascular endothelial growth factor (VEGFA) with bevacizumab in combination with chemotherapy has proven to be a milestone in molecular targeted therapy for breast cancer [24].
The over-representation of both the riboflavin (vitamin B2) metabolism and thiamine (vitamin B1) metabolism pathways is consistent with previously noted connections between vitamin B complex and breast cancer [25]. In addition the serum levels of the estrogen inducible riboflavin carrier protein, which occupies a key position in riboflavin metabolism, may be useful as a new marker to predict early-stage breast cancer [26].
Finally, the nucleotide excision repair pathway corrects DNA damage caused by environmental toxins including cigarette smoke and ultraviolet radiation. Polymorphisms in this pathway have been reported in breast cancer patients [27], suggesting the possibility of impaired repair and consequent accumulation of mutations. More specifically a number of genes in this pathway, such as ERCC4 (Table S2), are tightly associated with breast cancer [28]. In addition, one of the important cancer-related genes, P53, regulates excision repair through DNA damage response genes such as GADD45 [29].
Genes whose expression is repressed by breast cancer and increased by predicted drug candidates (DC/UB) are over- represented in only two pathways. The involvement of the cytochrome P450 (CYPs) pathway indicates that the CYPs may be key enzymes in breast cancer formation and cancer treatment. Their importance lies in the fact that they metabolize drugs used for cancer treatment, and are therefore potential targets for anticancer therapy [30]. Among our top ranking genes for predicted breast cancer drugs are CYP2A6, and CYP2C19. Their pronounced polymorphic [30] suggests that for any strategy targeting them, individualized, or stratified therapy, could be especially critical.
Disease genes that are highly perturbed are over-represented in the ribosomal pathway. Many studies report that the morphological and functional changes in the nucleolus are a consequence of both the increased demand for ribosome biogenesis, and changes in the mechanisms controlling cell proliferation. The loss or functional changes in the two major tumor suppressor proteins, retinoblastoma protein (pRB) and p53, cause an up-regulation of ribosome biogenesis in many cancer tissues including breast cancer [31]. On the other hand, some down-regulated ribosomal proteins, such as RPL35A, RPL18 and RPL14 (Table S2) that we find in both the breast cancer tissue and cell lines have received relatively little attention [32] and might be worth pursuing.
Over-represented pathways for myelogenous leukemia. We identified five pathways with gene sets that are highly up-regulated in myelogenous leukemia, and highly downregulated by compounds: glycerolipid (triglyceride) metabolism, glycerophospholipid metabolism, glycosylphosphatidylinositol (GPI-anchor) biosynthesis, vascular smooth muscle contraction, and transforming growth factor b (TGF-b) signaling. The first three are components of lipid metabolism whose close association to leukemia has been studied for decades [33]. Although abnormal glycerolipid metabolism is well-known to be associated with cardiovascular disease and diabetes, there is also strong evidence that alkyl glycerolipids induce apoptosis of leukemia cell lines [34]. Disordered glycerophospholipid metabolism has been reported in the leukemia cell line and retinoic acid treatment will suppress the synthesis of ethanolamine-containing glycerophospholipids [35]. The reported connection between the synthesis of GPIanchor and leukemia is indirect and uncommon: the deficiency of GPI-anchor occurs in rare diseases including hemolytic anemia and paroxysmal nocturnal hemoglobinuria (PNH). PNH often develops in people with aplastic anemia which occasionally transforms into leukemia [36].
The association between the TGF-b signaling pathway and cancer is well known. The pathway is involved in tumor suppression, as well as in tumor progression and invasion [37]. Its over-representation among over-expressed genes indicates that in myelogenous leukemia it more likely behaves as a promoter. This is consistent with recent observations that support a permissive role for TGF-b in growth [38] and metastasis [39] of established tumors.
There seems to be no direct association between the vascular smooth muscle (VSM) contraction pathway and leukemia/cancer; its over-representation may be the result of over-representation of genes shared by relevant pathways. In particular examination of the genes involved in the VSM pathway (Table S3) indicates that the two most frequently appearing genes: PLA2G6 and ROCK2 are also genes in GPI-anchor biosynthesis pathway and the TGF-b pathway respectively; and TGF-b is known to promote the contractile phenotype in VSM cells [40].
Three pathways have been reported among DC/UB genes; two of them, apoptosis [41] and cell cycle [42] pathways, are well known to be cancer-associated, and have been studied extensively for myelogenous leukemia. While molecular defects in apoptotic pathways are thought to often contribute to the abnormal expansion of malignant cells and their resistance to chemotherapy, the abnormality of the cell cycle pathway usually produces cells with too many or too few chromosomes (aneuploidy), which is frequently associated with the transition to leukemia. The third pathway, the T-Cell receptor signaling pathway, is central to cellmediated immunity, which is invariably activated by tumor associated antigens [43]. The down regulated T-cell receptor signaling genes which are reactivated by the predicted drug candidates include PTPRC, CD8A, CD3D and src family protein kinase (FYN), all of which play key roles as triggering intracellular signaling including activation-induced cell death [44].
Gene ontology (GO) term enrichment analysis. In order to obtain broader insight we examined enriched GO terms among the identified gene sets using the GO Term Enrichment Analysis (GOTEA) and batch mode of VisANT system [45]. For the purpose of comparison, we use informative GO terms under which there are more than 400 annotated genes with FDR,0.01, and mark the terms using the abbreviation of corresponding KEGG pathways whenever they can be matched. The detailed results are listed in Table S4, S5. As expected, this analysis reveals more cellular functions, as well as the cellular compartments where these functions are carried out. Most of the over-represented pathways are reproduced. More interestingly, this analysis also finds the GO terms that are shared between UC/DB and DC/UB, probably because some of the terms, such as ''regulation of transport'', are not specific enough. We also find some GO terms common to Tables S4 and Table S5, which may hint at how the drugs can be repositioned between breast cancer and myelogenous leukemia.

Discussion
We introduced a novel procedure for identifying candidate therapeutics from gene expression profiles. The general idea is that viable drug candidates will be among those bioactive compounds that either down-regulate abnormally over-expressed genes, or upregulate those that are abnormally under-expressed. We show that the idea leads to a pool of plausible candidates for repositioning.

Targeting functions
One distinguishing feature of our method is that it targets cellular functions rather than genes, i.e., the focus of the method is to bring abnormal functions associated with disease back to the normal state. This strategy is based on the observation that diseases stem from failed/modified cellular functions, regardless of which of the particular genes contributing to the function are aberrant [19]. For the purpose of finding therapeutics, we do not have a fixed list of signature genes for a given disease. Instead from a large set of ranked differentially expressed genes for a particular disease, we find compounds whose effect on the expression of most perturbed genes is opposite that of the disease. This results in a number of overlapping but different (for different compounds) subsets of genes. On the other hand, for a particular disease the functions associated with the subsets are similar. This characteristic of variability at the level of genes, with conservation at the level of function can be partially seen in Table S2, S3 where for each drug candidate the list of genes is very different while the list of pathways is similar.
We used mRNA expression as a surrogate measure of the functional change because of its wide availability either for drug response or disease perturbation. The method is, however, applicable to other data types (protein expression, methylation and so fourth).
Since our method focuses on functional recovery and identifying different but overlapping subsets of genes for different compounds, it can cover potential drugs with heterogeneous properties. On the other hand, we do find genes that are targeted by a large number of our identified compounds. For example, LAMB1, CAV1 and RPL35, tend to be targeted by most of predicted drugs for breast cancer as shown in Table S2.

Mechanisms of action
The mechanisms and range of action of many current drugs are poorly understood. Even drugs with known targets often have ''offtarget'' effects [5]. While many such effects are undesirable, some of them provide the opportunity for repositioning. We have used pathway analysis to interpret the functional rationale for repositioning. The same analysis also provides some understanding mechanism.
As an example, consider Tamoxifen, which is used extensively for the treatment of both early and advanced estrogen receptor positive (ER+) breast cancer [46]. Our results indicate that tamoxifen is a candidate for repositioning to myelogenous leukemia. In particular, the overrepresentation of genes in this pathway, which are upregulated in myelogenous leukemia, and down-regulated by Tamoxifen suggests the possibility that aberrant TGF-b signaling plays a role in myelogenous leukemia. Since TGF-b production is down-regulated by tamoxifen in other tissues [47], tamoxifen might function as an anti-myelogenous leukemia drug by repressing this pathway (Table S3). This suggestion is supported by the fact that expression of estrogen receptors ESR1 and ESR2 is relatively unaffected by treatment with Tamoxifen (of the 20,469 ranked genes, ESR1 and ESR2 ranked 4184 and 4734 respectively -well below the number of top ranking genes used in the study: 700/800 for UC/DB and DC/UB). Consequently it seems unlikely that the effect of Tamoxifen on leukemic cells is mediated by these receptors.
We therefore speculate that tamoxifen acts similarly in breast cancer, and thereby exerts its effects in a dual manner; i.e. through inhibition of TGF-b, in addition to inhibition of estrogen. Militating against this possibility are the facts that the TGF-b pathway is not over-represented in UC/DB transcripts, and other investigations did not find evidence for the regulation of TGF-b genes/proteins by tamoxifen in breast cancer patients [48]. On the other hand an increased expression of TGF-b1, which is often seen in tumors of breast cancer patients, correlates with poor prognostic outcome [49]. This apparent conflict might be resolved by the recent discovery that tamoxifen decreases extracellular TGF-b1 proteins secreted from breast cancer cells, but not intracellular ones [50]. This result is also compatible with our finding that the adherens junction and focal adhesion pathways are both overrepresented in breast cancer cells, and these pathways are potentially inducible by TGF-b [37]. These observation are in line with other studies documenting decreased metastasis when TGF-b signalling is blocked in high-grade breast tumor [51], and suggest that tamoxifen represses the metastasis of breast cancer cells by down regulating the TGF-b pathway and preventing loss of polarity and cell-cell contacts.
Taken collectively, the functional analysis of our results suggests a potential mechanism for tamoxifen, which is independent of an interaction with the estrogen receptor, and has tamoxifen suppressing tumor metastasis and growth by down-regulating TGF-b signaling.

Beyond repositioning
Our results also suggest that some exploration of the identified non-FDA approve drugs (new drug candidates) could be fruitful. If the fraction of FDA approved drugs in clinical trials is taken as a measure of what is worth exploring (i.e. we conservatively neglect other supporting evidence), then we'd expect 8 of the 34 non-FDA approved drugs for breast cancer to be ultimately worthy of clinical trials; and 4 of the 44 for myelogenous leukemia and 5 of the 38 for prostate cancer (i.e. we'd expect this number to get through animal toxicity tests, and efficacy tests when available, and enter phase 1 trials).

Limitation and future development
There are several issues that may limit the future development of the approach. First, the optimization of the window size requires availability of the known FDA-approved drugs in CMAP, which may not always the case especially when expanding this approach to the other diseases that are functionally close to the three cancers. Second, the sensitivity of the approach to the subtype, or the different stage, of the same disease needs to be studied further. The approach will have great application to the personalized medicine if it is able to identify different drugs for the disease at different stages/subtype because the relative cheap price to get the patient expression profile. Finally, although mRNA expression is used to measure the functional change of the cell, we expected the better results using the other data that may be more representative of the cellular functions, such as protein expressions.

Transcript expression
Expression data in response to bioactive compounds for breast cancer, prostate cancer and myelogenous leukemia cell lines were obtained from the connectivity map (http://www.broad.mit.edu/ CMAP/) (Build 02) [6,14]. Differential expression data in response to breast cancer (GDS2617), leukemia (GDS2908), and prostate cancer (GDS1439) were obtained from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) [17]. The data sets are picked in such a way that there is fairly big number of samples and the expressions are normalized by GEO database. The ranked list of differentially expressed genes for a given cancer is calculated using t-statistic.

Gene filtering
The bioactive compound specific signatures fetched from CMAP are based on cell lines (i.e. cancerous cells with and without treatments), while those from GEO were based on tissue cells (i.e. normal and cancer tissue cells). Since the different cell types are not directly comparable, we first normalized gene-expressions according to the untreated cell line and the cancer tissue samples. We retain only genes that are expressed in both tissue and cell line. In particular we applied the t-test to the normalized scores, and calculated the corrected p-values for multiple testing by a false discovery rate (FDR) procedure. The FDR is defined as the expected proportion of false positives among the significant results and is a more appropriate measure than the raw p-value for multiple hypotheses testing. The FDR threshold was set as 0.01, and the genes with clearly different gene-expressions were removed from both samples. As a result, we retained 15572 genes (77%), 20469 genes (92%), and 12220 genes (55%) for breast cancer, myelogenous leukemia, and prostate cancer, respectively.

Comparison of reverse-correlated cancer and bioactive compound specific gene sets
We prepared two types of ranked lists of genes. One was generated from tissue samples ranked by differential expression between normal and cancer tissues from GEO data. The other was obtained from the ranked list of perturbed cell line genes from CMAP. In the former case, the top and bottom k genes were defined as up-regulated genes in cancer (UC) and down-regulated genes in cancer (DC). In the latter case, the top and bottom k genes were defined as up-regulated genes by bioactive compounds (UB) and down-regulated genes by bioactive compounds (DB). The genes of interest are the top and the bottom k genes in a ranked list where k ranges from 100 to 10000 in increments of 100.
We counted overlapping genes in between UC and DB (UC/ DB) and in between DC and UB (DC/UB) to investigate compounds up-regulating down-regulated cancer genes (DC/ UB), or down-regulating up-regulated cancer genes (UC/DB). We performed the Fisher's exact test to prove if the overlap is significant by comparing the number of overlapping genes to that of randomly selecting genes (background). The p-value was transformed into FDA corrected for multiple hypotheses. The FDR threshold was set as 0.01.

Choice of window size
For each value k, a compound is labeled as bioactive if the number of overlapping genes (as explained in Fig. 1) is statistically significant. The sensitivity and specificity were calculated by measuring the proportions of true positives (fraction of FDA drugs identified) and true negatives (fraction of identified compounds that failed clinical trials). For each cancer, we chose values of k (one for UC/DB and one for DC/UB) that gave maximum specificity, subject to the constraint of non zero sensitivity (at least 1 correct prediction), non zero duality and a FDR less than 0.01. In this way we identified for further investigation, a total of 90 compounds (and associated genes) for breast cancer (28 suppressors of up-regulated cancer genes; 62 enhancers of down-regulated genes); 36 compounds for myelogenous leukemia (10 suppressors; 26 enhancers), and 171 compounds for prostate cancer (83 suppressors; 88 activators). The results regarding different window size are presented in Table S7 and Fig. S1.

Pathway over-representation analysis
We mapped correlated genes in UC/DB and in DC/UB onto the KEGG pathways and counted the number of genes mapped and total number of existing genes with respect to each pathway. Given the number of genes and total number of all of genes we used, a p-value is calculated with hypergeometic distribution [52]; we accepted only pathways with the p-values below 0.05 as overrepresented pathways [53].

Drug and clinical trail information retrieval
We collected data from KEGG DRUG Database (http://www. genome.jp/kegg/drug/), DrugBank (http://www.drugbank.ca/) and PharmGKB (http://www.pharmgkb.org/, email: ) to map International Nonproprietary Name (INN) to generic names and alias. FDA approved drugs were found from FDA service: Drugs@FDA. All clinical trials data and references that we checked for our predictions were shown in Table 2 and Table S1 with corresponding hyperlinks. Figure S1 The specificity and the sensitivity against bioactive compounds identified in each parameter k with respect to each cancer type for both with and without filtering out genes with apparently different gene-expressions in between different cell types. (A) Breast cancer with filtering (B) Breast cancer without filtering (C) Leukemia with filtering (D) Leukemia without filtering (E) Prostate cancer with filtering (F) Prostate cancer without filtering. (JPG) Table S1 Candidates for repositioning for three cancers. FDA approved compounds (*); Compounds showing duality ( 1 ); The 1 st number in the bracket associated with each compound is the pvalue, the 2 nd number is the number of overlapping genes.