Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Gene-Set Local Hierarchical Clustering (GSLHC)—A Gene Set-Based Approach for Characterizing Bioactive Compounds in Terms of Biological Functional Groups

  • Feng-Hsiang Chung ,

    Contributed equally to this work with: Feng-Hsiang Chung, Zhen-Hua Jin

    Affiliations Institute of Systems Biology and Bioinformatics, National Central University, Zhongli, 32001, Taiwan, Center for Dynamical Biomarkers and Translational Medicine, National Central University, Zhongli, 32001, Taiwan

  • Zhen-Hua Jin ,

    Contributed equally to this work with: Feng-Hsiang Chung, Zhen-Hua Jin

    Affiliation Institute of Systems Biology and Bioinformatics, National Central University, Zhongli, 32001, Taiwan

  • Tzu-Ting Hsu,

    Affiliation Institute of Systems Biology and Bioinformatics, National Central University, Zhongli, 32001, Taiwan

  • Chueh-Lin Hsu,

    Affiliation Institute of Systems Biology and Bioinformatics, National Central University, Zhongli, 32001, Taiwan

  • Hsueh-Chuan Liu,

    Affiliation Institute of Systems Biology and Bioinformatics, National Central University, Zhongli, 32001, Taiwan

  • Hoong-Chien Lee

    hclee12345@gmail.com

    Affiliations Institute of Systems Biology and Bioinformatics, National Central University, Zhongli, 32001, Taiwan, Center for Dynamical Biomarkers and Translational Medicine, National Central University, Zhongli, 32001, Taiwan, Department of Physics, Chung Yuan Christian University, Zhongli, 32023, Taiwan, Physics Division, National Center for Theoretical Sciences, Hsinchu, 30043, Taiwan

Gene-Set Local Hierarchical Clustering (GSLHC)—A Gene Set-Based Approach for Characterizing Bioactive Compounds in Terms of Biological Functional Groups

  • Feng-Hsiang Chung, 
  • Zhen-Hua Jin, 
  • Tzu-Ting Hsu, 
  • Chueh-Lin Hsu, 
  • Hsueh-Chuan Liu, 
  • Hoong-Chien Lee
PLOS
x

Abstract

Gene-set-based analysis (GSA), which uses the relative importance of functional gene-sets, or molecular signatures, as units for analysis of genome-wide gene expression data, has exhibited major advantages with respect to greater accuracy, robustness, and biological relevance, over individual gene analysis (IGA), which uses log-ratios of individual genes for analysis. Yet IGA remains the dominant mode of analysis of gene expression data. The Connectivity Map (CMap), an extensive database on genomic profiles of effects of drugs and small molecules and widely used for studies related to repurposed drug discovery, has been mostly employed in IGA mode. Here, we constructed a GSA-based version of CMap, Gene-Set Connectivity Map (GSCMap), in which all the genomic profiles in CMap are converted, using gene-sets from the Molecular Signatures Database, to functional profiles. We showed that GSCMap essentially eliminated cell-type dependence, a weakness of CMap in IGA mode, and yielded significantly better performance on sample clustering and drug-target association. As a first application of GSCMap we constructed the platform Gene-Set Local Hierarchical Clustering (GSLHC) for discovering insights on coordinated actions of biological functions and facilitating classification of heterogeneous subtypes on drug-driven responses. GSLHC was shown to tightly clustered drugs of known similar properties. We used GSLHC to identify the therapeutic properties and putative targets of 18 compounds of previously unknown characteristics listed in CMap, eight of which suggest anti-cancer activities. The GSLHC website http://cloudr.ncu.edu.tw/gslhc/ contains 1,857 local hierarchical clusters accessible by querying 555 of the 1,309 drugs and small molecules listed in CMap. We expect GSCMap and GSLHC to be widely useful in providing new insights in the biological effect of bioactive compounds, in drug repurposing, and in function-based classification of complex diseases.

Introduction

Microarray technique has been a powerful tool for profiling gene expression on a genome-wide scale and to study associations between gene expression and the pathology of common diseases, including various cancers and Alzheimer's disease [1, 2]. A common practice, the Individual Gene Analysis (IGA) of microarrays, focuses on statistics-based identification of differentially expressed genes (DEGs) between two phenotypes. Standard and popular methods of this type include student t-test, z-test, SAM, Limma, and ANOVA [37]. While most biological processes, including metabolic process, signal transduction, and regulation of transcription, typically involve the collaborative activation of large sets of genes, IGA methods emphasize the independence of individual genes and neglect the expected correlations in gene expression.

An improvement on IGA is to explore whether, among IGA-selected DEGs, functionally related gene sets, such as those given by Gene Ontology [8] and KEGG [9], are significantly expressed. An example of this approach is Fisher's exact test [10]. A drawback in this approach is that genes not among DEGs, namely the vast majority of genes, are excluded from the consideration. In the event when the DEG set is large, the correspondingly long list of sets of functionally related genes makes it cumbersome to compare results between studies. Most importantly, this approach tends to be dominated by large gene-sets, such as those of immune response and metabolic pathways, and results in the neglect of possibly important functions represented by smaller gene-sets.

The Connectivity map (CMap) was first developed as a generic solution for identifying the functional associations between diseases, genes, and drugs [11]. This approach provides a common analytical platform using genomic profiles as a shared language to connect diseases, gene functions, and drug activities. Many studies have employed disease-defined gene-sets to query CMap for the discovery of repurposed drug activities against common diseases, including diabetes [12] and Alzheimer's disease [13, 14], and solid tumours such as colon cancer [15], breast cancer [16], lung adenocarcinoma [17], and Inflammatory Bowel Disease [18]. CMap has also used to study drug-induced differential expression of drug target mRNA [19] and, in combination of public repositories of gene expression data characterizing diseases, to construct a database that connects input genomic profiles with CMap drugs and diseases [20]. The standard application of CMap has been IGA based [17]. However, results of IGA-based application of CMap on human samples tend to be dominated by cell types (Supporting Information in [11]). One way to overcome this tendency is to generate a consensus genomic profile for each drug by merging CMap data from different cell-lines [21]. In another approach, transcriptional response data (i.e., genomic profiles) of a drug is decomposed into factors specific to individual cell lines and factors shared by two or more cell lines, and only shared factors are assumed to be relevant in characterizing drugs [22].

Gene-Set Analysis (GSA) was developed to address the shortcomings of IGA [23]. GSA uses sets of genes connected by biological functions, instead of individual genes, as units of analysis. In Gene Set Enrichment Analysis (GSEA) [24], the first GSA method, the relative importance of a functional gene-set is represented by an enrichment score (ES). GSEA was employed to generate a map that links genomic profiles of diseases to corresponding drug responses in CMap [11].

More recent variants of GSEA, including GSA [25], SAFE [26], Catmap [27], ErmineJ [28], and SAM-GS [29], employ variations in matrix ranking, definition for enrichment scores, or scheme for significance estimation. Other methods including FunNet [30], PARADIGM [31], and COFECO [32] are network-based and more sophisticated, but their application may also be limited by the availability of gene-gene interactions. GSA methods have been employed to explore functional relationships in large-scale compendiums of clinical cancer cohort samples and to elucidate associations in drug-driven signatures for therapeutic purposes [18, 33]. Another unsupervised method based on annotation-driven clustering had also performed excellent results on recovering clinically relevant patient subgroups [34]. An integrated approach using chemical structures and biological functions to discover novel links between specific chemical structure properties and distinct biological responses in cells had also been reported [35].

In GSA, a genomic profile may be expressed as the set of ESs for a comprehensive list of gene-sets computed from that genomic profile; we shall call that set a functional genomic profile (hereafter, functional profile). Because a functional profile neither relies on an arbitrary threshold for gene selection, as does IGA, nor by definition is it dominated by a few functionalities involving large gene-sets, it is expected to be more accurate and sensitive in reflecting the global as well as detailed properties of a genome-wide gene expression than IGA.

Here, we built Gene-Set Connectivity Map (GSCMap), an enhanced version of CMap where the genomic profiles of drugs in the CMap database are converted to functional profiles. Like CMap, GSCMap may be used for repurposed drug discovery, except that in GSCMap the functional signature of a phenotype is matched to functional profiles of drugs. The goal is to construct a database that one may expect to yield a more robust drug-phenotype association. We conducted tests to establish the internal consistency of GSCMap. We showed that grouping of drugs with similar biological activities is much more robust with GSCMap than with CMap in IGA mode. For an application of GSCMap we developed Gene-set-based Local Hierarchical Clustering (GSLHC), which utilizes an agglomerative hierarchical method for clustering a subset of functional gene-sets associated with "local" drugs responses (Fig 1). The idea is that, given a very large matrix of gene-set enrichment scores, a clear pattern of coordinated expression in sets of functionalities are usually confined to a subgroup of samples, a pattern that may not be easily detected by global measurements [36, 37]. Through GSLHC we identified the therapeutic properties and putative targets of 18 compounds of previously unknown characteristics listed in CMap, placing each in a subclass of drugs grouped by the similarity of the functional response they induce. Eight of the 18 subclasses contain putative anti-cancer activities. Our results revealed novel links in terms of gene-sets, and drug-versus-functions.

Our results showed GSCMap to be a robust and biologically more reliable version of CMap, and GSLHC, in combination with GSCMap, to be useful in discovering linkages among bioactive compounds characterized by their functional properties.

Materials and Methods

External database

The CMap database (build 02). Four types of human cancer cell lines (MCF7, PC3, HL60, SKMEL5) were treated with 1,309 distinct small-molecules including U.S. Food and Drug Administration (FDA) approved drugs and uncharacterized bioactive compounds (call perturbagens by the authors of CMap, here simplicity referred as drugs), for a total of 6,097 treatments [11]. Gene (total RNA) expressions from the 6,097 “instances” (an instance is a cell line treated with a drug at a dosage, and its non-treated control) were recorded in two batches of microarrays: 671 HG-U133A (Affymetrix) chips (on 407 drugs) and 5,426 HT-HG-U133A chips (for a total of 6,097 chips on 1,309 drugs). Raw data were downloaded from the CMap website (http://www.broadinstitute.org/cmap/).

Molecular signature database. We downloaded the annotated 4,884 gene-sets (called tags) from the Molecular Signatures Database (MSigDB: http://www.broadinstitute.org/gsea/msigdb/index.jsp) [38]. We used four types of tags in MSigDB: C2: curated tags from known pathways, online databases, and knowledge of domain experts; C3: motif tags based on conservative cis-regulatory motifs from human, mouse, rat, and dog genomes; C4: computational tags determined by co-expression neighbourhoods centered on 380 cancer-related genes; C5: gene-ontology tags collected from the same GO annotations of genes. C1 (positional tags on each human chromosome) was not included in this study for saving the time on big size of tags. For convenience, gene symbols in each tag were combined and transformed in HG-U133A Affymetrix ID according to the updated annotation file from Affymetrix website (http://www.affymetrix.com/estore/).

Chemical structure database. In order to cluster compounds based on 3D structure similarity, we queried 1,309 drug names on NCBI PubChem database (http://pubchem.ncbi.nlm.nih.gov/). Next, the retrieved 1,267 compounds (97% of CMap databsets) were hierarchically clustered by Chemical Structure Clustering tool based on the 3D structure (fingerprint) similarity using the single linkage algorithm on PubChem website [39]. Finally, we partitioned the tree into K clusters with K ranging from 10 to 200, and evaluated the clustering performance using F-score [40].

Pharmacological classification system. We retrieved class information of 798 compounds (61% of CMap databsets) from the Anatomical Therapeutic Chemical (ATC) classification system in the World Health Organization (WHO) website (http://www.whocc.no/) for information on similar therapeutic classes. In this system, drugs are classified into groups at 5 different levels: the first level of code indicates the anatomical main group; the second level of code indicates the therapeutic main group; the third level of code indicates the therapeutic/pharmacological subgroup; the fourth level of code indicates the chemical/therapeutic/pharmacological subgroup; the fifth level of code indicates the chemical substance. We used the first four levels of ATC to evaluate the gene and tag clusters performance using F-score. The fifth level of the code was not included in our analysis because at this level CMap was too fragmented–almost one drug to a class–for the code to be useful.

Molecular target database. We extracted information on known therapeutic protein targets, relevant diseases or cancers, and corresponding drugs (787 drugs; 60% of CMap datasets) from the Therapeutic Target Database (TTD: http://bidd.nus.edu.sg/group/ttd/) [41]. The working types on specific targets by the corresponding drugs (including activator, adduct, agonist, antagonist, antibody, binder, blocker, breaker, cofactor, inducer, inhibitor, intercalator, modulator, multitarget, opener, regulator, stimulator, and suppressor) were simply divided into two major groups: inhibition or activation. Because drugs and targets do not have one-to-one correspondence, we did not calculate F-score based on the small class size. Instead, we computed drug-drug correlations by target group in IGA and GSA. The drug-pair is assumed to have correlation value of 1 if they have similar effects on the same protein target.

Local database

CMap mirror database. Following the original methods described in CMap, the raw image of CEL files for the 6,097 instances from the CMap database were converted to average log-ratios and confidence calls using the algorithms MAS 5.0 (Affymetrix) and linear-fit-on-Pcall [11]. For each instance the log-ratios for the 22,283 HG-U133A probesets were ranked and the ranked data for all instances were saved in matrix form locally.

Local CMap program. The web version of CMap cannot be queried in batch mode. Furthermore, in each individual query the number of genes, or the size of the tag, is limited to 1000. To overcome these limitations, we used C++ language to build a local program encoding the same algorithms and datasets used by CMap. This program allows CMap-type queries to be made locally in single or batch mode, and permits GSEA (Gene Set Enrichment Analysis [38]) parameters be varied. The program was tested for reliability and speed before applied to the current study (see Results).

Matrix CMap and the enrichment-score matrix GSCMap and their sub-matrices

Cmap is a 22,283x6,097 probe-set versus instance matrix; elements of matrix are log-ratios of expression intensities. From this a number of extend maps/matrices were constructed:

Cmap1 2013 The 22,283x671 sub-matrix of CMap involving the 671 instances in CMap v1.0.

tCMap1 – A 300x671 sub-matrix of CMap1 involving the 300 highest variance probe-sets.

CMd – A 22,283x1,309 probe-set versus drug matrix reduced from CMap by averaging over same-drug instances.

IGCMd – A 4,884x1,309 sub-matrix of CMd involving the 4,884 highest variance probe-sets.

GSCMap – A 4,884x6,097 tag versus instance matrix; elements of the matrix are enrichment scores (ESs). For each of the 4,884 tags from MSigDB (collections C2-C5), we queried the 6,097 instances in CMap (version 2.0) to yield a 6,097-component vector (called Vd) of Kolmogorov-Smirnov statistic [11, 42, 43] based ESs, as defined in [38]. GSCMap is the set of 4,884 Vd’s.

GSCMap1 – The 4,884x671 sub-matrix of GSCMap involving the 671 instances in CMap v1.0.

tGSCMap1 – A 300x671 sub-matrix of GSCMap1 involving the 300 largest ES variances tags.

GSCMd – A 4,884x1,309 tag versus drug matrix. In CMap each drug were treated a variably multiple (averaging 6,097/1,309 = 4.66) times. For each tag and each drug the matrix element is the Kolmogorov-Smirnov statistic score (as in GSEA [38]) obtained by ranking the vector Vd corresponding to the tag and querying it using the multiple treatments for that drug.

Significance by permutation and normalized enrichment score (NES)

We tested the significance of the ES of a tag-drug pair by random permutation. Given a tag and a drug, and suppose the drug had t treatments in CMap and an (tag versus drug) enrichment score ES0. We generated a distribution of randomized ESs by running r trials, in each trial recalculating the Kolmogorov-Smirnov ES by replacing the t treatments for the drug by t randomly selected treatments among the 6,097 treatments. A randomization (two-sided) p-value for the ES was computed from ES0 and the distribution. The normalized enrichment score (NES) was taken to be ES0 divided by the mean of the distribution [38]. In this work we set r = 10,000.

Gene-set based Local Hierarchical Clustering (GSLHC)

GSLHC is an application of GSCMap for discovering links among drugs through tags strongly acted on by the drugs. Its implementation involves the steps: (i) Select a query drug set, which may be a single drug or a group of drugs with known shared property, or a drug of unknown property. (ii) For the query drug set, cull from GSCMap the functional profiles of drugs a subset of tags, each of which significantly enriched against every drug in the query drug set, where significant enrichment is determined by a threshold randomization p-value below an upper bound (we used p < 0.005). In the randomization test we generate a distribution of ESs by computing the ES for a tag-drug pair many times, each time replacing the genes in the tag by randomly selected genes from the entire gene pool [23]. (iii) Do a two-way hierarchical clustering of the culled tags with the entire set of 1,309 drugs, and cut out from the resulting heatmap the clade of drugs that includes the query drug set with correlation above a threshold value (we used 0.9).

Cluster evaluation

We used the F-score, a harmonic mean of precision and recall [40], to evaluate a cluster as a classifier of a known classification. Let TP, FP, and FN be true positive, false positive, and false negative, respectively. The precision rate P and recall R rate of the cluster are respectively given by P = TP/(FP + TP) and R = TP/(TP+ FN). Suppose several nodes in a cluster are meant to represent a classification, then, for class i, the F-score Fi for that class is the maximum nodal value for 2PR/(P+R), and the F-score for the classification is the weighted average of Fi summed over the nodes. The higher the F-score, the better the classification by cluster. The F-score ranges from 0 to 1.

Whenever possible, computations were conducted in the R environment (R version 2.15.1). Conversion of CMap to GSCMap was lengthy and took many hours of computation time. However, a typical application of GSLHC for constructing a high-correlation drug cluster requires less than one minute on a standard student grade laptop.

Ethic information

None.

Results

The local program reproduced results from CMap server with better efficiency

We used a tag called BRUINS_UVC_RESPONSE_LATE, which contains 1,137 genes differentially expressed only 12 h after UV-C irradiation of MEF cells, from MSigDB to compare the local program with the remote CMap server on the 100 drugs with the smallest p-value. The two programs yielded practically identical ESs (S1A Fig, dashed lines), and almost identical permutation p-values (S1A Fig, solid lines). Identical p-value were not expected; proportionally large differences in p-value occurred only when p < 10−3. We used the 772 tags in the C2 collection of MSigDB (number of genes in tags ranged from 50 to 1000) to compare the speed of the local program and the CMap server and found that the computation times were comparable, but the local program was slower when the gene number in the tag exceeded 600 (S1B Fig). The slower speed of the local program was more than compensated by the possibility of querying in batch mode.

DEGs have low reproducibility in CMap genomic profiles

In CMap each of the 1,309 perturbagens has an average of 4.7 genomic profiles (from different treatments) resulting from the total of 6,097 treatments. We computed the fractional overlaps of top-1,000 DEGs between pairs of genomic profiles. The average reproducibility (common DEGs/1000) between different-perturbagen pairs has a sharp peak at 0.05, with few cases exceeding 0.1. That of the same-perturbagen pairs also peaks strongly at 0.06, but has a long weak tail (S2 Fig), with 10,771 of cases having a reproducibility greater than 0.2.

The CMap and GSCMap matrices and their sub-matrices were constructed

Here, by CMap we mean the 22,283 (probes) x 6,097 (instances) matrix of log-ratios from CMap database. Using CMap we constructed the 4,884 (tags) x 6,097 GSCMap matrix of ESs using the 4,884 tags in MSigDB. Then we constructed sub-matrices of CMap, CMap1 (22,283x671), tCMap1 (300x671), CMd (22,283x1,309), and IGCMd (4,884x1,309), and sub-matrices for GSCMap, GSCMap1 (4,884x671), tGSCMap1 (300x671) and GSCMd (4,884x1,309), where 1,309 refers to the number of drugs/small molecules in CMap, 671 refers to the number of instances in CMap v1.0, 4,884 refers to the number of tags in MSigDB or the 4,884 highest variance probe-sets (for IGCMd), and 300 refers to 300 highest variance probe-sets (CMap) or ESs (GSCMap) (detail in Methods).

Cell-type dependence of CMap data was strong in IGA but weak in GSA

As a first comparison between the IGA and GSA, we separately hierarchically clustered the two (300x671) matrices tCMap1 (S1 Table, http://figshare.com/download/file/2258031), representing IGA, and tGSCMap1 (S2 Table, http://figshare.com/download/file/2258032), representing GSA, using a Pearson distance metric and average-linkage and examined the properties of the two resulting 671-branch dendograms as cell-type classifiers. Under visual inspection the tCMap1 dendrogram was overwhelmingly dominated by cell type (Fig 2A) whereas the tGSCMap1 dendogram was not (Fig 2B). Quantitatively, F-scores (Materials and methods) for the tCMap1 dendogram indicated that it provided a close to perfect classification for the four cell types (Table 1, permutation p-value < 0.01). In contrast, the tGSCMap dendogram was a poor (but fair for HL60) classifier for cell types. A similar result was found in a Principle Component Analysis on the full CMap dataset (S3 Fig). These results implied GSA results had a significantly better chance than IGA of not being masked by cell-type dependence.

thumbnail
Fig 2. Hierarchical clustering of CMap instances less dominated by cell-type when clustering is based on gene-set/tag enrichment scores.

Dendrograms are hierarchical clustering of CMap instances based on gene expression (A) and tag enrichment score (B). Colors in color bar below dendogram respectively represent the cell lines SKMEL5 (red), PC3 (green), MCF7 (blue), and HL60 (purple). For each instance top-300 genes or tags with the top-300 expression log-ratios or ES scores were selected for clustering based on Pearson distance metric and average linkage.

https://doi.org/10.1371/journal.pone.0139889.g002

thumbnail
Table 1. Cell-type effects are eliminated in hierarchical clustering based gene-set enrichment.

https://doi.org/10.1371/journal.pone.0139889.t001

Testing drug responses in IGA and GSA

GSA had clearer and more varied drug response than IGA.

We separately two-way hierarchically clustered the two (4,884x1,309) matrices IGCMd (S3 Table, http://figshare.com/download/file/2258034) for IGA and GSCMd (S4 Table, http://figshare.com/download/file/2258033) for GSA using Pearson distance metric and average-linkage (Fig 3). All computations were carried out over two days on a personal computer with an Intel(R) dual core Quad CPU, 2.40 GHz processor with a 8GB RAM. While the vast majority of tags responded to the drugs as being either positively or negatively enriched (Fig 3A), the vast majority of high-variance genes were neither up-regulated nor down-regulated with respect to the drugs (Fig 3B).

thumbnail
Fig 3. Two-way hierarchical clustering heatmep of 1309 CMap perturbagens shows higher contrast when clustering is based on gene-set/tag enrichment scores.

Two-way hierarchical clustering heatmaps were generated based on Pearson distance metric and average linkage using, for each CMap perturbagen: (A) normalized enrichment scores (NESs) of 4,884 tags from MSigDB, and (B) log-ratios for expression levels of the top-4884 high-variance genes. Color code: on red, positive NES or log-ratio; green, black, NES or log ratio ~0; green, negative NES or log-ratio.

https://doi.org/10.1371/journal.pone.0139889.g003

GSA gave a better drug classifier than IGA.

In CMap a drug typically is represented by several instances. For example, the pairs of drugs, trichostatin and LY-294002, respectively occur in 15 and 9 instances, each instance represented by a vector of 4,884 ESs (in GSCMap) or 22,283 intensity log-ratios (in CMap). We separately hierarchically clustered the two sets of combined 24 instances. Viewed as classifiers of the two drugs, the GSA cluster had a F-score of 0.98, and the IGA cluster, 0.72 (Fig 4A). The superiority of GSA over IGA in its ability to tell one drug from another happened to be a general feature. We repeated the above comparison for all the 20,736 drug-pairs with multiple instances in CMap1 and in GSCMap1 and found that the (drug classification) F-score for GSA was about 0.036 higher then IGA over an average of 0.75 (Fig 4B, two-sample Kolmogorov-Smirnov test: p-value < 2.2e-16).

thumbnail
Fig 4. Separation of two drugs among all instances involving the drug pair is done better using gene-set/tag enrichment scores.

Quality of separation is determined by F-scores for hierarchical clusters, constructed using tag enrichment scores and log-ratios for gene expressions, respectively, of all instances involving the drug pair. (A) Two clusters for the drug-pair valproic acid and trichostatin A; cluster based on gene expression, cluster on left, and on tag, cluster on right. Two-color bar indicates drug classification. (B) Ranking by F-scores of ~20,000 drug-pairs from the CMap development batch on HG-U133A platform involving 407 drugs and 674 chips; black, gene expression and red, tag.

https://doi.org/10.1371/journal.pone.0139889.g004

GSA and IGA responded similarly to chemical properties of CMap drugs.

The F-scores of clusters, constructed through GSA (using ESs from GSCMd) and IGA (using gene expression log-ratios from IGCMd), of drugs classified according to their anatomical, chemical, therapeutic, pharmacological (Anatomical Therapeutic Chemical (ATC) classification system, World Health Organization, http://www.whocc.no/) and structural (PubChem Structure Database [44]) properties (Material and Methods) were indistinguishable (S4 Fig).

Genomic signatures of same-target drug pairs had higher correlation in GSA than in IGA.

We expect the genomic signatures of drugs sharing a target to be more similar than drugs that do not. Information on drug targets were obtained from the Therapeutic Target Database (TTD) [41] (Material and Methods). The same-target drug-pairs correlated much better under GSA (ESs from GSMCd) than IGA (gene expression log-ratios from IGMCd) (Fig 5). An outstanding case was the triplet vorinostat, valproic acid, and trichostatin A that targets the histone deacetylase (HDAC) protein. The three pair-wise correlations for the triplet ranged from 0.8 to 1.0 in GSA and from 0.05 to 0.15 in IGA. Averaged over all 5,034 pairs involving 639 drugs, the mean of GSA correlation was 0.35 (S.D. = 0.27) and the mean of IGA correlation was 0.18 (S.D. = 0.15) (two-sample t-test, p-value < 2.2e-16).

thumbnail
Fig 5. Same-target drug-pairs correlate better when evaluated by gene-set/tag enrichment scores.

Figure plots correlation of same-target drug pair evaluated by tag enrichment score (ES) versus that evaluated by gene expression. Drug targets were those given by TTD database. In the tag approach, each drug, or CMap perturbagen, was represented by the ESs of 4884 MSigDB tags. In the gene expression case, each drug was represented by the set of top-4884 high variance genes. The three red dots are from the three pairs formed by the three drugs, vorinostat, valproic acid, and trichostatin A, all targeting the histone deacetylase (HDAC) protein.

https://doi.org/10.1371/journal.pone.0139889.g005

Validation of GSLHC and novel HDAC inhibitors

There are 106 active compounds in the CMap database that are poorly studied, and GSLHC was developed as an application on GSCMap to discover drug partners of known therapeutic properties for the compounds. We tested the GSLHC by giving it a set of tags common to and significantly enriched in the functional profiles of three histone deacetylase (HDAC) inhibitors–vorinostat (also known as suberoylanilide hydroxamic acid or SAHA), valproic acid, and trichostatin A–and see if it can recover them from GSCMap. The three HDAC inhibitors were chosen because they have been fully studied [4547]. A set of 597 tags significantly enriched with permutation p< 0.005 were selected for the test (Material and Methods). The selected tags had functions related to HDAC inhibitor activities. For example, among the down-regulated functions were histone acetylating, histone and chromatin modification, and maintenance of chromatin structures (Fig 6C). The test was successful; the triplet was among the six recovered drugs (Fig 6A and 6B). The three extras are not known as HDAC inhibitors but two of the three, scriptaid and HC toxin, have been reported to have HDAC inhibition activities [48, 49].

thumbnail
Fig 6. GSLHC finds novel HDAC inhibitors.

The three know HDAC inhibitors valproic acid, trichostatin A, and vorinostat, are all significantly enriched by 597 tags with permutation p< 0.005; these 597 tags were used in a new heatmap in the GSLHC protocol. (A) A sub-heatmap including the three HDAC inhibitors and all neighbors with correlation > 0.9. (B) Detail of the drug cluster associated with the sub-heatmap. The two drugs rifabutin and scriptaid in the cluster, not previously known as HDAC inhibitors, has literature support as having inhibition functions on HDAC proteins. (C) Detail of the tag cluster with the sub-heatmap shows several functions known to be related to HDAC inhibitor activities.

https://doi.org/10.1371/journal.pone.0139889.g006

Sample applications of GSLHC to characterization of active compounds

A novel cyclin-dependent kinase inhibitor (CDKi).

The compound 0175029–0000 is among molecules in CMap known to be active in certain biological roles [11] but poorly studied in literature. Its ES profile had 1,080 significantly enriched tags with permutation p< 0.005. Our GSLHC search showed it to be closely associated with three CDKi’s with correlation coefficient (CE) > 0.97 and five DNA topoisomerases with CE > 0.92 (Fig 7A and 7B). Biological functions negatively regulated by these drugs included those related to cell cycle and checkpoint on cell cycle (Fig 7C).

thumbnail
Fig 7. GSLHC identifies 0175029–0000 as a novel cyclin-dependent kinase inhibitor (CDKi).

(A) A correlation > 0.9 sub-heatmap including the compound 0175029–0000 of unknown function from a GSLHC-generated heatmap based on the ES of 1080 tags significantly enriched in 0175029–0000 with permutation p< 0.005. (B) Detail of the drug cluster associated with the sub-heatmap. According to the TTD database, GW-8510, alsterpaullone, and H-7 (red asterisk) CDK inhibitors, and doxorubicin, camptothecin, azacitidine, mitoxantrone, and ellipticine (blue asterisk) are DNA topoisomerase inhibitors. All have anti-tumor activities. (C) Detail of the tag cluster with the sub-heatmap shows functions known to be related to the inhibition activities of cell cycle.

https://doi.org/10.1371/journal.pone.0139889.g007

A novel antibiotic, anesthetic, and anti-inflammatory agent.

The ES profile of compound CP-863187 had 36 significantly enriched tags with permutation p< 0.005. Our GSLHC search showed it to be closely associated with an antibiotic (piperacilin; CE > 0.98), an anesthetic (benzocaine), an anti-inflammatory agents (betunlinic acid; CE > 0.97), as well as with another anti-inflammatory agent (CE > 0.96) and five other antibiotics (CE > 0.90) (Fig 8A and 8B). Biological functions affected by drugs associated with CP-863187 included negative regulation of integrin signalling pathway and hydrolases (Fig 8C).

thumbnail
Fig 8. GSLHC identifies CP-863187 as a potential antibiotic.

(A) A correlation > 0.9 sub-heatmap including the compound CP-863187 of unknown function from a GSLHC-generated heatmap based on the ES of 36 tags significantly enriched in CP-863187 with permutation p< 0.005. (B) Detail of the drug cluster associated with the sub-heatmap. According to TTD database, piperacillin, dapsone, tocainide, ampicillin, sulfadimethoxine, metronidazole (red asterisk) are antibiotics, betulinic acid and isoflupredone (blue asterisk) are anti-inflammatory agents, and benzocaine (green asterisk) is an anesthetic. (C) Detail of the tag cluster with the sub-heatmap shows functions known to block the formation of bacteria cell wall by inhibition of integrin signaling pathway.

https://doi.org/10.1371/journal.pone.0139889.g008

Summary of drug discovery by GSLHC (Table 2)

Eighteen previously uncharacterized compounds in CMap, including 0175029–0000 and CP-863187, were discovered by GSLHC to have closely associated drug partners (in CMap), putative targets, and therapeutic indications (Table 2; detail in S5S20 Figs). Among the discoveries, eight compounds: tyrphostin AG-825, 5248896, 0175029–0000, H-7, U0125, STOCK1N-35215, 0297417-0002B, and F0447-0125, were identified as having potential anti-tumor activities. Depending on their closest putative drug partners, their molecular mechanisms differ. Camptothecin, irinotecan, and betulinic acid, with closest partners tyrphostin AG-825, U0125, and CP-944629, respectively, were predicted to block DNA transcription by inhibiting DNA topoisomerase activities. The compounds 0175029–0000 and H-7, with closest partner GW-8510, were predicted to be cyclin-dependent kinase inhibitors. Compounds predicted to have therapeutic activities on non-cancer diseases include 5186324 (closest partner neostigmine bromide and therapeutic activity on myasthenia gravis) and Prestwick-692 (closest partner isoflupredone and therapeutic activity on rheumatoid arthritis).

thumbnail
Table 2. Putative molecular target and pharmacology obtained from application of GSLHC on CMap perturbagens without known indication.

https://doi.org/10.1371/journal.pone.0139889.t002

GSLHC website (http://cloudr.ncu.edu.tw/gslhc/)

This website contains 1,857 local hierarchical clusters accessible by querying 555 of the 1,309 drugs and small molecules listed in CMap v2.0. The other CMap drugs do not yield local hierarchical clusters that meet the criteria permutation p-values not greater than 0.01 and Pearson correlation not less than 0.90. The full dataset of NES values (http://figshare.com/download/file/2288071) and permutation p-values (http://figshare.com/download/file/2288072) for generating the hierarchical cluster results shown in GSLHC website can be downloaded and replicated in the local computer.

Discussion and Summary

We used CMap as a vehicle for the demonstration that GSA is a better way than IGA in utilizing genome-wide gene expression. Because this would involve repeated and massive application of CMap, we constructed a local extended version of CMap. The local CMap was stored and computation using it were conducted on a personal computer equipped with Intel(R) dual core Quad CPU, 2.40 GHz processor with a 8GB RAM. Advantages of the local program over the remote CMap include: (i) No reliance on the Internet and the ensuing network connection time saved; (ii) Length of the list of querying gene not limited to 1000; (iii) Capability for batch mode operation. Extensive tests conducted on the local version confirmed its accuracy, and verified that in single mode its running speed is comparable to the remote CMap (S1 Fig).

We implemented a GSA-based application of CMap by constructing GSCMap, an analog of CMap where gene-based genomic profiles of instances in CMap are replaced by tag-based functional profiles.

Hierarchical clustering based on gene expression has been an important tool in genomic technology. We showed that IGA-based hierarchical clustering of the CMap (the matrix) was dominated by cell-types, a dominance absent in the GSA-based GSCMap (Fig 2). This notion was strengthened by our quantitative measure, using F-scores, of the clusters as classifies of cell types. We confirmed a previous report that CMap was an excellent classifier of cell types, a result that imposes strong constraints of it being a good classifier of drug effects. In contrast, our F-score analysis showed GSCMap to be a poor classifier of cell types (Table 2). It is biologically reasonable that drug sensitivity varies with different cell types [50]. In a method that studies drug effects using a database such as CMap, the question is whether the method can pick out drug-induced signals over the background of cell type-specific signatures. For instance, the clustering of instances of the two drugs trichostatin and LY-294002 shows that under IGA, signatures associated with the cell line HL60 dominate over drug effects, whereas under GSA, drug-specific signals dominates over cell type-specific signatures (Fig 4A). This suggests that GSA provides a better means than IGA for focusing on drug effects, or “shared factors” [22], that are common to different cell types.

Having demonstrated that GSCMap has far weaker cell-type dependence than CMap, we conducted three tests to show the former had more discriminating responses to drug properties than the latter. The first test (using the 4,884x1,309 matrices GSCMd and IGCMd) showed tag response to drugs in GSCMap exhibited a much wider range then gene expression response to drugs in CMap (Fig 3). A second test showed that GSCMap clustered same-drug instances consistently better than CMap (Fig 4). A third test showed that the genomic profiles of a pair of drugs having the same target had higher correlation in GSCMap than in CMap (Fig 5). Our assumption for the third test is that same-target drugs are designed to have similar indication. Based on this assumption, the result of the test—the GSEA-based and IGA-based correlations have a two-sample t-test p-value of < 2.2e-16—suggests GSCMap much better connects drugs with similar indication. The case of the three HDAC inhibitors–vorinostat, valproic acid, and trichostatin A–brings home this point (red dots in Fig 5; admittedly this represents an extreme case). In GSEA the three pairwise correlations among the three drugs have a mean value of 0.90 (SD = 0.082), and in GSA the mean correlation is 0.077 (SD = 0.055). The t-test p-value for the two sets is 0.00301. Thus, in the IGA mode, if a query (a genomic profile or a gene set) matches (i.e., has a high IGA enrichment score) with one of the three HDAC inhibitors, it will not match either of the other two. In contrast, in the GSEA mode, a query will either match all three HDAC inhibitors or not match any.

Similar correlation-based analysis applied to drug-pairs having structural similarities at the chemical level or therapeutic indications at the clinical level did not exhibit any different between GSCMap and CMap (S4 Fig). This is not surprising, since global genomic signatures do not generally bear any direct relation to chemical structures of the drug and the target. Chemical compatibility between drug and target is a crucial consideration in drug design, especially when the purpose is to regulate a specific target that has a central role in a biological pathway. CMap (hence GSCMap) was not constructed to address the question of chemical compatibility. CMap focuses on the effects of a drug as manifested in changes it causes in the genomic profile, but makes no assumption on how those changes were brought about. This implies that in Table 2, the test drug may not share the target of the partner drug.

GSLHC was designed to discover, through GSCMap, functional links among drugs in CMap. The principle of the method, local hierarchical clustering, is generally applicable to any large list that may or may not represent drug effects. We validated GSLHC by using three known HDAC inhibitors as bait and saw that they were recovered as part of a tight cluster returned by GSLHC (red dots in Fig 5). The cluster also included three drugs, scriptaid, HC toxin, and rufabutin, not previously known as HDAC inhibitors. GSLHC showed all three as having significant correlation with biological functions relating to switching histone modification and destroying chromatin maintenance (Fig 6); scriptaid and HC toxin have been reported to inhibit HDAC proteins [48, 49], and rifabutin is primarily used in the treatment of tuberculosis. We regard all three as potential novel HDAC inhibitors.

Of the 106 uncharacterized compounds in the CMap dataset, GSLHC found drug partners of known indications for 18 (Table 2), 8 of which, tyrphostin AG-825, 0175029–0000, H-7, U0125, STOCK1N-35215, 0297417-0002B, F0447-0125, and CP-944629 were inferred to have anti-tumor activities. In each case we found significantly correlations between the compound with newly inferred indication and biological functions related to that indication (Figs 7 and 8, and S5S20 Figs). As mentioned, these predictions do not make any statement about drug targets.

The compound 0175029–0000 was shown to be closely associated with three CDKi’s–GW-8510 [5158], alsterpaullone [5158], H-7 [5158]–and five DNA topoisomerases–doxorubicin [5158], camptothecin [5158], azacitidine [5158], mitoxantrone [5158], and ellipticine [5158] (Fig 7), and was inferred as a putative CDKi/DNA topoisomerases, all of which have been reported to have anti-tumour activities [5158] and significantly expressed biological functions that negatively regulate cell cycle and checkpoint on cell cycle (Fig 7C).

The compound CP-863187 was shown to be closely associated with an antibiotic (piperacilin), an anesthetic (benzocaine), and an anti-inflammatory agent (betunlinic) (Fig 8), and to significantly express negative regulation of integrin signaling and hydrolases (Fig 8C). There are studies suggesting that antibiotics may have inflammatory and anesthetic properties [59, 60]. The source of the shared properties may be that as a signal transductors, integrins are involved in activities on cell membranes and cell-cell interactions. Hydrolases are ubiquitous and play important roles among bacteria including digesting the murein of bacteria [61], acting as a pacemaker for cell wall growth [62], and splitting the septum during cell division [63].

Despite its apparent success, the GSLHC approach has its own limitations. Statistical concerns regarding the neutrality of GSEA has been raised [64, 65] (and replied [64, 65]). There is not a perfect method for extracting hypothesis-free information from something as rich as a modern set of genome-wide gene expression data. The several tests shown in this work does show that for practical purposes, GSA, including GSEA and two algorithms derived from it, PAGE and GAGE, is superior to IGA. Of the 106 unknown compounds in CMap (version 2.0), we only found drug partners for 18. That we failed to do the same for the other 88 compounds have many possible reasons: a weakness of GSLHC; the tags in MSigDB is not sufficiently comprehensive; the sets of compounds presently included in MCap is too restrictive. Improvements on all three fronts are possible, even expected. Already in its current form, we expect the GSLHC approach to be more widely applicable to many areas other than what was demonstrated here. To name a few: repurposed drug discovery based on functional-profile characterization of phenotypes, function-based diagnosis and classification of complex diseases, and prognosis on advance-stage patients after chemotherapy treatment.

A sequel to CMap, the LINCS L1000 dataset consisting of over 1.4M gene-expression profiles collected from human cells treated with chemical compounds, was recently constructed and made available online (http://support.lincscloud.org/hc/en-us) by The Broad Institute. In L1000 each profile is a 1000-gene representation of a gene-expression profiling assay based on the direct measurement of the transcriptome. A GSA as carried out in the present paper will not be suitable for L1000 gene-sets. However, it will be interesting to investigate the cell-type dependence of the LINCS data.

Supporting Information

S1 Fig. The local program reproduces results of CMap server.

(A) The local program (blue) tracks results given by CMap for permutation p-value (solid lines), with small deviations when drug list is less than 30, and enrichment score (dash lines). (B) Run times for the local program and CMap are comparable, with the former slightly faster when size of probe set is less than 700, and slight slower otherwise.

https://doi.org/10.1371/journal.pone.0139889.s001

(TIF)

S2 Fig. Most of replicates treating with the same perturbagen show low reproducibility on the top-1000 differentially expressed genes (DEGs) across all CMAP datasets.

The reproducibility between two treatments (blue: the same perturbagen; red: two different perturbagens) is defined by the frequency of number of the overlapping genes verse the number of 1000 DEGs.

https://doi.org/10.1371/journal.pone.0139889.s002

(TIF)

S3 Fig. Principle component analysis of full C-MAP dataset.

The first two components, together accounting for 21.7% of the total weight, show a clear separation of data from the HC60 (black circle) and PC3 (green cross) cell lines.

https://doi.org/10.1371/journal.pone.0139889.s003

(TIF)

S4 Fig. Performance test (F-score) showed that no difference between gene and tag clusters by Anatomical Therapeutic Chemical (ATC) classification system and PubChem structure database.

(A) In PubChem database, we use chemical structure clustering tool to cluster compounds based on the structure (fingerprint) similarity using the Single Linkage algorithm; number of cluster decreases with cluster size. Both results indicated that F-score increases with decreasing class size. (B) In ATC system, drugs are classified into groups at 4 different levels–from general anatomical groups to detail chemical/therapeutic/pharmacological subgroups.

https://doi.org/10.1371/journal.pone.0139889.s004

(TIF)

S5 Fig. GSLHC identified the compound 5186324 as a novel acetylcholinesterase inhibitor.

(A) A correlation > 0.9 sub-heatmap including the compound 5186324 of unknown function from a GSLHC-generated heatmap based on tags significantly in 5186324 enriched with permutation p< 0.005. (B) Detail of the dendrogram showing 5186324 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s005

(TIF)

S6 Fig. GSLHC identified the compound DL-PPMP as a novel cyclooxygenase-1 inhibitor.

(A) A correlation > 0.9 sub-heatmap including the compound DL-PPMP of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in DL-PPMP with permutation p< 0.005. (B) Detail of the dendrogram showing DL-PPMP (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s006

(TIF)

S7 Fig. GSLHC identified the compound Prestwick-692 as a novel glucocorticoid receptor agonist.

(A) A correlation > 0.9 sub-heatmap including the compound Prestwick-692 of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in Prestwick-692 with permutation p< 0.005. (B) Detail of the dendrogram showing Prestwick-692 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s007

(TIF)

S8 Fig. GSLHC identified the compound tyrphostin AG-825 as a novel DNA topoisomerase I inhibitor.

(A) A correlation > 0.9 sub-heatmap including the compound tyrphostin AG-825 of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in tyrphostin AG-825 with permutation p< 0.005. (B) Detail of the dendrogram showing tyrphostin AG-825 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s008

(TIF)

S9 Fig. GSLHC identified the compound 5248896 as a novel human epidermal growth factor receptor (HER)-2/neu inhibitor.

(A) A correlation > 0.9 sub-heatmap including the compound 5248896 of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in 5248896 with permutation p< 0.005. (B) Detail of the dendrogram showing 5248896 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s009

(TIF)

S10 Fig. GSLHC identified the compound H-7 as a novel Cyclin-dependent kinase 2 Inhibitor.

(A) A correlation > 0.9 sub-heatmap including the compound H-7 of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in H-7 with permutation p< 0.005. (B) Detail of the dendrogram showing H-7 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s010

(TIF)

S11 Fig. GSLHC identified the compound Prestwick-1103 as a novel Tumor necrosis factor antibody.

(A) A correlation > 0.9 sub-heatmap including the compound Prestwick-1103 of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in Prestwick-1103 with permutation p< 0.005. (B) Detail of the dendrogram showing Prestwick-1103 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s011

(TIF)

S12 Fig. GSLHC identified the compound U0125 as a novel DNA topoisomerase I inhibitor.

(A) A correlation > 0.9 sub-heatmap including the compound U0125 of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in U0125 with permutation p< 0.005. (B) Detail of the dendrogram showing U0125 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s012

(TIF)

S13 Fig. GSLHC identified the compound 5109870 as a novel Alpha adrenergic receptor antagonist.

(A) A correlation > 0.9 sub-heatmap including the compound 5109870 of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in 5109870 with permutation p< 0.005. (B) Detail of the dendrogram showing 5109870 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s013

(TIF)

S14 Fig. GSLHC identified the compound MG-132 as a novel Proteasome Inhibitor.

(A) A correlation > 0.9 sub-heatmap including the compound MG-132 of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in MG-132 with permutation p< 0.005. (B) Detail of the dendrogram showing MG-132 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s014

(TIF)

S15 Fig. GSLHC identified the compound PHA-00851261E as a novel CGMP-inhibited 3',5'-cyclic phosphodiesterase.

(A) A correlation > 0.9 sub-heatmap including the compound PHA-00851261E of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in PHA-00851261E with permutation p< 0.005. (B) Detail of the dendrogram showing PHA-00851261E (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s015

(TIF)

S16 Fig. GSLHC identified the compound STOCK1N-35215 as a novel Histone deacetylase inhibitor.

(A) A correlation > 0.9 sub-heatmap including the compound STOCK1N-35215 of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in STOCK1N-35215 with permutation p< 0.005. (B) Detail of the dendrogram showing STOCK1N-35215 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s016

(TIF)

S17 Fig. GSLHC identified the compound 0297417-0002B as a novel Purine nucleoside phosphorylase Inhibitor.

(A) A correlation > 0.9 sub-heatmap including the compound 0297417-0002B of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in 0297417-0002B with permutation p< 0.005. (B) Detail of the dendrogram showing 0297417-0002B (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s017

(TIF)

S18 Fig. GSLHC identified the compound F0447-0125 as a novel DNA Inhibitor.

(A) A correlation > 0.9 sub-heatmap including the compound F0447-0125 of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in F0447-0125 with permutation p< 0.005. (B) Detail of the dendrogram showing F0447-0125 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s018

(TIF)

S19 Fig. GSLHC identified the compound W-13 as a novel Mineralocorticoid receptor agonist.

(A) A correlation > 0.9 sub-heatmap including the compound W-13 of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in W-13 with permutation p< 0.005. (B) Detail of the dendrogram showing W-13 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s019

(TIF)

S20 Fig. GSLHC identified the compound CP-944629 as a novel DNA polymerase beta inhibitor.

(A) A correlation > 0.9 sub-heatmap including the compound CP-944629 of unknown function from a GSLHC-generated heatmap based on tags significantly enriched in CP-944629 with permutation p< 0.005. (B) Detail of the dendrogram showing CP-944629 (marked by black asterisk) with its partner drugs.

https://doi.org/10.1371/journal.pone.0139889.s020

(TIF)

S1 Table. The 300x671 tCMap1 matrix used to construct the one-way cluster in Fig 2A.

tCMap1 is the 300 by 671 similarity matrix of the 300 highest variance microarray probe-sets and the 671 instances in CMap v1.0 (http://figshare.com/download/file/2258031).

https://doi.org/10.1371/journal.pone.0139889.s021

(XLSX)

S2 Table. The 300x671 tGSCMap1 matrix used to construct the one-way cluster in Fig 2B.

tGSCMap1 is the 300 by 671 similarity matrix of the 300 largest ES variance MSigDB tags and the 671 instances in CMap v1.0 (http://figshare.com/download/file/2258032).

https://doi.org/10.1371/journal.pone.0139889.s022

(XLSX)

S3 Table. The 4884x1309 IGCMd matrix used to construct the two-way cluster in Fig 3A.

IGCMd is the 4884 by 1309 similarity matrix of the 4884 highest variance microarray probe-sets against the 1309 drugs/chemicals in CMap v2.0 (http://figshare.com/download/file/2258034).

https://doi.org/10.1371/journal.pone.0139889.s023

(XLSX)

S4 Table. The 4884x1309 GSCMd matrix used to construct the two-way cluster in Fig 3B.

GSCMd is the 4884 by 1309 similarity matrix of the 4884 largest ES variance MSigDB tags against the 1309 drugs/chemicals in CMap v2.0 (http://figshare.com/download/file/2258033).

https://doi.org/10.1371/journal.pone.0139889.s024

(XLSX)

Author Contributions

Conceived and designed the experiments: FHC ZHJ H.C. Lee. Performed the experiments: FHC ZHJ. Analyzed the data: FHC ZHJ TTH CLH H.C. Lee. Contributed reagents/materials/analysis tools: FHC H.C. Liu H.C. Lee. Wrote the paper: FHC H.C. Lee.

References

  1. 1. Perez-Diez A, Morgun A, Shulzhenko N. Microarrays for cancer diagnosis and classification. Adv Exp Med Biol. 2007;593:74–85. Epub 2007/02/03. pmid:17265718.
  2. 2. Miller JA, Oldham MC, Geschwind DH. A systems level analysis of transcriptional changes in Alzheimer's disease and normal aging. J Neurosci. 2008;28(6):1410–20. Epub 2008/02/08. doi: 28/6/1410 [pii] pmid:18256261; PubMed Central PMCID: PMC2902235.
  3. 3. Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 2003;4(4):210. Epub 2003/04/19. pmid:12702200; PubMed Central PMCID: PMC154570.
  4. 4. Zaravinos A, Lambrou GI, Boulalas I, Delakas D, Spandidos DA. Identification of common differentially expressed genes in urinary bladder cancer. PLoS One. 2011;6(4):e18135. Epub 2011/04/13. pmid:21483740; PubMed Central PMCID: PMC3070717.
  5. 5. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences of the United States of America. 2001;98(9):5116–21. pmid:11309499; PubMed Central PMCID: PMC33173.
  6. 6. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3. Epub 2006/05/02. pmid:16646809.
  7. 7. Pavlidis P. Using ANOVA for gene selection from microarray studies of the nervous system. Methods. 2003;31(4):282–9. Epub 2003/11/05. doi: S1046202303001579 [pii]. pmid:14597312.
  8. 8. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1):25–9. Epub 2000/05/10. pmid:10802651; PubMed Central PMCID: PMC3037419.
  9. 9. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012;40(Database issue):D109–14. Epub 2011/11/15. doi: gkr988 [pii] pmid:22080510; PubMed Central PMCID: PMC3245020.
  10. 10. Hosack DA, Dennis G Jr., Sherman BT, Lane HC, Lempicki RA. Identifying biological themes within lists of genes with EASE. Genome Biol. 2003;4(10):R70. Epub 2003/10/02. pmid:14519205; PubMed Central PMCID: PMC328459.
  11. 11. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006;313(5795):1929–35. pmid:17008526.
  12. 12. Aramadhaka LR, Prorock A, Dragulev B, Bao Y, Fox JW. Connectivity maps for biosimilar drug discovery in venoms: the case of Gila monster venom and the anti-diabetes drug Byetta(R). Toxicon: official journal of the International Society on Toxinology. 2013;69:160–7. pmid:23602926.
  13. 13. Meng F, Dai E, Yu X, Zhang Y, Chen X, Liu X, et al. Constructing and characterizing a bioactive small molecule and microRNA association network for Alzheimer's disease. Journal of the Royal Society, Interface / the Royal Society. 2014;11(92):20131057. pmid:24352679; PubMed Central PMCID: PMC3899875.
  14. 14. Chen F, Guan Q, Nie ZY, Jin LJ. Gene expression profile and functional analysis of Alzheimer's disease. American journal of Alzheimer's disease and other dementias. 2013;28(7):693–701. pmid:24005853.
  15. 15. Garman KS, Acharya CR, Edelman E, Grade M, Gaedcke J, Sud S, et al. A genomic approach to colon cancer risk stratification yields biologic insights into therapeutic opportunities. Proc Natl Acad Sci U S A. 2008;105(49):19432–7. Epub 2008/12/04. doi: 0806674105 [pii] pmid:19050079; PubMed Central PMCID: PMC2592987.
  16. 16. Huang L, Zhao S, Frasor JM, Dai Y. An integrated bioinformatics approach identifies elevated cyclin E2 expression and E2F activity as distinct features of tamoxifen resistant breast tumors. PloS one. 2011;6(7):e22274. pmid:21789246; PubMed Central PMCID: PMC3137633.
  17. 17. Wang G, Ye Y, Yang X, Liao H, Zhao C, Liang S. Expression-based in silico screening of candidate therapeutic compounds for lung adenocarcinoma. PloS one. 2011;6(1):e14573. pmid:21283735; PubMed Central PMCID: PMC3024967.
  18. 18. Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A, et al. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Science translational medicine. 2011;3(96):96ra77. pmid:21849665; PubMed Central PMCID: PMC3502016.
  19. 19. Iskar M, Campillos M, Kuhn M, Jensen LJ, van Noort V, Bork P. Drug-induced regulation of target expression. PLoS computational biology. 2010;6(9). pmid:20838579; PubMed Central PMCID: PMC2936514.
  20. 20. Pacini C, Iorio F, Goncalves E, Iskar M, Klabunde T, Bork P, et al. DvD: An R/Cytoscape pipeline for drug repurposing using public repositories of gene expression data. Bioinformatics. 2013;29(1):132–4. pmid:23129297; PubMed Central PMCID: PMC3530913.
  21. 21. Iorio F, Bosotti R, Scacheri E, Belcastro V, Mithbaokar P, Ferriero R, et al. Discovery of drug mode of action and drug repositioning from transcriptional responses. Proceedings of the National Academy of Sciences of the United States of America. 2010;107(33):14621–6. pmid:20679242; PubMed Central PMCID: PMC2930479.
  22. 22. Parkkinen JA, Kaski S. Probabilistic drug connectivity mapping. BMC bioinformatics. 2014;15:113. pmid:24742351; PubMed Central PMCID: PMC4011783.
  23. 23. Nam D, Kim SY. Gene-set approach for expression pattern analysis. Brief Bioinform. 2008;9(3):189–97. Epub 2008/01/19. doi: bbn001 [pii] pmid:18202032.
  24. 24. Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34(3):267–73. Epub 2003/06/17. [pii]. pmid:12808457.
  25. 25. Efron B, Tibshirani R. On Testing the Significance of Sets of Genes. Ann Appl Stat. 2007;1(1):107–29. pmid:ISI:000261050400006.
  26. 26. Barry WT, Nobel AB, Wright FA. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics. 2005;21(9):1943–9. Epub 2005/01/14. doi: bti260 [pii] pmid:15647293.
  27. 27. Breslin T, Eden P, Krogh M. Comparing functional annotation analyses with Catmap. BMC Bioinformatics. 2004;5:193. Epub 2004/12/14. doi: 1471-2105-5-193 [pii] pmid:15588298; PubMed Central PMCID: PMC543458.
  28. 28. Lee HK, Braynen W, Keshav K, Pavlidis P. ErmineJ: tool for functional analysis of gene expression data sets. BMC Bioinformatics. 2005;6:269. Epub 2005/11/11. doi: 1471-2105-6-269 [pii] pmid:16280084; PubMed Central PMCID: PMC1310606.
  29. 29. Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, et al. Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics. 2007;8:242. Epub 2007/07/07. doi: 1471-2105-8-242 [pii] pmid:17612399; PubMed Central PMCID: PMC1931607.
  30. 30. Prifti E, Zucker JD, Clement K, Henegar C. FunNet: an integrative tool for exploring transcriptional interactions. Bioinformatics. 2008;24(22):2636–8. Epub 2008/09/19. doi: btn492 [pii] pmid:18799481.
  31. 31. Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics. 2010;26(12):i237–45. Epub 2010/06/10. doi: btq182 [pii] pmid:20529912; PubMed Central PMCID: PMC2881367.
  32. 32. Sun CH, Kim MS, Han Y, Yi GS. COFECO: composite function annotation enriched by protein complex data. Nucleic Acids Res. 2009;37(Web Server issue):W350–5. Epub 2009/05/12. doi: gkp331 [pii] pmid:19429688; PubMed Central PMCID: PMC2703949.
  33. 33. Wong DJ, Nuyten DS, Regev A, Lin M, Adler AS, Segal E, et al. Revealing targeted therapy for human cancer by gene module maps. Cancer research. 2008;68(2):369–78. pmid:18199530.
  34. 34. Lottaz C, Toedling J, Spang R. Annotation-based distance measures for patient subgroup discovery in clinical microarray studies. Bioinformatics. 2007;23(17):2256–64. pmid:17586546.
  35. 35. Khan SA, Faisal A, Mpindi JP, Parkkinen JA, Kalliokoski T, Poso A, et al. Comprehensive data-driven analysis of the impact of chemoinformatic structure on the genome-wide biological response profiles of cancer cells to 1159 drugs. BMC bioinformatics. 2012;13:112. pmid:22646858; PubMed Central PMCID: PMC3532323.
  36. 36. Ben-Dor A, Chor B, Karp R, Yakhini Z. Discovering local structure in gene expression data: the order-preserving submatrix problem. J Comput Biol. 2003;10(3–4):373–84. Epub 2003/08/26. pmid:12935334.
  37. 37. Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. Bioinformatics. 2002;18 Suppl 1:S136–44. Epub 2002/08/10. pmid:12169541.
  38. 38. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545–50. Epub 2005/10/04. doi: 0506580102 [pii] pmid:16199517; PubMed Central PMCID: PMC1239896.
  39. 39. Li QL, Chen TJ, Wang YL, Bryant SH. PubChem as a public resource for drug discovery. Drug Discovery Today. 2010;15(23–24):1052–7. pmid:ISI:000285235700010.
  40. 40. Manning CD, Raghavan P, Schutze H. Introduction to Information Retrieval: Cambridge University Press.
  41. 41. Zhu F, Shi Z, Qin C, Tao L, Liu X, Xu F, et al. Therapeutic target database update 2012: a resource for facilitating target-oriented drug discovery. Nucleic Acids Res. 2012;40(Database issue):D1128–36. Epub 2011/09/29. doi: gkr797 [pii] pmid:21948793; PubMed Central PMCID: PMC3245130.
  42. 42. Hollander M, Wolfe D. Nonparametric Statistical Methods 2ed. New York: Wiley; 1999.
  43. 43. Lamb J, Ramaswamy S, Ford HL, Contreras B, Martinez RV, Kittrell FS, et al. A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer. Cell. 2003;114(3):323–34. Epub 2003/08/14. doi: S0092867403005701 [pii]. pmid:12914697.
  44. 44. Li Q, Cheng T, Wang Y, Bryant SH. PubChem as a public resource for drug discovery. Drug Discov Today. 2010;15(23–24):1052–7. Epub 2010/10/26. doi: S1359-6446(10)00773-7 [pii] pmid:20970519; PubMed Central PMCID: PMC3010383.
  45. 45. Richon VM. Cancer biology: mechanism of antitumour action of vorinostat (suberoylanilide hydroxamic acid), a novel histone deacetylase inhibitor. Br J Cancer. 2006;95(S1):S2–S6.
  46. 46. Gottlicher M, Minucci S, Zhu P, Kramer OH, Schimpf A, Giavara S, et al. Valproic acid defines a novel class of HDAC inhibitors inducing differentiation of transformed cells. EMBO J. 2001;20(24):6969–78. Epub 2001/12/18. pmid:11742974; PubMed Central PMCID: PMC125788.
  47. 47. Kemp MG, Ghosh M, Liu G, Leffak M. The histone deacetylase inhibitor trichostatin A alters the pattern of DNA replication origin activity in human cells. Nucleic Acids Res. 2005;33(1):325–36. Epub 2005/01/18. doi: 33/1/325 [pii] pmid:15653633; PubMed Central PMCID: PMC546162.
  48. 48. Keen JC, Yan L, Mack KM, Pettit C, Smith D, Sharma D, et al. A novel histone deacetylase inhibitor, scriptaid, enhances expression of functional estrogen receptor alpha (ER) in ER negative human breast cancer cells in combination with 5-aza 2'-deoxycytidine. Breast Cancer Res Treat. 2003;81(3):177–86. Epub 2003/11/19. pmid:14620913.
  49. 49. Balakin KV, Ivanenkov YA, Kiselyov AS, Tkachenko SE. Histone deacetylase inhibitors in cancer therapy: latest developments, trends and medicinal chemistry perspective. Anticancer Agents Med Chem. 2007;7(5):576–92. Epub 2007/09/28. pmid:17896917.
  50. 50. Yadav B, Pemovska T, Szwajda A, Kulesskiy E, Kontro M, Karjalainen R, et al. Quantitative scoring of differential drug sensitivity for individually optimized anticancer therapies. Scientific reports. 2014;4:5193. pmid:24898935; PubMed Central PMCID: PMC4046135.
  51. 51. Wood ER, Kuyper L, Petrov KG, Hunter RN 3rd, Harris PA, Lackey K. Discovery and in vitro evaluation of potent TrkA kinase inhibitors: oxindole and aza-oxindoles. Bioorg Med Chem Lett. 2004;14(4):953–7. Epub 2004/03/12. S0960894X03012848 [pii]. pmid:15013000.
  52. 52. Lahusen T, De Siervi A, Kunick C, Senderowicz AM. Alsterpaullone, a novel cyclin-dependent kinase inhibitor, induces apoptosis by activation of caspase-9 due to perturbation in mitochondrial membrane potential. Mol Carcinog. 2003;36(4):183–94. Epub 2003/04/02. pmid:12669310.
  53. 53. Keller HU, Zimmermann A, Niggli V. Diacylglycerols and the protein kinase inhibitor H-7 suppress cell polarity and locomotion of Walker 256 carcinosarcoma cells. Int J Cancer. 1989;44(5):934–9. Epub 1989/11/15. pmid:2555310.
  54. 54. Tan C, Tasaka H, Yu KP, Murphy ML, Karnofsky DA. Daunomycin, an antitumor antibiotic, in the treatment of neoplastic disease. Clinical evaluation with special reference to childhood leukemia. Cancer. 1967;20(3):333–53. Epub 1967/03/01. pmid:4290058.
  55. 55. Rose MG. Hematology: Azacitidine improves survival in myelodysplastic syndromes. Nat Rev Clin Oncol. 2009;6(9):502–3. Epub 2009/08/27. doi: nrclinonc.2009.125 [pii] pmid:19707240.
  56. 56. Ko MW, Tamhankar MA, Volpe NJ, Porter D, McGrath C, Galetta SL. Acute promyelocytic leukemic involvement of the optic nerves following mitoxantrone treatment for multiple sclerosis. J Neurol Sci. 2008;273(1–2):144–7. Epub 2008/08/09. doi: S0022-510X(08)00315-8 [pii] pmid:18687447.
  57. 57. Kim JY, Lee SG, Chung JY, Kim YJ, Park JE, Koh H, et al. Ellipticine induces apoptosis in human endometrial cancer cells: the potential involvement of reactive oxygen species and mitogen-activated protein kinases. Toxicology. 2011;289(2–3):91–102. Epub 2011/08/17. doi: S0300-483X(11)00291-5 [pii] pmid:21843585.
  58. 58. Ulukan H, Swaan PW. Camptothecins: a review of their chemotherapeutic potential. Drugs. 2002;62(14):2039–57. Epub 2002/09/25. doi: 621404 [pii]. pmid:12269849.
  59. 59. Rubin BK, Tamaoki J. Antibiotics as anti-inflammatory and immunomodulatory agents. Basel; Boston: Birkhäuser; 2005. xiii, 273 p. p.
  60. 60. Sanders WE Jr. Antibiotics during anesthesia and surgery. Int Anesthesiol Clin. 1968;6(1):211–8. Epub 1968/01/01. pmid:5704457.
  61. 61. Smith TJ, Blackman SA, Foster SJ. Autolysins of Bacillus subtilis: multiple enzymes with multiple functions. Microbiology. 2000;146 (Pt 2):249–62. Epub 2000/03/09. pmid:10708363.
  62. 62. Holtje JV. From growth to autolysis: the murein hydrolases in Escherichia coli. Arch Microbiol. 1995;164(4):243–54. Epub 1995/10/01. pmid:7487333.
  63. 63. Garcia P, Gonzalez MP, Garcia E, Lopez R, Garcia JL. LytB, a novel pneumococcal murein hydrolase essential for cell separation. Mol Microbiol. 1999;31(4):1275–81. Epub 1999/03/30. pmid:10096093.
  64. 64. Damian D, Gorfine M. Statistical concerns about the GSEA procedure. Nature Genetics. 2004;36(7):663-. pmid:ISI:000222354100002.
  65. 65. Mootha VK, Daly MJ, Patterson N, Hirschhorn JN, Groop LC, Altshuler D. Statistical concerns about the GSEA procedure—Reply. Nature Genetics. 2004;36(7):663-. pmid:ISI:000222354100003.