One author, Dr. Ari Kahn, is affiliated with a commercial company SRA International Inc. This does not alter the authors’ adherence to all the PLoS ONE policies on sharing data and materials.
Conceived and designed the experiments: BRZ. Analyzed the data: BRZ. Contributed reagents/materials/analysis tools: BRZ. Wrote the paper: BRZ KK JNW YP VL AK. Researched and facilitated data analysis: AK. Provided general guidance throughout the project: WR. Provided critical feedback for improving the scientific content of the manuscript during the internal review process: WR.
The NCI-60 is a panel of 60 diverse human cancer cell lines used by the U.S. National Cancer Institute to screen compounds for anticancer activity. We recently clustered genes based on correlation of expression profiles across the NCI-60. Many of the resulting clusters were characterized by cancer-associated biological functions. The set of curated glioblastoma (GBM) gene expression data from the Cancer Genome Atlas (TCGA) initiative has recently become available. Thus, we are now able to determine which of the processes are robustly shared by both the immortalized cell lines and clinical cancers.
Our central observation is that some sets of highly correlated genes in the NCI-60 expression data are also highly correlated in the GBM expression data. Furthermore, a “double fishing” strategy identified many sets of genes that show Pearson correlation ≥0.60 in both the NCI-60 and the GBM data sets relative to a given “bait” gene. The number of such gene sets far exceeds the number expected by chance.
Many of the gene-gene correlations found in the NCI-60 do not reflect just the conditions of cell lines in culture; rather, they reflect processes and gene networks that also function
The NCI-60
When we recently clustered genes based on correlation of expression profiles across the NCI-60
The full size figures are available as Figures S1 and S2. The numbers appended after the gene name refer to the NCI-60 cluster in which that gene appeared.
A previous study
GBM CLUSTER | MARGINALS | ||||
1 | 2 | 3 | |||
|
|
9 | 1 | 5 | 15 |
|
12 | 34 | 18 | 64 | |
|
21 | 35 | 23 | 79 |
The Fisher Exact p-value corresponding to this contingency table was 0.00039.
Cancer cells in culture are subject to very different conditions than tumor cells in the host. They have been removed from their physiological milieu of other cell types, tissue architecture, hormonal influences, and autocrine/paracrine signals. So the question remained: “What does such a pattern of association in cell culture tell us about cancer cells in vivo?”.
To explore that question, we analyzed the highly curated glioblastoma (GBM) transcript expression data set generated by the The Cancer Genome Atlas (TCGA) initiative
NUMBER OF CLUSTERS (parameter “k” of cutree()) | ||||||
2 | 3 | 4 | 5 | 6 | 7 | 8 |
0.00254 |
|
0.00103 | 0.00093 | 0.00167 | 0.00157 | 0.00189 |
The Fisher’s Exact p-value for 100 randomizations corresponding to k = 3 was 0.464±0.279. The bold value indicates the lowest p-value model for the reals. This was the model used for the remainder of the cluster 52 analysis.
GBM CLUSTER | ||||
1 | 2 | 3 | ||
|
|
ACCN2 | FOS | GALK2 |
ALDOC | MCAT | |||
EPB41 | MRPS30 | |||
FOXO4 | SCO2 | |||
MUTYH | SLC43A1 | |||
OXSM | ||||
PDE3B | ||||
PRMT7 | ||||
SORD | ||||
|
AGER | ACTB | ACTN3 | |
ALCAM | ADAM9 | CDH2 | ||
DBN1 | AXL | CDH4 | ||
DDAH1 | CAV1 | CHST3 | ||
DYNLL1 | CAV2 | DKK3 | ||
GLRB | CLIC4 | EGFR | ||
MYBL1 | CNN2 | FADS3 | ||
NMT2 | DLC1 | FEZ2 | ||
ROBO3 | DSE | FGF2 | ||
SYNE1 | DUSP1 | KCNMA1 | ||
TRPC1 | FGFR1 | MGLL | ||
ZNF281 | FOSL2 | MT1X | ||
GADD45A | MYL6 | |||
GAS6 | SOCS5 | |||
IL6 | SORBS3 | |||
INHBA | TGFB2 | |||
ITGB1 | TUFT1 | |||
JUN | VCAN | |||
KLF7 | ||||
LOXL2 | ||||
MYH9 | ||||
MYLK | ||||
PDGFC | ||||
PDLIM7 | ||||
PTRF | ||||
PVR | ||||
QSOX1 | ||||
RRAS | ||||
TGFB1I1 | ||||
THBS1 | ||||
TNFRSF12A | ||||
TRAM1 | ||||
VCL | ||||
ZYX |
In the present analysis, we tested whether sets of genes that we previously found to be (1) highly co-expressed across the NCI-60, and (2) functionally coherent were also highly co-expressed across the GBM samples. We then extended that basic analysis by a “double fishing” strategy. That is, we identified sets of genes that showed correlation ≥0.60 in both the NCI-60 and GBM data sets relative to a given “bait” gene. We found that the number of such gene sets far exceeded the number expected by chance. That analysis does not mean that cancer cells in culture share all, or even most, of their characteristics with cells in vivo, but it does indicate similarities.
GBM CLUSTER | MARGINALS | |||||
1 | 2 | 3 | 4 | |||
|
|
8 | 11 | 0 | 5 | 24 |
|
11 | 12 | 38 | 14 | 75 | |
|
19 | 23 | 38 | 19 | 99 |
The Fisher Exact p-value corresponding to this contingency table was 0.00001.
NUMBER OF CLUSTERS (parameter “k” of |
||||||
2 | 3 | 4 | 5 | 6 | 7 | 8 |
0.09162 | 0.10917 |
|
0.00001 | 0.00001 | 0.00001 | 0.00001 |
The Fisher’s Exact p-value for 100 randomizations corresponding to k = 3 was 0.529±0.283. The bold value indicates the lowest p-value model for the reals. This was the model used for the remainder of the cluster 68 analysis.
For GBM expression data, the files
GBM CLUSTER | |||||
1 | 2 | 3 | 4 | ||
|
|
ACVR2A | ADAM15 | APP | |
CSNK1G3 | AGPAT3 | DOCK1 | |||
CTBP2 | AHNAK | MLF1 | |||
MAP3K13 | DUSP3 | NOL3 | |||
SMAD5 | EMP2 | OAT | |||
YES1 | GRN | ||||
ZNF205 | LMNA | ||||
ZNF35 | MGAT4B | ||||
PLP2 | |||||
SPR | |||||
ZFHX3 | |||||
|
CHRNA3 | ADA | AIF1 | C9 | |
CYFIP2 | ALDH1A2 | ARHGDIB | CD1A | ||
ELOVL4 | CD1E | CCR4 | GDF10 | ||
GNB1L | CD79A | CCR7 | GRAP | ||
GRIK5 | GP5 | CD1D | LY6H | ||
MYB | IGLL1 | CD27 | NKX2-5 | ||
NFKBIL1 | KRT1 | CD3D | PRKCQ | ||
SLIT1 | LAX1 | CD3E | RAG2 | ||
SMPD3 | LTA | CD3G | RASGRP2 | ||
TSPAN7 | PTGDR | CD4 | RORB | ||
USP20 | RAG1 | CD5 | SLC15A2 | ||
SEPT6 | CD52 | SLC18A2 | |||
CD7 | TAL1 | ||||
CD84 | VPREB1 | ||||
CD96 | |||||
CECR1 | |||||
CTSW | |||||
FLI1 | |||||
FYB | |||||
GFI1 | |||||
GMFG | |||||
GNA15 | |||||
GRAP2 | |||||
IL12RB1 | |||||
IL2RG | |||||
ITGAL | |||||
ITGB2 | |||||
ITK | |||||
LST1 | |||||
MAP4K1 | |||||
RHOH | |||||
SELL | |||||
SH2D1A | |||||
SIT1 | |||||
SPN | |||||
TRAT1 | |||||
TREML2 | |||||
ZAP70 |
We used all 202 GBM samples that are available, representing roughly comparable numbers of samples of each subtype. Since the calculated correlation values will be more accurate if they come from a more diverse sampling population, we wanted to retain as much diversity as possible by looking at all subtypes together, so we did not report co-expression within or between subtypes.
The full size figures are available as Figures S3 and S4. The numbers appended after the gene name refer to the NCI-60 cluster in which that gene appeared.
The full size CIM is available as Figure S5. The gene name given as the column header is the representative of a list of genes. The full list of genes is available in the HTGM Download S1.
NCI-60 expression data were obtained from CellMiner
As mentioned above for the GBM samples, we are trying to achieve as high a degree of diversity as possible in the cell lines, so that the highly heterogeneous mixture of cell lines represented by the NCI-60 is ideal. For illustration, consider two genes. We are looking to see if the expression levels of those two genes go up and down together as we traverse the 60 cell lines. If all of the cells lines were essentially identical to one another, there would be no variation and we could not see how the two genes relate in different conditions.
For most of the studies reported here, the expression data for GBM and for NCI-60 were restricted to those genes that were present in both
rank | Designated gene (G) | number of genes with correlation ≥0.60 | generalized functional correlation | number of genes incommon with cluster 52 | Number of genes in common with cluster 68 |
1 | WAS | 50 | immune | 0 | 21 |
2 | IL2RG | 37 | immune | 0 | 23 |
3 | CD4 | 37 | immune | 0 | 9 |
4 | CD48 | 36 | immune | 0 | 13 |
5 | PTPRC | 35 | immune | 0 | 15 |
6 | PTPN7 | 35 | immune | 0 | 12 |
7 | HCLS1 | 35 | immune | 0 | 12 |
8 | CORO1A | 35 | immune | 0 | 11 |
9 | CD37 | 34 | immune | 0 | 8 |
10 | PLCB2 | 33 | immune | 0 | 12 |
11 | RHOH | 32 | immune | 0 | 16 |
12 | LILRA2 | 31 | immune | 0 | 4 |
13 | LRMP | 29 | immune | 0 | 9 |
14 | RNASE6 | 28 | immune | 0 | 3 |
15 | NCF4 | 28 | immune | 0 | 0 |
16 | CD3D | 28 | immune | 0 | 20 |
17 | CSF3R | 27 | immune | 0 | 4 |
18 | CYBB | 26 | immune | 0 | 0 |
19 | SIT1 | 23 | immune | 0 | 19 |
20 | DOCK2 | 23 | immune | 0 | 9 |
21 | CD1D | 23 | immune | 0 | 14 |
22 | MYH9 | 21 | angiogenesis | 7 | 0 |
23 | CD2 | 21 | immune | 0 | 15 |
24 | SERPINE1 | 20 | angiogenesis | 4 | 0 |
25 | LOXL2 | 20 | angiogenesis | 4 | 0 |
26 | CYR61 | 20 | angiogenesis | 6 | 0 |
27 | SH2D1A | 19 | immune | 0 | 14 |
28 | GNA15 | 19 | immune | 0 | 10 |
29 | COL5A1 | 19 | extracellular matrix | 2 | 0 |
30 | BTK | 19 | immune | 0 | 2 |
31 | LY86 | 18 | immune | 0 | 0 |
32 | LOX | 18 | angiogenesis | 2 | 0 |
33 | FLNA | 18 | cell-cell junciton | 6 | 0 |
34 | CD52 | 18 | immune | 0 | 13 |
35 | S100A8 | 17 | immune | 0 | 2 |
36 | RNASE3 | 17 | immune | 0 | 2 |
37 | LGALS9 | 17 | immune | 0 | 8 |
38 | CSF2RB | 17 | immune | 0 | 0 |
39 | CD5 | 17 | immune | 0 | 13 |
40 | TNFRSF12A | 16 | chemotaxis | 5 | 0 |
41 | ST8SIA4 | 16 | tyrosine phosphorylation | 0 | 5 |
42 | PLK4 | 16 | mitosis | 0 | 0 |
43 | OAS2 | 16 | immune | 0 | 0 |
44 | MCM3 | 16 | DNA repair | 0 | 0 |
45 | LSP1 | 16 | immune | 0 | 2 |
46 | ITGB2 | 16 | immune | 0 | 10 |
47 | CD96 | 16 | immune | 0 | 14 |
48 | CD7 | 16 | immune | 0 | 11 |
49 | ATP2A3 | 16 | immune | 0 | 12 |
50 | COL1A2 | 15 | extracellular matrix | 0 | 0 |
51 | CCR7 | 15 | immune | 0 | 13 |
52 | ADAM12 | 15 | extracellular matrix | 1 | 0 |
53 | MED6 | 14 | NA | 0 | 0 |
54 | LGALS1 | 14 | NA | 0 | 0 |
55 | LCK | 14 | immune | 0 | 10 |
56 | FOSL2 | 14 | angiogenesis | 4 | 0 |
57 | DSE | 14 | immune | 4 | 0 |
58 | COL6A3 | 14 | extracellular matrix | 0 | 0 |
59 | COL1A1 | 14 | extracellular matrix | 1 | 0 |
60 | CLCF1 | 14 | tyrosine phosphorylation | 7 | 0 |
61 | ANXA2 | 14 | NA | 3 | 0 |
62 | PTPN22 | 13 | NA | 0 | 4 |
63 | PLAUR | 13 | immune | 1 | 0 |
64 | LTB | 13 | immune | 0 | 11 |
65 | CTSW | 13 | immune | 0 | 13 |
66 | IRAK3 | 12 | immune | 0 | 0 |
67 | GRB7 | 12 | NA | 0 | 0 |
68 | CTCF | 12 | chromatin assembly | 0 | 0 |
The designated gene (G) appearing in the gene column is the representative of a group of genes that correlate strongly with G.
“NA” indicates that the gene set did not map to any statistically significant GO categories.
The complete High-Throughput GoMiner (HTGM) download is provided in file Download S1. The files in the subdirectory work2026406846/inputFileDir are named according to each gene G. Each such file contains the complete list of genes correlating with G.
R language code
The key question we addressed here was whether genes that co-clustered with respect to their expression profiles across the NCI-60 cells also co-clustered with respect to their expression profiles across the GBM samples. To facilitate that analysis, we took advantage of the R language function
To determine the optimal value of k, we constructed a 2×k contingency table (
Without reference to any prior clustering analysis, the program constructed
We wished to determine if the number of pairs of genes having correlation ≥0.60 with respect to both NCI-60 and GBM expression profiles exceeded the number expected by chance. We therefore performed a set of 10 studies in which we randomized the gene names in the GBM expression profiles. The number of such pairs obtained in the real study was 2708. In contrast, the number in the randomization studies was small in comparison (193±14).
Functional categorization of gene lists was performed using the High-Throughput GoMiner (HTGM) program
We used either the Genesis clustering program
We recently clustered genes based on correlation of expression profiles across the NCI-60
Using the expression profiles for the cluster 52 genes across the NCI-60 cell lines and also across the GBM samples, we were able to generate expression correlation CIMs across both of those sets of expression profiles (
In the correlation CIMs, we appended a number (1 or 2) to the gene names, corresponding to membership in the two major clusters in the NCI-60 CIM. Those same numbers were retained in the gene names for the GBM CIM to allow identification of the cluster to which that gene belonged in the NCI-60 CIM. The pattern of clustering in the GBM correlation CIM (
More precisely,
A notable finding is that nearly half of the genes in GBM cluster 2 (
To investigate other potential examples of coherence between gene expression clusters in NCI-60 cell lines and GBM samples, we repeated that analysis with the immune system-related cluster 68 genes
There were 34,865 gene pairs with correlation ≥0.60 in the NCI-60 data set but not in GBM, 87,556 in GBM but not in the NCI-60, and 2,708 in both the NCI-60 and GBM. The highest-ranking gene of the 2,708 was WAS; 49 genes showed correlation ≥0.60 with WAS. Of the top 100 genes (
The genes in cluster 52 or cluster 68 had been obtained by prior clustering of the gene expression profiles across NCI-60 cell lines, but not across TCGA GBM samples. We expect to find that some of the
This analysis shows ways in which strong gene-gene correlations and functional categorization (
Full version of
(PDF)
Full version of
(TIF)
Full version of
(PDF)
Full version of
(PDF)
HTGM GO categories
(PNG)
The parameters used in running HTGM.
(JPG)
Zip archive of HTGM results.
(ZIP)
We would like to thank Dr. Roel G. W. Verhaak for helpful discussions about the TCGA data sets.