Multi-study Integration of Brain Cancer Transcriptomes Reveals Organ-Level Molecular Signatures

We utilized abundant transcriptomic data for the primary classes of brain cancers to study the feasibility of separating all of these diseases simultaneously based on molecular data alone. These signatures were based on a new method reported herein – Identification of Structured Signatures and Classifiers (ISSAC) – that resulted in a brain cancer marker panel of 44 unique genes. Many of these genes have established relevance to the brain cancers examined herein, with others having known roles in cancer biology. Analyses on large-scale data from multiple sources must deal with significant challenges associated with heterogeneity between different published studies, for it was observed that the variation among individual studies often had a larger effect on the transcriptome than did phenotype differences, as is typical. For this reason, we restricted ourselves to studying only cases where we had at least two independent studies performed for each phenotype, and also reprocessed all the raw data from the studies using a unified pre-processing pipeline. We found that learning signatures across multiple datasets greatly enhanced reproducibility and accuracy in predictive performance on truly independent validation sets, even when keeping the size of the training set the same. This was most likely due to the meta-signature encompassing more of the heterogeneity across different sources and conditions, while amplifying signal from the repeated global characteristics of the phenotype. When molecular signatures of brain cancers were constructed from all currently available microarray data, 90% phenotype prediction accuracy, or the accuracy of identifying a particular brain cancer from the background of all phenotypes, was found. Looking forward, we discuss our approach in the context of the eventual development of organ-specific molecular signatures from peripheral fluids such as the blood.


Text S4. Global statistical enrichment analysis of gene-pair classifiers
Our discussion in the main text on selected marker-panel genes provides some insight into their role in disease. Next, rather than focusing on a limited number of genes (such as only those in our marker-panel), we extended our analysis to large sets of gene pairs that distinguish GBM from OLG. We hypothesized that utilizing more complete information would offer more direct insight into the basis of the classifiers' relative expression reversal behavior, by being able to associate their global patterns with differences between the pathophysiology underlying the two brain cancers.
For any two classes and any given set of gene pairs, the union of the genes that are expressed relatively higher (respectively, lower) in each gene pair for Class 1 will be referred to as 'gene-set i' (resp., 'gene-set j'). Thus if there are N gene pairs, then 'gene-set i' and 'gene-set j' each consist of N genes. For the top 500, 1,000, and 1,500 gene pairs (numbers arbitrarily chosen to identify major trends) that best distinguish GBM (Class 1) from OLG (Class 2), we performed an enrichment analysis on the biological process ontologies of 'gene-set i' (expressed relatively higher in GBM than in OLG) and on those of 'gene-set j' (expressed relatively lower in GBM than in OLG) to find the most consistently enriched category by Z-score.
Information on the biological processes and chromosome numbers of genes is available in the PANTHER database [1]. The Z-score is defined as: where ! is the total number of classifier gene pairs between two classes (e.g. 500, 1000, or 1,500) and ! ! and ! ! are the proportion of genes characterized by biological category (or chromosome number) ! in a given gene set (e.g. gene-set i ) and in the null distribution (i.e. all genes in PANTHER), respectively.
Among 18 major biological processes in the PANTHER database, 'Immunity and Defense' was the most strongly enriched biological process in gene-set i ( Figure S4a). Strong enrichment in 'Immunity and Defense' for the genes expressed relatively higher in GBM reflects the frequently observed presence of chronic inflammation in highly malignant cancers [2], such as in GBM [3].
In a tumor-associated inflammatory micro-environment, immune cells penetrate inside the tumor and secrete reactive oxygen species. This can cause further oxidative DNA damage and oncogenic mutations, including amplification of oncogenes or deletion of cell-cycle regulators, and thereby facilitate cancer progression, survival, and migration. Our results show that a relatively highly inflamed tumor environment, composed of a deep infiltration of immune cells and tumor cells exhibiting functions to embattle such oxidative conditions, is the most representative pathophysiological trait that differentiates the tumors of GBM and OLG.
'Neuronal Activities' was the most enriched biological process in gene-set j, or the group of genes that are expressed lower in GBM compared to OLG ( Figure S4b). This functional category includes basic activities of the nerve or neuron behavior, such as synaptic transmission, neurotransmitter release, and action potential propagation. It has been shown that GBM cells release glutamate, an amino-acid neurotransmitter [4,5]. Elevated levels of extracellular glutamate concentrations is followed by an acute degeneration and death of neurons [4][5][6], a process known as excitotoxicity, and is one of the underlying causes of tumor-associated epileptic seizures and neuro-cognitive deficiencies in glioblastoma patients [7]. Therefore, this glutamate excitotoxicity in GBM can be the cause of lower normal synaptic transmission and neural function relative to OLG, as is suggested by our enrichment results for gene-set j. Glutamate neurotoxicity has also been implicated in other neurodegenerative diseases, including stroke and Alzheimer's disease [8].
Applying the same enrichment analysis strategy described above for chromosome number, we looked for associations between our expression data and gene copy-number alterations frequently observed in GBM and OLG. The genes in gene-set i and gene-set j were the most enriched in Chromosome 1 ( Figure S4c) and Chromosome 10 ( Figure S4d), respectively. The loss of Chromosome 10 is one of the most frequent genetic aberrations in GBM [9,10], causing the expression of its genes to be heavily suppressed. This offers a possible explanation for the overrepresentation of Chromosome 10 genes in the gene-set that is expressed relatively less in GBM ('gene-set j'), and thereby higher in OLG. The deletion of the short-arm of Chromosome 1 is a hallmark feature of OLG [11,12], which is what we suspect to have caused the overrepresentation of Chromosome 1 genes in the set that is expressed relatively less in OLG ('geneset i'), and thereby higher in GBM.
The results from our global enrichment analysis in biological processes and chromosome numbers display the relative differences between the collective properties of the two diseases. This offers a holistic view of the major trends that underlie the classifiers' relative expression reversal behavior, which could not have been detected by studying the gene pairs in our markerpanel alone. It is worth noting that we did not observe these same enrichment properties in the GBM-node or OLG-node classifiers in Table 2. This reflects a clear limit to the extent of which disease properties can be explained by using only the minimal number of classifier genes, since those gene pairs were chosen only in the interest of selecting the smallest set with the highest predictive accuracy, regardless of biological relevance.