Genome-Wide Methylation Analyses in Glioblastoma Multiforme

Few studies had investigated genome-wide methylation in glioblastoma multiforme (GBM). Our goals were to study differential methylation across the genome in gene promoters using an array-based method, as well as repetitive elements using surrogate global methylation markers. The discovery sample set for this study consisted of 54 GBM from Columbia University and Case Western Reserve University, and 24 brain controls from the New York Brain Bank. We assembled a validation dataset using methylation data of 162 TCGA GBM and 140 brain controls from dbGAP. HumanMethylation27 Analysis Bead-Chips (Illumina) were used to interrogate 26,486 informative CpG sites in both the discovery and validation datasets. Global methylation levels were assessed by analysis of L1 retrotransposon (LINE1), 5 methyl-deoxycytidine (5m-dC) and 5 hydroxylmethyl-deoxycytidine (5hm-dC) in the discovery dataset. We validated a total of 1548 CpG sites (1307 genes) that were differentially methylated in GBM compared to controls. There were more than twice as many hypomethylated genes as hypermethylated ones. Both the discovery and validation datasets found 5 tumor methylation classes. Pathway analyses showed that the top ten pathways in hypomethylated genes were all related to functions of innate and acquired immunities. Among hypermethylated pathways, transcriptional regulatory network in embryonic stem cells was the most significant. In the study of global methylation markers, 5m-dC level was the best discriminant among methylation classes, whereas in survival analyses, high level of LINE1 methylation was an independent, favorable prognostic factor in the discovery dataset. Based on a pathway approach, hypermethylation in genes that control stem cell differentiation were significant, poor prognostic factors of overall survival in both the discovery and validation datasets. Approaches that targeted these methylated genes may be a future therapeutic goal.


Introduction
Cancers are now recognized as driven as much by epigenetic as well as genetic changes [1]. Among epigenetic alterations that occur during oncogenesis, aberrant gene promoter hypermethylation is the most commonly investigated. However, there have been few studies that evaluated differential promoter methylation across the entire genome in glioblastoma multiforme (GBM), which is the most common type of malignant brain tumors in adults [2][3][4][5]. The primary goal of some studies, such as the Cancer Genome Altas Project (TCGA), was to characterize methylation patterns in tumors and to correlate with other genomic alterations such as gene mutations, copy number alterations and expression [6]. The investigation of differential methylation poses a challenge, because unlike colon, breast or prostate cancers, it is not possible to obtain matching ''normal'' tissues during surgery for GBM. The alternative method, which is to procure a substantial number of unrelated normal brain tissues for comparison, is also challenging. Moreover, previous reports on genome-wide methylation in normal brain tissues showed methylation patterns varied between neuro-anatomically distinct regions, and methylation level may change in the brain with increasing age [7][8][9]. Thus, an accurate profile of differential methylation will require appropriate control tissues with age and neuro-anatomical distribution matching those of glioma subjects.
Compared to genome-wide methylation near gene promoters, methylation derangement in the repetitive elements of the GBM genome was even less studied. Repetitive elements may comprise over two-thirds of the human genome, and a high proportion of them are retrotransposons, whose expression is normally suppressed by methylation of cytosine [10]. Retrotransposons become hypomethylated early on in oncogenesis. This can lead to transposable elements insertion, and some of them, such as L1, can express their RNAs, which then promote DNA damage, spreading of methylation to promoters and genomic deletions [11,12]. Despite their abundance and importance in tumorigenesis, the sequences and maps of repetitive elements in the genome have been difficult to ascertain, because repeats created ambiguities in alignment and in genome assembly [13]. Nevertheless, surrogate markers that estimate global cytosine methylation content, which indirectly reflects methylation levels in repetitive elements due to high CpG contents in those regions (.65% of total genomic CpGs), have been developed and used to study cancer risk, tumor stage, relationship to other molecular phenotypes and prognosis [14][15][16][17][18]. One study that measured 5-methylcytosine content using a methyl acceptance assay in one epileptic specimen and 10 GBM tissues showed global hypomethylation in tumors [19]. The methylome of other cancers had showed concurrent global hypomethylation and gene promoter hypermethylation [20].
This study had three primary objectives. First, we explored differential methylation of gene promoters/CpG islands across the genome, evaluating more than 14,000 genes at single CpG resolution. To accomplish this goal, we used standard non parametric and biological pathway based analytical approaches to compare primary GBM (de novo) with a substantial number of representative normal brain tissues. Second, we investigated genome-wide methylation level, which included CpG methylation levels in the repetitive elements, as potential diagnostic marker in GBM. We characterized and compared changes in LINE1 (L1 retrotransposon), 5 methyl-deoxycytidine (5m-dC) and 5 hydroxylmethyl-deoxycytidine (5hm-dC). Analysis of LINE1 is widely used as a marker of global cytosine methylation level [10,18]. Analysis of 5m-dC gives a broader and more accurate measure of global methylation across the genome. 5hm-dC is an oxidized product of 5m-dC generated by the a-ketoglutarate-dependent TET dioxygenases [21]. One report showed that 5hm-dC was strongly depleted in glioma and other cancers [21]. Third, we evaluated the prognostic values of methylation pathways and global methylation markers in a multi-variable Cox proportional hazard model, adjusted for IDH1 mutation, GCIMP status, MGMT methylation and other clinical factors.

GBM and Brain Control Tissues in the Discovery Dataset
This study was approved by the institutional review boards (IRBs) of Columbia University (CUMC) and Case Western Reserve University (CWRU). Participants provided written informed consents. Primary GBMs were retrieved from each institution's biorepositories. All tumor tissues were snap-frozen immediately post resection and were examined neuropathologically. Only tissues with an estimated 80% tumor nuclei and less than 50% necrosis were accepted for DNA extraction and subsequent methylation analyses. Fifty-four (54) de novo GBM (40 from CUMC and 14 from CWRU) passed these criteria and were included in this study. Control brain tissues were obtained from the New York Brain Bank. We retrieved 24 post-mortem, freshly frozen control tissues from 24 unique individuals. All control tissues had been previously examined by a neuro-pathologist and were verified to be without pathological evidence of other neurological or psychiatric diseases. These 24 brain controls and the aforementioned 54 GBMs comprised the discovery dataset. The data of our discovery dataset was deposited into GEO (accession # 50923).

GBM and Control Brain Tissues in the Validation Dataset
A validation GBM dataset was retrieved from the publicly available Cancer Genome Atlas Data Portal (TCGA data portal). This dataset comprised of 163 GBM samples submitted at initial diagnoses and were analyzed with the Illumina Methylation27 platform (CWRU cases were excluded). The data were from batches 16, 20, 26, 38 and 62, which were not included in the previous TCGA marker paper on GBM methylation [6]. The four control brain tissues used for that publication were not part of the TCGA dataset and were not available for download in the TCGA Public Portal (personal communication Daniel Weinsberger). Instead, we retrieved 140 publicly available brain tissue controls from GEO accession (# 15745) and dbGAP (phs000249.v1.p1) for comparison with TCGA tumors. This cohort of control brain tissues were obtained from consented subjects, at the time of autopsy, at the University of Maryland, Johns Hopkins University and the National Institute of Aging. They were examined neuropathologically to be without any intra-cranial pathology. The methylation results using Illumina Methylation 27 K platform were published previously [22].
In addition to having a validation dataset, we also validated the most significantly methylated CpG sites via pyrosequencing experiments. We chose those top sites that not only passed our FDR adjusted criteria but also showed at least 4 fold increased or decreased in methylation compared to control tissues. For correlation of validated methylation probes with gene expression, we used the corresponding TCGA gene expression dataset for the same 162 GBM patients (Agilent 244k Custom Gene Expression G4502A-07) to calculate overall Spearman correlation coefficients.

DNA Methylation and Illumina Infinium Human Methylation 27 K Platform
DNA was extracted by standard proteinase K/RNase treatment and phenol/chloroform extraction. Bisulfite modification of 1 mg of DNA was conducted using an EZ DNA Methylation Kit (Zymo Research, Irvine, CA). The HumanMethylation27 DNA Analysis BeadChips (Illumina) were used to interrogate 27,578 highly informative CpG sites at single nucleotide resolution, covering 14,495 genes. The array hybridization was conducted under a temperature gradient program, and the array was imaged using a BeadArray Reader. Image processing and intensity data extraction was performed as described previously [23].
LINE-1 DNA methylation levels were determined by pyrosequencing as previously described [25][26][27]. Each set of amplifications included bisulfite-converted CpGenome TM (Millipore) universal methylated, unmethylated and non-template controls. Percent methylation within a sample was subsequently determined by averaging across all three interrogated CpG sites. Non-CpG cytosine residues were used as internal controls to verify efficient sodium bisulfite DNA conversion. The inter-and intra-assay coefficients of variation were 1.90 and 1.30%, respectively. All samples were run blinded to tissue status.

Determination of Glioma CpG Island Methylator Phenotype (GCIMP) Status
In both the discovery and validation datasets, we found those probes in the Illumina 27 K array that corresponded to the validated GCIMP markers as documented by Noushmehr et al [6]. They were ANKRD43, HFE, MAL, LGALS3, FAS, RHO-F and DOCK5. Although the paper documented 2 FAS markers:  Figure 1. Heat map of differential methylation in the discovery dataset. Heat map based on a set of 1864 CpG sites that significantly segregated GBM and control brain tissues into six methylation classes in the discovery dataset. Methylation class numbers are marked inside the annotation bar. The heat map columns represented CpG probes, and the rows are tumor and control brain samples. In the color scale, relative hypermethlyation is denoted by a shift towards the red color, and relative hypomethylation towards blue. Neutral methylation is gray. doi:10.1371/journal.pone.0089376.g001

IDH1 Mutation
For GBM samples in the discovery dataset, IDH1 mutation status was determined via pyrosequencing. The portion of IDH1 spanning codon 132 (75 bp amplicon) was amplified. Forward primer was 59-GCTTGTGAGTGGATGGGTAAA-39 and biotinylated reverse primers was 59-TTGCCAACATGACTTACTT-GATC-39. Polymerase chain reaction (PCR) and pyrosequencing assays were performed as previously described [28]. Pyrosequencing primer provided sequence data that included codon 131 and the first nucleotide of codon 132 (59-GGGTAAAACCTATCAT-CATA-39). Negative controls were run with all subjects' samples. Sequence data were analyzed using PyroMark Q24 software.
For GBM samples in the TCGA validation dataset, we determined their IDH1 mutation status by examining their level 2 DNA sequencing data, which was generated using Illumina's Genome Analyzer (GA).

MGMT Methylation
For MGMT methylation status in the discovery dataset, we chose pyrosequencing as the analytical method, because previous studies showed that it provided the best prognostic value, cost effectiveness and ease of use [29]. Seven CpG sites in the promoter region of MGMT were selected based on previous validations, with a Qiagen kit (PM00149702) [30]. Polymerase chain reaction (PCR) was performed in a 25-ul reaction mix containing 50 ng of bisulfite-converted DNA, 1x Pyromark PCR Master Mix (Qiagen), 1x Coral Load Concentrate (Qiagen), and 0.3-uM forward and 59 biotinylated reverse primers, using the cycling conditions and amplifications as outlined previously [31]. Each set of amplifications included bisulfite-converted CpGenome TM universal methylated (Millipore, Billerica, MA), unmethylated (whole genome amplified DNA), and non-template controls. The sequencing reaction and quantitation of methylation was conducted using a PyroMark Q24 instrument and software (Qiagen). Percentage methylation was calculated by averaging across all CpG sites interrogated. As percentage methylation is a continuous variable, we converted it to a binary variable using a ''cutoff'' to facilitate clinical interpretation. There has been no established consensus cut-off for pyrosequencing percentage values, but in normal brain tissues, average MGMT promoter methylation ranges between 0% and 10% [29]. Thus, as reported previously, we used 14% as the threshold to distinguish unmethylated from methylated MGMT promoter in a given tumor [15].
Since TCGA did not separately provide MGMT methylation level of their GBMs, we used the CpG probes on the Illumina 27 K array to determine these tumors' MGMT methylation status. Two validated MGMT probes, cg12434587 and cg12981137, were used in prognostic models as a continuous variable, because a previous study confirmed their prognostic and classification properties [32].

Pyrosequencing Validation of Differentially Methylated Genes
For other significant CpG sites that were differentially methylated, the regions selected for interrogation covered the particular CpG sites on the Illumina arrays as well as surrounding sites. PCR and pyrosequencing assays were as described above using Qiagen kits. Primers were included in Table S1. Percent methylation of each gene was calculated by averaging across all CpG sites interrogated.

Statistical Methods
Data assembly. Each methylation data point represents the fluorescent signals from the M (methylated) and U (unmethylated) alleles. Background intensity was computed from a set of negative controls and was subtracted from each analytical data point. The ratio of fluorescent signals was then computed from the 2 alleles to reflect the fractional methylation level at each CpG site (b-value), which is between 0 and 1 as the proportion of methylation for a given CpG site. Beta values were generated using Illumina BeadStudio software. For quality control, methylation measures with a detection P value .0.05 and samples with CpG coverage , 75% were removed (for 7 probes total). All X and Y chromosome probes (including 1,085 in X and 7 in Y) were dropped, leaving 26,486 probes for all further analyses. We performed two major types of analyses: 1. Locus by locus comparison between GBM and control brain tissues; 2. Unsupervised hierarchical clustering of tumors and control tissues.
Locus-by-locus analyses. For both the discovery and validation datasets, we first filtered out those CpG sites with median |Db| ,0.2, as studies in the past had shown that this methylation array cannot accurately detect b difference at or below 0.17 [23]. Then we used the non-parametric Wilcoxon Rank Sum test to compare each CpG site's methylation levels between normal brain tissues and controls; Benjamin-Hochberg false discovery rate (FDR) was used to adjust for multiple comparisons. Significance level was set at FDR #0.05. Due to influence on differential methylation by neuroanatomical region and age, we also performed an adjusted analysis using a published method based on logistic regression: logit (P) = m ij +A*b ij +B*age j + C*location j +e ij , where P is the probability to be a tumor; b = beta for the CpG probe i of sample j; m = intercept for the CpG i of sample j; age = age of the patient from sample j; location = brain location of sample j; e = error term of the CpG i of sample j [33]. Histograms were generated to show median |Db| distributions of hyper-and hypomethylated CpG sites.
Unsupervised hierarchical clustering. To explore data patterns, we performed unsupervised hierarchical clustering on those differentially methylated CpG sites (FDR #0.05) using Euclidean distance metric and Ward linkage. The same clustering algorithm was applied to the discovery and validation datasets. To further reduce the dimensionality of our datasets, we also used principal component analyses (PCA) with correlation matrix for data reduction.
Biological pathways involved in differential methylation. Ingenuity Pathway Analysis (IPA, Ingenuity Systems, Redwood City, CA) was used for canonical pathway analyses of those validated, differential genes. This bioinformatics tool was used to provide insights into the most involved biological pathways in tumorigenesis based on DNA methylation alterations.
Correlation of global methylation markers with methylation classes. We compared LINE1, 5m-dC and 5hm-dC levels among methylation classes using the non-parametric Kruskal-Wallis Test, as prior analyses had shown that these markers were not normally distributed [26]. Post-hoc pairwise comparisons after significant Kruskal-Wallis Tests were conducted using the Tukey HSD test. Survival analyses. Prognostic assessments were performed separately for the discovery tumor dataset and the TCGA validation dataset, using Cox proportional hazard regression models. We investigated the value of methylation biomarkers other than MGMT or GCIMP, such as LINE1, 5m-dC and 5hm-dC, as potential independent prognostic factors. Moreover, biological processes do not act through the effect of a single gene but are the results of combined influences of many genes in a relevant pathway. Thus we also explored the net effect of the most important methylated pathways, such as those discovered by IPA, as prognostic factors. To achieve this goal, we calculated an index for a top pathway, which is a combination of the cross product of beta values and univariable Cox regression coefficient of each involved genes in that pathway. The index for the pathway was then evaluated in regression models. This method was previously published in other pathway-based survival studies using genomewide microarrays [34]. Standard clinical and molecular pathology information included MGMT methylation, GCIMP status, IDH1 mutation, age at diagnoses, Karnofsky performance score (KPS) at the time of diagnoses, extent of surgical resection, bevacizumab use at recurrence or tumor progression and the center which provided the specimens. In the discovery dataset, concomitant chemo-radiation was not included in survival model, as all patients received combined treatment. In the TCGA dataset, there was no information on extent of surgery and no global methylation biomarkers.
Each prognostic factor was first evaluated in a univariable Cox proportional hazard model analyses. Those factors that reached significance levels of p#0.1 were entered into the multivariable Cox model. All prognostic factors in the multivariable model were then removed one by one via backward elimination if the covariate p value is .0.05. This process continued until covariates kept in the model were all significant. Then the eliminated factors were added one-by-one back into the model to ensure that they were not significant in the multivariable model. The final model consisted of all significant factors (p#0.05) in the presence of each other. For each covariate, proportional hazard assumption was tested by plotting scaled Schoenfeld residuals against the natural log of time.

Demographics of GBM Cases and Normal Brain Controls
The clinical, demographic and pathological characteristics of GBM cases and brain controls are detailed in Table 1. Most GBM and control tissues came from subjects over age 50. Controls from GEO/dbGAP had median age younger than other groups. More men than women were represented in both tumors and controls. Frontal, parietal and temporal lobes represented the most common anatomical locations of both tumors and control tissues. For the group of brain controls retrieved from GEO/dbGAP, methylation data from four brain locations: frontal, temporal, pons and cerebellum were available. We only used frontal and temporal brain controls for comparison with TCGA glioblastoma, because these tumor tissues were mostly from frontal and temporal lobes. For NYBB brain controls, the frozen postmortem interval (PMI), which was calculated from the subject's reported time of death to the time the brain was processed, was a median of 5 hours; this time interval was shorter than that of 14.5 hours of dbGAP brain controls or other post-mortem brain tissues [8,9]. Causes of death for NYBB controls were cardiac (n = 14), pulmonary (n = 4), renal (n = 2), trauma (n = 2), cholangiocarcinoma (n = 1; no brain metastases) and unknown (n = 1). The causes of death for GEO/ dbGAP controls were unknown.

Differential Methylation between GBM and Control Brain Tissues in the Discovery Dataset
Methylation in 1864 CpG sites, corresponding to 1639 genes, differed significantly between GBM and normal brain tissues in the discovery dataset. Unadjusted and adjusted analyses essentially gave the same CpG list. Table S2 shows the complete list of unadjusted, differentially methylated CpG sites in the discovery dataset. Overall 1389 CpG sites (1175 genes) were hypomethylated in the tumors relative to controls, and 475 CpG sites (464 genes) were hypermethylated. The top 10 most differentially hypomethylated and hypermethylated CpGs are presented in Table 2 and 3, respectively. Figure 1 shows two key features of the unsupervised hierarchical clustering analysis. First, tumors and controls segregated into six classes, with five classes of tumors and one of control. Control brains did not form subgroups based on age or tissue of origin. Class 3 is the dominant tumor class with 29 subjects. Class 5 contained four tumors that were positive for the Glioma CpG Island Methylator Phenotype (GCIMP+). GCIMP status was verified using the markers described by Noushemir et al ( Figure S1a). As previously reported, subjects with GCIMP+ tumors were significantly younger than GCIMPsubjects (p = 0.007). Pyrosequencing analyses showed that five tumors harbored mutations in IDH1: four R132H and one R132L mutations. Four of these five IDH1 mutated tumors corresponded to the four GCIMP+GBMs but one was a GCIMP negative tumor.
The second feature of the heatmap showed that hyper-and hypomethylated CpG sites separated well into two respective blocks (see columns in Figure 1). Overall, nearly 70% differentially methylated probes were relatively hypomethylated in the tumors. Among tumor classes, each class showed variations in the pattern or degree of hypo and hyper methylation. Class 1 tumors appeared to have higher degree of hypomethylation than other tumor classes. Class 3 tumors showed the clearest transition from hypermethylated to hypomethylated CpG blocks. Compared to brain controls, Class 5 (GCIMP+) is only hypermethylated at discrete loci. Figure 2a illustrates the range of values of median |Db| in the discovery dataset. There were more hypomethylated than hypermethylated CpGs at moderate |Db| between 0.2 and 0.49; however, hypermethylated CpGs predominated when |Db| .0.5. Figure 3a illustrates principal component analyses (PCA) of the discovery dataset. The 1864 significant CpGs can be reduced to 40 orthogonal principal components (PC) that explained 95% of the variance of the dataset, with the first three PC explained 67% of the variance. Overall, controls clustered tightly together, whereas GBM showed wide dispersion in space due to increase in tumor variance. Each of the 5 tumor classes had its own elliptical plane that is orthogonal to each other, though some members of the classes overlapped each other at the periphery. Figures S2a shows the top down view of PCA analyses. It illustrates the posterior position of the GCIMP+ group, which was apart from other methylation classes but was difficult to fully appreciate from the frontal view. Table 4 shows the correlations in methylation level between MGMT and 5 differentially methylated CpG sites from the discovery dataset using Illumina's BeadChip and pyrosequencing assays. Correlations overall using Spearman's rho statistics was .0.8. Thus our results supported previous reports of excellent validations of this methylation array technology using pyrosequencing [3,31].

Results of Validation Dataset and Pyrosequencing Validation
Comparison of 163 TCGA GBMs with 140 publicly available controls showed 2445 CpG sites (2018 genes) were differentially methylated between GBM and normal. Table S3 shows the list of differentially methylated CpGs. There were 1625 hypomethylated CpGs (1368 genes) and 820 hypermethylated CpG sites (650 genes). Figure 4 shows the heatmap of the validation dataset, which also demonstrates hyper and hypomethylated probes formed two separate blocks. Tumors and controls were clustered into 7 classes. Similar to the discovery dataset, there were 5 classes of tumors but 2 of controls. Control subjects in Class 6 were significantly older than those in Class 7 (p,0.03). Class 1 tumors contained CpG sites that acquired a higher degree of hypomethylation. Class 5 had 13 GBMs that were GCIMP +. Again, relative to controls, hypermethylated CpGs were located in discrete loci. Figure S1b showed unsupervised hierarchical clustering that identified these 13 GCIMP+tumors, using markers as described by Noushemir et al. Our results were also confirmed by those reported in the TCGA Wiki. Out of these 13 GCIMP+ tumors, 6 were IDH1 mutated. Similar to the GBMs in our discovery dataset, the distribution of median |Db| ranges showed hypomethylated CpGs were more prevalent in the moderate |Db| range. But at |Db| .0.5, there were more hypermethylated probes (Figure 2b). Figure 3b shows the PCA analysis of the validation dataset, which reduced 2445 significant CpGs to 112 orthogonal PC that explained 95% variance of the dataset, and the top 3 PC accounted for 69% of the total variance. Similar to our discovery dataset, controls clustered tightly together, whereas GBM showed a wide range of methylation variability. Supporting Figure S2b shows the top down view of the PCA analyses, which also illustrates the posterior and separate position of the GCIMP+ group. The physical relationship between the two control groups is better visualized in this view as well.   Overall, 1548 CpG sites (1307 genes) in the validation dataset overlapped with the 1864 CpG sites (1639 genes) from the discovery dataset (83%). Of the 1307 validated genes, 905 were hypermethylated and 402 were hypomethylated. These differentially methylated CpGs and corresponding genes are included in Table S4.
Out of 1307 validated, differentially methylated genes, 1130 had available matched mRNA expression in the TCGA GBM data files (Agilent 244k Custom Gene Expression G4502A). All 163 TCGA GBM cases had corresponding methylation and mRNA expression data. The overall Spearman's rho was 20.42 (95 CI 20.54, 20.29, p-values = 0.018) for hypermethylated genes, and 20.28 (95% CI 20.42, 20.13, p value = 0.043). In the set of hypermethylated genes, 71. 89% of methylation-expression pairs showed significant inverse correlations (p#0.05), whereas 55.36% of hypomethylated genes-expression pairs were inversely related. Hence, gene expression and methylation intensity were negatively correlated for these significant genes, but correlation appeared to be stronger for hypermethylated genes.

Biological Characteristics of Validated Genes and Involved Pathways
IPA analyses were conducted separately for hypomethylated and hypermethylated genes. Our results showed that the top 10 significant canonical pathways involved in hypomethylation were all related to immune system functions (Figure 5a). In contrast, only 5 pathways were significant among hypermethylated genes ( Figure 5b); the top one influenced embryonic stem-cell pluripotency, but other significant pathways were involved ubiquitously in cell signaling, such as cAMP and G-protein.
We then used the top two hypomethylated pathways, which are related to granulocyte and agranulocyte adhesion and diapedesis and created a pathway index as mentioned previously in the Materials and Methods section. This immune index included CXCL10, C5AR1, CCL7, MYL4, ICAM2, IL1b, MMP14, SELE, IL18, IL1R1, MMP3, MMP19, CDH5, MYH4, ITGB2 and CCL11. With the same method, we also created an index for stem cell pluripotency, which was the top hypermethylated pathway. The embryonic stem cell (EST) index included GATA4, GATA6, NEUROG1, HOXB1, ISL1, FOXD3, GBX2 and MYF5. These indices were later used in survival analyses (see below).
We investigated targets of polycomb repressive complex 2 (PRC2) or histone 3 lysine 27 trimethylation (H3K27me 3 ) in  . Levels of global methylation markers among methylation classes. a. Levels of 5m-dC between brain controls and tumors, and among methylation classes. Red, yellow and green lines (dotted) denoted pairwise comparison between two classes and the P values of their comparisons. b. Levels of LINE1 between brain controls and tumors, and among methylation classes. c. Levels of 5hm-dC between brain controls and tumors, and among methylation classes. doi:10.1371/journal.pone.0089376.g007 human embryonic stem cells (hESC). To quantify the degree of enrichment in our validated genes, we queried CHIP-seq datasets of H1-hESC (Tier 1) in ENCODE and from published papers [35,36]. We downloaded lists of genes that are targets of H3K27me 3 and PRC2, which included Suppressor of Zeste 12 Homolog (SUZ12), Embryonic Ectoderm Development (EED) and Enhancer of Zeste Homolog 2 (EZH2). The resulting lists of targets were matched to our validated gene list. Overall, 164 of 402 validated and hypermethylated genes (40.80%) were targets of at least one of PRC2 or H3K27me 3 , whereas 53 of 905 hypomethylated genes (5.86%) were their targets. Hypermethylated genes were enriched with PRC2 or H3K27me 3 targets (x 2 = 245.42, df = 1, p = 0.0001). In total, 217 of 1307 genes (16.60%) were targets of PRC2 or H3K27me 3 . Figure 6 illustrates the frequency and overlap of enrichment of PRC2 and H3K27me 3 in our validated, differentially methylated CpG sites using a Venn diagram.

Correlation of Methylation Classes with Biomarkers of Global Methylation Levels
LINE1, 5m-dC and 5hm-dC levels were all significantly lower in GBMs compared to control brain tissues (p,0.0001 for all 3 markers). However, 5m-dC level was most capable in discriminating among various methylation classes (Figure 7a). In both 5m-dC and LINE1, global methylation levels were lowest in Class 1 tumor, and their levels successively rose from Class 1 to 5 (Figure 7a and 7b). Class 4 and 5 tumors had 5m-dC and LINE1 levels that were not statistically different from those of control brains. With respect to 5hm-dC levels, tumors were uniformly low compared to control tissues, but there were no differences in levels among tumor classes. (Figure 7c).

Survival Analyses
Univariable and multivariable survival analyses results are shown in Table 5 and 6, respectively. Overall the median survival of the discovery population is 20.09 months (IQR: 9.11-34.39 months). In this dataset, age at diagnosis, KPS, study center (CWRU versus Columbia), LINE1 methylation level, MGMT methylation (50% methylated in control brains and 67% methylated in tumors), immune index and ESC index were all statistically significant prognostic factors in univariable analyses. Gross total resection and methylation Class 4 and 5 (GCIMP) showed trends towards favorable prognoses in the Univariable Cox models. When these variables were included in a multivariable Cox proportional hazards model, high level of LINE1 methylation (higher level of genomic stability), methylated MGMT, along with high KPS and gross total resection were all significant favorable prognostic factors. High levels of methylation in genes that affect stem cell pluripotency or promote differentiation, as indicated by higher score of the ESC index, remained a significant poor prognostic factor. However, age at diagnosis, GCIMP, study center and immune index were no longer significant in the presence of these other variables.
We found 3 factors that were strongly associated with more advanced age at diagnoses and might have eliminated its effect in multivariable survival analyses: 1. A higher ESC index (p = 0.001); 2. A higher immune index, which denotes less hypomethylation in genes that control leukocyte trafficking (p = 0.032); 3. Low LINE1 methylation levels (p = 0.05). Due to our limited sample size and only 4. GCIMP+ tumors, we were unable to confirm GCIMP's prognostic ability. However, GCIMP+ tumors showed strong correlation with MGMT methylation (p = 0.017).
The median survival of the TCGA validation dataset was 14.53 months (IQR: 7.63-21.30 months). Univariable Cox model showed that age at diagnosis, KPS, concomitant radiation with temozolomide (TMZ), treatment with Bevacizumab, study center 41, MGMT methylation, immune index and ESC index were all significant prognostic factors on their own. When the dominant Methylation Class 2 was served as a reference, Class 1, 3 and 5 (GCIMP) showed favorable prognoses, with GCIMP+ group having the best prognosis. In the multivariable model, younger age at diagnosis, higher KPS, concomitant treatment with radiation and TMZ and methylation Class 3 were significant favorable prognostic factors. Higher scores of the immune index or less degree of hypomethylation in immunity related genes, high ESC index and study center 41 remained significant poor prognostic factors. Of note, methylation Class 5 (GCIMP) was no longer significant in the multivariable model. Instead, methylation Class 3 (with Class 2 as reference) showed a favorable survival after adjustment by other covariates.
When we evaluated factors that most influenced GCIMP status in logistic regression, younger age of onset and a lower score in the immune index (higher degree of hypomethylation) were the strongest factors associated with the GCIMP+ group (p = 0.03 and p = 0.01, respectively). Thus, in multivariable survival analyses, these 2 factors had overshadowed GCIMP as more significant prognostic factors. Similarly, MGMT methylation level based on probes on the Illumina array was no longer a significant prognostic factor in the multivariable survival model. Again, high level of demethylation in immune related genes might have accounted for MGMT's effect, as MGMT methylation was most strongly related to a lower score in the immune index (p = 0.007) and also GCIMP+ status (0.01).
Based on our analyses, the ESC index is a novel pathway-based biomarker for overall survival, and we were able to validate its prognostic significance in the TCGA dataset. Consistent between the two study populations, higher degree of methylation in genes that promote differentiation of stem cells is a poor prognostic factor. Figures 8a and 8b show the adjusted Cox survival curves based on EST index at 25 th , 50 th and 75 th percentiles, for both the discovery (a) and validation cohorts (b). In both datasets, tests of proportional hazard assumptions on all covariates did not show any violation of this assumption.

Discussion
This study validated more than 1500 differentially methylated sites and discovered 5 patterns of methylation changes across tumor samples in both the discovery and validation datasets. To our knowledge, it included the largest numbers of control brain tissues used for the investigation of differential methylation in glioma. The increase in brain control samples have helped us to better detect epigenetic alterations in de novo GBM. However, our PCA scatterplot illustrated that tumor methylation showed a wider amount of variability compared to the variability in controls. This finding may agree with that of another study that evaluated 139 cancer-specific differentially methylated regions (cDMRs) using a custom Illumina bead array. The investigator showed that differential methylation was characterized by increased stochastic variation in methylation level within each tumor type, suggesting a general disruption of the integrity of the cancer epigenome [33].
One feature of our differential methylation analyses is that there were more than twice as many hypomethylated CpGs as hypermethylated loci in the tumors. This finding is supported by differential methylation studies in other types of cancers, which suggested that hypomethylated loci are at least as numerous as, or often more abundant than hypermethylated CpGs [31]. However, hypomethylated loci, though more numerous, tended to show more moderate b changes compared to controls, whereas hypermethylated CpGs manifested larger changes even though they were fewer in numbers. A reason for this phenomenon may relate to differences in epigenetic remodeling, such as changes in chromatin marks, which lead to gene promoter hypomethylation versus those chromatin alterations that affect hypermethylation [37]. There were only 4 GCIMP+tumors and 5 IDH1 mutant tumors in the discovery dataset. This supported the finding that GCIMP and IDH1 mutations are uncommon findings in de novo GBM. In both datasets, strong correlation existed between methylated MGMT, or younger age of onset and GCIMP+ status, which confirmed findings from previous investigations [2,6]. However, hypermethylation relative to controls were found in discrete loci, or ''blocks'' and did not appear to be uniform across all CpGs. This may be due to the fact that brain controls were already hypermethylated in many CpG sites. Moreover, as one recent systematic review pointed out, there has been a lack of consensus on the precise definition of CIMP in cancers [38].
For genes that were differentially methylated in our dataset, we were able to demonstrate that different pathways are involved in hypo-and hypermethylated genes. We found genes that regulate the immune system, affecting both innate and cellular immunities, were aberrantly hypomethylated in the tumors. Previous publications had focused primarily on gene hypermethylation; thus this finding will hopefully prompt future studies on how immune pathways, through epigenetic alterations, will relate to the generation of immune-suppressive tumor environment, or the host's ability to detect and eliminate GBM. Our IPA canonical pathway results showed different pathways were affected by hypermethylation. The top pathway confirmed that epigenetic regulation was crucial to maintenance of stem cell pluoripotency in GBM. Related to this finding is that almost 40% of hypermethylated CpG sites were targets of PRC2 or H3K27me 3 . PRC2 was up-regulated in glioblastoma stem-like cells [39]. PRC2 targeted developmentally important genes, induced compact chromatin, repressed expression of target genes and maintained ''stemness'' in embryonic stem cells [40]. These same genes were targets of hypermethylation in cancer, via transformation from a polycomb-dependent silencing to methylation-dependent silencing during cancer development. Moreover, it appeared that the DNA demethylating agent, 5-deoxy-azacytidine (DAC) was able to reverse methylation and induce gene expression in cancer cells that were marked by both repressive chromatin marks (positive for H3K27me 3 ) and methylation, but histone deacetylase inhibitor (HDACi) was not able to re-activate these genes [1,36]. These results indicated that hypermethylated genes in cancer, even if they were maintained in a suppressed state by polycomb marking, are competent to reactivate upon removal of the methylation mark. Nevertheless, in GBM, demethylating agents have not been considered in clinical trials because promoter MGMT methylation is related to temozolomide (TMZ) response [41]. Moreover, in view of the fact the genes involved in immune system functions were hypomethylated in their promoters, and repetitive elements of de-methylation may trigger further genomic instability, broad-spectrum, demethylating drugs may potentially bring on undesirable consequences and genomic instability.
Consistent with findings from other cancers, GBM also showed global hypomethylation when compared to control tissues. Our three biomarkers consistently demonstrated that tumors were hypomethylated compared to control brains. Since these markers measured cytosine methylation across compartments in the human genome, and repetitive elements consisted of .65% of genomic CpG sites versus 5.5% of them in promoters of protein coding genes, demethylation in the repetitive elements may contribute to a significant degree of global hypomethylation in GBM compared to control tissues [42]. Nevertheless, since LINE1 and 5m-dC levels in methylation Class 4 and Class 5 (GCIMP) were similar to controls, the epigenetic alterations of these tumors might involve a lesser degree of wide-spread, concomitant de-methylation of the repetitive elements. Future studies using bisulfite next gen sequencing will help to further define boundaries between hypoand hypermethylated domains in the GBM epigenome, as one study had recently done in other cancer types [33].
The reasons for the strong and uniform depletion of 5hm-dC in GBM were not clear. TET1, which converts 5m-dC to 5hm-dC, has not been shown to be mutated or over-expressed in GBM. Although it is also a global methylation marker and is closely related to 5m-dC, 5hm-dC is enriched within the gene body, promoters and transcription start site, whereas 5m-dC and LINE1 were primarily located in heterochromatin and repetitive elements [43]. In the evaluation of prognostic marker, LINE1, was a significant prognostic factor for overall survival in the discovery dataset, even after it was adjusted for the effect of other known prognostic factors such as MGMT methylation, extent of surgical resection and KPS. This marker is worthy of further validation in larger study or clinical trial, as the TCGA does not have information on this biomarker.
We were able to show that higher ESC index score was a poor survival factor in the discovery dataset, and the TCGA dataset validated its prognostic significance. Thus, high levels of methyl-ation of genes involved in this pathway may maintain these cells in the pluripotent state and promote treatment resistance. However, therapeutic design to re-differentiate this gene set cannot be ''offthe shelf'' demethylating agents. The approach will have to target other epigenetic components that promote these methylation markers, or specific enhancers/suppressors of these genes.
Future studies may also want to elucidate factors that lead to methylation changes in GBM, such as alteration in chromatin states, and local DNA features. Some studies have suggested that the presence of SINE and LINE may predispose methylation spread into nearby gene promoters, unless certain insular DNA elements, such as the presence of CTCF, SP1 or USF1, protect against such spreading into downstream promoters [44][45][46]. This study illustrated that repetitive element demethylation and epigenetic alteration in gene promoters occur hand-in-hand, but whether destabilization of repetitive elements may enhance methylation spread into adjacent genes in GBM will need further laboratory evaluation. Figure S1 Heat map showing unsupervised hierarchical clustering using GCIMP markers. a. Using GCIMP markers, four GBM subjects were found to be GCIMP+ in the discovery dataset (red lines on the row dendrogram). They corresponded to 4 of the 5 IDH1 mutated subjects. b. Using GCIMP markers, 13 GBM subjects were found to be GCIMP+ in the validation dataset (red lines on the row dendrogram). Six of them had IDH1 mutations. (TIF) Figure S2 Top down views of PCA analyses. a. In the discovery dataset, the separate, posterior location of the GCIMP+ group was best appreciated from this view; b. In the validation dataset, GCIMP+ group occupied a similar location posteriorly. Also, segregation of the 2 control groups was better visualized in this view. (TIF)