Corruption of the Intra-Gene DNA Methylation Architecture Is a Hallmark of Cancer

Epigenetic processes - including DNA methylation - are increasingly seen as having a fundamental role in chronic diseases like cancer. It is well known that methylation levels at particular genes or loci differ between normal and diseased tissue. Here we investigate whether the intra-gene methylation architecture is corrupted in cancer and whether the variability of levels of methylation of individual CpGs within a defined gene is able to discriminate cancerous from normal tissue, and is associated with heterogeneous tumour phenotype, as defined by gene expression. We analysed 270985 CpGs annotated to 18272 genes, in 3284 cancerous and 681 normal samples, corresponding to 14 different cancer types. In doing so, we found novel differences in intra-gene methylation pattern across phenotypes, particularly in those genes which are crucial for stem cell biology; our measures of intra-gene methylation architecture are a better determinant of phenotype than measures based on mean methylation level alone (K-S test in all 14 diseases tested). These per-gene methylation measures also represent a considerable reduction in complexity, compared to conventional per-CpG beta-values. Our findings strongly support the view that intra-gene methylation architecture has great clinical potential for the development of DNA-based cancer biomarkers.


Introduction
Epigenetic information is stored in the genome in the form of heritable modifications to the chemical structure of DNA, such as methylation of particular bases, as well a variety of chemical modifications of the histone proteins which package the DNA.Epigenetic information can be modulated during the lifetime of an organism by environmental cues [1][2][3] and these changes persist in subsequent mitoses, leading to an acquired change of phenotype.
DNA methylation is an epigenetic mark consisting almost entirely of the methylation of CpG dinucleotides [4], and it is possible for one, both, or neither alleles at a particular genomic locus to be methylated [5].Hypermethylation of CpGs in the gene promoter (the region close to the transcriptional start site, TSS) are incontrovertibly associated with silencing of the corresponding gene, and this effect is particularly important in cancer, where aberrant gene silencing is associated with functional changes important in every stage of tumour progression [6].
It has been found previously that variability of methylation at specific genomic locations is important in the development of cancer [7].It has been noted in particular that there is an increase in stochastic methylation variability in regions which are already known to have altered levels of methylation in cancers, leading to aberrant and varying gene expression, and providing an epigenetic mechanism for tumour heterogeneity [8].It has also been shown recently that statistics based on differential variability of methylation can lead to improved detection of risk markers in precancerous growths [9,10].
Polycomb group proteins (PcG) play a fundamental role in developmental processes, maintaining a class of genes known as polycomb group targets (PCGTs) in a repressed state in ES (embryonic stem) cells, to maintain pluripotency, and 'poised for activation' during differentiation [11].The link between PCGTs and cancer has been discussed by many authors [12][13][14]; it was recently shown that DNA hypermethylation in cancers preferentially targets PCGTs which are developmental regulators [15], those authors hypothesising that this may contribute to the stemlike characteristics of cancer; in further support of these ideas it has been noted that tumours which are particularly poorly differentiated tend to display expression patterns which are similar to ES cells, including repression of PCGTs [16].
Polycomb group proteins maintain the repressed state of genes via chromatin (the DNA packaging); DNA in its compact state is wrapped around histone proteins (a main component of chromatin), and PRC2 (polycomb repressive complex 2) is responsible for the trimethylation of lysine 27 of histone 3 (leading to the epigenetic mark H3K27me3), which is associated with this compact state [17].Genes occupied by PRC2 in ES cells mostly carry bivalent chromatin marks [15]; bivalency includes the histone modification H3K4me3 (trimethylation of lysine 4 on histone 3), a mark which is associated with activation of the corresponding gene, in addition to the repressive H3K27me3 mark.It is thought that it is this bivalent state which maintains stemness, keeping the gene repressed, but poised for activation upon differentiation.As DNA methylation is also associated with repression and activation of genes, it is of interest whether the methylation patterns of genes which carry the chromatin markings H3K27me3 and/or H3K4me3 in stem cells are altered in cancer, as such aberrant alteration of gene regulation via DNA methylation might be associated with a return of or accentuation of stem-like cell characteristics.
The role of early epigenetic changes in oncogenic transformation, including disruption of the healthy epigenotype of progenitor cells, the creation of an epigenetically permissible environment in which genetic aberrations can have tumorigenic effects, and phenotypic plasticity leading to tumour adaptation and associated with intra-tumour heterogeneity, was originally proposed by Feinberg and colleagues [1].It is hypothesised that one way in which stochastic dysregulation of stem cell genes and associated phenotypic heterogeneity might manifest, is in terms of cell to cell variability of methylation; this would in turn be expected to correlate with intra-gene variability of methylation, as measured using aggregated mixtures of heterogenous cells in a microarray experiment.
Intra-gene methylation variability is deemed to be a disruption of the normal methylation profile, or architecture, of a particular gene, and such a change may be more generally linked to the creation of an epigenetically permissible environment for oncogenic transformation, and to tumourigenesis.Such changes would be expected to accompany the early stages or even precede the onset of the disease, and hence identifying reliable indicators of such changes might provide a valuable lead for the development of DNA-based cancer biomarkers in bodily fluids, especially as it has been shown recently that DNA methylation biomarkers related to stem cell genes are associated with clinical outcome in women's cancers [18].
Previous studies [7,9,10] have focussed on the effects of sample to sample variability of methylation; here for the first time, we analyse the association of phenotype with intra-gene variability of methylation.Making use of data derived from the Illumina Infinium HumanMethylation450 platform, which interrogates w485000 CpGs genome-wide including w330000 with known gene annotations (corresponding to on average 17 CpGs per gene), we have investigated measures of intra-gene methylation architecture, and their ability to differentiate between healthy and disease phenotypes.For this we have developed new measures, and adapted standard ones.

Results
To investigate intra-gene methylation architecture, four genecentric measures are considered, as follows: 1.The mean deviation of the sample methylation profile from the mean methylation profile of healthy phenotype control samples, for each gene.This mean methylation profile may fluctuate a lot within each gene, and so it is not the same as the mean methylation level of a gene.Because this mean deviation is normalised at every probe by dividing by the probe standard deviation across the healthy phenotype control samples, it is called the 'mean z-score' measure; this is illustrated in Figure 1(a).An example of one of the genes found to be most significant according to this measure is shown in Figure 1(b) and (c).2. The mean derivative of the methylation measurements for each gene.The derivative of the methylation profile for a given gene and sample is approximated by the differences between the methylation values measured at consecutive probes mapping to that gene.The mean of the absolute values of these differences is then calculated as the 'mean derivative' measure; this is the same as the sum total of all the increases and decreases in methylation level from one probe to the next across the gene; this is illustrated in Figure 1(d).This is a self-calibrating measure of intra-gene methylation variability, because it is calculated for a given sample from the differences within that sample, and without reference to any other sample.3. The mean of the methylation measurements for a particular genomic region for each gene; this is illustrated in Figure 1(e).Typical mean methylation levels vary greatly from one genomic region to another; hence the mean methylation level for a particular genomic region was used as the 'mean methylation measure' for a gene, and the same region was used for each gene.4. The variance for each gene of the methylation measurements for a particular genomic region; this is illustrated in Figure 1(f).
Because variance is calculated in relation to the mean, this measure was similarly calculated for each gene using only the probes mapping to a particular genomic region, again using the same genomic region for each gene.This is called the 'methylation variance' measure; it is another self-calibrating measure.
These four measures each seek to examine a different characteristic of intra-gene methylation architecture, and all are able to classify samples one-by-one, i.e., they are intra-gene or intra-sample measures, rather than sample to sample measures as has been investigated previously in the context of methylation variability.
As the mean z-score is calculated as a mean measure of methylation difference from the healthy methylation profile, strictly speaking it is a measure of methylation instability.The mean derivative and methylation variance measures are both measures of intra-gene methylation variability; however, the mean derivative is calculated with reference to the ordering of the probes (i.e., this measure would return a different number if the order of the probes was randomised) whereas the methylation variance would not; the mean derivative additionally considers all probes mapping to the gene, whereas the methylation variance measure only considers probes mapping to a particular genomic region.The mean methylation measure is unique here in that it does not measure difference in methylation level and instead measures absolute methylation level; it is included here mainly for comparison.
The properties of these four measures were initially investigated in the context of fourteen Illumina Infinium Human Methylation 450 data sets, which were downloaded from The Cancer Genome Atlas (TCGA) [19].We applied these four measures to the fourteen TCGA data sets; in all, we analysed 450 K DNAm data from 3284 tumour and 681 healthy samples; details of the number of samples of each phenotype and in each data set are shown in Table 1 (for data set abbreviations, see 'Methods and Models').We also carried out a meta-analysis of these data which is to our knowledge the largest meta-analysis performed in any DNA methylation study.

Comparison of Intra-gene Methylation Measures
As a preliminary assessment of the relative merits of these four measures, we looked at their ability to distinguish between tumour and healthy tissue.The correlation of the tissue sample phenotype to the four methylation measures was considered in terms of distributions of per-gene AUCs (area under curve, which is a measure of prediction accuracy, see 'Methods and Models' for details).These distributions are shown in box-plots in Figure 2.For every data set, the mean z-score measure is significantly better at discriminating tumour from healthy tissue using these methylation data, than the mean derivative measure, the methylation variance measure, and the mean methylation measure (visual comparison of Figure 2 was confirmed by Kolmogorov-Smirnov tests, data not shown); this is because the mean z-score measure is defined relative to the healthy mean methylation profile.Excluding the mean z-score measure, the mean methylation measure is significantly better at discriminating tumour from healthy tissue than the remaining two measures in ten of the remaining data sets, with the mean derivative discriminating significantly better in two data sets (READ and THCA), and inconclusive results for the remaining data sets (KIRC and PAAD, which has unstable results due to small sample size).Figure S3 shows, in scatter plots, pairwise comparisons of each of the four methylation measures for a gene which was among the top 1000 genes with the highest AUC according to each of these measures.
To directly compare the effectiveness of the mean z-score measure at predicting phenotype (cancer/healthy) independent of mean methylation level, a logistic regression model was fitted to each gene using mean z-score and mean methylation as covariates, leading to p-values for each gene for each of mean z-score and mean methylation.In every data set except two, for the large majority (80-100%) of those genes with at least one of the two covariates significant, the mean z-score covariate p-value was more significant than the corresponding mean methylation covariate p-value.In the remaining two data sets, the mean z-score covariate p-value was more significant for the majority (50-80%) of genes with at least one significant covariate (detailed results not shown).Hence, the mean z-score is a better predictor of phenotype than the mean methylation, even after adjustment for mean methylation level.

Meta-analysis and Gene-set Enrichment Analysis
A meta-analysis of the fourteen data sets was carried out.Genes were assigned significance according to their mean AUC (based on the mean z-score measure) across all data sets by a permutation method (see 'Methods and Models' for details); this identified over 4000 significant genes which were associated with a consistent difference between cancer and healthy phenotypes across tissue types (FDR qƒ0:05).These genes consistently show the biggest differences between healthy and cancer phenotypes (as the mean z-score measure is defined relative to healthy control samples), and as the mean z-score is a measure of methylation instability, they are termed the most unstable meta-analysis genes.The mean zscores for individual tumour and healthy samples for the 50 most significant of these most unstable meta-analysis genes are displayed in Figure 3, and details about the 100 most significant of these genes are shown in Table S1.In particular, Figure 3 shows the extent to which the instability is consistent (high mean z-score, red) across cancer patients as compared to healthy patients (low mean z-score, blue).Genes with a mean AUC close to 0.5 across most tumour types were also found; these are genes which tend to have the smallest differences between healthy and cancer phenotypes across tissue types and hence are marked as least unstable metaanalysis genes.Over 2800 least unstable meta-analysis genes were found to be significant by this permutation method (FDR qƒ0:05) and the 100 most significant of these are shown in Table S2.There is however less consistency among the least unstable meta-analysis genes across tumour types, e.g., the 100th placed significant least unstable meta-analysis gene has an AUC of less than 0.6 for only 10 out of 14 tumour types.
To confirm the biological significance of the findings of this meta-analysis with reference to genes which are well known to be important in cancer biology, the most unstable and least unstable meta-analysis genes were tested for enrichment by genes which in  ES cells carry the repressing/activating chromatin marks H3K27me3 (H3K27 ES genes), H3K4me3 (H3K4 ES genes) and bivalent (i.e., both H3K27me3 and H3K4me3 marks, Biv ES genes) and enrichment by PCGTs (ES cell polycomb group targets).The most unstable meta-analysis genes are highly enriched by Biv and H3K27 ES genes and PCGTs, and the least unstable meta-analysis genes are highly enriched by H3K4 ES genes (Table 2).
A more general gene-set enrichment analysis (GSEA) was also carried out, testing enrichment of the most unstable and least unstable meta-analysis genes by members of over 6000 gene sets (see 'Methods and Models' section for details).The 100 most significantly enriched of these gene sets by the most unstable and least unstable meta-analysis genes appear in tables S3 and S4 respectively.In particular Table S3 (gene sets enriched by most unstable meta-analysis genes) shows many developmental and cell signalling gene sets.The most unstable meta-analysis genes are associated with generally higher methylation levels than genes which are not significant according to the meta-analysis (i.e., genes which are neither most unstable or least unstable meta-analysis genes) for both tumour and healthy samples, for these genomic regions located closer to the promoter across all tissue types, however the most unstable meta-analysis genes are also associated with a large variability of methylation levels (Figure S4).The least unstable meta-analysis genes conversely are associated with consistently very low levels of methylation in both tumour and healthy samples for these genomic regions, and particularly for TSS200, 59UTR and 1stExon, suggesting that the low methylation instability of these genes is associated with a lack of methylation in the most functionally important genomic regions in both diseased and normal tissues, and therefore that regulation of these genes is by mechanisms other than those involving DNA methylation, in particular the availability of transcription factors.

Correlation of Tumour Gene Expression with Intra-gene Methylation Architecture
In order to investigate the effect of intra-gene methylation architecture on gene expression, the 217 BRCA tumour samples with matched gene expression and methylation data available from TCGA were considered in more detail.For each gene a nonlinear multivariate regression analysis was performed (see 'Methods and Models') of gene expression to intra-gene methylation architecture, for these matched tumour samples, taking gene expression as the response, and taking one of mean z-score, mean derivative and methylation variance as one covariate predictor, together with mean methylation as a second covariate predictor.The relative proportions of genes found as significant or not, and significant according to one covariate or the other, or both, are shown in Figure 4; in particular there are many genes with expression not significantly predicted by mean methylation but significantly predicted by mean z-score, mean derivative, or methylation variance.
Enrichment by stem cell genes of genes with expression significantly predicted by only one covariate was again tested to confirm the biological significance of findings with reference to genes which are well known to be important in cancer biology.It was found that genes with expression predicted by only the mean z-score covariate were significantly enriched by Biv ES genes and PCGTs (p~1:3|10 {3 and p~5:0|10 {3 respectively, Fisher's exact test), a result which is consistent with the findings here that Biv ES genes are enriched among the most unstable meta-analysis genes, i.e., those genes which are most consistently associated with the biggest difference in methylation pattern between cancer and healthy phenotypes.It was also found that, correspondingly, genes with expression predicted by only the mean methylation covariate in the multivariate regression with the mean z-score covariate were significantly enriched (p~9:0|10 {4 , Fisher's exact test) by H3K4 ES genes, a result which is consistent with our findings that H3K4 ES genes are enriched among least unstable meta-analysis genes, i.e., those genes which have consistently least difference in methylation pattern between cancer and healthy phenotypes.Similarly, it was found that genes with expression predicted by only the mean derivative covariate were significantly enriched by Biv ES genes and PCGTs (p~9:5|10 {4 and p~8:4|10 {4 respectively, Fisher's exact test) and that genes with expression predicted only by the mean methylation covariate in the same P-values (one-sided Fisher's exact test) show enrichment of most unstable meta-analysis genes (MUs) and enrichment of least unstable meta-analysis genes (LUs) by genes in various SC categories.This confirms the biological significance of the findings of the meta-analysis with reference to these genes which are well known to be important in cancer biology.doi:10.1371/journal.pone.0068285.t002The proportion of significant genes (i.e., the proportion of the genes represented by the left of each pair of bars in a) which are significant due to one, or the other, or both covariates.For the genes which are significant due to only one covariate predictor, the proportions of these genes for which the significance is due to positive or negative correlation are indicated on the bars with/and \ respectively.There are many genes with expression not significantly predicted by mean methylation but significantly predicted by mean z-score, mean derivative, or methylation variance.doi:10.1371/journal.pone.0068285.g004 multivariate regression were significantly enriched by H3K4 ES genes (p~3:1|10 {4 , Fisher's exact test).These findings extend to heterogeneous tumour phenotype, as defined by gene expression, the idea that differences in methylation patterns in stem cell genes are a hallmark of cancer, and shows that this can be measured by intra-gene methylation architecture in the form of intra-gene methylation variability (according to the mean derivative and methylation variance measures) and instability (according to the mean z-score measure) more accurately than by mean methylation level alone.

Association of Genome-wide Mean z-score with Breast Cancer Intrinsic Subtypes
Differences in intra-gene methylation architecture between heterogenous tumour phenotypes (as defined by gene expression) was further explored, in the context of breast cancer intrinsic subtypes.The same 217 BRCA samples with matched gene expression and methylation data available were each uniquely assigned to one of these disease subtypes, according to established molecular definitions, using the PAM50 classifier [20].This was done by correlating the gene expression profile (Spearman correlation) for each sample to the PAM50 classifier canonical gene expression profiles for 5 different intrinsic subtypes, and for each sample choosing the subtype with the largest correlation coefficient, leading to 42 samples classified as Basal, 24 as Her2, 81 Luminal A, 54 Luminal B, and 16 classified as Normal.For each of these samples, a genome-wide mean z-score was also calculated, as a per-sample genome-wide measure of intra-gene methylation architecture.The distributions of these genome-wide mean zscores for each intrinsic subtype are shown in Figure 5; there are clear differences in the means and distributions between each of the subtypes.A Kruskal-Wallis test was carried out to check the significance of these differences, with a very significant result, p~1:4|10 {12 .Removing the samples classified as Luminal B and Normal (as the distributions of genome-wide mean-z scores have larger and smaller variances, respectively, for these subtypes than the others), still resulted in a significant result in the Kruskal-Wallis test, p~0:023.This ability to distinguish between heterogenous tumour phenotypes, in the context of established molecular definitions of disease subtypes, indicates that it may be possible to use intra-gene methylation architecture to develop new molecular classifiers of cancer, or make established ones more robust.This is particularly interesting, since methylation levels are typically more stable than gene expression levels.

Discussion
We have shown that the reorganisation of intra-gene methylation architecture is a fundamental characteristic of cancer cells, and that there are many ways to assess these differences, which can provide complimentary information.We have developed measures to detect some of these differences, including the first investigation of intra-gene variability of methylation (as opposed to sample to sample variability of methylation).We have shown that our mean z-score measure is consistently more effective at predicting cancer compared to healthy phenotype than mean methylation, even after adjustment for the mean methylation level.
We have carried out what is, to our knowledge, the largest metaanalysis performed in any DNA methylation study.In particular, over 4000 genes were found to be significantly associated with a consistent difference between cancer and healthy phenotypes, demonstrating that, as a method for distinguishing cancer from healthy tissue, our mean z-score measure is robust to differences between tumour types.The 100 most significant genes according to this meta-analysis (Table S1) can be considered as particularly characteristic of a generalised and non tissue-specific cancer phenotype.These least unstable meta-analysis genes are also significantly enriched (Table 2) by genes carrying H3K27 and bivalent chromatin marks in ES cells and by PCGTs, consistent with the idea that the tumour phenotype is associated with the acquisition of stem-like cell characteristics [15].In this metaanalysis, over 2800 genes were also found to be significantly associated with an absence of difference in methylation pattern from healthy to cancer, and these are significantly enriched by genes carrying the activating H3K4 chromatin mark in ES cells (Table 2).
There is a particularly big contrast in the effectiveness of these methods with respect to endocrine cancers.On the one hand, these methods are particularly insensitive to PRAD and THCA (Figure 2), and furthermore, the genes identified as being most significant in the meta-analysis do not seem to show the same pattern of instability in these cancers (Figure 3).This suggests the possibility that epigenetic mechanisms may, in general, be less relevant to oncogenic processes in THCA and PRAD.On the other hand, These methods are very effective at determining differences in epigenetic patterns with respect to both BRCA and UCEC (Figure 2), and these cancers show significant patterns of instability in a large proportion of the genes identified as being most significant in the meta-analysis (Figure 3).
The correlation for tumour samples of gene expression to intragene methylation architecture (Figure 4) shows that there are a substantial number of genes for which mean methylation is not significantly predictive of gene expression but other measures of intra-gene methylation architecture are.In particular, in the case of our mean z-score and mean derivative measures, genes with The mean across all genes of the mean z-scores was calculated for the 217 BRCA samples with matched expression and methylation data available.These samples were independently classified by correlation of their gene expression profiles (Spearman correlation) with those of the PAM50 breast cancer intrinsic subtype classifier [20].The distributions of these genome-wide mean zscores, for each intrinsic subtype, are shown in the boxplots.Indicated significance was calculated using the Kruskal-Wallis test.doi:10.1371/journal.pone.0068285.g005expression predicted by these measures and not by mean methylation are enriched by Biv ES genes and PCGTs, suggesting that the intra-gene methylation instability and variability are able to provide important information about heterogeneous tumour phenotype (as measured by gene expression), particularly in relation to stem-like cell characteristics, which is beyond the reach of measures based on mean methylation level alone.
The differences in the genome-wide mean z-scores across breast cancer intrinsic subtypes (Figure 5) highlight the potential of intragene methylation architecture to distinguish between heterogenous tumour phenotypes in the context of established gene expression based definitions of distinct subtypes of this disease.This indicates that it may be possible to use intra-gene methylation architecture to develop new molecular classifiers of cancer, or make established ones more robust.
Further improvements in classification by our methods will be gained by the inclusion of complementary epigenetic data, in particular those which measure patterns of histone modification.As discussed, it is well established how crucial genes which carry important histone markings in stem cells are to understanding cancer biology.By extending the view of the epigenetic landscape beyond DNA methylation to consider also histone markings not just in stem cells but also in mature healthy cells and cancer cells, we will gain mechanistic insights into the interaction between intra-gene methylation architecture and histone modifications.
In summary, we have shown for the first time that generalised differences in intra-gene methylation architecture are a better predictor of phenotype than mean methylation level alone, and we have developed novel measures of these differences, which offer a considerable reduction in complexity from per CpG methylation measures (hundreds of thousands of features) to per gene methylation measures (tens of thousands of features).We have shown that there are many genes with expression predicted by measures of intra-gene methylation architecture other than mean methylation level, and therefore that more general measures of intra-gene methylation architecture offer novel information about heterogeneous tumour phenotype (as defined by gene expression).We have also shown that intra-gene methylation architecture is able to distinguish between established molecular definitions of heterogenous cancer subtypes.Because it has been shown previously that differences in methylation pattern occur prior to the onset of disease [18], we anticipate that our measures of intragene methylation architecture might also be able to efficiently find pre-disease methylation patterns.We therefore believe that our measures of intra-gene methylation architecture have potential for further development as DNA based cancer biomarkers.

Data Source and Preprocessing
Methylation data, collected via the Illumina Infinium Human-Methylation450 platform, were downloaded from The Cancer Genome Atlas (TCGA) project [19] at level 3.These data were obtained from fourteen different tumour types, as follows: Bladder Urothelial Carcinoma (BLCA), Breast Invasive Carcinoma (BRCA), Colon Adenocarcinoma (COAD), Head and Neck Squamous Cell Carcinoma (HNSC), Kidney Renal Clear Cell Carcinoma (KIRC), Kidney Renal Papillary Cell Carcinoma (KIRP), Liver (LIHC), Lung Adenocarcinoma (LUAD), Lung Squamous Cell Carcinoma (LUSC), Pancreatic Adenocarcinoma (PAAD), Prostate Adenocarcinoma (PRAD), Rectum Adenocarcinoma (READ), Thyroid Carcinoma (THCA), and Uterine Corpus Endometrioid Carcinoma (UCEC).These data were pre-processed by first removing probes with non-unique mappings and which map to SNPs (as identified in the TCGA level 3 data); probes mapping to sex chromosomes were also removed; in total 98384 probes were removed in this way from all data sets.After removal of these probes, 270985 probes with known gene annotations remained.Individually for each data set, probes were then removed if they had less than 95% coverage across samples; probe values were also replaced if they had corresponding detection p-value greater than 5%, by KNN (k nearest neighbour) imputation (k~5).
Matched gene expression data were also downloaded for 217 samples for the BRCA data set, and were quantile normalised.

Intra-gene Methylation Measures
Four methylation measures were considered, and were calculated separately for each sample, for each gene: N 'Mean z-score': the mean of the z-scores calculated from the methylation values for the probes mapping to the gene, with population parameters for each probe calculated from healthy control samples N 'Mean derivative': the mean absolute derivative of the methylation profile across the gene N 'Methylation variance': the variance of the methylation values for probes mapping to one genomic region of the gene N 'Mean methylation': the mean of the methylation values for probes mapping to one genomic region of the gene To calculate the mean of the z-scores for each gene, the R [21]/ Bioconductor [22] package 'IlluminaHumanMethylation450k' [23] was used to identify the probes mapping to each gene.Then for each probe, the mean and standard deviation of the methylation values for that probe were found from healthy tissue samples, allowing a z-score z i,j for each probe i, and for each sample j, to be calculated according to equation 1.By taking the mean of the absolute z i,j for all probes i mapping to gene g, a single intra-gene methylation predictor value x j (g) was then calculated for each gene g, for each sample j, according to equation 2. A regularisation parameter, j, was added to each probe standard deviation when calculating probe z-scores to prevent very large values from occurring; j was chosen to be 0.01 after considering the distribution of probe standard deviations (Figure S5).
where b i,j is the methylation value for probe i and sample j, m (h) i and s (h) i are the mean and standard deviation of the methylation values corresponding to the relevant healthy tissue samples for probe i, n(g) denotes the number of probes mapping to gene g and P(g) is the set of probes mapping to gene g.
To calculate the 'mean derivative' methylation measure, the 'IlluminaHumanMethylation450k' package was again used to find the probes mapping to each gene.Ordering the probes P(g)~fi(1),:::,i n(g) ð Þg mapping to gene g as they are positioned along the DNA, the derivative of the methylation profile for gene g and sample j is estimated as the differences between the beta the mean z-score measure was calculated for each gene.Significance was then assigned to each of these per-gene mean AUCs by similarly calculating null mean AUCs after permuting AUCs within data sets.This resulted in 4267 significant unstable meta-analysis genes with FDR q-value [26] less than 5%, i.e., those genes corresponding to the upper tail of the null mean AUC distribution, which are associated with a consistent difference between cancer and healthy phenotypes across tissue types; the top 100 most significant of these unstable genes appear in Table S1.This permutation method also resulted in 2818 significant (FDR qƒ0:05) significant least unstable metaanalysis genes, i.e., those genes corresponding to the lower tail of the null mean AUC distribution, which were associated with least difference from healthy to cancer phenotype across tissue types the top 100 most significant of these least unstable genes appear in Table S2.
To confirm the biological significance of the findings of this meta-analysis with reference to genes which are well known to be important in cancer biology, the least unstable and least unstable meta-analysis genes were tested for enrichment by genes which in ES cells carry the repressing/activating chromatin marks H3K27me3 (H3K27 ES genes), H3K4me3 (H3K4 ES genes) and bivalent (i.e., both H3K27me3 and H3K4me3 marks, Biv ES genes) and enrichment by PCGTs (ES cell polycomb group target genes) using the one-tailed Fisher's exact test.A more general gene-set enrichment analysis (GSEA) was also carried out both on the least unstable and least unstable meta-analysis genes; 6811 gene set definitions were downloaded from the Broad Institute Molecular Signatures Database (http://www.broadinstitute.org/),and each gene set was tested separately for enrichment among the significant genes.Enrichment was again tested using the one-sided Fisher's exact test, finding 1048 and 778 gene sets significantly (FDR qƒ0:05) enriched by least unstable and least unstable metaanalysis genes respectively.The top 100 of these gene sets are shown in tables S3 and S4 respectively.

Correlation of Tumour Gene Expression with Intra-gene Methylation Architecture
For the 217 BRCA tumour samples for which matched gene expression and methylation data were available, for each gene a multivariate regression analysis of gene expression and intra-gene methylation architecture was carried out.Gene expression was used as the response, with one of mean z-score, mean derivative and methylation variance as one covariate predictor, and with mean methylation as a second covariate predictor.As it was expected that this relationship would be non-linear, and as for a non-specified non-linear monotonic function the ranks of data points in response and predictor variables are linearly related if there is a good association between these variables, the ranks of each of the variables across the samples were correlated to one another, as follows.
Defining for gene g the ranks of the samples according to the expression data as r (e) (g), the ranks of the samples according to the mean z-score, mean derivative or methylation variance as r (x) (g), and the ranks of the samples according to the mean methylation as r (m) (g), the data were modelled according to equation 4: r (e) (g)~a(g)r (x) (g)zc(g)r (m) (g)zm(g)z[ ð4Þ where m(g) is the intercept term for gene g, and is the model error.
Where r (e) (g) is well-correlated with r (x) (g), similar integer entries in these vectors (corresponding to similar ranks) will appear in similar positions in these vectors (N.B., these vectors are not themselves ordered).This will then be reflected as a small p-value for this comparison (calculated from the corresponding t-statistic for the linear model a(g) coefficient), and similarly for r (m) (g) (and corresponding c(g) coefficient), if it is well-correlated with r (e) (g).This linear model was applied to the data for each gene present in the matched expression and methylation data for the BRCA dataset.'Body' annotated probes were again used to calculate the methylation variance and mean methylation measures as used in this model, because probes annotated to this genomic region produced, in both cases, the greatest number of significant pvalues (for the respective covariate), as compared to using probes annotated to each of the other genomic regions.Figure S3 Scatter plots showing pairwise comparisons of each of the four methylation measures, for the ONECUT3 gene.ONECUT3 was among the top 1000 genes with the highest AUC according to each of the four methylation measures.There is one point in each scatter plot for each of the 98 healthy and 586 cancer samples in the BRCA data set.(PDF)

Supporting Information
Figure S4 Genomic feature mean methylation levels for healthy and tumour samples.(1) significant consistently most unstable genes in the meta-analysis (sig MUs) (2) genes not significant in the meta-analysis, (3) significant consistently least unstable genes in the meta-analysis (sig LUs).(PDF) Figure S5 Distributions of probe standard deviations.For each tumour type, the standard deviation of the beta values for each probe is found for cancer and for healthy samples; then estimates of the density distributions of the standard deviations for all probes are plotted for cancer and healthy samples for each tumour type.Locations of the modal standard deviation of each density distribution estimate are indicated with dashed lines, and are stated on each plot; where the distribution is multimodal the modal standard deviation corresponding to the greatest density is used.(PDF) Table S1 Meta-analysis: 100 most significant most unstable genes.

Figure 1 .
Figure 1.Per-gene methylation measures.(a) The mean z-score measure is calculated for tumour sample j (shown in red) for gene g (to which n probes map), from the mean, m i , and standard deviation, s i , of the healthy control samples at each probe i (b) The methylation profiles of 586 cancer (red) and 98 healthy (blue) samples across a gene found as significant according to the mean z-score measure, with probes spaced (unevenly) according to their genomic loci.Genomic regions are indicated under the gene with the colour code displayed at the bottom of the figure.(c) A heatmap illustrating the same gene, with probes evenly spaced; b values for each sample and each probe are indicated by the colour code displayed at the top of the figure.Samples are plotted in order of mean z-score, such that the tumour sample with the smallest mean z-score and the healthy sample with the smallest mean z-score are adjacent.Genomic regions are indicated under the gene with the colour code displayed at the bottom of the figure.N.B., this gene has two transcriptional start sites (TSSs) in different locations.(d) The mean derivative measure is calculated, for sample j, as the mean of the absolute differences in the corresponding b values between consecutive probes, across the whole of gene g.(e) The mean methylation measure is calculated, for sample j, as the mean of the corresponding b values of the probes annotated to a particular genomic region of g. (f) The methylation variance measure is calculated, for sample j, as the variance of the corresponding b values of the n' probes annotated to a particular genomic region of g.N.B., (d)-(f) are calculated without reference to healthy samples, whereas (a) is calculated with reference to healthy samples.doi:10.1371/journal.pone.0068285.g001

Figure 3 .
Figure3.Heatmap of the mean z-score for the top 50 genes found by the meta-analysis.Mean z-scores for tumour (T) and healthy (H) samples are displayed in a heatmap according to the colour code for the top 50 meta-analysis genes (top 50 most consistently unstable genes).The heatmap shows the extent to which the instability is consistent (high mean z-score, red) across cancer patients as compared to healthy patients (low mean z-score, blue).For each tissue type healthy samples appear to the right of tumour samples; where no space is available the (H) label is omitted.Abbreviations: R (READ), B (BLCA), K(KIRP), P (PAAD).doi:10.1371/journal.pone.0068285.g003

Figure 4 .
Figure 4. Correlation of expression to intra-gene methylation architecture, for matched BRCA samples.Expression was taken as the response variable, with one of mean z-score, mean derivative and methylation variance as one covariate predictor, together with mean methylation as a second covariate predictor.(a) The proportion of genes with at least one covariate significant (FDR qƒ0:05), and the proportion of genes with neither covariate significant.(b) The proportion of significant genes (i.e., the proportion of the genes represented by the left of each pair of bars in a) which are significant due to one, or the other, or both covariates.For the genes which are significant due to only one covariate predictor, the proportions of these genes for which the significance is due to positive or negative correlation are indicated on the bars with/and \ respectively.There are many genes with expression not significantly predicted by mean methylation but significantly predicted by mean z-score, mean derivative, or methylation variance.doi:10.1371/journal.pone.0068285.g004

Figure 5 .
Figure5.Distributions of genome-wide mean z-score, for breast cancer intrinsic subtypes.The mean across all genes of the mean z-scores was calculated for the 217 BRCA samples with matched expression and methylation data available.These samples were independently classified by correlation of their gene expression profiles (Spearman correlation) with those of the PAM50 breast cancer intrinsic subtype classifier[20].The distributions of these genome-wide mean zscores, for each intrinsic subtype, are shown in the boxplots.Indicated significance was calculated using the Kruskal-Wallis test.doi:10.1371/journal.pone.0068285.g005

Figure
Figure S1 Distributions of per-gene AUCs calculated from genomic feature methylation variance measures.P-values shown are for Kolmogorov-Smirnov tests comparing the distributions of the most effective and second most effective measures.(PDF) Figure S2 Distributions of per-gene AUCs calculated from genomic feature mean methylation measures.Pvalues shown are for Kolmogorov-Smirnov tests comparing the distributions of the most effective and second most effective measures.(PDF)

Table 1 .
Number of samples in each data set.

Table 2 .
Enrichment of MUs and LUs genes by stem cell genes.