The impact of DNA methylation on the cancer proteome

Aberrant DNA methylation disrupts normal gene expression in cancer and broadly contributes to oncogenesis. We previously developed MethylMix, a model-based algorithmic approach to identify epigenetically regulated driver genes. MethylMix identifies genes where methylation likely executes a functional role by using transcriptomic data to select only methylation events that can be linked to changes in gene expression. However, given that proteins more closely link genotype to phenotype recent high-throughput proteomic data provides an opportunity to more accurately identify functionally relevant abnormal methylation events. Here we present a MethylMix analysis that refines nominations for epigenetic driver genes by leveraging quantitative high-throughput proteomic data to select only genes where DNA methylation is predictive of protein abundance. Applying our algorithm across three cancer cohorts we find that using protein abundance data narrows candidate nominations, where the effect of DNA methylation is often buffered at the protein level. Next, we find that MethylMix genes predictive of protein abundance are enriched for biological processes involved in cancer including functions involved in epithelial and mesenchymal transition. Moreover, our results are also enriched for tumor markers which are predictive of clinical features like tumor stage and we find clustering using MethylMix genes predictive of protein abundance captures cancer subtypes.

Aberrant DNA methylation disrupts normal gene expression in cancer and broadly contributes to oncogenesis. We previously developed MethylMix, a model-based algorithmic approach to identify epigenetically regulated driver genes. MethylMix identifies genes where methylation likely executes a functional role by using transcriptomic data to select only methylation events that can be linked to changes in gene expression. However, given that proteins more closely link genotype to phenotype recent high-throughput proteomic data provides an opportunity to more accurately identify functionally relevant abnormal methylation events. Here we present a MethylMix analysis that refines nominations for epigenetic driver genes by leveraging quantitative high-throughput proteomic data to select only genes where DNA methylation is predictive of protein abundance. Applying our algorithm across three cancer cohorts we find that using protein abundance data narrows candidate nominations, where the effect of DNA methylation is often buffered at the protein level. Next, we find that MethylMix genes predictive of protein abundance are enriched for biological processes involved in cancer including functions involved in epithelial and mesenchymal transition. Moreover, our results are also enriched for tumor markers which are predictive of clinical features like tumor stage and we find clustering using MethylMix genes predictive of protein abundance captures cancer subtypes.

Author summary
To elucidate the molecular basis of cancer we examine the variation and dynamics characterizing the flow of information from epigenome to the transcriptome and proteome. Conducting the first genome wide analysis of epigenome-proteome associations, we present a MethylMix analysis that leverages protein abundance data taking advantage of recent high-throughput proteomic data generated using mass-spectrometry technology to elucidate the role of DNA methylation in cancer. By integrating across molecular data types,

Introduction
Genomic characterization can elucidate underlying biology, disease etiology and reveal biomarkers of cancer development and progression; however, each molecular feature is susceptible to different sources of biological and technical measurement noise and provides only one view on the cell state. Therefore, comprehensive studies are needed to understand the molecular basis of disease. Toward this end a multi-institutional consortium, The Cancer Genome Atlas (TCGA), has extensively characterized numerous cancer sites producing genome wide data for mutations, copy number alterations (CNA), RNA expression, microRNA expression, and DNA methylation [1][2][3][4][5]. As part of this project, the proteome was probed using protein array Reverse Phase Protein Assay (RPPA) technology. However, antibody based analysis are inherently limited because of the reduced coverage and inability to easily compare across proteins due to differential binding effects [6,7]. Transcending these limitations, recent advancements in proteomics through high sensitivity mass-spectrometry (MS) are opening new opportunities in cancer research [8]. To accelerate the uptake of proteomics the Clinical Proteomic Tumor Analysis Consortium (CPTAC) is performing proteomic analyses of TCGA tumor bio-specimens for a growing number of tissue types and establishing standardized workflows using high-throughput liquid chromatography tandem mass-spectrometry (LC-MS/MS) to capture the proteome as a whole [6,9,10].
To best leverage this new technology comparative analysis between protein abundance and RNA expression can highlight factors influencing concordance and inform how to best interpret proteomic data [11]. For example, multiple studies have proven that concordance between mRNA and protein is highly variable, such that one cannot be used to reliably predict the other. Correlation between mRNA and protein has been repeatedly shown to vary by tissue type and cancer status among other molecular features like biological function or molecular stability [7]. It was shown across multiple cancers that dynamic proteins involved in metabolism show strong agreement whereas housekeeping proteins and RNA processing proteins are weakly or negatively correlated [6,9,10]. So, although many biological functions are regulated primarily through RNA expression-producing moderate correlation between proteomic and transcriptomic data, with mean spearman rho: 0.23-0.47 -post-transcriptional mechanisms also play a significant role that cannot be overlooked.
The proteome represents the final link from genotype to molecular phenotype, so proteins are of special importance among molecular features and likely provide a more accurate depiction of cell state; this enhanced view on disease can be leveraged to assess functional effects of upstream aberrations, such as epigenetic modifications. Multi-level epigenetic features such as DNA methylation and histone modification work in concert to regulate gene expression. DNA-methylation, the covalent addition of methyl groups to CpG dinucleotides to form 5-methylcytosine (5mC), is catalyzed by DNA methyltransferases, and is influenced by both environmental and hereditary factors [12]. Previous studies have shown that DNA methylation plays a key role in health and is involved in processes of embryonic development and cellular differentiation, where changes can occur through imprinting, inheritance, or de novo events [13,14]. Furthermore, DNA methylation has been numerously cited as a potentially causative event in cancer [15,16]. Among potential DNA methylation drivers, silencing of tumor suppressors through promoter CpG island hypermethylation is best understood and linked to corresponding gene silencing [13,17,18]. Global hypo-methylation on the other hand can potentially result in genomic instability and reactivation of oncogenes [12,13,15].
To elucidate the role of DNA methylation in disease, our goal is to investigate whether linking proteomic data with DNA methylation data identifies key genes, describes molecular features and subtypes in cancer. Previously we presented MethylMix an algorithm that formalizes the identification of DNA methylation driver genes using a model-based approach [19][20][21][22][23].
Recognizing the complex role of the methylome in epigenetic regulation of cancer, MethylMix uses mRNA data to select only differentially methylated genes that show a downstream effect on gene expression (MethylMix-GE). This selects for likely functional aberrations with the aim of discriminating between true driver genes, and passenger events which are characteristic of genome wide dysfunction in cancer. Herein we present MethylMix-PA (Protein Abundance), a MethylMix analysis which refines candidate nominations for epigenetic driver genes by excluding aberrations that are buffered at the protein level; this likely selects for events which are functional over those which may accumulate during cancer but do not drive pathogenesis. Using proteomic data generated by MS technology from three cancer cohorts: breast invasive carcinoma, colorectal adenocarcinoma, and ovarian serous cystadenocarcinoma, we report MethylMix-PA genes, which include potential cancer progression markers and therapeutic targets. We describe MethylMix-PA's ability to elucidate key molecular and higher level disease features and evaluate MethylMix-PA performance against MethylMix-GE. In summary, our study highlights the differences between integrated epigenomic-proteomics and epigenomic-transcriptomics analyses.

Results
We applied MethylMix to identify differentially methylated genes in two ways: once with gene expression data defined as MethylMix-GE and once with protein abundance data defined as MethylMix-PA [19][20][21][22] across three cancer types (Fig 1, Table 1): breast invasive carcinoma (BRCA), colorectal adenocarcinoma (COADREAD), and ovarian serous cystadenocarcinoma (OV). These analyses result in two lists of genes: MethylMix-GE and MethylMix-PA representing genes that are both differentially methylated compared to normal methylation, and with an significant relationship with gene expression and protein abundance respectively (S1 Table). Next, we will specifically examine the biological and clinical relevance of both analyses' output and utility for downstream analysis.

MethylMix-PA narrows candidate nominations for epigenetically driven genes
For each cohort both models identify genes that are 1) differentially methylated when compared to normal adjacent tissue and 2) functionally predictive of downstream effects at the level of RNA expression in the case of MethylMix-GE or protein abundance in the case of MethylMix-PA (Fig 2). Among all three cancer cohorts we observe significant correlations between RNA expression and protein abundance (mean rho: 0.23-0.47), indicating that most genes are regulated at the transcript level (Fig 3, S2 Table). Therefore, it is unsurprising that MethylMix-PA shows high agreement with MethylMix, where more than 90% of MethylMix-PA genes are also identified by MethylMix. However, MethylMix-PA lists are more conservative identifying fewer candidate genes across all three cancers, where often the effect of methylation is present at the RNA level, but not detected at the protein level (Fig 2), likely because they are buffered at the protein due to post-transcriptional, translational, or degradation regulation. Therefore, MethylMix-PA better enriches for methylation-states that more likely execute functional roles in cancer development.

MethylMix-PA identifies new genes with significant methylation effects only at the protein level
For each cancer cohort using protein abundance data also identifies a few unique driver genes, the majority of which have documented roles in carcinogenesis. Explanative mechanisms by which the effect of DNA methylation may be undetected at the RNA level but functional at the protein level are further addressed below in the discussion.
In breast cancer we discovered 19 novel differentially methylated genes of diverse biological functions. MethylMix-PA identified hyper-methylation of FSTL1, an autoantigen that promotes immune response. This candidate tumor suppressor, FSTL1, has also been shown to mediate tumor immune evasion in nasopharyngeal cancer through hyper-methylation silencing [24]. MethylMix-PA also found hyper-methylation of DHX40 which has an unclear link to cancer; although it is of note that RNA splicing proteins-like DHX40 -are highly stable, perhaps explaining the particularly stronger effect of DNA methylation on protein abundance than mRNA [25] (S1 Table). Next, MethylMix-PA identified hypo-methylation of CEACAM5 (also known as CEA), a cell surface glycoprotein that is used as a clinical biomarker for gastrointestinal cancers and may promote tumor development through its role as a cell adhesion molecule. High levels of CEACAM5 have been associated with operable early breast cancer [26,27]. Next, MethylMix-PA also identified hyper-methylation of FOXO1, a transcritionf factor where low expression has been associated with cancer [28].
In colorectal cancer the MethylMix-PA analysis uniquely recovers several genes associated with immune function and inflammation, which is known to play a key role in pathogenesis. We found that MethylMix-PA identifies a functional effect of UTR hypo-methylation of the PTPRC gene. PTPRC belongs to a family of protein tyrosine phosphatase which contains oncogenes regulating cell growth and differentiation. PTPRC is also related to tumor necrosis and disrupts normal T-and B-cell signaling through SRC kinase pathways-which are separately implicated in colorectal cancer through amplification [9,29]. Next, MethylMix-PA identified upregulation of S100A9 through promoter hypo-methylation. Of note, elevated S100A9 mRNA and protein levels are commonly observed in many conditions associated with inflammation [30]; additionally in hydropharangeal cancer where knockdown inhibited cell growth and invasion, S100A9 is also prognostic of worse outcome and indications like metastasis [31]. Of note MethylMix-PA filtered out functional effects of a UTR hypo-methylation in S100A9 previously detected by MethylMix-GE. Next, MethylMix-PA identified hyper-methylation across the promoter region of LTF, a likely tumor suppressor which is produced by neutrophils to regulate growth and differentiation. In the context of colorectal tissue LTF has been shown The impact of DNA methylation on the cancer proteome to restrict inflammation by regulating T cell interaction [32]. Additionally, gene expression of LTF has previously been shown to correlate with tumor size and survival in breast cancer [33].
MethylMix-PA picks up hypo-methylation states in 18 unique genes in ovarian cancer related to processes of invasion and proliferation. MethylMix-PA uniquely identifies hypomethylation in the promoter region of EVL a key regulator of the actin cytoskeleton, associated with invasion and metastasis. Overexpression of EVL is also indicative of advanced stage in breast cancer [34] and has been implicated in malignancies due to inappropriate recombination [35]. Next, MethylMix-PA identified hypo-methylation of CTSZ, also known as cathepsin Z, a a lysosomal cysteine proteinase that has been shown to be involved in many primary tumors. For example, high levels of CTSZ promote epithelial to mesenchymal transition and are associated with the mesenchymal-like cell phenotype [36]. We also found hypermethylation of GSTM2, a gene that is normally high expressed in ovary, but has been shown to be a hypermethylated in lung cancers [37] and colorectal cancers [38], suggesting a tumor suppressor role for GSTM2 across tissues. Lastly, MethylMix-PA also identifies hypo-methylation in the mitochondrial genes SPG7, speculatively linked to cancer through metabolic function [39].

MethylMix-PA genes are enriched for biological processes involved in cancer
We conducted enrichment analysis to identify biological processes that are overrepresented in MethylMix-PA and MethylMix-GE genes ( Table 2, S3 Table). Given the large proportion of common genes, across all three cancers both models capture many of the same annotations. However, comparing enrichments found for each cancer site, we find that broadly Methyl-Mix-PA results include more significant enrichments for functions associated with cell adhesion and migration of epithelial and endothelial cells; these processes increase cell motility and invasiveness and are indicative of epithelial to mesenchymal transition (EMT) which is key to cancer development. Additionally, we observed that enrichment for immune functions are highly variable between each model's results.
Comparing unique annotations among breast cancer genes, MethylMix-PA includes enrichments for cell-cell adhesion, STAT signaling, response to interferon-gamma and immune cell functions, whereas MethylMix-GE similar pathways, but is also enriched for several other functions with less relevance to cancer such as homeostasis, muscle cell proliferation and skin development. In colorectal cancer, the MethylMix-PA gene list is shorter as fewer MethylMix-PA genes have been identified (Fig 2). These genes are only enriched in cell-cell adhesion ( Table 2). The MethylMix-GE list for colorectal cancer is also enriched in cell-cell adhesion but also includes seemingly irrelevant enrichments for humoral immune response and detection of stimulus involved in sensory perception (S3 Table). For ovarian cancer, the MethylMix-PA enrichment mirrors the MethylMix-GE enrichment almost exactly with enrichments for metabolic processes, NF-kappa-beta signaling and interleukin-1 production.

MethylMix-PA genes are enriched for tumor progression markers
Taking an orthogonal approach, we identified putative biomarker of disease progression based on correlations between gene expression and clinical features (Table 3). Although MethylMix-PA gene lists contain much fewer identifications, we find that across all three cancers Methyl-Mix-PA lists include a larger proportion of markers of tumor stage and size and show stronger odds of containing such genes ( Table 3). The greatest difference in frequency of tumor stage marker is observed in breast cancer where 12% versus 7% of genes show correlation in Methyl-Mix-PA and MethylMix-GE gene lists respectively. The most significant associations however are observed in colorectal cancer where 15% of MethylMix-PA genes show correlation   between gene expression and tumor stage, this includes LTF which is mentioned among unique MethylMix-PA genes (Table 3A, S1 Table). The same trend applies when correlating gene expression with tumor size where the largest difference in enrichment can be seen in colorectal cancer where 7% versus 3% of genes correlate with size when comparing models. However, the enrichment is much stronger for breast cancer where 29% of genes correlate with tumor size compared to 21% of MethylMix-GE genes (Table 3B).

Clustering on MethylMix-PA genes captures cancer subtypes
Clustering on methylation has been shown to stratify patients into clinically relevant subgroups [2,20,21,23]. We performed consensus clustering using the DM values for MethylMix-PA and MethylMix-GE genes evaluating clusters sizes from two to six (Table 4); for clarity we discuss clusters at K = 2, examining the gross differences between MethylMix-GE and Methyl-Mix-PA. We evaluated if these epigenetically defined subgroups correspond to previously The impact of DNA methylation on the cancer proteome published subtypes and clinical and genetic features and found that MethylMix-PA identifies subgroups of patients that enriched for specific cancer subtypes and other molecular features and performs similarly to MethylMix-GE (Fig 4, S4 Table).
In breast cancer MethylMix-PA clusters significantly correlate with molecular subtypes and other molecular features such as Progesterone and Estrogen Receptor (PR, ER) status (Fig 4A). Similar to other studies our clusters differentiate between canonical breast cancer molecular subtypes: Cluster-1 and Cluster-2 containt the majority of patients with Luminal A/B type tumors while Cluster-3 contains the majority of patients with Basal-like tumors and as expected is enriched for samples negative for ER, PR, or HER2. HER2 and Normal subtypes are less clearly distinguished in MethylMix-PA clusters.
Among colorectal samples we are able to confirm the CpG island methylator phenotype (CIMP) (Fig 4B). Cluster-2 contains all but one of the patients labeled CIMP-High using methylation signatures and 75% of patients labeled Microsatellite Instable/CIMP using gene-expression signatures. The CIMP subtype has known association with MLH1 silencing through hyper-methylation, which is reflected in our MethylMix-PA subtypes where we find cluster-1 to include the majority of samples with non-silenced MLH1. MethylMix-PA subtypes also Examining subtypes in ovarian cancer our MethylMix-PA clusters agree well with molecular subtypes and are significantly correlated (Fig 4C). Cluster-1 contains 78% of Immunoreactive subtype and 78% of Differentiated subtype patients, while about half of cluster-2 is comprised of patients labeled as Proliferative. Lastly Mesenchymal subtype patients can be found with relatively equal frequencies in each cluster [40][41][42]. MethylMix-PA clusters also significantly correlate with tumor features, where cluster-2 and cluster-1 roughly correspond patients with lower-grade and higher-grade tumors.

Discussion
Epigenetic aberrations contribute to oncogenesis, where DNA hypermethylation inactivates tumor suppressor genes, while hypomethylation is known to promote genomic instability and activate oncogenes [12,20]. Therefore, DNA methylation has potential to inform patient treatment and improve patient outcomes through new diagnostics and therapeutics. When identifying epigenetically driven cancer genes, it is of note that most biological functions-subject to genomic and epigenomic dysregulation-are ultimately executed at the protein level, so we can expect neutralization of non-functional upstream effects at-or before-the proteome. Herein we confirm the potential of using proteomic data to elucidate functional DNA methylation events by conducting the first genome wide analysis of epigenome-proteome relationships across three large human cancer cohorts. We present MethylMix-PA, a method that formalizes the identification of abnormally methylated genes that are predictive of protein abundance, like MethylMix-GE, and uses a model-based approach, negating the use of arbitrary user-defined thresholds for abnormal DNA methylation, and identifies subpopulations of hypo or hypermethylated samples within a heterogeneous population. By integrating DNA methylation array and quantitative MS technologies, MethylMix-PA identifies candidate epigenetic driver genes with clinical value as potential therapeutic targets and protein biomarkers for assessing prognosis and treatment stratification. MethylMix-PA builds on our model MethylMix and addresses the potential limited predictive value of mRNA as proxy for phenotype due to the role of post-transcriptional mechanisms.
MethylMix-PA identifies oncogenes and tumor suppressors and-with the exception of a few genes-returns a subset of MethylMix identifications, where often the effect of DNA Methylation does not propagate to the proteome (Fig 1, S1 Table). In other cancer studies similar buffering has been observed in both cis and trans CNA effects, suggesting that many detectable aberrations in cancer do not manifest in expression changes at the protein level [6,10]. Otherwise put, many abnormally methylated genes are likely only passengers and do not functionally contribute to cancer development. Identification of a reduced set of genes in our study has pragmatic benefits for cancer research, where narrowing nominations to fewer high-quality candidates increases the likelihood of finding true targets; strongest candidates include genes identified by both models that show negative correlation between DNA methylation and both gene expression and protein abundance, and therefore have clear biological interpretations amenable to validation in the laboratory. Similar methods to identify true targets have been described, where genes that show correlation between mRNA and protein are more likely to have tumor promoting effects [10]. Conversely, novel MethylMix-PA genes should be taken with due consideration given the lack of clear mechanisms explaining how changes in DNA methylation may alter protein levels, but be undetectable at the transcript level-plausible explanations that remain to be tested include erroneous or noisy gene expression data, low mRNA stability or alternative splicing confounding expression at the RNA level.
Nevertheless, most new identifications are well supported to have tumor promoting effects and therefore warrant further investigation to uncover how DNA-methylation may influence regulation of genes like EHF, FSTL1, PTPRC, S100A9, LTF, EVL, and TSTA3. Importantly, in all these cases the type of DNA methylation is consistent with gene function, where known tumor-suppressors are hyper-methylated and oncogenes are hypo-methylated at regions where DNA methylation negatively regulates transcription.
Taken together MethylMix-PA genes highlight important features in cancer related to tumor features and subtypes. MethylMix-PA genes also capture oncogenic biological processes based on enrichment analysis showing key aspects of cancer development such as processes related to EMT, immune function, and proliferative signaling ( Table 2, S3 Table). MethylMix-PA also elucidates more shared annotations between cancer types, and thus a greater ability to identify genes of core cancer pathways that are shared across cancer sites. Next, using a completely orthogonal approach we also find that MethylMix-PA is more descriptive of tumor progression; although this new analysis produces a reduced number of identifications, Methyl-Mix-PA genes are more likely to correlate in expression with disease features such as tumor stage and size (Table 3). Lastly, we find MethylMix-PA performs reasonably well for patient clustering recapitulating established molecular subtypes. Given the limitations of our study, we expect our clustering to have reduced discriminative power, since we limit our observations to genes for which we have both matched gene expression and protein abundance measurements in our analysis. This significantly diminishes the feature space we used for learning. Nevertheless, we find that MethylMix-PA performs similarly to MethylMix-GE in identifying cancer subtypes such as luminal and basal types of breast cancer, the CIMP type in colorectal cancer and all subtypes in ovarian cancer, with the exception of the mesenchymal subtype which is the least clearly defined subtype [40] (Fig 2, S4 Table). These findings suggest the reduced number MethylMix-PA genes capture the major sources of variation in each cancer cohort and facilitate translatability into feasible panels for testing.
Overall MethylMix-PA shows practical utility for improving nominations of cancer driver genes and elucidating new mechanisms of cancer development missed by our previous model. More broadly our study supports using proteomic data to better understand how epigenetic deregulation promotes cancer. Similar approaches have been applied and found to potentially improve aspects of patient care. For example, a retrospective analysis of outcomes in an oncology trial for glioblastoma-which tested efficacy of different temozolomide regiments-found that updating the clustering model to incorporate MGMT protein expression and c-MET protein abundance provided better separation of overall survival prognostic groups than incorporating MGMT promoter methylation alone [43]. These findings and ours support the claim that protein data combined with DNA methylation is a better way to stratify patients and understand cancer features then using DNA methylation alone.
Although milestone initiatives like TCGA and CPTAC provide valuable date for the acceleration of discovery and research in cancer, we acknowledge the limitations of this study and further work required. A barrier to translation, the number of specimens used here is insufficient to draw conclusive clinical correlations and require replication of these results by independent studies. Importantly molecular measurements used here are also subject to sources of technical and biological bias. For example, it is known that bulk measurements obscure the complex nature of tumor microenvironment which includes many cell types including vascular, lymphatic, and immune cells. This confounding effect is compounded considering that each molecular feature was measured using different tumor fragments, which may have very different cellular compositions due to intra-tumor heterogeneity. Additionally, we recognize further characterization of genome wide proteomic studies is required to fully understand possible biases, such as worse detection of highly hydrophobic and hydrophilic peptides, or low-abundance peptides co-eluting with very high-abundance peptide [9]. Moreover, early proteomic techniques such as those utilized in CPTAC's Common Data Analysis Pipeline have not yet reached the genome level resolution of other omic measurements; these methods require refinement to address low coverage due to inherent limitations of proteolytic measurements such immeasurable peptides that are excessively large or small tryptic fragments and the inability to distinguish some amino acids [9]. This reduced coverage to a few thousand genes in our study excludes many genes with possible roles in cancer.
The complex nature of disease development and interplay between interacting biological aberrations-genetic, epigenetic, somatic or germline-often makes it difficult to elucidate causal mechanisms of cancer development. Furthermore, there is still much work in multiomics to elucidate causal flows of information influencing cellular physiology and pathology and to discriminate how separate phenomena are linked to create cancer [3,5,42,44]. However, integrated multi-omic approaches like MethylMix-PA can provide additional insights into pathways and processes involved in oncogenesis and how they manifest as clinical phenotypes. As CPTAC moves into its second phase and characterizes more samples across more cancer types, models such as MethylMix-PA may leverage this valuable data to improve understanding of the molecular basis of cancer.

Ethics statement
All data used in this study is third party data, and is available from the following articles [6,9,10,[45][46][47]. All other data is available within the paper and Supporting Information files.
DNA methylation. CpG site methylation levels/percentages were measured using Illumina Infinium Human Methylation 27k and 450k BeadChip Platforms [45][46][47]. We limit our observations to overlapping probes or CpG sites for cancer tissues measured using both platforms, otherwise we use all available probes. The methylation level is recorded as a beta value representing a ratio of the signal/intensity from the methylated probe over the sum of both the methylated probe and the unmethylated probes. Values close to 0 indicate low levels of DNA methylation and values close to 1 indicate high levels of DNA methylation. We removed CpG sites with more than 10% missing entries across all samples and we applied 15-K Nearest Neighbor (KNN) to impute the remaining missing values, this procedure was replicated for all molecular data types. We observed significant technical sources of variation among tissue samples processed in batches using a one-way analysis, which we corrected using COMBAT [48]. To reduce dimensionality of the CpG data we applied hierarchical clustering with complete linkage and a minimum average Pearson correlation of 0.4 between values. Last, we matched clusters to their corresponding genes by mapping to the closest transcriptional start sites, where one gene may relate to many CpG clusters but each CpG cluster only maps to one gene. Therefore, we limit our analysis to DNA methylation states with cis regulation effects.
RNA expression. We used transcriptomic data in MethylMix produced by RNA sequencing technology [45][46][47]. We log-transformed the RNAseq counts and replaced infinities with a non-zero low value. Similar to our DNA methylation data processing, we estimated missing values using 15-KNN and used COMBAT to correct for batch effects [48].
Protein abundance. Proteomic data used in MethylMix-PA was provided by CPTAC [6,9,10]. Participating research institutions used the following Common Data Analysis Pipeline to produce protein level measurements: First tissue samples were enzymatically digested, cutting large proteins in a sequence specific manner into smaller peptides. Peptides were fractioned using liquid chromatography to improve downstream quality before measurements using Thermo Fisher high-resolution tandem mass spectrometry (LC-MS/MS). Next, the resultant mass ladders were matched to theoretical mass ladders in the FASTA database and subsequently assigned to peptide spectra using software tools and The Reference Sequence Database. The data was then filtered to exclude peptide fragments common to more than one protein and to only include protein-identifying or unshared peptides i.e. fragments with unique sequences. Lastly peptides were mapped to a parsimonious set of genes.
The BRCA and OV workflows used iTRAQ-labeling to increase throughput, where 3 patient samples are isotopically labelled and analyzed against a common reference standard and describe relative ion intensities. Quantities are recorded after taking the log2 ratio of the abundances. Alternatively, measurement of COADREAD samples used label free MS technology and are reported as absolute counts, which were transformed to relative quantities by taking the log2 of quantile normalized values using the limma R package [10]. OV samples collected from Pacific Northwest National Laboratory and John Hopkins University were merged and corrected for batch effects using COMBAT [48].
To remove samples compromised by protein degradation we filtered samples using the QC method described by Mertins et al. [6]: we calculated the standard deviation of non-normalized protein measurements across all genes for each sample and segmented samples into groups using a two component Gaussian mixture model. Samples belong to the poor-quality group i.e. higher mean standard deviation were excluded from study. Applying this method we discarded 28 BRCA and 5 OV samples. Finally, for each cancer we removed samples with greater than 75% missing values, estimating the remaining missing values using 15-KNN algorithm [49].

Algorithm
Step 1: Fit mixture model to methylation data. As described earlier methylation levels are recorded as beta values or values ranging from 0 to 1 representing the percentage of methylation and therefore gene values across all samples are beta distributed. MethylMix identifies subgroups of patients with a distinct methylation pattern or state by finding the beta mixture model with the number of components that best describe the data. To map samples to subgroups we iteratively add components requiring that each additional component improve the Bayesian Information Criterion (BIC) to avoid overfitting. To define the most descriptive subgroups we include methylation measurements across all samples, however our model integrates epigenetic data with proteomic and transcriptomic data using only the subset of these samples with available matched data (Table 1).
Step 2: Compare methylation to normal tissue. To identify differentially methylated CpG clusters we compare the mean methylation level-the mean value of the beta mixture component-to the mean methylation level of normal samples. To measure if an observed difference is significant we perform a Wilcoxon rank sum test with a Q-value cutoff of 0.05, using both p-value multiple testing correction with False Discover Rate (FDR). As an additional measure, we require a minimum difference of 0.10 based on the platform sensitivity [50]. If significant, the difference in methylation level between the mode and normal is recorded as the Differential Methylation value or DM value for each methylation state.
Step 3: Select for functionally predictive genes. Next, we filter our set of genes, requiring that genes be not only differentially methylated when compared to normal but also predictive of gene expression or protein abundance. Hyper-methylation should lower gene expression and corresponding protein abundance when compared to the normally methylated samples, therefore we only accept genes that have a negative correlation between methylation level and downstream gene products. Note this assumption is only explanatory of methylation at promoter regions and does not necessarily apply to methylation at the gene-body or 3' and 5' untranslated regions (UTRs). To assess the likelihood that methylation events are functional, MethylMix-GE uses the relationship between methylation and gene expression, whereas MethylMix-PA examines the effect on protein abundance. In both cases, we perform a linear regression between methylation levels and RNA expression or protein abundance data respectively. We use the R-square statistic to estimate the magnitude of the correlation and used cutoffs at R-square greater than 0.05 and a Q-value of 0.01 using FDR multiple testing correction.
Applying the procedures outlined above for MethylMix-GE and MethylMix-PA each produces a list of candidate cancer drivers (referenced as MethylMix-GE and MethylMix-PA genes) and a corresponding matrix of DM-values for identified CpG clusters across all samples. All MethylMix genes have the following statistical properties: (i) a DNA methylation difference based on the beta mixture model that is significantly different from normal that is > = 0.1 based on the platform sensitivity [50], and (ii) an R-square statistic greater than 0.05 with a Q-value less than 0.01 for correlation with gene expression and protein abundance for Methyl-Mix-GE and MethylMix PA respectively.
To assess the validity of each list we used orthogonal clinical and biological data to assess utility for downstream analysis and relevance to disease state. Evaluation GO term enrichment. To describe the underlying biological processes captured by each model, we tested for enrichment of Gene Ontology (GO) terms in MethylMix-GE and Methyl-Mix-PA genes. This analysis was implemented using the PANTHER Classification System's statistical overrepresentation tool [51] with the following settings: Homo-sapiens for organism, the background set to include all genes with matching protein and RNA data, and either MethylMix-PA or MethylMix-GE genes for input. Enrichment was calculated using fisher's exact test. For each gene list we rank terms using significance of the test statistic with a minimum p-value of 0.10.
Methylation subtypes. With the matrices of DM values for our CpG clusters we performed consensus clustering to identify robust groupings of patients based on epigenetic signatures [52]. Our analysis for each cancer cohort used the following parameters: maximum number of clusters is 6, number of bootstrap subsamples is 500 with 0.8 the proportion of the subsample, and our method uses k-means cluster algorithm and Euclidean distance. To identify the optimal number of clusters we inspected the proportion of ambiguous classification (PAC Score) [53,54], and the consensus heatmap and values, where the score/index between two samples is the proportion of clustering runs in which the two items are clustered together. We define the intra cluster consensus as the mean of all pairwise consensus scores between samples clustered in the same group, and inter cluster consensus as the mean of all consensus indexes between a sample and all the other samples clustered in different groups. A robust clustering result ideally shows high intra cluster consensus and low inter cluster consensus. We tested for association between cluster assignments and several disease features, using a Chi-squared test for categorical variables such as molecular subtype labels or a Kruskal-Wallis test for ordinal values such as tumor grade. Our analysis includes genetic, molecular, and clinical annotations, which were collected from supplementary tables from the original TCGA publications [45][46][47] in addition to annotations downloaded using the TCGAbiolinks R package [55].
Enrichment for putative tumor markers. We compared MethylMix-GE and MethylMix-PA genes by investigating their enrichment in genes related to disease progression. We used correlation of gene expression with cancer stage and tumor size to identify potential genes capturing disease progression. We took the spearman correlation between gene expression levels and these clinical variables using all available samples. We selected genes using a p-value cutoff of 0.05 and biased for genes with greater likelihood of relevance by taking top 50th quantile in sample variance. Next, we filtered for only relationships that can be explained by methylation, such that genes identified as hyper-methylated in cancer tissue were required to show a negative correlation between gene expression and disease progression (tumor-suppressor genes) and hypo-methylated genes positively correlated (oncogenes). To assess each models' likelihood in picking up genes related to disease progression we examined the overlap between these genes and the MethylMix-PA and MethylMix-GE genes, using Fisher's exact test to evaluate significance.