Proteogenomic view of cancer epigenetics: the impact of DNA methylation on the cancer proteome

Aberrant DNA methylation disrupts normal gene expression in cancer and broadly contributes to oncogenesis. We previously developed MethylMix, a model-based algorithmic approach to identify epigenetically regulated driver genes. MethylMix identifies genes where methylation likely executes a functional role by using transcriptomic data to select only methylation events that can be linked to changes in gene expression. However, given that proteins more closely link genotype to phenotype recent high-throughput proteomic data provides an opportunity to more accurately identify functionally relevant abnormal methylation events. Here we present ProteoMix, which refines nominations for epigenetic driver genes by leveraging quantitative high-throughput proteomic data to select only genes where DNA methylation is predictive of protein abundance. Applying our algorithm across three cancer cohorts we find that ProteoMix narrows candidate nominations, where the effect of DNA methylation is often buffered at the protein level. Next, we find that ProteoMix genes are enriched for biological processes involved in cancer including functions involved in epithelial and mesenchymal transition. ProteoMix results are also enriched for tumor markers which are predictive of clinical features like tumor stage and we find clustering on ProteoMix genes captures cancer subtypes.


Introduction
Genomic characterization can elucidate underlying biology, disease etiology and reveal biomarkers of cancer development and progression; however, each molecular feature is susceptible to different sources of biological and technical measurement noise and provides only one view on the cell state. Therefore, comprehensive studies are needed to understand the molecular basis of disease. Toward this end a multi-institutional consortium, The Cancer Genome Atlas (TCGA), has extensively characterized numerous cancer sites producing genome wide data for mutations, copy number alterations (CNA), RNA expression, microRNA expression, and DNA methylation (1)(2)(3)(4)(5).
As part of this project, the proteome was initially probed using protein array Reverse Phase Protein Assay (RPPA) technology. However, antibody based analysis are inherently limited because of the reduced coverage and inability to easily compare across proteins due to differential binding effects (6,7). Transcending these limitations, recent advancements in proteomics through high sensitivity mass-spectrometry (MS) are opening new opportunities in cancer research (8). To accelerate the uptake of proteomics the Clinical Proteomic Tumor Analysis Consortium (CPTAC) is performing proteomic analyses of TCGA tumor bio-specimens for a growing number of tissue types and establishing standardized workflows using high-throughput liquid chromatography tandem mass-spectrometry (LC-MS/MS) to capture the proteome as a whole (6,9,10).
To best leverage this new technology comparative analysis between protein abundance and RNA expression can highlight factors influencing concordance and inform how to best interpret proteomic data (11). For example, multiple studies have proven that concordance between mRNA and protein is highly variable, such that one cannot be used to reliably predict the other. Correlation between mRNA and protein has been repeatedly shown to vary by tissue type and cancer status among other molecular features like biological function or molecular stability (7). It was shown across multiple cancers that dynamic proteins involved in metabolism show strong agreement whereas housekeeping proteins and RNA processing proteins are weakly or negatively correlated (6,9,10). So, although many biological functions are regulated primarily through RNA expression -producing moderate correlation between proteomic and transcriptomic data, with mean spearman rho: 0.23 -0.47 -post-transcriptional mechanisms also play a significant role that cannot be overlooked.
The proteome represents the final link from genotype to molecular phenotype, so proteins are of special importance among molecular features and likely provide a more accurate depiction of cell state; this enhanced view on disease can be leveraged to assess functional effects of upstream aberrations, such as epigenetic modifications.
Multi-level epigenetic features such as DNA methylation and histone modification work in concert to regulate gene expression. DNA-methylation, the covalent addition of methyl groups to CpG dinucleotides to form 5-methylcytosine (5mC), is catalyzed by DNA methyltransferases, and is influenced by both environmental and hereditary factors (12). Previous studies have shown that DNA methylation plays a key role in health and is involved in processes of embryonic development and cellular differentiation, where changes can occur through imprinting, inheritance, or de novo events (13,14).
Furthermore, DNA methylation has been numerously cited as a potentially causative event in cancer (15,16). Among potential DNA methylation drivers, silencing of tumor suppressors through promoter CpG island hypermethylation is best understood and linked to corresponding gene silencing (13,17,18). Global hypo-methylation on the other hand can potentially result in genomic instability and reactivation of oncogenes (12,13,15).
To elucidate the role of DNA methylation in disease, our goal is to investigate whether linking proteomic data with DNA methylation data identifies key genes, describes molecular features and subtypes in cancer. Previously we presented MethylMix an algorithm that formalizes the identification of DNA methylation driver genes using a model-based approach (19)(20)(21)(22)(23). Recognizing the complex role of the methylome in epigenetic regulation of cancer, MethylMix uses mRNA data to select only differentially methylated genes that show down-stream effect on gene expression. This selects for likely functional aberrations with the aim of discriminating between true driver genes, and passenger events which are characteristic of genome wide dysfunction in cancer.
Herein we present ProteoMix which refines candidate nominations for epigenetic driver genes by excluding aberrations that are buffered at the protein level; this likely selects for events which are functional over those which may accumulate during cancer but do not drive pathogenesis. Using quantitative MS data from three cancer cohorts: breast invasive carcinoma, colorectal adenocarcinoma, and ovarian serous cystadenocarcinoma, we report ProteoMix' gene identifications, which include potential markers and therapeutic targets. We describe ProteoMix' ability to elucidate key molecular and higher level disease features and evaluate ProteoMix' performance against MethylMix. In summary, our study highlights the differences between integrated epigenomic-proteomics and epigenomic-transcriptomics analyses.
Therefore, it is unsurprising that ProteoMix shows high agreement with MethylMix, where more than 90% of ProteoMix genes are also identified by MethylMix. However, ProteoMix lists are more conservative identifying fewer candidate genes across all three cancers, where often the effect of methylation is present at the RNA level, but not detected at the protein level ( Figure 1), likely because they are buffered at the protein due to post-transcriptional, translational, or degradation regulation. Therefore, ProteoMix better enriches for methylation-states that more likely execute functional roles in cancer development. In breast cancer ProteoMix discovers three novel differentially methylated genes of diverse biological functions. ProteoMix detects a functional effect of hypo-methylation in the untranslated region (UTR) of EHF, which is a well-studied transcription factor involved in HER2 mediated epithelial differentiation (35); a likely oncogene, knockdown of EHF has been shown to inhibit tumor invasion and proliferation (36). Next, ProteoMix identifies hyper-methylation of FSTL1, an autoantigen that promotes immune response.
This candidate tumor suppressor, FSTL1, has also been shown to mediate tumor immune evasion in nasopharyngeal cancer through hyper-methylation silencing (37).
ProteoMix also reports hyper-methylation of DHX40 which has an unclear link to cancer; although it is of note that RNA splicing proteins -like DHX40 -are highly stable, perhaps explaining the particularly stronger effect of DNA methylation on protein abundance than mRNA (38) (Supplementary Table 1).
In colorectal cancer ProteoMix recovers several genes associated with immune function and inflammation, which is known to play a key role in pathogenesis. ProteoMix uniquely identifies a functional effect of UTR hypo-methylation of the PTPRC gene.
PTPRC belongs to a family of protein tyrosine phosphatase which contains oncogenes regulating cell growth and differentiation. PTPRC is also related to tumor necrosis and disrupts normal T-and B-cell signaling through SRC kinase pathways -which are separately implicated in colorectal cancer through amplification (9,39). Next, ProteoMix identifies upregulation of S100A9 through promoter hypo-methylation. Of note, elevated S100A9 mRNA and protein levels are commonly observed in many conditions associated with inflammation (40); additionally in hydropharangeal cancer where knockdown inhibited cell growth and invasion, S100A9 is also prognostic of worse outcome and indications like metastasis (41). Of note ProteoMix filtered out functional effects of a UTR hypo-methylation in S100A9 previously detected by MethylMix. Next, ProteoMix identifies hyper-methylation across the promoter region of LTF, a likely tumor suppressor which is produced by neutrophils to regulate growth and differentiation. In the context of colorectal tissue LTF has been shown to restrict inflammation by regulating T cell interaction (42). Additionally, gene expression of LTF has previously been shown to correlate with tumor size and survival in breast cancer (43). Lastly, ProteoMix uniquely identifies hypo-methylation mediated upregulation of DAK, also known as TKFC, which is related to virus-associated chronic inflammation (44).
ProteoMix picks up hypo-methylation states in five new genes in ovarian cancer related to processes of invasion and proliferation. ProteoMix uniquely identifies hypomethylation in the promoter region of EVL a key regulator of the actin cytoskeleton, associated with invasion and metastasis. Overexpression of EVL is also indicative of advanced stage in breast cancer (45) and has been implicated in malignancies due to inappropriate recombination (46). ProteoMix discovers elevated TSTA3 expression caused by gene body hypo-methylation. TSTA3 is linked to malignant transformations through abnormal glycosylation and controls cell proliferation and invasion by regulating CXCR4 chemokine mediated T cell signaling; additionally in breast cancer, high TSTA3 expression correlated with poor survival (47). Next, ProteoMix detects promoter hypomethylation mediated upregulation of HMGB3, a well-documented oncogene implicated in several cancers including breast, lung, esophageal, bladder, and colorectal cancers (48)(49)(50)(51). HMGB3 promotes cell proliferation in colorectal cancer cells through regulation of MYC, where genes from same family have been linked to decreased MYC expression which is unique to proliferative subtypes of ovarian cancer (26). Lastly, ProteoMix also identifies hypo-methylation in two mitochondrial genes ATP5D and SPG7, speculatively linked to cancer through metabolic function (52).

ProteoMix genes are enriched for biological processes involved in cancer
We conducted enrichment analysis to identify biological processes that are overrepresented in ProteoMix and MethylMix genes (Supplementary Table 3

ProteoMix genes are enriched for tumor progression markers
Taking an orthogonal approach, we identified putative biomarker of disease progression based on correlations between gene expression and clinical features ( Table 2).
Although ProteoMix gene lists contain much fewer identifications we find that across all three cancers that ProteoMix' lists include a larger proportion of markers of tumor stage and size and show stronger odds of containing such genes (  (Table 2B).  Table   4).  Figure 2C). About half of cluster-1 is comprised of patients labeled as Proliferative, while cluster-2 contains 75% of Immunoreactive subtype and 80% of Differentiated subtype patients, lastly Mesenchymal subtype patients can be found with relatively equal frequencies in each cluster (54)(55)(56). ProteoMix clusters also significantly correlate with tumor features, where cluster-1 and cluster-2 roughly correspond patients with lower-grade and higher-grade tumors.

Discussion
Epigenetic aberrations contribute to oncogenesis, where DNA hypermethylation inactivates tumor suppressor genes, while hypomethylation is known to promote genomic instability and activate oncogenes (12,20). Therefore, DNA methylation has potential to inform patient treatment and improve patient outcomes through new diagnostics and therapeutics. When identifying epigenetically driven cancer genes, it is of note that most biological functions -subject to genomic and epigenomic dysregulation -are ultimately executed at the protein level, so we can expect neutralization of non-functional upstream effects at -or before -the proteome. Herein we confirm the potential of using proteomic data to elucidate functional DNA methylation events by conducting the first genome wide analysis of epigenome-proteome relationships across three large human cancer cohorts. We present ProteoMix, a data-  Overall ProteoMix shows practical utility for improving nominations of cancer driver genes and elucidating new mechanisms of cancer development missed by our previous model. More broadly our study supports using proteomic data to better understand how epigenetic deregulation promotes cancer. Similar approaches have been applied and found to potentially improve aspects of patient care. For example, a retrospective analysis of outcomes in an oncology trial for glioblastoma -which tested efficacy of different temozolomide regiments -found that updating the clustering model to incorporate MGMT protein expression and c-MET protein abundance provided better separation of overall survival prognostic groups than incorporating MGMT promoter methylation alone (57). These findings and ours support the claim that protein data combined with DNA methylation is a better way to stratify patients and understand cancer features then using DNA methylation alone.
Although milestone initiatives like TCGA and CPTAC provide valuable date for the acceleration of discovery and research in cancer, we acknowledge the limitations of this study and further work required. A barrier to translation, the number of specimens used here is insufficient to draw conclusive clinical correlations and require replication of these results by independent studies. Importantly molecular measurements used here are also subject to sources of technical and biological bias. For example, it is known that bulk measurements obscure the complex nature of tumor microenvironment which includes many cell types including vascular, lymphatic, and immune cells. This confounding effect is compounded considering that each molecular feature was measured using different tumor fragments, which may have very different cellular compositions due to intra-tumor heterogeneity. Additionally, we recognize further characterization of genome wide proteomic studies is required to fully understand possible biases, such as worse detection of highly hydrophobic and hydrophilic peptides, or low-abundance peptides co-eluting with very high-abundance peptide (9).

Moreover, early proteomic techniques such as those utilized in CPTAC's Common Data
Analysis Pipeline have not yet reached the genome level resolution of other omic measurements; these methods require refinement to address low coverage due to inherent limitations of proteolytic measurements such immeasurable peptides that are excessively large or small tryptic fragments and the inability to distinguish some amino acids (9). This reduced coverage to a few thousand genes in our study excludes many genes with possible roles in cancer.
The complex nature of disease development and interplay between interacting biological aberrations -genetic, epigenetic, somatic or germline -often makes it difficult to elucidate causal mechanisms of cancer development'. Furthermore, there is still much work in multi-omics to elucidate causal flows of information influencing cellular physiology and pathology and to discriminate how separate phenomena are linked to create cancer (3,5,56,58). However, integrated multi-omic approaches like ProteoMix can provide additional insights into pathways and processes involved in oncogenesis and how they manifest as clinical phenotypes. As CPTAC moves into its second phase and characterizes more samples across more cancer types, models such as ProteoMix may leverage this valuable data to improve understanding of the molecular basis of cancer.

DNA Methylation
CpG site methylation levels/percentages were measured using Illumina Infinium Human Methylation 27k and 450k BeadChip Platforms (24)(25)(26). We limit our observations to overlapping probes or CpG sites for cancer tissues measured using both platforms, otherwise we use all available probes. The methylation level is recorded as a beta value representing a ratio of the signal/intensity from the methylated probe over the sum of both the methylated probe and the unmethylated probes. Values close to 0 indicate low levels of DNA methylation and values close to 1 indicate high levels of DNA methylation. We removed CpG sites with more than 10% missing entries across all samples and we applied 15-K Nearest Neighbor (KNN) to impute the remaining missing values, this procedure was replicated for all molecular data types. We observed significant technical sources of variation among tissue samples processed in batches using a one-way analysis, which we corrected using COMBAT (27) . To reduce dimensionality of the CpG data we applied hierarchical clustering with complete linkage and a minimum average Pearson correlation of 0.4 between values. Last, we matched clusters to their corresponding genes by mapping to the closest transcriptional start sites, where one gene may relate to many CpG clusters but each CpG cluster only maps to one gene. Therefore, we limit our analysis to DNA methylation states with cis regulation effects.

RNA Expression
We used transcriptomic data in MethylMix produced by RNA sequencing technology (24)(25)(26). We log-transformed the RNAseq counts and replaced infinities with a non-zero low value. Similar to our DNA methylation data processing, we estimated missing values using 15-KNN and used COMBAT to correct for batch effects (27).

Protein Abundance
Proteomic data used in ProteoMix was provided by CPTAC (6,9,10). Participating research institutions used the following Common Data Analysis Pipeline to produce protein level measurements: First tissue samples were enzymatically digested, cutting large proteins in a sequence specific manner into smaller peptides. Peptides were fractioned using liquid chromatography to improve downstream quality before measurements using Thermo Fisher high-resolution tandem mass spectrometry (LC-MS/MS). Next, the resultant mass ladders were matched to theoretical mass ladders in the FASTA database and subsequently assigned to peptide spectra using software tools and The Reference Sequence Database. The data was then filtered to exclude peptide fragments common to more than one protein and to only include proteinidentifying or unshared peptides i.e. fragments with unique sequences. Lastly peptides were mapped to a parsimonious set of genes.
The BRCA and OV workflows used iTRAQ-labeling to increase throughput, where 3 patient samples are isotopically labelled and analyzed against a common reference standard and describe relative ion intensities. Quantities are recorded after taking the log2 ratio of the abundances. Alternatively, measurement of COADREAD samples used label free mass spectrometry technology and are reported as absolute counts, which were transformed to relative quantities by taking the log2 of quantile normalized values using the limma R package (10). OV samples collected from Pacific Northwest National Laboratory and John Hopkins University were merged and corrected for batch effects using COMBAT (27).
To remove samples compromised by protein degradation we filtered samples using the

Algorithm
Step 1: Fit mixture model to methylation data integrates epigenetic data with proteomic and transcriptomic data using only the subset of these samples with available matched data (Table 1).
Step 2: Compare methylation to normal tissue To identify differentially methylated CpG clusters we compare the mean methylation level -the mean value of the beta mixture component -to the mean methylation level of normal samples. To measure if an observed difference is significant we perform a Wilcoxon rank sum test with a Q-value cutoff of 0.05, using both p-value multiple testing correction with False Discover Rate (FDR). As an additional measure, we require a minimum difference of 0.10 based on the platform sensitivity (29). If significant, the difference in methylation level between the mode and normal is recorded as the Differential Methylation value or DM value for each methylation state.
Step 3: Select for functionally predictive genes Next, we filter our set of genes, requiring that genes be not only differentially methylated when compared to normal but also predictive of gene expression or protein abundance.
Hyper-methylation should lower gene expression and corresponding protein abundance when compared to the normally methylated samples, therefore we only accept genes that have a negative correlation between methylation level and downstream gene products. Note this assumption is only explanatory of methylation at promoter regions and does not necessarily apply to methylation at the gene-body or 3' and 5' untranslated regions (UTRs). To assess the likelihood that methylation events are functional, MethylMix uses the relationship between methylation and gene expression, whereas ProteoMix examines the effect on protein abundance. In both cases, we perform a linear regression between methylation levels and RNA expression or protein abundance data respectively. We use the R-square statistic to estimate the magnitude of the correlation and used cutoffs at R-square greater than 0.05 and a p-value less than 0.001.
Applying the procedures outlined above for MethylMix and ProteoMix each produces a list of candidate cancer drivers (referenced as MethylMix and ProteoMix genes) and a corresponding matrix of DM-values for identified CpG clusters across all samples. To assess the validity of each list we used orthogonal clinical and biological data to assess utility for downstream analysis and relevance to disease state.

GO Term Enrichment
To describe the underlying biological processes captured by each model, we tested for enrichment of Gene Ontology (GO) terms in MethylMix and ProteoMix genes. This analysis was implemented using the PANTHER Classification System's statistical overrepresentation tool (30) with the following settings: Homo-sapiens for organism, the background set to include all genes with matching protein and RNA data, and either ProteoMix or MethylMix genes for input. Enrichment was calculated using fisher's exact test. For each gene list we rank terms using significance of the test statistic with a minimum p-value of 0.10.

Methylation Subtypes
With the matrices of DM values for our CpG clusters we performed consensus clustering to identify robust groupings of patients based on epigenetic signatures (31).
Our analysis for each cancer cohort used the following parameters: maximum number of clusters is 6, number of bootstrap subsamples is 500 with 0.8 the proportion of the subsample, and our method uses k-means cluster algorithm and Euclidean distance.
To identify the optimal number of clusters we inspected the proportion of ambiguous classification (PAC Score) (32, 33), and the consensus heatmap and values, where the score/index between two samples is the proportion of clustering runs in which the two items are clustered together. We define the intra cluster consensus as the mean of all pairwise consensus scores between samples clustered in the same group, and inter cluster consensus as the mean of all consensus indexes between a sample and all the other samples clustered in different groups. A robust clustering result ideally shows high intra cluster consensus and low inter cluster consensus. We tested for association between cluster assignments and several disease features, using a Chi-squared test for categorical variables such as molecular subtype labels or a Kruskal-Wallis test for ordinal values such as tumor grade. Our analysis includes genetic, molecular, and clinical annotations, which were collected from supplementary tables from the original TCGA publications (24)(25)(26) in addition to annotations downloaded using the TCGAbiolinks R package (34).

Enrichment for putative tumor markers
We compared MethylMix and ProteoMix genes by investigating their enrichment in genes related to disease progression. We used correlation of gene expression with cancer stage and tumor size to identify potential genes capturing disease progression.
We took the spearman correlation between gene expression levels and these clinical variables using all available samples. We selected genes using a p-value cutoff of 0.05 and biased for genes with greater likelihood of relevance by taking top 50th quantile in sample variance. Next, we filtered for only relationships that can be explained by methylation, such that genes identified as hyper-methylated in cancer tissue were required to show a negative correlation between gene expression and disease progression (tumor-suppressor genes) and hypo-methylated genes positively correlated (oncogenes). To assess each models' likelihood in picking up genes related to disease progression we examined the overlap between these genes and the ProteoMix and MethylMix genes, using Fisher's exact test to evaluate significance.