DNA Methylation Patterns Can Estimate Nonequivalent Outcomes of Breast Cancer with the Same Receptor Subtypes

Breast cancer has various molecular subtypes and displays high heterogeneity. Aberrant DNA methylation is involved in tumor origin, development and progression. Moreover, distinct DNA methylation patterns are associated with specific breast cancer subtypes. We explored DNA methylation patterns in association with gene expression to assess their impact on the prognosis of breast cancer based on Infinium 450K arrays (training set) from The Cancer Genome Atlas (TCGA). The DNA methylation patterns of 12 featured genes that had a high correlation with gene expression were identified through univariate and multivariable Cox proportional hazards models and used to define the methylation risk score (MRS). An improved ability to distinguish the power of the DNA methylation pattern from the 12 featured genes (p = 0.00103) was observed compared with the average methylation levels (p = 0.956) or gene expression (p = 0.909). Furthermore, MRS provided a good prognostic value for breast cancers even when the patients had the same receptor status. We found that ER-, PR- or Her2- samples with high-MRS had the worst 5-year survival rate and overall survival time. An independent test set including 28 patients with death as an outcome was used to test the validity of the MRS of the 12 featured genes; this analysis obtained a prognostic value equivalent to the training set. The predict power was validated through two independent datasets from the GEO database. The DNA methylation pattern is a powerful predictor of breast cancer survival, and can predict outcomes of the same breast cancer molecular subtypes.


Introduction
Breast cancer is the second largest cause of morbidity worldwide, the first cause of tumors in women [1], and the leading cause of cancer death in women. Moreover, the incidence rates of breast cancer are continuing increase [2]. Breast cancer has multiple molecular subtypes that are classified using tumor biomarkers, such as hormone receptors (HR) (i.e., estrogen receptor (ER) and progesterone receptor (PR)) and human epidermal growth factor receptor 2 (Her2) [3]. Four clinical subtypes of breast cancer can be separated according to HR expression and the epithelial cell of origin (luminal or basal): Luminal A (HR+/Her2-), Luminal B (HR+/Her2 +), Her2-enriched (HR-/Her2+), and triple-negative (HR-/Her2-) [4][5][6][7]. Breast cancer is a highly heterogeneous disease between and within tumors as well as among cancer-bearing individuals, which is a challenge for the diagnosis, treatment and prognosis [8].
DNA methylation is an epigenetic modification that plays important roles in gene expression regulation, cellular differentiation, development and even tumorigenesis. DNA methylation often occurs at the C-5 position of cytosine, especially cystonsines located in C-phosphate-G (CpG) sites. DNA hypermethylation in gene promoter or CpG islands can result in tumor suppressor silencing, leading to tumorigenesis. Therefore, a large number of differentially methylated regions in cancer have been identified to explore the epigenetic regulation mechanisms underlying oncogenesis [9]. Recently, DNA methylation biomarkers for the diagnosis, molecular typing and prognosis of breast cancer were identified. For example, hypermethylation of RASSF1A can be used to detect breast cancer during the early stages using a CpG island that is hypermethylated in 60-70% of breast cancers [10,11] or a promoter that is hypermethylated in 70% of breast cancer individuals [12]. Methylated RASSF1A is strongly associated with metastasis, tumor size, and an increased risk of death [13]. BRCA is a well known tumor suppressor for both breast and ovarian cancer whose mutations are more likely to be higher grade, poorly differentiated, highly proliferative, ER negative, PR negative and harbor p53 mutations [14,15]. However, Xu et al. found that although methylation of the BRCA1 promoter was more prevalent in cancers with tumors size greater than 2 cm, hypermethylation of BRCA1 from breast cancers with BRCA1 mutations had no overall correlation with ER, PR or grade [16]. Aberrant DNA methylation may be correlated with more advanced tumor stages at the time of diagnosis, but it is independent of BRCA1 mutation. Furthermore, DNA methylation has several advantages over sequence mutations as a cancer biomarker [17]. First, the aberrant methylation of specific CpG islands or gene promoters is more frequent than mutations. Second, aberrant methylation patterns can be detected even when they are embedded in an excess amount of normal DNA molecules. Third, techniques for the detection of methylation patterns are relatively simple [18].
Breast cancers can have different treatment responses and overall outcomes even when they are at the same stage of the disease or have the same subtype. Therefore, a good prognostic biomarker of breast cancer can not only contribute to the accurate classification of the subtype but also guide clinical treatment and improve breast cancer outcomes. Signatures predicting the clinical outcomes of breast cancer based on gene expression profiling have been identified and will provide benefits for adjuvant therapy [19]. However, the identification of prognostic predictors in breast cancer that regulate gene expression may have more benefits than unstable gene expression. For example, aberrant methylation of the TSC [20], SFRP1 [21], and RASSF1A [22] genes was associated with an unfavorable prognosis of breast cancer and could be regarded as independent predictors. Although several methylation biomarkers have been identified to predict breast cancer survival, they are usually limited to average methylation levels of several genes based on experiential knowledge. However, there is a weak correlation between the average DNA methylation levels of gene promoter and gene expressions in genome wide [23]. This finding prompted us to hypothesize that methylated CpGs in promoters might be not have equivalent regulatory effects on gene expression. Here, we used canonical correlation analysis to obtain the methylation patterns of CpGs with the strongest correlation with gene expression, and identified predictors of breast cancer prognosis based on high throughput DNA methylation data. The methylated features showed a good distinction of breast cancer outcomes even in samples with the same receptor status.

Materials and Methods
Data downloading and processing DNA methylation data sets of breast cancers based on Human Infinium 450K arrays were obtained from TCGA (http://cancergenome.nih.gov/). Gene expression data sets of breast cancers based on AgilentG4502A_07 were also downloaded from TCGA. Breast cancer samples that had both DNA methylation and gene expression information were retained as the training set. Other samples with only DNA methylation and patient outcome information were regarded as the test set. Samples or CpG sites missing data in the training set were filtered in the following analysis. Additionally, two DNA methylation datasets based on Human Infinium 450K arrays (GSE37754) and Human Infinium 27K arrays (GSE20712) were downloaded from GEO database (http://www.ncbi.nlm.nih.gov/geo/) as independent test sets to assess the predictive power of the DNA methylation biomarkers.

Canonical correlation analysis
Due to the multiple CpGs located in the gene promoters and the variability of the DNA methylation levels of multiple CpGs located in the same gene, canonical correlation analysis was used to estimate the correlation between gene expression and DNA methylation levels from multiple CpGs. Let Y i ¼ fy i 1 ; y i 2 ; . . . ; y i n g denote the expression levels of the ith gene among n samples, and X i ¼ fX i 1 ; X i 2 ; . . . ; X i p g denote the DNA methylation levels of p CpGs from the ith gene, where X i j ¼ fx i 1j ; x i 2j ; . . . ; x i nj g t is the methylation levels of the jth CpG among n samples.
For each gene, we can describe the methylation and expression data using the following matrix.
x n1 x n2 Á Á Á x np y n Assuming the mean and covariance of X and Y were The corresponding covariance matrix was S.
U and V were the linear combination of X j and Y respectively.
The objective function that obtained the maximum correlation between U and V was solved based on the Lagrange multiplier method.
The estimated valueâ andb constituted the canonical variables U and V that obtained the maximum correlation. Therefore, the methylation pattern of multiple CpGs located in the gene promoter was represented by U which was estimated through the following equation.
The expression pattern was represented by V which was estimated through the following equation.
V ¼b Á Y whereb denoted the regulatory direction of methylation on expression. Ab less than 0 indicated negative regulation; conversely, ab greater than 0 indicated positive regulation. Additionally, r(U,V) was the correlation degree between the methylation pattern and expression pattern.
For each gene, the methylation pattern score (MPS) was defined as where, x kj was the DNA methylation levels of the jth CpG in the kth sample.

Survival analysis
The log-rank test was used to identify a subset of genes whose MPS showed significant differences between the high and low groups and to obtain a p value. Univariate and multivariable analyses were performed using Cox proportional hazards models incorporating MPS and known prognostic clinical factors, including age at diagnosis ( 55 vs !56 years), tumor pathological stage (I & II vs III & IV), and tumor size (1-2 vs 3-4) as categorical variables. Univariate Cox regression analysis was performed to assess the survival prognosis capabilities of the selected gene set using the overall survival time as a dependent variable. To create an optimal feature for genes based on methylation patterns to assess breast cancer outcomes, the methylation risk score (MRS) was defined through featured genes identified based on multivariable Cox proportional hazards models with the MPS as a continuous variable.
where k was the kth sample. m was the number of feature genes filtered by the multivariable Cox proportional hazards models and c j (j = 1, 2, . . ., m) was the coefficient estimated by the multivariable Cox proportional hazards models. The 5-year overall survival for each MRS scoring group (high vs low) was calculated using the Kaplan-Meier method, and the statistical significance was assessed using the log-rank test. The significance level of all statistical tests was 0.05.

Multiple test correction
To control the false discover rate (FDR) of the featured genes, we adopted two methods to correct the p value of the statistical test. For the identification of the gene subset based on MPS through the log-rank test, sample labels were permutated 100 times and the log-rank test was re-performed. Empirical p values were calculated according to the order of the observed p value among the 100 permutations. If the empirical p value was less than 0.01, the gene was retained. Thus, all p values from the 100 permutations were larger than the observed p value.
To obtain the significant candidate gene set with the univariate Cox proportional hazards models, we again permuted the sample labels 100 times. In each permutation, we obtained the pvalue of the univariate Cox proportional hazards models. According to the FDR equation, the cutoff of p (p 0 ) was determined through FDR = 0.01.
where the numerator was the expected number of genes whose p value from the univariate Cox proportional hazards models was random less than cut off p 0 in random and the denominator was the number of genes whose p value was less than p 0 in the real situation.

Identification of featured biomarkers based on the effect of DNA methylation regulatory patterns of CpGs on expression
The gene expression and DNA methylation data from 209 breast cancer patients were obtained after processing the missing data from TCGA, which included 15,801 genes and 281,066 CpGs located in gene promoters. The breast cancer details are shown in S1 Table. The methylation patterns of CpGs in gene promoters that showed the maximum correlation with the gene expression patterns were obtained through canonical correlation analysis, which measured the maximum regulatory effect of DNA methylation on gene expression. Pearson's correlation analysis was performed between the average methylation of CpGs and gene expression. However, the canonical correlation degree was significantly higher than the Pearson's correlation degree (Wilcoxon p value < 2.2 × 10 −16 , S1A Fig). Moreover, we found that DNA methylation had a negative regulatory effect on the expression of some genes and a positive regulatory effect on others (S1B Fig). The MPS was calculated through the canonical variable from the canonical correlation analysis to estimate the methylation patterns of CpGs. High-and low-MPS groups were classified based on the median MPS among the samples. The log-rank test was used to identify genes that showed significant difference in outcomes between the high-and low-MPS groups. The p value of the log-rank test of these genes was less than 0.05 and the lowest of 100 permutations. A gene subset including 151 genes was identified. Furthermore, 38 genes were retained through univariate Cox proportional hazards models with p < 0.05 and FDR = 0.01 among 100 permutations (S2 Table).
The methylation prognostic biomarkers of breast cancer were identified through multivariable Cox proportional hazards models based on 38 genes with p < 0.01. The methylation prognostic biomarkers consisted of 7 protective genes with negative Cox proportional hazard model regression coefficient, indicating that the survival time increased as the MPS increased, and 5 risk genes with positive coefficients, indicating that the MPS increased as the survival time decreased (Table 1).

Estimating breast cancer outcomes according to the methylation risk score
We assessed the MRS of each sample based on the 12 methylation biomarkers described as follows to predict the breast cancer outcomes. The Kaplan-Meier method was used to estimate the 5-year overall survival rate between the high-and low-MRS groups classified through the median MRS among the samples. As shown in Fig 1A, the high-MRS group had a significantly shorter 5-year overall survival rate than the low-MRS group (68.5% vs 94.1%, log-rank p = 0.00103). The average methylation and expression levels of 12 featured genes were adopted to perform survival analysis. No significant difference were detected between the high and low groups classified through the median methylation or expression levels (Fig 1B and 1C). Additionally, we used the 7 protective and 5 risk genes for the survival analysis. The overall survival time and 5-year survival rates were significant different between the high and low groups; the 7 protective genes especially allowed more distinct estimations (S2 Fig). MRS had a significant association with hazard ratio (HR) of death in both the univariate and multivariable Cox proportional hazards models compared with the prognostic clinical  Table). This result suggested that the regulatory effect of the DNA methylation pattern on expression might be better able to predict the outcome of breast cancer than depending on methylation or expression alone.
Differential outcomes of receptor states with differential MRS High heterogeneity and differential outcomes of breast cancer were found among patients from the same subtype or receptor status. However, no significant differences in the overall survival time and 5-year survival rate were found between ER+ and ER-, PR+ and PR-, Her2 + and Her2-patients (Fig 2A-2C). When we combined these factors with the MRS, found a significant difference between the positive and negative receptor status and the breast cancer outcome.  (Fig 2E and 2F). Significantly differential outcomes were observed between the high-and low-MRS groups, although the patients had the same receptor state (Fig 2G-2L). In conclusion, differential outcomes were observed when the MRS and receptor status were combined, even when the patients had the same receptor state. We found that ER-, PR-or Her2-patients with high-MRS had the worst outcomes, which was consistent with the known conclusion.

Application of the methylation risk feature on breast cancers that resulted in death state and independent test datasets
To validate the prognostic value of DNA methylation biomarkers on other breast samples, the MRS of 12 featured genes were applied to 28 breast samples from patients with a death outcome from the TCGA dataset that only had DNA methylation information and were not included in the previous dataset. We found a significantly differential survival time between the low-and high-MRS groups (Fig 3A). Moreover, the overall survival time of the high-MRS group was less than 5 years. MRS showed a good ability to distinguish outcomes between ER + and ER-, PR+ and PR-, and Her2+ and Her2-samples (Fig 3B-3D). We observed similar results with the training set, with ER-, PR-or Her2-samples with high-MRS showing the worst prognosis; indeed, less than one year of survival was estimated for the Her2-& high-MRS group. The effect of the prognostic outcome in the independent samples suggested that a DNA Survival Analysis of Breast Cancer Based on DNA Methylation Pattern methylation biomarker identified through the regulation of gene expression by DNA methylation patterns in CpGs, is believable and has good predictive value for breast cancer outcomes. Additionally, we evaluated the performance of 12 featured genes using two independent breast cancer datasets (GSE37754 and GSE20712). NARS and SLC25A21 were not included in the 27K DNA methylation dataset (GSE20712), and an additional 10 featured genes were used. All 12 featured genes were used in the 450K dataset (GSE37754). Using two independent cohorts, we found differential outcomes between the high-and low-MRS groups (Fig 4); p values of the log-rank test were especially significant in GSE20712 (Fig 4B). MRS showed a preferable distinguishing power between the high-and low-MRS groups. This finding suggested that DNA methylation biomarkers might be robust factors for the prediction of breast cancer outcomes.

Discussion
Breast cancer is a heterogeneous tumor. Molecular classification has been successfully used to design individualized therapies, leading to significant improvements in disease-specific survival [24]. Recently, breast cancer was classified into three major subtypes based on luminal, Her2 + and basal-like based gene expression profiling [25,26]. Moreover, DNA methylation showed distinct patterns among subtypes of breast cancer (especially between luminal B and basal-like) [27]. Each of the breast cancer subtypes has different risk factors for incidence, response to treatment, risk of disease progression and outcomes. For example, triple negative breast cancer which usually includes basal-like tumors that lack HR and Her2, has a worse outcomes than the other subtypes because no specific molecular targets have been identified [28]. Recently, the identification of differential methylated regions of triple-negative breast cancer based on whole-genome methylation sequencing has provided diagnostic and prognostic value for personalized management [29].
We identified DNA methylation patterns of CpGs located within gene promoters that had a maximal regulatory effect on gene expression based on canonical correlation analysis to define the methylation pattern score of adjacent CpGs. DNA methylation featured genes associated with breast cancer outcomes that were obtained according to the DNA methylation patterns, which contributed to the construction of the methylation risk score. The methylation risk score of the featured genes showed the best ability to estimate the survival time between the high and low-risk groups compared to average DNA methylation or gene expression. Moreover, we found significant differential outcomes between the high-and low-MRS groups even though the breast samples had the same HR or Her2 status (especially ER-, PR-or Her2-), with the high-MRS group having the worst outcomes. A similar conclusion was supported using 28 breast samples with death as an outcome. We evaluated the estimation ability of the DNA methylation pattern of featured genes in other breast cancers based on Human Infinium 450K and 27K arrays from the GEO database. Due to the absence of some CpGs and genes in the 27K arrays compared with the 450K arrays, we used shared featured genes and CpGs between the 450K and 27K datasets to measure the MRS. Only 10 featured genes were found (excluding NARS and SLC25A21) in the 27K datasets; these genes were used to predict breast cancer outcomes. We also found differential outcomes between the high-and low-MRS groups in two independent cohorts. However, the p values of the log-rank test from the 450K test sets were not significant. We speculate that the high heterogeneity in cancer may lead to differential outcomes. Notably, the breast samples from GSE37754 (450K arrays) were in the early stages of cancer with tumor pathological stages I or II, which might have caused the non-significant pvalue. Although differences in the platform between the 27K and 450K arrays led to the absence of some featured genes and CpGs, the p value of the log-rank test was significant and the breast cancer outcomes were clearly distinguished between the high-and low-MRS groups. This finding implied that the survival model based on the methylation patterns of featured genes showed a good survival prognostic capability in breast cancer.
The DNA methylation featured genes we identified were ZNF536, ZNF592, NARS, SLC25A21, HERPUD2, NRIP2, PPP1R12C, DPAGT1, PFN1, ZFAND5, HPX and TMEM184B. ZNF536, ZNF592 and ZFAND5 were zinc finger proteins that had important roles in transcript regulation, embryonic development and cell differentiation. Several studies reported that SLC25A21 [30], PFN1 [31], HPX [32]and TMEM184B [33] were associated with a risk of breast cancer. Additionally, other genes were associated with the risk of other diseases, such as NARS, which causes nonsyndromic hearing loss, Leigh syndrome [34] and Alpers syndrome [35], and DPAGT1, which is involved in the pathogenesis of oral cancer [36]. The methylation featured genes did not include the BRCA gene. This finding was consistent with the conclusion of Xu et al, who reported that the methylation of BRCA had no overall correlation with ER, PR or grade. These results suggest that DNA methylation is an independent predictor of breast cancer prognosis and is independent of BRCA1 mutation. We attempted to compare the survival time between the BRCA mutation and non-mutation samples through MRS, but the analysis was limited by the small sample number with BRCA mutations. The good prognostic power of DNA methylation biomarkers can help guide clinical treatment and predict the outcome of breast cancer.