Prediction of Breast Cancer Survival Using Clinical and Genetic Markers by Tumor Subtypes

Purpose To identify the genetic variants associated with breast cancer survival, a genome-wide association study (GWAS) was conducted of Korean breast cancer patients. Methods From the Seoul Breast Cancer Study (SEBCS), 3,226 patients with breast cancer (1,732 in the discovery and 1,494 in the replication set) were included in a two-stage GWAS on disease-free survival (DFS) by tumor subtypes based on hormone receptor (HR) and human epidermal growth factor receptor 2 (HER2). The associations of the re-classified combined prognostic markers through recursive partitioning analysis (RPA) of DFS for breast cancer were assessed with the Cox proportional hazard model. The prognostic predictive values of the clinical and genetic models were evaluated by Harrell’s C. Results In the two-stage GWAS stratified by tumor subtypes, rs166870 and rs10825036 were consistently associated with DFS in the HR+ HER2- and HR- HER2- breast cancer subtypes, respectively (P rs166870=2.88×10-7 and P rs10825036=3.54×10-7 in the combined set). When patients were classified by the RPA in each subtype, genetic factors contributed significantly to differentiating the high risk group associated with DFS inbreast cancer, specifically the HR+ HER2- (P discovery=1.18×10-8 and P replication=2.08×10-5) and HR- HRE2- subtypes (P discovery=2.35×10-4 and P replication=2.60×10-2). The inclusion of the SNPs tended to improve the performance of the prognostic models consisting of age, TNM stage and tumor subtypes based on ER, PR, and HER2 status. Conclusion Combined prognostic markers that include clinical and genetic factors by tumor subtypes could improve the prediction of survival in breast cancer.


Results
In the two-stage GWAS stratified by tumor subtypes, rs166870 and rs10825036 were consistently associated with DFS in the HR+ HER2-and HR-HER2-breast cancer subtypes, respectively (P rs166870 =2.88×10 -7 and P rs10825036 =3.54×10 -7 in the combined set). When patients were classified by the RPA in each subtype, genetic factors contributed significantly to differentiating the high risk group associated with DFS inbreast cancer, specifically the HR+ HER2-(P discovery =1.18×10 -8 and P replication =2.08×10 -5 ) and HR-HRE2-subtypes

Introduction
Breast cancer is one of the most common malignancies among women in the world. Although breast cancer patients have generally a good prognosis [1], because the 5-year survival for invasive breast cancer cases from 1999 to 2005 was about 90%, large differences exist in survival rate because of a variety of clinicopathological prognostic factors [2]. The tumor-node-metastasis (TNM) staging system approved by the American Joint Committee on Cancer (AJCC) is a well-known important prognostic factor [3]. However, there are prognostic differences within specific stages because of the biological heterogeneity of tumors; thus, additional tumor markers such as tumor grade, lymphovascular invasion, proliferation markers, estrogen and progesterone receptor (ER and PR) status, and human epidermal growth factor receptor 2 (HER2) overexpression have been suggested to provide a more precise prognosis of breast cancer [3][4][5].
Among those prognostic factors, ER, PR, and HER2 status have been used for breast tumor subtypes classification in terms of heterogeneous clinical behavior and systematic therapy recommendations [6]. The tumor subtype based on ER, PR, and HER2 status has been validated in independent data set with significant differences in their clinical features even in Asian and European, early and metastatic breast cancer patients suggesting the robust classification [7][8][9][10].
In this study, we hypothesized that the association of breast cancer prognosis with common genetic variants may vary by breast tumor subtypes. This study aims to conduct a two-stage GWAS for disease-free survival (DFS) in breast cancer stratified by tumor subtypes defined by the ER, PR, and HER2 status and evaluate the performance of prognostic models that included genetic variants with well-known clinical factors.

Study Population
The Seoul Breast Cancer Study (SEBCS) is a multicenter-based case-control study of female breast cancer in Seoul, Korea as previously reported [29,30]. This two-stage GWAS included a total of 3,226 incident breast cancer cases. A total of 4,040 histologically confirmed breast cancer patients were recruited from Seoul National University Hospital (SNUH) and ASAN Medical Center (AMC) between 2001 and 2007. For the discovery stage, 2,273 breast cancer patients who had been participated in GWAS on breast cancer risk were selected with sufficient DNA samples and successful genotyping [29]. We excluded subjects who had a previous history of breast or other cancers before the recruitment (N = 67), were diagnosed with benign breast disease (N = 17), or had no clinicopathological information (N = 73). After those exclusions which were not mutually exclusive, the subjects with a metastatic disease (N = 30) on review of their medical records were additionally excluded and 2,111 subjects remained. For survival analysis, the subjects who had a follow-up loss or follow-up time of less than 90 days (N = 113) were excluded and 1,998 subjects (95% of 2,111 eligible subjects) remained. Among those subjects, a total of 1,732 incident breast cancer patients with known tumor subtypes were included in the discovery set in this study.
For the replication set, a total of 1,837 breast cancer patients were included comprised of 508 SEBCS participants who were not included in the discovery set and 1,329 newly recruited participants who were histologically confirmed as having breast cancer at SNUH between 2000 and 2008. Of those patients, 1,735 breast cancer patients whose DNA samples were sufficient in concentration and purity were successfully genotyped. After exclusion in common with the discovery stage (N previous history of cancers = 13, N benign breast disease = 4, N metastatic disease = 16, N follow-up time of less than 90 days = 86), 1,616 subjects remained. The subjects with unknown tumor subtypes were also excluded, and a total of 1,494 subjects were included in the replication set in this study.
All participants in this study provided written informed consent. The study design was approved by the Committee on Human Research of Seoul National University Hospital (IRB No. H-0503-144-004).

Tumor Subtypes
Information on ER, PR, and HER2 status was obtained from the medical records of patients' based on laboratory results and the interpretation of pathologists. The ER and PR status was determined with immunohistochemistry (IHC) test. When ER and/or PR tumor cells showed 10% or more expression by IHC, the hormone receptor (HR) status was considered positive. Otherwise, HR was considered negative when both ER and PR tumor cells showed less than 10% expression by IHC. The HER2 status was defined by IHC and fluorescence in situ hybridization (FISH) tests according to HercepTest criteria [31]. When using the IHC staining score of HER2, 0 or 1+ was regarded as negative, while 3+ was considered as positive. When the IHC staining score of HER2 was 2+, the HER2 status was estimated with the FISH test. Tumor subtypes were classified as ER and/or PR positive and HER2 negative (HR+ HER2-), ER and/or PR positive and HER2 positive (HR+ HER2+), ER and PR negative and HER2 positive (HR-HER2+), and ER and PR negative and HER2 negative (HR-HER2-) subtypes.

Genotyping and Quality Control
Genotyping was conducted using Affymetrix Genome-Wide Human SNP array 6.0 chip (Affymetrix, Inc.) and quality control steps ((a) a p-value<1.0×10 -6 for deviation from Hardy-Weinberg equilibrium (HWE), (b) a call rate<95%, (c) a minor allele frequency (MAF)<1%, (d) a p-value<1.0×10 -4 for differential missingness between cases and controls, and (e) multiple positioning and/or mitochondrial SNPs) were considered, as previously described [29]. Finally, a total of 555,525 genotyped SNPs remained in the discovery set. Moreover, an imputation of the SNPs based on the Han Chinese from Beijing and Japanese from Tokyo (CHB+JPT) data from the HapMap Phase II database (release 22) as a reference panel was done with the hidden Markov model using MaCH 1.0 [32]. Among the 2,416,663 inferred SNPs, 2,210,580 remained after excluding SNPs that had an imputation quality score (r 2 ) of <0.3 in the discovery set. When SNPs were genotyped as well as imputed, the information from the genotyped SNPs was used.
For the replication set, SNPs with a p-value less than 5.0×10 -6 and a MAF equal to or more than 10% for the per allele hazard ratio (HR) were selected from each tumor subtype in the discovery stage. A total of 10 lead SNPs that included other SNPs in linkage disequilibrium (LD, r 2 >0.4) at loci with multiple SNPs were selected for genotyping in the replication stage as follows: rs161041, rs2835688, rs9935088, and rs166870 in HR+ HER2-; rs1896346 and rs12940572 in HR+ HER2+; rs34073156 and rs10906761 in HR-HER2+, and rs10825036 and rs10862597 in HR-HER2-. Proxy SNPs, rs1081228 (r 2 = 0.98) and rs4750561 (r 2 = 1.00), were genotyped for rs166870 at 15q25 and rs10906761 at 10p31, respectively, because of the genotyping failure of the original ones. The LD metrics (r 2 ) of the selected SNP pairs were calculated using the SNP Annotation and Proxy Search (SNAP) based on HapMap release 22 in the CHB-+JPT population panel. When the selected SNP pairs showed LD (r 2 >0.4), SNPs with the lowest p-value were selected for the per-allele HR, which were genotyped with the Fluidigm 192.24 Dynamic Array. Integrated Fluidic Circuit (IFC) (Fluidigm Corp. South San Francisco, CA, USA) was used according to the manufacturer's instructions. When the selected SNPs failed genotyping, proxy SNPs were selected based on the LD metrics (r 2 ) and genotyped. The success rates for genotyping were greater than 99% for all replication SNPs.

Outcomes
The information on follow-up time, and recurrence status was obtained through retrospectively reviewing the patients' medical records. The DFS time was defined as the time from the initial breast cancer surgery to the time of recurrence which includes loco-regional recurrence, first distant metastasis, contralateral breast cancer, and second primary cancer. The breast cancer patients who did not have evidence on recurrence were censored at last follow-up until 2011.

Statistical Analysis
The associations between each SNP and DFS stratified by tumor subtypes were estimated with Cox proportional hazard models adjusted for age, recruiting center, and TNM stage. The hazard ratios (HRs) and 95% confidence intervals (CIs) per allele for each SNP were assessed in the additive model which was based on the number of rare alleles carried. The statistical significance of the associations was estimated with the p-value for the trend test with 1 degree of freedom. The analyses were done with the PLINK program version 1.07 and R 2.15.1 package (GenABEL and ProbABEL) and confirmed with SAS 9.3. To validate previously reported association, the SNPs identified from previous GWAS also analyzed. Using web-based Locus Zoom, regional association plots of the selected gene regions were generated. To estimate combined associations of the discovery and replication sets between SNPs and DFS, random-effects meta-analyses were done with STATA version 12.
A recursive partitioning analysis (RPA) of the prognostic factors was performed to classify breast cancer patients by distinguished groups based on the survival time [33]. The prognostic factors assessed by RPA were age, recruiting center, TNM stage, tumor subtype, and selected SNPs (rs166870 and rs10825036) were included. RPA was also done within specific tumor subtypes for those SNPs from the GWAS that were considered predictive factors. Kaplan-Meier graphs and HRs and 95% CIs of the Cox model are presented for the combined prognostic groups. Within each group, the probabilities of DFS and the percentage of breast cancer patients were measured. The predictive powers of survival models which included age, recruiting centers, TNM stage, and tumor subtypes with or without selected SNPs were calculated with Harrell's C statistics, and the differences between the predictive powers were estimated with the p-value expressed by the lincom command in STATA. All statistical analyses were done again among patients with TNM stage I-III as a sensitivity analysis, and a statistically significant level was a two-sided p-value of 0.05.

Characteristics of the Study Population
The characteristics of the 3,226 study participants and the associations with DFS are summarized in Table 1. The median follow-up time was 3.8 years (range, 0.3-8.0 years) in the discovery and 4.6 years (range, 0.3-8.5 years) in the replication sets. During the follow-up period, 214 (12.4%) patients in the discovery set and 164 (11.0%) patients in the replication set had events. Tumor size, nodal status, TNM stage, and tumor subtypes were statistically significant in associations with DFS in both the discovery and replication sets. The participants had a similar distribution for age, nodal status and ER and PR status, but a different distribution for tumor size, TNM stage, HER2 status, and breast tumor subtypes between discovery and replication sets (p-value<0.05 by Chi-square test). The characteristics including age, TNM stage, and tumor subtypes were not significantly different between remained and excluded subjects due to follow-up loss (data not shown). The characteristics of the study participants by tumor subtypes are presented in S1 Table.

Genome-Wide Association Study on Prognosis
The associations between previously identified SNPs through the GWAS of prognosis and DFS in the SEBCS by tumor subtypes are listed in S2 Table. Although none of those SNPs reported in the previous GWAS were further replicated in the overall breast cancer, 4 SNPs showed significant associations with DFS in the specific tumor subtypes.
Although there were no SNPs that reached a nominal genome-wide statistical significance (p-value<5.0×10 -8 ), a total of 10 SNP for DFS achieved p-values of 5.0×10 -5f in each subtype in the discovery set (Table 2). Among these SNPs, rs166870 in HR+ HER2-(p trend = 0.03) and rs10825036 in HR-HER2-(p trend = 0.06) had statistically marginal significance in the replication set ( Table 2). The regional plots for those two SNPs in associations with DFS in breast cancer for each subtype are shown in Fig 1. In combined analyses of the discovery and replication sets, those two SNPs had strong associations among breast cancer patients for each subtype (HR rs166870 = 2.30, 95% CI = 1.67-3.15, p trend = 2.88×10 -7 in HR+ HER2-and HR rs10825036 = 2.26, 95% CI = 1.34-3.81, p trend = 3.54×10 -7 in HR-HER2-, Table 2). The results were similar when breast cancer patients with TNM stage 0 were excluded (S3 Table). To identify the heterogeneity of the prognosis for those SNPs according to tumor subtypes, the associations with DFS for the other tumor subtypes of breast cancer were estimated, and they were not statistically associated with the other subtypes (Fig 2) and p-values for heterogeneity by tumor subtypes were statistically significant (p rs166870 <0.01 and p rs10825036 = 0.02 in combined set).

Prognostic Value of the Combined Markers of Clinical and Genetic Factors
RPA classified patients into distinct prognostic groups in each subtype shown in Table 3, which were significantly associated with the DFS of breast cancer in both the discovery and replication sets. The rs166870 (CC+CT or TT) was the second node among the HR+ HER2-patients after the TNM stage (0-II or III) (p discovery = 1.18×10 -8 and p replication = 2.08×10-5, Table 2), and  rs10825036 (TT+TG or GG) was the only node among the HR-HER2-patients showing significant differences between the groups (p discovery = 2.35×10 -4 and p relication = 2.60×10 -2 , Table 3). The similar results were presented when breast cancer patients with TNM stage 0 were excluded (S4 Table).

Discussion
From the two-stage GWAS, genetic factors that were associated with DFS in breast cancer were identified by tumor subtypes, and the prognostic values for the combined clinical and genetic factors were evaluated. The SNPs, rs166870 and rs10825036, showed a statistically significant association with DFS in the HR+ HER2-and HR-HER2-breast tumor subtypes, respectively, and these associations were not seen in the other tumor subtypes. They contributed to the prognostic models by improving the prediction of DFS within specific subtypes. We conducted a subtype-specific GWAS, unlike other previous studies that had conducted a GWAS for overall breast cancer before stratifying by ER, PR, and HER2 status because breast cancer is considered as a heterogeneous disease for which the prognosis varies across subtypes [34]. This intertumor heterogeneity is plausible in that breast cancer could originate from different cell types according to the tumor subtype [35] and is supported by previous studies showing the heterogeneous associations between SNPs and the prognosis of breast cancer by ER, PR and HER2 status [15,[18][19][20][21][22][23][24][25][26][27], in agreement with the current study (Fig 2). Another reason for the subtype-specific analyses was that breast cancer subtypes are considered as a predictor factor that distinguishes different responses to particular therapies among patients [36]. Because those differences in responses to particular therapies could have been a result of subtype-specific biological differences, the stratification of breast tumors by subtypes is necessary [37].
Among previously identified SNPs by GWAS for the prognosis of breast cancer, none of the SNPs were associated with DFS overall in this study (S2 Table). Of those SNPs that showed an association in the subtypes, rs3784099 and rs9934948 had been associated with the total mortality, for overall and ER+ breast cancer in Chinese women [15]. Although the association of SNP rs9934948 was not in the same direction as in this study, the reason for this might be because the tumor subtypes, specifically HR+ and HR-, had a different tumor biology from that of ErbB2, and the luminal subtypes showed entirely different up-regulated gene patterns even in the same organ relapse patients [38]. The other identified SNPs, rs1387389, rs2774307, and rs4778137 (especially in ER-), are associated with survival in European women, and the same directions for the estimates are shown in our patients [12,14]. The SNP rs4778137 is also significantly associated with the overall survival (OS) of breast cancer in Chinese women, even though it has not been replicated in the ER-subtype [15].
In the region surrounding the SNP rs166870, an acetylation of lysine 27 as an activation mark in the H3 histone protein (H3K27Ac) was observed by in silico analysis (S1A Fig). Rs166870 is close to the methenyltetrahydrofolate synthetase (MTHFS) gene, which is involved in folate mediated one-carbon metabolism. Although associations between SNPs in the MTHFS gene and the risk and prognosis of breast cancer have not been reported, an association has been reported between MTHFS variants and the prognosis of lung cancer [39], and also other one-carbon metabolism pathway genes are associated with the prognosis of breast cancer [22,[40][41][42]. One-carbon metabolism influences DNA methylation and synthesis [43], regulating Bcl-2/adenovirus E1B 19 kDa-interacting protein 3 (BNIP3). The loss of BNIP3 expression has been correlated with poor prognostic features such as lymph node metastasis, a higher mitotic activity index (MAI), and tubule formation in breast cancer [44]. Moreover, the MTHFS protein is known as a potential mediator of insulin-like growth factor-1 receptor (IGF-1R) dependent transformation [45]. Breast cancer patients, especially HR+, HER2-, and tumor patients with a Ki-6714%, who had a better score for IGF-1R expression had a higher survival [46]. SNP rs10825036 was also represented as a H3K27Ac mark by in silico analysis (S1B Fig). Although there are no studies on rs10825036, the SNP was weakly correlated with rs583012, which was associated with the c-reactive protein [47], and rs12256830 was associated with antibody levels [48]. Rs10825036 is close to the PCDH15 gene which encodes integral membrane proteins that mediate calcium-dependent cell-cell adhesion. Previously, the SNPs of the PCDH15 gene are known for associations with adverse events caused by chemotherapy in breast cancer [49] as well as with lipid abnormalities [50]. The lipid profiles have been associated with the risk, stage, and recurrence of breast cancer [51][52][53]. Moreover, lipids profiles have been distinguished between triple-negative and other breast tumor subtypes [54].
The predictive power of the combined model including rs166870 and rs10825036 identified from the two-stage GWAS, was more improved than that of the clinical model which did not include the SNPs. In previous multivariate survival models, Harrell's C statistics were estimated ranging from 0.69 to 0.82 according to the number and type of clinicopathological factors and the characteristics of the study population included in the models [55][56][57]. There were no SNPs whose c-indices were estimated, but the gene expression signatures improved the predictive powers when additionally included in multivariate clinicopathological models [58].
To assign the risk group according to the prognosis of breast cancer, clinical and genetic factors were combined and re-classified with RPA. From the results of the RPA, the genetic factors selected from the two-stage GWAS were more valuable when the analyses were stratified by tumor subtypes, and only one node of the genetic factors was statistically significant regardless of the clinical factors in HR-HER2-breast cancer. Therefore, prognostic markers that include the SNPs identified from the GWAS could be valuable in predicting the prognosis of breast cancer, particularly in specific tumor subtypes. This is the first study that conducted a two-stage GWAS by tumor subtypes based on HR and HER2 status. Furthermore, combined survival models that include genetic factors identified by the two-stage GWAS as well as other well-known clinical factors were evaluated for predicting the prognosis of breast cancer. The first limitation of this study was that the statistical significances of the associations from the two-stage GWAS did not reach a p-value<5.0×10 -8 as the nominal significance for the GWAS [59]. However, there have been a few GWAS on the prognosis of breast cancer, and none of the SNPs associated with the prognosis of breast cancer have had a nominal significance from the GWAS so far [11][12][13][14][15]. Second, the treatment information for breast cancer was not controlled in the analyses because of substantial missing data. Although the adjuvant chemotherapy and radiation did not affect associations of survival, the hormone therapy was associated with survival in the discovery set but not in the replication set (data not shown), which tended to depend on the tumor subtypes. All the analyses were adjusted or stratified by tumor subtypes instead of controlling for treatments.
It has been inconclusive whether genetic factors influence survival by intrinsic subtypes. In this analysis, the novel genetic markers including rs166870 and rs10825036 were associated with survival in HR+ HER2-and HR-HER2-tumors showing heterogeneity between tumor subtypes. The novel genetic markers identified in this study would be helpful to understand biological insights in heterogeneous breast cancer patients. Furthermore, RPA showed those genetic markers played a role in distinguishing between high and low risk groups of breast cancer patients. The combined prognostic markers that include the genetic markers and well-known clinical factors could be useful to predict the clinical outcome for breast cancer patients.
In conclusion, our two-stage GWAS identified two novel SNPs (rs166870 and rs10825036) associated with DFS in the HR+ HER2-and HR-HER2-subtypes, respectively. When these genetic factors were added to well-known clinical survival models that included age, TNM stage, and tumor subtype, improved predictive powers of the models were observed. Furthermore, our RPA showed that genetic factors had a role in distinguishing between high and low risk groups when using combined prognostic markers. To validate these results, further studies are needed to evaluate the predictive power of the survival models which include genetic factors as well as clinical factors.