Exploring the Link between Germline and Somatic Genetic Alterations in Breast Carcinogenesis

Recent genome-wide association studies (GWASs) have identified candidate genes contributing to cancer risk through low-penetrance mutations. Many of these genes were unexpected and, intriguingly, included well-known players in carcinogenesis at the somatic level. To assess the hypothesis of a germline-somatic link in carcinogenesis, we evaluated the distribution of somatic gene labels within the ordered results of a breast cancer risk GWAS. This analysis suggested frequent influence on risk of genetic variation in loci encoding for “driver kinases” (i.e., kinases encoded by genes that showed higher somatic mutation rates than expected by chance and, therefore, whose deregulation may contribute to cancer development and/or progression). Assessment of these predictions using a population-based case-control study in Poland replicated the association for rs3732568 in EPHB1 (odds ratio (OR) = 0.79; 95% confidence interval (CI): 0.63–0.98; Ptrend = 0.031). Analyses by early age at diagnosis and by estrogen receptor α (ERα) tumor status indicated potential associations for rs6852678 in CDKL2 (OR = 0.32, 95% CI: 0.10–1.00; Precessive = 0.044) and rs10878640 in DYRK2 (OR = 2.39, 95% CI: 1.32–4.30; Pdominant = 0.003), and for rs12765929, rs9836340, rs4707795 in BMPR1A, EPHA3 and EPHA7, respectively (ERα tumor status Pinteraction<0.05). The identification of three novel candidates as EPH receptor genes might indicate a link between perturbed compartmentalization of early neoplastic lesions and breast cancer risk and progression. Together, these data may lay the foundations for replication in additional populations and could potentially increase our knowledge of the underlying molecular mechanisms of breast carcinogenesis.


Introduction
With the advent of technical and methodological advances, several GWASs identifying common genetic variation associated with risk of developing cancer have been completed recently [1]. Thus, initiatives such as the National Cancer Institute's Cancer Genetic Markers of Susceptibility (CGEMS) and efforts carried out by deCODE Genetics and the Breast Cancer Association Consortium have led to the identification of breast cancer risk alleles in single nucleotide polymorphisms (SNPs) replicated across populations [2][3][4][5][6]. Intriguingly, illustrating the unbiased nature of GWASs, most hits have corresponded to a priori unexpected candidate genes. In this context, the involvement of biological processes beyond the canonical DNA damage response in breast cancer is further suggested by the observed differential influence of low-penetrance risk alleles among BRCA1 and BRCA2 mutation carriers [7][8][9].
A potential common characteristic of the unexpected lowpenetrance susceptibility genes is the previously identified contribution to tumorigenesis, but at the somatic level. Common genetic variation in loci encoding for FGFR2 and MAP3K1 influences risk of breast cancer [2,4], and these genes were previously found to be somatically mutated in diverse neoplasias including breast cancer [10,11]. In addition, and central to the understanding of cancer progression, common risk alleles showed differential influence according to ERa tumor status [12], and variation in the locus encoding for ERa, ESR1, also influences risk of breast cancer [13,14]. More recently, additional breast cancer susceptibility loci have been described that include CDKN2A/B as candidates [15]. While these observations suggest a ''germlinesomatic'' link in breast carcinogenesis, an analogous situation may exist for other neoplasias. Variation in loci encoding for CDH1 and SMAD7 influences risk of colorectal cancer [16,17] and, similarly, these genes were previously identified as inactivated or deregulated in tumors [18][19][20][21]. Moreover, deregulated germline expression of a paradigmatic proto-oncogene, MYC, may be a common mechanism of tumorigenesis in epithelial tissues [22][23][24][25]. However, despite some evidence of a germline-somatic link, as yet there is no explicit evaluation of this hypothesis and its potential usefulness in replication studies. Here we present an examination of this link through analysis of the CGEMS GWAS breast cancer dataset and subsequent assessment of the predictions in a case-control study of incident breast cancer in Poland.

Distribution of somatic gene sets in ordered breast cancer GWAS results
Previously, analysis of the CGEMS GWAS dataset using the lowest genotypic P value per gene locus suggested true associations in genes annotated with Gene Ontology (GO) biological process terms linked to somatic events [26,27]. However, since there is a positive correlation between the extension of a given locus and the number of SNPs it may contain (and, therefore, the possibility of significant association results being obtained by chance), an unadjusted GWAS rank is biased at its lowest P values for specific processes in which large gene products frequently participate [26,28,29] (Fig. 1A). Nevertheless, cancer genes tend to expand across large genomic regions [30], and examination of eight genes likely involved in breast cancer through low-penetrance mutations-CASP8, COX11, ESR1, FGFR2, LSP1, MAP3K1, RAD51L1 and TOX3 [2][3][4][5][6]13,14]-showed a trend for larger genomic loci (mean (x) genomic extension = 211 kilo bases (kb) and standard deviation (s) = 283 kb; compared to x = 66 kb and s = 128 kb for all annotated genes in the CGEMS GWAS rank).
Having identified caveats to the ranking of GWAS results, we performed 10,000 permutations of case-control status and used the null distribution of t statistics from the age-adjusted partial correlation analysis to correct the original rank, which then showed an unbiased distribution (Fig. 1B). Prior to the evaluation of somatic sets, analysis of GO biological process terms in the GWAS permutation P values rank did not show any significant asymmetry using the Gene Set Enrichment Analysis (GSEA) tool [31] with multiple testing correction by the false discovery rate (FDR) approach [32]. Nonetheless, most processes with nominally significant P values were those previously highlighted, which are associated with somatic events [26,27] (Table S1). This observation appears to agree with recently described results of pathwaybased analysis of the same GWAS dataset [33].
Next, evaluation of somatic sets related to cancer prognosis and treatment response prediction, and to genetic and genomic alterations (see Materials and Methods), revealed significant asymmetrical distribution of ''driver kinases'' [34,35]; that is, kinases whose deregulation through frequent somatic mutation contributes to tumor development and/or progression (''driver mutations''). In contrast, ''passenger mutations'' were defined as essentially neutral and linked to the inherent genetic instability in cancer cells [34,35]. Thus, the driver kinases set was found to be biased towards the top (nominal significant association results) of the GWAS permutation rank (GSEA nominal P,0.001; FDRadjusted P = 0.010) ( Fig. 1C and Table S2). Among the remaining of somatic sets evaluated, only cooperation response genes (CRGs) to oncogenic mutations [36] showed a trend for a distribution similar to that of driver kinases (GSEA nominal P = 0.080; FDRadjusted P value = 0.25) (Fig. 1D), although the intersection between both sets only contained two genes (Table S2). Therefore, in somatic cancer genes, common genetic variation in driver kinase loci might frequently influence risk of breast cancer.
The set of driver kinases contained a benchmark gene, FGFR2 [2,4], and a locus recently replicated in an independent study, BMPR1B [37]. Nevertheless, a significant bias was still observed following exclusion of these two loci (GSEA nominal P = 0.001; FDR-adjusted P = 0.048), which suggests that variation at additional driver kinase loci influences risk of breast cancer. Importantly, using the set of non-driver kinases-either the subsequent equivalent set as originally statistically ordered or the total set (n = 344) [35]-did not reveal significant bias (GSEA nominal P = 0.99 and 0.66, respectively), which reinforces the idea of frequent involvement of driver kinases. However, if only the individual statistical data for each locus were considered, most of the driver kinase loci would perhaps not have been selected for replication in other populations.

Independent association results for common variation in driver kinase loci
Given the possible bias in GWAS rank identified above, we examined the top 20 driver kinase variants in the original rank (Table S3, including details of the CGEMS and results below) in a case-control study of incident breast cancer in Szczecin (Poland), previously used in other replications [38]. Applying genotyping quality controls and Hardy-Weinberg equilibrium analysis, 16 SNPs representing an identical number of driver kinase loci (i.e., a single SNP for each locus and representing the strongest potential statistical association) were examined for their association with risk of breast cancer using 880 controls and 1,173 cases (see Materials and Methods). In this analysis, the rs3732568 variant in the ephrin type-B receptor 1 (EPHB1) locus was found to be associated with risk of breast cancer: OR = 0.79, 95% CI: 0.63-0.98; P trend = 0.031 (Table 1). Further evaluation of this association through 10,000 case-control permutations in our study gave a similar significance value, P trend = 0.034. Importantly, this association was in the same direction and with similar magnitude to the result in the CGEMS GWAS: age-adjusted OR = 0.78, 95% CI: 0.64-0.94; P trend = 0.009.
While deregulated expression or function of EPHs and EPH receptors is thought to play a critical role in the initial stages of epithelial neoplasia [39,40], recent analysis of early breast cancer expression changes suggests a link between disruption of cell adhesion and extracellular matrix pathways, and the risk of developing breast cancer [41]. Analysis of this recent dataset also revealed an early expression change of EPHB1, between normal breast tissue and atypical ductal hyperplasia (Fig. 2). This alteration consisted of infra-expression in hyperplasia, akin to its potential role in the compartmentalization of early neoplastic lesions [42]. Together, association studies, early expression changes in carcinogenesis and the regulation of cell adhesion suggest the involvement of EPHB1 in risk of breast cancer.
Having potential differences by ERa tumor status, we next examined associations in ERa-positive and -negative breast cancer patients. Thus, rs3732568 in EPHB1 showed a similar influence on either type of breast cancer (Table 3)-which is consistent with an overall significant association-and rs12765929 in BMPR1A and rs9836340 in EPHA3 showed a potential major impact on the risk of ERa-negative breast cancer (P for difference in OR (interaction) by ERa status ,0.05), while rs4707795 in EPHA7 showed a differential effect between ERa-negative versus ERa-positive breast cancer risk (P interaction = 0.007) ( Table 3). None of these additional candidates linked to ERa tumor status, or those linked to an early age of diagnosis above, showed significant expression differences at early stages of breast carcinogenesis as EPHB1. On the other hand, the remaining SNPs examined in this study after applying quality controls and Hardy-Weinberg equilibrium analysis (i.e., 10 out of 16), did not show significant associations following CGEMS evidence (Table S3). Together, the gene-set based analysis of GWAS data and the subsequent replication attempt might indicate that common genetic variation in specific driver kinase loci, and particularly in EPH receptor genes, influence risk of breast cancer.

Discussion
Evaluation of a germline-somatic link in breast carcinogenesis suggests a role for driver kinases and, perhaps to a lesser extent, genes with a synergistic response to oncogenic mutations. This study might be limited by the assignment of the lowest genotypic P value per gene locus within a defined genomic window (i.e., Figure 1. GWAS ranks and distribution of cancer somatic gene sets. A, Original GWAS results ranked according to the lowest genotypic association test P value per gene locus (unadjusted for genomic extension; taken SNPs in defined genomic window of 610 kb relative to the first and last exons of a given gene). The Y-axis indicates the number of SNPs per gene locus while the X-axis indicates the lowest association P value per gene locus. Bias can be appreciated as the number of SNPs per gene locus increases at lower P values. B, GWAS results ranked according to the lowest association P value per gene locus but adjusted by genomic extension through case-control permutations. Compared to the previous graph, the bias largely disappears. C, Following the rank in B, the Y-axis indicates odds ratios (ORs) of allele effects and density distributions of gene sets (driver kinases correspond to a light lilac curve; the rest of the genome in the GWAS dataset is shown by a dark lilac curve), while the X-axis indicates the logtransformed association P values, previously adjusted by genomic extension. As indicated by the density curves, SNPs mapping to driver kinase loci are relatively more frequent at lower association adjusted P values. This observation is supported by GSEA results using the same CGEMS GWAS adjusted rank; nominal P,0.001 and FDR-adjusted P = 0.010 (Table S2). D, Similarly to the graph in C, distribution of CRGs in the CGEMS GWAS rank adjusted through permutations. doi:10.1371/journal.pone.0014078.g001 610 kb)-thus excluding a large proportion of variation that cannot be assigned to a specific known gene-and by its focus on the additive model of influence of risk alleles when adjusted through case-control permutations. Future analyses taking into account the potential perturbation of germline gene expression by, for example, common variation at distant regulatory regions may improve the identification of susceptibility genes using GWAS complete data. Another limitation in the interpretation of the results presented here may lie in the case-control study designs: the CGEMS addressed breast cancer risk in postmenopausal women, while the Polish study was relatively enriched in early-onset cases. Therefore, studies in additional populations, with diverse designs, are warranted to corroborate the results shown here.
The results of the replication study may be consistent with previously detected somatic genetic alterations and/or functional roles. Somatic mutations in CDKL2 were nonsense and were only detected in breast and ovarian cancer cell lines or tumors [11,35]. CDKL2 (also known as p56 or KKIAMRE) is the most distant member of the CDC2-related serine/threonine protein kinase family, involved in epidermal growth factor signaling [44], but with a mostly uncharacterized function. DYRK2 was found to be mutated in breast and central nervous system tumors, in nonsense and missense alterations, respectively [11,35]. The functional role of DYRK2 in the DNA damage response [45] may link to CGEMS GWAS results for RAD51L1 [3]: loss of DYRK2 function alters the activation of apoptosis in response to DNA damage via ATM [45], which may therefore promote carcinogenesis.
Having revealed potential associations linked to known somatic alterations, the most striking results of this study may concern the identification of risk alleles at three EPH receptor loci. EPHmediated signaling regulates important biological process altered in carcinogenesis, such as cell-to-cell communication, and cell migration and adhesion via the actin cytoskeleton [39,40]. Thus, through RHO and RAS/MAPK activities [46], this signaling pathway has been implicated in the maintenance of epithelial tissue architectures and is therefore thought to act as a tumor suppressor [39,40]. These observations may indicate that, similarly to colorectal tumorigenesis [42], EPH-mediated compartmentalization of early breast tissue neoplastic lesions is critical to prevent the subsequent emergence of carcinoma. Therefore, through a germline expression or functional perturbation, EPHB1 may contribute to the observed variability in the transition from an in situ lesion to an invasive carcinoma [47]. While the associations revealed here warrant further replication in other populations, the existing data could potentially increase current knowledge of the genetic basis and molecular mechanisms of breast carcinogenesis.

CGEMS dataset
The National Cancer Institute CGEMS initiative has conducted genome-wide association studies to identify common genetic variants and the corresponding functionally affected genes involved in breast cancer and prostate cancer susceptibility. An initial CGEMS whole genome scan was designed to study the main effect of SNPs on breast cancer risk in postmenopausal women [2]. The study involved 1,145 invasive postmenopausal breast cancer cases and 1,142 matched controls from the Nurses' Health Study nested case-control study [48]. Results of the CGEMS GWAS of breast cancer were obtained upon approval of a Data Access Request.

GWAS rank
In our previous analyses [26,27], ordered CGEMS GWAS results (i.e., ranks) corresponded to the lowest P value per gene for the genotypic test in a genomic region of +/210 kb at each gene locus, defined by the Ensembl human genome release 57. Assigned SNPs were curated using Ensembl gene annotations. We [26] and others [28] noted that such ranks were biased along with the genomic extension-and therefore with the number of SNPs-per gene locus. To adjust for this bias, several statistical strategies are possible [28], including carrying out permutations of the case-control status to correct the significance of the original statistic. In our analysis, considering typed and informative SNPs in each gene locus, we first chose the maximum absolute value of the t statistic from the ageadjusted partial correlation in the additive model. Next, 10,000 permutations of the same informative SNPs were performed to create a null distribution for this maximum t statistic, which was used to assess its significance corrected by number of SNPs.

GSEA application
The distribution of gene sets in ranked GWAS results was examined using the non-parametric algorithm in the GSEA tool, with default values for all parameters [31] except for the set size when appropriated. In GSEA, a pre-defined gene set is mapped to a rank-in our case genes/loci ordered according to the adjusted association statistic-to assess potential bias using an enrichment score that reflects the degree to which this set is overrepresented at the extremes of the entire ranked list. In the interpretation of the results, caution should be taken when considering sets of different size. In our study, different hypotheses were examined independently (i.e., gene sets linked to prognosis, prediction or genetic/ genomic somatic alterations), and P values were corrected for multiple testing within each group : 1) genes whose expression in primary breast tumors was associated with patient prognosis and/ or metastasis [49][50][51][52][53][54][55]; 2) genes whose expression in primary breast tumors was associated with patient therapeutic treatment response [56][57][58][59]; 3) genes whose expression levels differed according to ERa breast tumor status or grade [60], or in response to 17bestradiol [61]; and 4) genes with somatic genetic and/or genomic somatic alterations (Table S2). This last group was made up of five sets : i/ driver kinases (conditional probability of containing driver mutations .0.70, n = 119 as defined previously [35], of which 95 were uniquely mapped in the GWAS rank); ii/ CRGs to oncogenic mutations [36]; iii/ cancer gene census, somaticallymutated only [62,63]; iv/ genes affected by somatic chromosomal rearrangements and/or fusions [64]; and v/ amplified and overexpressed cancer genes [65] (Table S2).

Gene expression analysis
Raw expression microarray data on breast cancer progression [41] were downloaded from the Gene Expression Omnibus reference GSE16873 and normalized with robust multiarray average (RMA) [66] and significance analysis was performed using the significance analysis of microarray (SAM) algorithm [67].

Study samples in Poland and association study
A case-control study of unselected invasive breast cancer collected between 1996 and 2003 in Szczecin (Poland) was analyzed. The series included 976 cases of breast cancer unselected for age and an additional group of 367 cases of breast cancer diagnosed at age 50 or below. Therefore, the series was enriched for early-onset cases: mean age of diagnosis was 52.4 years (range 19-88). Subjects were unselected for family history and 15% of cases reported a first-or second-degree relative with breast cancer. The participation rate exceeded 70% among women with breast cancer invited to enroll. Collected information included year of birth, age at diagnosis of breast and/or ovarian cancer, tumor bilaterality, family history (first-and second-degree relatives with breast and/or ovarian cancer) and tumor pathological features in .80% of cases (ERa and progesterone receptor status, and grade). Cases were also examined for BRCA1 founder mutations in Poland [68] and, if positive,  Supporting Information