Fine-Mapping of the 1p11.2 Breast Cancer Susceptibility Locus

The Cancer Genetic Markers of Susceptibility genome-wide association study (GWAS) originally identified a single nucleotide polymorphism (SNP) rs11249433 at 1p11.2 associated with breast cancer risk. To fine-map this locus, we genotyped 92 SNPs in a 900kb region (120,505,799–121,481,132) flanking rs11249433 in 45,276 breast cancer cases and 48,998 controls of European, Asian and African ancestry from 50 studies in the Breast Cancer Association Consortium. Genotyping was done using iCOGS, a custom-built array. Due to the complicated nature of the region on chr1p11.2: 120,300,000–120,505,798, that lies near the centromere and contains seven duplicated genomic segments, we restricted analyses to 429 SNPs excluding the duplicated regions (42 genotyped and 387 imputed). Per-allelic associations with breast cancer risk were estimated using logistic regression models adjusting for study and ancestry-specific principal components. The strongest association observed was with the original identified index SNP rs11249433 (minor allele frequency (MAF) 0.402; per-allele odds ratio (OR) = 1.10, 95% confidence interval (CI) 1.08–1.13, P = 1.49 x 10-21). The association for rs11249433 was limited to ER-positive breast cancers (test for heterogeneity P≤8.41 x 10-5). Additional analyses by other tumor characteristics showed stronger associations with moderately/well differentiated tumors and tumors of lobular histology. Although no significant eQTL associations were observed, in silico analyses showed that rs11249433 was located in a region that is likely a weak enhancer/promoter. Fine-mapping analysis of the 1p11.2 breast cancer susceptibility locus confirms this region to be limited to risk to cancers that are ER-positive.

Fine-scale mapping of the susceptibility regions identified by GWAS has the potential to further narrow down the relevant area of interest, identifying additional risk SNPs, and predicting potential functional mechanisms. Fine-mapping of the 1p11.2 locus among Chinese women (878 cases and 900 controls) identified a novel SNP rs2580520 as a variant significantly associated with breast cancer risk, which was not identified in European women [26]. However, fine-mapping has not been performed at this locus in a large population of multi-ethnic women. The Collaborative Oncological Gene-environment Study (COGS) designed and executed a collaborative genotyping and fine-mapping effort utilizing a custom built iSelect genotyping array (iCOGS) [8]. In this study we fine-mapped the1p11.2 breast cancer susceptibility locus utilizing the data generated through iCOGS, using both genotyped and imputed SNPs from over 50 case-control studies within the Breast Cancer Association Consortium (BCAC). Further, we determined whether the associated SNPs displayed heterogeneity by tumor subtype defined by ER-expression, as well as tumor grade and histology.

Study populations
Fifty breast cancer studies participating in the Breast Cancer Association Consortium (BCAC) were included in this analysis. The majority of the included studies were population-based or hospital-based case-control studies that included participants of European ancestry (41 studies), Asian ancestry (9 studies), and African ancestry (2 studies), totaling 45,276 breast cancer cases and 48,998 controls. Study participants were recruited under protocols approved by the Institutional Review Board at each institution, and all subjects provided written informed consent, as previously described [8]. For a list of all approving Institutional Review Boards by study, refer to Table A in S1 File.

SNP selection, genotyping and imputation
Genotyping and quality control (QC) measures used in COGs have been described elsewhere [8]. In brief, excluded were SNPs with call rates of < 95%, with Hardy-Weinberg equilibrium deviation in controls at P < 1 x 10 -7 and those with more than 2% of discrepant genotypes in duplicate samples across all COGS consortia. The 900 Kb genomic region for fine-mapping of the 1p11.2 locus (chr1p11.2: 120,300,000-121,185,600; based on build hg19) included all known SNPs correlated (r 2 > 0.1) with the index variant rs11249433. In total, 92 genotyped SNPs from the iCOGS array satisfied the initial QC metrics above.
We used imputation in order to estimate the genotypes at variants in the region not typed on the iCOGS array. Imputation was performed using IMPUTE2 [27], separately for each ethnic group. The IMPUTE2-info score and posterior probabilities at each SNP were used to evaluate imputation performance; scores ranged from 1 (high confidence) to 0 (no confidence). Markers with an IMPUTE2-info score < 0.9 or minor allele frequency (MAF) 3% were excluded from the association analyses as unreliable. Imputed genotypes where the maximum probability was <0.9 were considered unknown. For reference panels we used the International HapMap Project Phase 3 CEU data [28] and the 1000 Genome Project June 2010 release [29].
Based on these reference panels, genotypes at an additional 4,279 SNPs were reliably imputed across the 1p11.2 region.
After reviewing the 1p11.2 genomic region using the UCSC genome browser, we observed that there were several SNPs which mapped to multiple genomic regions due to duplication of genomic segments on both sides of the centromere on chromosome 1 [30,31]. Therefore, we restricted our analysis to the region of 1p11.2 that excluded the following duplicated genomic segments: chr1:120531871-120697156; chr1:120747157-120936695; chr1:121086699-121133098; chr1:121160483-121222841; chr1:121280229-121351595; chr1:121361172-121418375 and chr1:121418377-121472478. The final analysis was therefore based on 42 genotyped and 387 imputed SNPs across~210kb of genomic sequence. We also investigated data for a previously described SNP noted to be associated with risk in Asian populations, rs2580520 [26]. Unfortunately, the rs2580520 SNP was not genotyped in the iCOGS effort, is not curated in the 1000 Genomes database that was used for imputation of the 1p11.2 region (www.1000genomes.org/data [32]) and falls within the duplicated region noted above.

Statistical analysis
The LD structure based on the 1000 Genomes CEU data was visualized using the R package snp.plotter [33]. A line graph was constructed displaying likelihood ratio statistics for recombination hotspot using SequenceLDhot software based on the background recombination rates inferred by PHASE v2.1. Physical locations of SNPs were based on hg19 and gene annotation and the LD plot was based on the NCBI RefSeq genes from the UCSC Genome Browser [34].
Standard logistic regression models were used where the common allele was the referent to assess the association of all genotyped and imputed SNPs with breast cancer risk and all analyses (overall and for breast cancer subtypes) used a 1-degree of freedom test (additive model) to estimate the per-allele odds ratio (OR) for the variant allele and corresponding 95% confidence interval (CI) for each SNP. Association analyses were adjusted for study and eight eigenvectors to capture population structure, obtained from principal component analyses [8]. P-values for trend from the Wald test are reported, imputed SNPs were handled using estimated allele dose. To identify SNPs independently associated with breast cancer risk within the 1p11.2 locus we conducted forward stepwise logistic regression analysis separately for each ethnicity (European, Asian and African) conditioning on rs11249433, the top SNP originally identified in CGEMS SNP and the top SNP at this locus in iCOGS. After identifying a novel independent signal, stepwise logistic regression analyses were repeated conditioning on the newly identified SNP rs146784183. Bonferonni adjusted significance was set at P < 7x10 -5 , corrected for 4,371 SNPs.
To determine if there were differences in the associated effects of the independent signals on different subtypes of breast cancer among women of European ancestry, we conducted stratified analyses according to subtypes defined by: 1) tumor histology (ductal/mixed, lobular, other), 2) tumor grade (well-differentiated, moderately-differentiated, poorly-differentiated), and 3) ER status (ER-positive or ER-negative) subtypes. To determine if SNP associations varied significantly between defined subtypes of breast cancer, we performed polytomous logistic regression models, and P-values for heterogeneity were obtained from case-case analysis for tumor subtypes (ER, tumor grade and tumor histology). Meta-analyses were performed using the random effects model to estimate the I 2 statistic and p-value for heterogeneity by study.

In silico functional analysis and eQTL data
To evaluate any possible functional implications of our top-associated SNPs, we assessed in silico functional data and expression quantitative trait loci (eQTL). Utilizing the UCSC Genome Browser and HaploReg v3 we reviewed ENCODE data to determine potentially altered regulatory motifs. RegulomeDB v1.1 was used to query publicly available eQTL data from multiple cell types associated with the identified SNPs and select SNPs significantly correlated to the tag SNP rs11249433.

Results and Discussion
Fine-scale mapping of the 1p11.2 locus Following quality control and genomic restrictions, a total of 429 SNPs (42 genotyped and 387 imputed) were examined for their association with breast cancer risk. Fig 1 shows the genotyped and imputed SNPs analyzed in European women, plotted against corresponding chromosomal positions within 1p11.2. Gene annotations within this genomic region, including the NOTCH2 gene, and the degree of linkage disequilibrium between the SNPs, are also shown in  Regional plots of breast cancer association in 1p12-11.2. Regional plot of association result, recombination hotspots and linkage disequilibrium for the 1p12-11.2:120,505,799-121,481,132 breast cancer susceptibility loci. Association result from a trend test in-log10Pvalues (y axis, left; red diamond, the top ranked breast cancer associated locus in the region; blue diamond, best conditioned analysis results conditioned on rs11249433; black diamonds, genotyped SNPs; gray diamonds, imputed SNPs) of the SNPs are shown according to their chromosomal positions (x axis). Linkage disequilibrium structure based on the 1000 Genomes CEU data (n = 85) was visualized by snp.plotter software. The line graph shows likelihood ratio statistics (y axis, right) for recombination hotspot by SequenceLDhot software based on the background recombination rates inferred by PHASE v2.1. Physical locations are based on hg19. Gene annotation was based on the NCBI RefSeq genes from the UCSC Genome Browser.

locus
Of the 429 SNPs, 136 SNPs were associated with breast cancer risk overall in European women at P < 5x10 -8 (Table B in S1 File and S1 Fig). The most significant association with breast cancer risk was observed for the previously identified rs11249433 SNP (MAF 0.402; per-G-allele OR = 1.10, 95% CI 1.08-1.13, P = 1.49 x 10 -21 , Table 1) [12]. To test for the existence of additional independent signals within the 1p11.2 locus, we conducted forward stepwise logistic regression analyses conditioning on the top SNP rs11249433. A second signal was identified corresponding to an imputed SNP rs146784183 (MAF 0.101; per-A-allele OR = 0.88, 95% CI 0.85-0.91, P = 1.27 x 10 -5 after adjustment for rs11249433, Table 1). After adjustment for rs11249433, SNP rs146784183 was not strongly correlated with the index SNP (r 2 = 0.086), and is located 57 kb telomeric from rs11249433, and closer to the NOTCH2 gene. Stepwise regression analyses conditioning on both rs146784183 and rs11249433 did not result in the identification of any additional independent signals at this locus (Table C in S1 File). Meta-analyses demonstrated that results were similar across studies for association results seen for both rs11249433 (I 2 = 0%, P-het study = 0.844) and rs146784183 (I 2 = 6.7%, P-het study = 0.351).

Association analysis by estrogen receptor status in European women
We next determined whether risk associations at the 1p11.2 locus varied by estrogen receptor (ER) status; associations observed were limited to ER-positive (rs11249433: per-G-allele OR = 1.12, 95% CI 1.10-1.15, P-het = 9.88 x 10 -9 ; rs146784183: per-A-allele OR = 0.86, 95% CI 0.82-0.89, P-het = 8.41 x 10 -5 ; Table 2 and Table D in S1 File). Associations for these two SNPs among ER-negative breast cancers were null (rs11249433: per-G-allele OR = 1.00, 95% CI 0.95-1.05, P = 0.90; rs146784183: per-A-allele OR = 0.99, 95% CI 0.92-1.06, P = 0.68; Table 2 and Table D in S1 File). Meta-analyses stratified by estrogen receptor status demonstrated that results were similar across studies for association results seen for both rs11249433 (ER-positive: I 2 = 0%, P-het study = 0.846) and rs146784183 (ER-positive: I 2 = 0%, P-het study = 0.524). Per-allele odds ratios (OR) and 95% confidence intervals (95% CI) were estimated from logistic regression adjusted for study site and 7 principal components in Europeans and 2 principal components in women with Asian ancestry. The common allele was the referent for calculating odds ratio; the Gallele for both rs11249433 and rs146784183. f Odds ratios (OR) and 95% confidence intervals (95% CI) were estimated from logistic regression mutually adjusted for rs146784183 and top SNP (rs11249433) along with study site and 7 principal components. Association analysis by tumor grade and histology in European women Assessment of risk associations by tumor grade showed that SNP rs11249433 was significantly associated with risk for well-differentiated tumors (per-G-allele OR = 1.18, 95% CI 1.14-1.23) and moderately-differentiated tumors (per-G-allele OR = 1.13, 95% CI 1.10-1.16), but not poorly-differentiated tumors (per-G-allele OR = 1.02, 95% CI 0.98-1.05; P -het = 8.90 x 10 -11 , Table 2 and Table E in S1 File). Similarly, SNP rs146784183 showed significant associations for well and moderately-differentiated tumors, but not poorly-differentiated tumors in (P -heterogeneity = 8.80 x 10 -4 , Table 2 and Table E in S1 File). Results were similar when assessing heterogeneity by tumor grade only among ER-positive breast cancers, there were no significant associations for poorly-differentiated ER-positive tumors ( Table F in S1 File).
Differential risk associations for rs11249433 was also seen by tumor histology, where associations were strongest for lobular tumors (per-G-allele OR = 1.28, 95% CI 1.22-1.35; P -het = 7.60 x 10 -11 ), and less so for ductal/mixed or other tumor histology (Table 2). Significant risk differences by tumor histology were not observed for SNP rs146784183 (P -heterogeneity = 0.11), though the risk reduction associated with this SNP was strongest for lobular tumors ( Table 2). Of the 160 genotyped and imputed SNPs found to be significantly associated with lobular breast tumors at a Bonferroni adjusted P < 7 x 10 -5 , 127 (79%) were also associated with ductal/mixed tumors, and only 30 (19%) of those also associated with tumors of other histology (Table G in S1 File).

Analysis of index SNPs in different ethnic groups
We also examined breast cancer risk associations among participants in the nine case-control studies that included women of Asian ancestry ( Table 2, S1 Fig and Table H in S1 File). The degree of linkage disequilibrium between the SNPs in this region was examined using HapMap data (S2 Fig).
The top SNP among European women, rs11249433, was also associated with breast cancer risk among Asian women (per-G-allele OR = 1.19, 95% CI 1.04-1.36; P = 0.01, Table 1 and S1 Fig). Although this SNP is rare in this population (MAF = 0.037), the OR was consistent with that in Europeans. SNP rs146784183 was also associated with breast cancer risk among Asian women (per-A-allele OR = 0.89, 95% CI 0.82-0.96; P = 0.002, Table 1 and S1 Fig).
The most strongly associated SNP within the Asian population, genotyped SNP rs115775083, was found to be significantly associated with breast cancer risk overall within the Asian population (per-T-allele OR = 1.78, 95% CI 1.43-2.20, P = 1.52 x 10 -7 , Table H in S1 File). The rs115775083 genotyped SNP is a rare variant among Asian women with a MAF = 0.011. This genotype is also rare among European women (MAF = 0.016) but not associated with breast cancer risk (per-T-allele OR = 0.95, 95% CI 0.88-1.02, P = 0.15). SNP rs115775083 is not correlated with the rs11249433 and rs146784183 SNPs identified to be associated with breast cancer risk in European women (r 2 < 0.01). Conditioning on the top SNP identified in the women of Asian did not identify any novel signals within the 1p11.2 locus, but did reaffirm SNP rs115775083 as the most significant signal among Asian women (Table I in S1 File). Similar analyses were performed among women with African Ancestry using data from two BCAC studies (N = 378 cases and N = 254 controls). There were no SNPs within the 1p11.2 locus found to be significantly associated with breast cancer risk after adjusting for multiple comparisons (Table J in S1 File and S1 Fig). In silico functional and eQTL data SNP rs11249433, was strongly correlated with one other SNP, rs12134101 (r 2 = 0.943), which showed a similar association with risk (both for overall and ER-positive breast cancer). All other SNPs were less strongly associated with risk (likelihood ratio < 1:1000 relative to rs11249433), suggesting that one or both SNPs rs11249433 and rs12134101 are likely to be causally implicated in breast cancer risk.
In silico analyses showed that SNP rs11249433 was found to be located within a weak enhancer and weak promoter in myoblasts and leukemia cells, respectively. Also, this SNP was located within a region of DNase I hypersensitivity and histone H3K27 acetylation in multiple cell types including T47D and MCF7 breast cancer cell lines. There were no proposed regulatory motifs altered by SNP rs146784183, and neither rs11249433 nor rs146784183 were found to have any significant eQTL associations.
In this large-scale fine-mapping analysis of nearly 50,000 breast cancer cases and 50,000 controls within the Breast Cancer Association Consortium (BCAC), we found index SNP rs11249433 to be the strongest signal within the 1p11.2 locus associated with breast cancer risk in European women. An additional association signal was identified, rs146784183, that was independent of the index SNP for overall breast cancer risk. Neither signal was found to be significantly associated with breast cancer risk among women with Asian or African ancestry, after adjusting for multiple comparisons. Notably, rs11249433 and rs146784183 displayed significant heterogeneity in risk associations by important tumor characteristics including ER status, tumor grade and histology. Our findings highlight the value of fine-mapping analyses to identify novel risk associations, and the utility of performing large-scale genotyping projects within varied ethnic populations to aid in narrowing down the genomic area relevant to future functional analyses.
Fine-mapping the 1p11.2 locus was complex due to the proximity to the centromere and the presence of duplicate genomic segments. As such, we employed strict quality control measures to increase our likelihood for finding true association signals. In this study we have identified SNP rs146784183 as a novel independent signal within the 1p11.2 locus among European women. SNP rs146784183 and the index SNP were not correlated (r 2 = 0.086), and this newly identified SNP is located about 57 kb telomeric from the index SNP, and closer to the NOTCH2 gene.
Our findings concur with previous research identifying rs11249433 as a SNP displaying heterogeneity by important tumor characteristics including ER status and histology. Specifically, rs11249433 was found to be more strongly associated with tumors of lobular histology and those that were ER-positive [20][21][22]. Further, we have recently shown that this SNP was more strongly associated with tumors having low E-cadherin breast tissue expression compared to E-cadherin high tumors [22]. Our current and previous findings for SNP rs11249433 are consistent given that expression of the E-cadherin tumor suppresor protein is frequently lost in tumors of lobular histology.
We did not identify any eQTL signals for either rs11249433 or rs146784183. In silico analyses showed that rs11249433 is situated in a DNase I hypersensitive region which contains open chromatin with histone marks, suggestive that this region might be a weak enhancer in some cell types [35]. SNP rs11249433 is located upstream of the NOTCH2 gene on chromosome 1. The NOTCH signaling pathway has been frequently implicated in breast cancer development though the exact function of NOTCH2 in this process is not well characterized [36][37][38][39][40]. Interestingly, the NOTCH2 gene was shown to be associated with super-enhancers, or large clusters of transcriptional enhancers that drive expression of genes that function in the acquisition of hallmark capabilities in cancer [41]. Dysregulation of the NOTCH signaling pathway has been implicated in breast cancer initiation and progression; this pathway is also considered as the target for novel therapeutics [36][37][38][39][40]. Consequently, though rs11249433 is located within a weak enhancer, it is plausible that it participates in transcriptional regulation through the function of a larger super-enhancer that contributes to tumor pathology.
In the current study we did not perform functional analyses, however, in a study of 180 breast tumors Fu and colleagues found that carriers of the risk genotypes of rs11249433 (AG/ GG) were associated with increased mRNA expression of the NOTCH2 gene [20][21][22]. Further, expression of NOTCH2 was highest in ER-positive/TP53 wild-type tumors. This study supports the potential regulation of NOTCH2 gene expression by SNP rs11249433 and in turn, is in line with our observation that this SNP is specific to ER-positive breast tumors. In a separate study of NOTCH2 protein expression in breast cancer, NOTCH2 levels were found to be high in well-differentiated tumors and low in poorly-differentiated tumors [42]. If, as suggested by Fu et al. [21], rs11249433 contributes to the increased expression of NOTCH2, the observation by Parr and colleagues that NOTCH2 is highest among well-differentiated tumors, supports our findings for low grade, well-differentiated tumors. However, without direct experimental evidence, it is difficult to determine the functional implications of these SNPs with certainty. While it is possible that these two variants (rs11249433 and rs146784183) are influencing different genes, however, the patterns of association with breast cancer sub-types suggest that they may affect similar biological and/or signaling processes.
Our analyses in a diverse population of women showed that the top association signals found in European women showed similar associations in women of Asian ancestry, although associations were weaker. However, no significant signals were observed among women with African ancestry. These findings support what has been previously shown in multi-ethnic studies of the 1p11.2 locus [24,25,43]. Among Asian women, a rare variant, SNP rs115775083 was found to be the strongest association signal for breast cancer overall. This region of chromosome 1 and its association with breast cancer has been examined among Chinese women. Jiang and colleagues assessed the association of seven tagging SNPs, including rs11249433, within a 277 kb region of 1p11.2 [26]. In the Jiang study, the authors observed borderline significant associations of rs11249433 with breast cancer risk in their population of Chinese women. However, given that this SNP is rare among women with Asian ancestry, the absence of a significant association is likely due to decreased power caused by insufficient numbers of cases harboring the risk allele. rs115775083, the top SNP among Asian women in our population, was not included among the seven SNPs assessed in the Jiang study [26]. We were unable to duplicate the findings from Jiang et al. [26] which identified rs2580520 as a significant association signal among Chinese women. The rs2580520 SNP was not genotyped as part of the iCOGs effort, is not found in the 1000 Genomes Project Phase 1 data [32] which was used for imputation, and maps to a suspected duplicated region. These data illustrate the challenges of genotyping this complex region. Though no significant signals were found among women with African Ancestry, examining the regional plots among European, Asian and African women, association analysis suggests that the relevant area of interest for future studies lies within the interval spanning chr1p11.2: 121,105,799-121,405,799.
The strengths of our study are in analysis of a very large data set, which includes subjects of European, Asian and African ancestry; and availability of detailed genetic and tumor pathology data, which allowed us to refine these risk associations by pathologic subtypes of breast cancer. Moreover, the findings observed in this pooled analysis did not differ significantly by study. Our study was limited by the available genomic information of the 1p11.2 region. However, the genomic map of the peri-centromeric region that harbors our region of interest was significantly improved in the latest build of the reference human genome. Due to this improvement, some genomic gaps were filled and some new pseudogene transcripts were mapped in the region; this could potentially increase SNP coverage and improve fine-mapping quality.

Conclusions
In summary, we showed the 1p11.2 locus is specific for ER-positive breast cancers and provided data to narrow the relevant area of interests for future functional studies, which should provide further insights into the underlying causal SNPs responsible for its association with breast cancer.  Table A: List of studies and ethics approvals. Table B: Genotyped and imputed SNPs at the 1p11.2 locus associated with overall breast cancer risk at genome-wide significance (p < 5x10-8) among European women in BCAC. Table C: Genotyped and imputed SNPs at the 1p11.2 locus associated breast cancer risk at genome-wide significance (p < 5x10-8) after conditioning on SNP rs146784183 among European women in BCAC. Table D: Genotyped and imputed SNPs at the 1p11.2 locus associated with ER-positive breast cancer risk at Bonferroni adjusted significance (p < 7x10-5) and the corresponding association with ER-negative breast cancer risk among European women in BCAC. Table E: Genotyped and imputed SNPs at the 1p11.2 locus associated with well differentiated breast cancer risk at Bonferroni adjusted significance (p < 7x10-5) and the corresponding association with moderately differentiated and poorly differentiated breast cancer risk among European women in BCAC. Table F: Two independent association signals at the 1p11.2 locus: Association results for breast cancer risk among European women in BCAC, by tumor characteristic. Table G: Genotyped and imputed SNPs at the 1p11.2 locus associated with lobular breast cancer risk at Bonferroni adjusted significance (p < 7x10-5) and the corresponding association with ductal or mixed and other breast cancer risk among European women in BCAC. Table H: Top 5 SNPs at the 1p11.2 locus and their association with breast cancer risk among women with Asian ancestry in BCAC. Table I: Results of association analyses after conditioning on the top SNP (rs115775083) identified among women with Asian Ancestry. Table J: Two independent association signals at the 1p11.2 locus, lack of association with breast cancer risk among women with African Ancestry.