Genetic susceptibility markers for a breast-colorectal cancer phenotype: Exploratory results from genome-wide association studies

Background Clustering of breast and colorectal cancer has been observed within some families and cannot be explained by chance or known high-risk mutations in major susceptibility genes. Potential shared genetic susceptibility between breast and colorectal cancer, not explained by high-penetrance genes, has been postulated. We hypothesized that yet undiscovered genetic variants predispose to a breast-colorectal cancer phenotype. Methods To identify variants associated with a breast-colorectal cancer phenotype, we analyzed genome-wide association study (GWAS) data from cases and controls that met the following criteria: cases (n = 985) were women with breast cancer who had one or more first- or second-degree relatives with colorectal cancer, men/women with colorectal cancer who had one or more first- or second-degree relatives with breast cancer, and women diagnosed with both breast and colorectal cancer. Controls (n = 1769), were unrelated, breast and colorectal cancer-free, and age- and sex- frequency-matched to cases. After imputation, 6,220,060 variants were analyzed using the discovery set and variants associated with the breast-colorectal cancer phenotype at P<5.0E-04 (n = 549, at 60 loci) were analyzed for replication (n = 293 cases and 2,103 controls). Results Multiple correlated SNPs in intron 1 of the ROBO1 gene were suggestively associated with the breast-colorectal cancer phenotype in the discovery and replication data (most significant; rs7430339, Pdiscovery = 1.2E-04; rs7429100, Preplication = 2.8E-03). In meta-analysis of the discovery and replication data, the most significant association remained at rs7429100 (P = 1.84E-06). Conclusion The results of this exploratory analysis did not find clear evidence for a susceptibility locus with a pleiotropic effect on hereditary breast and colorectal cancer risk, although the suggestive association of genetic variation in the region of ROBO1, a potential tumor suppressor gene, merits further investigation.


Introduction
Population-based studies have revealed a strong clustering of breast and colorectal cancer within some families [1,2]. This clustering has given rise to speculation that there are "breastcolon" cancer susceptibility genes for which there are variants that predispose to co-occurrence of breast and colorectal cancer [3][4][5]. While co-occurrence of these two common cancers in some families could be due to chance, there is evidence that members of families with segregating germline mutations in BRCA1, BRCA2 and CHEK2 genes (which are associated with breast cancer) are also at moderately increased risk of colorectal cancer [6][7][8][9][10]. Similarly, families with germline mutations in the MLH1 or MSH2 genes responsible for Lynch syndrome, or with mutations in the LKB1/STK11 gene responsible for Peutz-Jeghers syndrome are known to cosegregate the two cancers [11,12]. However, these relatively rare syndromes cannot explain all the observed familial clustering of breast and colorectal cancer [2].
Daley et al., using a sib-pair analysis of a genome-wide linkage scan of 33 families with a breast-colorectal cancer phenotype, detected multiple linkage peaks, one of them in the region of BRCA2. They also detected other novel linked regions, including D17S1308 on chromosome 17p, in close proximity to the candidate gene hypermethylated in cancer 1 gene, HIC1 [13]. Two recent population based studies have also found suggestive evidence for a positive genetic correlation between colorectal and breast cancer. Lindstrom and colleagues quantified genetic correlation between different cancer types and found shared heritability of 0.22 for breast and colorectal cancer (P = 0.01) [14]. Similarly, Yu and colleagues used the vast Swedish Family-Cancer Database, to identify a modestly increased risk of breast cancer among families of probands affected with colorectal cancer, after excluding cases with a known hereditary predisposition [15]. Further evidence to support the hypothesis that there are additional susceptibility genes is the observation that many families/persons fit our definition of a breastcolorectal cancer phenotype (clustering of breast and colorectal cancer in at least two first-or second-degree relatives in a family, or persons affected with synchronous or metachronous breast and colorectal cancers), but do not have mutations in the known breast or colorectal cancer genes. These families/persons, however, often exhibit features of an inherited predisposition, such as cancer diagnosis at a young age, a Mendelian inheritance pattern, and presence of multiple cancers such as both breast and colorectal cancers in the same person [3]. Such families/persons are suggestive of yet undiscovered susceptibility genes for breast and colorectal cancers and continue to be a challenge for clinical geneticists and their patients because of the difficulty in estimating cancer risk for relatives and defining strategies for future surveillance.
The identification of genetic markers associated with the risk of a breast-colorectal cancer phenotype has clinical implications for prevention and screening/surveillance guidelines. Further, it could aid in identifying families/persons who are at high-risk of an inherited predisposition for breast and colorectal cancer, but do not have any of the known high-penetrance mutations. Therefore the aim of this study was to identify novel susceptibility markers for this understudied phenotype using a genome-wide association study (GWAS) that included discovery and replication phases. We hypothesized that this phenotype is clinically distinct from known hereditary breast and colorectal cancer predisposition syndromes and that unique susceptibility genes influence genetic predisposition to a breast-colorectal cancer phenotype, putting some families and persons at increased risk of both breast and colorectal cancer.

Discovery Phase
Data sources. The primary sources were the Colon Cancer Family Registry (CCFR) and the Breast Cancer Family Registry (BCFR) both established to support studies on the etiology, prevention, and clinical management of colorectal and breast cancer, respectively. The CCFR is an international consortium of six sites in North America and Australia for which recruitment of colorectal cancer case families and controls occurred between 1998 and 2012 [16]. The BCFR is a collaboration of six sites in North America and Australia for which recruitment commenced in 1996 [17]. Both CFRs used standardized protocols to collect blood and tissue samples, and questionnaires to collect information about family history, personal and environmental risk factors. By design, both registries are enriched for families with multiple canceraffected family members.
GWAS data. Three separate GWASs were undertaken by the CCFR and BCFR from which data for our study were sourced. All subjects were non-Hispanic white. For the CCFR GWASs, 1189 population-based colorectal cancer cases and 986 unrelated population-based controls were genotyped in Phase 1 in 2009, and 825 cases and 825 same-generation family controls were genotyped in Phase 2 in 2010. The CCFR Phase 1 samples were genotyped on an Illumina Human1M v1 and/or Illumina Human 1M-Duo v3.0 single nucleotide polymorphism (SNP) array, and the Illumina HumanOmni1-Quad v1.0 array for Phase 2 (~50% overlap with the 1M array used in Phase 1) [18,19]. The BCFR GWAS consisted of populationbased case women diagnosed with invasive breast cancer before age 51 years, and women controls between the ages of 20 and 51 years with no history of breast cancer [20]. Genotyping was performed using Illumina 610 Quad and Illumina Cyto12 SNP BeadChip arrays. All GWAS studies excluded cases with a known or likely hereditary predisposition syndrome. The CCFR GWASs excluded colorectal cancer cases with familial adenomatous polyposis; with microsatellite unstable tumors; with tumors for which immunohistochemistry revealed loss of DNA mismatch repair protein; with an MYH mutation; or with a known deleterious mismatch repair gene mutation. The BCFR GWAS excluded breast cancer cases with known BRCA1 or BRCA2 gene mutations.
The data were accessed by submitting a proposal and obtaining approval from the CCFR/ BCFR Data Access Committees (CCFR: http://www.coloncfr.org/collaboration; BCFR: http:// www.bcfamilyregistry.org/for-researchers/initiate-collaborations). The CFR data can be similarly accessed by researchers with appropriate approvals.
Participating studies. All cases and controls for the Discovery Phase were selected from the CCFR or BCFR GWASs and included data from five CCFR and three BCFR sites (Table 1). Cases were defined as: 1) women diagnosed with breast cancer who had one or more first-or second-degree relatives with colorectal cancer; or 2) persons diagnosed with colorectal cancer who had one or more first-or second-degree relatives with breast cancer; or 3) persons diagnosed with both breast and colorectal cancer, irrespective of which cancer was diagnosed first and the time between the two cancer diagnoses. The BCFR GWAS included only women younger than 51 years. Controls were sex-and age-matched (within 5 years) to the cases, unrelated to cases, and not known to have a personal or family history of breast or colorectal cancer. Based on these criteria 1,078 cases and 2,001 controls were used in the Discovery Phase.
The study was approved by the MD Anderson Institutional Review Board (IRB) and by the respective IRBs of the data providing CFR sites where written informed consent was obtained from all participants prior to collecting the data.
Quality control. Standard quality control (QC) measures were implemented on the CCFR and BCFR GWAS datasets [20]. For our study the CCFR dataset consisted of 1,528,306 SNPs (612 cases and 999 controls), and the BCFR dataset consisted of 1,265,521 SNPs (391 cases, 788 controls). For each independent dataset, we applied filters to exclude SNPs with After filtering, 465,717 CCFR SNPs and 1,214,531 BCFR SNPs remained for a sample of 1,003 cases and 1,787 controls. The datasets from both CFRs were merged and the combined BCFR and CCFR data had 433,277 common SNPs. Additional QC filters were applied for relatedness (a first-or second-degree relative inferred by pairwise allele sharing estimates of identity by descent), mismatch between called and phenotypic sex. There were no sex-discrepant individuals but two cases and four controls were removed due to relatedness (PI HAT > 0.15 after the PLINK IBS test). For this filtered dataset, principal components analysis was performed to remove population outliers using PLINK v1.07 (detailed documentation of the PLINK commands can be found at http://zzz.bwh.harvard.edu/plink/strat.shtml#cluster) [21]; 16 cases and 12 controls were removed because they were 4 or more standard deviations from the centroid (|Z| > 4), and 2 additional controls were removed as outliers after running the smartpca script in Eigenstrat [22]. Finally we had 985 cases and 1787 controls. After removal of the outliers, the PC eigenvectors were recalculated, and the first 6 principal components were included in the association analyses.
Discovery analysis. Genome-wide association analyses to identify genetic loci associated with a breast-colorectal cancer phenotype were computed by logistic regression, in an additive genetic model (per allele additive trend test), adjusting for study (CCFR or BCFR), age, sex and six PCs. The analyses for the imputed data were performed using ProbABEL (version 0.4.1; release data August 29, 2013 [25]). We generated quantile-quantile plots and calculated genomic inflation factors to estimate the inflation in test statistics arising from any systematic causes of bias.

Replication Phase
The top 549 SNPs associated with the breast-colorectal cancer phenotype at P<5.0E-04 across 18 chromosomes in the Discovery Phase were selected for replication using the colorectal cancer GWAS studies that are part of the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO). Details regarding studies participating in GECCO, GWAS genotyping, imputation, and QC have been described elsewhere [19]. Using the same case and control definitions as in the Discovery Phase, 293 cases and 2,103 controls from three GWAS datasets (Women's Health Initiative [WHI] 1, WHI 2, and VITamins And Lifestyle cohort [VITAL]) were eligible for inclusion in the Replication Phase (Table 1). All cases had a diagnosis of colorectal cancer with a family history of breast cancer in a first-or second-degree relative. Moreover, all cases and controls were non-Hispanic white women because both WHI cohorts consisted of women only, and family history of breast cancer for men in the VITAL cohort was not reported.
Replication analysis was performed using a marginal logistic regression model for each study, followed by a meta-analysis of the study level results. Age, center and three PCs were included in the model.

Meta-analysis of the discovery and replication phases
Association P-values of the discovery data (CCFR and BCFR) and replication data (GECCO) were meta-analyzed using the Genome-wide Association Meta-Analysis Software, GWAMA (v. 2.1), which uses an inverse variance method [26].

Overlap with variants detected by breast and colorectal GWASs
Published GWAS SNPs associated with breast cancer (n = 99) and with colorectal cancer (n = 39) for Europeans/ non-Hispanic whites were extracted from HaploReg version 3 [accessed on 3/20/18] [27]. We examined results of association analysis for these GWAS SNPs in our Discovery Phase dataset.

Overlap with functional elements in the genome
Encyclopedia of DNA Elements data were queried to assess overlap with potentially functional genomic characteristics such as DNase I hypersensitivity sites, transcription factor regulatory regions and enhancer elements. Similarly, RegulomeDB [28] and HaploReg version 4 [27] were queried to assess overlap of the SNPs with regulatory genomic regions.

Results
Data for 985 cases and 1769 controls from 5 CCFR sites and 3 BCFR sites were analyzed in the Discovery Phase (Table 1). The CCFR dataset consisted of 589 cases of colorectal cancer with a family history of breast cancer and 13 cases with breast and colorectal cancer (n = 602 cases). Similarly, the BCFR dataset included 378 breast cancer cases with a family history of colorectal cancer and 5 women with breast and colorectal cancer (n = 383 cases). All cases and controls were non-Hispanic whites. Most participants were women (72.9% of cases and 73.2% of controls). The controls were older than the cases in the overall sample (mean ages 51.5 versus 49.4 years) although the BCFR cases and controls were younger than the CCFR cases and controls ( Table 1). The genomic inflation factor was λ = 1.02 (see quantile-quantile plot, S1 Fig).
The most significant association signal from the genome-wide association analysis was for rs12548629 on chromosome 8q22.3 (P = 2.5E-07), an intronic SNP in BAALC ( Table 2, and S2 Fig). We found no genome-wide significant variants (at P<5.0E-08). Using a lower threshold (P<5.0E-04), we identified 60 suggestive regions/loci across the genome, on multiple chromosomes (chromosomes 1-12 and [16][17][18][19][20]. The most significant SNP associations at each locus are listed in Table 2. A majority of the associated variants were in non-coding or intergenic regions. SNPs likely to affect binding or gene expression are indicated by a low score from Reg-ulomeDB [28]. Two potentially functional SNPs, rs11666622 (3'UTR) and rs1468348 (Regulo-meDB score 1f) were annotated because they were in linkage disequilibrium (r2>0.9) with the SNP with the smallest P-value in the region. We investigated these loci further using the replication dataset for potential association with a breast-colorectal cancer phenotype. Results of association testing for SNPs with a P<5.0E-04 are provided in S1 Table. Replication of the most significant SNP associations (n = 549 SNPs) in 60 regions on 18 chromosomes was performed using 293 cases and 2,103 controls, from 3 studies in GECCO.  All cases and controls were women and the mean ages were 66.4 and 67.7 years, respectively ( Table 1). The replication analysis of the 549 SNPs revealed multiple correlated SNPs in the 3p12, 9p13.3, and 18q12.2 regions that were significantly associated at P<0.05 (most significant SNP, rs7429100, P = 2.8E-03) along with three independent SNPs on 5p13.3, 9p22.2 and 20p11.23 (Table 3). The signal on chromosome 8 around the BAALC gene did not replicate (P = 0.54).
The meta-analysis of discovery and replication results did not identify a genome-wide significant signal (smallest meta P was for rs7429100 on chromosome 3p12.3, P = 1.84E-06; Table 3). The strongest suggestive association from the combined dataset (on 3p12.3) was for several highly correlated SNPs in the region of the roundabout guidance receptor 1 gene, ROBO1 (P ranging between 1.8-9.6E-06; Table 3). The ROBO1 signal was driven by 53 highly correlated SNPs (r 2 : 0.78-1.0) in intron 1.
The in silico functional analysis of the ROBO1 SNPs displayed a range of altered binding motifs: rs9878764 had protein binding activity with CEBPB (CCAAT/Enhancer Binding Protein Beta), but no breast or colorectal tissue specific expression quantitative trait loci (eQTL) were associated with any of the ROBO1 SNPs (S4 Table). However, ROBO1 gene expression is fairly ubiquitous across multiple human tissues (S3 Fig), and ROBO1 is frequently mutated across almost all cancer types (S4 Fig), including breast and colon.

Discussion
In this study, we analyzed GWAS data from the Colon and Breast Cancer Family Registries in cases enriched for family history of breast and colorectal cancer, and controls, to agnostically identify novel markers of genetic susceptibility for the joint breast-colorectal cancer phenotype. Our cases were diagnosed at a younger age (mean age, 49.4 years), which coupled with their cancer family histories, favors the likelihood of a genetic predisposition. Our main findings include a suggestive association of a cluster of SNPs in ROBO1 with the breast-colorectal cancer phenotype, although none of the SNPs were genome-wide statistically significant. There has been long interest and debate around possible genetic susceptibility to a distinct breast-colorectal cancer phenotype. The clustering of breast and colorectal cancer in families was described as early as in 1972 by Lynch et al. who found that some families where many members had breast cancer also had a high predisposition to colorectal cancer [29]. Subsequently, the idea of a distinct hereditary breast and colorectal cancer phenotype (HBCC) was proposed by Meijers-Heijboer and colleagues, when they found a significant association of the CHEK2 1100delC mutation with HBCC using a subset of familial breast cancer families that did not carry the BRCA1 or BRCA2 mutations [10]. However, a larger study did not confirm the HBCC syndrome as a separate entity linked to the CHEK2 1100delC mutation, suggesting that HBCC could be due to chance or yet undiscovered genes [4]. A similar message was conveyed in a commentary by Lipton and colleagues, who suggested that other than the known clinical syndromes, many breast-colorectal cancer families probably result from chance clustering of two common cancers or through a genetic predisposition to one of the cancers and chance co-occurrence of the other. However, they acknowledged that there are families that present with evidence of genetic disease not accounted for by known genes or chance. They also posited that it may be difficult to identify potential breast-colorectal cancer genes, suggesting a candidate gene analysis approach as the most suited (at the time, in the pre-GWAS era) [3]. GWAS data has now allowed us to look beyond candidate genes to use genetic variation across the genome to try to identify breast-colorectal cancer susceptibility loci.
In the present study, no genome-wide significant loci were detected in the Discovery Phase but we found some suggestive associations, across multiple chromosomes, the most promising on chromosome 8q22.3, overlying the BAALC gene. However, of the 549 SNPs at 60 loci tested for replication, BAALC SNPs did not replicate but P-values <5.0E-03 were found for several correlated SNPs in the region of chromosome 3p12.3; the ROBO1 gene. Although suggestive, the association signal did not meet the Bonferroni multiple testing threshold, (P<9.1E-05 for testing 549 SNPs, or <8.3E-04 if considering 60 loci) but the cluster of SNPs in intron 1 of the ROBO1 gene also had the smallest P-values from meta-analysis of the discovery and replication datasets (P smallest = 1.84E-06; rs7429100).
ROBO1 merits follow-up as it may have a potential functional role in breast-colorectal carcinogenesis. A transmembrane receptor of the immunoglobulin family, ROBO1interacts with SLIT2 (Slit Guidance Ligand 2) to regulate many biological functions, is differentially expressed in human cancers, and has a possible role as a tumor suppressor gene [30]. Studies have found that low ROBO1 expression is an adverse prognostic factor for breast cancer [31,32] and might play a role in the pathogenesis of colorectal cancer [33].
Evidence in support of ROBO1 as a susceptibility gene for a hereditary breast-colorectal cancer (HBCC) phenotype comes from a study by Villacis and colleagues [34]. The aim of their study was to identify genomic alterations (copy number variations, CNVs) related to cancer predisposition in patients with a suggestive HBCC phenotype who did not carry high risk mutations in the major genes known to be implicated in hereditary breast or colorectal cancer (i.e., patients who met HBCC criteria as defined by Naseem et al. [4]). The authors identified a ROBO1 germline deletion in intron 4, spanning 37.470 kb (chr3:78,990,568-79,028,038 hg18), in three unrelated cases out of 113 HBCC patients. Pathogenicity of the deletion was supported by familial co-segregation with disease, its rarity in public CNV databases, and in silico evidence of the deletion having a functional role due to the presence of several enhancers and a histone marker in the deleted region. Notably, the authors reported that direct sequencing did not reveal any pathogenic point mutations in ROBO1. From our data and analyses, the association signal for ROBO1, comprised a cluster of 53 SNPs in intron 1, spanning 87.866 kb (79,703,548-79,791,414 hg 18) and the SNP closest to the deletion was located 675.51 kb away from the deletion. Furthermore, unlike the rare deletion identified by Villacis and colleagues [34], these were common SNPs with a minor allele frequency ranging between 0.39-0.46.
Of the published GWAS SNPs associated with colorectal cancer risk and breast cancer risk, our analysis of the combined breast-colorectal dataset found that only a few SNPs were nominally significant at P<0.05. The SMAD7 SNP rs4939827, identified as a colorectal cancer risk SNP [35] was the most significantly associated SNP in our breast-colorectal data (P = 1.85E-04) with a consistent direction of association for the risk allele (T). To our knowledge, there is no published evidence of association of this SNP with breast cancer risk, although a role of SMAD7 in the modulation of cancer growth and progression has been suggested for many cancers, including breast cancer [36]. Among the GWAS SNPs known to be associated with breast cancer risk, notable were two FGFR2 SNPs (rs298175 and rs2981582) [37,38] that were nominally significant in our breast-colorectal data. Genetic alternations in FGFR2 have been found to be associated with cancers other than breast but we have not found any association of FGFR2 SNPs with colorectal cancer in the published literature.
Unlike the present study, which used a familial clustering approach used to identify cases, other studies have applied a meta-analysis approach to large GWAS datasets to identify common genetic susceptibility variants across multiple cancers. For example, using colorectal cancer and endometrial cancer genome-wide data for~13,000 cases unselected for age of disease onset or family history, and~40,000 controls, Cheng and colleagues identified two novel polymorphisms, rs3184504 in the SH2B3 gene and rs12970291 near the TSHZ1 gene with evidence for a shared colorectal and endometrial cancer predisposition [39]. Another study by Hung and colleagues reported that the same SH2B3 SNP on chr12q24 was associated with lung, colorectal and breast cancer [40]. In our study, however, we did not find an association of rs3184504 or any other variant in the chr12q24 region, with the breast-colorectal cancer phenotype. Furthermore, a recent comprehensive analysis of pleiotropic associations across five cancers using 61,851 cases and 61,850 controls did not find any evidence of pleiotropy across breast and colorectal cancer [41], which suggests that these two cancers may not have a common genetic susceptibility mechanism.
The strengths of the present study include using an agnostic approach to identify genetic susceptibility loci, and its enrichment for genetic susceptibility through the incorporation of early cancer onset (for breast cancer) and family history to define the breast-colorectal cancer phenotype. Furthermore, any breast-colorectal GWAS signal was less likely to be due to syndromic cases since all cases carrying BRCA1 or BRCA2 or known colorectal cancer susceptibility gene mutations were excluded from the GWAS. Despite this, our study could have had limited statistical power to detect alleles with small true effect sizes especially if they are rare. Our study was powered to detect risk alleles with a frequency greater than 10% and a per-allele odds ratio of 1.7. Although this detectable odds ratio is high for a GWAS, we reasoned that because the cases had a family history, the frequency of the risk allele could be higher than it would be for unselected cases [42]. Study power could also have been reduced because data were from different GWAS platforms, however, imputation of genetic variants after merging datasets allowed us to maximize genetic markers for our analyses. Another potential limitation was that the breast cancer cases being younger had less time to be diagnosed with CRC and similarly, the relatives had fewer person-years at risk for CRC, than cases that had incident CRC and were older. Furthermore, although the cases were identified based on family history, the inclusion criteria regarding the number of relatives affected or case subjects' age at cancer onset were not stringent. This is in contrast to the HBCC criteria suggested by Naseem and colleagues, which included age of colorectal or breast cancer onset <50 years as a defining feature for the affected or relative [4]. We used the relaxed criteria to obtain a larger sample size and increase the power to detect an association, as this was an exploratory analysis, however power may be reduced due to smaller effect sizes. Future studies of affected families with stronger clustering of breast and colorectal cancer, might reveal a specific genetic signal. There is also the possibility of recall bias in the capture of family history between cases and controls, however, if the controls had unreported breast or colorectal cancer-affected first-or seconddegree relatives, it may lead to type 2 (false-negatives) rather than type 1 error. Finally, the study was limited by the lack of availability of a large dataset for replication. While many thousands of people have been genotyped in the GWASs, these studies provide only limited phenotype data, and most notably, lack data on cancer family history. Lack of replication could also be because the replication data did not closely resemble the discovery data due to the inclusion of only women with a higher mean age in the replication series, in contrast to the discovery data which included both men and women who were relatively younger.
This analysis, aimed at elucidating genes/regions associated with a pleiotropic effect for breast and colorectal cancer risk, did not show a clear susceptibility locus for this phenotype. This raises the possibility that aggregation of these cancers within families may be due to chance co-occurrence of two common cancers. However, since germline variation in the region of ROBO1 was suggestive of an association with breast-colorectal cancer risk, and given mounting evidence for the role of ROBO1 as a tumor suppressor gene, further investigation of this association may be warranted.