Colorectal Cancer Linkage on Chromosomes 4q21, 8q13, 12q24, and 15q22

A substantial proportion of familial colorectal cancer (CRC) is not a consequence of known susceptibility loci, such as mismatch repair (MMR) genes, supporting the existence of additional loci. To identify novel CRC loci, we conducted a genome-wide linkage scan in 356 white families with no evidence of defective MMR (i.e., no loss of tumor expression of MMR proteins, no microsatellite instability (MSI)-high tumors, or no evidence of linkage to MMR genes). Families were ascertained via the Colon Cancer Family Registry multi-site NCI-supported consortium (Colon CFR), the City of Hope Comprehensive Cancer Center, and Memorial University of Newfoundland. A total of 1,612 individuals (average 5.0 per family including 2.2 affected) were genotyped using genome-wide single nucleotide polymorphism linkage arrays; parametric and non-parametric linkage analysis used MERLIN in a priori-defined family groups. Five lod scores greater than 3.0 were observed assuming heterogeneity. The greatest were among families with mean age of diagnosis less than 50 years at 4q21.1 (dominant HLOD = 4.51, α = 0.84, 145.40 cM, rs10518142) and among all families at 12q24.32 (dominant HLOD = 3.60, α = 0.48, 285.15 cM, rs952093). Among families with four or more affected individuals and among clinic-based families, a common peak was observed at 15q22.31 (101.40 cM, rs1477798; dominant HLOD = 3.07, α = 0.29; dominant HLOD = 3.03, α = 0.32, respectively). Analysis of families with only two affected individuals yielded a peak at 8q13.2 (recessive HLOD = 3.02, α = 0.51, 132.52 cM, rs1319036). These previously unreported linkage peaks demonstrate the continued utility of family-based data in complex traits and suggest that new CRC risk alleles remain to be elucidated.


Introduction
Colorectal cancer (CRC) is the third most common cancer and the third leading cause of cancer death in the United States. Approximately 141,210 new cases and 49,380 deaths from CRC were expected in the United States in 2011 [1]. Family history is a consistent risk factor [2]; without CRC family history, the lifetime risk for an individual above the age of 50 years is 5% to 6%, yet this can be as high as 20% when there are first-or second-degree relatives with CRC [3][4][5], and reaches 80% to 100% in familial syndromes [6]. Lynch syndrome represents up to 5% of CRCs and results from germline mutations in one of several DNA mismatch repair (MMR) genes (MLH1, MSH2, MSH6 and PMS2 [7]. MMR mutations result in a defective mismatch repair (dMMR) tumor phenotype manifested by absence of MMR protein expression [8,9] and DNA microsatellite instability (MSI-H). Segregation analyses excluding Lynch syndrome families suggest that additional loci for CRC susceptibility exist [5].
Here, we describe a genome-wide linkage scan of 356 white families without evidence of dMMR using family groups defined by age at diagnosis, ascertainment method, and number of affected family members. This represents the largest linkage study of proficient MMR (pMMR) CRC families to date and suggests novel regions with evidence of high-penetrance loci.

Ascertainment and Collection of Families
A total of 578 linkage-informative families were identified having at least two affected individuals diagnosed with invasive CRC in sibling, half-sibling, cousin, grand-parental, or avuncular pairs [27], absence of sequence-confirmed Lynch Syndrome and MYH-associated polyposis [28], and absence of medical-recordconfirmed familial adenomatous polyposis.
The majority of families (N = 480) were from the Colon Cancer Family Registry multi-site NCI-supported consortium (Colon CFR) ascertained between 1997 and 2007 by Cancer Care Ontario (Toronto, Canada), a University of Southern California Consortium (Los Angeles, CA), a University of Melbourne Consortium (Victoria, Australia), the University of Hawaii (Honolulu, HI), the Mayo Clinic (Rochester, MN), and the Fred Hutchinson Cancer Research Center (Seattle, WA) [28]. All study sites ascertained population-based families, although varying sampling schemes based upon age and/or family history were used. Clinic-based families were ascertained by the University of Melbourne Consortium (through family cancer clinics in Adelaide, Perth, Sydney, Brisbane, and Melbourne, Australia and Auckland, New Zealand), the University of Southern California Consortium (through the Cleveland Clinic), and the Mayo Clinic. Epidemiologic data, blood samples, tumor blocks, and pathology reports were collected on all participants with CRC at each site, using standardized core protocols.
Clinic-based families from a City of Hope consortium (N = 59) were recruited between 1998 and 2005 at the City of Hope (Duarte, CA), Tufts University (Medford, MA), the University of Pittsburgh (Pittsburgh, PA), Northwestern University (Chicago, IL), the University of Wisconsin (Madison, WI), Vanderbilt University (Nashville, TN), the University of South Florida/ Moffitt Cancer Center (Tampa, FL), Maine Medical Center (Portland, ME), and Rose Medical Center (Denver, CO). White CRC cases older than 18 years of age, who had at least one living sibling diagnosed with CRC, were enrolled. Blood samples, pathology reports, and a brief questionnaire focused on ethnicity and family history were collected on all cases.
Population-based and clinic-based families (N = 39) from Newfoundland and Labrador, Canada were obtained at Memorial University of Newfoundland as previously described [29,30]. Briefly, pathologically confirmed cases diagnosed under the age of 75 years were enrolled via the provincial tumor registry between 1997 and 2003. Epidemiologic data including family history and risk factors, blood samples, tumor tissue, and pathology reports were collected. Clinic-based families were contacted following referrals to the high-risk cancer clinic of the provincial Medical Genetics Program.

SNP Genotyping and Quality Control
We genotyped all available affected individuals within each family, as well as key unaffected individuals, including siblings, children, and spouses of deceased affected individuals; parents of affected siblings; grandparents of affected cousins; and other individuals useful for estimation of phase [31]. Single nucleotide polymorphism (SNP) genotyping was conducted using the Affymetrix 10K 2.0 array (Affymetrix, Santa Clara, CA) for 327 families (1,753 individuals) and the Illumina Infinium Linkage 12 bead array (Illumina, San Diego, CA) for 251 families (1,001 individuals) following manufacturers' protocols [32,33]. A CEU trio (Coriell Institute for Medical Research, Camden, NJ, USA) was included in each 96-well plate.

Family Exclusions
We aimed to analyze white families without relationship errors and without evidence of MMR deficiency. Self-reported family structures were confirmed via evaluation of Mendelian inheritance using PREST [36] and Pedcheck [37] based on SNP data. Where probable sample switches or non-paternities were found, family structures were altered (nine sibships changed to half-sibships) or excluded (34 families excluded). We used EIGENSTRAT [38] to estimate ethnicity for individuals with missing self-reported ethnicity, verify ethnic similarity among related individuals, and exclude families with individuals clustering outside the large selfreported white cluster (43 families were excluded, Figure S1).
MMR proficiency was evaluated using MSI testing, immunohistochemical (IHC) analysis, and LOD scores at known MMR loci. MSI testing of Colon CFR and Newfoundland families was performed on multiple family members using paired normal and tumor DNA isolated from formalin-fixed, paraffin-embedded (FFPE) material [39]. Ten markers were tested (mono-nucleotide markers BAT25, BAT26, BAT34C4, and BAT40; di-nucleotide markers ACTC, D5S346, D10S197, D17S250, and D18S55; and complex repeat MYCL), and four unequivocal results were required. Eighty-nine families with at least one MSI-H tumor were excluded. IHC analysis of Colon CFR and Newfoundland families for MLH1, MSH2, MSH6, and PMS2 expression was performed on FFPE samples, as previously described [39]. IHC staining across all sites was done at three centers, and pathologist interpretation was conducted blind to MSI status. Forty-one families in which at least one tumor showed protein loss were excluded. Finally, we excluded an additional 15 families with dominant LOD scores .0.4 within 20 kb surrounding MSH2, MLH1, MSH6, PMS2, PMS1, MSH3, or MLH3 (linkage methods described below). Thus, 356 white families with no evidence of MMR deficiency were included in the analysis.

Linkage Analysis
Multipoint parametric and nonparametric linkage analyses used MERLIN version 1.1.2 [40]; dominant and recessive models were based on a prior segregation analysis (Table S2) [5]. Parametric linkage in the presence of heterogeneity was assessed using heterogeneity LOD (HLOD) scores, and the proportion of families linked to each locus (a) was estimated using HOMOG [41]. Nonparametric Kong & Cox LOD (NPL) scores from the linear model were computed along with S all statistics [42,43]. As has been useful for other cancers [44,45], we sought to improve power by increasing genetic homogeneity using family sub-groups defined a priori based on presumed genetically relevant characteristics. Thus, family groups were based on mean age at diagnosis (,50 years, $50 years), ascertainment scheme (population-based, clinic-based, or unknown), and number of affected individuals (2, 3, 4 or more). Likelihood ratio testing evaluated heterogeneity of linkage across the independent subsets of each subgroup factor (i.e., age at diagnosis, ascertainment scheme, and number of affected individuals).

Association Analysis
In key regions identified by linkage analysis, we also performed association testing among an additional 1,136 cases (343 family history positive and 793 family history negative cases with and without 1 st degree relative with CRC, respectively) and 997 controls from population-based collections of the Colon CFR who were genotyped using the Illumina 1M/1M Duo SNP array, as described previously [46]. Logistic regression estimated association between genotype and CRC risk adjusted for age, gender, study site, and four principal components representing ancestry [46]. A quantile-quantile (Q-Q) plot of genome-wide observed versus expected test statistics indicated no evidence of inflation (l = 0.938) [46].

Results
This collection of white CRC families with no evidence of dMMR consisted of 277 families from the Colon CFR, 48 families from the City of Hope consortium, and 31 families from Newfoundland. A total of 1,612 individuals were successfully genotyped including, on average, five individuals per family (range, 2-10, mean 2.2 affected and 2.8 unaffected individuals). The mean age at diagnosis was 59.7 years (range, 36-79) and 56.2 years (range, 31-74) among population-and clinic-based families, respectively. The majority of families had two affected members (56%) and an older (.50 years) mean age at diagnosis (84%). MSI data were available on 224 families (209 MSS and 15 MSI-L), and IHC data were available on 255 families and showed no evidence of MMR deficiency (Table 1). Both MSI and IHC data were available on 190 families. Sixty-seven families were not tested but had a LOD,0.04 within 20 kb surrounding MSH2, MLH1, MSH6, PMS2, PMS1, MSH3, or MLH3.
Genome-wide linkage scans of nine family groups were conducted including analysis of all families and of subsets of families defined by age, ascertainment scheme, and number of affected individuals. Four regions in five family groups were observed with HLOD scores greater than 3.0 ( Figure 1). The strongest result was based on analysis of 58 families with a mean age at diagnosis ,50 years. In this group, we observed a dominant HLOD of 4.51 on chromosome 4q21.1 (145.40 cM, NPL = 2.52) with an estimated 84% of families linked ( Table 2). The peak occurred at rs10518142 which is in intron 5 of NAAA encoding Nacylethanolamine acid amidase. The linkage region, defined as a 1-HLOD support interval, spanned 16.0 cM (8.7 Mb). This peak was not seen in older mean age at diagnosis families ( Figure S2), although significant heterogeneity by mean age at diagnosis was not observed (LRT p = 0.35). Other regions of interest in families defined by age at diagnosis (HLOD.2.0) are provided in Table 3.
The second strongest linkage peak occurred in analysis of all families (N = 356) at 12q24.32 with a maximum dominant HLOD of 3.60 (285.15 cM, NPL = 2.88) and an estimated 48% of families linked ( Table 2). The peak SNP, rs952093, resides in intron 1 of TMEM132C encoding transmembrane protein 132C; the equivalent of a 1-HLOD interval defined a 14 cM (1.3 Mb) region. Three suggestive regions in analysis of all families (HLOD.2.0) were seen on chromosomes 4, 15, and 17 ( Figure 1; Table 3), including a region near to the 4q21.1 peak seen in younger age at diagnosis families.
Additional linkage peaks with HLODs just over 3.0 were observed on chromosome 15q22.31 (101.40 cM, rs1477798) among 67 families with four or more affected individuals and among 88 clinic-based families ( Figure 1). Among families with at least four affected members, a dominant HLOD of 3.07 was observed (a = 0.29, NPL = 1.03), and among clinic-based families a dominant HLOD 3.03 was seen (a = 0.32, NPL = 1.03). Thirtyfive families contributed to both analyses (i.e., clinic-based families with four or more affected individuals) ( Table 4); analysis of these revealed a dominant HLOD of 3.15 (a = 0.35, NPL = 1.88). Of note, this region was also suggested by analysis of all families (HLOD = 2.51, a = 0.20, Table 3). This peak was not seen in analysis of smaller families, population-based families, or families with unknown ascertainment, although significant heterogeneity by family size or ascertainment scheme was not observed (all LRT p's.0.10). rs1477798 is in intron of MEGF11 which encodes multiple EGF-like-domains 11.
An additional HLOD over 3.0 was observed in recessive analysis of 200 families with only two affected family members ( Figure 1, Table 2). On 8q13.2, a recessive HLOD of 3.02 was seen at rs1319036 (intron in pseudo-gene LOC100129096, a = 0.51, NPL = 0.08). Linkage assuming a recessive mode of inheritance is consistent with an affected sibling pair family structure. This region was not highlighted in analysis of larger families (Table 3) (Table 3); the peak SNP rs888115 is in intron 4 of MSI2 which encodes musashi homolog 2 (Drosophila). Additional linkage results are provided in Figure S2. A second recessive model linkage peak downstream of 8q13.2 was observed in the same families with two affected members (HLOD = 2.0, a = 0.51) on 8q12.2. These two nearby peaks were 11.2 cM (8.1 Mb) apart.
Finally, we analyzed association within the 1-HLOD-support intervals surrounding each linkage peak with HLOD.3.0 using additional Colon CFR cases (N = 1,176) and controls (N = 997). In 4q21.21, which showed evidence of linkage in younger age at diagnosis families, the linkage SNP rs10518142 showed no evidence of association; however, rs12643573, which is 2 cM downstream, showed some evidence of association (OR 1.64, p = 5.4610 25 ; family history positive OR 1.82, p = 1.0610 24 ) ( Figure S3). At rs1477798 in 15q22.31 which showed evidence of linkage in clinic-based, larger families, a nominally significant casecontrol association was observed (OR 1.16, p = 0.04) which was modestly strengthened for cases with CRC family history (OR 1.24, p = 0.03); however, no significant difference in risk by family history was observed and associations were far from genome-wide significant. No other associations of note were observed.

Discussion
Results of this genome-wide linkage scan provide strong evidence for four previously-unreported CRC susceptibility loci. Notably, we identified a region at 4q21.1 among families with younger mean age at diagnosis (dominant HLOD = 4.51) and estimated that 84% of these families were linked. The 1-HLODsupport interval of this region, 16 cM (139 cM-155 cM) spanning 8.7 Mb, contains multiple known genes including NAAA. NAAA encodes an N-acylethanolamine-hydrolyzing enzyme and is shown to be expressed in variety of human tissues including colon [47].
Many of the genes upstream and downstream of NAAA are members of the chemokine family that are clustered in 4q12-21 region. The CXC chemokines modulate tumor behavior by regulation of angiogenesis, activation of a tumor-specific immune response, and direct stimulation of tumor proliferation in an autocrine or paracrine fashion [48].
Among all families, evidence for linkage was seen at 12q24.32 (HLOD = 3.60) with an estimated 48% of families linked to this locus (1-HLOD-support interval of 14 cM [276 cM-290 cM] spanning 1.3 Mb). This region contains four known genes (TMEM132C, SLC15A4, GLT1D1, and TMEM132D), four hypothetical genes (LOC100128554, LOC387895, LOC440117, and FLJ37505), and one microRNA (MIR3612). The four known genes in this region are conserved in dog, mouse, and chicken and, in some cases, zebrafish and Arabidopsis. One of these transmembrane proteins (TMEM132D) is known to be expressed in mature oligodendrocytes [49], but little else is known about either function or pathology, as is also true of GLT1D1 (glycosyltransferase 1 domain containing 1) in humans. Members of the SLC15 (solute carrier family 15) family are electrogenic transporters of short-chain peptides into a variety of cells [50]. Evidence for linkage at 15q22.31, with a 1-HLOD-support interval of 38 cM (78 cM-116 cM) spanning 12.9 Mb, was particularly evident among families enrolled at high-risk clinics or with four or more affected individuals (dominant HLOD = 3.15). This is a large gene-rich region and contains many known genes including MEGF11 and RAB11A. Very little is known about MEGF11 [51]. RAB11A is a RAS oncogene family member expressed in tumor cell lines and suggested to be involved in membrane trafficking [52]. Finally, among families with only two affected individuals, the 1-HLOD-support interval of 12 cM (126 cM-138 cM) spans 5 Mb (8q13.2, recessive HLOD = 3.02) and contains mostly pseudogenes. Notably, SULF1 in this region has been suggested to modulate signaling by heparin-binding growth factors, and downregulation represents a novel mechanism by which cancer cells can enhance growth factor signaling [53].  Like all complex diseases, CRC is heterogeneous and most likely due to multiple partially penetrant susceptibility alleles as well as non-genetic factors. In order to maximize power to detect linkage, we sought to increase genetic homogeneity by grouping families with similar, potentially genetically driven features, such as age at diagnosis, clinic-based ascertainment, and number of affected family members [5]. A number of other groups have taken a similar predefined subset approach, reporting evidence of CRC linkage in specific regions among family subsets [54]. Here, linked regions on 4q21.1 and 8q13.2 become apparent only in the families with younger mean age at diagnosis and only two affected members, respectively, and the 15q22.31 peak suggested by analysis of all families strengthened considering clinic-based or large families only. Two observations provide particular reassurance of the use of this subset approach: first, the subsets predicted by segregation analysis to be more likely to be genetic (younger age at diagnosis, clinic-based) showed greater evidence for linkage; and second, the peak among smaller families (sibling pairs) was identified using a recessive model.
A number of factors about this study are unique among CRC genome-wide linkage-scans. First, ours is the largest study, thus had higher power for detection. Second, our population included only families with no evidence of MMR deficiency. Only two smaller studies focused on pMMR families [18,26]. In this respect, our approach of studying a large number of pMMR families allowed us to identify specific linkage regions for this subgroup of families who are known to differ clinically from dMMR families and do not arise from MMR mutations [55][56][57]. Unlike some prior studies, we included MSI-L families (N = 15) in our analysis, because the relatedness of this phenotype to dMMR disease is unknown; in all regions, results did not differ when analyses were repeated exclusion of these families. Finally, the two most significant regions reported here showed similar NPL scores in these regions.
Several GWAS have reported highly replicated low-penetrance loci [10][11][12][13][14][15][16][17], including a meta-analysis of ten independent studies (11,067 cases and 12,517 controls) which replicated eight previously-reported associations [46]. In relation to the four linkage regions reported here, the closest reported GWAS association is on chromosome 12q24 (rs7315438) [46] 3 Mb away for our peak HLOD. It is not surprising that GWAS and linkage analyses may identify different loci due to the complementary strengths of each approach and the evidence, for many cancers, that the familial and non-familial forms of the disease do not often show affected pathways in common. This is largely supported by our analysis of association within the linkage regions reported here. In fact, despite the attractiveness of the two-hit hypothesis, colorectal cancer is an important exception to the pattern among adult cancers, rather than the rule: APC is central to a dominant familial syndrome and frequently mutated somatically in the nonfamilial disease [58]. There is a similar pattern involving the MMR genes: they are mutated in the germline among those with Lynch syndrome, and MLH1, at least, is frequently hypermethylated in the non-familial cancer.
In conclusion, these results suggest novel CRC susceptibility loci on chromosomes 4q21, 8q13, 12q24, and 15q22. Further confirmatory studies are needed, including targeted sequencing and dense mapping of the identified linkage regions. Targeted sequencing of these regions will facilitate identification of novel variants that may be missed with linkage analysis, while finemapping studies will narrow the region of interest to be examined. In addition, pooling of linkage data across multiple genome-wide scans should allow for fine-level analysis of discrepant results across    family collections. It is clear from this work and the work of others that multiple loci are involved in increasing susceptibility to CRC in families and that family-based studies remain critical to the identification and characterization of these loci.

Supporting Information
Figure S1 Ethnicity Estimation using Eigen Analysis.
EIGENSTRAT was used to verify ethnic similarity among related individuals from 544 families based on self-report and to estimate ethnicity for individuals with missing ethnicity. The first two principal components are plotted by (A Plot shows the 1-HLOD interval surrounding rs10518142, the peak linkage SNP among families with younger mean age at diagnosis. The x-axis indicates genomic position. The y-axis indicates 2log 10 association p-values for genotyped SNPs (solid circles) adjusted for age, gender, study site, and four principal components representing ancestry. The most significantly associated SNP is a indicated by a purple diamond. Other than rs10518142 which is indicated by a yellow circle, the colored points indicate the strength of LD with the SNP most associated with CRC risk (purple diamond). Also shown are the SNP build 36 coordinates in kilobases (kb) and a subset of the known genes in the region (below x-axis). (DOCX)