Multiple Common Susceptibility Variants near BMP Pathway Loci GREM1, BMP4, and BMP2 Explain Part of the Missing Heritability of Colorectal Cancer

Genome-wide association studies (GWAS) have identified 14 tagging single nucleotide polymorphisms (tagSNPs) that are associated with the risk of colorectal cancer (CRC), and several of these tagSNPs are near bone morphogenetic protein (BMP) pathway loci. The penalty of multiple testing implicit in GWAS increases the attraction of complementary approaches for disease gene discovery, including candidate gene- or pathway-based analyses. The strongest candidate loci for additional predisposition SNPs are arguably those already known both to have functional relevance and to be involved in disease risk. To investigate this proposition, we searched for novel CRC susceptibility variants close to the BMP pathway genes GREM1 (15q13.3), BMP4 (14q22.2), and BMP2 (20p12.3) using sample sets totalling 24,910 CRC cases and 26,275 controls. We identified new, independent CRC predisposition SNPs close to BMP4 (rs1957636, P = 3.93×10−10) and BMP2 (rs4813802, P = 4.65×10−11). Near GREM1, we found using fine-mapping that the previously-identified association between tagSNP rs4779584 and CRC actually resulted from two independent signals represented by rs16969681 (P = 5.33×10−8) and rs11632715 (P = 2.30×10−10). As low-penetrance predisposition variants become harder to identify—owing to small effect sizes and/or low risk allele frequencies—approaches based on informed candidate gene selection may become increasingly attractive. Our data emphasise that genetic fine-mapping studies can deconvolute associations that have arisen owing to independent correlation of a tagSNP with more than one functional SNP, thus explaining some of the apparently missing heritability of common diseases.


Introduction
Genome-wide association (GWA) studies of colorectal cancer (CRC) have so far identified 14 common, low-risk susceptibility variants [1]. Of these 14 variants, 3 are close to loci that are secreted members of the bone morphogenetic protein (BMP) signalling pathway: GREM1 (rs4779584); BMP4 (rs4444235); and BMP2 (rs961253). In the colon, GREM1 is one of several BMP antagonists produced by sub-epithelial myofibroblasts (ISEMFs). GREM1 binds to and inactivates the ligands BMP2 and BMP4 that are primarily produced by inter-cryptal stromal cells.
Our GWA studies have utilised a primary phase of genomewide typing of tagging single nucleotide polymorphisms (tagSNPs), followed by larger validation phases of those SNPs with the strongest signals of association. We have previously used relatively stringent statistical thresholds to take SNPs forward into the final validation phases [1]. Whilst such a design has been cost-effective, the use of a lower threshold may have led to the discovery of more CRC SNPs, albeit at the cost of a relatively high type I error rate. One means of reducing false positives might be to select SNPs using a less stringent threshold where there is a priori evidence for candidacy. We reasoned that the best candidate loci were those already identified as harbouring CRC risk alleles. Of those 14 loci, we prioritised GREM1, BMP2 and BMP4 for further analysis owing to their strongly-related functions.
The GWA studies had identified a single tagSNP associated with CRC risk close to each of GREM1, BMP2 and BMP4 [1]. Examination of the regions around these genes in public databases such as HapMap (http://www.hapmap.org/) showed in all cases that the coding sequence and predicted surrounding regulatory regions were present within more than one linkage disequilibrium (LD) block. For each of the 3 genes, therefore, it was possible that there were additional genetic determinants of CRC risk, independent of the already-identified SNPs. We proceeded to test this hypothesis in large sets of CRC cases and controls of European origin.

Results
The rs4779584 CRC signal results from two independent SNPs close to GREM1 In order to refine the location of CRC-associated functional variation close to the GREM1, BMP4 and BMP2 loci, we genotyped 442 SNPs close to rs4779584, rs4444235 and rs961253 in 4,878 CRC cases and 4,914 controls from the UK2 and Scotland2 sample sets and imputed other SNPs within these regions. No significant localisation of a functional variant was achieved for rs4444235 or rs961253 ( Figure S1), but at GREM1, rs16969681 (chr15:30,780,403 bases) had a notably stronger signal of association than rs4779584 (pairwise LD: r 2 = 0.18, D9 = 0.70) ( Figure 1 and Figure S2). We genotyped rs16969681 in additional independent CRC case-control series (UK1, UK4, VQ58, Helsinki, Cambridge and EPICOLON; see Methods). After combined analysis, a significant association between the minor allele at rs16969681 and CRC risk was seen (P = 5.33610 28 ; Table 1). Unconditional logistic regression analysis, incorporating sample series as a co-variate, showed that rs16969681 was more strongly associated with CRC than rs4779584, but that the signals were non-independent (for rs16969681, OR = 1.16, 95% CI 1.07-1.25, P = 1.91610 24 ; for rs4779584, OR = 1.08, 95% CI 1.02-1.14, P = 5.27610 23 ). Akaike information criteria metrics for rs16969681 and rs4779584 respectively were 25608 and 25922, consistent with a superior fit of the risk model incorporating the former SNP. Intriguingly, we found that rs16969681 maps to a site of open chromatin in GREM1-expressing CRC cell lines, raising the possibility that it may be directly functional ( Figure S3).
Haplotype risk analysis (Table S2) provided evidence that rs16969681 alleles do not capture all the CRC risk associated with rs4779584. In brief, data from UK2 and Scotland2 showed that the risk alleles at rs16969681 and rs4779584 were defined by a TGGTC haplotype at rs16969681-rs16969862-rs12594722-rs4779584-rs9888701. The TT rs16969681-rs4779584 haplotype was at a frequency of 0.063 in cases and 0.052 in controls (P = 6.29610 25 ). However, there appeared to be a residual effect of the T allele at rs4779584, since there was also an elevated risk associated with the CT rs16969681-rs4779584 haplotype (P = 0.026).
We therefore tested the hypothesis that rs4779584 tags two independent risk SNPs at GREM1. We used reverse stepwise logistic regression to search the set of GREM1 SNPs genotyped in the UK2 and Scotland2 samples (Table S1) for associations that were independent of rs16969681 genotype and that captured the residual rs4779584 signal. This analysis led to elimination of rs4779584 from the regression model and identification of a model in which only rs16969681 (P = 1.04610 24 ) and another SNP, rs11632715 (P = 1.00610 23 ), produced independent signals. rs11632715 (chr15:30,791,539) is in low LD with rs16969681 (r 2 = 0.009, D9 = 0.31) and modest LD with rs4779584 (r 2 = 0.18, D9 = 0.90; Figure S2). Through genotyping of additional casecontrol series, we showed that rs11632715 was significantly associated with CRC risk (P = 2.30610 210 ; Table 1). Unconditional logistic regression in the 21,139 samples typed for both rs11632715 and rs16969681 provided confirmatory evidence of the independence of the signals (for rs16969681, P = 1.84610 26 and for rs11632715, P = 6.36610 27 ); these associations were of very similar magnitude to those obtained when each SNP was analysed individually in those sample sets ( Figure S2). Incorporation of rs4779584 into the logistic regression model showed that this SNP had a weaker effect than that of either rs16969681 or rs11632715 and did not significantly improve the model fit (Table  S3). Inspection of the region containing rs4779584, rs16969681 and rs11632715 ( Figure S2) showed that rs4779584 lay within a recombination hotspot. This finding was consistent with our discovery that rs4779584 tags two independent functional variants that are, in turn, tagged by rs16969681 and rs11632715.

Identification of additional, independent CRC susceptibility SNPs near BMP4 and BMP2
The regions analysed for fine mapping encompassed only a minority of the transcribed and flanking regions of GREM1, BMP4 and BMP2. We therefore tested for further independent CRCassociated SNPs around these loci (Table S4) by undertaking a pooled analysis of data from 5 CRC GWA studies (UK1, Scotland1, VQ58, CCFR, Australia) and from the UK2 and Scotland2 samples that had been genotyped at 55,000 SNPs with the strongest evidence of association from meta-analysis of UK1 and Scotland1 ( Figure S4) [1]. Since each of the 7 sample sets had been genotyped using different, but overlapping, SNP panels, we performed the combined analysis irrespective of the number of studies in which any SNP had been typed. Figure 2 shows the resulting signals of association from single SNP analysis in this discovery phase.
We prioritised SNPs for further assessment in the replication data sets if they passed two thresholds. First, we required SNPs to show association with CRC at P,1610 24 under the allelic or Cochran-Armitage tests; this was a less stringent threshold than that used in our previously-reported hypothesis-free GWA studies [1,2,3], reflecting the fact that GREM1, BMP4 and BMP2 were strong candidate susceptibility genes. Four SNPs at BMP4, 3 at BMP2 and 9 at GREM1 fulfilled this criterion ( Figure 2). Second, since our aim was to test for novel, independent disease variants rather than to refine existing signals of association, we required that SNP genotypes were not correlated with each other or with previously identified risk SNPs (r 2 ,0.05, D9,0.10). After applying these criteria, one SNP at BMP4 (rs1957636) and one at BMP2 (rs4813802) were retained for subsequent analyses.
We used unconditional logistic regression, adjusting for sample series, to test the independence of the two pairs of SNPs at BMP4 and at BMP2. In both cases, each signal remained independent, reflecting the existence of recombination hotspots between the pairs of SNPs at each locus ( Figure S5 and Figure S6). For rs4444235 and rs1957636, association P-values were respectively 2.09610 28 (I 2 = 47.7%) and 3.93610 210 (I 2 = 0%)). For rs961253 and rs4813802, P-values were 1.89610 215 (I 2 = 5%) and 4.65610 211 (I 2 = 5%)). Thus, all 4 SNPs represented independent signals of association with CRC. Further imputation around BMP4 and BMP2 provided no evidence for the alternative possibility that a single variant was tagged by the two SNPs in each region (details not shown).

Author Summary
Genome-wide association studies (GWAS) have identified several colorectal cancer (CRC) susceptibility polymorphisms near genes that encode proteins in the bone morphogenetic protein (BMP) pathway. However, most of the inherited susceptibility to CRC remains unexplained. We investigated three of the best candidate BMP genes (GREM1, BMP4, and BMP2) for additional polymorphisms associated with CRC. By extensive validation of polymorphisms with only modest evidence of association in the initial phases of the GWAS, we identified new, independent CRC predisposition polymorphisms close to BMP4 (rs1957636) and BMP2 (rs4813802). Near GREM1, we used additional genotyping around the GWAS-identified polymorphism rs4779584 to demonstrate two independent signals represented by rs16969681 and rs11632715. Common genes with modest effects on disease risk are becoming harder to identify, and approaches based on informed candidate gene selection may become increasingly attractive. In addition, genetic fine mapping around polymorphisms identified in GWAS can deconvolute associations which have arisen owing to two independent functional variants. These types of study can identify some of the apparently missing heritability of common disease.
rs4813802 maps to chr20:6,647,595, about 49 kb upstream of BMP2 and 295 kb telomeric of the previously-identified BMP2 CRC susceptibility SNP, rs961253 (chr20:6,352,281). There is very little LD between these two SNPs (r 2 = 0.000, D9 = 0.017 from UK1) owing to a recombination hotspot at chr20:6,587,000 ( Figure S5). rs4813802 lies within a region of LD flanked by rs727689 (chr20:6,636,405) and rs6117401 (chr20:6,664,097). This region contains 3 spliced ESTs, BX107852, BG822004 and DB094697; none of these has any known functional role or homology to other human or non-human transcripts or genes. The nearest gene to rs4813802 apart from BMP4 is FERMT1 (transcriptional start site, chr20:6,052,191). From HapMap3 release 2 and 1000 Genomes Pilot 1, 29 SNPs were found to be in moderate or greater LD (r 2 .0.20) with rs4813802 in Europeans. Of those SNPs, several in the region chr20:6,636,405-6,647,595 mapped to sites of potential functional importance in BMP2 transcription. None of the SNPs in the area around rs4813802 is the location of a reported eQTL.

Gene-gene interactions and other SNPs near BMP pathway loci
Using a case-control logistic regression design, we searched for pairwise gene-gene interactions between 5 SNPs associated with CRC risk (rs4444235, rs1957636, rs961253, rs4813802 and rs4779584). Risks were additive and no evidence of epistasis was detected (P.0.2 for all SNP pairs). We also searched for evidence of CRC susceptibility alleles at tagSNPs close to other BMP pathway genes. Using the transcribed regions of flanking genes as boundaries, we identified 4,361 tagSNPs mapping to 37 BMP agonist, antagonist and receptor loci (Table S5). However, we found no statistically significant evidence of associations with disease (P.10 23 in all cases).

Discussion
We have identified two new CRC predisposition tagSNPs close to BMP4 (rs1957636) and BMP2 (rs4813802). To date, few other loci have been shown at stringent levels of significance to harbour more than one, independent cancer susceptibility variant. One notable exception is the locus proximal to MYC on chromosome 8q24.21 that contains multiple regions independently associated with the risk of prostate and other cancers [4]. Low-penetrance cancer predisposition loci are becoming increasingly hard to identify, owing to small effect sizes and/or low risk allele frequencies -and a return to candidate gene-based approaches may become increasingly attractive. It is true that in the past, candidate gene approaches have generally been unsuccessful at identifying cancer risk loci, but it is now possible to make use of information, such as expression quantitative trait locus identification, that increasingly permits a more considered approach. All data sets in which rs16969681 and/or rs11632715 were genotyped are shown. The sample sets genotyped for the SNPs near GREM1 are overlapping, but nonidentical, largely because rs11632715 and rs4779584 (but not rs16969681) are present on the proprietary Illumina genome-wide arrays, and also because the Cambridge data set was additionally genotyped for rs16969681. In addition to the overall association test statistics, the P value for the replication phase (excluding UK2 and Scotland2) is shown for rs16969681 and rs11632715. Although there is considerable overlap, the sample sets genotyped here differ somewhat from those typed for the BMP2 and BMP4 SNPs. These differences result entirely from sample and data availability and practical issues of genotyping, including the following: GWAS data but not samples were available from some data sets, so that SNPs such as rs16969681 could not be genotyped in those sample sets; the 1958 Birth Cohort samples were not available at the time of genotyping rs16969681; and for some sample sets, DNA quantity was limiting. doi:10.1371/journal.pgen.1002105.t001 We have also found good evidence that the original CRCassociated SNP near GREM1, rs4779584 [5], tags two independent functional SNPs, represented by association signals at rs16969681 and rs11632715. This finding emphasises that genetic fine-mapping studies are valuable not only for detecting stronger association signals, but also for deconvoluting tagSNP associations that have arisen owing to independent correlation of the tagSNP with more than one functional SNP. The original rs4779584 tagSNP signal could be described as an example of ''synthetic association'', a term that has been used to describe a situation in which multiple, sometimes rare, variants underlie a tagSNP signal [6,7]. Synthetic association can explain some of the apparently missing heritability of complex diseases. Here, we estimate that the 6 SNPs close to the 3 BMP pathway genes contribute approximately 2% of the heritability of CRC, about double that estimated before this study.
Finally, our data provide evidence that GREM1, BMP4 and BMP2 are the targets of the functional variation in each region. Multiple, independently-acting variants close to these loci contribute to CRC risk. Perhaps unexpectedly, there are no detectable genetic interactions among these variants. If the downstream SMAD effectors that function within both the BMP and TGF-beta pathways are included, the components of BMP signalling involved in CRC risk might comprise up to 3 highpenetrance predisposition genes (SMAD4, BMPR1A, GREM1) and 8 low-penetrance variants at GREM1, BMP4, BMP2, SMAD7 and LAMA5 (tagged respectively by rs16969681 and rs11632715, rs4444235 and rs1957636, rs961253 and rs4813802, rs4939827, and rs4925386) [1,2,3,5,8,9,10,11]. Collectively these data emphasise the potential importance of genetic variants in the BMP pathway for CRC predisposition.

Ethics statement
Collection of blood samples and clinico-pathological information from patients and controls was undertaken with informed consent and ethical review board approval in accordance with the tenets of the Declaration of Helsinki.

Overall strategy
The study had two main components: (i) refinement of existing GWAS signals at the GREM1, BMP4 and BMP2 loci using a dense genotyping and imputation approach in several thousand cases and controls previously used for GWAS validation; and (ii) a search for new, independent CRC tagSNPs at the same three loci using a less stringent threshold for validation than used previously, combined with multiple validation sample sets.
Discovery screen data sets UK1 (CORGI) [1] comprised 922 cases with colorectal neoplasia (47% male) ascertained through the Colorectal Tumour Gene Identification (CoRGI) consortium. All had at least one firstdegree relative affected by CRC and one or more of the following phenotypes: CRC at age 75 or less; any colorectal adenoma (CRAd) at age 45 or less; $3 colorectal adenomas at age 75 or less; or a large (.1 cm diameter) or aggressive (villous and/or severely dysplastic) adenoma at age 75 or less. The 929 controls (45% males) were spouses or partners unaffected by cancer and without a personal family history (to 2 nd degree relative level) of colorectal neoplasia. Known dominant polyposis syndromes, HNPCC/ Lynch syndrome or bi-allelic MYH mutation carriers were excluded. All cases and controls were of white UK ethnic origin. Scotland1 (COGS) [1] included 980 CRC cases (51% male; mean age at diagnosis 49.6 years, SD66.1) and 1,002 cancer-free population controls (51% male; mean age 51.0 years; SD65.9). Cases were for early age at onset (age #55 years). Known dominant polyposis syndromes, HNPCC/Lynch syndrome or biallelic MYH mutation carriers were excluded. Control subjects were sampled from the Scottish population NHS registers, matched by age (65 years), gender and area of residence within Scotland.
The Colon Cancer Family Registry (CCFR) data set [14] comprised 1,332 familial CRC cases and 1,084 controls Colon Cancer Family Registry (Colon-CFR) (http://epi.grants.cancer. gov/CFR/about_colon.html). The cases were recently diagnosed CRC cases reported to population complete cancer registries in the USA (Puget Sound, Washington State) who were recruited by the Seattle Familial Colorectal Cancer Registry; in Canada (Ontario) who were recruited by the Ontario Familial Cancer Registry; and in Australia (Melbourne, Victoria) who were recruited by the Australasian Colorectal Cancer Family Study. Controls were population-based and for this analysis were restricted to those without a family history of colorectal cancer.
The Australian study [15] comprised 591 patients treated for CRC at the Royal Melbourne, Western and St Francis Xavier Cabrini Hospitals in Melbourne from 1999 to 2009. The 2,353 controls were derived from Queensland or Melbourne: for the former, the controls came from the Brisbane Twin Nevus Study [16]; for the latter, individuals were participants in the Genes in Myopia study [17]. There was no overlap between the CFR and Australian data sets. Owing to potential residual ethnic heterogeneity within the Melbourne population, for the Australian cohort only we performed an additional screen to minimise heterogeneity after performing principal components analysis (PCA) to remove individuals who clustered with non-CEU individuals (see below). We achieved this by performing PCA on the Australian cases and controls without reference samples of known ancestry. We then paired each case with a control in a 1:1 ratio based on a maximum separation of 0.050 using the first and second eigenvectors. All unpaired samples were excluded, leaving 441 cases and 441 controls in the study. The genomic inflation factor, l GC , was 1.02 after this filtering.
UK2 (NSCCG) [1] consisted of 2,854 CRC cases (58% male, mean age at diagnosis 59.3 years; SD68.7) ascertained through two ongoing initiatives at the Institute of Cancer Research/Royal Marsden Hospital NHS Trust (RMHNHST) from 1999 onwards -The National Study of Colorectal Cancer Genetics (NSCCG) and  Cambridge/SEARCH consisted of 2,248 CRC cases (56% male; mean age at diagnosis 59.2 years, SD68.1) and 2,209 controls (42% males; mean age 57.6 years; SD615.1. Samples were ascertained through the SEARCH (Studies of Epidemiology and Risk Factors in Cancer Heredity, http://www.cancerhelp.org. uk/trials/a-study-looking-at-genetic-causes-of-cancer) study based in Cambridge, UK. Recruitment started in 2000; initial patient contact was though the general practitioner. Control samples were collected post-2003. Eligible individuals were sex-and frequencymatched in five-year age bands to cases.
The COIN samples [18] were 2,151 cases derived from the COIN and COIN-B clinical trials of metastatic CRC. Median age was 63 years. COIN cases were compared against genotypes from 2,501 population controls (1,237 males,) from the WTCCC2 National Blood Service (NBS) cohort (50% male; mean age at diagnosis 53.2 years, SD615.4).
The Helsinki (FCCPS) study (http://research.med.helsinki.fi/ gsb/aaltonen/) comprised 988 cases from a population-based collection centred on south-eastern Finland and 864 population controls from the same collection.
EPICOLON [19] included 1,410 cases matched with the same number of controls collected in a prospective fashion from centres in Spain. Exclusion criteria were Mendelian CRC syndromes and a personal history of inflammatory bowel disease.
In all cases CRC was defined according to the ninth revision of the International Classification of Diseases (ICD) by codes 153-154 and all cases had pathologically proven adenocarcinomas.

Sample preparation and genotyping
DNA was extracted from samples using conventional methods and quantified using PicoGreen (Invitrogen). The VQ, UK1, Scotland1 and Australia GWA cohorts were genotyped using Illumina Hap300, Hap370, or Hap550 arrays. 1958BC and NBS genotyping was performed as part of the WTCCC2 study on Hap1M arrays. The CCFR samples were genotyped using Illumina Hap1M or Hap1M-Duo arrays. In UK2 and Scotland2, genotyping was conducted using custom Illumina Infinium arrays according to the manufacturer's protocols. Some COIN SNPs were typed on custom Illumina Goldengate arrays. To ensure quality of genotyping, a series of duplicate samples was genotyped, resulting in 99.9% concordant calls in all cases.
Other genotyping was conducted using competitive allelespecific PCR KASPar chemistry (KBiosciences Ltd, Hertfordshire, UK), Taqman (Life Sciences, Carlsbad, California) or MassAR-RAY (Sequenom Inc., San Diego, USA). All primers, probes and conditions used are available on request. Genotyping quality control was tested using duplicate DNA samples within studies and SNP assays, together with direct sequencing of subsets of samples to confirm genotyping accuracy. For all SNPs, .99% concordant results were obtained.

Quality control and sample exclusion
We excluded SNPs from analysis if they failed one or more of the following thresholds: GenCall scores ,0.25; overall call rates ,95%; MAF,0.01; departure from Hardy-Weinberg equilibrium (HWE) in controls at P,10 24 or in cases at P,10 26 ; outlying in terms of signal intensity or X:Y ratio; discordance between duplicate samples; and, for SNPs with evidence of association, poor clustering on inspection of X:Y plots.
We excluded individuals from analysis if they failed one or more of the following thresholds: duplication or cryptic relatedness to  To identify individuals who might have non-northern European ancestry, we merged our case and control data from all sample sets with the 60 European (CEU), 60 Nigerian (YRI), and 90 Japanese (JPT) and 90 Han Chinese (CHB) individuals from the International HapMap Project. For each pair of individuals, we calculated genome-wide identity-by-state distances based on markers shared between HapMap2 and our SNP panel, and used these as dissimilarity measures upon which to perform principal components analysis. Principal components analysis was performed using Eigenstrat/SmartPCA using CEU, YRI and HCB HapMap samples as reference. The first two principal components for each individual were plotted and any individual not present in the main CEU cluster (that is, .5% of the PC distance from HapMap CEU cluster centroid) was excluded from subsequent analyses.
We had previously shown the adequacy of the case-control matching and possibility of differential genotyping of cases and controls using Q-Q plots of test statistics in STATA. The inflation factor l GC was calculated by dividing the mean of the lower 90% of the test statistics by the mean of the lower 90% of the expected values from a x 2 distribution with 1 d.f. Deviation of the genotype frequencies in the controls from those expected under HWE was assessed by x 2 test (1 d.f.), or Fisher's exact test where an expected cell count was ,5.

SNP selection and genotyping for fine mapping
Regions selected for fine mapping were: chr15:30,733,560-30,802,752; chr14:53,430,973n 53,530,761; and chr20:6,292,730-6,402,661. These corresponded to the haplotype blocks and immediately flanking regions harbouring rs4779584, rs4444235, and rs961253. To define these haplotype blocks and the recombination hotspots harbouring these CRC-associated SNPs, we used Haploview and SequenceLDHot. From dbSNP (build 128), we selected all SNPs between the recombination hotspots flanking the haplotype block. All these SNPs were submitted to Illumina for assay design and those with a design score.0.3 were genotyped on custom arrays in the UK2 and Scotland2 case-control series. In total, we genotyped 81, 42 and 60 SNPs in the 15q13.3, 14q22.2 and 20p12.3 regions respectively. A list of these SNPs is shown in Table S1. Association statistics, using an additive model, were obtained with SNPTEST v2 (www.stats.ox.ac.uk/,marchini/ software/gwas/snptest.html). We used genotype data from the 1000 Genomes CEPH (http://www.1000genomes.org/) and Hap-Map3 CEPH and TSI samples (www.hapmap.org/) and the IMPUTE v2 software (https://mathgen.stats.ox.ac.uk/impute/ impute_v2.html) to generate in silico genotypes at additional SNPs in all three regions. This imputation resulted in the addition of 74, 113 and 255 markers in the chromosome 15q13.3, 14q22.2 and 20p12.3 regions respectively (for details on imputed and genotyped markers see Table S1). Association meta-analyses only included markers with proper_info scores .0.5, imputed call rates per SNP .0.9 and minor allele frequencies (MAFs) .0.01. Meta-analyses of the two sample sets were carried out with Meta (http://www.stats. ox.ac.uk/,jsliu/meta.html) using the genotype probabilities from IMPUTE v2, where a SNP was not directly typed. To test for the presence of additional independent risk alleles in each region, we carried out logistic regression analysis within each region, both pairwise with the original tagSNP and then in a backwards analysis that included all SNPs with evidence of association in the metaanalysis at P,5610 24 .

Statistical analysis
Association between SNP genotype and disease status was primarily assessed in STATA v10 (http://www.stata.com/) and PLINK v1.07 (http://pngu.mgh.harvard.edu/,purcell/plink/) using allelic and Cochran-Armitage tests (both with 1df) respectively, or by Fisher's exact test where an expected cell count was ,5. Genotypic (2df), dominant (1df) and recessive (1df) tests were also performed. The risks associated with each SNP were estimated by allelic, heterozygous and homozygous odds ratios (ORs) using unconditional logistic regression, and associated 95% confidence intervals (CIs) were calculated. Joint analysis of data generated from multiple phases was conducted using standard methods for combining raw data based on the Mantel-Haenszel method in STATA and PLINK. The reported meta-analysis statistics were derived from analysis of allele frequencies, and joint ORs and 95% CIs were calculated assuming fixed-and random-effects models. Tests of the significance of the pooled effect sizes were calculated using a standard normal distribution. Cochran's Q statistic to test for heterogeneity [20] and the I 2 statistic [21] to quantify the proportion of the total variation due to heterogeneity were calculated. Large heterogeneity is typically defined as I 2 $75%. Where significant heterogeneity was identified, results from the random effects model were reported. Alongside, we also performed meta-analysis based on allele dosage (0, 1, 2) and incorporated age and sex as co-variates. Although age and sex are associated with colorectal cancer risk, they were not associated with SNP genotype and did not materially affect the significance of any of the 6 reported associations (details not shown).
We used Haploview software v4.2 (http://www.broadinstitute. org/haploview) to infer the LD structure of the genome in the regions around GREM1, BMP2 and BMP4. The combined effects of pairs of loci identified as associated with CRC risk were investigated by multiple logistic regression analysis in PLINK to test for independent effects of each SNP and stratifying by sample series. Evidence for interactive effects between SNPs (epistasis) was assessed by likelihood ratio test assuming an allelic model in PLINK.
The sibling relative risk attributable to a given SNP was calculated using the formula l Ã~p pr 2 zqr 1 ð Þ 2 zq pr 1 zq ð Þ 2 p 2 r 2 z2pqr 1 zq 2 ð Þ 2 where p is the population frequency of the minor allele, q = 12p, and r 1 and r 2 are the relative risks (estimated as OR) for heterozygotes and rare homozygotes, relative to common homozygotes [22]. Assuming a multiplicative interaction, the proportion of the familial risk attributable to a SNP was calculated as log(l * )/log(l 0 ), where l 0 is the overall familial relative risk estimated from epidemiological studies of CRC, assumed to be 2.2 [23]. UK2/NSCCG2 samples were used for this estimation. The Akaike information criterion was calculated using the swaic command in STATA. Genome co-ordinates were taken from the NCBI build 36/hg18 (dbSNP b126).      Text S1 Consortium co-authors. (DOCX)