A Colorectal Cancer Susceptibility New Variant at 4q26 in the Spanish Population Identified by Genome-Wide Association Analysis

Background Non-hereditary colorectal cancer (CRC) is a complex disorder resulting from the combination of genetic and non-genetic factors. Genome–wide association studies (GWAS) are useful for identifying such genetic susceptibility factors. However, the single loci so far associated with CRC only represent a fraction of the genetic risk for CRC development in the general population. Therefore, many other genetic risk variants alone and in combination must still remain to be discovered. The aim of this work was to search for genetic risk factors for CRC, by performing single-locus and two-locus GWAS in the Spanish population. Results A total of 801 controls and 500 CRC cases were included in the discovery GWAS dataset. 77 single nucleotide polymorphisms (SNP)s from single-locus and 243 SNPs from two-locus association analyses were selected for replication in 423 additional CRC cases and 1382 controls. In the meta-analysis, one SNP, rs3987 at 4q26, reached GWAS significant p-value (p = 4.02×10−8), and one SNP pair, rs1100508 CG and rs8111948 AA, showed a trend for two-locus association (p = 4.35×10−11). Additionally, our GWAS confirmed the previously reported association with CRC of five SNPs located at 3q36.2 (rs10936599), 8q24 (rs10505477), 8q24.21(rs6983267), 11q13.4 (rs3824999) and 14q22.2 (rs4444235). Conclusions Our GWAS for CRC patients from Spain confirmed some previously reported associations for CRC and yielded a novel candidate risk SNP, located at 4q26. Epistasis analyses also yielded several novel candidate susceptibility pairs that need to be validated in independent analyses.


Introduction
Colorectal cancer (CRC) represents globally, in terms of frequency, the third leading cause of cancer-related mortality, and the second most frequent malignant disease in Europe [1].A minority of patients have a family history of CRC, suggesting some hereditary contribution.Germ-line mutations have been identified as the cause of inherited cancer risk in some of these CRC-prone families.Overall, high penetrance mutations are estimated to account for less than 5% of CRC cases [2].On the other hand, the vast majority of patients with CRC have no clear evidence of having inherited the disorder and are therefore classified as ''sporadic'' cancer.
Sporadic CRC is considered a complex disorder resulting from the combination of genetic and non-genetic risk factors in concert with somatic genetic and epigenetic alterations.The non-Mendelian genetic risk factors are common low-risk variants distributed throughout the genome.The Genome-wide association studies (GWAS) approach is an useful tool for identifying such variants [3].Using this approach about 30 risk genetic variants related to CRC susceptibility have been reported in the last years [4][5][6][7][8][9][10][11][12][13][14][15].In spite of this, the combined effect of these variants altogether only represents a small proportion of the genetic risk for CRC development in the general population [16].This suggests that many other risk genetic variants are yet to be discovered.
In general, GWAS have been insufficient to uncover all genes involved in complex diseases and, most importantly, they have not been very useful in isolating specific molecular pathways related to the disorders under study [17].One of the reasons could be that single-locus approach is typically the only method applied to GWAS datasets, and this does not take into account the multigenic nature that underlies the etiology of complex diseases.Thus, new analytical methods that would help to detect more powerful genetic associations based on combination of markers have been proposed by us and others [18][19][20].Recently, the first two-locus association study in CRC has been reported [21].Additional studies are clearly necessary for a more comprehensive understanding of the genetic complexity of CRC susceptibility in the different human populations.
The aim of this work was to search for genetic risk factors for CRC in the Spanish population, performing a new GWAS using single-locus and two-locus genetic association analyses.
All of the SNPs were genotyped using the Affymetrix NSP I 250K chip.After quality control, 20 cases were discarded (4 discordant sex, 8 different ethnicity and 8 low sample call rate).Finally, 480 cases and 801 controls were selected for association analysis.Principal component analysis performed among this sample did not reveal population admixture (Figure S1).Age at recruitment was 58.069.1 years in cases and 51.968.8 years in controls (mean 6 standard deviation).The corresponding number (percentage) of female samples were 278 (57.9%), and 368 (45.9%), respectively.Among the 262264 SNPs that can be genotyped with this chip, 83334 did not pass the quality controls (52964 SNPs were discarded due to low minor allele frequency (MAF), 2307 SNPs failed HWE, and 28333 had a significantly different rate of missingness between case and control groups).A total of 178,930 markers were finally selected for subsequent association analyses.There was no overall inflation of the test statistic (genomic inflation factor = 1.10) (see Figure S2), providing reassurance that systematic confounding factors were unlikely.
We also performed a two-locus analysis using the HFCC software (see Patients and Methods section), exclusively on the SNPs that passed the quality controls.A total of 1.60610 10 twolocus combinations were finally obtained.After applying control direction and tracking filters, this software yielded 5x10 5 two locus strata.Although none of them reached the cut off p value established at 3.12610 212 some pairs reached values close to that threshold (Table S2).

Phase II. Validation and meta-analysis
To test the best genetic associations observed in phase I, first, those SNPs that were included in any of the best 157 two-locus signals (Table S2) were selected.These pairs accounted for 276 single SNPs because 38 SNPs were present in more than one pair.Second, 79 SNPs from the single-locus analyses were selected according to the association p value obtained in phase I (p, 6.9610 24 ) or the probability to be successfully genotyped with the Veracode technology.Thus, a total of 355 SNPs were initially selected for the preparation of custom-made arrays.However, it was only possible to design oligonucleotide pools for 340 SNPs (79 single locus SNPs and 261 two-locus SNPs).
These genetic markers were genotyped in 423 different cases and 1448 different controls (NXC-VAL sample).Age at recruitment was 58.767.3 years in cases and 51.1612.9 in controls (mean 6 standard deviation).The corresponding number (percentage) of female samples was 262 (61.8%), and 920 (63.5%), respectively.Twenty SNPs did not pass the quality control (14 SNPs were not genotyped in more than 80% of samples, and 6 SNPs showed a HWE p-value ,0.001 in controls).As for the samples, 66 controls were excluded (31 individuals did not achieve a genotyping call rate .80%,and 35 individuals showed some degree of relatedness to each other according to data obtained with the GRR software).Finally 423 CRC cases and 1382 controls were genotyped with 320 markers (77 single-locus and 243 two-locus selected SNPs) (Table S3).Table 1 shows those selected SNPs that were replicated in the NXC-VAL sample (p, 0.05 and same effect direction).Only one SNP, rs3987 at 4q26, reached a GWAS significant p-value in the meta-analysis (Table 2).Interestingly, four more SNPs in the same genomic region showed a trend for association at GWAS-significant p-value (Table 2).
Regarding two locus analysis, only five pairs were validated in phase II (p,0.05 and same effect direction).Although none of them reached GWAS significant p-value (p,3.12610 212) in the meta-analysis (Table 3), a SNP pair, rs1100508 CG and rs8111948 AA, was borderline for association (4.35610 211 ).

Result validation using additional datasets
To test whether the results could be replicated in another Spanish dataset, we used data from the Epicolon project [23].However, none of the SNPs that were considered significant or candidates in phase II of this study replicated in this Epicolon sample.
The results obtained in our GWAS (phase I and II), and those obtained from the Epicolon cohort, were combined in an effort to see a global effect of all those SNPs checked in phase II.None of the SNPs reached the GWAS significant p-value in the combined study (Table S4).Table 4 shows the best results obtained in this study (selected from those SNPs showing an effect in the same direction in all three analyzed series.See details from those selected SNPs in Table S5).
Regarding two-locus HFCC analysis, no SNP-pair showed a significant and consistent effect (in the same direction) when the 3 samples (NXC-GWAS, NXC-Val and Epicolon) were analyzed together.

Analysis of SNPs previously associated with CRC
Only one of the previously associated SNPs with CRC risk was successfully genotyped in our GWAS.In order to cover a greater number of these SNPs we imputed genotypes using CEU HapMap data base and Plink software.After imputation, we obtained a total of 1,371,009 SNPs for subsequent analysis.A total of 16 previously reported as CRC associated SNPs were available at the time of the analysis (Table 5).Of these, five SNPs located at 3q36.2 (rs10936599), 8q24 (rs10505477), 8q24.21(rs6983267),11q13.4(rs3824999) and 14q22.2(rs4444235), showed nominal association with CRC in our GWAS, and with effects in the same direction than those previously reported (Table 5).Two more SNPs located at 8q23.3 (rs16892766) and 12q13.13(rs7136702) showed a trend to nominal association with CRC in our study, again with the effect in the same direction than previously reported (Table 5).
We could not test the candidate SNPs reported by Fernandez-Rozadilla et al. [23] in their CRC-GWAS performed in the Spanish population (Epicolon sample), because those candidates were not covered or successfully genotyped/imputed in our study.
We also tested two-locus interactions between rs1571218 (20p12.3)and rs10879357 (12q21.1)previously associated with CRC [21].Applying general lineal models we did not observe any evidence of interaction between them in our dataset (data not shown).

Discussion
We present a new two-phase CRC-GWAS carried out in the Spanish population for single locus and also for two-locus association using our HFCC software [18].One marker, rs3987 at 4q26, reached association with CRC susceptibility at GWAS significant p-value.Furthermore, one SNP pair, rs1100508 CG rs8111948 AA (located at 7q31.33 and 19q12, respectively), showed also a trend for epistatic association.
In spite of limitations of our GWAS -low density of genomic coverage of the DNA-chip, and a moderate sample size -we replicated 5 of the 16 SNPs previously associated with CRC.In addition, the majority of these 16 SNPs in our GWAS study were in the same direction than in the published reports (Table 5).Furthermore, regression analysis showed good concordance of the odds ratios (Figure S3).These data together suggest that our study is in line with previously published CRC GWAS analyses.
In our two-phase CRC-GWAS, one marker, namely rs3987 at 4q26, exhibited association with CRC susceptibility at GWAS significant p-value.This SNP is located in an intergenic region of 4q26 between TRAM1L1 and NDST3 genes (,500 kb and ,180 kb, respectively).Several studies have already suggested the presence of cancer genes in 4q region [24,25], and it has also been reported that somatic deletions at 4q26 are frequent in CRC [26,27].Interestingly, the NDST4 gene, located also at 4q26, and belonging to the same family than NDST3, has been identified as a possible tumour suppressor gene in CRC [27].
The two-locus analysis revealed that one of the SNPs pairs, rs1100508 CG and rs8111948 AA (located at 7q31.33 and 19q12, respectively), showed a trend for association.These SNPs are in intergenic regions located at 7q31.33 and 19q12.The closest gene to rs1100508 is GPR37, a member of the G protein-coupled receptor family that is known to interact with Parkin, albeit its function remains to be fully characterized.On the other hand, rs8111948 is located between LINC00662 and LINC00906 (,500 kb and ,600 kb, respectively), two loci belonging to the long non-coding RNA (lncRNA) family.If the association of this SNP pair is confirmed, the nature of that interaction will need to be further characterized.We also studied the markers associated with CRC from our two-phase GWAS in an independent Spanish GWAS dataset (Epicolon), but none of these associations replicated.However, since our GWAS could validate more of the well-stablished CRC associations than the Epicolon GWAS [23], we consider that the candidates derived from our study deserve to be validated in further meta-analysis including other GWAS and validation studies performed in the Spanish population, or in a more general Caucasian population.
According to the GWAS catalogue from NIH (http://www.genome.gov/26525384),and previous works in this topic [5][6][7][8][9][10][11][12][13][14][15], neither the variants associated with CRC reported in table 1 or 2, nor variants included in the SNP pairs reported in table 3 (or in linkage disequilibrium with them) have been previously associated with CRC.Since the majority of these previous studies were not particularly performed in the Southern Caucasian population, our results could be specific for that population.An alternative explanation would be that they are false positive.The clustering of several SNPs at the same 4q26, and the replication of previously reported associations argues against this possibility.
Although our results could not be replicated in the independent Epicolon sample, we carried out a meta-analysis taking into account the three samples analyzed here (NXC-GWAS, NXC-VAL, and Epicolon).None of the SNPs, or combinations of them, were replicated in the three samples, but the best signals comprise several SNPs in linkage disequilibrium at 9q31.1, within or close to LINC00587 locus (Table 4).This gene also belongs to the lncRNA family involved in cellular differentiation and proliferation as posttranscriptional regulators of splicing or as molecular decoys for miRNA [28,29].The expression of lncRNAs is deregulated in many different cancers, including colon cancer [30], and some studies suggest a role in cancer initiation, progression and metastasis [31].The association reported in previous GWAS between CRC susceptibility and SNPs located at 8q24 could be due to the PRNCR1 locus, a lncRNA member [32].
Interestingly, a high proportion of SNPs found to be associated with CRC in our study discovery phase (tables 1, 2 and 4), were selected by the two-locus analysis.This suggests that in addition to identify epistatic interactions, our two-locus analysis method (HFCC software) can also improve the capture of single signals in the genome related to CRC susceptibility in particular and thus in multigenic disease in general.This is an enticing hypothesis that might be confirmed if some of these SNPs are validated in future studies.On the other hand, the results of our two-locus analyses suggest that the interaction signals have no more powerful predictive value than single loci for CRC susceptibility because of the failure to detect SNP pairs associated with CRC at GWAS significant p-value.This observation, together with the absence of statistically significant results in our global meta-analysis, as well as the lack of replication of the only SNP pair interaction previously reported as associated with CRC [21] suggests that the role of genetic factors in CRC susceptibility might be more intricate that previously thought.
In conclusion, we have carried out a CRC-GWAS in the Spanish population that is in line with some previously reported associations and yielded a new candidate SNP for CRC susceptibility at 4q26 that needs to be validated in future studies.Our two-locus study also provides evidence of the high level of complexity in genetic cancer risk.

Patients
Subjects in phase I were 801 controls from the Spanish general population (which were previously described [33]) and 500 cases diagnosed of CRC with pathological confirmation (NXC-GWAS sample).In phase II 1448 controls and 423 cases of CRC were used (NXC-VAL sample).CRC samples were collected in two different Spanish hospitals (Hospital Universitario Virgen del Rocı ´o in Seville and Hospital Universitario 12 de Octubre in Madrid) from November 2002 to April 2008.The control samples included in phase II were collected during the same time period in several primary health care centres from all around Spain.These samples have been previously used as controls in other association studies performed for different diseases in the Spanish population [34].Therefore, a total of 923 CRC cases and 2249 controls from the Spanish general population were included in this study.All individuals enrolled were Caucasian with registered Spanish ancestors (two generations) as recorded by clinical researchers.
of the effect had to be the same for each replication group (which approximates to p,1610 26 over all three replication groups).
To explore the nature and strength of interactions in selected two-locus patterns, we further evaluated epistasis among selected markers using Alambique software [18].Specifically, Alambique was programmed to measure departure from additive models by calculating the Synergy index, AP or RERI statistics, whilst departure from multiplicity was measured by computing strataspecific odds ratios and case-only interaction test.The algorithms included in the Alambique software have been previously described elsewhere [42,43].
During the validation process, those SNPs selected by HFCC that were successfully genotyped in the NXC-VAL sample were analyzed for replication.In this case two groups of replication were created: the NXC-GWAS sample and the NXC-VAL sample.When the selected pairs were also studied in the Epicolon cohort, three groups of replication were created: NXC-GWAS, NXC-VAL and the Epicolon sample.
Multiple-testing correction was applied in those studies taking into account the number of different SNP-pairs generated.Thus, the p-value threshold was established at (p = 3.12610 212 (0.05/ total number of SNP-pairs generated in the phase I dataset).

Imputation
We imputed genotypes using HapMap phase 2 CEU founders (n = 60) as a reference panel with Plink [22].Genotype calls with high quality scores (info.0.8) were used in subsequent association analyses.Figure S3 Correlation between the effects (OR) found in the NXC-GWAS and the reported effects for the 16 SNPs previously found to associate with CRC risk.The blue line represents perfect correlation.The green line indicates the correlation excluding the outlayer rs16969681 (red circle).This SNP was originally reported in the UK2 GWAS with an OR of 1.247, that reached GWAS significant after meta analysis with other Northern Europe GWAS but was not replicated in the Epicolon GWAS of Southern Europe.The coefficient of determination (R2) and p-value (Pearson's P) of the correlation are indicated.Without excluding the rs16969681, the coefficient of determination and p-value were 0.28 and 0.035, respectively.(PDF)

Figure 1 .
Figure 1.Manhattan plot of CRC-GWAS.Blue and red horizontal lines correspond to p values of 6.97610 24 and 5610 28 respectively.doi:10.1371/journal.pone.0101178.g001 ; SNP: Single Nucleotide Polymorphism; BP: Base pair position; A1: Reference allele (minor allele).The last eight columns show the minor allele frequency in cases (MAF A), the minor allele frecuency in controls (MAF U) and, the p and the Odds Ratio (OR) values obtained in each analyzed sample.*SNPsselected by two-locus association analyses in the NXC-GWAS sample.{Thenearest gene or the gene where the SNP is located.`According to UCSC genome browser (NCBI36/hg18) and dbSNP build 130. doi:10.1371/journal.pone.0101178.t001

Figure
Figure S1 Scatterplot of the two main eigenvectors obtained from the principal component analysis performed on 801 controls (green circles) and 480 cases (blue circles) selected for the phase-I association study.(PDF) Figure S2 Quantile-Quantile (Q-Q) plot of the observed and expected x2 values obtained from the study of the association between SNP genotype and colorectal cancer risk.(PDF)

Table 1 .
SNPs validated in the phase II.

Table 2 .
Meta-analysis of SNPs validated in the phase II.
CHR: Chromosome; SNP: Single Nucleotide Polymorphism; BP: Base pair position; A1: Reference allele (minor allele); P: Fixed-effects p-value; P(R): Random-effects p-value; OR: Fixed-effects Odds Ratio; OR(R): Random-effects Odds Ratio; Q: p-value for heterogeneity of OR; I: effect size for heterogeneity of OR. *SNPs selected by two-locus association analyses in the NXC-GWAS sample.{Thenearest gene or the gene where the SNP is located.

Table 3 .
Best SNP6SNP interactions validated in the phase II and meta-analysis results.The nearest gene or the gene where the SNP is located.doi:10.1371/journal.pone.0101178.t003 {

Table 4 .
Top results in the global meta-analysis.The nearest gene or the gene where the SNP is located. {

Table 5 .
Results of previously reported SNPs that were successfully genotyped or imputed in our analysis.Single Nucleotide Polymorphism.The last four columns show the allele reported frequency in cases (F A), in controls (F U) and, the p and the Odds Ratio (OR) values obtained in the NXC-GWAS sample.In bold type, SNPs with a nominal p-value below 0.05.
{The nearest gene or the gene where the SNP is located.* Table S1 Best phase I results obtained by Plink.Table S3 SNPs included in the phase II and metaanalysis results.(DOC) Table S4 SNPs included in the stage II and global metaanalysis results.(DOC) Table S5 Details of the results obtained in each sample from those SNPs that showed the best results in the global meta-analysis.