Single nucleotide polymorphisms associated with susceptibility for development of colorectal cancer: Case-control study in a Basque population

Given the significant population diversity in genetic variation, we aimed to investigate whether single nucleotide polymorphisms (SNPs) previously identified in studies of colorectal cancer (CRC) susceptibility were also relevant to the population of the Basque Country (North of Spain). We genotyped 230 CRC cases and 230 healthy controls for 48 previously reported CRC-susceptibility SNPs. Only the rs6687758 in DUPS10 exhibited a statistically significant association with CRC risk based on the crude analysis. The rs6687758 AG genotype conferred about 2.13-fold increased risk for CRC compared to the AA genotype. Moreover, we found significant associations in cases between smoking status, physical activity, and the rs6687758 SNP. The results of a Genetic Risk Score (GRS) showed that the risk alleles were more frequent in cases than controls and the score was associated with CRC in crude analysis. In conclusion, we have confirmed a CRC susceptibility locus and the existence of associations between modifiable factors and the rs6687758 SNP; moreover, the GRS was associated with CRC. However, further experimental validations are needed to establish the role of this SNP, the function of the gene identified, as well as the contribution of the interaction between environmental factors and this locusto the risk of CRC.


Introduction
Colorectal cancer (CRC)is the fourth most common type of tumour, being 6.1% of the total new cases of cancer diagnosed in 2018and one of the major causes of cancer-related morbidity and mortality globally(9.2% of cancer deaths) [1]. There is wide geographical variation in incidence with rates varying 8-fold (colon cancer) and 6-fold (rectal cancer) in both sexes worldwide [1]. In this sense, Spain is one of the countries with the highest incidence of CRC, and taking into account both sexes, it was the most frequent cancer diagnosed in 2018 with 13.7% of newcancer cases [2] and is the main cause of cancer related deaths [3]. Considering the magnitude of the problem, the use of screening tests for early detection and effective treatment of CRC during the initial stages would have a significant impact on public health. In this sense, US PreventiveServicesTaskForce and the American CancerSocietyrecommendthescreening for CRC byannualfaecaloccultbloodtesting (FOBT), flexible sigmoidoscopyor(every 5 years) orcolonoscopy(every 10 years), in subjects aged 50 years or older [4].
The mechanisms underlying CRC occurrence and progression are complicated and mainly involve genetic and environmental factors, such as sex [5,6], diet and physical activity [5,7]. Various oncogenes and tumour suppressors, such as KRAS, APC, BRAF, TP53, and SMAD4, have been identified by CRC-related studies and may be useful for diagnosing and treating CRC in the future [5,8,9].
There is a direct association between sporadic tumour occurrence and susceptibility variants carried by an individual [10]. Many candidate gene [11] and genome-wide association studies (GWAS) [12]have evaluated common genetic risk factors for CRC; however, only a few of these have been replicated in subsequent studies [10]. Thus, in this study, we aimed to test the hypothesis that some of the previously reported CRC-related SNPs are associated with CRC susceptibility in the Basque population, in which there are no previous studies of this kind. Therefore, we investigated possible associations between 48 susceptibility SNPs and development of sporadic CRC in the adult population of the Basque Country.

Design
This is an observational, matched case-control study in a population group residing in the Basque Country (Spain).

Study population
Participants in this study were recruited among patients attending, between January 2012 and December 2014, any of the three hospitals of the Osakidetza/Basque Health Service (Basurto, Galdakao and Donostia)belong of the Basque Country Colorectal Cancer Screening Programme (CRCSP) [13]. To be eligible for this CRCSP, average risk people from 50 to 69 years, asymptomatic for colorectal symptoms and registered with the Osakidetza/Basque Health Service [13]. Subjects with symptoms suggesting CRC or with high CRC risk, such as individuals with familial adenomatous polyposis or hereditary nonpolyposis are managed outside this programme and are not included in this analysis. Subjects were invited to participate in this study by the gastroenterologists who performed the colonoscopies as a confirmatory test.
The recruitment and data collection for the present study were conducted between 2014 and 2016. All the patients who were newly diagnosed with CRC (n = 601) were invited to participate in this study, that is, the individuals with a positive result, (abnormal) to an immunochemical faecal occult blood test (iFOBT), being the faecal-Haemoglobin cut-off point of 20 μg Hb/g faeces for both sexes [13] and a colonoscopy [13]. Of those, 283 refused to participate in the study, and 10 were excluded due to missing information. Ultimately, 308 subjects (66.2% men) consented to participate in the survey and completed all the questionnaires.
In addition, for each case, three age-(±9.0 years) and sex-matched control patients were randomly sought from the list of CRC-free subjects (n = 1,836) who participated in the CRCSP during the same period as the cases. The matched controls were patients with positive results (abnormal) for iFOBT and negative colonoscopy results (normal). The participation rate of the controls was 37.6%, and 17 subjects were excluded due to missing information. Finally, the matched case-to-control ratio was 1:1, and the final dataset included 308 cases who were diagnosed with CRC and 308 age-and sex-matched controls. The flowchart displaying the selection process for the CRC cases and controls is shown in Fig 1. Thirty-three cases, 39 controls and 6 cases-controls initially included in this study were excluded from the genetic analysis because incomplete genotyping by insufficient DNA available for the assay, and the respective partners of cases and controls were also excluded of the study. Finally, genotyping data were obtained from 230 cases and 230 controls.
The time spent between the participation in the CRCSP and in the present study was 1.8 (1.0) years (range: 0.4-4.6) in cases and 1.6(1.5) years (range: 0.2-3.7) in controls, without significant differences (P = 0.119). Consenting participants self-completed and returned a detailed Food Frequency Questionnaire (FFQ) and one general questionnaire (GQ). The questions referred to the behaviours before participating in the CRCSP. Assistance from the study staff was available to help the patients to understand the items on the questionnaires.
This study was conducted according to the guidelines laid down in the Declaration of Helsinki, and all procedures involving patients were approved by the Clinical Research Ethics Committee of the Basque Country (reference numbers PI2011006 and PI2014042). Written informed consent was obtained from all the study participants.

Biological samples and genotyping
In this study, healthy tissues or saliva samples of 230 CRC patients and 230 controls were collected and genotyped. Samples were provided by the Basque Biobank for Research-OEHUN www.biobancovasco.org and were processed following standard operating procedures with appropriate ethical approval. DNA was extracted using AllPrep DNA / RNA kit (Qiagen) for paraffin-embedded tissue samples and AutoGenFlex Tissue DNA Extraction kit (Autogen) for mouthwash saliva samples and then was quantified with NanoDrop™ Spectrophotometer (ThermoFisher).
Double-stranded DNA was quantified by fluorometry using theQuant-iT™ PicoGreen 1 dsDNA Assay Kit (Invitrogen, CA) on a DTX 880 Multimode Detector (Beckman Coulter) to normalize DNA concentration. After an updated summary of the published SNPs associated with susceptibility for development of CRC [14,15], those shown in Table 1 were selected. These SNPs were organized in the context of the gene(s) at or near locus and chromosome locus. The allelic discrimination was assessed using the MassARRAY 1 System (Agena Bioscience) on CeGen-PRB2-ISCII (Nodo USC) following the procedure provided by the manufacturer. Quality control samples were included in the genotyping assays.

Associated data
The questionnaire mentioned above, the GQ was used to gather information on weight status (self-reported weight and height) and environmental factors (demographic factors: age and sex; and lifestyle information: physical activity (PA) and smoking consumption). These questions were taken from the Spanish Health Questionnaire [16]. Body mass index (BMI), estimated from self-reported height and weight was classified according to the WHO criteria for those under 65  . It consists of 67 items and requires the subjects to recall the number of times each food item was consumed either per week or per month. The respondents might also record the consumption of other foods that were not included on the food list.
Average portion sizes were employed to convert FFQ consumptions [21]. For items that included several foods, each food's contribution was estimated with weighting coefficients that were obtained from the usual consumption data [22]. All the food items that were consumed were entered into DIAL 2.12 (2011ALCE INGENIERIA), a type of dietary assessment software, to estimate energy intake (kcal/d). Moreover, the FFQ included specific questions about their frequency of intake of five major types of alcohol beverages: beer, wine, cider, aperitif with alcohol and liquor. In terms of the amount consumed, 10 g of alcohol was considered a standard drink [23]. Participants were categorized into non-drinker/moderate consumption and risk consumption, according to the SENC criteria that consider moderate drinking is up to 1 standard drink per day for women and up to 2 standard drinks per day for men [23]. Alcohol consumption was also expressed in tertiles of ml per day according to sex (men: T1, � 70.6; T2, 70.7-138.8; T3, � 138.9; and women: T1 � 5.8; T2, 5.9-69.8; T3, � 69.9).
Additionally, socioeconomic data was assessed with an index that was obtained from the clinical databases developed by the Health Department of the Basque Government, namely the socioeconomic deprivation index (DI). This index was estimated using the MEDEA project criteria[24] from simple indicators in the 2001 Census, namely unemployment, manual workers, casual workers, low education level and low education level among young people. The DI was divided into quintiles (Q), with the first being the least disadvantaged and the fifth being the most disadvantaged. The DI was successfully assigned to 82.4% of participants, while the address information quality did not permit the linking of the remaining 17.6%.

Quality management
In the present research, we apply a similar quality management that those used in the IDEFICS study [25]. A unique subject identification number was attached to each recording sheet, questionnaire, and sample, as in other researches. The identification number had to be entered twice before the document could be entered into its respective database. All data were entered twice independently, and deviating entries were corrected. Inconsistencies that were identified by additional plausibility checks were rectified.

Statistical analysis
Statistical analyses were performed using SPSS 22.0 (SPSS Inc, Chicago, USA), STATA 13.0 (StataCorp LP, Texas, USA). Categorical variables are shown as a percentage, and continuous variables are shown as the means and standard deviations (s.d.). Normality was checked using Kolmogorov-Smirnov-Lilliefors test. Paired t-testorWilcoxonrank-sum test was used to two related means comparison, and a χ2 test was used to evaluate differences. Tests for association and deviation from Hardy-Weinberg equilibrium were performed separately in CRC patients and healthy controls. When expected frequencies were lesser than 5, Fisher's exact test was used.
In the case-control study, we estimated the odds ratio (OR) and 95% confidence interval (95% CI) for the polymorphism selected using conditional logistic regression adjusted for age (50-59 years old vs. 60-69 years old), sex(women vs. men), BMI (underweight/normal weight vs. overweight/obesity), physical activity (�15 min/d vs.<15 min/d), smoking status (never smoker vs. current and former smoker and quit smoking: � 11 years ago vs.< 11 years ago), alcohol consumption (T1, T2 and T3) and Deprivation Index (DI) (quintile 1-3 vs. quintile 4-5) as categorical variables and energy intake as quantitative (kcal/d). ORs were calculated for the codominant model, dominant model, recessive model, and allelic comparison. The most frequent genotype (homozygous) was considered the reference group to calculate ORs in a codominant and dominant model, and the most frequent genotype (homozygous) and the heterozygous genotype containing the risk allele were considered the reference group in the recessive model. The significance level was corrected using a Bonferroni correction by dividing the standard P value (two-tailed) (0.05) by the total number of SNPs analyzed (n = 48), assuming alpha was equal to 0.001 (α = 0.05/48).
Additionally, correspondence analysis (CA) was performed using PAST 3.21 to identify potential associations between SNPs associated with CRC and associated data. CA is a multivariate statistical technique which provides Cartesian diagrams based on the association of the variables examined. All variables were represented in graphs and the more closed are the points the more higher is the level of association between variables[26].
To assess genetic susceptibility, two methods were used as a simple, unweighted count method (count Genetic Risk Scores, c-GRS) and a weighted method (w-GRS)[27,28]. Both methods assumed each SNP to be independently associated with risk [29]. An additive genetic model was assumed: weightings of 0, 1, and 2 were given according to the number of risk alleles present [29,30].
The count method assumed that each SNP contributed equally to CRC risk and was calculated by summing the number of risk alleles across the panel of SNPs tested. This produced a score between 0 and twice the number of SNPs, i.e., representing the total number of risk alleles. The weighted GRS was calculated by multiplying each β-coefficient for the CRC phenotype from the discovery set by the number of corresponding risk alleles (0, 1, or 2 copies of the risk allele except for the SNP rs5934683 in chromosome X that was coded 0, 0.5, and 1) and then summing the products[31].
Finally, we defined the GRS as the count of risk alleles across all 48 SNPs, ranging from 0 to 95 for c-GRS and 0 to 105 for w-GRS. Since the published effects of each SNP were similar, an unweighted GRS was preferred. However, we also explored the models using weights derived from the GWAS publications and models fitted to our data[32].

Gene expression association analyses
Gene expression changes in tumour and normal colon tissue associated to SNPs with significant association with CRC risk were analyzed using publicly available data and bioinformatic tools. In the first place Genomic Data Commons Data Portal (GDC) (https://portal.gdc. cancer.gov) was used to examine data generated by the TCGA (The Cancer Genome Atlas) research network (https://www.cancer.gov/tcga), but for SNPs with unavailable data in GDC portal alternative bioinformatic tools were applied. On the one hand, gene expression data from between case and control samples of colon and rectum adenocarcinomas were compared using GEPIA (Gene Expression Profiling Interactive Analysis) (http://gepia.cancer-pku.cn/ index.html) [33]. On the other hand, GTEx (The Genotype-Tissue expression project) (https:// gtexportal.org/home/) was used to check the relationship between SNPs and the expression level of genes related to these SNPs in colon tissue of healthy donors. Table 2 shows the comparisons of associated data between cases and controls. Cases had a higher consumption of cigarettes/day and were more engaged in regular physical activity at a medium-high level as compared with controls. In addition, in the total sample, there were more smokers in men than in women (70.6% vs. 54.5%; P<0.001); and had a higher consumption of cigarettes/day (11.6(11.1) vs. 9.0(11.4); P = 0.030). Among controls 51.9% of women and 65.4% of men were smokers (P = 0.049); and among cases, 57.1% of women and 75.8% of men were smokers (P = 0.004).

Results
The distribution of genotypes and alleles at SNPs selected in the CRC group and in the control group that deviated from the Hardy-Weinberg equilibrium are shown in Supplementary Material(S1 Table). The SNPs that were not following the Hardy-Weinberg equilibrium in cases were rs12080929 and rs5934683. None of the genotype or allele frequencies for the SNPs analysed reached statistically significant differences between cases and controls, after Bonferroni correction application. Table 3 presents some results of the association of susceptibility genotypes and alleles with the risk of CRC in the codominant model. Other SNPs analyzed in this study are shown in Supplementary Material (S2 Table). Adjusting for potential confounders did not appreciably alter the observed ORs. Only the rs6687758 exhibited a statistically significant association with CRC risk based on the crude analysis. The AG genotype of rs6687758 conferred about 2.13-fold increased risk for CRC compared to the AA genotype.
Moreover, there was an association between smoking status, physical activity and the rs6687758 SNP for CRC risk in cases (Fig 2). We did not find an association between the risk genotype for rs6687758 and other associated variables (BMI, sex, alcohol consumption, DI and age). The results of CA for all cases are shown in a Cartesian diagram. The first three axes accounted for more than 50.0% of the total variance in all cases (axis 1: 23.0%; axis 2: 19.6% and axis 3: 13.4%). An inverse association can be observed between the variable DI (which plotted at the negative end of axis 1) and age, positioned in the positive segment of axis 1. Overall, axis 1 represents a gradient that runs from low values for DI (0: Q1-Q3; 1: Q4-Q5) to high values for age (0:50-59 y; 1:60-69 y). From the genetic viewpoint, the SNP that showed the closest association with associated variables wasrs6687758, which also plotted in the quadrant delimited by the positive segments of axis 1 and 2.

Discussion
In this study, we investigated SNPs associated with susceptibility for the development of CRC in a Basque population who took part in the population screening programme. We found that out of 48 analysed SNPs, only the rs6687758 was associated with the risk of CRC in this population. This is in agreement with previous GWAS that reported a positive association between this SNP and CRC also in European population [15,34]. Some authors have also observed relationships between this SNP and colorectal polyp risk [35]; although this SNPs is not associated significantly with adenoma risk and has their effects on the malignant stage of colorectal tumorigenesis [36]. The frequency of the risk allele of rs6687758 (G) in the European population (22.2%) [37]is similar to that registered in the cases of the present study and higher than that of the controls.
The other 47 risk SNPs did not replicate in our population. This may be due to differences in the underlying linkage patterns given the ethnic differences in populations studied. Twenty- Colorectal cancer susceptibility SNPs oneof the SNPs analyzedhave beenreplicated in Asian, American-Caucasian or African, but not in European (rs11903757, rs1321311, rs10505477, rs719725, rs704017, rs12241008, rs11196172, rs174537, rs4246215, rs174550, rs1535, rs10849432, rs3217901, rs4444235, rs11632715, rs4939827, rs10411210, rs1800469, rs2241714, rs961253 and rs4813802); and 4 were not replicated in population studies; however, they were associated with susceptibility for development of CRC in GWAS (rs1665650, rs59336, rs1957636 and rs12603526). The effect sizes of some of these associations were small (OR <1.20, P<0.05, for rs1321311, rs12241008, and rs704017) [38][39][40]. Additionally, it may be that the distribution of environmental factors in our population differs from that of the populations in which these genetic variants were discovered.
The SNP rs6687758 is in a regulatory region, flanking the promoter of DUSP10, at~250 kb from the start of the gene. Hence, it is likely to affect the expression of this gene. Polymorphisms in DUSP10 gene (dual specificity protein phosphatase 10) have been previously demonstrated to be associated with CRC risk [41,42]. In this study, we confirmed this CRC susceptibility locus in the Basque population sample. Earlier analyses have found frequent dysregulation of dual specificity protein phosphatase 10 (DUSP10/MKP-5) in CRC [41]. DUSP10 belongs to the dual kinase phosphatase family. These proteins are associated with cellular proliferation and differentiation, and they act as tumour suppressors [41,43].
Target kinases of DUSPs are inactivated by dephosphorylation of both phosphoserine/threonine and phosphotyrosine residues [41,42]. They act at several levels, taking part in fine-tuning signalling cascades. DUSPs negatively regulate members of the mitogen-activated protein kinase (MAPK) superfamily [41,44], which are implicated in some activities that are often dysregulated in cancer, such as cell proliferation, survival, and migration [41]. MAPK signalling also plays a key role in determining the response of tumour cells to cancer therapies, since its abnormal signalling has important consequences for the development and progression of human cancer [44].
Several studies have already shown the involvement of DUSPs as major modulators of critical signalling pathways dysregulated in different cancers [43], such as in the case of the overexpression of DUSP1/MKP-1 in the early phases of cancer and its decreasing during tumour progression [42].
There is abundant evidence that DUSP10, in particular, may play an important role in tumorigenesis and could alter CRC risk [45,46]. It inactivates p38 and JNKin vitro [41,47], and its upregulation are very common in CRC [48]. The activation of JNK protein is due to the protein kinase G (PKG)/MEKK1/SEK1/JNK cascade, and it is related with cell proliferation and inducing apoptosis [41,49]. Moreover, p38 is involved in the promotion of cellular senescence as a meansof eluding oncogene-induced transformation; it participates in cell cycleregulation suppressing cell proliferation and tumorigenesis [41,49].
On the other hand, the results extracted from gene expression association analyses show a higher expression of DUSP10 gene in CRC cases, but also that there is a higher expression of this gene in colon tissue of healthy controls when they have the GG genotype for rs6687758. Thus, it would be likely to find a relationship between higher expression of the gene and the presence of allele G in rs6687758 in tumour tissue. Nonetheless, it would be interesting to further explore this aspect through future analyses to compare gene expression between individuals carrying the risk variant and control individuals. Previous studies have pointed in the same direction that there is overall increase in patients' relapse-free survival when DUSP10 expression is upregulated, and that DUSP10 mRNA was increased in the tumour compared with normal tissue adjacent to the tumours [46,49,50].
We found an association between smoking status and the rs6687758 SNP for CRC risk in cases. Other authors have also observed this association [51]. Benzo[a]pyrene, one of the carcinogenic compounds included in cigarette smoke, up-regulated COX-2 in mouse cells [52], which in turn could either activate or be dependent on the MAPK pathway, suggesting a possible gene-smoking interaction [53,54]. Concerning the association between physical activity, the rs6687758 SNP and CRC risk, as far as we know, there are no precedents in the literature. However, other studies have found interactions between polymorphisms associated with growth hormone (GH1) and insulin-like growth factor I (IGF-I) (rs647161, rs2665802), physical activity and CRC [53,54]. According our results, rs6687758, medium-high physical activity level and CRC would be associated. However, this outcome, contrary o what it could be expected, could be related to changes in the lifestyles, including physical activity level, in cases after diagnosis [55].
We also analyzed unweighted and weighted GRS models. We observed that cases had more risk alleles than controls, this result was according to expectations considering the previous studies [56]. In the crude analysis, we observed that patients that had a higher number of risk alleles had a higher risk of CRC. Other authors observed similar results using an adjusted unweighted model [32]. However, some other authors did not find this association [57]. It should be noted that common allele variants generally have modest effect sizes [58], but the combination of multiple loci with modest effects into a global GRS might improve the identification of patients with genetic risk for common complex diseases, such cancer [59]. In this sense, Ortlepp et al. [60] concluded that more than 200 polymorphisms might be necessary for "reasonable" genetic discrimination.
Our study has several limitations and strengths. The principal limitations of this study were the small sample size that makes difficult to detect possible associations between polymorphisms and disease risk since some genotypes showed very low frequencies in our population. Another disadvantage of the small sample size is that they can produce false-positive results; in order to avoid it, the Bonferroni correction was used. The strengths of the study were that although controls tested positive in iFOBT, in CRCSP were confirmed that they were free of the disease through colonoscopy. Colonoscopy was used as diagnosis criteria to identify the cases in order to avoid false positives and negatives.
In conclusion, most SNPs analyzed were not associated with risk of CRC. Only one of the 48 SNPs analyzed, rs6687758, was associated with risk of CRC, in this population (on crude analysis). Moreover, there were significant associations between smoking status, physical activity, the rs6687758SNP and CRC risk. On the other hand, the results of the GRS showed that the risk alleles were more frequent in cases than controls and this score was associated with this type of cancer in crude analysis. Therefore, in this study, we have confirmed a CRC susceptibility locus and the existence of associations between modifiable factors such as smoking and physical activity and the presence of the risk genotype for rs6687758. However, further experimental validations are needed to establish the role of this SNP, the function of the gene identified, as well as the contribution of the interaction between environmental factors and this polymorphism to the risk of CRC.
Supporting information S1 Table. Deviation from Hardy-Weinberg equilibrium and differences in allele frequencies and genotype distribution between cases and controls. A, adenine; C, cytosine; G, guanine; HWE, Hardy-Weinberg equilibrium; rs, reference single nucleotide polymorphism; SNP, single nucleotide polymorphism; T, thymine; a Valid percentages; b P<0.001 was significant; c Differences in allele frequencies and genotype distribution between cases and controls. (PDF) S2 Table. Association between genetic variants of susceptibility and the risk of CRC in the codominant model. A, adenine; C, cytosine; CI, confidence interval; G, guanine; NA, no available data; OR, odds ratio; rs, reference single nucleotide polymorphism; SNP, single nucleotide polymorphism; T, thymine; a The most frequent genotype was considered the reference group; b Model I, crude conditional logistic regression model; c Model II, conditional logistic regression adjusted for: age, sex, BMI, physical activity, smoking status, alcohol consumption, Deprivation Index and energy intake. Participants with missing data for the confounding variables were included as a separate category for these variables; d P<0.001 was significant. (PDF) S1 Fig. eQTL violin plot