Polymorphisms in DNA-Repair Genes in a Cohort of Prostate Cancer Patients from Different Areas in Spain: Heterogeneity between Populations as a Confounding Factor in Association Studies

Background Differences in the distribution of genotypes between individuals of the same ethnicity are an important confounder factor commonly undervalued in typical association studies conducted in radiogenomics. Objective To evaluate the genotypic distribution of SNPs in a wide set of Spanish prostate cancer patients for determine the homogeneity of the population and to disclose potential bias. Design, Setting, and Participants A total of 601 prostate cancer patients from Andalusia, Basque Country, Canary and Catalonia were genotyped for 10 SNPs located in 6 different genes associated to DNA repair: XRCC1 (rs25487, rs25489, rs1799782), ERCC2 (rs13181), ERCC1 (rs11615), LIG4 (rs1805388, rs1805386), ATM (rs17503908, rs1800057) and P53 (rs1042522). The SNP genotyping was made in a Biotrove OpenArray® NT Cycler. Outcome Measurements and Statistical Analysis Comparisons of genotypic and allelic frequencies among populations, as well as haplotype analyses were determined using the web-based environment SNPator. Principal component analysis was made using the SnpMatrix and XSnpMatrix classes and methods implemented as an R package. Non-supervised hierarchical cluster of SNP was made using MultiExperiment Viewer. Results and Limitations We observed that genotype distribution of 4 out 10 SNPs was statistically different among the studied populations, showing the greatest differences between Andalusia and Catalonia. These observations were confirmed in cluster analysis, principal component analysis and in the differential distribution of haplotypes among the populations. Because tumor characteristics have not been taken into account, it is possible that some polymorphisms may influence tumor characteristics in the same way that it may pose a risk factor for other disease characteristics. Conclusion Differences in distribution of genotypes within different populations of the same ethnicity could be an important confounding factor responsible for the lack of validation of SNPs associated with radiation-induced toxicity, especially when extensive meta-analysis with subjects from different countries are carried out.


Introduction
Genetic polymorphisms are variants of the genome that appear by mutations in some individuals, are transmitted to offspring and acquire some frequency (at least 1%) in the population after many generations. Polymorphisms are the basis of evolution and those that are consolidated may be silent or provide benefits to individuals, but can also be involved in disease development [1]. The most frequent polymorphisms are single nucleotide polymorphisms (SNPs). The ethnic origin of a population determines the distribution of genotypes in a population, and has not to be equal to others. Moreover, differences observed within populations of the same ethnic origin suggest that race is not a sufficient factor to ensure the homogeneity of the sample. In that sense, it is known the presence of several significant axes of stratification, most prominently in a northern-south-eastern trend, but also along an east-west axis, among the genotype distribution of European population [2]. In the case of Spain, although populations inhabiting the Iberian Peninsula show a substantial genetic homogeneity [3], there are findings suggesting that Northwest African influences existing among the Spanish populations and these differences might increase the risk for false positives in genetic epidemiology studies [4].
Radiation therapy (RT) is an effective treatment offered to patients with localized prostate cancer as a viable alternative to surgery [5]. Although both therapies showed comparable results in terms of survival [6], the main differences between them are related to adverse effects. Tumour control by RT requires the use of maximum dose that can be delivered while maintaining a tolerance risk of normal tissue toxicity, being clinical toxicity the factor limiting the efficacy of the treatment [7]. The role of genetics in the response of normal tissues to RT is widely accepted by the scientific community, and it would help to explain why patients treated with RT experience a large variation in normal tissue toxicity, even when similar doses and schedules are administered [8]. Radiation causes the loss of structure and function of most biologic molecules, including DNA. The individual DNA repair capacity consists of several mechanisms (nucleotide and base excision repair, homologous recombination, non-homologous endjoining, mismatch repair and telomere metabolism) and the individual capacity to repair damaged DNA may modify the response of tumour tissue and normal tissue to radiation [9]. Thus, studies of candidate genes have been focused on genes mainly involved in DNA damage recognition and repair (eg, ATM, XRCC1, XPD, ERCC1, LIG4, and TP53 among others), and also in free radical scavenging (eg, SOD2), or anti-inflammatory response (eg, TGFB1).
The association between SNPs and radiation toxicity has been deeply explored [10] and numerous consortia have been formed to identify common genetic variations associated with the development of radiation toxicity [11]. Although promising, the overall results failed at the validation stage [12] and today, the development of a SNP signature associated to the prediction of toxicity is still far away. Although this lack of association could be explained by different reasons (presence of confounding factors, insufficient sample size, and lack of consensus in the methodology used in terms of genotyping, statistics, and even in the grading of radiation toxicity) [13], the heterogeneity of the studied populations is a factor whose effect has been commonly underestimated.
With all those assumptions in mind, we designed a study aimed to evaluate the genotypic distribution of 10 SNPs in 6 different genes involved in DNA repair and classically associated to radiation-induced toxicity, in a wide set of Spanish prostate cancer patients, to determine the homogeneity of the population and to disclose potential undervalued confounders in the association between SNPs and radiation toxicity.

Patients
A total of 601 patients with non-metastatic localized prostate cancer (PCa) were included in the study. Geographical distribution of patients was as follows (

DNA Isolation and Quantification
All the blood samples were sent to the Hospital Universitario de Gran Canaria Dr. Negrín for DNA extraction and subsequent analyses. DNA was isolated from 300 ml of whole-blood in an iPrep purification system (Applied Biosystems, Foster City, CA) using the iPrep TM PureLink TM gDNA Blood Kit (Applied Biosystems). DNA was quantified and the quality of samples was determined in a NanoDrop 2000 (Thermo Scientific, Wilmington, DE).

Genotyping
The SNP genotyping was made in a Biotrove OpenArrayH NT Cycler (Applied Biosystems). DNA for OpenArray (OA) was diluted at a concentration of 50 ng/ml and a ratio of A260/A280 and A260/230 of 1.7-1.9. A total of 300 ng of genomic DNA was used. A final amount of 150 ng was incorporated into the array with the autoloader and genotyped according to the manufacturer's recommendations. A non-template control (NTC) consisting of DNase-free double-distilled water was introduced within each assay. When the DNA and master mix were transferred, the loaded OA plate was filled with an immersion fluid and sealed with glue. The multiplex TaqMan assay reactions were carried out in a Dual Flat Block (384-well) GeneAmp PCR System 9700 (Applied Biosystems) with the following PCR cycle: an initial step at 93uC for 10 minutes followed by 50 cycles of 45 seconds at 95uC, 13 seconds at 94uC and 2 minutes, 14 seconds at 53uC; followed by a final step during 2 minutes at 25uC and holding at 4uC.
The fluorescence results were read using the OpenArrayH SNP Genotyping Analysis software version 1. site/us/en/home/Global/forms/taqman-genotyper-softwaredownload-reg.html) using autocalling as the call method. The quality value of the data points genotype was determined by a threshold above 0.95. Genotyping analysis was done for each population separately (Figure 1).

Statistical Analysis
Genotype and allelic frequencies were determined using the web-based environment SNPator (SNP Analysis To Results, from the Spain's National Genotyping Center and the National Institute for Bioinformatics) [21]. Relative excess heterozygosity was determined to check compatibility of genotype frequencies with Hardy-Weinberg equilibrium (HWE). Thus, p-values from the standard exact HWE lack of fit test were calculated using SNPator. Comparisons of genotypic and allelic frequencies among populations, as well as haplotype analyses were also done in SNPator.
Principal component analysis (PCA) was made using the SnpMatrix and XSnpMatrix classes and methods [22], implemented as an R package and available from Bioconductor (as of version 2.11; http://bioconductor.org). It consists in the transformation of the set of original variables in another set of variablesprincipal components -obtained as a linear combination of those. The new variables retain all the information, but most of the principal components have so small variability that can be ignored. Thus, few components (generally 3 or less) can represent and explain reasonably the set of objects of the sample without loss of information. PCA reduces the complexity of the data and permits the graphical representation of the variables.
Non-supervised hierarchical clustering [23] of SNP in each population was made using MultiExperiment Viewer (available at: www.tigr.org). Clustering was made using Euclidean distance correlation and average linkage. To success perform the clusters,  wild homozygous was encoded as 21, heterozygous as 0 and mutated homozygous as 1.
All additional statistical analyses were performed using PASW Statistics 15 (IBM Corporation, Armonk, NY, USA).

Results
All the genotyped samples met quality criteria as stated above, and all samples were genotyped with the same batch of material at The genotypic and allelic frequencies are shown in Table 3. A relative excess of heterozygosity, indicating a deviation from HWE, was observed in 4 SNPs from 2 different populations: rs25487 (XRCC1) in subjects from Catalonia and rs13181 (ERCC2), rs11615 (ERCC1) and rs180057 (ATM) in subjects from Andalusia (Table 3). The genotype distribution was different between the study populations in 4 of the 10 SNPs: rs25487, rs13181, rs11615, and rs1805386 (LIG4) (x 2 test, Table 3), showing a differential distribution of genotypes among populations. A non-supervised hierarchical cluster was performed trying to visualize the differences in the genotype distributions among the four populations. Thus, as shown in Figure 2, polymorphisms were distributed into two main clusters, each one with different number and identity of SNPs, suggesting heterogeneity among populations. Moreover, the web-based tool SNPator was used to compare populations individually one against one. Differences in genotypic distributions were mainly present between Andalusia and the other populations (x 2 test, Table 4). According to that result, the populations from Catalonia and Andalusia showed the greatest differences, with 3 SNPs (rs25487, rs13181 and rs11615) differentially distributed among the PCa patients from both populations.
Principal component analysis (PCA) was done trying to identify global differences among populations. Components 1 and 2 were responsible for the 15.3% and 14.3% of the variance, respectively. When both components were plotted, the main components seemed not to discriminate between populations ( Figure 3A). However, when components were analyzed separately, the first one could distinguish between the populations of Andalusia and Catalonia ( Figure 3B), corroborating the results observed in Table 4 and clearly showing the differences in the distribution of genotypes between the analyzed populations.
Finally, haplotype analysis was performed in SNPator. As shown in Table 5, the three most frequent haplotypes were different among populations. Thus, for SNPs in chromosome 11 (those located in ATM gene), the haplotype GG was absent in the Andalusian population. For SNPs in chromosome 13 (those located in LIG4 gene), haplotypes GG and AA showed a different distribution among the populations. In the case of SNPs in chromosome 19 (those located in XRCC1, ERCC2 and ERCC1 genes), haplotype CCGGG was present only in PCa patients from Canary and Catalonia, while haplotype CCGTG was present only in PCa patients from Andalusia and Basque Country. The fact that the most frequent haplotypes were equal in all populations suggests a similarity between individuals of the same ethnicity.

Discussion
Radiogenomics is the study of genetic variants, primarily single nucleotide polymorphisms (SNPs), associated with the development of radiotherapy toxicity, in an attempt to find an assay capable of predicting which cancer patients are most likely to develop adverse effects after RT [9]. The prediction of normal tissue toxicity would allow the adjusting of radiation doses individually for each patient, especially when higher radiation  dose levels are associated with improved biochemical control outcomes and reduction in distant metastases in PCa patients [24]. The role of genetics in radiation-toxicity has been proved [25]. In that sense, genetics seem to contribute to explain the high interindividual variability observed between cases, even when patients are similar and are treated with the same treatment schedule [26]. However, although it has been published a lot of bibliography reporting the predictive role of some SNPs in normal tissue toxicity, the validation studies have failed, calling into question the utility of SNPs as a tool for predicting radiation-induced toxicity [12]. Population association between genotype at a particular locus and a binary trait (such as presence/absence of radiation-induced toxicity) can arise in three ways [27]: i) the locus may be causally related to the disease (different alleles carrying different risks), ii) the locus may not itself be casual (but may be sufficiently close to a causal locus as to be in linkage disequilibrium whit it), or iii) the association may be due to confounding by population stratification or admixture. Confounding may act to create population association in the absence of a casual link or obscure a casual relationship. Thus, it is important to exclude spurious association by appropriate design and/or analysis of studies, taken into account that biases that result from systematic error (such as selection biases or biases in measuring outcomes) persist as the sample size increases. Confounding would arise if the population contained several ethnic groups, if allele frequencies at the locus of interest differed between groups, and if disease frequency also differed between groups for reasons quite unrelated to the locus of interest. It is known that ethnicity influences the applicability of pharmacogenetics [28].
Canary population, as well as the rest of populations included in this study, is considered as Caucasian. However, the natural history of, for example, Canary and Basque Country, are different. Thus, while Canary population has influence from Northwest Africa migration and European colonisation [29], Basques have a different origin [30]. However, in a recent published paper, 30 individuals from 10 different populations from Spain (Canary population was not included in that study) were genotyped for 120 SNPs, concluded that the studied populations were genotypically similar [3]. None of the SNPs considered in the present study were included in this previous article. We found that genotype distribution of 4 SNPs was different among populations from Andalusia, Basque Country, Canary and Catalonia. We compared our findings with the largest cohort of PCa patients analyzed in Spain [31]. A total of 698 Galician PCa patients were screened for 14 SNPs located in the ATM, ERCC2, LIG4, MLH1 and XRCC3 genes. Three of these SNPs were included in our multicenter study: rs1805388 (LIG4), rs1805386 (LIG4) and rs1800057 (ATM). Genotypic distributions of rs1805388 and rs1805386 were significantly different among Galician and the populations included in the present study (x 2 test, p = 0.001 and p = 0.007, respectively), highlighting the variability between populations of the same ethnicity (Caucasians) from the same country in depending of each SNP. According to our results, Andalusia was the population differentially distributed, showing the greatest disparity with Catalan (results observed in x 2 analyses and PCA). Differences among populations were also evident in haplotype analysis and subsequent distribution. Those results suggest that each SNP need to be considered individually, trying to find possible confounding variables that would be crucial for the interpretation of results. In case-control studies, which is the usual type of design in studies for discovering associations between SNPs and radiation toxicity, the fundamental assumption is that these two series of subjects (controls and cases) may be used to provide unbiased estimates of the corresponding distributions among affected and unaffected members of some underlying population [27]. This fundamental assumption may not be met in practice, leading to biased findings that fall into two broad classes: selection bias caused by inappropriate sampling of cases and controls, and information bias caused by differential measurement errors in cases and controls. When the confounding variable is detected in the study, the classical method in epidemiology is by stratification of the analysis by the potentially confounding variable and testing for association between factors of interest (i.e. genotype) and disease within strata (i.e. grades of radiation-induced toxicity). Concern over the presence of bias from population stratification in genetic case-control studies should be alleviated by proper design and analysis of case-control studies, evaluation of the likelihood of major bias in a given study [32] and, if needed, methods for correction [33].
The present study has some limitations that should be noted. First, all subjects were PCa patients and the genotype frequency may be different if it is compared with a population of healthy subjects. However, in studies designed to evaluate possible associations between SNPs and radiation toxicity, controls are patients with null-low grade of toxicity and cases are patients with high grade of toxicity, but all subjects are cancer patients. Thus, this limitation could be considered as an advantage because it mimics the standard design of such studies. Second, the number of subjects from the different population varies widely. However, the fact that the main differences were not found in the population with the smallest number of patients (Basque Country, with 51 PCa) suggests that this limitation may not be decisive in the interpretation of results. Moreover, if heterogeneity among populations is considered a systematic bias, it is independent of sample size. Third, to blind the analysis, no clinical data on patients were available, that is, there are not data about TNM staging, tumor grade, biochemical failure, or Gleason Score. In that sense, it is possible that some polymorphism may influence tumor characteristics in the same way that it may pose a risk factor for other disease characteristics [34,35]. In the other hand, some advantages should be highlighted: i) it includes a number of subjects sufficient to have reliable data on the distribution of these 10 SNPs in the PCa populations studied (especially for Canary and Catalonia); ii) all subjects were male, then avoiding the possible bias generated by the gender; and iii) all the determinations (6,010 in total) were performed with the same methodology (OpenArray, Applied Biosystems), with the same batch of chips and by the same investigator, thus minimizing biases from technical origin.

Conclusions
Differences in distribution of genotypes within different populations of the same ethnicity could be an important confounding factor responsible for the lack of validation of those SNPs associated with radiation-induced toxicity, especially when extensive meta-analysis with subjects from different countries are carried out [36]. Our results suggest that equality between people (especially among those considered as control) should be checked before proceeding with any further analysis.