Allelic interaction effects of DNA damage and repair genes on the predisposition to age-related cataract

Purpose Age-related cataract (ARC) is a leading cause of visual impairment and blindness worldwide. DNA damage and malfunction of DNA repair are believed to contribute to the pathogenesis of ARC. Aside from increasing age, the risk factors for ARC appear to be rather complex, and one or more gene variations could play critical roles in the diverse processes of ARC progression. This study aimed to investigate the combined effects of different genetic variants on ARC risk. Methods A cohort of 789 ARC patients and 531 normal controls from the Jiangsu Eye Study was included in this study. Genotyping of 18 single-nucleotide polymorphisms (SNPs) in 4 DNA damage/repair genes was performed using TaqMan SNP assays. SNP-SNP interactions were analyzed via multifactor dimensionality reduction (MDR), classification and regression tree (CART) and genetic risk score (GRS) analyses. Results Based on single-locus analyses of the 18 SNPs examined, WRN-rs11574311 (T>C) was associated with ARC risk. However, in MDR, the gene-gene interaction among the five SNPs (WRN-rs4733220 (G>A), WRN-rs1801195 (T>G), OGG1-rs2072668 (G>C) and OGG1-rs2304277 (A>G)) on ARC risk was significant (OR = 5.03, 95% CI: 3.54~7.13). CART analyses also revealed that the combination of five SNPs above was the best polymorphic signature for discriminating between the cases and the controls. The overall odds ratio for CART ranged from 4.56 to 7.90 showing an incremental risk for ARC. This result indicated that these critical SNPs participate in complex interactions. The GRS results showed an increased risk for ARC among individuals with the SNPs in this polymorphic signature. Conclusion The use of multifactorial analysis (or an integrated approach) rather than a single methodology could be an improved strategy for identifying complex gene interactions. The multifactorial approach used in this study has the potential to identify complex biological relationships among ARC-related genes and processes. This approach will lead to the discovery of novel biological information, ultimately improving ARC risk management.


Results
Based on single-locus analyses of the 18 SNPs examined, WRN-rs11574311 (T>C) was associated with ARC risk. However, in MDR, the gene-gene interaction among the five SNPs (WRN-rs4733220 (G>A), WRN-rs1801195 (T>G), OGG1-rs2072668 (G>C) and OGG1-rs2304277 (A>G)) on ARC risk was significant (OR = 5.03, 95% CI: 3.54~7.13). CART analyses also revealed that the combination of five SNPs above was the best polymorphic signature for discriminating between the cases and the controls. The overall odds ratio for CART ranged from 4.56 to 7.90 showing an incremental risk for ARC. This result indicated that these critical SNPs participate in complex interactions. The GRS results showed an increased risk for ARC among individuals with the SNPs in this polymorphic signature. PLOS

Introduction
Age-related cataract (ARC), a leading cause of visual impairment and blindness worldwide [1], is a growing global public health problem that affects approximately 37 million people and accounts for 48% of all cases of blindness [2,3]. According to the location of the opacity within the lens, ARC can be classified as cortical (C), nuclear (N), posterior subcapsular (PSC) or mixed (M) [4]. The development of ARC can be influenced by multiple factors, ranging from degenerative processes or personal characteristics to environmental and dietary factors. Age, gender, smoking, and exposure to sunlight are the documented risk factors for ARC, and several recent studies have identified numerous single nucleotide polymorphisms (SNPs) in genes such as OGG1, EPHA2, WRN and glutathione S-transferases (GSTs) that are associated with ARC [5][6][7][8].
In the past, the majority of studies have analyzed individual genes by directly testing the effects of one or several SNPs in a candidate gene on disease development. However, because of the weak marginal effects of these disease-associated SNPs, each individual SNP has limited power to predict the risk of ARC. More recently, to evaluate whether interactions and combined effects among multiple SNPs contribute to the susceptibility to ARC, researchers have turned to multifactorial analysis. The analysis of such interactions and combined effects in case-control studies is hampered by one major concern: dimensionality.
In the current study, we employed a combination of three methods-multifactor dimensionality reduction (MDR), classification and regression tree (CART) analysis and genetic risk score (GRS) calculation-to extend our previous work [9] on ARC susceptibility by jointly investigating 18 SNP genotypes in 4 genes. This analytical approach avoided the problems related to dimensionality and multiple comparisons.

Study design and participants
The participants in this case-control study, including both ARC patients and normal controls, were recruited from the Jiangsu Eye Study (JES), a population-based epidemiologic study. We identified and selected ARC patients as research subjects from a total of 2208 cataract patients from the JES. According to the inclusion and exclusion criteria, 1064 ARC patients (C = 335, N = 470, PSC = 42, M = 217) were included. Applying further exclusion criteria led to the removal of 163 participants, including ARC patients with systemic diseases such as diabetes, kidney disease, or cancer and those with macular diseases or other retinal diseases, as well as 67 patients of any ARC subtype with LOCSIII grade<2 for the worse eye. As a result, 834 ARC patients were eligible for this study. Of these 834 ARC patients, 45 patients lacked samples for DNA extraction and genotyping. Ultimately, we examined 789 ARC patients. The details of the design and procedures of this study have been described elsewhere [9].
This study was conducted according to the Declaration of Helsinki and was approved by the Ethics Committee of the Affiliated Hospital of Nantong University. Each participant was fully informed of the purpose and procedures of the study and signed an Informed Consent Form.

Statistical analysis
Continuous variables were presented as means±standard deviation (SD) and were evaluated using t tests. Categorized variables were presented using numbers and percentages and were evaluated using the χ 2 test or Fisher's exact test. A two-tailed P-value of less than 0.05 was considered to be statistically significant. All statistical analyses were performed using SPSS software version 18.0 (SPSS, Inc., Chicago, IL, USA). Gene-gene interactions among the 18 loci of the 4 genes were determined using MDR (MDR software, version 2.0 beta8). A CART model was conducted using R (version 3.0.1) (http://www.rproject.org/).

MDR
MDR is a non-parametric and genetic model-free method that uses constructive induction or attribute construction to identify the interactions among multi-locus genotypes. This method was initially introduced by Ritchie et al [10]. The benefit of MDR is the minimization of statistical issues such as invalidity of parameter estimates owing to the presence of few or no observations assigned to contingency table cells when testing for interactions. In MDR, a set of multilocus genotypes with n dimensions is reduced to a single dimension (i.e. constructive induction) of 1 variable with 2 possible values for the genotypes of 2 loci: high risk or low risk. The model is evaluated for its ability to correctly classify and predict disease (case vs. control) according to the testing balanced accuracy (TBA) statistic [11], the cross-validation consistency (CVC) and permutation testing. One thousand permutations were repeatedly conducted for each randomized dataset to determine the statistical significance of the best models and to identify false positives. A P-value of less than 0.05 for the MDR permutation results was considered to be statistically significant.

CART analysis
Decision trees date back to the early 1960s with the work of Morgan and Sonquist. Breiman and colleagues published the first comprehensive description of recursive partitioning methodology, a novel application of CART analysis to clinical and physiological data related to mood disorders, and this method therefore merits a more extensive description. CART is a powerful statistical method of data mining that can analyze data from different perspectives for summarization into useful and practical information in order to identify important correlations or patterns among dozens of variables in large relational databases. CART requires no distributional assumptions; however, CART models are highly unstable in response to small changes in the data, and this instability represents the major drawback of CART analysis. CART creates binary tree-shaped structures and classifies patients or predicts outcomes by selecting the most important genetic and environmental risk factors. The binary decision tree first considers all individuals pooled together in a heterogeneous root node. Before growing a tree, the measures for goodness of split are chosen using the criteria determined by the rpart package algorithm, which identifies splits that maximize the homogeneity of subnodes with respect to the value of the target variable. Each subnode can then be treated independently as a new root node, and the pattern continues until no further partitions can be performed, resulting in a very large and complex tree. In CART analysis, the primary variable used for splitting is examined together with other variables via a pruning procedure to avoid overfitting the model. Proportions closer to 0 or 1 are considered to reflect purer partitions. The binary tree structure shows the effects of interactions between variables using the optimal splits. Finally, the risk of various genotypes was evaluated using a special type of CART analysis: logistic regression analysis. However, for a large number of parameters, it is computationally infeasible to examine every possible combination of factors along with their interactions in a logistic regression. CART can precisely determine the combination of variables that maximizes predictive power.

GRS
One popular approach of incorporating identified genetic variants is the calculation of a GRS for modeling using a variety of approaches, such as additive simple count and weighted GRSs [12,13]. The applicability of these cumulative GRSs as predictive models for disease has been proposed and has shown anecdotal success in real genetic studies [14][15][16][17][18]. The GRS is defined as the cumulative number of risk-associated alleles in each individual. A value of 2, 1 or 0 was allotted to the homozygous variant, heterozygous and homozygous wild type genotypes, respectively, and the values for all 18 SNPs were then summed. We treated alleles with an odds ratio (OR) greater than or equal to 1 at each locus as a risk allele (the reference allele was considered as the risk allele if its OR was less than 1).

Population characteristics
The demographic and clinical characteristics of the study subjects are presented in Table 1. A total of 1329 participants from the JES, including 789 ARC cases (C = 257, N = 366, PSC = 34, M = 132) and 531 controls, were recruited for the current study. The average age was 69.7 years for the cases and 70.4 years for the controls (P = 0.053). The gender distribution of the two groups was matched (P = 0.057). The fasting blood glucose level was not significantly different between the two groups (P = 0.063).

Allelic distribution and associations of polymorphisms with ARC
Among the controls, the genotypes of all SNPs except for BLM rs8027126 that were considered in the current study were in accordance with HWE. Thus, we excluded this SNP from our study.
We found that only WRN-rs11574311 (T>C) was associated with a statistically significantly increased risk for developing ARC (OR = 1.49, 95%CI: 1.17-1.90, P = 0.003), although its significance was attenuated after Bonferroni correction (P = 0.054; Table 2). The differences regarding the CC genotype of WRN rs11574311 between ARC and control were statistically significant (p = 0.005) in the genotype analysis. And in dominant model, it also indicated its harmful roles in developing ARC (OR = 1.55, P<0.05; Table 3)

Gene-gene interactions
We compared the allele frequency between ARC cases and controls and then analyzed the distribution of allele frequencies stratified according to ARC subtype. Table 4 shows the best interaction model based on MDR analysis for all ARC cases and controls for the one-locus through five-locus models. The one-factor model for predicting ARC risk was WRN-rs11574311 (T>C) SNP (testing accuracy = 0.5378, CVC = 10/10, permutation test P<0.001). The two-factor model with a potential gene-gene interaction between OGG1-rs2072668 (G>C) and OGG1-rs2304277 (A>G) showed an improved testing accuracy of 0.6193 and an increased CVC (10/10) (permutation test P<0.001). A significant three-factor   permutation test P-values, compared with the single-factor models, the multifactorial model that included WRN-rs11574311 (T>C), WRN-rs4733220 (G>A), WRN-rs1801195 (T>G), OGG1-rs2072668 (G>C) and OGG1-rs2304277 (A>G) was regarded as the best fit model with the highest testing accuracy of 62.73%, the greatest CVC of 10/10, and a permutation test P-value<0.001. The results shown in Table 4 represent the associations of higher-order interactions with the risk of the C, N and M subtypes of ARC.

CART analysis
The final tree structure, which contained nine terminal nodes, was generated via CART analysis for identification of ARC-related factors, considering all investigated genetic variants of the selected pathways (Table 5). CART analysis showed that patients with higher and lower risks of ARC could be identified based on specific genotype combinations. Consistent with the best one-factor MDR model, the initial factor splitting the root node on the decision tree was WRN-rs11574311; this result suggests that this SNP is the strongest risk factor for ARC among the polymorphisms examined.  Tables 6, 7 and 8 represent the associations of higher-order interactions with the C, N and M subtypes of ARC risk, respectively. These results were consistent with those obtained from MDR analyses. We also used logistic regression (LR) to detect SNP-SNP interactions from both MDR and CART and found that a p for interaction was 0.0627 using 5 SNPs in MDR model and was 0.9142 using 6 SNPs in CART model. Table 9 shows the additive effects of multiple SNPs. For each individual, we counted the number of alleles associated with increased risk for ARC. The total GRS ranged from 3 to 26 for all 1320 participants, with a median of 11 among control subjects and 12 among cases. The mean (±SD) total GRS was 12.08±4.07 in ARC patients and 11.46±3.77 in controls (P = 0.005). The patients with the C and N subtypes of ARC had higher total GRSs than the controls. We categorized the participants into 13 groups and considered participants with a GRS of 3-4 as the reference group. Compared with the reference group, the group of participants with a GRS<19 showed a non-significant difference in ARC risk. However, the significant OR (2.67, 95%CI: 1.08-6.66, P = 0.034) for the group of participants with a GRS>19 (19)(20)(21)(22)(23)(24)(25)(26) compared with the reference group suggested that the former group had an increased risk of ARC. This result corresponded to a several-fold difference in ARC risk between those carrying lowest number of risk-associated alleles and those carrying the greatest number of risk-associated alleles in our population.

Discussion
To more comprehensively evaluate ARC risk, the present analysis examined sets of sequence variants associated with high and low intrinsic risk of ARC. Of the 18 examined polymorphisms, several were found to be significantly associated with ARC risk in our previous study [9], while others showed little or no influence on the risk for ARC development. We took WRN-rs11574311 data as a reference for power analysis. Based on a pre-defined two-sided alpha of 0.05, our sample sets has greater than 85% power to detect a OR of 1.50. Moreover, accumulating evidence supports the importance of oxidative stress to ARC development, as oxidative stress induces various types of DNA damage in the lens, thus causing cataract [19][20][21][22][23]. Therefore, we further extended our work by incorporating factors potentially related to cataract pathogenesis. The genes selected in this study that encode for DNA repair enzymes play a vital role in the DSER, NER and BER pathways. To our knowledge, this is the first study to examine both main and epistatic effects of four candidate genes, WRN, BLM, OGG1 and ERCC6, on the risk of ARC.
In the single-locus analysis, the WRN-rs11574311 (T>C) SNP was associated with ARC risk [9]. Furthermore, in the current study, we applied a multifactorial analysis strategy combining the MDR, CART and GRS approaches to systematically identify particular combinations of genetic variants that contribute to ARC risk. Therefore, a promising finding reported for the first time in this study was that the WRN, BLM, OGG1 and ERCC6 genes may play an important role in independently modulating the etiology of ARC in an interactive manner.
We used 10-fold cross-validation method to compare MDR and CART models and to evaluate if there is an over-fitting issue. In the 5-factor and their two-way interaction items model that was derived from MDR, the AUC from training data analysis was 0.6130, which was slightly higher than that from testing data analysis (AUC = 0.5566). The difference of these two AUCs was 0.0563. However, in the 7-factor and their two-way interaction items selected from Cart model the AUC from training data analysis 0.5703, which was slightly higher than that derived from testing AUC (0.5488, difference = 0.0215). AUC from model established by MDR is slightly high, compared with AUC from the model established by CART, which seems predictive performance of two models are similar in current study. There are smaller difference of AUCs between training data and testing data analysis in two models respectively, which implied that there is no serious over-fitting issue in both models.
The completed model increases with the order of interactions. Peduzzi P. et al. performed a Monte Carlo study and found that LR can detect only low-order interactions. This limitation of LR is referred to as the curse of dimensionality [24] and also consistent with what we found in this study. A fully saturated model with numerous terms may be prone to unstable and biased estimates due to sparse data and multicollinearity. In this condition, large sample theory underlying the test statistic may be violated. In some SNP-SNP interactions, Briollais et al. found that the permutation distribution of the likelihood ratio test did not closely match a chi square distribution, which justifies the use of a distribution-free test statistics [25]. CART and MDR do not require or assume any specific parametric or distributions for the relationship between predictors and outcomes. Therefore, they could uncover SNP-SNP interactions that are missed by LR. These model-free methods are better in dealing with sparse and high-dimension data and can account for non-linear SNP-SNP interactions. However, when the study design is suboptimal, such as relative small sample size or minor allele frequencies, these model-free methods have a high chance of detecting false associations [26].
Besides these, we also constructed and assessed a GRS from the number of risk alleles for risk assessment of ARC. We found individuals with ARC have a high GRS compared to normal individuals with the low GRS. Further calculating the GRS individually in the type of ARC, our finding was that all three types conferred increase risk.
Our multifactorial analytic approach revealed that the combination of the WRN-rs11574311 (T>C), WRN-rs4733220 (G>A), WRN-rs1801195 (T>G), OGG1-rs2304277 (A>G) and OGG1-rs2072668 (G>C) SNPs may predicts a significantly increased risk for developing ARC. Furthermore, certain loci in the WRN, OGG1, ERCC6 and BLM genes were associated with the C, N and M subtypes of ARC. Our number of included PSC cases remained small, even though we tried our best to increase the sample size, because the number of cases in this populationbased study was fixed at the time point of the survey. The influence of genetic polymorphisms on the function of an enzyme may lead to different subtypes of cataract. Given the absence of cell nuclei in the lens nuclei, the functional effects of these SNPs might originate from lens epithelial cells. Aberrant metabolism of lens epithelial cells can easily cause dysfunction in the lens fibers.
Expression of the BLM gene is thought to increase in the S and G 2 phases of the cell cycle as a result of crossing over during homologous recombination-mediated DNA repair events [27]. BLM assures genomic integrity through faithful chromosome segregation. Mutations in BLM deleting or altering its helicase motifs and disabling its 3'-5' helicase activity may induce Bloom syndrome. Furthermore, several studies reported that BLM influences the selection of the pathway for the repair of double-strand breaks in human chromosomes [28] and that polymorphisms in BLM are associated with colorectal cancer [29] and breast cancer [30]. Although ARC pathogenesis is different from that of cancers, these diseases might share pathways related to aging and genome instability. Our results show the association of specific gene-gene interactions with subtypes of ARC as well as overall ARC.
The WRN gene is responsible for maintaining the genome and serves as an important link between repair of defective DNA and processes related to aging [31]. Previous studies have reported associations between WRN polymorphisms and age-related diseases such as myocardial infarction [32] and type 2 diabetes mellitus [33]. Recently, an Israeli study found that the WRN C1367T (rs1346044) polymorphism is not linked to cataract among the elderly [21]. However, our results showed that WRN-rs11574311 is associated with the C and M subtypes of ARC and that WRN-rs2725338 is associated with the M subtype of ARC based on either single-factor or multifactorial analysis. Additionally, rs11574311 showed strong linkage disequilibrium with rs1346044. This inconsistency in results between studies may be due to the genetic heterogeneity between populations and the limited sample size in the Israeli study.
Mutations in ERCC6 gene may lead to Cockayne syndrome, which often presents as severe cataract [34] and AMD [35,36]. ARC and AMD, both of which are age-related eye diseases, may be caused by long-term UV radiation, oxidative damage, aging and a similar set of genetic factors. In our study, we found gene-gene interaction effect of ERCC6 on the risk for the C subtype of ARC. OGG1 is responsible for the removal of 8-oxoguanine, which is produced via the incorporation of 8-oxo-dGTP from the oxidation of dGTP by ROS during DNA replication in the BER pathway. We selected two new common SNPs in OGG1 and found gene-gene interaction effects of these SNPs on not only subtypes of ARC but also overall ARC.
Compared to the results of single-locus analyses, the overall analysis of all 18 selected polymorphisms did not diminish either overall ARC or the various subtypes of ARC. Thus, we concluded that the application of this multifactorial analytical approach was more sensitive and accurate than single-factor approaches and showed reasonable power for identifying genes for disease risk prediction.
Moreover, a prominent, significant role of oxidative stress processes was elucidated based on the GCSs of selected pathways independently or in combination. Therefore, exhaustive multi-factorial analyses using approaches such as MDR, CART and GRS are well recognized methods in understanding complex traits, such as disease susceptibility, and the etiology of complex diseases.
In summary, the results from this comprehensive study using multi-factorial genetic analysis to determine risk factors for ARC development suggest that individuals with more genetic variations in oxidative stress pathway genes may elevate the risk for ARC. This finding confirms the importance of applying a multigenic pathway-based approach to disease risk assessment. This finding also indicates that the development of ARC involves complex genetic interactions and proceeds via different pathways depending on the specific genetic background of the individual. The present study provides evidence supporting the contribution of oxidative stress pathway genes, most importantly the interactions between the WRN, BLM and OGG1 genes, to the risk for ARC.
Thus, our results support the concept that genetic polymorphisms can be used as predictors of ARC risk and that combined analysis of multiple polymorphisms may enables more delineation of risk groups. Thus, our results suggest the future direction of association studies. These results must be replicated in other ethnic groups, as our study included only Chinese individuals.
Supporting information S1