Genetic Polymorphisms of TGFB1, TGFBR1, SNAI1 and TWIST1 Are Associated with Endometrial Cancer Susceptibility in Chinese Han Women

Endometrial cancer (EC) is a complex disease involving multiple gene-gene and gene–environment interactions. TGF-β signaling plays pivotal roles in EC development. This study aimed to investigate whether the genetic polymorphisms of TGF-β signaling related genes TGFB1, TGFBR1, SNAI1 and TWIST1 contribute to EC susceptibility. Using the TaqMan Genotyping Assay, 19 tagging-SNPs of these four genes were genotyped in 516 EC cases and 707 controls among Chinese Han women. Logistic regression (LR) showed that the genetic variants of TGFB1 rs1800469, TGFBR1 rs6478974 and rs10733710, TWIST1 rs4721745 were associated with decreased EC risk, and these four loci showed a dose-dependent effect (Ptrend < 0.0001). Classification and regression tree (CART) demonstrated that women carrying both the genotypes of TGFBR1 rs6478974 TT and rs10512263 TC/CC had the highest risk of EC (aOR = 7.86, 95% CI = 3.42–18.07, P<0.0001). Multifactor dimensionality reduction (MDR) revealed that TGFB1 rs1800469 plus TGFBR1 rs6478974 was the best interactional model to detect EC risk. LR, CART and MDR all revealed that TGFBR1 rs6478974 was the most important protective locus for EC. In haplotype association study, TGFBR1 haplotype CACGA carrier showed the lowest EC risk among women with longer menarche-first full term pregnancy intervals (˃11 years) and BMI˂24 (aOR = 0.39, 95% CI = 0.17–0.90, P = 0.0275). These results suggest that polymorphisms in TGFB1, TGFBR1, SNAI1 and TWIST1 may modulate EC susceptibility, both separately and corporately.

Introduction 1999 and 2011. Patients with history of cancer, metastasized cancer from other organs, and radiotherapy or chemotherapy history were excluded from our study. The epidemiological information including age, body mass index (BMI), age at menarche/menopause/primiparity, smoking history and family history of cancer in the first-degree relatives was collected. The eligible 707 controls were randomly selected from women who participated in a communitybased screening program for non-infectious diseases conducted in Beijing between 2011 and 2012. The selection criteria included no history of cancer, Chinese Han ethnic background and frequency-matched to the cases by 5 year-age. All controls provided the same epidemiological information as that we collected from the cases. The characteristics of the 707 controls and the 516 cases are summarized in S1 Table. This study was approved by the Ethics Committee of Peking University Health Science Center.

DNA isolation and genotyping assay
Genomic DNA for controls was isolated from peripheral blood leukocytes, whereas cases' genomic DNA was extracted from formalin-fixed paraffin-embedded normal fallopian tube tissues. Genotyping was conducted with the ABI 7900HT 1 Real-Time PCR System (Applied Biosystems, Foster City, California) using TaqMan 1 Assay in compliance with the manufacturer's instructions. Primers and probes (FAM-and VIC-labeled) were supplied by ABI incorporation and the PCR reaction system was the same as described previously [16]. Briefly, all assays were carried out in 384-well plates with negative and positive controls. Plates were sealed and heated at 95°C for 5min, then subjected to 45-50 cycles of 92°C for 15s and 60°C for 1min. Data from plates failing in more than 15% samples were excluded from the analysis. At least 1% of the samples were duplicated randomly in each SNP genotyping assay, and the concordance between duplicates was more than 99%.

Statistical analysis
Differences in the distribution of demographic characteristics and selected variables between controls and cases were calculated by two-sided Pearson's χ 2 test or Student's t test, where appropriate. Hardy-Weinberg equilibrium (HWE) was evaluated in controls using goodnessof-fit χ 2 test within each tSNPs. The D' values of LD plots were produced using the Haploview program. The expectation-maximization (EM) algorithm was used to evaluate the most probable haplotype by maximum-likelihood estimation among current population. A two-sided χ 2 test was employed to compare differences in the distribution of genotypes and alleles between cases and controls. Each genotype was assessed in terms of additive (co-dominant), dominant and recessive models of inheritance. Also, Cochran-Armitage trend test was performed to estimate the association between EC risk and allele dose in each tSNP (P trend). Odds ratios (OR) and 95% confidence intervals (CI) were assessed by using univariate and multivariate unconditional logistic regression (LR), with adjustment for BMI, age at menarche/primiparity, menopause status, number of childbearing and family history of cancer. Statistical significance was defined as P<0.05. A Bonferroni-corrected P value was carried out in individual tSNPs and haplotype/diplotype association analysis. The potential gene-environment interactions between TGFBR1 haplotype CACGA and clinical risk factors (estrogen exposure, family history of cancer and BMI) were assessed by LR in stratified population. All statistics were analyzed by SAS software (v.9.1; SAS Institute, Cary, NC).
Classification and regression tree (CART) analysis was performed for high-order gene-gene interactions using SPSS software (v.19.0; SPSS Inc., Chicago, IL, USA) to build a decision tree via recursive partitioning. The decision tree started with a root node which contained the total sample and split into two child nodes. The splitting process continued until the terminal nodes had no subsequent statistically significant splits or reached a pre-supposed minimum size, and then the terminal subgroups were further analyzed. The case rate was calculated for each terminal node and the association of subgroups with EC risk was evaluated by LR analysis, using the subgroup with the least percentage of cases as reference. The ORs and 95% CI were adjusted as mentioned above.
Multifactor dimensionality reduction (MDR) analysis was performed to identify high-order interaction models that were associated with EC risk using open-source MDR software (v.2.0 beta 8.4, http://www.epistasis.org) [17]. Statistical significance was determined using permutation testing in MDRpt (v.1.0 beta 2.0). MDR analysis collapsed multi-dimensional data into a single independent dimensional variable with two levels (high and low risk) using the ratio of the number of cases to the number of controls, and thereby reduced multiple dimensional data into one dimension and permitted detection of interactions in relatively small sample sizes. The new one-dimensional multi-locus genotype variable was evaluated for its ability to classify and examine disease status through cross-validation and permutation test. The best candidate interaction model was regarded as the one with maximal testing accuracy and cross-validation consistency (CVC). To better confirm and visualize the interaction models, we further built an entropy-based interaction dendrogram. This would enable the loci that strongly interact to each other to appear close together at the branches of the tree, and those with weak interaction to appear distant from one another. MDR 1,000-fold permutation results were regarded as statistically significant at P<0.05. The conjoint effect of the variables in the best model was assessed by LR analysis.

Characteristics of study population
The characteristics of population were herein described in S1 Table. The controls and cases seemed to be adequately matched on age (P = 0.7528). The cases, as expected, had higher BMI (P<0.0001), earlier age of menarche (P<0.0001) and later age of menopause (P = 0.0002) compared with the controls. In addition, the percentage of nulliparous women in patients was significantly higher than that in controls (P<0.0001). EC patients were more likely to have family history of cancer in first-degree relatives (P = 0.0464). These variables with significant differences between cases and controls were used in multivariate LR models to further adjust for any possible confounding effect on the association of selected genetic variants with risk of EC.

LD degree between tSNPs
The genotype frequencies for selected 19 tSNPs were all consistently in agreement with those expected by HWE in controls (P<0.05, S2 Table). For our study, haplotype blocks were reconstructed in cases and controls as well as in HapMap CHB population based on D' value (Fig 1). There were some differences in SNPs' pairwise LD between controls and cases. For TGFB1, three LD blocks were reconstructed in disease-free participants. For TGFBR1 as well as SNAI1, all selected tSNPs were reconstructed into one high-LD block in controls. For TWIST1, only one haplotype block was reconstructed, in which the rs4721746 and rs4721745 were excluded from the analysis because their MAFs were lower than 5%.
We further explored the combination effects between the aforementioned four protective polymorphisms by setting up two binary (1, 0) dummy variables. Firstly, we assessed the relative importance of the four protective tSNPs in their designated models. The adjusted OR value indicated that these four protective tSNPs affected EC susceptibility at a similar level (S4 Table). Then, individuals were categorized into five groups based on the number of protective genotypes they carried, and those without any protective genotypes were defined as the reference group. The analysis of combination effects indicated that the adjusted OR of EC for individuals carrying two protective genotypes was 0.41 (95% CI = 0.23-0.74, P = 0.0029). Coexisting three or four protective genotypes substantially decreased the susceptibility of EC in an almost similar degree (Table 2). Also, the protective genotypes took effect in a dose-dependent manner (Ptrend < 0.0001) ( Table 2).

Association of high-order interactions among genetic variants with EC risk by CART analysis
CART is a binary recursive partitioning method that produces a decision tree to identify subgroups of subjects at higher risk [18]. Fig 2 demonstrated the tree structure. The tree initiated from the total study population (node 0) and contained five terminal nodes in the final tree structure. TGFBR1 was singled out in the first splitting node, and TGFBR1 rs6478974 TA/AA genotype carriers had the least percentage of EC cases (35.7%), indicating that rs6478974 locus was the strongest susceptible factor for EC risk among the examined polymorphisms. Then the tree progressed along node 1 with the major allele homozygotes of SNP rs6478974. We designated node 2 as a reference node, because women in this node (with TGFBR1 rs6478974 TA/ AA genotypes) had the lowest EC risk. This tree structure revealed that individuals harboring TGFBR1 rs6478974 TT, TGFBR1 rs10512263 TT, TGFB1 rs4803455 CC and TGFBR1 rs10733710 GG genotypes (node 7) had significantly higher risk calculated by multivariate LR analysis (aOR = 3.71, 95% CI = 2.14-6.43, P<0.0001), and women with both the genotypes of TGFBR1 rs6478974 TT and TGFBR1 rs10512263 TC/CC (node 4) imparted the highest predisposition to EC risk in our population (aOR = 7.86, 95% CI = 3.42-18.07, P<0.0001, Table 3).

Association of high-order interactions among genetic variants with EC risk by MDR analysis
We applied the MDR method, a nonparametric and genetic model-free analysis, to identify interaction models. The best one-factor model generated by MDR for examining EC risk was TGFBR1 rs6478974 (testing accuracy 0.561, CVC 9/10, Table 4), which was consistent with the first splitting node by CART analysis. The two-factor model including both TGFB1 rs1800469 and TGFBR1 rs6478974 was the best interaction model, which yielded the maximal CVC of 10/ 10 and the highest testing accuracy of 0.589. The best three-factor model including TGFB1 rs1800469, TGFBR1 rs6478974 and TGFBR1 rs10733710 and the four-factor model consisting of TGFB1 rs1800469, TGFBR1 rs6478974, TGFBR1 rs10512263 and TGFBR1 rs10733710 had higher testing accuracy compared with the one-factor model (0.584, 0.575, respectively), but the CVCs were decreased (7/10, 6/10, respectively). All interaction permutation P value was less than 0.05. The interaction dendrogram showed that TGFB1 rs1800469 and TGFBR1 rs6478974 had the strongest synergistic interaction (black line), which also interacted with TGFBR1 rs10733710 (dark grey line). Furthermore, TGFBR1 rs10512263 had weak interaction with TGFBR1 rs10733710, TGFB1 rs1800469 and TGFBR1 rs6478974 (light grey line, Fig 3). For the combined effect of TGFB1 rs1800469 and TGFBR1 rs6478974 in the best interaction model identified above, LR analysis demonstrated that the adjusted OR of EC was 0.43 (95% CI = 0.27-0.69, P = 0.0003, data not shown). The summary of these three approaches for single-locus analysis was shown in S5 Table. Association of haplotypes and diplotypes in TGFB1, TGFBR1, SNAI1, TWIST1 with EC risk by LR analysis To further explore the modest etiological effects of polymorphisms on EC susceptibility, haplotype was reconstructed as surrogate to provide higher resolution and potentially greater statistic power [16]. In our study, TGFB1 haplotypes (rs1800469 and rs2241716) with above 1% frequency were tested separately against the most common haplotypes, and the remaining rare haplotypes in the block (frequency < 1%) were not analyzed. The haplotype CG in block 1 was associated with increased EC risk relative to haplotype TG by univariate LR algorithm (OR = 1.60, 95% CI = 1.32-1.95, P˂0.0001, Table 5), and the diplotype CA-CG, carrying at-risk haplotype CG, increased about 62% of EC risk compared to the most common diplotype TG-CA (aOR = 1.62, 95% CI = 1.08-2.43, P = 0.0187). They did not remain significant after the Bonferroni correction. For TGFBR1, CACGA, harboring a protective locus rs6478974, could decrease about 42% of EC risk (aOR = 0.58, 95% CI = 0.43-0.77, P = 0.0003). Even after adjustment for Bonferroni-corrected multiple testing, the haplotype was still significantly associated with EC risk (Bonferroni-corrected P<0.05). Furthermore, the diplotype CACGA-CT-TAA, containing a protective haplotype CACGA was also associated with decreased EC risk compared with the most common diplotype TTTGG-CACGA (aOR = 0.35, 95% CI = 0.18-0.66, P = 0.0012), with a Bonferroni corrected P<0.05 (Table 5). Haplotypes in SNAI1 and TWIST1 were not associated with EC susceptibility (S6 Table).

Association of interactions among genetic variants and environmental factors with EC risk
Given that long-term exposure to estrogen, cancer history in first-degree relatives and overweight are clinical EC risk factors [19], we conducted analysis in stratified population to EC, endometrial cancer; OR, odds ratios; CI, confidence intervals. a The genetic variants of rs1800469, rs6478974, rs10733710 and rs4721745 were considered as protective genotypes. Individuals in group 1 had no protective genotypes; in the next four groups, we pooled all individuals harboring any one protective genotype as group 2, harboring any two protective genotypes as group 3, harboring any three protective genotypes as group 4 and four protective genotypes as group 5. b Adjusted for BMI, age at menarche, age at primiparity, number of childbirth, menopause status and family history of cancer in first-degree relatives.
Bold numbers denote a statistical significance at 0.05 level.

Discussion
In this study, we applied multiple strategies including LR, CART and MDR approaches to systematically evaluate the association of EC susceptibility with germline variants in TGF-β signaling related genes TGFB1, TGFBR1, SNAI1 and TWIST1 among Chinese Han women. In single-locus analysis using multivariate LR, five polymorphisms, rs1800469 in TGFB1, rs6478974 and rs10733710 in TGFBR1, rs4721746 and rs4721745 in TWIST1 showed significant association with EC susceptibility. Although LR has been widely used in multivariate gene-gene or gene-environment interactions, it cannot fully characterize them because of the sparseness of data in high dimensions. Moreover, its statistic power would decrease and type II errors would increase in relatively small sample size [21]. So the non-parametric CART and MDR analysis were employed in high-order gene-gene interactions to examine particular combination effects of genetic variants. In this study, CART analysis indicated that the most important splitting variable was TGFBR1 rs6478974, followed by TGFBR1 rs10512263. The MDR method, which reduces the genotype parameters from multi-dimension to one dimension, demonstrated that TGFB1 rs1800469 and TGFBR1 rs6478974 together were the best interactional polymorphisms to examine EC risk.
All the three approaches in single-locus analysis consistently indicated that the genotype TGFBR1 rs6478974 TA/AA (located in intron 1) had the strongest protective effect on EC susceptibility. Until now, common variants were seldom reported in the exons or functional regions of TGFBR1 to have clear functional relevance. Although the vast majority of SNPs are located in the genomic non-coding regions, new evidence suggests that SNPs, located in gene promoter or regulatory regions, play critical roles in regulating the nature and timing of gene expression [22]. Chen J et al found that rs6478974 was associated with increased risk of gastric cancer in Chinese population (in dominant model: aOR = 1.36, 95% CI = 1.14-1.63; in additive model: aOR = 1.23, 95% CI = 1.08-1.40) [23]. They discovered that rs6478974 was in moderate LD with rs334348 and rs1590 (in the 3'-UTR, both r 2 = 0.504) using online software SNPinfo (http://manticore.niehs.nih.gov/cgi-bin/snpinfo/snpfunc.cgi), and these two loci probably regulated miRNAs binding and influenced gastric cancer development. Because TβRI inhibits cell growth during early tumorigenesis [9,24], we speculate that individuals carrying TGFBR1 rs6478974 TA/AA express higher levels of TβRI than TT genotype carriers, and therefore have lower susceptibility to EC. We also found that individuals with both the genotypes of TGFBR1 rs6478974 TT and TGFBR1 rs10512263 TC/CC had higher susceptibility compared with those harboring the genotype TGFBR1 rs6478974 TA/AA by CART analysis, which indicates that rs10512263 could be a risk locus. A two-stage case-control study of gastric cancer (the first stage of cases/controls = 650/683; the second stage of cases/controls = 484/348) showed that rs10512263 in dominant models (CT/CC vs. TT) was significantly associated with increased risk of gastric cancer in Chinese population [23], which is consistent with our results. But Scollen S et al discovered that the minor allele C of rs10512263 had a protective effect on breast cancer susceptibility (OR = 0.87, 95% CI = 0.81-0.95, P = 0.001) in meta-analysis of the SEARCH and PBCS studies [25]. The discrepancies among these results could be due to the ethnic diversity of populations and complicated environmental factors. In TGFB1, we observed that the T allele of rs1800469 (C<T at the 5'UTR region) was associated with decreased EC susceptibility under dominant model, which was consistent with the result in gastric cancer among the same ethnic population (cases/controls = 675/704, aOR = 0.65, 95% CI = 0.52-0.82) [26]. Our MDR analysis demonstrated that the combined genetic variants of TGFB1 rs1800469 and TGFBR1 rs6478974, the best interaction model, decreased the EC risk, which was in accordance with the results analyzed by LR. It was reported that the T allele of rs1800469 could enhance the affinity of its promoter with some transcriptional factors such as Yin Yang 1 (YY1), and increase the expression of TGF-β1 [27,28]. Moreover, Grainger DJ et al showed that the concentration of TGF-β1 in plasma was extremely higher in T allele carriers than C allele carriers among UK population [29]. The polymorphism TGFB1 rs1800469 may perform its protective function during early tumorigenesis by altering the expression of TGF-β1.
In TWIST1, we discovered that the variant genotypes of rs4721746 and rs4721745, both locating in the 3' flanking regions, had opposite effects on EC risk in our population. When using web-based functional annotation tool F-SNP (http://compbio.cs.queensu.ca/F-SNP/) [30], these two polymorphisms were both predicted to influence transcriptional regulation by TFSearch and Consite tools (functional significance scores = 0.239; 0.208, respectively). Further studies in other population are needed to verify our findings. Haplotype-based approach may have greater power than single-locus analysis when SNPs are in strong LD and would provide additional statistical power to detect genes involved in complex trait diseases [31,32]. In our haplotype-reconstruction association study, we found that TGFB1 haplotype CG and diplotype CA-CG were both associated with increased EC susceptibility. In TGFBR1, haplotype CACGA and diplotype CACGA-CTTAA decreased EC risk. Also, we observed significant joint effects of haplotype CACGA, family history of cancer, BMI status and estrogen exposure in stratified analysis. Haplotype CACGA, harboring a protective allele A of rs6478974, decreased the risk of EC regardless of what the environmental factors were, which further indicated that A allele of rs6478974 might be the most important protective locus in our population. If these haplotypes and diplotypes could be proved in other populations, they can be used as molecular makers for the estimation of EC risk, and can also provide some clues for finding causal SNPs.
There are three main strengths in our study. First, in single-locus analysis, we not only used traditional LR, but also CART and MDR approaches to identify high-order interactions while overcoming LR's shortcomings, such as inaccurate parameter estimates and low power for detecting interactions. Second, we performed gene-wide analysis of tSNPs that covered all common SNPs of the four genes. Third, we reconstructed haplotype blocks according to our genotyping data in controls, which could guarantee the reasonable division of haplotype blocks. However, our study has several limitations. First, our sample size was relatively small, and the number of individuals was even smaller when the data were stratified. Second, the causal genetic variants hidden behind the association have not been revealed, and a further fine mapping study with high-density SNPs within the target region would be helpful in identifying the causal variants.
Supporting Information S1