MUC5AC Upstream Complex Repetitive Region Length Polymorphisms Are Associated with Susceptibility and Clinical Stage of Gastric Cancer

MUC5AC was deemed to be involved in gastric carcinogenesis since aberrant MUC5AC expression has been repeatedly detected in patients with gastric cancer (GC). In this study, length polymorphisms in a complicated repetitive region adjacent to MUC5AC promoter were assessed in 230 patients with GC and 328 cancer-free controls. Alleles of 1.4 and 1.8 kb were significantly more prevalent in GC group than in controls. In contrast, 2.3 and 2.8 kb alleles occurred at significantly lower frequencies in patients than in controls. Alleles were then classified into susceptible (S; 1.4 and 1.8 kb), protective (P; 2.3 and 2.8 kb) and null (N; all other alleles) categories with respect to their linkage with the susceptibility to GC. Individuals with genotype SS had a 2.7-fold increased risk of GC occurrence, but PN genotype was associated with a significantly reduced risk of this cancer. Moreover, homozygous or heterozygous individuals with one or two copies of 1.4 kb allele showed an earlier age of onset and more advanced metastasis stage compared with patients without this allele (Bonferroni corrected p = 1.35×10−4 and 6.60×10−4 accordingly), whereas homozygous patients with two copies of 1.8 kb allele were linked to less advanced GC TNM stage. Our results suggest that certain genetic variations in MUC5AC upstream repetitive region are associated with the susceptibility and progression of GC.


Introduction
Gastric cancer (GC) is one of the most common malignancies and the second leading cause of cancer-related death worldwide [1]. However, its mechanism remains unclear. Although some environmental factors, such as diet, cigarette smoking and Helicobacter pylori, may contribute to carcinogenesis of gastric epithelial cells [2][3][4], only a fraction of the population exposed to such risk factors develop GC during their lifetime. This suggests that genetic factors play a crucial role in determining an individual's susceptibility to GC [5,6].
Mucins are a group of diverse, complex, highly glycosylated extracellular proteins important in maintaining epithelial homeostasis. Cancer cells are often observed to express aberrant forms or amounts of mucins, and these aberrations are thought to play a role in carcinogenesis, especially in regulation of tumor cell differentiation, proliferation, and tumor invasion [7]. For example, overexpression of MUC1 and MUC4 in several different forms of adenocarcinoma contributed to the regulation of cancer cell proliferation via an interaction with epidermal growth factor regulator (EGFR) and extracellular signal-regulated kinases [8]. Velcich et al. demonstrated that Muc2 2/2 mice develop adenomas in the intestine that progress to invasive adenocarcinomas [9], suggesting a protective role for MUC2 in intestinal tumorigenesis.
MUC5AC is a secreted gel-forming mucin and a marker of gastric foveolar epithelial cells [10]. MUC5AC was deemed to be involved in gastric carcinogenesis since gastric carcinoma was found to contain a lower level of MUC5AC expression than normal gastric mucosa [11][12][13], and several clinical studies demonstrated that MUC5AC expression level was associated with severity of GC; however, these data were inconsistent [12,14]. There has been little research on MUC5AC function and the mechanisms underlying its role in GC development, until recently it was reported that silencing MUC5AC, using a small hairpin RNA-containing lentivirus, increased gastric cancer cell invasion and migration in vitro [13]. This adds to evidence that altered levels of MUC5AC expression may be involved in GC pathogenesis. Functional genetic polymorphisms in the regulation region may affect MUC5AC gene expression and then contribute to an individuals' susceptibility to gastric cancer.
Repetitive regions of DNA are common throughout the human genome and are characterized by their dynamic, unstable features [15,16]. They are the major generator of genetic variation and are considered to underlie substantial genetic variability, with novel mutations in such regions explaining much of the 'missing' heritability in polygenic diseases [17,18], including GC [19]. However, this kind of genetic variation cannot be included in genome-wide association studies (GWAS) panels and is challenging to assess reliably. Intensive review of the MUC5AC upstream regulation region and around identified a complicated repetitive region (termed MUC5AC-u repetitive region). We undertook this case-control study to determine the nature and extent of genetic polymorphisms within this region, and to explore the association of each genetic variant with the occurrence and progression of GC.

Methods
Database searches and analysis of the upstream region of MUC5AC The UCSC genome browser (http://genome.ucsc.edu) and the GRCh37/hg19 release of the human genome were used to generate a map showing the location and major genomic features of the MUC5AC gene, including histone H3 lysine 27 acetylation (H3K27AC) status, transcription factor binding sites, common single-nucleotide polymorphisms (SNPs), and repetitive sequence genomic features of the upstream region. The DNA sequence of the upstream region was downloaded from the Ensembl (http:// useast.ensembl.org/Homo_sapiens/ Info/Index).

Ethics statement
This study was conducted with the approval of the Medical Ethics Committee of Shandong University and informed written consent was received from all subjects. The manuscript does not contain identifying patient information. The data were analyzed anonymously and all clinical investigations were conducted according to the principles expressed in the Declaration of Helsinki.

Study subjects
Two hundred and thirty patients with GC were recruited in Shandong Province, northeastern China, between January 2011 and December 2012. All diagnoses of GC were pathologically confirmed; exclusion criteria included a history of cancer of any other organ (not originally from stomach) or having undergone radiotherapy or chemotherapy. Three hundred and twenty-eight cancer-free individuals without any detectable or known cancers were collected as controls. All these subjects were living in the same residential areas as the cases, the vast majority of them were selected from the healthy volunteers, and a small portion of our aged controls were collected from inpatients with mild cardiovascular diseases of the hospitals. Their age and sex were matched with those of patients with GC. All subjects were genetically unrelated ethnic Han Chinese. Each subject was evaluated individually with a pretested questionnaire to obtain demographic data and information on related risk factors, including tobacco smoking and alcohol consumption. Individuals who smoked at least once a day for longer than one year were defined as smokers, and those who consumed three or more alcoholic drinks per week for more than six months were considered alcohol drinkers. Clinical data and pathological characteristics of patients were collected and confirmed from their medical history records and questionnaires, and GC tumor, node and metastasis (TNM) stages were classified according to the system of the World Health Organization (WHO).

Specimens and DNA extraction
One mL peripheral blood sample was collected from each subject. Genomic DNA was isolated from each sample using a modified salt extraction technique [20].
We obtained tissue samples from 36 GC patients in our cohort, and samples from each patient consisted of cancerous tissue, the respective para-carcinoma (defined as being 1.0 cm away from the tumor mass) and surrounding noncancerous gastric mucosal tissues. Genomic DNA was extracted from these samples using the Blood and Cell Culture DNA Mini Kit (Tiangen Biotech, Beijing, China).

Assessment of allele sizes
MUC5AC-u repetitive region genotyping was performed using the polymerase chain reaction (PCR); the gene-specific primer sequences used were as follows: sense 59-TCCACCC-TAACCCTGTCAGCCGC-39; antisense 59-GTGGCAG-GAGTGTGGGGAAAGG G-39. PCR amplification of DNA was performed in a total reaction volume of 50 mL, containing 100 ng genomic DNA, 0.2 mM of each primer and 25 mL PrimeSTAR Max DNA Polymerase (Takara, Japan). PCR was conducted in a 9700 Thermacycler (Perkin-Elmer, CA, USA) as follows: a 5 minute initial denaturation at 94 uC, followed by 30 cycles of 10 s at 98uC and 2 minutes at 68uC. PCR products were analyzed by gel electrophoresis (1 volt/cm) in TAE buffer through 1.0% agarose gel.

DNA sequencing assay
To confirm the genotyping results, PCR-amplified DNA samples (amplicons) were selected and sent to BGI Tech (Beijing, China) for purification and Sanger sequencing. This assay was conducted blind with respect to the specimens and study design.
Statistical analysis SPSS 13.0 software (SPSS, Chicago, IL, USA) was employed for statistical analysis. Differences in demographic variables, smoking and drinking habits, and grouped allelic frequencies between case and control participants were compared using the chi-squared test or Fisher's exact test. Regression analyses were performed to determine the odds ratios (ORs) for association of GC and MUC5AC-u repetitive region genotypes between the controls and GC patients. ORs were estimated using the natural logarithm and its standard error. The chi-squared test or Fisher's exact test were used for comparison of clinical and pathological characteristics of patients. In order to allow multiple comparisons, p values were corrected (pc) using the Bonferroni correction; pc = p631 as, across the whole study, 31 statistical tests were conducted. All tests were two-sided, with pc,0.05 considered to be statistically significant.

Identification of the MUC5AC-u repetitive region
Intensive review of the MUC5AC upstream region identified a complicated 1710 bp repetitive region (termed the MUC5AC-u repetitive region) located between nucleotides 23162 to 21452 upstream from the ATG initiation codon ( Figure 1). This position is immediately downstream of a genomic locus with the capacity to bind several transcription factors. The MUC5AC-u repetitive region contains many interrupted irregular repeats of different lengths and is a complicated combination of microsatellite (e.g., CTCA), minisatellite (e.g., CATTCACT or CATTCACTCATT) and megasatellite (e.g., ACCCATTCACTCACTCACTTATT-CACTC) repeats. At the 59 region, a 300 bp sequence was found to be duplicated exactly, head-to-tail.

Study population
All individuals in the study (328 cancer-free controls and 230 GC patients) were from a Han Chinese population and without any known hereditary disease. Both groups had similar distributions of age, sex and alcohol consumption (x 2 test;  Table S1). There was no significant difference in the distribution of cigarette smoking between the patients and controls (p = 0.098). According to the TNM system, 10.9%, 10.0%, 21.7%, 42.2% and 15.2% of patients had stage 0, I, II, III and IV disease, respectively (Table  S2).

Association of repetitive region genotypes with the risk of GC
Genomic DNA samples were isolated from whole blood of all the subjects and used as templates to amplify the MUC5AC-u repetitive region. Eight alleles with discontinuous sizes ranging from 1.1 to 2.8 kb were identified in this Han Chinese population ( Figure 2). The 1.1 kb allele was most common, the 1.8 and 2.0 kb alleles were less common, and the others were all relatively uncommon ( Table 1).
The overall distribution of the MUC5AC-u repetitive region alleles among patients with GC differed significantly from that found in controls (x 2 = 58.44, p = 3.09610 210 ). For further analysis, comparisons of allele frequencies between patients and controls were made individually for each allele, using Fisher's test ( Table 1). The 1.4 and 1.8 kb alleles were significantly more prevalent in patients with cancer than in controls (3.9% vs. 0.0%, pc = 3.00610 26 ; 35.4% vs. 25.8%, pc = 1.56610 22 , respectively). Additionally, the frequencies of the 2.3 and 2.8 kb alleles were significantly lower in patients with cancer than in controls (3.3% vs. 9.0%, p = 1.51610 24 ; 0.0% vs. 1.8%, p = 0.002, respectively), and the multiple comparisons corrected p values were 4.68610 23 for the 2.3 kb and 0.062 (suggestive) for the 2.8 kb allele. No significant differences were found when frequencies of other alleles between cases and controls were compared.
Based on these observations, we classified the eight alleles as susceptible (S), protective (P), or null with respect to risk (N) as follows: S, 1.4 or 1.8 kb; P, 2.3 or 2.8 kb; and N, all other alleles. Twenty-one MUC5AC-u repetitive region genotypes were totally identified in our case-control population (Table S3), the genotypes were then defined as NN, SN, PN, SP, SS, and there was no PP genotype in our cohort. The most common genotype (NN) was designated as the reference group. Individuals with the homozygous genotype SS had a 2.7-fold increased risk of GC occurrence (OR = 2.683, 95% CI = 1.554-4.361, pc = 0.012; Table 2). The PN genotype was associated with a significantly reduced risk of GC (OR = 0.257, 95% CI = 0.116-0.569, pc = 0.031). Neither of the heterozygous genotypes SN and SP was associated with a change in the risk of GC (both p.0.05).

Clinical and pathological characteristics at diagnosis of GC patients with differing MUC5AC-u repetitive regions
As certain variable number of tandem repeat polymorphisms are reported to exert dual, conflicting effects on the risk and prognosis of cancer [21], we compared the age at onset and clinical stages between GC patients with and without MUC5AC-u repetitive regions of 1.4, 1.8 or 2.3 kb separately.
There were 128 GC patients (55.7%) in our sample who carried the 1.8 kb version of the MUC5AC-u repetitive region; 35 patients   (Table 4). We did not find individuals showing the homozygous genotype 2.3/2.3 kb in our sample; however, fifteen GC patients (6.5%) were heterozygous for this allele. Heterozygous patients were older at GC onset than patients who were not, although this result was at a marginal level of significance and did not survive the correction for multiple tests. There was no significant difference in distributions of T, N, M or TNM stages of cancer between patients with one or no copy of the 2.3 kb allele ( Table 5).

Analysis of repetitive region instability in cancer tissues
As repetitive regions of DNA are unstable in various human malignancies, including GC [22], we next determined whether the hypervariable MUC5AC-u repetitive regions differed in length between cancer, para-carcinoma and surrounding normal tissues from 36 GC patients. The results showed no differences in band pattern between para-carcinoma and normal tissues in all 36 patients; however, length alterations were observed in DNA samples of cancer tissues in two GC patients ( Figure 3). In both cases, bands were detected showing a shift from long alleles in cancer tissue to short alleles in para-carcinoma tissue. In one case, one allele shifted from 2.0 kb to a novel, 0.9 kb allele, and, in another case, one allele shifted from 2.3 kb to 1.4 kb. Among the 36 gastric cancer patients tested, the frequency of cancer-related genome rearrangement in the MUC5AC-u repetitive region was 5.6%.
Sanger sequencing of the 1.1, 1.4 and 1.8 kb alleles from three GC patients PCR amplicons of the 1.1 kb and 1.4 kb alleles from the gastric cancer tissue DNA were successfully sequenced using the Sanger sequencing technique. These sequences are listed in supporting information files. We were unable to sequence the entire fragment of a 1.8 kb amplicon (PCR amplicon using the gastric cancer tissue DNA), or any other fragments .1.8 kb, due to the complicated and repetitive structure of the target region and limitations of the technique. The sequences show the same main genetic structure and repetitive units as the UCSC genome reference sequence but with different overall lengths. The initial 300 bp at the 5' end of the 1.4 kb MUC5AC-u repetitive region sequence are exactly duplicated in a head-to-tail pattern.

Discussion
In this study, we assessed the association of genetic variation in a repetitive region close to the MUC5AC promoter with the risk of occurrence and progression of GC. Our study was suggested by the diverse biological functions of MUC5AC in the healthy and diseased states, the unique location of the region potential regulating the gene expression, the highly dynamic nature of the repetitive sequence, and the effect of this instability on generating novel mutations.
Analysis of 230 GC patients and 328 controls showed the MUC5AC-u repetitive region was highly polymorphic, with eight different alleles (plus a 0.9 kb allele in the cancer tissue from one GC patient) being present in a Han Chinese population from northeastern China. Based on the distribution and differences of Table 2. Association of MUC5AC-u repetitive region genotypes and gastric cancer risk. allelic frequencies between GC patients and controls, these eight alleles were classified into susceptible alleles (S: 1.4 and 1.8 kb), protective alleles (P: 2.3 and 2.8 kb), and null alleles (N: the others). Individuals bearing two susceptible alleles (SS) had a 2.7fold increased risk of developing GC, and the genotype PN was associated with a reduced risk of gastric cancer. Our findings suggest that genetic variation in this region is significantly associated with susceptibility to GC, and thus add to the existing evidence that changes in MUC5AC expression is involved in the pathogenesis of this malignant disease.
In further analysis, we found that these genetic variants were not only associated with GC susceptibility, but also with its prognosis. We found patients with the 1.4 kb allele had an earlier age of GC onset and were more likely to have advanced T and M stage diseases. As advanced T and M stages are associated with a poor prognosis in general, our results indicated that GC patients with the 1.4 kb allele were linked to more rapid progression of the disease. In contrast, patients homozygous for the 1.8 kb allele tended to have an older age at diagnosis and less advanced T, N, and TNM stages than other patients, indicating this genotype Table 3. Clinical and pathological characteristics of GC patients with MUC5AC-u repetitive region 1.4 kb allele.  might decrease the risk of developing advanced gastric cancer and be associated with a better outcome. Repetitive regions of the genome have been dismissed as nonfunctional ''junk'' DNA previously; however, a recent study found that up to 25% of gene promoters in the Saccharomyces cerevisiae genome contain repetitive sequences [23]. A comparable distribution of tandem repeats in the promoters of Homo sapiens genes also demonstrated that genes driven by repeat-containing promoters had significantly higher rates of transcriptional divergence [23]. A number of studies have shown that many variations in repetitive regions of promoters affect gene expression and contribute to genetic susceptibility to various human disorders [24][25][26], and for cancers as well [27][28][29]. Several molecular mechanisms may underlie the effects of repetitive regions in promoters on gene expression; for example, they may alter the number of transcription factor binding sites, generate changes in the spacing of critical promoter elements, modulate the activity of RNA-binding proteins or affect the chromatin structure [23]. According to the Encyclopedia of DNA Elements (ENCODE) dataset, available for visualization and download via the UCSC Genome Browser (http://genome.ucsc.edu/), the region containing the MUC5AC-u repetitive region contains clusters of known transcription factor binding sites and is enriched for histone H3 lysine 27 acetylation (H3K27ac), a reliable marker for active chromatin. Thus length variations of this repetitive region might have considerable impact on DNA structure and transcription factor binding, and hence upon gene regulation. Therefore, our finding of an association between the length of the repetitive region and a change in GC risk might be explained by alterations in MUC5AC levels. This will be explored in future studies, which will investigate whether the 1.4 and 1.8 kb alleles enhance promoter activity and if 2.3 and 2.8 kb alleles repress it. Such studies will help to reveal the exact role this region plays in the development and prognosis of GC.
Genomic instability was shown to affect tumor initiation and progression by accelerating the accumulation of the multiple genetic alterations responsible for cancer cell development [30]. Although spontaneous rearrangements of repetitive regions were detected more frequently in the germ line than in somatic cells [31], several studies have demonstrated repetitive regions are unstable in various human neoplasms [32][33][34], including GC [35]. When we examined the MUC5AC-u repetitive region length in DNA from normal and cancer tissues from some GC patients, we found two examples of length alterations. Both converted long to short alleles, and the 1.4 kb allele, associated in our study with an increased risk of GC, appeared in one case. Although the genetic rearrangement frequency was relatively low, this result implies that instability at this locus contributes to the pathogenesis of gastric cancer in some cases.
The duplication of a 300 bp DNA segment at the beginning of this complicated repetitive region is relevant in this context. This duplication is likely to have occurred multiple times to form the larger allelic variants, which differ from each other mostly in  ,300 bp increments. For example, we speculate the 1.4 kb allele was generated by this duplication from the 1.1 kb allele, which is the most common allele in our study population. This duplication event may be associated with genome instability which is extensively involved in tumorigenesis, development and metastasis of gastric cancer. There are reports of similar duplication events in large central repetitive exons of MUC5AC [36]. Although we did not achieve novel sequences distinguishing them with the reference sequence in three selected GC patients, from 1.1, 1.4 and 1.8 kb allele DNA fragments, we found many SNPs, which could possibly be used as proxies for allele sizes and in strong LD with other genetic markers outside of the region, besides of the length differences. Due to the great complexity, length, high similarity across the region, and the limits of the sequencing technique, we could not sequence the entire region of the DNA amplicons from all subjects; thus, other sequencing features and genetic variants have not been revealed very likely. Moreover, we can not tell if there are more dramatic genetic mutation events occurred in the genome DNA of the cancer tissue which will be more challenging, but likely more productive.
To the best of our knowledge, this is the first report to indicate the association between genetic variation in MUC5AC-u repetitive region and gastric cancer risk. We have shown certain genetic length variants in the repetitive region around the MUC5AC promoter are significantly associated with susceptibility to GC, and with its clinical stages. Prospective, large-scale trials, as well as well-designed mechanistic studies, are required to validate our findings. Figure S1 Multiple sequence alignment of MUC5AC-u repetitive region variants. PCR amplicons of the 1.