Pooled Resequencing of 122 Ulcerative Colitis Genes in a Large Dutch Cohort Suggests Population-Specific Associations of Rare Variants in MUC2

Genome-wide association studies have revealed several common genetic risk variants for ulcerative colitis (UC). However, little is known about the contribution of rare, large effect genetic variants to UC susceptibility. In this study, we performed a deep targeted re-sequencing of 122 genes in Dutch UC patients in order to investigate the contribution of rare variants to the genetic susceptibility to UC. The selection of genes consists of 111 established human UC susceptibility genes and 11 genes that lead to spontaneous colitis when knocked-out in mice. In addition, we sequenced the promoter regions of 45 genes where known variants exert cis-eQTL-effects. Targeted pooled re-sequencing was performed on DNA of 790 Dutch UC cases. The Genome of the Netherlands project provided sequence data of 500 healthy controls. After quality control and prioritization based on allele frequency and pathogenicity probability, follow-up genotyping of 171 rare variants was performed on 1021 Dutch UC cases and 1166 Dutch controls. Single-variant association and gene-based analyses identified an association of rare variants in the MUC2 gene with UC. The associated variants in the Dutch population could not be replicated in a German replication cohort (1026 UC cases, 3532 controls). In conclusion, this study has identified a putative role for MUC2 on UC susceptibility in the Dutch population and suggests a population-specific contribution of rare variants to UC.


Introduction
Inflammatory bowel diseases (IBD) are common chronic gastrointestinal inflammatory disorders. The two major forms of IBD are Crohn's disease (CD) and ulcerative colitis (UC). CD can affect any part of the gastrointestinal tract, while UC is restricted to the colon and the rectum. UC is probably caused by an aberrant immune response against the commensal intestinal flora, influenced by a combination of genetic, microbial and environmental factors, resulting in chronic inflammation of the colonic epithelium. Defects in both innate and adaptive immunity and epithelial barrier function are associated with UC [1].
The genetics of complex diseases has been thoroughly investigated in genome wide association studies (GWAS). These identified thousands of common genetic variants associated with disease susceptibility [2]. GWAS and meta-analyses have identified 200 risk loci in IBD, including 29 risk loci specifically associated with UC. While relevant disease pathways have been identified by GWAS, UC-associated common variants only explain 8.2% of variance in disease onset [3]. Therefore, research looking into the missing heritability in UC is now focused on the contribution of low frequency and rare variants [4,5].
Sequencing studies have revealed that low frequency (minor allele frequency (MAF) between 1% and 5%) and rare (MAF < 1%) genetic variants are more likely to have a deleterious effect on health compared to common variants (MAF > 5%) [6]. Also, population-based studies characterizing detailed genetic variation within a population, like the Genome of The Netherlands (GoNL), have shown that rare genetic variants can be very population-specific [7].
Since rare variants are population-specific and only one previous study investigated UC, we aimed to further investigate the contribution of rare, large effect genetic variants to UC susceptibility. We identified a putative role of variants in the MUC2 gene on UC susceptibility in the Dutch population and suggest a population-specific contribution of rare variants to UC liability.

Materials and Methods
We performed a targeted resequencing study in 790 UC patients (Phase I) followed by replication of identified variants in an independent Dutch cohort of 1021 UC cases and 1161 Healthy controls (Phase II) and a German cohort consisting of 1026 UC cases and 3532 healthy controls (Phase III).
Pooled targeted deep high-throughput sequencing has been performed of 122 genes: We have selected two groups of target genes for re-sequencing.
The first group of genes (n = 111) originates from genomic loci identified through previous GWAS and Immunochip studies conducted by the International IBD Genetics Consortium [12], The second group consisted of genes selected based on the fact that they lead to the development of a spontaneous colitis in knock-out mice (n = 11) [13] (Phase I). (S1 File) In addition to the coding sequence, for 45 of these genes with a known cis-eQTL effect (expression Quantitative Trait Locus) we also sequenced the promoter region [14]. We used whole genome sequence data of 500 healthy unrelated Dutch individuals from the Genome of the Netherlands (GoNL) as a control cohort [7]. Follow-up genotyping of identified variants was performed in 1021 Dutch cases and 1166 healthy controls (Phase II) and in independent German cohorts of 1026 UC cases and 3532 healthy controls (Phase III). Fig 1 shows an overview of our analysis strategy (Phases I, II and III).

Phase I: Discovery
Target selection, design and enrichment. In total, for 122 genes, we sequenced all exons including 20 flanking intronic base pairs. In addition, for the genes with a known cis-eQTL effect [15], we included 1000 base pairs upstream of the transcription start site in the sequencing design to enable us to identify regulatory variants in the promoter sequence of those genes.
Pooled targeted enrichment of DNA from 790 Dutch UC patients (12 individuals per pool) was performed using a custom-made kit (Agilent HaloPlex). The HaloPlex kit was designed with Agilent's Sure Design, resulting in coverage of 99.9% of the target sequence (S1 File).
Sequencing, read alignment and annotation. Next, after the enrichment, the resulting libraries were sequenced using 100 bp paired-end sequencing on an Illumina HiSeq 2500 machine with 8 barcoded pools per sequence lane. Sequences were aligned using an in housedeveloped pipeline adapted for pooled sequencing (Genome Build 37, Genome Analysis Toolkit [GATK]). To reduce false-positive SNVs, we performed a second alignment and variant calling with NextGENE software (http://www.softgenetics.com/NextGENe.html). Only variants called by both algorithms were included for further analysis.
Chi-squared and the Fisher-exact tests with R statistical software [7] were used for association analyses. The allele frequency was based on allele counts per Single Nucleotide Variant (SNV). Variants were annotated using SNPeff and SeattleSeq [16,17]. To check for regulatory functions of the variants, the Encyclopedia of DNA Elements (ENCODE) [18] was searched using the UCSC Genome Browser [19].
Quality control and variant selection: prioritization of relevant variants. As part of our quality control procedure several identified variants were validated by Sanger sequencing (S1 File). An overview of the quality control steps is shown in Fig 2 and described in detail in S1 File.
After quality control, a total of 2562 confidential SNVs remained (S1 Table). To prioritize relevant variants for follow up genotyping, we removed SNVs that had been tested previously in other studies that used the Immunochip genotyping array (n = 527) [12]. Synonymous mutations (n = 335) were removed since they lack functional consequence. Next, we used the following strategies to select non-synonymous SNVs (coding), including splice-sites, (n = 418) as well as non-coding SNVs (n = 1282).
In the coding variant group, we used an allele frequency (AF) threshold of <0.05 for inclusion of variants for follow-up genotyping since common variant (AF > 0.05) analyses within these regions have extensively been performed within the original GWAS and Immunochip based studies [12]. A slightly different strategy was obtained for genes that are known to lead to spontaneous colitis when knocked-out in mice. Here the aim was to study whether genomic variants in these genes exist in humans and whether they are associated with UC susceptibility. In this group of genes we took a more liberal approach in selecting variants for further followup and included common variants with predicted functional consequences for follow-up genotyping (Fig 2). After this step, 377 SNVs remained. Further prioritization was based on damaging effect predicted by Polyphen (damaging effects between 0.8 and 1.0) and/or damaging effect predicted by Sift (n = 112). We included all nonsense variants (n = 6), the variants in splice sites (n = 4) and variants that were significantly different in AF compared to the AF in GoNL (n = 5). We also included newly identified variants that were present in multiple pools (n = 13). In total, 140 coding variants remained after this filtering step. Overview of quality control and prioritization in Phase I. a) After pooled sequencing, a total of 7969 SNVs were detected with a coverage of >360x (12 individuals* 30x coverage). b) All variants called by two alignment strategies were included and filtered using a Forward/Reverse balance between 20-80%. c) Variants previously tested in a large IBD cohort with the Immunochip (n = 527) and silent mutations (n = 335) were excluded. d) We used different strategies to select nonsynonymous SNVs (coding), including splice-sites, (n = 418) (d1) and non-coding SNVs (n = 1282) (d2). d1) The coding variants were selected on the basis of allele frequency (AF): known SNVs with an AF > 0.05 were excluded. A different strategy was obtained for genes that are known to lead to spontaneous colitis when in knocked-out mice. In this group of genes we took a more liberal approach in selecting variants for further follow-up and included common variants with predicted functional consequences for follow-up genotyping. Three hundred seventy-seven SNVs remained after this step. d2) To prioritize the noncoding SNVs in regulatory regions, we selected 48 SNVs in a transcription factor binding site (TFBS), based on ENCODE data in the UCSC browser e) Further prioritization was based on damaging effect prediction by Polyphen (damaging effects between 0.8 and 1.0) and/or damaging effect predicted by Sift (n = 112). We included all nonsense variants (n = 6), the variants in splicesites (n = 4) and variants that were significantly different in AF compared to the AF in GoNL (n = 5). We also included unknown SNVs present in more than one pool (n = 13). f) In total, 140 coding and 48 non-coding rare variants remained after filtering. To prioritize the non-coding SNVs in regulatory regions, we selected 48 SNVs in a transcription factor binding site (TFBS), based on ENCODE data in the UCSC browser [19].
In total, 19 variants were selected for replication in an independent cohort (Phase III), including variants with a significant p-value (p< 0.05), singletons replicated in cases in Phase II and SNVs based on the gene-based analysis. SNVs were excluded if the association was in the opposite direction between discovery (Phase I) and replication phase 1 (Phase II).
In total, 188 SNVs were selected for follow-up genotyping, of which 171 passed the design of the Agena Bioscience iPlex (Phase II). After quality control 111 SNVs remained. The relatively low number of replicated SNVs results from the stringent cut-off threshold to exclude false positives. For 30 of the 111 rare SNVs, we could not identify additional carriers in either cases or controls. For half of the variants, we detected a discrepancy in the direction of the AF between cases and controls in the discovery (Phase I) and replication phase 1 (Phase II). For one singleton variant, we detected one additional carrier in the cases. For the SNVs located in a TFBS, we detected nine additional carriers, but no significant differences in AF between the cases and controls in the replication phase 1 (Phase II, (S2 Table).
Single marker permutation (10,000x) allelic association analysis, performed with the Megaanalysis of Rare Variants (MARV) software, detected eight SNVs (P < 0.05) with a significant difference in AF between cases and controls [9]. Four of these SNVs were located in the coding region of MUC2. The other four SNVs consisted of one stop-gain variant located in CCDC88B, two damaging coding variants in RFTN2 and MMEL1 and one variant in a TFBS in the promoter region of the PMCA gene (Table 3). Gene-based analysis with SKAT-O resulted in nine variants in the MUC2 gene with a significant p-value of 9.2 x 10 −5 (threshold p = 0.0011 after Bonferroni correction).
In total, 19 variants were selected for replication phase 2 (Phase III). After quality control, 17 variants remained, and none of the variants were associated with UC in the German cohort (Phase III).

Discussion
In this large Dutch sequencing study, we investigated the contribution of rare variants to the genetic susceptibility of UC. We identified a supposed role for the MUC2 gene on UC susceptibility in the Dutch population, suggesting a population-specific contribution of rare variants to UC susceptibility. What distinguishes our study from previous re-sequencing studies is that we include 11 genes that are known to lead to spontaneous colitis when knocked-out in mice [13]. Moreover, we include the promoter regions of genes with a known cis-eQTL effect. We have sequenced 122 genes in 790 Dutch UC patients, using a targeted pooled sequencing approach. After prioritization of variants with a pathogenic probability, extensive follow-up genotyping in~1000 additional Dutch UC cases and~1200 healthy Dutch controls revealed an association of variants in the MUC2 gene with UC in the Dutch population. This association was not replicated in an independent German cohort. We also confirmed known rare variants in the IL23R (rs41313262, rs76418789, rs11209026), CARD9 (rs141992399, rs200735402) and JAK2 (rs41316003) genes, most with similar AFs to those reported in other studies (Table 1).
Pooled sequencing has proven to be a highly cost-effective method for screening large populations. Therefore, it has been used in several re-sequencing studies in IBD [9][10][11]21]. A major problem of sequencing studies is the relative high rate of false-positive SNVs. The recommended approach to minimize the high false-positive rate is very deep sequencing (100x per individual) of a large population with geographically matched individuals [22]. In this study, we used the largest Dutch UC cohort available for discovery (Phase I) and replication phase 1    (Phase II). Target enrichment was performed with HaloPlex capturing, in which genomic DNA is fragmented by restriction enzyme digestion and circularized by hybridization to probes. Compared to hybrid capture methods, HaloPlex is relatively quick and inexpensive. It also requires a smaller amount of DNA and has a higher fraction of sequence reads in our region of interest [23]. However, because of the fragmentation with restriction enzymes instead of random fragmentation, it is impossible to exclude duplicate reads in the alignment in order to reduce sequencing artefacts. Therefore, we used the presence of the SNVs in both forward and reverse sequencing reads as a quality control filter, which substantially reduced the number of false positives. Since this output cannot be deduced from our standard bioinformatics GATK-pipeline, we did additional alignment and variant calling using the NextGene Software.
After an extensive, stringent quality control with the additional alignment,~2500 highly confident variants remained with a minimal coverage of >59x and with a transition/transversion ratio ti/tv = 2.52, indicative of a relatively high true-positive rate for our dataset [10,9]. Single marker association and gene-based analyses (p-value = 9.2 x 10 −5 ) showed an association of the MUC2 gene with UC in the Dutch population (Table 3). MUC2 was selected because it induces spontaneous colitis when knocked out in mice [1,13,24]. The MUC2 gene encodes a member of the mucin protein family and is the major mucin secreted in the large intestine. The colonic mucus layer plays a critical role in intestinal homeostasis by limiting contact between luminal bacteria and the mucosal immune system. A defective mucosal barrier is a key feature of active UC [25,26]; patients with UC present with a reduction of goblet cells, decreased glycosylation of mucins, and absence of MUC2 expression in goblet cells in the affected colon mucosa. Altogether, this functional evidence supports MUC2 as a candidate gene for UC pathogenesis.
MUC2 has not been previously identified as an UC-associated gene. A previous small candidate-approach genetic association study did not show an association of MUC2 with UC [3]. Furthermore, MUC2 has never been associated with UC in GWAS studies or meta-analyses; the Immunochip contains just two MUC2 SNPs and only a few were present on previous GWAS platforms (Illumina HumanHap550). The reason for this could be the difficulty of designing specific probes because of the homology of the MUC2 gene with other members of the mucin protein family (MUC5AC, MUC5B, MUC6 and MUC19). This strong homology could be a source of problems in the alignment of sequencing reads, thereby introducing false positive SNVs. However, we were able to validate our variants using Agena Bioscience assays, which were highly specific for MUC2 as demonstrated by blasting of our sequences in the UCSC genome browser (http://genome.ucsc.edu). Blast output and a clusterplot of MUC2 is shown in the S1 file. MUC2 is a very large gene. The exonic sequence contains 49 exons and the entire MUC2 gene product has more than 5100 amino acids in its commonest allelic form. The size of the gene makes it more likely to detect mutations.
While our association of MUC2 in the Dutch UC population could not be replicated in a German cohort, this might be because our associated SNVs are population specific or because of a lack of power. Recently, the first trans-ancestry association study in IBD was performed in a cohort of 86,640 European individuals and 9,846 individuals of East Asian, Indian or Iranian descent [3]. The majority of the loci found, based on common SNPs with a MAF >5%, were shared between different ancestry groups. However, this study also found genetic heterogeneity between divergent populations at several established risk loci driven by difference in allele frequency (NOD2) or effect size (TNFSF15 and ATG16L1), or a combination of these factors (IL23R and IRGM). Rare variants are even more likely to be specific to a particular population, as was demonstrated by a recent sequencing study in a Korean IBD population [21]. Table 1 shows that allele frequencies for a rare variant in IL23R (rs76418789) differ strongly among populations, even between closely related UK populations [11] in Prescott's study and the large population used in the Rivas and Beaudoin studies (NIDDK consortium (North America), Australia, Italian, Dutch, Swedish, German, UK population) [10,9]. The Korean study shows a 10x higher allele frequency compared to European populations [21]. These differences in MAF between populations, even in ancestrally close populations, could explain the lack of replication between our Dutch and German cohorts. There could also simply be a lack of power to detect association in our replication phase 2 (Phase III, Table 3). For example, the CARD9 splice-site (rs141992399) has the same allele frequency in the large population of the Rivas paper (28,000 patients and 17,570 healthy controls) as in our study, but our p-value is much higher (Table 1), which underlines the importance of well-powered studies to detect significant rare variants.
Large whole genome sequencing (WGS) and whole exome sequencing (WES) studies in IBD are in progress. Although we identified potential variants in TFBS, none of them were statistically significant in replication phase 1 (Phase II). Thus the WGS and WES studies will increase the power to explore the non-coding part of the genome and the association of the MUC2 region to UC in different populations.

Conclusions
Identifying associations of rare variants in complex diseases remains challenging, and the approach of re-sequencing known genes might not be the key to resolving the missing heritability in complex diseases like UC. The power of rare variants could be better captured in the regulatory, non-coding part of the genome by sequencing the whole genome or, specifically, the enhancer regions. Another option is to select genes based on pathway analyses or candidate genes, or to use specific phenotypic populations (like early onset IBD or family based studies). If the eventual goal is individual risk-scores for disease development, genomic interpretation of the non-coding part of the genome is crucial. For this, large well powered WGS and WES studies are necessary to give a realistic view of the role of rare variants in complex disease. , Nr_pools = number of pools (of 12 patients) in which variant is detected, SNV = rs-id if available from dbSNP137, refAllele = reference allele, altAllele = alternative allele, UC_freq = allele frequency detected in 790 UC patients, Controls_GoNL_Freq = allele frequency detected in 500 healthy controls of the Genome of the Netherlands cohort(GoNL), CHISQ = p-value after Chi-squared test, FISHER = p-value after Fisher-exact test, Wash_EA_AF = allele frequency based on European population in Exome Variant Server (http://evs.gs. washington.edu/EVS/), 1000G_EUR_AF = allele frequency based on European population in 1000 genomes (http://www.1000genomes.org), ExAC = allele frequency based on Exome Aggregation Consortium (http://exac.broadinstitute.org), HGVS.c = Variant using Human Genome Variation Society notation (DNA level), HGVS.p = If variant is coding, this field describes the variant using Human Genome Variation Society notation(Protein level), SnpEf-f_effect = Effect of this variant based on SnpEff, SnpEff_gene_biotype = This field is 'COD-ING' if any transcript of the gene is marked as protein coding, SnpEff_gene_name = name of the Gene, Selection group = genes selected in UC genes or genest hat lead to spontaneous colitis when knocked-out in mice. DNASE1 = DNase I hypersensitive sites from ENCODE, HISTONE = histone modification from ENCODE, POLYMERASE = polymerase subunits from ENCODE, TFBS = Transcription Factor Binding Sites from ENCODE, DNASE1_ CELLTYPES = DNase I hypersensitive sites specific cell types from ENCODE, HISTONE_ CELLTYPES = histone modification specific cell types from ENCODE, POLYMERASE_ CELLTYPES = polymerase specific cell types from ENCODE, TFBS_CELLTYPES = Transcription Factor Binding Sites specific cell types from ENCODE, PolyPhen = polymorphism phenotyping, used to predict functional effect of human missense variant, in this study the damaging effect cut-off is between 0.8-1.0. ClinicalAssociation = link with known clinical association, SIFT = predicts whether an amino acid substitution affects protein function, SNP_on_ICHIP = SNP already tested on Immunochip, Refseq = annotation based on Reference sequence database (http://www.ncbi.nlm.nih.gov/refseq/), Imputed_SNV_ICHIP = variant is imputed in Immunochip dataset, using GoNL data Selected_follow-up = variants selected for follow-up (Phase II), 140 coding variants, 48 variants based on location in Transcription Factor Binding Sites. (XLSX) S2 Table. Replication phase 1 (Phase II): complete list of 111 SNVs after quality control. Follow-up genotyping of 171 SNVs (after quality control) in an additional Dutch cohort (funded by the Parelsnoer Institute) consisted of 1021 UC cases, 1166 healthy controls and 111 SNVs, with a genotype call rate of 98%. Allelic association analysis (χ 2 test, PLINK v1.07) and permutation (10,000X) association analysis was performed with the Mega-analysis of Rare Variants (MARV) software, significance cut-off p-value of <0.05. A Gene-Based analysis was performed with EPACTS-software, 45 genes (all variants with AF < 0.05) were tested with the SKAT-O test. (http://genome.sph.umich.edu/wiki/EPACTS). OR = Odds Ratio, Zstat_10,000perm = Z-statistic (which is compared to a reference standard normal distribution) after 10,000 permutations, P_10,000perm = p-value after 10,000 permutations (MARV software). (XLSX)