Exome chip association study excluded the involvement of rare coding variants with large effect sizes in the etiology of anorectal malformations

Introduction Anorectal malformations (ARM) are rare congenital malformations, resulting from disturbed hindgut development. A genetic etiology has been suggested, but evidence for the involvement of specific genes is scarce. We evaluated the contribution of rare and low-frequency coding variants in ARM etiology, assuming a multifactorial model. Methods We analyzed 568 Caucasian ARM patients and 1,860 population-based controls using the Illumina HumanExome Beadchip array, which contains >240,000 rare and low-frequency coding variants. GenomeStudio clustering and calling was followed by re-calling of ‘no-calls’ using zCall for patients and controls simultaneously. Single variant and gene-based analyses were performed to identify statistically significant associations, applying Bonferroni correction. Following an extra quality control step, candidate variants were selected for validation using Sanger sequencing. Results When we applied a MAF of ≥1.0%, no variants or genes showed statistically significant associations with ARM. Using a MAF cut-off at 0.4%, 13 variants initially reached statistical significance, but had to be discarded upon further inspection: ten variants represented calling errors of the software, while the minor alleles of the remaining three variants were not confirmed by Sanger sequencing. Conclusion Our results show that rare and low-frequency coding variants with large effect sizes, present on the exome chip do not contribute to ARM etiology.


Introduction
Anorectal malformations (ARM) are rare congenital malformations, resulting from disturbed hindgut development. A genetic etiology has been suggested, but evidence for the involvement of specific genes is scarce. We evaluated the contribution of rare and low-frequency coding variants in ARM etiology, assuming a multifactorial model.

Methods
We analyzed 568 Caucasian ARM patients and 1,860 population-based controls using the Illumina HumanExome Beadchip array, which contains >240,000 rare and low-frequency a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

Introduction
Congenital anorectal malformations (ARM) are the most frequent malformations of the gastrointestinal tract, with a prevalence of 2 to 7 in 10,000 live births [1]. ARM encompass a broad range of phenotypes, which are usually classified according to the type of fistula to neighboring organs. In approximately 50% of the patients, ARM is associated with additional congenital malformations, including vertebral, cardiac and/or renal malformations [2,3]. Multiple surgical corrections are required during the first years of a patient's life. Despite major improvements in treatment and care of ARM patients in the past decade, a substantial number of patients face lifelong physical and psychosocial problems [4].
Our current understanding of the embryology and etiology of ARM is limited. Syndromes caused by a fully penetrant mutation in a single gene, such as Currarino syndrome (OMIM #176450) or Townes-Brocks syndrome (OMIM #107480), are identified in at most 10% of ARM patients [2]. In the remaining patients, the involvement of both genetic and non-genetic factors in the occurrence of ARM seems likely. Aggregation of ARM has been observed in some families [5][6][7][8]. In addition, non-genetic risk factors, such as fertility treatment [9][10][11][12], maternal overweight or obesity[5, 13,14], pre-existing diabetes [15][16][17], and previous miscarriages [18], have all been found to be associated with ARM in various studies. Genetic research into ARM has mainly focused on candidate genes that are involved in embryonic signaling pathways, such as sonic hedgehog (SHH), wingless-type integration site (WNT), and fibroblast growth factor (FGF) signaling, identified from animal studies [19,20]. Human ARM studies provided only limited evidence to support a contribution of these genes in ARM etiology [21][22][23][24][25][26][27][28]. Hypothesis-generating approaches, such as genome-wide studies seem valuable to obtain new knowledge and hypotheses on genes being involved in the etiology of ARM.
Until now, Wong et al. performed the only genome-wide association study for ARM in 175 ARM patients. This study did not yield any associated common single nucleotide variants, and the results did not suggest a role for common copy number variants (CNV) in the etiology of ARM either [29]. There was, however, an apparent excess of rare CNVs in ARM patients compared to controls and to healthy individuals from the Database of Genomic Variants [29]. This suggests the importance of rare genomic variation in the etiology of ARM.
In this study, we aimed to identify rare genetic variants for ARM by genotyping >240,000 known rare and low-frequency coding variants, using an Illumina Human Exome Beadchip array, in a cohort of almost 600 ARM patients.

Study population
AGORA (Aetiologic research into Genetic and Occupational/environmental Risk factors for Anomalies in children) is a large data-and biobank with DNA samples and clinical and questionnaire data from children with congenital malformations or childhood cancer, control children, and their parents. AGORA is a multicentre effort coordinated by the Radboud university medical center (Radboudumc) in Nijmegen, The Netherlands [30]. For the current study, AGORA provided 429 blood or saliva samples from live born European patients who were treated for ARM at the departments of Pediatric Surgery of the Radboudumc Amalia Children's Hospital, the Sophia Children's Hospital Erasmus MC Rotterdam (EMC), and the University Medical Center Groningen (UMCG), in The Netherlands. The German Network for Congenital Uro-REctal malformations (CURE-Net) provided 169 additional DNA samples from ARM patients of European ancestry. These patients were recruited through the German self-help organization for ARM patients (SoMA e.V.) and pediatric surgical departments throughout Germany. DNA samples from ARM patients with known chromosomal abnormalities or syndromes with a known genetic cause were excluded from the study population.
Pediatric surgeons, clinical geneticists, and researchers reviewed the medical records of the ARM patients extensively to obtain clinical information on ARM phenotypes and associated birth defects. We classified ARM phenotypes according to the Krickenbeck criteria [31], while other congenital malformations were classified according to the EUROCAT classification [32].
Controls were derived from the Nijmegen Biomedical Study (NBS), a population-based survey conducted by the Department for Health Evidence and the Department of Laboratory Medicine of the Radboudumc [33]. In total, 22,451 age and sex stratified randomly selected adult inhabitants of the municipality of Nijmegen received an invitation to fill out a postal questionnaire on items, such as lifestyle and medical history, and to donate two blood samples. The response to the questionnaire was 43% (N = 9,350), and 69% (N = 6,468) of the responders donated blood samples. For the current study, DNA samples of 1,930 European controls were used. The Arnhem-Nijmegen Regional Committee on Research Involving Human Subjects approved the AGORA and NBS study protocols and the Ethics Committees of the University of Bonn and the University of Heidelberg approved the CURE-Net study protocol. Written informed consent was obtained from all adult participants and parental consent for children under 18 years of age.

Genotyping and quality control
Using standard methods, DNA was extracted from blood collected in EDTA-containing tubes or saliva specimens collected in Oragene containers (DNA Genotek Inc., Ottawa, Canada). The DNA samples were genotyped using the Illumina HumanExome BeadChip array (v1.1), which contains 242,901 variants throughout the exome, hereafter referred to as exome chip. The majority of these variants (~220,000) are rare and low-frequency coding variants (nonsynonymous, splice site, stop gain, and stop loss variants), and a small proportion are non-coding variants. The variants were selected based on their occurrence in approximately 12,000 sequenced genomes and exomes from multiple populations of primarily European ancestry [34]. Approximately 20% of the variants have a minor allele frequency (MAF) >1%, and 80% of the variants a MAF <1%. Genotyping, calling, and quality control procedures were performed at the Erasmus MC in Rotterdam, The Netherlands, as part of the ExomeChip Rainbow Project (RP10) of Biobanking and Biomolecular Research Infrastructure Netherlands (BBMRI-NL). Regular GenomeStudio clustering and calling was followed by quality control and re-calling of the 'no-calls' using zCall, for patient and control samples simultaneously, to prevent batch effects. The zCall tool was especially designed for calling rare variants from array data [35].
The raw dataset after GenomeStudio calling consisted of 2,528 samples and 242,532 variants. After quality control (S1 Fig), 2,432 samples and 239,253 variants remained before the zCall procedure was applied. After the zCall procedure, two additional samples were removed because of excess heterozygosity, 205 additional variants because of low call rates (<99%), and 6 other variants because they were not in Hardy-Weinberg equilibrium (HWE). In total, 2,430 samples and 239,042 variants were included in the final dataset. We additionally removed one patient because of a clinical diagnosis of caudal regression syndrome (OMIM #600145) and one patient because of known relatedness with another patient (PI-HAT 0.28). As a result, 239,042 genotyped variants in 568 patients and 1,860 controls remained for further analyses.
Single variant analyses. We first restricted the single variant analyses to variants with a minor allele frequency (MAF) value of �1.0% in the combined patient and control series. Based on this criterion, 36,062 markers remained for the statistical analyses. Secondly, we were interested in less frequently occurring variants. To reduce potential false-positives and to have sufficient power to possibly replicate a statistically significantly associated variant we investigated variants with a MAF value of �0.4%, an arbitrary value which corresponds to a minor allele count of a minimum of 20 minor alleles in the combined patient and control series. Based on this criterion, 43,653 markers remained for the statistical analyses.
We tested the associations between ARM and each variant using Fisher's exact tests in an allelic model, assuming HWE, and an additive genetic model. We used an exact method, as the asymptotic score test is known to be too liberal and does not provide robust results when variant allele counts are low. Potential of bias due to population stratification was assessed using the quantile-quantile (QQ) plot and by calculating the genomic inflation factor (lambda), which was defined as the regression coefficient of the observed to expected-log p values from the Fisher's exact test. Test statistics were adjusted for lambda. A p-value below 1.39x10 -6 was considered statistically significant in the single variant analyses with a MAF value of �1.0%, which corresponds to a p-value of 0.05 with Bonferroni correction for 36,062 tests. Similarly, a p-value below 1.15x10 -6 with Bonferroni correction for 43,653 tests was used in the single variant analyses with a MAF �0.4%.
Gene-based analyses. Gene-based analyses were performed using the sequence kernel association test (SKAT) in R. SKAT has been shown to be a powerful method when both harmful and protective variants with different magnitudes of effect occur in one gene [36]. We tested 6,188 genes with at least two variants that passed quality control and were polymorphic in either patients or controls. A p-value below 8.08x10 -6 was considered statistically significant, corresponding to a Bonferroni correction for 6,188 tests. The analyses were performed using default settings [36].

Extra quality control step
To make sure that the statistically significant variants represented true positive calls, we applied an extra quality control step (S1 Text) for the variants that showed apparent statistical significance in the single variant analyses. This involved manual inspection of the minor allele calls in the cluster plots to exclude possible technical artefacts, which may have led to false calls.

Sanger sequencing
After the extra quality control step, three SNPs were selected for technical validation using Sanger Sequencing, including all samples that showed a heterozygous genotype for one of these three SNPs. More detailed information on primer design and Sanger sequencing can be found in S1 Text. The sequences obtained were compared with the reference sequence derived from the UCSC genome browser (Hg38) by two independent researchers using VECTOR NTI software 11.0.

Statistical analyses-Single variant analyses
Single variant analyses using an allelic model (lambda 1.063) for variants with MAF �1% did not identify variants that were statistically significantly associated with ARM ( Figure A in  S2 Fig).
Analyses of variants with MAF �0.4% and an allelic model identified 13 variants that were statistically significantly associated with ARM (lambda 1.068) (Table 2 and Figure B in S2 Fig). The additive genotype model was also applied, and showed similar results, after correction of lambda 1.072 (S1 Table).

Extra quality control step
The extra quality control step for the 13 statistically significant variants from the single variant analysis (MAF �0.4%) showed that 10 out of the 13 variants most likely represented false minor allele calls due to technical errors (S1 Text). Therefore, these variants were excluded as candidates for technical validation using Sanger sequencing.

Sanger sequencing
The three variants that remained eligible for technical validation using Sanger sequencing were exm42669, exm297681, and exm1452159. For variant exm42669, we were unable to confirm heterozygosity in 28 out of the 29 patients who were called as heterozygous risk allele carriers by the exome chip. We did not have DNA available for the last patient anymore. For variant exm297681, we had DNA for 22 out of 24 patients available and for variant exm1452159, for 29 out of 31 patients. Again, we were not able to confirm the presence of the minor alleles for these variants by Sanger sequencing.

Statistical analyses-Gene-based analyses
Gene-based analyses were performed for the dataset with MAF restricted to �1.0%, as we encountered many false-positive findings when we restricted the MAF cut-off to �0.4%. We did not find any gene that was statistically significantly associated with ARM in the combined patient and control series. The genes that passed a p-value threshold of 0.01 are given in S2 Table.

Discussion
This study aimed to identify rare genetic variants associated with ARM by analyzing exome chip data of >240,000 known rare and low-frequency coding variants in a large ethnically homogeneous series of 568 Caucasian ARM patients and 1,860 Caucasian population-based controls. We identified 13 variants that were statistically significantly associated with ARM, but all of these appeared to be false-positive findings. Therefore, this first exome chip study did not provide statistical evidence for association of rare genetic variants with large effect sizes and ARM. Although the study by Wong et al. suggested the importance of rare genomic variation in the etiology of ARM [29], we were not able to support this hypothesis with findings in the current study. Although the prevalence of ARM is relatively low, we were able to include a large number of patients. The frequency of ARM patients with additional anomalies in our study was in concordance with previous literature [2,3,11], and the distribution of ARM phenotypes was comparable with patients from the European ARM-Net registry [37]. This illustrates that we included a representative ARM patient series within our study. Despite the inclusion of almost 600 ARM patients, we could only detect, at 80% power and a multiple testing-adjusted alpha, rare variants with MAFs of 0.4-1.0% that have large effect on ARM (allelic ORs >4.0). To be able to detect rare variants with smaller effect sizes (e.g. OR 2.0), >3000 ARM patients are required for variants with MAF 1.0% and >7500 ARM patients for variants with MAF 0.4%. Because of the assumption of a multifactorial model in ARM etiology, smaller effect sizes are likely, which indicates the need for much larger sample sizes when arrays are used.
Given the well-known limitations of calling rare variants on genotyping arrays [38], the higher risk of incorrect genotype calling when lower MAFs are used [39], and the large number of rare variants that were present on the exome chip, measurement errors may have had a large impact on the study results. However, we applied Bonferroni corrections to minimize the chance of reporting false-positive associations. In addition, we applied extra quality control steps, which included the inspection of the cluster plots and validation by Sanger sequencing. Both of these quality control steps suggested that the initial findings for all 13 statistically significant variants were driven by incorrect calling of rare alleles. This highlights the importance of visual inspection of cluster plots and validation of findings using an independent genotyping technique. Another note about this study is that the content of the exome chip is based on the coding and splice site variants that were observed several times in approximately 12,000 sequenced individuals of mainly European descent. Hence, a large proportion of rare human exome variation and private variant alleles were not covered. Therefore, whole genome sequencing is required to allow for discovering relevant rare coding variants that are not covered by this exome chip, as well as relevant non-coding variants. Because ARM is observed as a component within certain syndromes [2], and aggregation of ARM has been observed within families[5-8], a role of genetics in the etiology of ARM remains likely.
Wong et al. performed the only GWAS thus far in 175 cases and 2971 controls, which failed to detect common variants associated with ARM [29]. Both, the study by Wong et al. and our study indicate that major effects of rare or common single nucleotide variants as captured by exome-and genome-wide arrays do not seem to play a major role in the etiology of ARM. ARM appears to comprise a genetically heterogeneous set of malformations, as was also suggested by Wong et al. [40] A potential hypothesis regarding the etiology is that ARM may only be caused by single-gene defects in a, probably small, proportion of patients. Since ARM comprise a wide spectrum of phenotypes, several different monogenic forms of ARM may exist. This hypothesis was also suggested for congenital malformations of the kidney and urinary tract (CAKUT) [41]. Equally, ARM may be a truly multifactorial disorder with involvement of many genetic as well as non-genetic risk factors with relatively small effects acting simultaneously in the majority of patients. Support for this scenario derives from several non-genetic factors being associated with ARM [19], [42].
In conclusion, the present study among 568 ARM patients did not yield evidence for associations between ARM and rare and low-frequency coding variants with large effect sizes, as captured by the exome chip. Future studies with even larger sample sizes are needed to identify potential common and/or rare genetic variants with small effects on ARM etiology. For these studies, international collaborations are essential. Genetic studies in phenotypically homogenous subgroups of ARM may further contribute to the elucidation of underlying genetic causes.