Best Practices and Joint Calling of the HumanExome BeadChip: The CHARGE Consortium

Genotyping arrays are a cost effective approach when typing previously-identified genetic polymorphisms in large numbers of samples. One limitation of genotyping arrays with rare variants (e.g., minor allele frequency [MAF] <0.01) is the difficulty that automated clustering algorithms have to accurately detect and assign genotype calls. Combining intensity data from large numbers of samples may increase the ability to accurately call the genotypes of rare variants. Approximately 62,000 ethnically diverse samples from eleven Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium cohorts were genotyped with the Illumina HumanExome BeadChip across seven genotyping centers. The raw data files for the samples were assembled into a single project for joint calling. To assess the quality of the joint calling, concordance of genotypes in a subset of individuals having both exome chip and exome sequence data was analyzed. After exclusion of low performing SNPs on the exome chip and non-overlap of SNPs derived from sequence data, genotypes of 185,119 variants (11,356 were monomorphic) were compared in 530 individuals that had whole exome sequence data. A total of 98,113,070 pairs of genotypes were tested and 99.77% were concordant, 0.14% had missing data, and 0.09% were discordant. We report that joint calling allows the ability to accurately genotype rare variation using array technology when large sample sizes are available and best practices are followed. The cluster file from this experiment is available at www.chargeconsortium.com/main/exomechip.


Introduction
Exome-and whole-genome sequencing is becoming increasingly affordable and allows for detection and genotyping of rare variants in the human genome. Yet, genotyping arrays remain a cost-effective approach when investigating genetic polymorphisms previously identified in large populations. A limitation of using arrays to genotype rare variants is the difficulty that automated clustering algorithms have to accurately detect and assign accurate genotype calls [1,2]. Large sample sizes increase the number of occurrences of rare variants and, therefore, should facilitate automated clustering and genotyping.
An array focused on rare and low frequency coding variation, hereafter referred to as the exome chip, has been developed by querying the exomes sequenced in ,12,000 individuals and aggregating the variation that is seen in more than two individuals in more than two sequencing efforts (http://genome.sph.umich. edu/wiki/Exome_Chip_Design). Participating studies in the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium [3] consented to have their Illumina Infinium HumanExome BeadChip intensity data analyzed collectively (n = 62,266) in order to increase the accuracy of rare variant genotype calls. The resulting cluster file (.egt) is publically available and we show that its use, along with best practices, increase genotype accuracy compared to other methods alone.

Results
Genotypes were obtained for 238,876 successful variants in accordance with our best practices (96.4% SNP pass rate) which were converted to PLINK format [4] by cohort and combined into a single aggregate file for further analyses. Of the 62,266 samples genotyped, 1,380 (2.2%) had a GenCall quality score in the lower 10 th percentile of the distribution across all variants genotyped (p10GC) ,0.38 or call rate ,0.97 and were excluded from allele frequency calculations. Because founder effects and unique population structure have been previously observed in Icelandic samples [5,6], the Age, Gene/Environment, Susceptibility-Reykjavik study was excluded from subsequent steps. Known duplicated samples, individuals without self-reported race, and the HapMap controls were also removed. After excluding duplicate variants (n = 811), the minor allele frequencies (MAF) for 238,065 successful SNPs and 56,407 samples by self-reported race are described in Table 1. There were 10,693 monomorphic SNPs (4.5%), and 78.6% of the variants on the exome chip have a MAF ,0.005. Allele frequencies for each variant by race group are reported in the SNP information file (see Methods and Data Access sections). Ethnicity specific HapMap allele frequencies for the 96 controls (48 CEU and 48 YRI) and genotypes are also available.
To evaluate the performance of the rare variant calling approach (see Methods), we compared exome chip genotypes derived from three calling methods to available exome sequencing data in 530 ARIC individuals. First, exome chip genotypes were called with the Illumina issued cluster file HumanExome-12v1.egt (see Data Access section for file location) (Dataset I). Second, we used zCall (threshold set to 7) [7] to determine genotypes for the missing variant calls in Dataset I to create Dataset Z. Third, we used the CHARGE best practices (see Data Access) and joint calling approach described to ascertain exome chip genotypes (Dataset C). A total of 185,119 variants that were present in the exome sequence dataset and passed our best practices were compared using genotype concordance and uncertainty coefficient tests. Results are presented in Table 2. The uncertainty coefficients indicate that we can predict 86.4% of the information (entropy) in the exome sequence data when using the Illumina cluster file, 91.2% when using the zCall algorithm, and 93.4% when the CHARGE clustering method was utilized.
These data demonstrate the importance of implementing stringent laboratory quality control measures in addition to the clustering algorithms and rare variant calling approaches tested. The complete list of 8,994 failing SNPs identified in the jointly called exome chip project are available for download on the CHARGE public website. Genotypes ascertained with the CHARGE jointly called exome chip cluster file (Dataset C) were 99.77% concordant with sequence data, 0.14% were missing in exome chip data, in the exome sequence data, or both, and 0.09% were discordant (Figure 1). Heterozygotes in Dataset C were most often misclassified when compared to the common allele homozygote, and mismatches were attributed equally to both sequencing and genotyping ( Table 2).
We also tested the ability of the CHARGE exome chip cluster file to accurately assign genotypes in the three rarest variant bins: singletons (minor allele count = 1), doubletons (minor allele count = 2), and tripletons (minor allele count = 3). We observed high concordance between exome chip singletons (99.99%), doubletons (99.98%), and tripletons (99.97%) when compared to their respective sequence genotypes in the same 530 ARIC individuals previously described (data not shown). These results are consistent with the global concordance tests which suggest we are able to accurately call very rare variants.

Discussion
The results presented here demonstrate that rare variants on the exome chip can be accurately called when using a large, combined cluster file and best practices described when compared to existing clustering algorithms and rare variant calling methods. The joint calling protocol, accompanying cluster file, list of poor performing variants on the chip, and annotation data are a valuable resource for the scientific community and will be of great utility to those having smaller sample sets where the calling of rare variants is problematic. All new projects will require user decisions based on their own cohort data and the metrics and best practices presented here should be updated accordingly.

Ethics Statement
All subjects provided written and informed consent to participate in genetic studies, and all study sites received approval to conduct this research from their local respective Institutional Review Boards (IRB) as follows: ''The National Bioethics Committee'' and ''The Data Protection Authority'' (AGES); University of Mississippi Medical Center IRB (ARIC -Jackson Field Center), Wake Forest University Health Sciences IRB (ARIC -Forsyth County Field Center), University of Minnesota IRB  . Each center genotyped a common set of 96 HapMap samples to be utilized for quality control and determination of batch effects. The two channel raw data files (.idat) for all samples were transferred to a central location and assembled into a single project for joint calling. A summary of the samples genotyped within each cohort by race and gender is described in Table 3. The following variables were provided for each sample included in the project: study specific sample ID, cohort name, sample type (DNA or WGA), race (self-reported), gender, sample plate, sample well, chip barcode, chip position, replicate ID, father and mother IDs (if applicable).

Clustering, Genotype Calling and Laboratory Quality Control
The Illumina GenomeStudio v2011.1 software was utilized with the GenTrain 2.0 clustering algorithm. Genomic DNA study samples and HapMap controls with call rates .99% (n = 55,142) were used to define genotype clusters with races combined and reruns excluded. The no-call threshold was set to 0.15 and we excluded female Y SNPs when calculating SNP statistics. The genotype quality score, representing the 10 th percentile of the distribution of GenCall scores across all SNPs genotyped (p10GC), was visually examined in a scatter plot across all samples (Index vs. p10GC). Samples with an empirically determined p10GC ,0.38 were identified as outliers and flagged for exclusion. The SNP parameters ''Expected Number of Clusters of Y SNPs'' and ''Expected Number of Clusters of mtSNPs'' were set to 2. Following automated clustering, all variants meeting the criteria provided in Table 4 (n = 107,175) were visually inspected and manually clustered, if possible, by two independent laboratory technicians. AA and BB theta deviation cutoffs were determined empirically. Variants removed from the HumanExome BeadChip v1.1 (n = 4,969) and cautious sites, as defined by the exome chip design committee (n = 333) (ftp://share.sph.umich.edu/ exomeChip/IlluminaDesigns/cautiousSites/cautiousSite.sorted. sites), were also inspected. Samples with a call rate between 0.95 and 0.99 that had been previously excluded were brought back in to the project and re-inspected based on the criteria listed in Table 4. This additional review was necessary as the CHARGE exome chip project contains samples from multiple DNA sources and ethnicities that were genotyped at several centers. SNPs exhibiting obvious batch effects were excluded. After joint calling, reproducibility and heritability statistics, SNP statistics and sample statistics were updated and the SNP-level quality control criteria described in Table 5 were implemented. SNPs with reproducibility (rep) errors .2, parent-parent-child (PPC) error .1, or parent-child (PC) error .1 were not excluded, but were flagged and reported back to the participating studies for further investigation. A list of the 8,994 excluded variants is provided on the CHARGE exome chip website as cluster positions for these sites are zeroed out in the.egt file (note: all SNP statistics for these sites will be converted to zero when the cluster file is imported into the Genome Studio project). A portion of excluded SNPs may be recoverable in projects with a homogenous population substructure, and we recommend clustering and reviewing the subset of variants with the user's high quality samples. Table 6 describes the exome chip content and number of variants excluded by functional category (see Annotation). Importantly, the v1.0 cluster file should not be used for calling the Illumina v1.1 exome chip as the two versions were manufactured with different bead pools.

Exome Chip Performance
Genotypes derived from available exome sequencing of 540 ARIC participants were used as the comparison dataset to test the performance of the exome chip. We excluded 10 individuals from the sequencing dataset due to a high missing data rate ,0.90, or non-overlap of individuals with existing exome chip data. Exome sequencing data is accessible via dbGaP as part of the National Heart Lung and Blood Institute (NHLBI) GO-ESP: Heart Cohorts Component of the Exome Sequencing Project (ARIC) (Study Accession: phs000398.v1.p1).
The following variants were excluded from the exome chip dataset as they were not available in the genotype data derived from exome sequencing results: replicate sites that were determined as triallelic or duplicates on opposite strands, short insertion/deletions, XY chromosome SNPs, Y chromosome SNPs, mitochondrial SNPs, or sites not identified in the exome sequencing dataset (n = 56,042). Poor performing variants identified by our best practices criteria were removed if not previously excluded (n = 6,709), thus a total of 185,119 variants were available for concordance analyses in 530 individuals.
Since concordance results are potentially high due to rare variation on the exome chip, we also calculated uncertainty coefficients [24] to determine the degree of association between each of the exome chip calling methods and exome sequence data. The uncertainty coefficient is a measure of association that is based Table 3. Sample sizes of cohorts participating in joint calling effort by gender and self-reported race. on information entropy [25], or the uncertainty in a random variable, that is, a variable subject to chance variations. Uncertainty coefficients are useful when evaluating results obtained from clustering algorithms since genotype classification is usually random (all minor alleles are not classified as either AA or BB), thus the algorithm is not susceptible to rare variation bias in which the more common genotype could have been called by chance alone. See Press et al. (1992), pp. 758-762, for further clarification of the uncertainty coefficient metric [26].

Annotation
Annotation of the v1.0 exome chip variants was performed with dbNSFP [27]. The dbNSFP v2.0 annotations are available on the CHARGE exome chip public website in the SNP information file. dbSNP rs information has been curated and a look up table with the associated Illumina SNP name is also available. The reason for inclusion of the variant on the exome chip by the design team is also provided in the SNP info file (ftp://share.sph.umich.edu/ exomeChip/IlluminaDesigns/annotatedList.txt).