Genome-Wide Analyses of Exonic Copy Number Variants in a Family-Based Study Point to Novel Autism Susceptibility Genes

The genetics underlying the autism spectrum disorders (ASDs) is complex and remains poorly understood. Previous work has demonstrated an important role for structural variation in a subset of cases, but has lacked the resolution necessary to move beyond detection of large regions of potential interest to identification of individual genes. To pinpoint genes likely to contribute to ASD etiology, we performed high density genotyping in 912 multiplex families from the Autism Genetics Resource Exchange (AGRE) collection and contrasted results to those obtained for 1,488 healthy controls. Through prioritization of exonic deletions (eDels), exonic duplications (eDups), and whole gene duplication events (gDups), we identified more than 150 loci harboring rare variants in multiple unrelated probands, but no controls. Importantly, 27 of these were confirmed on examination of an independent replication cohort comprised of 859 cases and an additional 1,051 controls. Rare variants at known loci, including exonic deletions at NRXN1 and whole gene duplications encompassing UBE3A and several other genes in the 15q11–q13 region, were observed in the course of these analyses. Strong support was likewise observed for previously unreported genes such as BZRAP1, an adaptor molecule known to regulate synaptic transmission, with eDels or eDups observed in twelve unrelated cases but no controls (p = 2.3×10−5). Less is known about MDGA2, likewise observed to be case-specific (p = 1.3×10−4). But, it is notable that the encoded protein shows an unexpectedly high similarity to Contactin 4 (BLAST E-value = 3×10−39), which has also been linked to disease. That hundreds of distinct rare variants were each seen only once further highlights complexity in the ASDs and points to the continued need for larger cohorts.


Introduction
The Autism spectrum disorders (ASDs, MIM: 209850) are a heterogeneous group of childhood diseases characterized by abnormalities in social behavior and communication, as well as patterns of restricted and repetitive behaviors [1]. Twin studies have demonstrated much higher concordance rates of ASD in monozygotic twins (92%) than dizygotic twins (10%) [2,3], indicating a strong genetic basis for autism susceptibility. Although previous work has implicated numerous genomic regions of interest [4][5][6][7][8], the identification of specific genetic variants that contribute to ASD risk remains challenging.
Substantial progress towards the identification of genetic risk variants has come from recent characterization of structural variation (i.e., copy number variation or CNV). For example, an initial report involving patients with syndromic autism characterized genomic variation using array comparative genomic hybridization (CGH) and identified large de novo CNVs in 28% of cases [9]. Similarly, subsequent work demonstrated that the frequency of de novo CNVs is higher in cases versus controls [7,8]. CNV analyses have proven useful in the identification of regions that are potentially disease-related [8,[10][11][12][13] and have begun to be employed to advance the candidacy of individual genes, including NRXN1, CNTNAP2, and NHE9 [6,[14][15][16]. Recent work characterizing structural variation in cases and ethnically matched controls associating ubiquitin-pathway genes with autism with replicating this finding in the AGRE dataset is likewise notable [17], although family data was not reported here. Using the AGRE dataset as a discovery cohort, along with family information available for AGRE samples, we describe distinct and complementary analyses, prioritizing exonic events over CNVs in introns and intergenic intervals, which provide important new insights into the genetic architecture of the ASDs.
Towards the identification of additional genes and regions that may modulate disease risk, we have assembled a resource characterizing genome-wide structural variation from over nine hundred multiplex ASD families. Presented below are results from analyses contrasting events observed in cases and healthy ethnically matched controls, focusing on three classes of genic events: exonic deletions (eDels), exonic duplications (eDups), and whole gene duplication (gDups). Recovery of known ASD locitogether with the identification of novel regions harboring variants in multiple cases but no controls -supports the utility of this dataset. Consistent with enormous inter-individual variation, we further document a large number of events observed in only individual cases (Table S4). Importantly, all of these data have been made available to the scientific community pre-publication (www.agre.org), greatly enhancing the utility of existing publicly accessible biomaterials and phenotype data. These data further highlight the extent of structural variation in both human and the ASDs and offer an important resource for hypothesis-generation and interrogation of individual loci.

Results/Discussion
To characterize structural variation in ASD multiplex families and unrelated controls, we typed individuals at 561,466 SNP markers using Illumina HumanHap550 version 3 arrays. After excluding samples that failed to meet QC thresholds (see Table  S1), we obtained array data on 3832 individuals from 912 multiplex families enrolled in the Autism Genetic Resource Exchange (AGRE) [18], 1070 disease-free children from the Children's Hospital of Philadelphia (CHOP), and 418 neurologically normal adults and seniors from the National Institute of Neurological Disorders and Stroke (NINDS) control collection [19]. Using the PennCNV software [20], we detected CNVs with a mean size of 59.9 Kb and mean frequency of 24.3 events per individual (see Table S2). Sensitivity compares favorably with previous BAC array-based [9,21] and SNP-based methods [8], in

Author Summary
Autism spectrum disorders (ASDs) are common neurodevelopmental syndromes with a strong genetic component. ASDs are characterized by disturbances in social behavior, impaired verbal and nonverbal communication, as well as repetitive behaviors and/or a restricted range of interests. To identify genes likely to contribute to ASD etiology, we performed high density genotyping in 912 multiplex families from the Autism Genetics Resource Exchange (AGRE) collection and contrasted results to those obtained for 1,488 healthy controls. To enrich for variants most likely to interfere with gene function, we restricted our analyses to deletions and gains encompassing exons. Of the many genomic regions highlighted, 27 were seen to harbor rare variants in cases and not controls, both in the first phase of our analysis, and also in an independent replication cohort comprised of 859 cases and 1,051 controls. More work in a larger number of individuals will be required to determine which of the rare alleles highlighted here are indeed related to the ASDs and how they act to shape risk.
which mean resolution was observed to be in the range of Mbs and hundreds of Kbs, respectively.
As a first step towards validation of genotyping accuracy we examined the inheritance of CNVs in the AGRE cohort. Consistent with high quality, 96.2% of CNV calls made in children were also detected in a parent. To explore the issue of genotyping accuracy further, we generated CNV calls for an independently generated data set in which an overlapping set of 2,518 AGRE samples were genotyped using the Affymetrix 5.0 platform [11]. For CNVs (.500 kb) in known ASD regions (e.g. 15q11-13, 16p11.2, and 22q11.21; Table 1) [8,11,21,22], we observed 100% correspondence between the two platforms for individuals genotyped on both platforms. For further confirmation of CNV calls, we compared de novo variants identified here to those highlighted in previous analyses of AGRE families. We identified all five de novo CNVs reported by Sebat et al [7], three of the five de novo CNVs reported by Szatmari et al [6], one de novo CNV within A2BP1 reported by Martin et al [23], and all five 16p11.2 de novo deletions reported by Weiss et al [11] and Kumar et al [10]. Of the two of thirteen de novo CNVs reported by Szatmari et al not detected as de novo in our study, one was very small (2 SNPs, 180 bp on 8p23.2), and the second clearly appears to be inherited (469 SNPs, 1.4 Mb on 17p12). Thus, our data are concordant with several other studies, and provide a more comprehensive picture of de novo CNVs in multiplex autism families. To further evaluate the quality of these data on another independent platform, we used Taqman to determine relative copy number at 12 previously unreported de novo CNVs identified in AGRE probands, confirming 11/12 loci ( Figure 1 and Table S3). Together these results suggest that the CNVs calls we report are consistent and reliable.
We therefore undertook additional analyses to identify specific loci in which structural variants were enriched in cases versus controls. Because the majority of such variants were intronic or intergenic, we sought to prioritize CNVs most likely to interfere with the molecular function of specific genes. We first filtered CNV calls to include only exonic deletions (eDels) observed to overlap with a RefSeq gene. Overall, such eDels were observed at similar frequencies in AGRE cases, 1 st degree relatives of AGRE cases, and unrelated controls (CHOP and NINDS cohorts), with an average of ,2 such variants per person (Table S2). To identify events related to the ASDs we then looked for genes harboring eDels in at least one case but no unrelated controls. Among the 284 genes that met this criteria (Table S4) we observed several known ASD or mental retardation genes including: ASPM [24], DPP10 [8], CNTNAP2 [25,26], PCDH9 [16], and NRXN1 [6].
To enrich for genes most likely to contribute to ASD risk, we used family-based calling to evaluate which of these genes carried eDels in three or more cases from at least two unrelated families (Table S5). This stringent filtering resulted in 72 genes at 55 loci, including NRXN1. This is notable, given that eleven distinct disease-linked NRXN1 variants have been identified [6,8,15,27,28]. Neurexin family members are known to interact functionally with ASD-related neuroligins [29][30][31][32], and likewise play an important role in synaptic specification and specialization [33,34]. eDels in more recently identified candidates, including DPP10 and PCDH9, were likewise retained. Similarly, recovery of RNF133 and RNF148 within intron 2 of CADPS2 [7,35] highlights additional complexity at this locus. Although CNV breakpoints cannot be mapped precisely using SNP data alone, it is possible to determine overlap with protein coding exons and use these data to predict impact on gene function. Consistent with perturbation of function, distinct alleles at the loci highlighted here are predicted to eliminate or truncated the corresponding protein products ( Figure 2). Importantly, CNVs at a majority of these eDel loci show unique breakpoints in different families and/or result in the loss of distinct exons, demonstrating that they are independent. Moreover, because it is well established that CNVs at a subset of loci show identical breakpoints in unrelated individuals [10], this result is likely to underestimate the extent to which variants described here arose independently. Results from multi-dimensional scaling are likewise consistent with the interpretation that variants we highlight arose independently ( Figure S1).
Given the large number of variants identified, it was critically important to confirm in an independent case-control analysis, how many of these eDels were truly overrepresented in cases, as opposed to being potentially attributable to Type I error. To address this concern, we sought to determine eDel frequency in these same genes in a replication dataset comprising 859 independently ascertained ASD cases and 1051 unrelated control subjects from the Autism Case Control cohort (ACC, see Description in Methods). One third of the loci identified in the discovery phase were observed in one or more ACC controls (18/ 55; 32.7%), suggesting that while rare, eDels at these loci are not limited to ASD cases and family members. In contrast, and providing evidence for formal replication, 14 separate loci encompassing 22 genes were observed to carry eDels in both AGRE and ACC cases, but none of 2539 controls (Table S2).
Our replication data lend strong support to the involvement of specific loci in the ASDs (Table 2). However, to ensure that these results were not observed by chance alone, we performed 10,000 permutation trials on data from the replication cohort by permuting case/control status across individuals. In each permuted dataset, we maintained the same numbers of cases and controls as in the original data, and calculated the number of genes harboring CNVs exclusively in cases. None of the 10,000 permutation trials gave results comparable to experimental observations for replicated case-specific loci (n = 14; p,0.0001; Figure 3). In contrast, findings comparable to those for nonreplicated loci (highlighted as case-specific in the discovery phase but subsequently seen in replication controls) were seen in controls in 246/10,000 trials (n = 18; p = 0.02; Figure S2). Although additional experimental work in independent cohorts will be required to determine if variation in any of the genes highlighted here do in fact impact ASD risk, no more than 5 replicated loci would be predicted to be observed by chance alone.
Despite the challenges associated with obtaining statistical support for individually rare events [7,36] we next sought to assign P values for replicated eDel loci. We were able to obtain support for each of the following loci: BZRAP1 at 17q22 (p = 8.0610 24 ), NRXN1 at 2p16.3 (p = 3.3610 24 ), MDGA2 at 14q21.3 (p = 1.3610 24 ), MADCAM1 at 19q13 (p = 5.5610 25 ), and a three gene locus at 15q11 (p = 1.3610 211 ). CNV calls at each of 15q11 and 19p13 are highly-error prone, suggesting that results here be interpreted with caution (see footnotes C and F in Table 2). Recovery of NRXN1, however, provides confidence for involvement of additional loci that were likewise replicated. Benzodiazapine receptor (peripheral) associated protein 1 (BZRAP1, alternatively referred to as RIMBP1), is an adaptor molecule thought to regulate synaptic transmission by linking vesicular release machinery to voltage gated Ca2+ channels [37]. Identification of this synaptic component here, in a hypothesis-free manner, is particularly satisfying and also provides additional support for synaptic dysfunction in the ASDs [29,38]. Less is known about MDGA2 [39], although comparison of the predicted protein to all others within GenBank by BLASTP indicated an unexpectedly high similarity to Contactin 4 (24% identity over more than 500 amino acids; Expect = 3610 239 ). Given previous reports of hemizygous loss of CNTN4 in individuals with mental retardation [40] and autism [17,41]. similarity between MDGA2 and CNTN4, surpassed only by resemblance to MDGA1, is notable. Likewise intriguing in light of the suggestion that common variation in cell adhesion molecules may contribute to autism risk [42] is the structural likeness of MDGA2 to members of this family of molecules.
Although some published analyses emphasize the greater contribution of gene deletion events in autism pathogenesis [7], there are also clear examples of duplications that strongly modulate ASD risk [43,44]. We therefore conducted a parallel analysis of duplications, distinguishing between events involving entire genes (gDups) which might increase dosage and those restricted to internal exons (eDups) which could give rise to a frameshift or map to a chromosomal region distinct from the reference gene. For gDups, we identified 449 genes that were duplicated in at least one AGRE case but no CHOP/NINDS controls (Table S4). Of those, 200 genes at an estimated 63 loci, including genes at 15q11.2 [43], met the more stringent criteria of being present in three or more cases from at least two independent families (Table S5). Of these, 11.5% (23/200) were also seen in ACC controls, whereas 24.5% (49/200) were case-specific in the replication cohort. Strong statistical support was obtained for established loci (e.g. p = 9.3610 26 for UBE3A and other genes in the PWS/AS region at 15q11-q13), and nominal evidence was observed for the following novel loci: CD8A at 2p11.2 (p = 0.069),  For eDups, we reasoned that duplication of one or more internal exons could serve to disrupt the corresponding open reading frame and be predicted to impair gene function as a result. Despite the caveat that observed copy number gains need not map to the wild-type locus, known ASD genes including TSC2 [45] and RAI1 [44,46] within the Potocki-Lupski Syndrome critical interval were amongst the 159 loci observed in at least one AGRE case, but no CHOP/NINDS controls (Table S4). Such events were also seen in one family at the NLGN1 locus, which is of interest given previous support for NLGN3 and NLGN4 [29]. Filtering of these results, using the more stringent criteria employed above in consideration of eDels, limited this set of events to 76 loci observed in at least three cases from two separate families (Table S5). Interestingly, BZRAP1, reported above to harbor eDels at significantly higher frequencies in AGRE and ACC cases versus controls (p = 8.0610 24 ), was amongst these, with eDups observed here in four unrelated AGRE cases (screening p = 0.021). Eight other genes, including the voltage gated potassium channel subunit KCNAB2 (p = 4.7610 23 ) remained absent from ACC controls and were also replicated in the independent case cohort. Although eDups at BZRAP1 were not detected in ACC cases, eDels at this locus were replicated, underscoring the importance of variation here. When considering eDels and eDups at the BZRAP1 locus together, the likelihood of such an observation occurring by chance alone is small (p = 2.3610 25 ).
Although none of the variants we highlight were observed in any of 2539 unrelated controls, key events, including eDels at NRXN1, BZRAP1, and MDGA2 were observed in both cases and non-autistic family members (Figure 4). This is in keeping with previous work which suggests that haploinsufficiency at NRXN1 may contribute to the ASDs [15], but is insufficient to cause disease. Such data are also consistent with the well established finding of the ''broader autism phenotype'', such as subclinical language and social impairment in first degree relatives of cases with an ASD, which supports a multi-locus model [47,48]. We were also surprised to see that key variants at these loci appear to be transmitted to only a subset of affected individuals in some families ( Figure 4). These observations parallel findings at other major effect loci including 16p11.2 [11] and DISC1 [49,50] and are consistent with a model in which multiple variants, common and rare, act in concert to shape clinical presentation [51][52][53]. Results are also consistent with the idea that true risk loci are likely to show incomplete penetrance and imperfect segregation with disease [13], a reality that will complicate gene finding efforts. Related to this is that substantial effort will be required to determine whether rare alleles of moderate effect act independently on distinct aspects of disease (endophenotype model) or together to undermine key processes in brain development (threshold model). How distinct alleles may interact to shape presentation is yet another question that will require larger cohorts along with multigenerational families to resolve [54].
By limiting CNV calls to include only exonic deletions (eDels) and duplications (eDups and gDups), we have attempted to enrich for variants most likely to impact gene function and in doing so improve the signal to noise ratio similar to work in other complex diseases [55]. At the same time, like other gene-based strategies, we preserve our ability to consider eDels involving the same transcriptional unit as separate but equivalent. Given that such events appear rare, this is an important consideration.
Pathway analysis by DAVID [56] found support for overrepresentation of cell adhesion molecules amongst recurrent eDel genes (uncorrected p = 0.002; CDH17, PCDH9, LAMA2, MAD-CAM1, NRXN1, POSTN, SPON2) , although it should be noted that this analysis does not adjust for gene size and may favor larger genes. Nevertheless, aside from SPON2 no eDels in these genes were observed in any of the controls interrogated. In contrast, no evidence for such overrepresentation was observed for genes in the ubiquitin degradation pathway and neither term was highlighted as overrepresented amongst eDups or gDups. Given that this study Figure 3. Observed replication unlikely to be attributable to chance alone. We performed 10,000 phenotype permutation trials on replication data and determined for each the number of loci harboring CNVs in cases but not controls. Thus, within each trial, the number of loci absent from controls in the replication cohort was determined. None of the permutation trials generated as many case-specific loci as observed in our actual dataset (n = 14; p,0.0001). doi:10.1371/journal.pgen.1000536.g003 focused only on events encompassing RefSeq exons, differences from Glessner and colleagues [17] are to be expected.
Despite the large cohorts interrogated at each phase of our investigations, only a minority of loci (established or novel) were replicated between AGRE and ACC cases. For example, variants at each of the following previously reported loci were observed multiple times in AGRE cases but not once amongst ACC probands: PCDH10 and DPP10 (eDels), RAI and TSC2 (eDups), and DIDO1 (gDups). This suggests that even with current numbers, the present experiments are underpowered to obtain replication for a subset of recurrent variants. Because events seen only in single cases collectively account for a substantial fraction of Table 2. A subset of eDel loci were observed to harbor rare variants in both discovery and replication cohorts, but none of 2539 controls. eDel: exonic deletion; ACRD: autism chromosome rearrangement database (http://projects.tcag.ca/autism/). For eDels at a given locus, the ratio of unaffected carriers (siblings or parents) to total number of carriers (cases and family members). C The significant difference in CNV frequency between AGRE and ACC cases (p = 2.6610 26 ), along with multiple instances of similar variation in the DGV (see Tables S4  and Table S5), suggests that additional factors -including some potentially unrelated to diagnosis -may be relevant here. Sparse SNP coverage along with regional complexity (large segmental duplications) is also likely to increase false positive and false negatives at this locus. Replication data (and corresponding p Value) is for OR4N4, as only one eDel at either LOC650137 or OR4M2 was observed amongst ACC cases. D A comparable number of eDels were observed at multiple neighboring genes; carrier fraction corresponds to FLYWCH1, the lowest observed at this locus. E Joint consideration of eDels (n = 8) and eDups (n = 4) at BZRAP1 further improves statistical support for this locus (p = 2.3610 25 ). F Note extreme telomeric position of this locus which may undermine/interfere with reliable calling of structural variants. CNV counts and carrier fraction corresponds to MADCAM1; fewer variants were observed at C19orf19 amongst ACC cases and carrier fraction was higher than that for MADCAM1. doi:10.1371/journal.pgen.1000536.t002 observed variation even larger cohorts still will be required for a thorough understanding of the genetic basis of complex disorders like the ASDs.
In summary, we have performed a high resolution genome-wide analysis to characterize the genomic landscape of copy number variation in ASDs. Through comparison of structural variation in 1,771 ASD cases and 2,539 controls and prioritization of events encompassing exons we identified more than 150 loci harboring rare variants in multiple probands but no control individuals. For each class of structural variant interrogated, the recovery of known loci serves to validate the methods employed and results obtained.
Greatest confidence should be placed in loci harboring variants in multiple unrelated cases but no controls and also recovered in both screening and replication cohorts. Amongst novel genes, best support was obtained for BZRAP1 and MDGA2, intriguing candidate genes for which additional study is warranted.

Sample ascertainment
For initial screening we assembled three sample collections: 1) 943 ASD families (4,444 unique subjects) from the Autism Genetic Resource Exchange (AGRE) collection; 2) 1,070 de-identified and unrelated children of European ancestry from the Children's Hospital of Philadelphia (CHOP), with no evidence of neurological disorders; 3) 542 unrelated neurologically normal adults and seniors of European ancestry from the National Institute of Neurological Disorders and Stroke (NINDS) control collection. The AGRE families include 917 multiplex families, 24 simplex families and 2 families without an ASD diagnosis. For all analyses, AGRE cases annotated with ''Autism'' (n = 1,463), ''Broad Spectrum'' (n = 149) or ''Not Quite Autism'' (n = 71) were treated equally and as affected. Samples from AGRE and NINDS were genotyped using DNA extracted from Epstein-Barr Virus (EBV)transformed lymphoblastoid cell lines, while the CHOP controls were genotyped using DNA extracted from whole blood. All AGRE and control samples included in these analyses were genotyped on the Illumina HumanHap550 version 3 arrays, and 281 samples genotyped on version 1 arrays were excluded from the present analysis. Since the NINDS controls were genotyped at a different location and time, they were used to assess the frequency of specific CNVs in an independent cohort and to address concerns of cell line artifacts. This study was approved by the Institutional Review Board of Children's Hospital of Philadelphia. All subjects provided written informed consent for the collection of samples and subsequent analysis.
The Autism Case-Control (ACC) cohort included 859 cases from multiple sites within the United States, all of whom were of European ancestry affected with ASD. Of those, 703 were male and 156 were female; 828 met diagnostic criteria for autism, and 31 met criteria for other ASDs. Subjects ranged from 2-21 years of age when the Autism Diagnostic Interview (ADI) was given. Of the case subjects, 54% were from simplex families with the balance coming from multiplex families. The control group used for replication included 1051 children of self-reported Caucasian ancestry who had no history of ASDs. These controls were recruited by CHOP nursing and medical assistant staff under the direction of CHOP clinicians within the CHOP Health Care Network, including four primary care clinics and several group practices and outpatient practices that included well child visits.

Detection and annotation of copy number variation
For each data set, we applied identical and stringent quality control criteria to remove samples with low signal quality. CNV calls were generated using PennCNV [20], an algorithm which employs multiple sources of information, including total signal intensity, allelic intensity ratios, SNP allele frequencies, distance between neighboring SNPs, and family information to generate calls. We excluded samples meeting any of the following criteria: a) standard deviation for autosomal log R ratio values (LRR_SD) higher than 0.28, b) median B Allele Frequency (BAF_median) higher than 0.55 or lower than 0.45, c) fraction of markers with BAF values between 0.2 and 0.25 or 0.75 and 0.8 (BAF_drift) exceeded 0.002. We also excluded from our analysis CNVs within IGLC1 (22q11.22), IGHG1 (14q32.33) and IGKC (2p11.2), and the T cell receptor constant chain locus (14q11.2), as well as CNVs in chromosomes showing evidence of heterosomic aberrations (chromosome rearrangements in sub-populations of cells) in BeadStudio.
CNV calls were mapped onto genes by identifying overlap with RefSeq exons, the coordinates of which we obtained from the UCSC table browser. Deletion events overlapping with exons retrieved in this way were listed as eDels. eDups were defined as gains overlapping one or more coding exons and seen to be internal to the beginning and end of the corresponding transcript. Gains observed to encompass all exons for a given gene were annotated as gDups. P values for relative CNV burden in cases and controls were calculated at each locus by Fisher's exact test. P values presented in Table S2, S4, S5 have not been subjected to correction for multiple testing. To compare our CNV calls with other publications that have used AGRE families [10,11,21,22], we examined published calls on the same individuals with the same AGRE identifiers. The CNV calls were retrieved from the Supplementary Materials of each corresponding publication.

Quantitative PCR for CNV validation
TaqMan primer/probe sets were designed to query random CNVs using FileBuilder 3.0 on the repeat-masked human genome (NCBI_36; March 2006 release; http://genome.ucsc.edu/). For each assay, 10 ng of genomic DNA was assayed in quadruplicate in 10-mL reactions containing 16 final concentration TaqMan Universal Master Mix (ABI part number 4304437), and 200 nM of each primer and probe. Cycling was performed under default conditions in 384-well optical PCR plates on an ABI 7900 machine. Copy number was defined as 2 2DDCT , where DC T is the difference in threshold cycles for the sample in question normalized against an endogenous reference (RNAseP) and expressed relative to the average values obtained by three arbitrary control DNAs. A list of TaqMan probes against the 12 CNVs tested is included in Table S3. Figure S1 Multi-dimensional scaling plot of AGRE affected subjects, with red cross highlighting subjects carrying the eDels. Subjects of European ancestry are clustered toward the right side of the triangle. Found at: doi:10.1371/journal.pgen.1000536.s001 (0.11 MB DOC) Figure S2 We performed 10,000 phenotype permutation trials on replication data and determined for each the number of loci harboring CNVs exclusively in controls. During each trial a new set of control-specific loci was identified and the number of these absent from cases determined. We observed results comparable to those obtained experimentally (n = 18) in 246 of 10,000 trials (p = 0.02). Found at: doi:10.1371/journal.pgen.1000536.s002 (0.03 MB DOC) families. We also gratefully thank the children and their families who donated blood samples to CHOP for research purposes. We also acknowledge the technical staff at the Center for Applied Genomics for producing the genotypes used for analyses and members of the Bucan lab for performing additional resequencing experiments and validation experiments.