Conceived and designed the experiments: WECB JVR AB. Performed the experiments: DD G CP. Analyzed the data: WECB JVR DD HF CP RA VP. Contributed reagents/materials/analysis tools: HF. Wrote the paper: WECB DD ML. Oversaw ethics issues: ML. Oversaw project: RP AB.
WECB is a consultant for, and has shares in Genizon Biosciences Inc. JVR, DYB, EG, HF, CP, RA, VP, ML, RJAP and AB are employees of Genizon Biosciences. These affiliations do not alter the author's adherence to all the PLoS ONE policies on sharing data and materials, as detailed online in the guide for authors:
We have examined the genomic distribution of large rare autosomal deletions in a sample of 440 parent-parent-child trios from the Quebec founder population (QFP) which was recruited for a study of Attention Deficit Hyperactivity Disorder.
DNA isolated from blood was genotyped on Illumina Hap300 arrays. PennCNV combined with visual evaluation of images generated by the Beadstudio program was used to determine deletion boundary definition of sufficient precision to discern independent events, with near-perfect concordance between parent and child in about 98% of the 399 events detected in the offspring; the remaining 7 deletions were considered
Nine of the 13 hotspots carry one gene (7 of which are very long), while the rest contain no known genes. All nine genes have been implicated in disease. The patterns of exon deletions support the proposed roles for some of these genes in human disease, such as
Recent improvements in microarray-based genotyping technology have led to significant advances in our understanding of the genetic contribution to common disease in the last few years. In addition to identification of chromosome regions carrying haplotypes putatively involved in conferring disease susceptibility, these studies have also allowed quantitative assessment of large-scale deletions and duplications (also known as copy number variations, CNVs) on a genome-wide basis. The number of individuals carrying a given CNV at a known gene locus is frequently small, so conventional association studies based on statistical analyses cannot always be performed. Nevertheless, discovery of these anecdotal changes can be useful since, for example, deletion of even a portion of one copy of a gene expected to have an important biological role can have important consequences and can ultimately lead to insights into disease etiology.
A number of publications have documented copy number variations (CNV) in a variety of population samples using data from genome-wide association studies. These contributions have revealed potentially important roles for CNVs in disorders such as autism, schizophrenia and other neurological conditions
Genizon Biosciences has recently used the Illumina Infinium Hap300 platform to complete genome-wide studies of 550 attention-deficit hyperactivity disorder (ADHD) trios, 540 endometriosis (EN) simplex patients and 480 schizophrenia (SZ) simplex patients, as well as 640 recruited in a longevity study (LG; this sample comprised individuals more than 95 years of age and was genotyped on the Illumina Hap550 array). As a first step in determining the possible role of rare autosomal deletions in our samples, we evaluated fluorescence intensities as presented by the Beadstudio program, using a combination of computer-based scanning and individual assessment by human observer. The trio structure of the ADHD study allowed mutual verification of each transmitted deletion in both parent and child, and we were able precisely to define those SNPs situated within each deletion. We found remarkable patterns of clustering, with variation in frequency of independent deletion events per unit length of as much as 100-fold across the genome. The points of greatest deletion frequency fell within about 15 regions mostly of about 0.4–0.6 Mb in length; 13 of these were evaluated in the other cohorts and found to be similarly enriched for deletions. Four of the regions had no genes whereas nine harbored a single gene or gene region, all nine of which have been implicated in disease. The patterns of exon loss vs. retention allowed insights into the role, or lack thereof, of the genes in the respective diseases.
The number of transmitted and non-transmitted autosomal deletions in the ADHD trios: The analysis of the offspring as described in the
In the second round we examined the deletions called by PennCNV in the parents, and found 549 independent rare deletions which were visually affirmed, of which 49 were present in a second parental sample for a total of 598. Of these 598, 270 had called, affirmed deletions in their offspring from the first round. The samples of the offspring of each of the other 328 parents were visually examined in the Beadstudio application for any uncalled deletions corresponding to that found in the parent. Thirty-three were found, of which 6 had already been found in other trios. (Possible reasons for the lower rate of apparent false negatives in the second round are discussed in
This iterative process therefore revealed a total of 399 deletions, counting the
Distribution of the deletions in the genome: A striking observation is that many of the deletions cluster within a limited number of chromosomal regions. Examples are shown in
Two regions with particularly high frequency of deletions are presented, using the UCSC Genome Browser. A. Chromosome 20. The hotspot in 20p12 was the most unstable, with 27 independent deletions in four population samples comprising 2540 individuals. A second hotspot was observed on this chromosome, at about 40.6 Mb. B. First 25 Mb of Chromosome 9. No
Similar deviation from uniform distribution was seen in several other regions, which generally ranged from 0.4–0.6 Mb in length (
Number of independent deletions found (number of deletions affecting coding exons) | ||||||||
ADHD | EN | SZ | LG | |||||
Region location | Limits of region (Mb) | Gene | Trans-mitted |
non-transmitted | Independent in all chromosomes | |||
1q43 | 235.2–235.7 | none | 4 (na) | 0 (na) | 4 | 1 (na) | 1 (na) | 1 (na) |
2p16.3 | 50.7–51.3 | 2 (1) | 1 (0) | 3 | 2 (0) | 7 (2) | 3 (0) | |
6q26 | 162.5–162.9 | 4 (2) | 5 (4) | 9 | 1 (0) | 3 (2) | 6 (1) | |
8p23.2 | 4.3–4.9 | 3 (0) | 1 (0) | 3 | 3 (0) | 2 (0) | 3 (0) | |
8p23.2 | 5.6–6.2 | none | 2 (na) | 2 (na) | 4 | 3 (na) | 3 (na) | 1 (na) |
8p22 | 15.2–15.6 | 4 (1) | 1 (0) | 5 | 0 (0) | 0 (0) | 1 (0) | |
9p23 | 11.7–12.2 | none | 5 (na) | 5 (na) | 9 | 7 (na) | 8 (na) | 13 (na) |
9p21.1 | 30.4–30.7 | none | 3 (na) | 0 (na) | 3 | 1 (na) | 3 (na) | 1 (na) |
10q21.3 | 67.8–68.2 | 3 (3) | 5 (4) | 7 | 8 (4) | 4 (2) | 8 (4) | |
13q31.1 | 83.1–83.7 | 4 (0) | 4 (0) | 8 | 2 (0) | 2 (0) | 3 (0) | |
16p13.2 | 6.6–7.0 | 0 (0) | 4 (0) | 4 | 1 (0) | 4 (1) | 4 (1) | |
20p12.1 | 14.5–15.1 | 8 (0) | 4 (0) | 11 | 9 (3) | 6 (0) | 3 (0) | |
20q12 | 40.5–40.7 | 4(0) | 0 (0) | 4 | 2 (0) | 0 (0) | 1 (0) |
ADHD, attention deficit hyperactivity disorder, 440 parent-parent trios; EN, 540 endometriosis simplex patients; SZ, 480 schizophrenia simplex patients; LG, 640 individuals over 95 years of age.
*includes 4 de novo deletions, three in 20p12.1 and one in 2p16.3.
The genomic distribution of deletions in the non-transmitted chromosomes was then examined and again these were found to cluster in hotspot regions. Eight of the 11 loci in the ‘transmitted’ list of hotspots were also in the list of non-transmitted deletion hotspots, and only one locus (chr16p13.2) with no transmitted deletions was represented on the non-transmitted list of hotspots (
To determine the universality of these hotspot domains, 13 of the domains with three or more deletions in the entire ADHD study were assessed for deletions present in other samples. The PennCNV algorithm was used to identify candidate deletions within Genizon's SZ and EN samples, and subsequently the visual inspection protocol developed for ADHD was used to confirm these loci (
A third sample for longevity, which had been genotyped with the Illumina 550k microarray was analyzed separately and again, clustering was observed within the same hotspots (
The mean length of the 135 different clustered deletions which were found in all samples was 94 kb, very close to the mean for the non-HS transmitted ADHD deletions, but the median was 54 kb, nearly twice that of the latter. The standard deviation was much greater in the latter group, (219 kb vs. 111 kb). These numbers reflect a greater uniformity in length in deletions found in hotspots, with 68% being between 20 and 200 kb, compared with 50.2% for non-HS deletions (chi square 4.01, p<0.05) falling within this size range. This length corresponds very approximately with the length of chromatin loops, and may reflect some aspect of the mechanism for this high frequency deletion phenomenon.
We were able to trace the parent-of-origin for five of the
With respect to gene content, a remarkable characteristic of the hotspots is the length of the genes therein. Nine of the domains have one and only one gene, and 7 of them rank among the 35 longest genes in the genome (according to gene annotations in the UCSC Genome Browser), including 4 in the top 8. The regions of greatest deletion frequency are in the regions of very long introns in the long genes, usually the 5′ end. Interestingly, the DMD hotspot extending from exon 44 to exon 53 which is comparable to those described here is in the longest gene. In any event, it is clear that the coding sequence density is, on average, very low in the hotspots. As noted below, none of the sequence motifs, or functional entities known or suspected to be involved in chromosomal instability are present in abundance in any of the hotspots. Furthermore, the proximity of fragile sites, which overlap with only one of the 13 regions we identify (6q26), does not explain the major part of this instability.
We have documented the presence of 13 regions of about 0.5 Mb each in the human genome where deletions occur at up to 100-fold higher frequency than the other 99.8% of the genome. We also present suggestive evidence for the existence of as many as 30 or more hotspots throughout the genome (Poisson distribution,
By filtering out frequently observed deletions, we eliminate from consideration common deletions existing in the population (by analogy with SNPs, probably those which arose a relatively long time ago and against which there is no negative selection) as well as those rare but recurring deletions arising between repeated elements such as segmental duplications. In so doing, we maximized the chance of finding regions where deletions frequently occur due to reasons other than the presence of repeated elements, such as those in 16p11.2 associated with autism
The hotspot in 20p12.1 is particularly noteworthy. A total of 27 independent deletions were documented in the 4 population samples, 25 of which were seen only once. The exactitude of the deletion borders is not in question, as the log R ratio graphs show how cleanly the first and last SNP of each deletion could be called (illustrated in
The offspring shows an inherited deletion from the father, as well as a
Although this is to our knowledge the first such genome-wide description of genomic instability at this level, our conclusions are supported by published work in a variety of ways. First, deletions in two of the genes residing in hotspots have been intensively studied by many groups because of their roles in disease. Deletions in the
Second, in a genome-wide study similar to ours
It is of interest to place these findings in the context of what is known regarding hotspots of chromatin instability, and deletions in particular. One of the few such regions which are well enough characterized to allow estimates of deletion frequency is in the
In our study, we observed 5
This work raises a number of intriguing questions at the fundamental level, one being, why do these hotspots exist? It may be that as a consequence of some form of stress, a chromatin loop may escape its natural confines within the highly organized and compact nuclear structure, and this event simply happens much more often at these sites. Alternatively, these high-frequency deletions may reflect some protective element, for which positive selection has occurred. It is of note that these scenarios are not mutually exclusive, in that there may exist situations of stress where a chromatin domain may (or must) undergo deletion; it would be to the organism's advantage if the deletion occurred in a DNA domain of low coding sequence density. In this way the hotspots we have characterized could be considered as hypothetical safety valves.
A second question that can be raised concerns the molecular mechanism of the high frequency of deletions. Many of the chromosomal elements such as low copy repeats (LCR), and segmental duplications (SD) which have been associated with structural alterations identified in diseases such as autism, neurofibromatosis and Sotos syndrome (OMIM) have been ruled out in the case of the DMD hotspot
At a more applied level, these data also have implications for gene-disease associations. The finding of rare deletions in or near coding sequences, especially if they arise
Nine of the documented hotspots carry genes, and every one has been implicated in disease. We propose that a careful delineation of precise deletion (or amplification) boundaries in and around these genes will be useful, since at least some of the deletions may be present simply due to the unstable nature of the chromosomal domain rather than because they contributed to the phenotype by affecting gene function. In our samples, exons were unaffected in three of the nine genes, perhaps reflecting important roles for these genes in human health; however because of small numbers involved we cannot draw conclusions from this information. Nevertheless, the patterns of exon disruption in the other genes are somewhat informative, and the following paragraphs present some examples.
Deletions in this gene have been implicated in neurological disorders including autism and mental retardation in anecdotal fashion
One report suggests this gene as a candidate for Kabuki syndrome, since a
This region of the genome has also been implicated in colorectal cancer, with the report
The hotspot on chromosome 10 falls in the 3′ half of this gene. Exons were affected in a substantial proportion of the deletions, including four such deletions which would produce a frameshift in the LG cohort (all 4 subjects were mentally alert). This gene has been associated with late-onset Alzheimer's disease in women by genetic studies
The association of this gene with familial early-onset PD is well established, since it is homozygously mutated in about 50% of such cases
One of the hallmarks of a tumor-suppressor gene (TSG) is the presence of deletions in tumors which affect coding sequence, as was seen with the prototypical TSG, RB1
In general, therefore, the existence of hotspots with the properties we present here should be incorporated into any interpretation of deletion data concerning the genes associated with these hotspots. In some instances, it may become appropriate to incorporate exon-dosage assays in evaluating individuals' risk and potential treatment scenarios.
Ethical approval was obtained from Ethica, Montreal, for all stages of recruitment and data generation
PennCNV
Determining CNV boundaries with precision is a significant issue in calling structural variations in the genome
The ultimate goal of this work was to detect genomic deletions potentially associated with disease, so we anticipated that the pertinent alterations might be rare, as has been found for schizophrenia (SZ)
Text, figures and tables presenting information not shown in the main text.
(3.61 MB DOC)
We thank Jan Mellegers for intellectual contributions and commentary on the results.