Enhancer Chip: Detecting Human Copy Number Variations in Regulatory Elements

Critical functional properties are embedded in the non-coding portion of the human genome. Recent successful studies have shown that variations in distant-acting gene enhancer sequences can contribute to disease. In fact, various disorders, such as thalassaemias, preaxial polydactyly or susceptibility to Hirschsprung’s disease, may be the result of rearrangements of enhancer elements. We have analyzed the distribution of enhancer loci in the genome and compared their localization to that of previously described copy-number variations (CNVs). These data suggest a negative selection of copy number variable enhancers. To identify CNVs covering enhancer elements, we have developed a simple and cost-effective test. Here we describe the gene selection, design strategy and experimental validation of a customized oligonucleotide Array-Based Comparative Genomic Hybridization (aCGH), designated Enhancer Chip. It has been designed to investigate CNVs, allowing the analysis of all the genome with a 300 Kb resolution and specific disease regions (telomeres, centromeres and selected disease loci) at a tenfold higher resolution. Moreover, this is the first aCGH able to test over 1,250 enhancers, in order to investigate their potential pathogenic role. Validation experiments have demonstrated that Enhancer Chip efficiently detects duplications and deletions covering enhancer loci, demonstrating that it is a powerful instrument to detect and characterize copy number variable enhancers.


Introduction
Recently, researchers have been focusing their efforts on the study of the non coding part of the human DNA and, in particular, on its predicted role in the regulation of gene expression [1]. In particular, comparative sequence analysis has proved to be a valuable instrument to identify regulatory elements that have been highly conserved throughout evolution [2], many of these being noncoding sequences shown to act as enhancers in experimental models [3,4].
A database of human and mouse noncoding fragments with a gene enhancer activity has been developed [4]. VISTA Enhancer Browser is a public resource to provide access to conserved sequence elements tested for enhancer activity [5]. The database contains human candidate regions identified either by their conservation between human and non-mammalian vertebrates across long (chicken and frog) or extremely long (pufferfish and zebrafish) evolutionary distances or by their unusually high conservation among mammals, such as 'ultra'-conservation (100% identity for at least 200 bp between human, mouse and rat) [5,6]. Moreover, putative enhancers have been assayed for their capacity to drive reporter gene expression in a transgenic mouse model: positive enhancers are elements that drive report gene expression at mouse embryonic day 11.5 (E11.5); negative enhancers are not functional at E11.5, even though they could act as enhancers at different time points or in different physiological conditions, or their activity could depend on the presence of additional cis-regulatory elements.
Chromosomal rearrangements or deletions may lead to a disturbance of long-range control and, as a consequence, to pathological conditions [7]. Up to now, several alterations in enhancer structure or DNA sequence have been found to be causative of human diseases. For example, thalassaemias may be the result of deletions or rearrangements of b-globin gene (HBB) enhancers [8], sonic hedgehog (SHH) limb-enhancer point mutations can cause preaxial polydactyly [9] and a susceptibility to Hirschsprung's disease has been associated with a RET protooncogene enhancer variant [10]. Moreover, a large number of disease susceptibility regions overlapping non-coding intervals has been mapped in genome wide association studies (GWAS) [11].
While impressive results have been obtained in the discovery and mapping of tissue specific enhancers [12], the analysis of CNVs covering these elements and their correlation with the phenotype has been hampered by the lack of a method able to detect them.
Recently, Array-Based Comparative Genomic Hybridization (aCGH) has been found to be able to detect causative alterations in patients with unexplained developmental delay/intellectual disability (DD/ID), autism spectrum disorders (ASD), and multiple congenital anomalies (MCA) in between 11% and 15% of examined cases [13]. Copy number variations in regions not investigated so far could be responsible for other undiagnosed cases.
Moreover, enhancers have been demonstrated to be located near genes active during development [11], suggesting their involvement in the regulation of these, often disease related, genes.
CNVs encompassing enhancer noncoding sequences could in this way affect target gene expression, causing human disorders.
To characterize CNVs overlapping VISTA enhancer loci, we have compared the coordinates of human VISTA enhancer loci with CNVs deposited to the Database of Genomic Variants (DGV) and Indels (small insertions and deletions of 100 bp-1 kb length) and with two highly polymorphic sets of deleted and duplicated regions. We have shown that highly polymorphic CNVs are under negative selection at VISTA enhancer loci, suggesting that copy number variable enhancers could represent functional variants. Array-CGH represents a reasonable, cost-effective instrument to investigate multiple DNA regions. To confirm the functional relevance of enhancers and to verify whether dysmorphic features or mental retardation could be associated with rare or private duplications and deletions in these elements, we have designed the Enhancer Chip custom array that is described below.
The number of enhancer loci and fraction of genome covered by CNV regions were calculated using ''feature coverage'' and ''base coverage'' tools available on the Galaxy, web portal for large-scale interactive data analyses [18].

Array CGH Design
Enhancer Chip design was developed using the Agilent platform and the SurePrint G3 8660K format.
Biological features were randomly distributed on the microarray and the routinely used Human CGH 1K Agilent Normalization Probe Group (1,262 features) and the Human CGH 1K Agilent Replicate Probe Group (5,000 features) were also included into design.

Loci Selection
Enhancer Chip array was designed to provide redundancy with high sensitivity and specificity for detection of clinically significant unbalanced chromosomal abnormalities, while minimizing detection of non pathogenic CNVs or CNVs of uncertain significance. To this aim, we selected 322 diseases related to development delay or congenital physical anomalies, listed in Table S2. Moreover to further characterize CNVs overlapping VISTA enhancer loci and verify if physical anomalies or mental retardation could be associated to aberrations in these elements, we also selected 1,276 putative enhancers contained in VISTA enhancer database in July 2010, excluding those (5 enhancers) for which no Agilent probes were available.

Array Design Strategy
For Enhancer Chip we designed 25,000 probes to cover all the genome (backbone) with an average spacing of 100 Kb and a resolution of 300 Kb ( Figure S1a). We also added 18,000 probes to cover regions of interest (telomeres, centromeres, and selected diseases loci) with an average resolution of about 40 Kb ( Figure  S1b and c). This design fulfills these suggestions. Moreover, 7,790 probes were designed to investigate VISTA enhancer loci. The number of probes on each element depends from enhancer length, with at least 3 probes for each enhancer and an average spacing of 238 nt ( Figure S1d). The residual free space on the array (2,853 features; approximately 4.8%) was randomly filled with probes from the commercially available Agilent Human Genome 44K array CGH.

Reference DNA Samples
Anonymous blood samples were collected from healthy, unrelated Italian individuals. All subjects studied entered the diagnostic centers of Naples or Rome and signed an appropriate consent form for genetic testing as well as forms related to privacy of data. Approval for the study was obtained by the Seconda Università di Napoli Ethics Committee (prot. 862/08). Genomic DNA was extracted using standard procedures.
After amelogenin-based sex confirmation by PCR, concentration and purity of DNA samples were assessed using Nanodrop ND 1000 (Thermo Scientific Inc., USA) by evaluating the A260/ A280 and A260/A230 ratios to exclude contaminating proteins or other organic compounds. After dilution to a final concentration of 100 ng/mL, six sex-matched DNA samples were pooled together and used as male or female reference DNA samples in array CGH experiments.

DNA Samples
For validation experiments, we utilized DNA samples from patients in which genomic deletions or duplications were previously detected with alternative diagnostic methods. In addition, we also utilized DNA samples from patients with development delay and/or congenital anomalies but without a molecular diagnosis. Further DNA samples from healthy individuals were also analyzed.
In particular, forty DNA samples were collected at Università Cattolica del Sacro Cuore (Rome) and seven at the Seconda Università di Napoli (Naples).
For all DNA samples, sex was confirmed by amelogenin-based PCR assay. Concentration and purity of each DNA sample were assessed using Nanodrop ND 1000 (Thermo Scientific Inc., USA). After dilution to a final concentration of 100 ng/mL, all DNA samples were blindly tested in array CGH experiments.

Array CGH Hybridization and Analysis
Labeling and hybridization were performed according to the manufacturer's specifications (Agilent Oligonucleotide Array-Based CGH for Genomic DNA Analysis protocol, version 6.1; Agilent Technologies, USA). Scanned array images were analyzed with Feature Extraction software (version 10.5.1.1; Agilent Technologies, USA). Graphical overview and analysis of data were obtained using DNA Analytics as part of Agilent Genomic Workbench software (version 5.0; Agilent Technologies, USA), evaluating the quality of each test with the quality control (QC) metrics generated with DNA Analytics software. For identifying duplications and deletions we used the standard set-up of the Aberration Detection Method 2 (ADM-2) algorithm for the data that passed QC metrics testing. The threshold applied to the algorithm was empirically chosen for each test. An aberration filter was set to select aberrant regions with at least 3 targets showing the same direction in copy-number change and to exclude aberrant regions if the average log2 ratio within the region was less than the value of Derivative Log Ratio spread (DLRSpread). Variants not known to be pathogenic were compared with the Database of Genomic Variants (http://projects.tcag.ca/variation/) and with the Decipher database (http://decipher.sanger.ac.uk/) to facilitate interpretation.
All validation experiments were carried out by comparison with male or female reference DNA samples obtained by pooling of six sex-matched genomic DNA samples from healthy and unrelated individuals of the same ethnic origin [19].

Real Time PCR, Long-range PCR and DNA Sequencing
Real Time PCR reactions were performed using Bio-Rad iQ SYBR Green Supermix with 1 ng of DNA, according to the manufacturer's specifications. The PCR conditions were the following: 96uC61 min; 45 cycles of 96uC630 s, 62uC630 s and 68uC630 s, with 72uC612 min as final step.
Long-range PCR reactions were performed using 100 ng DNA, 0.5 mM of each primer, 400 mM of each dNTP, and 1.5 units TaKaRa LA Taq in 1X LA PCR Buffer II (TaKaRa Bio, Inc., Japan). The PCR conditions were the following: 96uC61 min; 30 cycles of 96uC630 s, 62uC61 min and 68uC64 min plus 5 s/ cycle, with 72uC612 min as final step.
PCR products were double-strand sequenced using BigDye Terminator sequencing chemistry (Life Technologies, USA) and analyzed on an ABI 31306L automatic DNA sequencer (Applied Biosystems, USA).
Next, we determined whether the overlap of the CNVs and enhancer loci was random (null hypothesis) or whether the CNVs were underrepresented at these loci (alternative hypothesis). To test these hypotheses, we compared fractions of the enhancer loci and fractions of the genome covered by the differentially defined CNV regions. Figure 1 (b and c) shows that the fractions of the enhancer loci (0.19% and 0.33%) covered by the two sets of ''polymorphic'' CNVs are at least five times lower than the fractions of the covered genome (1.05% and 2.10%). Also the 34,186 small Indels deposited in the DGV (with a genomic coverage of 0.3%) overlap the VISTA enhancer loci two-fold lower than expected (0.16% corresponding to 8 enhancers, see Figure 1b and Table 2 for a complete list).
These data demonstrate a negative selection of highly polymorphic CNVs and of small Indels at enhancer loci.
The CNV purification effect is less strong if one compares the fraction of the enhancers (24%) covered by the ''DGV-deposited'' CNVs with the fraction of the genome covered by those CNVs (30%) (Figure 1a). As already discussed in recently published papers [16,20], the low purifying effect observed for the ''DGVdeposited'' CNVs suggests that some of these CNVs are very rare or private or could be false positive artifacts.
Finally, not only common CNVs but also CNVs implicated in specific diseases can affect enhancer loci and thus can play an important role in pathogenesis. We have identified 77 enhancers (Table S4) located in chromosomal regions implicated in microdeletion/microduplication syndromes (DECYPHER v5.0) [15]. The role of enhancer CNVs in the pathogenesis of these conditions has never been investigated.

Validation Strategy of the Enhancer Chip
For the Enhancer Chip validation, we analyzed 47 blind DNA samples (Table 3). These included 31 samples from patients in whom genomic imbalances had previously been detected by aCGH (27/31), G-banding (3/31) or MLPA (1/31) and 12 from patients with a clinical diagnosis of development delay or congenital anomalies (with the exception of GF and X926 who each had a prevalent muscular phenotype). Two samples (16 and 36) from affected individuals with deleterious mutations undetectable with Enhancer Chip and two (27 and 28) from healthy individuals were also included as negative controls.
The Enhancer Chip detected no clinically relevantgenomic imbalances in any of the 4 negative controls and confirmed the molecular diagnosis in 31 out of 31 samples, that had previously been analyzed (Table 3). Sample 15 presented a heterozygous intragenic deletion of the KCNQ1gene (MIM 607542) and sample 20 showed a heterozygous intragenic deletion of the FGF14 gene (MIM 601515). These variations, undetectable by Agilent 44K  array, had been identified by using a 244k Agilent array and confirmed by using the Enhancer Chip. We also revealed a NF1 deletion (118) detected by MLPA and a 6p duplication (IM and IF) detected by G-banding. In these cases our array proved to be a valuable tool to define breakpoint boundaries and the extension of the aberrations (data not shown), demonstrating that it is a good alternative to commercial all-genome platforms. In 2 out of 12 patients (17)(18)(19)(20)(21) with only a clinical diagnosis, the Enhancer Chip array was able to detect new small copy-number changes, classified as variants of uncertain clinical significance (Table S5).
In patient 1 ( Figure 2a) the result was confirmed by long-range PCR and direct sequencing, showing an insertion of 3 nucleotides at breakpoint boundaries.
In samples 25 and 27 (Figures 2b-c) the extension of these deletions was confirmed and better defined by Real-Time PCR. Finally, the breakpoints were finely mapped by direct sequencing, demonstrating a 11-bp long and 2-bp long insertion in samples 25 and 27, respectively.
Interestingly, all three enhancers are active at stage E11.5. In particular, hs775 drives the reporter gene expression in the forebrain of transgenic mice, hs676 in the branchial arch, ear, forebrain and hindbrain, and hs607 in the hindbrain (rhombencephalon) and neural tube ( Table 4). None of these rearrangements have been described and there are no CNVs deposited in the DGV overlapping these three enhancers.
To genotype these 3 new deletions, we developed a triplex PCR assay (Figure 2d). The forward primer was located 59 to the breakpoint, the reverse primer for detecting the wild type allele within the deletion sequence and the reverse primer for detecting the deletion 39 to the breakpoint. This enabled us to specifically amplify both the wild type and the deletion alleles, even in heterozygous samples. We carried out this analysis on 300 samples to genotype the deletions detected in samples 1, 25 and 27 and we confirmed these to be rare or private losses.

Discussion
''Polymorphic'' CNVs show some purifying effects at VISTA enhancer loci, as already seen for miRNA genes [21], which are equally underrepresented in polymorphic copy number variable regions. As indicated in Table 1, only a small fraction of CNVenhancers has been identified so far. The enhancers are not only conserved elements across evolution but also relatively stable among humans.
Although it is very difficult to predict how many highly polymorphic CNV-enhancers are present in the human genome, they are potential functional variants and could represent candidate loci, especially if located in regions implicated in diseases by linkage or association studies. A previous study [22] has shown that a deletion of several ultraconserved non-coding sequences in mice may not result in obvious phenotypes, demonstrating that even an extreme evolutionary constraint does not necessarily indicate that a non-coding sequence is required for viability. The general opinion is that non-coding conserved sequences are essential and that their deletion may result in severe phenotypes. This lack of an obvious effect could be due to several considerations, similar to those that could explain the absence of a phenotype upon deletion of highly conserved protein-coding genes: minor phenotypes not detected; a functional redundancy with other genes or enhancers; or reductions in fitness that only become apparent over multiple generations or are not easily detected in a controlled laboratory environment [11]. However, contrary to this finding, the number of recorded cases of noncoding mutations linked to human diseases has been growing rapidly. Several chromosomal alterations demonstrate a link between malformations and regulatory mutations. Aniridia and related eye anomalies may arise from chromosomal rearrangements that disrupt the region downstream of the PAX6 transcription unit [23]. A number of long-range regulatory disruptions are associated with genes of the forkhead/winged helix group of transcription factors, such as FOXC1, FOXC2 and FOXL2, causing ocular malformations [24]. Chromosomal rearrangements can remove one or more cis-regulatory elements of the SOX9 gene, leading to campomelic dysplasia [25]. Holoprosencephaly has also been associated with long-range regulator mutations leading to a haploinsufficiency of SIX3 or SHH proteins [26].
The lack of precise data on CNV-enhancers, their polymorphisms and their putative pathogenic role is mostly due to the absence of appropriate methods for their identification and characterization in a large number of samples. A simple and inexpensive method that enables an accurate characterization of several CNVs of interest has never been proposed up to now, hampering the analyses of CNVs and their correlation with the phenotype.
In order to characterize all CNV-enhancers and eventually identify cryptic disease-associated deletions or duplications, we have developed our Enhancer Chip, a straightforward and costeffective assay with research purposes. A custom-designed array represents an important diagnostic instrument [27], as well as a powerful technique to identify novel disease genes or to characterize relatively unknown elements. Validation experiments, here described, demonstrate its sensitivity and specificity, confirming all the results generated by other methods. Moreover, thanks to probes on the VISTA enhancer loci, our custom array is an innovative tool, the first one to investigate enhancer elements. Clearly, we have designed this array not as a tool for molecular diagnosis, but to discover new CNVs covering enhancers and potentially causing disease phenotypes. As described, the Enhancer Chip has allowed the detection of three new and supposedly rare deletions covering enhancers. These three enhancers have been demonstrated to be active during embryonic development even if nothing is known about their gene targets. Our samples show heterozygous deletions on these elements. It would be interesting to evaluate the expression levels of their putative targets, once identified. Probably none of the three alterations is directly responsible for the observed phenotypes in these patients. Further studies on a large population could help to identify the phenotypic effects of copy variable enhancers.
Other recent papers have demonstrated the utility of an exontargeted oligonucleotide array (i.e., aCGH using an array with probes concentrated disproportionately in the exons) to detect intragenic copy-number changes in patients with various clinical phenotypes [28,29]. An exon-targeted design improves the resolution of aCGH to the level of the exon while excluding much of the noise inherent in other strategies. Our subsets of experimentally validated probes covering the VISTA enhancer loci could be used in any dedicated exon-targeted array design.
Although we have focused on distant-acting enhancers, there are other categories of functional elements in the non-coding portion of the genome (for example insulators, negative regulators, promoters and non-coding RNAs), which are also crucial targets for a large scale study of regulatory elements in the human genome. To identify deletions or duplications in all these elements, we are going to develop a new customized CGH-array, to investigate these regulatory regions.